Deviations & The Chrysalids

If people remember the British writer John Wyndham (1903-1969) at all it will be because of the Day of The Triffids a wonderful low-key science fiction novel about what happens when certain very large plants (Triffids) are developed, grown and harvested….Wyndham also wrote other great novels such as The Midwich Cuckoos (filmed as The Village of The Damned) amd The Kraken Wakes (about interstellar entities that take to The Deep causing maelstroms and tidal waves / tsunami’s , putting up the price of air travel because everyone ‘s scared to travel by sea, and then They or their Allies begin to invade coastal regions….)

But the Wyndham book most applicable to Statisticians, is The Chrysalids (Michael Joseph 1955, Penguin 1958). A post apocalyptic society scared of mutations, any mutations, in plants, in animals, and particularly in humans, known as Deviations.

‘Blessed is the Norm’,  ‘Watch thou for the Mutant’. So when young David befriends Sophie, who turns out to be a Deviation, because she has six toes (“éach foot five toes, and each toe shall end with a flat nail”);…well you’ll just have to read it. A very thoughtful book indeed.

Hobart and Randomicity

Mona, lower case, is a great 50’s song by Bo Diddely, covered a few years later by the Rolling Stones on their first album.

MONA, upper case, standing for Museum of Old and New Art, is an amazing underground (literally) art gallery in Hobart, the capital of Tasmania, the island state of Australia.

Hobart is the second oldest state capital in Australia (after Sydney), was liked by both Mark Twain and Agatha Christie, and is the birthplace of Hollywood actor Errol Flynn (1909-1959), as well as the final resting place of the last thylacine or ‘Tasmanian Tiger’ a carnivorous mammal, the last of which died in captivity in 1936. Hobart is also the setting for the development, in the mid 1930’s,  of Edward James George Pitman’s (1897-1993) development of randomization or permutation tests, which Sir Ronald Aylmer Fisher had also worked on. Permutation tests rely (these days) on computers, and don’t require reference to statistical arcana such as the Normal and Student’s T distributions, etc.

As shown by the late Julian Simon and more recently in that wonderful stats book that sounds like a law firm (Lock, Frazer Lock, Lock Morgan, Lock and Lock, 2012), permutation tests can also be easier to understand by students than the parametric alternatives.

MONA itself is currently showing the movie ‘David Bowie Is’, a segment of which talks about the London singer’s use of the William Burroughs / Brion Gysin cut-up technique and later a computer program called Verbasizer, to randomly suggest combinations of particular words to aid in the creative song-writing process.

While you may or may not be interested in randomicity, and the David Bowie movie may no longer be showing, but whether it’s out of the desire for adventure, curiosity, necessity or for purely random reasons, visit MONA and Hobart!!

Further reading:

Lock EH, Frazer Lock P, Lock Morgan K, Lock EF, Lock DF (2012). Statistics: unlocking the power of data. Wiley.

McKenzie D (2013). Chapter 14: Statistics and the Computer. In McKenzie S: Vital Statistics: an introduction for health science students. Elsevier.

Robinson ES (2011). Shift linguals: cut-up narratives from William S. Burroughs to the present. Rodopi.

Timms P (2012). Hobart. (revised edition). University of New South Wales Press.

Divisive Rules OK: Clustering #1

Back about 1965, while (whilst?) attending Primary (grade) School in a little northern Victorian town, lunchtimes would see a behaviour pattern in which, say, two boys would link arms and march around the boys’ playground chanting “join on the boys who want to play Chasey” or some other sport or game, and soon, unless something bizarre or boring was called out for, there’d be three, five, eight, ten etc until there were enough people for that particular game.

Now, let’s imagine a slightly more surreal version, a big group of boys, or girls, or indeed a mixture thereof, wanders around the playground. But what sport are they going to play, it’s unlikely (especially for our purposes here) they’ll all want to play the same thing, and even if they did, there may be too many, or some people may well be better suited to some other sport or game. If we were brave enough and if lunchtimes extended into infinity, we could try every possible way of splitting our big group into two or more smaller groups as, in the general field of cluster analysis, Edwards and Cavalli-Sforza showed back in ’65.

Alternatively, we could ask the single person most different from the rest of the main group M in terms of the game they wanted to play. That person, (let’s call them Brian after Brian Everitt, who wrote a great book on cluster analysis in several editions, and Brian Setzer as in Stray Cats and the Brian Setzer Orchestra, this being the unofficial Brian Setzer Summer) splits off from the group and forms a splinter group S. For each of the remaining members, we check whether on average, they’re more dissimilar to the members of M, than the members of S (i.e. Brian et al). If so, then they too join S.

Known as divisive clustering (the earlier  “join on”  syndrome is sorta kinda like agglomerative clustering, start off with individuals and group em together), this particular method was published in ’64 by Macnaughton-Smith. Described in Kaufman and Rousseeuw’s book as DIANA, with shades of a great steak sauce and an old song by Paul Anka,  DIANA is available in R as part of the  cluster  package.

Now if you’ll excuse me, there’s a group looking for members to march down the road for a cold drink, on this hot Australian summer night! Once we get to the bar, the most dissimilar, perhaps a nondrinker, will split off, clusters will be formed, and through the night there may be re-splitting and re-joining of groups or cliques, as some go off to the pinball parlour, others to the pizza joint, while some return to the bar, all in the manner of another great clustering algorithm, Ball and Hall’s ISODATA.

Bottled Sources:

Ball GH, Hall DJ (1965). A novel method of data analysis and pattern classification. Technical Report, Stanford Research Institute, California.

Edwards AWF, Cavalli-Sforza, LL (1965). A method for cluster analysis. Biometrics, 21, 362-375.

Everitt, B.S. (1974 and more recent editions). Cluster analysis. Heinemann: London.

Kaufman L, Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. Wiley: New York.

Macnaughton-Smith P, Williams WT, Dale MB, Mockett LG (1964). Dissimilarity analysis: A new technique of hierarchical sub-division. Nature, 202, 1034-1035.

Who gives a toss: the statistics of coins

Spring is here in Melbourne, and a time for fashionable horse racing, including The Melbourne Cup in November., once attended by Mark Twain. Australia is also home of the “two-up” coin tossing game (descended from the British pitch and toss), played in outback pubs, hidden city lanes and now Australian casino’s, described in great old Australian novels such as Come In Spinner, and the eerie book and 1971 movie Wake in Fright (aka Outback).

In the 18th century, the Comte de Buffon obtained 2048 heads from 4040 tosses, while more recently and not to be outdone the statistician Karl Pearson obtained 12,012 heads out of 24,000 tosses (The Jungles of Randomness by Ivars Peterson, 1998). Of course a misunderstanding of the law of large numbers or so-called law of averages, makes the uninitiated think that if there’s say seven heads in a row, a cosmic force will decide “hang on that coin is coming up heads more than 50%, better make the next one a tail”).

While it doesn’t look at two-up, “Digital Dice” by the always entertaining Paul Nahin (2008) examines a tricky coin-tossing problem posed in 1941 and not solved until 1966. Prof Paul shows how to solve it using a computer-based Monte Carlo method, itself named after that famous casino in Monaco, where James Bond correctly observed that “the cards have no memory”.

And who says stats isn’t relevant?!

Applied Australian Change-Point Analysis: Before the Shark Gets Jumped?

Ok I saw the (in)famous Season 5 Episode 3 “Jump the Shark” episode of Happy Days (when Fonzie water skiis over a shark pool) when I was 18 back in 1977, and hated it.

Definitely Uncool.
But one Saturday morning a month or two ago I saw it again and loved it. It’s wild! It’s glorious!

The term has come to mean the point at which a TV series goes down hill, when the wolf becomes a dog, to riff on a previous post.

Anyhow, Australia’s Professor Kerrie Mengersen and Dr Hassen Assareh have developed a snazzy new Bayesian Markov Chain Monte Carlo procedure for working out the change-point in a process, specifically the point where a key change happened to a hospital patient’s condition for example. Helping to identify the ‘why’, as well as the ‘when’.

It’s a great idea and yet another instance of how Statistics can help save the world, again!

Snappy Stepwise Regression

Stepwise regression, the technique that attempts to select a smaller subset of variables from a larger set by at each step choosing the ‘best’ or dropping the ‘worst’ was developed back in the late 1950’s by applied statisticians in the petroleum and automotive industries. With an ancestry like this, there’s no wonder that it is often regarded as the statistical version of the early 60’s Chev Corvair, at best only ‘driveable’ by expert careful users, or in Ralph Nader’s immortal words and title of his 1966 book  ‘Unsafe at Any Speed’.

Well maybe. But if used with cross-validation and good sense, it’s an old-tech standby to later model ‘lasso’ and ‘elastic net’ techniques. However, there’s an easy way for a bit of a softshoe shuffle of the old stepwise routine. See how well (preferably on a fresh set of data) forward entry with just one or, maybe two, or at most three variables do, compared with larger models. (SAS and SPSS allow the number of steps to be specified).

Of if you’d like to do some slightly fancier steps it in twotone spats, try a best subset regression (available in SAS, and SPSS through automatic linear, and Minitab and R etc), of all one variable combinations, two variables, three variables.

The inspiration for this is partly from Gerd Gigerenzer’s ‘take the best’ heuristic, taking the best cue or clue often beats more complex techniques including multiple regression etc. ‘Take the best’ is described in Prof Gigerenzer’s great new general book Risk Savvy: How to Make Good Decisions (Penguin, 2014),,9781846144745,00.html as well as his earlier academic books such as Simple Heuristics That Make Us Smart (Oxford University Press, 1999)

See if a good little model can do as well as a good (or bad) big ‘un!.


Further Future Reading

Draper NR, Smith H (1966) Applied regression analysis (and later editions). Wiley: New York.

Who wrote what: Statistics and the Federalist

Stats is of course not just about numbers, it’s also often used to analyse words, even more so now with the explosion of social media in the past few years. But the late great Phil Stone of Harvard University’s General Inquirer for the quantitative analysis of text was developed in the early 1960’s. A few years later, in 1964, the release of the Ford Mustang and the Pontiac GTO pony/muscle cars, the late great Fred Mosteller and the great (and still with us) David Wallace published their book on the (mainly) Bayesian analysis of who wrote the Federalist Papers, a year after an introductory paper had appeared in the Journal of the American Statistical Association.

In the late 18th Century, three key figures in the foundation of the United States – Alexander Hamilton, John Jay and James Madison wrote 85 newspaper articles to help ratify the American Constitution.

The papers were published anonymously, but scholars had figured out the authorship of all but twelve, not knowing for sure whether these had been written by Madison or Hamilton. The papers were written in a very formal, and very similar, style,and so Mosteller and Wallace turned to function words like “an” and “of” and “upon and particularly “while” and “whilst”, a researcher from back in 1916 having noticed that Hamilton tended towards the former, Madison the latter. Computers back in the 60’s were pretty slow, and expensive, and hard to come by, there weren’t any at Harvard, where Mosteller had recently established a Statistics Department, and so they had to use the one at MIT.

In Mosteller and Wallace’s own words, after the combined work of themselves and a huge band of helpers, they “tracked the problems of Bayesian analysis to their lair and solved the problem of the disputed Federalist papers” using works of known authorship to conclude that Madison wrote all 12.

In 1984, M & W published a newer edition of their groundbreaking, and highly readable book with a slightly different title, while a few years later, the late great Colin Martindale (with a Harvard doctorate) and myself re-analysed the original data using Stone’s General Inquirer thematic dictionary as well as function words, and a type of kernel discriminant analysis / neural network, coming to the same conclusion.

Case closed? Not quite. It has recently been proposed that the disputed 12 papers were a collaboration, a summary of the evidence, and some other citations to classical & recent quantitative Federalist research are available here

Either way, when you’re getting a bit jaded with numbers, and the 0’s are starting to look like o’s, analyse text!

Further/Future reading

Mosteller F, Wallace DL (1964) Inference and disputed authorship, the Federalist. Addison-Wesley.

McGrayne, SB (2011) The theory that would not die: how Bayes’ rule cracked the Enigma code, hunted down Russian submarines & emerged triumphant from two centuries of controversy. Yale University Press.

Martindale C, McKenzie D (1995) On the utility of content analysis in author attribution: The Federalist. Computers and the Humanities, 29, 259-270.

Stone PJ, Bales RF, Namenwirth JZ, Ogilvie DM (1962). The General Inquirer: a computer system for content analysis and retrieval based on the sentence as a unit of information. Behavioral Science, 7, 484-498.