Who wrote what: Statistics and the Federalist

Stats is of course not just about numbers, it’s also often used to analyse words, even more so now with the explosion of social media in the past few years. But the late great Phil Stone of Harvard University’s General Inquirer for the quantitative analysis of text was developed in the early 1960’s. A few years later, in 1964, the release of the Ford Mustang and the Pontiac GTO pony/muscle cars, the late great Fred Mosteller and the great (and still with us) David Wallace published their book on the (mainly) Bayesian analysis of who wrote the Federalist Papers, a year after an introductory paper had appeared in the Journal of the American Statistical Association.

In the late 18th Century, three key figures in the foundation of the United States – Alexander Hamilton, John Jay and James Madison wrote 85 newspaper articles to help ratify the American Constitution.

The papers were published anonymously, but scholars had figured out the authorship of all but twelve, not knowing for sure whether these had been written by Madison or Hamilton. The papers were written in a very formal, and very similar, style,and so Mosteller and Wallace turned to function words like “an” and “of” and “upon and particularly “while” and “whilst”, a researcher from back in 1916 having noticed that Hamilton tended towards the former, Madison the latter. Computers back in the 60’s were pretty slow, and expensive, and hard to come by, there weren’t any at Harvard, where Mosteller had recently established a Statistics Department, and so they had to use the one at MIT.

In Mosteller and Wallace’s own words, after the combined work of themselves and a huge band of helpers, they “tracked the problems of Bayesian analysis to their lair and solved the problem of the disputed Federalist papers” using works of known authorship to conclude that Madison wrote all 12.

In 1984, M & W published a newer edition of their groundbreaking, and highly readable book with a slightly different title, while a few years later, the late great Colin Martindale (with a Harvard doctorate) and myself re-analysed the original data using Stone’s General Inquirer thematic dictionary as well as function words, and a type of kernel discriminant analysis / neural network, coming to the same conclusion.

Case closed? Not quite. It has recently been proposed that the disputed 12 papers were a collaboration, a summary of the evidence, and some other citations to classical & recent quantitative Federalist research are available here
http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/the-twelve-disputed-federalist-papers-a-case-for-collaboration/

Either way, when you’re getting a bit jaded with numbers, and the 0’s are starting to look like o’s, analyse text!

Further/Future reading

Mosteller F, Wallace DL (1964) Inference and disputed authorship, the Federalist. Addison-Wesley.

McGrayne, SB (2011) The theory that would not die: how Bayes’ rule cracked the Enigma code, hunted down Russian submarines & emerged triumphant from two centuries of controversy. Yale University Press.

Martindale C, McKenzie D (1995) On the utility of content analysis in author attribution: The Federalist. Computers and the Humanities, 29, 259-270.

Stone PJ, Bales RF, Namenwirth JZ, Ogilvie DM (1962). The General Inquirer: a computer system for content analysis and retrieval based on the sentence as a unit of information. Behavioral Science, 7, 484-498.

When Boogie becomes Woogie, when Dog becomes Wolf

An exciting (and not just for statisticians!) area of application in statistics/analytics/data science relates to change/anomaly/outlier detection, the general notion of outliers (e.g. ‘unlikely’ values) having been covered in a previous post, looking at, amongst other things, very long pregnancies.

But tonight’s fr’instance comes from Fleming’s wonderful James Bond Jamaican adventure novel, Dr No, (also a jazzy 1962 movie) which talks of London Radio Security shutting down radio connections with secret agents, if a change in their message transmitting style is detected. This may have indicated that their radio had fallen into enemy hands.

To use a somewhat less exotic example, imagine someone, probably not James Bond, tenpin bowling and keeping track of their scores, this scenario coming from HJ Harrington et al’s excellent Statistical Analysis Simplified: the Easy-to-Understand Guide to SPC and Data Analysis (McGraw-Hill, 1998).

On the 10th week, the score suddenly drops more than three standard deviations (scatter or variation around the mean or average) below the mean.

Enemy agents? Forgotten bowling shoes? Too many milk shakes?

Once again, an anomaly or change, something often examined in industry (Statistical Process Control (SPC) and related areas) to determine the point at which, in the words of Tom Robbin’s great novel Even Cowgirls Get The Blues, ‘the boogie stopped and the woogie began’.

Sudden changes in operations & processes can happen, and so a usual everyday assembly line (‘dog’) can in milliseconds become the unusual, and possibly even dangerous (‘wolf’), at which point hopefully an alarm goes off and corrective action taken.

The basics of SPC were developed many years ago (and taken to Japan after WW2, a story in itself). Anomaly detection is a fast-growing area. For further experimentation / reading, a recent method based upon calculating the closeness of points to their neighbours is described in John Foreman’s marvellous DataSmart: using Data Science to Transform Information into Insight (Wiley, 2014).

We might want to determine if a credit card has been stolen on the basis of different spending patterns/places, or, to return to the opening example, detect an unauthorised intruder to a computer network (e.g. Clifford Stoll’s trailblazing The Cuckoo’s Egg: Tracking a Spy Through the Maze of Computer Espionage).

Finally, we might just want to figure out just exactly when it was that our bowling performance dropped off!

Telstar, Cortina & the Median Quartile Test: where were you in ’62?

It was 1962, the setting of the iconic 1973 movie American Graffiti, from which comes the subtitle of this post. The Beatles had released Love Me Do, their first single. That year also heard and saw Telstar, the eerie but joyful Claviolined Joe Meek instrumental by the Tornados, celebrating the circling communications private transatlantic television satellite it honoured. The British Ford Cortina, named after an Italian ski-resort saw out the humpty-dumpty rounded Prefects and 50’s Zephyrs, while in the US, the first of 50 beautiful, mysterious and largely lost Chrysler Ghia Turbine cars was driven in Detroit.

Meanwhile, the world of statistics was not to be outdone. Rainald Bauer’s Median Quartile test, an extension of Brown and Mood’s early 50’s Median Test, was published, in German, in 1962. The latter test, still available in statistics packages such as IBM SPSS, SAS and Stata simply compares groups on counts below and above the overall median, providing in the case of two groups, a two by two table.

The Median Quartile Test (MQT), as the name suggests, compares each group on the four quartiles.  But the MQT is largely unknown, mainly discussed in books and papers published in, or translated from, German.

The MQT conveys similar information to John Tukey’s boxplot, shows both analysts and their customers and colleagues where the data tend to fall, and provides a test of statistical significance to boot. Does one group show a preponderance of scores in the lower and upper quartiles for example, suggesting in the field of pharma fr’instance, that one group either gets much better or much worse.

A 1967 NASA English translation of the original 1962 Bauer paper is available in the Downloadables section of this site.

Recent Application in Journal of Cell Biology

Click to access 809.full.pdf

Further / Future reading

Bauer RK (1962) Der “Median-Quartile Test”… Metrika, 5, 1-16.

Von Eye A  et al (1996) The median quartiles test revisited. Studia Psychologica, 38, 79-84.

Visual Trees: The Book of Trees by Manuel Lima

Apart from the iconic and mysterious Australian  Nullarbor http://www.nullarbornet.com.au/ (literally ‘no’, or actually very few, trees) and the baddest and saddest of outer suburban concrete jungles, trees are a major part of our daily life. Trees produce shade and oxygen, and provide inspiration for dreaming scientists watching apples fall. ‘Tree of life’. ‘Family trees’. Tree branches have also long provided a metaphor for branches of knowledge and classification systems.

In his excellent new book ‘The book of trees: visualizing branches of knowledge’,

https://www.papress.com/html/book.details.page.tpl?isbn=9781616892180

Manual Lima, Designer and Fellow of the Royal Society of Arts (http://www.visualcomplexity.com/vc/) examines the role of trees in history, religion, philosophy, biology, computer science, data visualization, information graphics and data analysis / statistics.

Covering various types of tree graphs, including radial trees, sunbursts, Ben Schneiderman’s Treemaps and Voronoi Treemaps, Lima’s treatise provides inspirational historical and contemporary pictures, including timely applications such as looking at the words that appear with ‘I’ and ‘you’ in Google texts.

Statistical applications covered are mainly confined to Icicle plots or trees, used in applications such as cluster analysis, or the grouping observations into related classes, ‘taxa’ or clusters such as disease categories.

Not published in the Northern hemisphere until April 2014, the book is available now in Melbourne, Australia for around $50, e.g. www.ngvshop.ngv.vic.gov.au (the online search does not work) or http://metropolisbookshop.com.au

Accompanied by sources of information on how to construct such diagrams (e.g. http://www.flowingdata.com ) Lima’s new book will serve as an accessible and constant source of information on visualizing trees for new, as well as existing, ‘arborists’.

‘Velut arbor aevo’

‘May the Tree Thrive’!

 

Treemap software

http://www.cs.umd.edu/hcil/treemap/

http://www.treemap.com/

http://www.tableausoftware.com/

Statistical Outliers: of Baldness and Long Gestations

At what point is a human gestation period ‘impossibly’ long. This was the question a British court had to consider in the 1949 appeal to the 1948 judgement in Hadlum vs Hadlum.

Ms Hadlum had a gestation period of 349 days, taking into account when Mr Hadlum went off to the war. The average human gestation is 40 weeks or 280 days, although new research shows an average of 268 days or 38 weeks, varying by +- 37 days http://www.sciencedaily.com/releases/2013/08/130806203327.htm

The widely used statistical definition of an outlier was given by Douglas Hawkins in 1980, ‘an observation which deviates so much from other observations as to cause suspicions that it was generated by a different mechanism’. (Hawkins DM, 1980, Identification of outliers. Chapman & Hall).

Hmn! The court upheld the 1948 finding that such a long gestation was possible, and so Ms Hadlum had not been ‘unfaithful’ to Mr Hadlum, cause for divorce back in those dark days. In the 1951 case of Preston-Jones vs Preston-Jones, however, the court found a gestation period of 360 days to be the limit. The judge concluded that ‘If a line has to be drawn I think it should be drawn so as to allow an ample and generous margin’.

Statisticians have established guidelines for ‘outliers’, that are lines in the sand, if not in concrete.

But speaking of sand, at what point do grains of sand form a heap of sand?

How many hairs constitutes the threshold distinguishing between bald and not bald?

(philosophers call this is the Sorites or ‘heap’ paradox).

The world ‘forgot’ how to make concrete from about 500-1300 AD, but was there a day when we could still make concrete, and a day in which we couldn’t? Something to think about on a Sunday afternoon!

2014 Excel implementation of some simple outlier detection techniques, by John Foreman http://au.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html

References on the above legal cases

1978 Statistics journal: http://www.jstor.org/discover/10.2307/2347159?uid=2&uid=4&sid=21103476515283

1953 Medical journal: http://link.springer.com/article/10.1007/BF02949756

Olden goldies: Cybernetic forests 1967

Richard Brautigan was an American author and poet who, in 1967’s Summer of Love in San Francisco, published ‘All Watched Over by Machines of Loving Grace’, wishing for a future in which computers could save humans from drudgery, (such as performing statistical operations by hand?)

Apart from the perkier PDP ‘mini-computers’, computers of Brautigan’s day were hulking behemoths with more brawn than brain, and a scary dark side, as seen through HAL in 1968’s ‘2001: The Space Odyssey’.

Brautigan’s poem applied sweet 1960’s kandy-green hues to these cold & clanging monsters, just a few years away from friendly little Apple and other micro’s of the 70’s & 80’s. Now we are all linked on the web, and if we get tired of that we can talk to the electronic aide and confidante Siri, developed at Menlo Park, California – not too far away in space, if not time, from where Brautigan wrote.

We can get a glimpse of a ‘rosy’ future in the even friendlier electronic personality operating system of Spike Jonze’s new movie ‘Her’.

In his BBC documentaries of 2011, named after Brautigan’s poem, filmmaker Adam Curtis argues that computers have not liberated humanity much, if at all.

Yet there is still something about Richard Brautigan’s original 1967 poem, something still worth wishing for!

All three verses, further information and audio of Mr Brautigan reading his poem can be found at

http://www.brautigan.net/machines.html

CSIRAC, a real-life Australian digital dinosaur, that stomped the earth from 1949 to 1964, and is the only intact but dormant (hopefully!) first generation computer left anywhere in the world, can be viewed on the lower ground floor of the Melbourne Museum.

(and yes, this computer was used for statistical analyses, as well as other activities, such as composing electronic music)

http://museumvictoria.com.au/csirac/

SecretSource: of Minitab and Dataviz

When the goers go and the stayers stay, when shirts loosen and tattoos glisten, it’s time for the statisticians and the miners and the data scientists to talk, and walk, Big Iron.

R. S-Plus. SAS. Tableau. Stata. GnuPlot. Mondrian. DataDesk. Minitab.   MINITAB?????? Okay, we’ll leave the others to get back to their arm wrasslin’, but if you want to produce high quality graphs, simply, readily and quickly, then Minitab could be for you.

A commercialized version of Omnitab, Minitab appeared in Philadelphia in 1972 and has long been associated with students learning stats, but also now with business, industrial and medical/health quality management and six sigma, etc. There’s some  other real ‘rough and tumble’ applications involving Minitab – DR Helsell’s ‘Statistics for Censored Environmental Data using Minitab and R’ (Wiley 2012), for instance.

IBM SPSS and Microsoft Excel can produce good graphs (‘good’ in the ‘good sense’ of John Tukey , Edward Tufte, William Cleveland, Howard Wainer, Stephen Few & Nathan Yau etc etc), with the soft pedal down and ‘caution switches’ on, but Minitab is probably going to be easier.

For example, the Statistical Consulting Centre at the University of Melbourne uses Minitab for most of its graphs (R for the trickiest ones). As well as general short courses on Minitab, R, SPSS and GenStat there’s a one day course in Minitab graphics in November, which I’ve done and can recommend.

More details on the Producing Excellent Graphics Simply (PEGS) course using Minitab at Melbourne are at

http://www.scc.ms.unimelb.edu.au/pegs.html

student and academic pricing for Minitab is at http://onthehub.com/

What, I wonder, would Florence Nightingale have used for graphic software if she was alive today???