Hovercrafts and Aunties: Learning Statistics as a Foreign Language

To many who are not members of our Craft, and even some that are, Statistics is something of a Foreign Language, difficult to grasp without a good understanding of its grammar, or at least a whole swag of useful rules.

Stats is also difficult to teach, note our students look of bored angst when we try to explain p values.

So could we teach Stats like a foreign language?

For starters, why don’t we teach statistical ‘tourists’/’travellers’/’consumers’ some useful ‘phrases’ they can actually use, like how to read Excel files into a stats package, how to do a box plot, check for odd values, do some basic recodes etc.

Such things rarely appear in texts. Instead, we tumble about teaching the statistical equivalent of ‘the pen of my aunt is on the table’ or ‘my hovercraft is full of eels’ (Monty Python), or ‘a wolverine is eating my leg’ (Tim Cahill).

For example, as well as assuming that the data are all clean and ready to go, why do stats books persist in showing how to read in a list of 10 or so numbers, rather than reading in an actual file?

Just as human languages may or may not directly have universal concepts, the same may apply for stats packages. The objects of R for example, are very succinct in conception, but very dfficult to explain.

Such apparent lack of universality, may be why English borrows words like ‘gourmand’ (to cite from my own book chapter), as English doesn’t otherwise have words for a person that eats for pleasure. Similarly, courgette/zucchini sounds better than baby marrow (and have you ever seen how big they can actually grow?).

Yet it’s a two way street, with English providing words to other languages, such as ‘weekend’.

According to the old Sapir-Whorf hypothesis, language precedes or at least shapes thought (but see John McWhorter’s recent 2014 book The Language Hoax), so if there’s no word for something, it’s supposedly hard to think about it.

In Stats package terms, instructors would have to somehow explain that it is very easy to extract and store, say, correlation values in R, for further processing, putting smiley faces beside large ones etc. But in SPSS and SAS we would normally have to use OMS/ODS, and think in terms of capturing information that would otherwise be displayed on a line printer. This is a difficult concept to explain to anyone under 45 or so!

Although there are many great books on learning stats packages, (something for a later post), and I myself can ‘speak’ SPSS almost like a native after 33 years, I only know a few words of other human languages, (and, funnily enough, only a few “words” of R).

If you’ll excuse me, my aunt and her pen are now going for a ride on a hovercraft.
(I hope there’s no eels! )

REFS

Sapir-Whorf Hypothesis

https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Sapir%E2%80%93Whorf_hypothesis.html

Counter to the Sapir-Whorf Hypthesis

http://global.oup.com/academic/product/the-language-hoax-9780199361588;jsessioniid=455D218B50BA25BC37B119092C3F7CDE?cc=au&lang=en&

Hovercraft, Gourmands and Stats Packages

McKenzie D (2013) Chapter 14: ‘Statistics and the Computer’ in
http://www.elsevierhealth.com.au/epidemiology-and-public-health/vital-statistics-paperbound/9780729541497/

 

Who wrote what: Statistics and the Federalist

Stats is of course not just about numbers, it’s also often used to analyse words, even more so now with the explosion of social media in the past few years. But the late great Phil Stone of Harvard University’s General Inquirer for the quantitative analysis of text was developed in the early 1960’s. A few years later, in 1964, the release of the Ford Mustang and the Pontiac GTO pony/muscle cars, the late great Fred Mosteller and the great (and still with us) David Wallace published their book on the (mainly) Bayesian analysis of who wrote the Federalist Papers, a year after an introductory paper had appeared in the Journal of the American Statistical Association.

In the late 18th Century, three key figures in the foundation of the United States – Alexander Hamilton, John Jay and James Madison wrote 85 newspaper articles to help ratify the American Constitution.

The papers were published anonymously, but scholars had figured out the authorship of all but twelve, not knowing for sure whether these had been written by Madison or Hamilton. The papers were written in a very formal, and very similar, style,and so Mosteller and Wallace turned to function words like “an” and “of” and “upon and particularly “while” and “whilst”, a researcher from back in 1916 having noticed that Hamilton tended towards the former, Madison the latter. Computers back in the 60’s were pretty slow, and expensive, and hard to come by, there weren’t any at Harvard, where Mosteller had recently established a Statistics Department, and so they had to use the one at MIT.

In Mosteller and Wallace’s own words, after the combined work of themselves and a huge band of helpers, they “tracked the problems of Bayesian analysis to their lair and solved the problem of the disputed Federalist papers” using works of known authorship to conclude that Madison wrote all 12.

In 1984, M & W published a newer edition of their groundbreaking, and highly readable book with a slightly different title, while a few years later, the late great Colin Martindale (with a Harvard doctorate) and myself re-analysed the original data using Stone’s General Inquirer thematic dictionary as well as function words, and a type of kernel discriminant analysis / neural network, coming to the same conclusion.

Case closed? Not quite. It has recently been proposed that the disputed 12 papers were a collaboration, a summary of the evidence, and some other citations to classical & recent quantitative Federalist research are available here
http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/the-twelve-disputed-federalist-papers-a-case-for-collaboration/

Either way, when you’re getting a bit jaded with numbers, and the 0’s are starting to look like o’s, analyse text!

Further/Future reading

Mosteller F, Wallace DL (1964) Inference and disputed authorship, the Federalist. Addison-Wesley.

McGrayne, SB (2011) The theory that would not die: how Bayes’ rule cracked the Enigma code, hunted down Russian submarines & emerged triumphant from two centuries of controversy. Yale University Press.

Martindale C, McKenzie D (1995) On the utility of content analysis in author attribution: The Federalist. Computers and the Humanities, 29, 259-270.

Stone PJ, Bales RF, Namenwirth JZ, Ogilvie DM (1962). The General Inquirer: a computer system for content analysis and retrieval based on the sentence as a unit of information. Behavioral Science, 7, 484-498.

“AIC/BIC not p” : comparing means using information criteria

A basic principle in Science is that of parsimony, or reducing complexity where possible, as typified in the application of Occam’s Razor.

William of Occam (or Ockham), a philosopher monk named after the English town that he came from, said something to the effect of ‘pluralitas non est ponenda sine necessitate’ (‘plurality should not be posited without necessity’). In other words, don’t increase, beyond what is necessary, the number of entities needed to explain something.

Occam’s Razor doesn’t necessarily mean that ‘less is always better’, it merely suggests that more complex models shouldn’t be used unless required, to increase model performance, for example. As is commonly, but probably mistakenly believed to have been proposed by Albert Einstein, ‘everything should be made as simple as possible, but not simpler’.

 

Common methods of measuring performance or ‘bang’, taking into account the cost, complexity or ‘buck’, are the Akaike Information Criterion (AIC), Bayesian or Schwarz Information Criterion (BIC), Minimum Message Length (MML) and Minimum Description Length (MDL).

Unlike the standard AIC, the latter three techniques take sample size into account, while MDL and MML also take the precision of the model estimates into account, but let’s just keep to the comparatively simpler AIC/BIC here.

An excellent new book by Thom Baguley ‘Serious Stats’ (serious in this case meaning powerful rather than scarey) http://seriousstats.wordpress.com/ shows how to do a t-test using AIC/BIC in SPSS and R.

I’ll do it here using Stata regression, the idea being to compare a null model (e.g. just the constant) with a model including the group. In this case we’re looking at the difference between headroom in American and ‘Foreign’ cars in 1978. (well, it’s Thursday night!).

Here’s the t-test results

(1978 Automobile Data)
——————————————————————————
Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
———+——————————————————————–
Domestic |      52    3.153846    .1269928    .9157578    2.898898    3.408795
Foreign |        22    2.613636     .103676    .4862837     2.39803      2.829242

 

Domestic has slightly bigger mean headroom (but also larger variation!), p value is 0.011, indicating that the probability of getting a difference in means as large as or larger than the one above (0.540), IF the null hypothesis, that the populations means are actually identical, holds, is around 1 in a 100.

 

Using the method shown in Dr Thom’s book (Stata implementation on my Downloadables page) we get

 

Akaike’s information criterion and Bayesian information criterion

—————————————————————————–
Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
————-+—————————————————————
nullmodel |     74   -92.12213   -92.12213      1     186.2443    188.5483
groupmodel |     74   -92.12213   -88.78075      2     181.5615    186.1696

 

AIC and BIC values are lower for the model including group, suggesting in this case that increasing complexity (the two groups), also commensurately increases performance (i.e. need to take into account the two group means for US and non-US cars, rather than assuming there’s just one common mean, or universal headroom)

Of course, things get a little more complex when comparing several means, having different variances etc (as the example above actually does, although means still “significantly” different when differences in variances taken into account using separate variance t-test).  Something to think about, and more info on applying AIC/BIC to variety of statistical methods can be found in refs below, particularly 3 and 5.

 

Further Reading (refs 2,3,4 and 5 are the most approachable, with Thom Baguley’s book referred to above, more approachable still)

      1. Akaike, H., A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974. 19: p. 716-723.
      2. Anderson, D.R., Model based inference in the life sciences: a primer on evidence. 2007, New York: Springer.
      3. Dayton, C.M., Information criteria for the paired-comparisons problem. American Statistician, 1998. 52: p. 144-151.
      4. Forsyth, R.S., D.D. Clarke, and R.L. Wright, Overfitting revisited : an information-theoretic approach to simplifying discrimination trees. Journal of Experimental and Artificial Intelligence, 1994. 6: p. 289-302.
      5. Sakamoto, Y., M. Ishiguro, and G. Kitagawa, Akaike information criterion statistics. 1986, Boston, MA: Dordrecht.
      6. Schwarz, G., Estimating the dimension of a model. Annals of Statistics, 1978. 6: p. 461-464.
      7. Wallace, C.S., Statistical and inductive inference by minimum message length. 2005, New York: Springer.
      8. Wallace, C.S. and D.M. Boulton, An information measure for classification. Computer Journal, 1968. 11: p. 185-194.

 

      —————————————————————————–

 

 

 

 

Minitab 17: think Mini Cooper, not Minnie Mouse

As it has been 3 or 4 years since the previous version, the new release of Minitab 17 statistical package is surely cause for rejoicing, merriment, and an extra biscuit with a strong cup of tea.

At one of the centres where I work, the data analysts sit at the same lunch table, but are known by their packages, the Stata people, the SAS person, the R person, the SPSS person and so on. No Minitab person as yet, but maybe there should be. Not only for its easy to use graphics, mentioned in a previous post, but for its all round interface, programmability (Minitab syntax looks a little like that great Kemeny-Kurtz language from 1964 Dartmouth College, BASIC, but more powerful), and a few new features (Poisson regression for relative risks & counted data, although alas no negative binomial regression for trickier counted data), and even better graphics.

Bubble plots, Outlier tests, and the Box-Cox transformation (another great collaboration from 1964), Minitab was also one of the first packages to include Exploratory Data Analysis (e.g. box plots and smoothed regression), for when the data are about as well-behaved as the next door neighbours strung out on espresso coffee mixed with red cordial.

Not as much cachet for when the R and SAS programmers come a-swaggering in, but still worth recommending for those who may not be getting as much as they should be out of SPSS, particularly for graphics, yet find the other packages a little too high to climb.

http://www.minitab.com/en-us/

Resulting Consulting: Excel for Stats – 800 pound Gorilla or just Monkeying around?

When hearing of folks running statistical analysis with Excel , statisticians often have panicky images of ‘Home Haircutting , with Electric Shears, in the Wet’!

Mind you, Excel really is great for processing data, but analysing it in a more formal or even exploratory sense, can be a trifle tricky.

On the upside, many work computers have Excel installed, it’s readily available for quite a low price even if one is not a student or an academic, and for the most part is well designed and simple to use. It’s very easy to develop a spreadsheet that shows each individual calculation needed for a particular formula such as the standard deviation, for instance. Such flexibility is wonderful for learning and teaching stats, because everyone can see the steps involved in actually getting an answer, more so than the usual press-button, window click, typing ‘esoteric’ commands.

On the downside, pre-2010 versions of Excel had both practical accuracy issues (with functions & the add-in statistics toolpak) and validity issues (employed non-usual methods for things like handling ties in ranked data). There’s still no nonparametric tests (e.g. Wilcoxon), and Excel is still a bit light on for confidence intervals, regression diagnostics,  and for performing production, shop-floor type statistical analyses. More of an adjustable wrench than a set of spanners?

In sum, if used wisely, Excel is a useful adjunct to third party statistical add-ins or  statistical packages, but please avoid pie charts, especially 3D ones, and watch out for those banana skins….

**Excel 2010 (& Gnumeric & OpenOffice) Accuracy / Validity**

http://www.tandfonline.com/doi/abs/10.1198/tas.2011.09076#.UvH4rp24a70

http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf

**Some Excel Statistics Books**

Conrad Carlberg http://www.quepublishing.com/store/statistical-analysis-microsoft-excel-2013-9780789753113

Mark Gardener http://www.pelagicpublishing.com/statistics-for-ecologists-using-r-and-excel-data-collection-exploration-analysis-and-presentation.html

Neil Salkind http://www.sagepub.com/books/Book236672?siteId=sage-us&prodTypes=any&q=salkind&fs=1

**Some Statistical Add-Ins for Excel**

Analyse-It http://analyse-it.com     DataDesk /XL   http://www.datadesk.com

RExcel (interfaces Excel to open source R) http://rcom.univie.ac.at/

XLStat http://www.xlstat.com/en/

**Some Open Source Spreadsheets**

Gnumeric https://projects.gnome.org/gnumeric/  OpenOffice http://www.openoffice.org.au/

Olden goldies: Cybernetic forests 1967

Richard Brautigan was an American author and poet who, in 1967’s Summer of Love in San Francisco, published ‘All Watched Over by Machines of Loving Grace’, wishing for a future in which computers could save humans from drudgery, (such as performing statistical operations by hand?)

Apart from the perkier PDP ‘mini-computers’, computers of Brautigan’s day were hulking behemoths with more brawn than brain, and a scary dark side, as seen through HAL in 1968’s ‘2001: The Space Odyssey’.

Brautigan’s poem applied sweet 1960’s kandy-green hues to these cold & clanging monsters, just a few years away from friendly little Apple and other micro’s of the 70’s & 80’s. Now we are all linked on the web, and if we get tired of that we can talk to the electronic aide and confidante Siri, developed at Menlo Park, California – not too far away in space, if not time, from where Brautigan wrote.

We can get a glimpse of a ‘rosy’ future in the even friendlier electronic personality operating system of Spike Jonze’s new movie ‘Her’.

In his BBC documentaries of 2011, named after Brautigan’s poem, filmmaker Adam Curtis argues that computers have not liberated humanity much, if at all.

Yet there is still something about Richard Brautigan’s original 1967 poem, something still worth wishing for!

All three verses, further information and audio of Mr Brautigan reading his poem can be found at

http://www.brautigan.net/machines.html

CSIRAC, a real-life Australian digital dinosaur, that stomped the earth from 1949 to 1964, and is the only intact but dormant (hopefully!) first generation computer left anywhere in the world, can be viewed on the lower ground floor of the Melbourne Museum.

(and yes, this computer was used for statistical analyses, as well as other activities, such as composing electronic music)

http://museumvictoria.com.au/csirac/

SecretSource: of Minitab and Dataviz

When the goers go and the stayers stay, when shirts loosen and tattoos glisten, it’s time for the statisticians and the miners and the data scientists to talk, and walk, Big Iron.

R. S-Plus. SAS. Tableau. Stata. GnuPlot. Mondrian. DataDesk. Minitab.   MINITAB?????? Okay, we’ll leave the others to get back to their arm wrasslin’, but if you want to produce high quality graphs, simply, readily and quickly, then Minitab could be for you.

A commercialized version of Omnitab, Minitab appeared in Philadelphia in 1972 and has long been associated with students learning stats, but also now with business, industrial and medical/health quality management and six sigma, etc. There’s some  other real ‘rough and tumble’ applications involving Minitab – DR Helsell’s ‘Statistics for Censored Environmental Data using Minitab and R’ (Wiley 2012), for instance.

IBM SPSS and Microsoft Excel can produce good graphs (‘good’ in the ‘good sense’ of John Tukey , Edward Tufte, William Cleveland, Howard Wainer, Stephen Few & Nathan Yau etc etc), with the soft pedal down and ‘caution switches’ on, but Minitab is probably going to be easier.

For example, the Statistical Consulting Centre at the University of Melbourne uses Minitab for most of its graphs (R for the trickiest ones). As well as general short courses on Minitab, R, SPSS and GenStat there’s a one day course in Minitab graphics in November, which I’ve done and can recommend.

More details on the Producing Excellent Graphics Simply (PEGS) course using Minitab at Melbourne are at

http://www.scc.ms.unimelb.edu.au/pegs.html

student and academic pricing for Minitab is at http://onthehub.com/

What, I wonder, would Florence Nightingale have used for graphic software if she was alive today???