statistics – Synergistic Statistical Consulting, Analysis, Arboronics

Wisdom of the Cloud

Many summers ago when I started out in the Craft, I could log onto the trusty DEC-20 literally anywhere in the world, and use SPSS or BMDP to analyse data. Nowadays, I have to have IBM SPSS or Stata installed on the right laptop or computer, and bring it with me, wherever I may roam, and wonder dreamily if I could just access my licensed stats packages from anywhere, like a library, a beach, a forest, a coffee shop.

One option would to subscribe to a stats package in the Cloud! Iin terms of main line stats packages, https://www.apponfly.com/en/ has R (free plus 8 euro’s ($A12.08) per month for platform, NCSS 10 at 18/27.19 per month + platform, IBM SPSS 23 Base 99/149.54 ditto and Standard (adds logistic regression, hierarchical linear modelling, survival analysis etc) for 199/300.59 per month + platform.

Another option, particularly if you’re more into six sigma / quality control type analyses, is Engineroom from http://www.moresteam.com at $US275 ($A378.55) per year.

Obviously, compare the prices against actually buying the software , but to be able to log in from anywhere, on different computers, and analyse data, sigh, it’s almost like the summer of ’85!

Hobart and Randomicity

Mona, lower case, is a great 50’s song by Bo Diddely, covered a few years later by the Rolling Stones on their first album.

MONA, upper case, standing for Museum of Old and New Art, is an amazing underground (literally) art gallery in Hobart, the capital of Tasmania, the island state of Australia.

Hobart is the second oldest state capital in Australia (after Sydney), was liked by both Mark Twain and Agatha Christie, and is the birthplace of Hollywood actor Errol Flynn (1909-1959), as well as the final resting place of the last thylacine or ‘Tasmanian Tiger’ a carnivorous mammal, the last of which died in captivity in 1936. Hobart is also the setting for the development, in the mid 1930’s, of Edward James George Pitman’s (1897-1993) development of randomization or permutation tests, which Sir Ronald Aylmer Fisher had also worked on. Permutation tests rely (these days) on computers, and don’t require reference to statistical arcana such as the Normal and Student’s T distributions, etc.

As shown by the late Julian Simon and more recently in that wonderful stats book that sounds like a law firm (Lock, Frazer Lock, Lock Morgan, Lock and Lock, 2012), permutation tests can also be easier to understand by students than the parametric alternatives.

MONA itself is currently showing the movie ‘David Bowie Is’, a segment of which talks about the London singer’s use of the William Burroughs / Brion Gysin cut-up technique and later a computer program called Verbasizer, to randomly suggest combinations of particular words to aid in the creative song-writing process.

While you may or may not be interested in randomicity, and the David Bowie movie may no longer be showing, but whether it’s out of the desire for adventure, curiosity, necessity or for purely random reasons, visit MONA and Hobart!!

Divisive Rules OK: Clustering #1

Back about 1965, while (whilst?) attending Primary (grade) School in a little northern Victorian town, lunchtimes would see a behaviour pattern in which, say, two boys would link arms and march around the boys’ playground chanting “join on the boys who want to play Chasey” or some other sport or game, and soon, unless something bizarre or boring was called out for, there’d be three, five, eight, ten etc until there were enough people for that particular game.

Now, let’s imagine a slightly more surreal version, a big group of boys, or girls, or indeed a mixture thereof, wanders around the playground. But what sport are they going to play, it’s unlikely (especially for our purposes here) they’ll all want to play the same thing, and even if they did, there may be too many, or some people may well be better suited to some other sport or game. If we were brave enough and if lunchtimes extended into infinity, we could try every possible way of splitting our big group into two or more smaller groups as, in the general field of cluster analysis, Edwards and Cavalli-Sforza showed back in ’65.

Alternatively, we could ask the single person most different from the rest of the main group M in terms of the game they wanted to play. That person, (let’s call them Brian after Brian Everitt, who wrote a great book on cluster analysis in several editions, and Brian Setzer as in Stray Cats and the Brian Setzer Orchestra, this being the unofficial Brian Setzer Summer) splits off from the group and forms a splinter group S. For each of the remaining members, we check whether on average, they’re more dissimilar to the members of M, than the members of S (i.e. Brian et al). If so, then they too join S.

Known as divisive clustering (the earlier “join on” syndrome is sorta kinda like agglomerative clustering, start off with individuals and group em together), this particular method was published in ’64 by Macnaughton-Smith. Described in Kaufman and Rousseeuw’s book as DIANA, with shades of a great steak sauce and an old song by Paul Anka, DIANA is available in R as part of the cluster package.

Now if you’ll excuse me, there’s a group looking for members to march down the road for a cold drink, on this hot Australian summer night! Once we get to the bar, the most dissimilar, perhaps a nondrinker, will split off, clusters will be formed, and through the night there may be re-splitting and re-joining of groups or cliques, as some go off to the pinball parlour, others to the pizza joint, while some return to the bar, all in the manner of another great clustering algorithm, Ball and Hall’s ISODATA.

Bottled Sources:

Ball GH, Hall DJ (1965). A novel method of data analysis and pattern classification. Technical Report, Stanford Research Institute, California.

Edwards AWF, Cavalli-Sforza, LL (1965). A method for cluster analysis. Biometrics, 21, 362-375.

Everitt, B.S. (1974 and more recent editions). Cluster analysis. Heinemann: London.

Kaufman L, Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. Wiley: New York.

Macnaughton-Smith P, Williams WT, Dale MB, Mockett LG (1964). Dissimilarity analysis: A new technique of hierarchical sub-division. Nature, 202, 1034-1035.

Who wrote what: Statistics and the Federalist

Stats is of course not just about numbers, it’s also often used to analyse words, even more so now with the explosion of social media in the past few years. But the late great Phil Stone of Harvard University’s General Inquirer for the quantitative analysis of text was developed in the early 1960’s. A few years later, in 1964, the release of the Ford Mustang and the Pontiac GTO pony/muscle cars, the late great Fred Mosteller and the great (and still with us) David Wallace published their book on the (mainly) Bayesian analysis of who wrote the Federalist Papers, a year after an introductory paper had appeared in the Journal of the American Statistical Association.

In the late 18th Century, three key figures in the foundation of the United States – Alexander Hamilton, John Jay and James Madison wrote 85 newspaper articles to help ratify the American Constitution.

The papers were published anonymously, but scholars had figured out the authorship of all but twelve, not knowing for sure whether these had been written by Madison or Hamilton. The papers were written in a very formal, and very similar, style,and so Mosteller and Wallace turned to function words like “an” and “of” and “upon and particularly “while” and “whilst”, a researcher from back in 1916 having noticed that Hamilton tended towards the former, Madison the latter. Computers back in the 60’s were pretty slow, and expensive, and hard to come by, there weren’t any at Harvard, where Mosteller had recently established a Statistics Department, and so they had to use the one at MIT.

In Mosteller and Wallace’s own words, after the combined work of themselves and a huge band of helpers, they “tracked the problems of Bayesian analysis to their lair and solved the problem of the disputed Federalist papers” using works of known authorship to conclude that Madison wrote all 12.

In 1984, M & W published a newer edition of their groundbreaking, and highly readable book with a slightly different title, while a few years later, the late great Colin Martindale (with a Harvard doctorate) and myself re-analysed the original data using Stone’s General Inquirer thematic dictionary as well as function words, and a type of kernel discriminant analysis / neural network, coming to the same conclusion.

Case closed? Not quite. It has recently been proposed that the disputed 12 papers were a collaboration, a summary of the evidence, and some other citations to classical & recent quantitative Federalist research are available here
http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/the-twelve-disputed-federalist-papers-a-case-for-collaboration/

Either way, when you’re getting a bit jaded with numbers, and the 0’s are starting to look like o’s, analyse text!

Further/Future reading

Mosteller F, Wallace DL (1964) Inference and disputed authorship, the Federalist. Addison-Wesley.

McGrayne, SB (2011) The theory that would not die: how Bayes’ rule cracked the Enigma code, hunted down Russian submarines & emerged triumphant from two centuries of controversy. Yale University Press.

Martindale C, McKenzie D (1995) On the utility of content analysis in author attribution: The Federalist. Computers and the Humanities, 29, 259-270.

Stone PJ, Bales RF, Namenwirth JZ, Ogilvie DM (1962). The General Inquirer: a computer system for content analysis and retrieval based on the sentence as a unit of information. Behavioral Science, 7, 484-498.

“AIC/BIC not p” : comparing means using information criteria

A basic principle in Science is that of parsimony, or reducing complexity where possible, as typified in the application of Occam’s Razor.

William of Occam (or Ockham), a philosopher monk named after the English town that he came from, said something to the effect of ‘pluralitas non est ponenda sine necessitate’ (‘plurality should not be posited without necessity’). In other words, don’t increase, beyond what is necessary, the number of entities needed to explain something.

Occam’s Razor doesn’t necessarily mean that ‘less is always better’, it merely suggests that more complex models shouldn’t be used unless required, to increase model performance, for example. As is commonly, but probably mistakenly believed to have been proposed by Albert Einstein, ‘everything should be made as simple as possible, but not simpler’.

Common methods of measuring performance or ‘bang’, taking into account the cost, complexity or ‘buck’, are the Akaike Information Criterion (AIC), Bayesian or Schwarz Information Criterion (BIC), Minimum Message Length (MML) and Minimum Description Length (MDL).

Unlike the standard AIC, the latter three techniques take sample size into account, while MDL and MML also take the precision of the model estimates into account, but let’s just keep to the comparatively simpler AIC/BIC here.

An excellent new book by Thom Baguley ‘Serious Stats’ (serious in this case meaning powerful rather than scarey) http://seriousstats.wordpress.com/ shows how to do a t-test using AIC/BIC in SPSS and R.

I’ll do it here using Stata regression, the idea being to compare a null model (e.g. just the constant) with a model including the group. In this case we’re looking at the difference between headroom in American and ‘Foreign’ cars in 1978. (well, it’s Thursday night!).

Here’s the t-test results

(1978 Automobile Data)
——————————————————————————
Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
———+——————————————————————–
Domestic |      52    3.153846    .1269928    .9157578    2.898898    3.408795
Foreign |        22    2.613636     .103676    .4862837     2.39803      2.829242

Domestic has slightly bigger mean headroom (but also larger variation!), p value is 0.011, indicating that the probability of getting a difference in means as large as or larger than the one above (0.540), IF the null hypothesis, that the populations means are actually identical, holds, is around 1 in a 100.

Using the method shown in Dr Thom’s book (Stata implementation on my Downloadables page) we get

Akaike’s information criterion and Bayesian information criterion

—————————————————————————–
Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
————-+—————————————————————
nullmodel |     74   -92.12213   -92.12213      1     186.2443    188.5483
groupmodel |     74   -92.12213   -88.78075      2     181.5615    186.1696

AIC and BIC values are lower for the model including group, suggesting in this case that increasing complexity (the two groups), also commensurately increases performance (i.e. need to take into account the two group means for US and non-US cars, rather than assuming there’s just one common mean, or universal headroom)

Of course, things get a little more complex when comparing several means, having different variances etc (as the example above actually does, although means still “significantly” different when differences in variances taken into account using separate variance t-test). Something to think about, and more info on applying AIC/BIC to variety of statistical methods can be found in refs below, particularly 3 and 5.

Further Reading (refs 2,3,4 and 5 are the most approachable, with Thom Baguley’s book referred to above, more approachable still)

Akaike, H., A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974. 19: p. 716-723.
Anderson, D.R., Model based inference in the life sciences: a primer on evidence. 2007, New York: Springer.
Dayton, C.M., Information criteria for the paired-comparisons problem. American Statistician, 1998. 52: p. 144-151.
Forsyth, R.S., D.D. Clarke, and R.L. Wright, Overfitting revisited : an information-theoretic approach to simplifying discrimination trees. Journal of Experimental and Artificial Intelligence, 1994. 6: p. 289-302.
Sakamoto, Y., M. Ishiguro, and G. Kitagawa, Akaike information criterion statistics. 1986, Boston, MA: Dordrecht.
Schwarz, G., Estimating the dimension of a model. Annals of Statistics, 1978. 6: p. 461-464.
Wallace, C.S., Statistical and inductive inference by minimum message length. 2005, New York: Springer.
Wallace, C.S. and D.M. Boulton, An information measure for classification. Computer Journal, 1968. 11: p. 185-194.

—————————————————————————–

Hot Cross Buns: How Much Bang for the Buck?

Good Friday and Easter Monday are public holidays in Australia and UK (the former day is holiday in US in 12 states). For many down here, including those who don’t pay much nevermind to symbols, Good Friday is traditionally the day to eat Hot Cross Buns. For the last few years, the Melbourne Age newspaper has rated a dozen such buns for quality, as well as listing their price.

http://www.goodfood.com.au/good-food/search.html?ss=Good+Food&text=bunfight&type=

We would expect that, quality would increase, to some extent with price, although it would eventually flatten out (e.g. thrice as expensive doesn’t always mean thrice as good). Graphing programs such as Graphpad, Kaleidagraph and SigmaPlot, as well as R and most Stats packages, can readily fit a plethora of polynomial and other nonlinearities, but I used Stata to perform a preliminary scatterplot of the relationship between tasters’ score (out of 10) and price per bun (A$), smoothed using Bill Cleveland’s locally weighted least squares Lowess/Loess algorithm. http://en.wikipedia.org/wiki/Lowess

The relationship shown above is vaguely linear or, rather, ‘monotonic’, at least until I can have a better go with some nonlinear routines.

A simple linear regression model accounts for around 42% of the variation in taste, in this small and hardly random sample, returning the equation y=1.71*unitprice+1.98, suggesting (at best) that subjective taste, not necessarily representing anyone in particular, increases by 1.7 with every dollar increase in unit price.

But the fun really begins when looking at the residuals, the difference between the actual taste score, and that predicted using the above model. Some buns had negative residuals, indicating (surprise surprise!) that their taste was (much) lower than expected, given their price. I won’t mention the negatives.

As to the positives, two bakeries, Woodfrog Bakery in St. Kilda (Melbourne, Australia) and Candied Bakery in Spotswood (ditto), both cost $2.70 each and so were predicted to have a taste score out of 10 of 6.6, yet Woodfrog hopped in with an actual score 8.5 and Candied with an actual score of 8.

The results can’t be generalised, prove nothing at all, and mean extremely little, except to suggest that regression residuals can perhaps be put to interesting uses, but please take care in trying this at home! Tread softly and carry a big (regression) book e.g Tabachnick and Fidell’s Using Multivariate Statistics

(or the Manga Guide to Regression, when published! http://www.nostarch.com/mg_regressionanalysis.htm)

Statistics of Social Spiders: postscript

Some new information has just scuttled across my desk!

It seems that there is in fact a type of social spider, and a recent paper in the Proceedings of the Royal Society has looked at the role of social interaction amongst these creatures (Stegodyphus Mimosarum).

http://www.the-scientist.com/?articles.view/articleNo/39577/title/Behavior-Brief/

This is all very well, but has someone counted these spiders up and ascertained if the counts fit a Poisson distribution (e.g. are these spiders Poisson-ous (!!!), or some other distribution, (and what are their views on competing sundew plants?)

Laskowski KL, Pruitt JN (March, 2014). Evidence of social niche construction: persistent and repeated social interactions generate stronger personalities in a social spider. Proc Royal Society B.