stata – Synergistic Statistical Consulting, Analysis, Arboronics

Wisdom of the Cloud

Many summers ago when I started out in the Craft, I could log onto the trusty DEC-20 literally anywhere in the world, and use SPSS or BMDP to analyse data. Nowadays, I have to have IBM SPSS or Stata installed on the right laptop or computer, and bring it with me, wherever I may roam, and wonder dreamily if I could just access my licensed stats packages from anywhere, like a library, a beach, a forest, a coffee shop.

One option would to subscribe to a stats package in the Cloud! Iin terms of main line stats packages, https://www.apponfly.com/en/ has R (free plus 8 euro’s ($A12.08) per month for platform, NCSS 10 at 18/27.19 per month + platform, IBM SPSS 23 Base 99/149.54 ditto and Standard (adds logistic regression, hierarchical linear modelling, survival analysis etc) for 199/300.59 per month + platform.

Another option, particularly if you’re more into six sigma / quality control type analyses, is Engineroom from http://www.moresteam.com at $US275 ($A378.55) per year.

Obviously, compare the prices against actually buying the software , but to be able to log in from anywhere, on different computers, and analyse data, sigh, it’s almost like the summer of ’85!

Snappy Stepwise Regression

Stepwise regression, the technique that attempts to select a smaller subset of variables from a larger set by at each step choosing the ‘best’ or dropping the ‘worst’ was developed back in the late 1950’s by applied statisticians in the petroleum and automotive industries. With an ancestry like this, there’s no wonder that it is often regarded as the statistical version of the early 60’s Chev Corvair, at best only ‘driveable’ by expert careful users, or in Ralph Nader’s immortal words and title of his 1966 book ‘Unsafe at Any Speed’.

Well maybe. But if used with cross-validation and good sense, it’s an old-tech standby to later model ‘lasso’ and ‘elastic net’ techniques. However, there’s an easy way for a bit of a softshoe shuffle of the old stepwise routine. See how well (preferably on a fresh set of data) forward entry with just one or, maybe two, or at most three variables do, compared with larger models. (SAS and SPSS allow the number of steps to be specified).

Of if you’d like to do some slightly fancier steps it in twotone spats, try a best subset regression (available in SAS, and SPSS through automatic linear, and Minitab and R etc), of all one variable combinations, two variables, three variables.

The inspiration for this is partly from Gerd Gigerenzer’s ‘take the best’ heuristic, taking the best cue or clue often beats more complex techniques including multiple regression etc. ‘Take the best’ is described in Prof Gigerenzer’s great new general book Risk Savvy: How to Make Good Decisions (Penguin, 2014) http://www.penguin.co.uk/nf/Book/BookDisplay/0,,9781846144745,00.html as well as his earlier academic books such as Simple Heuristics That Make Us Smart (Oxford University Press, 1999)

See if a good little model can do as well as a good (or bad) big ‘un!.

Further Future Reading

Draper NR, Smith H (1966) Applied regression analysis (and later editions). Wiley: New York.

John and Betty’s Journey into Statistics Packages*

In past days of our lives, those who wanted to learn a stats package, would attend courses, and bail up/bake cakes for statisticians, but would mainly raise the drawbridge, lock the computer lab door and settle down with the VT100 terminal or Apple II or IBM PC and a copy of the brown or update blue SPSS Manual, or whatever.

Nowadays, folks tend to look things up on the web, something of a mixed blessing, and so maybe software consultants will now say LIUOTFW (‘Look It Up On The Flipping Web’) rather than the late, great RYFM (‘Read Your Flipping Manual’).

And yes, there are some great websites, and great online documentation supplied by the software venders, but there are also some great books, available in electronic and print form. A list of three of the many wonderful texts available for each package (IBM SPSS, SAS, Stata, R and Minitab) can be downloaded from the Downloadables section on this site.

IBM SPSS (in particular), R (ever growing), and to a slightly lesser extent SAS, seem to have the best range of primers and introductory texts.
IMHO though, Stata could do with a new colourful, fun primer (not necessarily a Dummies Guide, although there’s Roberto Pedace’s Econometrics for Dummies (Wiley, New York, 2013) which features Stata), perhaps one by Andy Field, who has already done superb books on SPSS, R and SAS.

While up on the soapbox, I reckon Minitab could do with a new primer for Psychologists / Social Scientists, much like that early ripsnorter by Ray Watson, Pip Pattison and Sue Finch, Beginning Statistics for Psychology (Prentice Hall, Sydney, 1993).

Anyway, in memories of days gone by, brew a pot of coffee or tea, unplug email, turn off the phone and the mobile/cell, and settle in for an initial night’s journey, on a set or two of real and interesting data, with a good stats package book, or two!

*(The title of this post riffs off the improbably boring and stereotyped 1950’s early readers still used in Victorian primary (grade) schools in the 1960’s
http://nla.gov.au/nla.aus-vn4738114 (think Dick and Jane, or Alice and Jerry), as well as the far more entertaining and recent John and Betty’s Journey into Complex Numbers by Matt Bower http://www.slideshare.net/aus_autarch/john-and-betty )

“AIC/BIC not p” : comparing means using information criteria

A basic principle in Science is that of parsimony, or reducing complexity where possible, as typified in the application of Occam’s Razor.

William of Occam (or Ockham), a philosopher monk named after the English town that he came from, said something to the effect of ‘pluralitas non est ponenda sine necessitate’ (‘plurality should not be posited without necessity’). In other words, don’t increase, beyond what is necessary, the number of entities needed to explain something.

Occam’s Razor doesn’t necessarily mean that ‘less is always better’, it merely suggests that more complex models shouldn’t be used unless required, to increase model performance, for example. As is commonly, but probably mistakenly believed to have been proposed by Albert Einstein, ‘everything should be made as simple as possible, but not simpler’.

Common methods of measuring performance or ‘bang’, taking into account the cost, complexity or ‘buck’, are the Akaike Information Criterion (AIC), Bayesian or Schwarz Information Criterion (BIC), Minimum Message Length (MML) and Minimum Description Length (MDL).

Unlike the standard AIC, the latter three techniques take sample size into account, while MDL and MML also take the precision of the model estimates into account, but let’s just keep to the comparatively simpler AIC/BIC here.

An excellent new book by Thom Baguley ‘Serious Stats’ (serious in this case meaning powerful rather than scarey) http://seriousstats.wordpress.com/ shows how to do a t-test using AIC/BIC in SPSS and R.

I’ll do it here using Stata regression, the idea being to compare a null model (e.g. just the constant) with a model including the group. In this case we’re looking at the difference between headroom in American and ‘Foreign’ cars in 1978. (well, it’s Thursday night!).

Here’s the t-test results

(1978 Automobile Data)
——————————————————————————
Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
———+——————————————————————–
Domestic |      52    3.153846    .1269928    .9157578    2.898898    3.408795
Foreign |        22    2.613636     .103676    .4862837     2.39803      2.829242

Domestic has slightly bigger mean headroom (but also larger variation!), p value is 0.011, indicating that the probability of getting a difference in means as large as or larger than the one above (0.540), IF the null hypothesis, that the populations means are actually identical, holds, is around 1 in a 100.

Using the method shown in Dr Thom’s book (Stata implementation on my Downloadables page) we get

Akaike’s information criterion and Bayesian information criterion

—————————————————————————–
Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
————-+—————————————————————
nullmodel |     74   -92.12213   -92.12213      1     186.2443    188.5483
groupmodel |     74   -92.12213   -88.78075      2     181.5615    186.1696

AIC and BIC values are lower for the model including group, suggesting in this case that increasing complexity (the two groups), also commensurately increases performance (i.e. need to take into account the two group means for US and non-US cars, rather than assuming there’s just one common mean, or universal headroom)

Of course, things get a little more complex when comparing several means, having different variances etc (as the example above actually does, although means still “significantly” different when differences in variances taken into account using separate variance t-test). Something to think about, and more info on applying AIC/BIC to variety of statistical methods can be found in refs below, particularly 3 and 5.

Further Reading (refs 2,3,4 and 5 are the most approachable, with Thom Baguley’s book referred to above, more approachable still)

Akaike, H., A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974. 19: p. 716-723.
Anderson, D.R., Model based inference in the life sciences: a primer on evidence. 2007, New York: Springer.
Dayton, C.M., Information criteria for the paired-comparisons problem. American Statistician, 1998. 52: p. 144-151.
Forsyth, R.S., D.D. Clarke, and R.L. Wright, Overfitting revisited : an information-theoretic approach to simplifying discrimination trees. Journal of Experimental and Artificial Intelligence, 1994. 6: p. 289-302.
Sakamoto, Y., M. Ishiguro, and G. Kitagawa, Akaike information criterion statistics. 1986, Boston, MA: Dordrecht.
Schwarz, G., Estimating the dimension of a model. Annals of Statistics, 1978. 6: p. 461-464.
Wallace, C.S., Statistical and inductive inference by minimum message length. 2005, New York: Springer.
Wallace, C.S. and D.M. Boulton, An information measure for classification. Computer Journal, 1968. 11: p. 185-194.

—————————————————————————–

Hot Cross Buns: How Much Bang for the Buck?

Good Friday and Easter Monday are public holidays in Australia and UK (the former day is holiday in US in 12 states). For many down here, including those who don’t pay much nevermind to symbols, Good Friday is traditionally the day to eat Hot Cross Buns. For the last few years, the Melbourne Age newspaper has rated a dozen such buns for quality, as well as listing their price.

http://www.goodfood.com.au/good-food/search.html?ss=Good+Food&text=bunfight&type=

We would expect that, quality would increase, to some extent with price, although it would eventually flatten out (e.g. thrice as expensive doesn’t always mean thrice as good). Graphing programs such as Graphpad, Kaleidagraph and SigmaPlot, as well as R and most Stats packages, can readily fit a plethora of polynomial and other nonlinearities, but I used Stata to perform a preliminary scatterplot of the relationship between tasters’ score (out of 10) and price per bun (A$), smoothed using Bill Cleveland’s locally weighted least squares Lowess/Loess algorithm. http://en.wikipedia.org/wiki/Lowess

The relationship shown above is vaguely linear or, rather, ‘monotonic’, at least until I can have a better go with some nonlinear routines.

A simple linear regression model accounts for around 42% of the variation in taste, in this small and hardly random sample, returning the equation y=1.71*unitprice+1.98, suggesting (at best) that subjective taste, not necessarily representing anyone in particular, increases by 1.7 with every dollar increase in unit price.

But the fun really begins when looking at the residuals, the difference between the actual taste score, and that predicted using the above model. Some buns had negative residuals, indicating (surprise surprise!) that their taste was (much) lower than expected, given their price. I won’t mention the negatives.

As to the positives, two bakeries, Woodfrog Bakery in St. Kilda (Melbourne, Australia) and Candied Bakery in Spotswood (ditto), both cost $2.70 each and so were predicted to have a taste score out of 10 of 6.6, yet Woodfrog hopped in with an actual score 8.5 and Candied with an actual score of 8.

The results can’t be generalised, prove nothing at all, and mean extremely little, except to suggest that regression residuals can perhaps be put to interesting uses, but please take care in trying this at home! Tread softly and carry a big (regression) book e.g Tabachnick and Fidell’s Using Multivariate Statistics

(or the Manga Guide to Regression, when published! http://www.nostarch.com/mg_regressionanalysis.htm)