Power Analysis Surge

December-February (and generally March) is Summer in the Southern Hemisphere. It can get pretty hot even in the more temperate southern states (= northern states in the Northern Hemisphere). Today, the last day of 2015, has an expected top of 40 Celsius, which is 104 Fahrenheit and a nose-peeling 313.5 Kelvin!

Summer is also the start of the Australian Academic Year, and Grant Season, where everyone is looking for Statisticians (who are in swimming pools or pool halls) to run power analyses for them. With no statisticians to be found, it’s off to the library and borrow whatever randoms they can find, or use one of the free web packages, which often works out like a home haircut.

There’s great specific software such as PASS http://www.ncss.com and NQuery Adviser & Nterim  http://www.statsols.com/products/nquery-advisor-nterim/

They’re not cheap (but neither are grants!), and aid professional statisticians as well.  There’s R of course, and the excellent menu-driven power and sample size routines in SAS and Stata.

But first  define the differences you’re expecting, based on the actual Literature as well as Clinical Judgement, and always see a Statistician!

Wisdom of the Cloud

Many summers ago when I started out in the Craft, I could log onto the trusty DEC-20 literally anywhere in the world, and use SPSS or BMDP to analyse data. Nowadays, I have to have IBM SPSS or Stata installed on the right laptop or computer, and bring it with me, wherever I may roam, and wonder dreamily  if I could just access my licensed stats packages from anywhere, like a library, a beach, a forest, a coffee shop.

One option would to subscribe to a stats package in the Cloud! Iin terms of main line stats packages, https://www.apponfly.com/en/ has R (free plus 8 euro’s ($A12.08) per month for platform, NCSS 10 at 18/27.19 per month + platform, IBM SPSS 23 Base 99/149.54 ditto and Standard (adds logistic regression, hierarchical linear modelling, survival analysis etc) for 199/300.59 per month + platform.

Another option, particularly if you’re more into six sigma / quality control type analyses, is Engineroom from http://www.moresteam.com at $US275 ($A378.55) per year.

Obviously,  compare the prices against actually buying the software , but to be able to log in from anywhere, on different computers, and analyse data,  sigh, it’s almost like the summer of ’85!

Divisive Rules OK: Clustering #1

Back about 1965, while (whilst?) attending Primary (grade) School in a little northern Victorian town, lunchtimes would see a behaviour pattern in which, say, two boys would link arms and march around the boys’ playground chanting “join on the boys who want to play Chasey” or some other sport or game, and soon, unless something bizarre or boring was called out for, there’d be three, five, eight, ten etc until there were enough people for that particular game.

Now, let’s imagine a slightly more surreal version, a big group of boys, or girls, or indeed a mixture thereof, wanders around the playground. But what sport are they going to play, it’s unlikely (especially for our purposes here) they’ll all want to play the same thing, and even if they did, there may be too many, or some people may well be better suited to some other sport or game. If we were brave enough and if lunchtimes extended into infinity, we could try every possible way of splitting our big group into two or more smaller groups as, in the general field of cluster analysis, Edwards and Cavalli-Sforza showed back in ’65.

Alternatively, we could ask the single person most different from the rest of the main group M in terms of the game they wanted to play. That person, (let’s call them Brian after Brian Everitt, who wrote a great book on cluster analysis in several editions, and Brian Setzer as in Stray Cats and the Brian Setzer Orchestra, this being the unofficial Brian Setzer Summer) splits off from the group and forms a splinter group S. For each of the remaining members, we check whether on average, they’re more dissimilar to the members of M, than the members of S (i.e. Brian et al). If so, then they too join S.

Known as divisive clustering (the earlier  “join on”  syndrome is sorta kinda like agglomerative clustering, start off with individuals and group em together), this particular method was published in ’64 by Macnaughton-Smith. Described in Kaufman and Rousseeuw’s book as DIANA, with shades of a great steak sauce and an old song by Paul Anka,  DIANA is available in R as part of the  cluster  package.

Now if you’ll excuse me, there’s a group looking for members to march down the road for a cold drink, on this hot Australian summer night! Once we get to the bar, the most dissimilar, perhaps a nondrinker, will split off, clusters will be formed, and through the night there may be re-splitting and re-joining of groups or cliques, as some go off to the pinball parlour, others to the pizza joint, while some return to the bar, all in the manner of another great clustering algorithm, Ball and Hall’s ISODATA.

Bottled Sources:

Ball GH, Hall DJ (1965). A novel method of data analysis and pattern classification. Technical Report, Stanford Research Institute, California.

Edwards AWF, Cavalli-Sforza, LL (1965). A method for cluster analysis. Biometrics, 21, 362-375.

Everitt, B.S. (1974 and more recent editions). Cluster analysis. Heinemann: London.

Kaufman L, Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. Wiley: New York.

Macnaughton-Smith P, Williams WT, Dale MB, Mockett LG (1964). Dissimilarity analysis: A new technique of hierarchical sub-division. Nature, 202, 1034-1035.

Who gives a toss: the statistics of coins

Spring is here in Melbourne, and a time for fashionable horse racing, including The Melbourne Cup in November., once attended by Mark Twain. Australia is also home of the “two-up” coin tossing game (descended from the British pitch and toss), played in outback pubs, hidden city lanes and now Australian casino’s, described in great old Australian novels such as Come In Spinner, and the eerie book and 1971 movie Wake in Fright (aka Outback).

In the 18th century, the Comte de Buffon obtained 2048 heads from 4040 tosses, while more recently and not to be outdone the statistician Karl Pearson obtained 12,012 heads out of 24,000 tosses (The Jungles of Randomness by Ivars Peterson, 1998). Of course a misunderstanding of the law of large numbers or so-called law of averages, makes the uninitiated think that if there’s say seven heads in a row, a cosmic force will decide “hang on that coin is coming up heads more than 50%, better make the next one a tail”).

While it doesn’t look at two-up, “Digital Dice” by the always entertaining Paul Nahin (2008) examines a tricky coin-tossing problem posed in 1941 and not solved until 1966. Prof Paul shows how to solve it using a computer-based Monte Carlo method, itself named after that famous casino in Monaco, where James Bond correctly observed that “the cards have no memory”.

And who says stats isn’t relevant?!

Applied Australian Change-Point Analysis: Before the Shark Gets Jumped?

Ok I saw the (in)famous Season 5 Episode 3 “Jump the Shark” episode of Happy Days (when Fonzie water skiis over a shark pool) when I was 18 back in 1977, and hated it.

Definitely Uncool.
But one Saturday morning a month or two ago I saw it again and loved it. It’s wild! It’s glorious!

The term has come to mean the point at which a TV series goes down hill, when the wolf becomes a dog, to riff on a previous post.


Anyhow, Australia’s Professor Kerrie Mengersen and Dr Hassen Assareh have developed a snazzy new Bayesian Markov Chain Monte Carlo procedure for working out the change-point in a process, specifically the point where a key change happened to a hospital patient’s condition for example. Helping to identify the ‘why’, as well as the ‘when’.


It’s a great idea and yet another instance of how Statistics can help save the world, again!

Snappy Stepwise Regression

Stepwise regression, the technique that attempts to select a smaller subset of variables from a larger set by at each step choosing the ‘best’ or dropping the ‘worst’ was developed back in the late 1950’s by applied statisticians in the petroleum and automotive industries. With an ancestry like this, there’s no wonder that it is often regarded as the statistical version of the early 60’s Chev Corvair, at best only ‘driveable’ by expert careful users, or in Ralph Nader’s immortal words and title of his 1966 book  ‘Unsafe at Any Speed’.

Well maybe. But if used with cross-validation and good sense, it’s an old-tech standby to later model ‘lasso’ and ‘elastic net’ techniques. However, there’s an easy way for a bit of a softshoe shuffle of the old stepwise routine. See how well (preferably on a fresh set of data) forward entry with just one or, maybe two, or at most three variables do, compared with larger models. (SAS and SPSS allow the number of steps to be specified).

Of if you’d like to do some slightly fancier steps it in twotone spats, try a best subset regression (available in SAS, and SPSS through automatic linear, and Minitab and R etc), of all one variable combinations, two variables, three variables.

The inspiration for this is partly from Gerd Gigerenzer’s ‘take the best’ heuristic, taking the best cue or clue often beats more complex techniques including multiple regression etc. ‘Take the best’ is described in Prof Gigerenzer’s great new general book Risk Savvy: How to Make Good Decisions (Penguin, 2014) http://www.penguin.co.uk/nf/Book/BookDisplay/0,,9781846144745,00.html as well as his earlier academic books such as Simple Heuristics That Make Us Smart (Oxford University Press, 1999)

See if a good little model can do as well as a good (or bad) big ‘un!.


Further Future Reading

Draper NR, Smith H (1966) Applied regression analysis (and later editions). Wiley: New York.

John and Betty’s Journey into Statistics Packages*

In past days of our lives, those who wanted to learn a stats package, would attend courses, and bail up/bake cakes for statisticians, but would mainly raise the drawbridge, lock the computer lab door and settle down with the VT100 terminal or Apple II or IBM PC and a copy of the brown or update blue SPSS Manual, or whatever.

Nowadays, folks tend to look things up on the web, something of a mixed blessing, and so maybe software consultants will now say LIUOTFW (‘Look It Up On The Flipping Web’) rather than the late, great RYFM (‘Read Your Flipping Manual’).

And yes, there are some great websites, and great online documentation supplied by the software venders, but there are also some great books, available in electronic and print form. A list of three of the many wonderful texts available for each package (IBM SPSS, SAS, Stata, R and Minitab) can be downloaded from the Downloadables section on this site.

IBM SPSS (in particular), R (ever growing), and to a slightly lesser extent SAS, seem to have the best range of primers and introductory texts.
IMHO though, Stata could do with a new colourful, fun primer (not necessarily a Dummies Guide, although there’s Roberto Pedace’s Econometrics for Dummies (Wiley, New York, 2013) which features Stata), perhaps one by Andy Field, who has already done superb books on SPSS, R and SAS.

While up on the soapbox, I reckon Minitab could do with a new primer for Psychologists / Social Scientists, much like that early ripsnorter by Ray Watson, Pip Pattison and Sue Finch, Beginning Statistics for Psychology (Prentice Hall, Sydney, 1993).

Anyway, in memories of days gone by, brew a pot of coffee or tea, unplug email, turn off the phone and the mobile/cell, and settle in for an initial night’s journey, on a set or two of real and interesting data, with a good stats package book, or two!

*(The title of this post riffs off the improbably boring and stereotyped 1950’s early readers still used in Victorian primary (grade) schools in the 1960’s
http://nla.gov.au/nla.aus-vn4738114 (think Dick and Jane, or Alice and Jerry), as well as the far more entertaining and recent John and Betty’s Journey into Complex Numbers by Matt Bower http://www.slideshare.net/aus_autarch/john-and-betty )