Spiders, sowbugs and sundew statistics

Statisticians often like to think that non-statisticians don’t know what exactly it is that we do. The truth is of course that not only do they not know, they do not particularly care! With the possible exception of someone like Nat ‘2012 US election’ Silver, what statisticians are thought to do is about as exciting as driving around in  cardigan and slippers in a two-tone ’74 Morris Marina with no radio.

But what  if statisticians went around ripping up floorboards and counting up spiders? Now you’re talking!!

Back in ’46 a scientist named LC Cole published some data on counts of spiders, and sowbugs, (or woodlice, roly poly’s or slaters).

Cole and various bright sparks ever since, had the idea of fitting the spider / sowbug counts to various types of probability distribution. Voila!, it was found that spider counts could be quite happily fitted by the Poisson distribution, as can the number of typewriter errors made on a page, the number of people killed by horse kicks in the Prussian cavalry, etc etc.

But not sowbug counts, which are better fitted by a ‘contagious distribution’, such as the ‘generalized Poisson’ or ‘generalized Negative binomial’, in which the event of something happening is itself dependent on other events. Sowbugs, it seems are a social breed, and when they notice their numbers dwindling, to the point where there’s only one or two left, they pick up sticks and try the house down the road, in search of other sowbugs, if not adventure.

Spiders, on the other hand, are more individualistic or anti-social and don’t care if they’re left by themselves.  (In fact they probably appreciate the peace and quiet after those pesky sowbugs have marched off elsewhere, unless of course the spiders belong to the  type known as woodlouse spiders or sowbug hunters, which is a very different kettle of fish, or spiders, altogether, as are ‘shy spiders’and ‘social spiders’)

Finally, a paper published in the journal of the highly prestigious Royal Society in 2010 found that carnivorous wolf spiders (Lycosidae) and pink sundew plants (Drosera capillaris) competed with each other for available food, in statistically interesting ways, indeed the lead author described the study as ‘awfully fun’ http://www.livescience.com/8566-plant-spider-compete-food.html
http://www.americanscientist.org/issues/pub/2010/6/in-the-news-30
http://ittakes30.wordpress.com/2010/10/25/feed-me-seymour/

So, next time someone asks (without really caring what the answer is) ‘just what is it that statisticians actually do.?….’.

[updated, 9 October 2016]

References

Cole LC (1946) A study of the cryptozoa of an Illinois woodland. Ecological Monographs, 16, 49-86.

Consul PC (1989) Generalized Poisson distributions. Marcel Dekker, New York.

Forbes C, Evans M et al (2011) Statistical distributions. 4th ed. Wiley, Hoboken, New Jersey.

Janardan KG et al. (1979). Biological applications of the Lagrangian Poisson distribution. BioScience, 29, 599-602.
Jennings DE et al. (2010). Evidence for competition between carnivorous plants and spiders. Proc Royal Society B, 277, 3001-3008.

Raja TA, Mir AH (2011). On applications of some probability distributions. Journal of Research & Development, 11, 107-116.

Watch the Skies: Manga Regression!

Although it’s probably the technique most employed by statisticians, regression or at least multiple linear or multiple logistic regression, is often the concept that is most feared or misunderstood by students and newbie researchers. If someone you know is in the latter categories, and they would like a fun and straightforward introduction to regression that literally uses pictures (cartoons), announce that The Manga Guide to Regression book has now been published, around May 2016,  by the friendly folks at NoStarchPress    http://www.nostarch.com/regression

and available  on http://www.oreilly.com   http://www.amazon.com   http://www.bookdepository.com   http://www.dymocks.com.au etc

Authored by Shin Takahashi (Manga Guide to Statistics, 2008), the new book uses Manga http://en.wikipedia.org/wiki/Manga   (think Osamu Tezuka’s Jungle Emperor: made into the 1965 Japanese anime TV series of the same name and the 1966  US overdub ‘Kimba the White Lion’  http://en.wikipedia.org/wiki/Jungle_Emperor, as well as ‘Atomu’ / Astro Boy’ .

Okay, Kimba himself is not actually included in the Manga Regression  book, but there’s both linear and logistic regression, demonstrated using Microsoft Excel (just a slight gritting of teeth, go for Excel 2010/2013/2016 or higher,  but at least it might allow getting ‘down and dirty with data’.

 

Happy Numbers

Why there are so very few statisticians as heroes (or even dashing villains) in novels is a pop culture mystery even bigger than the true identity of reggae magicians Johnny and the Attractions, or the actual final resting place of Butch and Sundance.

I have heard of, but don’t have, the 2008 novel Dancing with Dr Kildare, which features British medical statistician Nina, as well as the Finnish composer Sibelius, and the Tango, by Jane Yardley PhD, in real life a co-ordinator of medical trials for a small pharma.

http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2009.00341.x/abstract.

http://www.transworld-publishers.co.uk/catalog/book.htm?command=Search&db=twmain.txt&eqisbndata=0552773107

I’m now performing statistical consulting at two major hospitals so I’m about to re-read that wonderful book by major scriptwriter / drama writer Jim Keeble ‘The Happy Numbers of Julius Miles’, originally published in 2012 by independent outfit Alma

http://www.almabooks.com/the-happy-numbers-of-julius-miles-p-387-book.html

but there seems to be an April 2014 printing for the US.

It’s a great book about a big fellow, Julius Miles, a professional statistician with Barts Health NHS Trust, Royal London Hospital, Whitechapel, East London, England. Julius loves stats – nose-counting ones such as the fact that it takes him 2 minutes to polish his shoes (with 30 seconds airing between polish, application and buff), as well as meaty methods such as multilevel Poisson regression for length of hospital stay.

Julius is about 1.93 metres (6 foot 4 inches) and wears size 13 (UK) shoes, a solid fellow (although   not reminiscent of the solid Ignatius Reilly in John Kennedy Toole’s classic posthumous 1980  novel ‘A Confederacy of Dunces’).

There’s something about the name Julius, Julius Sumner Miller the US physicist and educator whose ‘Why is it so?’ ran on Australian TV for over 20 years from the 1960’s, and the frothy US drink Orange Julius, named after Julius Freed, around since 1926, taking off in ’29 (the official drink of the 1964 New York World’s Fair).

I can thoroughly recommend this colourful & warm book about Julius Miles, medical statistician.

Give p’s a Chance? Hoochie Coochie Hypothesis Tests

A common request for jobbing analysts is to ‘run these results through the computer and see if they’re significant’. Now, unfortunately, many folk, including scarily, even lecturers in our craft, have a misconception as to what ‘significance’ actually means.

Shout in a desperate monotone “it’s the probability of getting a result as large as, or larger than, what we would obtain if the ‘null hypothesis’ of no difference or association was actually true” and people look flummoxed, yes flummoxed, as if you were speaking to them in the language of the ancient Huns, (another) language no-one has been able to figure out.

True, testing ‘something’ against the concept of ‘nothing’ is a bit kooky. If we really did have a situation where two groups ended up with identical averages we’d think it was a trifle dodgy to say the least.

And as for the notion of effect sizes! Picture, on an enchanted desert isle, two group means of 131.5 and 130, with a pooled standard deviation (sd) of 15. A difference of 1.5 divided by 15, is a Cohen’s (the late great Jacob Cohen; Cohen’s kappa, populariser of power analysis, maven of multiple regression) effect size of 0.10, where given Jack’s arbitrary but conventional guidelines for mean differences, 0.20 is a small effect size, 0.50 medium, 0.80 large.

Using an online calculator e.g.

http://www.graphpad.com/quickcalcs/ttest1/

we find, that if there were 1000 in each group, the t test value would be 2.24 and our p value 0.026.

Voila, Eureka, Significance, as cook smiles and puts an extra dollop of custard on our pudding!

But if we ‘only’ had 100 in each group, our t value would be 0.71, our p value would be 0.48, and there’d be a sigh, a frown, a closing of doors and a grim faced cook doling out the thrice-boiled cabbage….

But they’re the same means, the same sd, and the same effect size!

Coming Up:  Guest Post on a possible, probable, Salvation.

Further/Future reading

G Cumming (2014) How significant is P? Australasian Science, March 2014. p. 37.

http://www.australasianscience.com.au/article/issue-march-2014/how-significant-p.html

also check out Prof G’s website

http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci

with free Excel ESCI program and details of his illuminating 2012 book ‘The New Statistics’.

Now, back to honest resting from honest labour!

When Boogie becomes Woogie, when Dog becomes Wolf

An exciting (and not just for statisticians!) area of application in statistics/analytics/data science relates to change/anomaly/outlier detection, the general notion of outliers (e.g. ‘unlikely’ values) having been covered in a previous post, looking at, amongst other things, very long pregnancies.

But tonight’s fr’instance comes from Fleming’s wonderful James Bond Jamaican adventure novel, Dr No, (also a jazzy 1962 movie) which talks of London Radio Security shutting down radio connections with secret agents, if a change in their message transmitting style is detected. This may have indicated that their radio had fallen into enemy hands.

To use a somewhat less exotic example, imagine someone, probably not James Bond, tenpin bowling and keeping track of their scores, this scenario coming from HJ Harrington et al’s excellent Statistical Analysis Simplified: the Easy-to-Understand Guide to SPC and Data Analysis (McGraw-Hill, 1998).

On the 10th week, the score suddenly drops more than three standard deviations (scatter or variation around the mean or average) below the mean.

Enemy agents? Forgotten bowling shoes? Too many milk shakes?

Once again, an anomaly or change, something often examined in industry (Statistical Process Control (SPC) and related areas) to determine the point at which, in the words of Tom Robbin’s great novel Even Cowgirls Get The Blues, ‘the boogie stopped and the woogie began’.

Sudden changes in operations & processes can happen, and so a usual everyday assembly line (‘dog’) can in milliseconds become the unusual, and possibly even dangerous (‘wolf’), at which point hopefully an alarm goes off and corrective action taken.

The basics of SPC were developed many years ago (and taken to Japan after WW2, a story in itself). Anomaly detection is a fast-growing area. For further experimentation / reading, a recent method based upon calculating the closeness of points to their neighbours is described in John Foreman’s marvellous DataSmart: using Data Science to Transform Information into Insight (Wiley, 2014).

We might want to determine if a credit card has been stolen on the basis of different spending patterns/places, or, to return to the opening example, detect an unauthorised intruder to a computer network (e.g. Clifford Stoll’s trailblazing The Cuckoo’s Egg: Tracking a Spy Through the Maze of Computer Espionage).

Finally, we might just want to figure out just exactly when it was that our bowling performance dropped off!

Telstar, Cortina & the Median Quartile Test: where were you in ’62?

It was 1962, the setting of the iconic 1973 movie American Graffiti, from which comes the subtitle of this post. The Beatles had released Love Me Do, their first single. That year also heard and saw Telstar, the eerie but joyful Claviolined Joe Meek instrumental by the Tornados, celebrating the circling communications private transatlantic television satellite it honoured. The British Ford Cortina, named after an Italian ski-resort saw out the humpty-dumpty rounded Prefects and 50’s Zephyrs, while in the US, the first of 50 beautiful, mysterious and largely lost Chrysler Ghia Turbine cars was driven in Detroit.

Meanwhile, the world of statistics was not to be outdone. Rainald Bauer’s Median Quartile test, an extension of Brown and Mood’s early 50’s Median Test, was published, in German, in 1962. The latter test, still available in statistics packages such as IBM SPSS, SAS and Stata simply compares groups on counts below and above the overall median, providing in the case of two groups, a two by two table.

The Median Quartile Test (MQT), as the name suggests, compares each group on the four quartiles.  But the MQT is largely unknown, mainly discussed in books and papers published in, or translated from, German.

The MQT conveys similar information to John Tukey’s boxplot, shows both analysts and their customers and colleagues where the data tend to fall, and provides a test of statistical significance to boot. Does one group show a preponderance of scores in the lower and upper quartiles for example, suggesting in the field of pharma fr’instance, that one group either gets much better or much worse.

A 1967 NASA English translation of the original 1962 Bauer paper is available in the Downloadables section of this site.

Recent Application in Journal of Cell Biology

Click to access 809.full.pdf

Further / Future reading

Bauer RK (1962) Der “Median-Quartile Test”… Metrika, 5, 1-16.

Von Eye A  et al (1996) The median quartiles test revisited. Studia Psychologica, 38, 79-84.

Minitab 17: think Mini Cooper, not Minnie Mouse

As it has been 3 or 4 years since the previous version, the new release of Minitab 17 statistical package is surely cause for rejoicing, merriment, and an extra biscuit with a strong cup of tea.

At one of the centres where I work, the data analysts sit at the same lunch table, but are known by their packages, the Stata people, the SAS person, the R person, the SPSS person and so on. No Minitab person as yet, but maybe there should be. Not only for its easy to use graphics, mentioned in a previous post, but for its all round interface, programmability (Minitab syntax looks a little like that great Kemeny-Kurtz language from 1964 Dartmouth College, BASIC, but more powerful), and a few new features (Poisson regression for relative risks & counted data, although alas no negative binomial regression for trickier counted data), and even better graphics.

Bubble plots, Outlier tests, and the Box-Cox transformation (another great collaboration from 1964), Minitab was also one of the first packages to include Exploratory Data Analysis (e.g. box plots and smoothed regression), for when the data are about as well-behaved as the next door neighbours strung out on espresso coffee mixed with red cordial.

Not as much cachet for when the R and SAS programmers come a-swaggering in, but still worth recommending for those who may not be getting as much as they should be out of SPSS, particularly for graphics, yet find the other packages a little too high to climb.

http://www.minitab.com/en-us/