February 2014 – Synergistic Statistical Consulting, Analysis, Arboronics

Minitab 17: think Mini Cooper, not Minnie Mouse

As it has been 3 or 4 years since the previous version, the new release of Minitab 17 statistical package is surely cause for rejoicing, merriment, and an extra biscuit with a strong cup of tea.

At one of the centres where I work, the data analysts sit at the same lunch table, but are known by their packages, the Stata people, the SAS person, the R person, the SPSS person and so on. No Minitab person as yet, but maybe there should be. Not only for its easy to use graphics, mentioned in a previous post, but for its all round interface, programmability (Minitab syntax looks a little like that great Kemeny-Kurtz language from 1964 Dartmouth College, BASIC, but more powerful), and a few new features (Poisson regression for relative risks & counted data, although alas no negative binomial regression for trickier counted data), and even better graphics.

Bubble plots, Outlier tests, and the Box-Cox transformation (another great collaboration from 1964), Minitab was also one of the first packages to include Exploratory Data Analysis (e.g. box plots and smoothed regression), for when the data are about as well-behaved as the next door neighbours strung out on espresso coffee mixed with red cordial.

Not as much cachet for when the R and SAS programmers come a-swaggering in, but still worth recommending for those who may not be getting as much as they should be out of SPSS, particularly for graphics, yet find the other packages a little too high to climb.

http://www.minitab.com/en-us/

Simple Stats: Food, Friends, Families and F values

Way back when I was a young data analyst, there were limitations to the techniques available for analysing certain types of data. If the data involved counts, for example, there were certain types of transformation, and for repeated measurements over time, one needed ‘fiddle factors’ such as the G-G and H-F, or ‘scattergun’ mighty MANOVA approaches, that lacked in statistical power what they made up in firepower.

These days, even dear old SPSS has some sophisticated regression models, but whereas once there was a ‘trees not forest’ approach of a whole lot of basic tests, looking for ‘significant’ p values, rather than practical effect sizes and generality, now there’s complex ‘forest’ tests, without understanding the output, or even the question.

When talking about simplicity, analysts often recall the monk William of Occam and his “razor” (‘vain to do with more what can be done with fewer’) or misquote Albert Einstein, who probably never actually said ‘everything should be made as simple as possible, but not simpler’).

I like the ancient Greek, Epicurus of Athens, who was big on simple things like food, and friends and families, (although his name has come to be associated with a sort of false. hoggish hedonism, which defeats the purpose). I reckon we need to get a wooden table, some nice fresh food, jugs of (unfermented & fermented) grape, and after the important things like art and sport and the latest clips on Rage night music discussed, then talk about research questions, how they are to be answered, in what sensible but creative manner, so as to get back to other things.

We’d begin with graphical techniques, with the purpose of saying ‘aha’ or ‘Eureka’; not ‘gosh’ or ‘wow’ or ‘huh?’. Building up with fundamental methods, then perhaps more complex methods if needed, we’d test our models on fresh samples, and looking at that, and effect sizes, as well as confidence intervals and p values. I reckon that’s the sort of data party that even old Epicurus might have attended! http://textpublishing.com.au/books-and-authors/book/travels-with-epicurus/

http://www.dkstatisticalconsulting.com/practical-statistics/ <great book for analysing counts etc using SPSS & Stata>

Expected Unexpected: Power bands, performance curves, rogue waves and black swans

Many years ago, I had a ride of a Kawasaki 500 Mach III 2-stroke motorcycle, which along with its even more horrendous 750cc version was known as the ‘widow-maker’. It was incredibly fast in a straight line, but if it went around corners at all, the rider had long since fallen (or jumped) off!

It also had a very narrow ‘power band’ http://en.wikipedia.org/wiki/Power_band, in that it would have no real power until about 7,000 revs per minute, and then all of a sudden it would whoop and holler like the proverbial bat out of hell, the front wheel would lift, the rider’s jaw drop, and well, you get the idea! In statistical terms, this was a nonlinear relationship between twisting the throttle and the available power.

A somewhat less dramatic example of a nonlinear effect is the Yerkes-Dodson ‘law’ http://en.wikipedia.org/wiki/Yerkes%E2%80%93Dodson_law, in which optimum task performance is associated with medium levels of arousal (too much arousal = the ‘heebie-jeebies’, too little = ‘half asleep’).

Various simple & esoteric methods for finding global (follows a standard pattern such as a U shape, or upside down U) or local (different parts of the data might be better explained by different models, rather than ‘one size fits all’) relationships exist. A popular ‘local’ method is known as a ‘spline’ after the flexible metal ruler that draftspeople once fitted curves with. The ‘GT’ version, Multivariate Adaptive Regression Splines http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines. is available in R (itself a little reminiscent of a Mach III cycle at times!), the big-iron ‘1960’s 390 cubic inch Ford Galaxie V8′ of the SAS statistical package and the original, sleek ‘Ferrari V12’ Salford Systems version.

Other nonlinear methods are available http://en.wikipedia.org/wiki/Loess_curve, but the thing to remember is that life doesn’t always fit within the lines, or follow some human’s idea of a ‘natural law’.

For example, freak or rogue waves, that can literally break supertankers in half, were observed for centuries by mariners but are only recently accepted by shore-bound scientists, similarly the black swans (actually native to Australia) of the stock market http://www.fooledbyrandomness.com/

When analysing data, fitting models, (or riding motorcycles), please be careful!

Visual Trees: The Book of Trees by Manuel Lima

Apart from the iconic and mysterious Australian Nullarbor http://www.nullarbornet.com.au/ (literally ‘no’, or actually very few, trees) and the baddest and saddest of outer suburban concrete jungles, trees are a major part of our daily life. Trees produce shade and oxygen, and provide inspiration for dreaming scientists watching apples fall. ‘Tree of life’. ‘Family trees’. Tree branches have also long provided a metaphor for branches of knowledge and classification systems.

In his excellent new book ‘The book of trees: visualizing branches of knowledge’,

https://www.papress.com/html/book.details.page.tpl?isbn=9781616892180

Manual Lima, Designer and Fellow of the Royal Society of Arts (http://www.visualcomplexity.com/vc/) examines the role of trees in history, religion, philosophy, biology, computer science, data visualization, information graphics and data analysis / statistics.

Covering various types of tree graphs, including radial trees, sunbursts, Ben Schneiderman’s Treemaps and Voronoi Treemaps, Lima’s treatise provides inspirational historical and contemporary pictures, including timely applications such as looking at the words that appear with ‘I’ and ‘you’ in Google texts.

Statistical applications covered are mainly confined to Icicle plots or trees, used in applications such as cluster analysis, or the grouping observations into related classes, ‘taxa’ or clusters such as disease categories.

Not published in the Northern hemisphere until April 2014, the book is available now in Melbourne, Australia for around $50, e.g. www.ngvshop.ngv.vic.gov.au (the online search does not work) or http://metropolisbookshop.com.au

Accompanied by sources of information on how to construct such diagrams (e.g. http://www.flowingdata.com ) Lima’s new book will serve as an accessible and constant source of information on visualizing trees for new, as well as existing, ‘arborists’.

‘Velut arbor aevo’

‘May the Tree Thrive’!

Treemap software

http://www.cs.umd.edu/hcil/treemap/

http://www.treemap.com/

http://www.tableausoftware.com/

Statistical Outliers: of Baldness and Long Gestations

At what point is a human gestation period ‘impossibly’ long. This was the question a British court had to consider in the 1949 appeal to the 1948 judgement in Hadlum vs Hadlum.

Ms Hadlum had a gestation period of 349 days, taking into account when Mr Hadlum went off to the war. The average human gestation is 40 weeks or 280 days, although new research shows an average of 268 days or 38 weeks, varying by +- 37 days http://www.sciencedaily.com/releases/2013/08/130806203327.htm

The widely used statistical definition of an outlier was given by Douglas Hawkins in 1980, ‘an observation which deviates so much from other observations as to cause suspicions that it was generated by a different mechanism’. (Hawkins DM, 1980, Identification of outliers. Chapman & Hall).

Hmn! The court upheld the 1948 finding that such a long gestation was possible, and so Ms Hadlum had not been ‘unfaithful’ to Mr Hadlum, cause for divorce back in those dark days. In the 1951 case of Preston-Jones vs Preston-Jones, however, the court found a gestation period of 360 days to be the limit. The judge concluded that ‘If a line has to be drawn I think it should be drawn so as to allow an ample and generous margin’.

Statisticians have established guidelines for ‘outliers’, that are lines in the sand, if not in concrete.

But speaking of sand, at what point do grains of sand form a heap of sand?

How many hairs constitutes the threshold distinguishing between bald and not bald?

(philosophers call this is the Sorites or ‘heap’ paradox).

The world ‘forgot’ how to make concrete from about 500-1300 AD, but was there a day when we could still make concrete, and a day in which we couldn’t? Something to think about on a Sunday afternoon!

2014 Excel implementation of some simple outlier detection techniques, by John Foreman http://au.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html

References on the above legal cases

1978 Statistics journal: http://www.jstor.org/discover/10.2307/2347159?uid=2&uid=4&sid=21103476515283

1953 Medical journal: http://link.springer.com/article/10.1007/BF02949756

Resulting Consulting: Excel for Stats – 800 pound Gorilla or just Monkeying around?

When hearing of folks running statistical analysis with Excel , statisticians often have panicky images of ‘Home Haircutting , with Electric Shears, in the Wet’!

Mind you, Excel really is great for processing data, but analysing it in a more formal or even exploratory sense, can be a trifle tricky.

On the upside, many work computers have Excel installed, it’s readily available for quite a low price even if one is not a student or an academic, and for the most part is well designed and simple to use. It’s very easy to develop a spreadsheet that shows each individual calculation needed for a particular formula such as the standard deviation, for instance. Such flexibility is wonderful for learning and teaching stats, because everyone can see the steps involved in actually getting an answer, more so than the usual press-button, window click, typing ‘esoteric’ commands.

On the downside, pre-2010 versions of Excel had both practical accuracy issues (with functions & the add-in statistics toolpak) and validity issues (employed non-usual methods for things like handling ties in ranked data). There’s still no nonparametric tests (e.g. Wilcoxon), and Excel is still a bit light on for confidence intervals, regression diagnostics, and for performing production, shop-floor type statistical analyses. More of an adjustable wrench than a set of spanners?

In sum, if used wisely, Excel is a useful adjunct to third party statistical add-ins or statistical packages, but please avoid pie charts, especially 3D ones, and watch out for those banana skins….

**Excel 2010 (& Gnumeric & OpenOffice) Accuracy / Validity**

http://www.tandfonline.com/doi/abs/10.1198/tas.2011.09076#.UvH4rp24a70

http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf

**Some Excel Statistics Books**

Conrad Carlberg http://www.quepublishing.com/store/statistical-analysis-microsoft-excel-2013-9780789753113

Mark Gardener http://www.pelagicpublishing.com/statistics-for-ecologists-using-r-and-excel-data-collection-exploration-analysis-and-presentation.html

Neil Salkind http://www.sagepub.com/books/Book236672?siteId=sage-us&prodTypes=any&q=salkind&fs=1

**Some Statistical Add-Ins for Excel**

Analyse-It http://analyse-it.com DataDesk /XL http://www.datadesk.com

RExcel (interfaces Excel to open source R) http://rcom.univie.ac.at/

XLStat http://www.xlstat.com/en/

**Some Open Source Spreadsheets**

Gnumeric https://projects.gnome.org/gnumeric/ OpenOffice http://www.openoffice.org.au/

Olden goldies: Cybernetic forests 1967

Richard Brautigan was an American author and poet who, in 1967’s Summer of Love in San Francisco, published ‘All Watched Over by Machines of Loving Grace’, wishing for a future in which computers could save humans from drudgery, (such as performing statistical operations by hand?)

Apart from the perkier PDP ‘mini-computers’, computers of Brautigan’s day were hulking behemoths with more brawn than brain, and a scary dark side, as seen through HAL in 1968’s ‘2001: The Space Odyssey’.

Brautigan’s poem applied sweet 1960’s kandy-green hues to these cold & clanging monsters, just a few years away from friendly little Apple and other micro’s of the 70’s & 80’s. Now we are all linked on the web, and if we get tired of that we can talk to the electronic aide and confidante Siri, developed at Menlo Park, California – not too far away in space, if not time, from where Brautigan wrote.

We can get a glimpse of a ‘rosy’ future in the even friendlier electronic personality operating system of Spike Jonze’s new movie ‘Her’.

In his BBC documentaries of 2011, named after Brautigan’s poem, filmmaker Adam Curtis argues that computers have not liberated humanity much, if at all.

Yet there is still something about Richard Brautigan’s original 1967 poem, something still worth wishing for!

All three verses, further information and audio of Mr Brautigan reading his poem can be found at

http://www.brautigan.net/machines.html

CSIRAC, a real-life Australian digital dinosaur, that stomped the earth from 1949 to 1964, and is the only intact but dormant (hopefully!) first generation computer left anywhere in the world, can be viewed on the lower ground floor of the Melbourne Museum.

(and yes, this computer was used for statistical analyses, as well as other activities, such as composing electronic music)

http://museumvictoria.com.au/csirac/