Synergistic Statistical Consulting, Analysis, Arboronics

Visual Trees: The Book of Trees by Manuel Lima

Apart from the iconic and mysterious Australian Nullarbor http://www.nullarbornet.com.au/ (literally ‘no’, or actually very few, trees) and the baddest and saddest of outer suburban concrete jungles, trees are a major part of our daily life. Trees produce shade and oxygen, and provide inspiration for dreaming scientists watching apples fall. ‘Tree of life’. ‘Family trees’. Tree branches have also long provided a metaphor for branches of knowledge and classification systems.

In his excellent new book ‘The book of trees: visualizing branches of knowledge’,

https://www.papress.com/html/book.details.page.tpl?isbn=9781616892180

Manual Lima, Designer and Fellow of the Royal Society of Arts (http://www.visualcomplexity.com/vc/) examines the role of trees in history, religion, philosophy, biology, computer science, data visualization, information graphics and data analysis / statistics.

Covering various types of tree graphs, including radial trees, sunbursts, Ben Schneiderman’s Treemaps and Voronoi Treemaps, Lima’s treatise provides inspirational historical and contemporary pictures, including timely applications such as looking at the words that appear with ‘I’ and ‘you’ in Google texts.

Statistical applications covered are mainly confined to Icicle plots or trees, used in applications such as cluster analysis, or the grouping observations into related classes, ‘taxa’ or clusters such as disease categories.

Not published in the Northern hemisphere until April 2014, the book is available now in Melbourne, Australia for around $50, e.g. www.ngvshop.ngv.vic.gov.au (the online search does not work) or http://metropolisbookshop.com.au

Accompanied by sources of information on how to construct such diagrams (e.g. http://www.flowingdata.com ) Lima’s new book will serve as an accessible and constant source of information on visualizing trees for new, as well as existing, ‘arborists’.

‘Velut arbor aevo’

‘May the Tree Thrive’!

Treemap software

http://www.cs.umd.edu/hcil/treemap/

http://www.treemap.com/

http://www.tableausoftware.com/

Statistical Outliers: of Baldness and Long Gestations

At what point is a human gestation period ‘impossibly’ long. This was the question a British court had to consider in the 1949 appeal to the 1948 judgement in Hadlum vs Hadlum.

Ms Hadlum had a gestation period of 349 days, taking into account when Mr Hadlum went off to the war. The average human gestation is 40 weeks or 280 days, although new research shows an average of 268 days or 38 weeks, varying by +- 37 days http://www.sciencedaily.com/releases/2013/08/130806203327.htm

The widely used statistical definition of an outlier was given by Douglas Hawkins in 1980, ‘an observation which deviates so much from other observations as to cause suspicions that it was generated by a different mechanism’. (Hawkins DM, 1980, Identification of outliers. Chapman & Hall).

Hmn! The court upheld the 1948 finding that such a long gestation was possible, and so Ms Hadlum had not been ‘unfaithful’ to Mr Hadlum, cause for divorce back in those dark days. In the 1951 case of Preston-Jones vs Preston-Jones, however, the court found a gestation period of 360 days to be the limit. The judge concluded that ‘If a line has to be drawn I think it should be drawn so as to allow an ample and generous margin’.

Statisticians have established guidelines for ‘outliers’, that are lines in the sand, if not in concrete.

But speaking of sand, at what point do grains of sand form a heap of sand?

How many hairs constitutes the threshold distinguishing between bald and not bald?

(philosophers call this is the Sorites or ‘heap’ paradox).

The world ‘forgot’ how to make concrete from about 500-1300 AD, but was there a day when we could still make concrete, and a day in which we couldn’t? Something to think about on a Sunday afternoon!

2014 Excel implementation of some simple outlier detection techniques, by John Foreman http://au.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html

References on the above legal cases

1978 Statistics journal: http://www.jstor.org/discover/10.2307/2347159?uid=2&uid=4&sid=21103476515283

1953 Medical journal: http://link.springer.com/article/10.1007/BF02949756

Resulting Consulting: Excel for Stats – 800 pound Gorilla or just Monkeying around?

When hearing of folks running statistical analysis with Excel , statisticians often have panicky images of ‘Home Haircutting , with Electric Shears, in the Wet’!

Mind you, Excel really is great for processing data, but analysing it in a more formal or even exploratory sense, can be a trifle tricky.

On the upside, many work computers have Excel installed, it’s readily available for quite a low price even if one is not a student or an academic, and for the most part is well designed and simple to use. It’s very easy to develop a spreadsheet that shows each individual calculation needed for a particular formula such as the standard deviation, for instance. Such flexibility is wonderful for learning and teaching stats, because everyone can see the steps involved in actually getting an answer, more so than the usual press-button, window click, typing ‘esoteric’ commands.

On the downside, pre-2010 versions of Excel had both practical accuracy issues (with functions & the add-in statistics toolpak) and validity issues (employed non-usual methods for things like handling ties in ranked data). There’s still no nonparametric tests (e.g. Wilcoxon), and Excel is still a bit light on for confidence intervals, regression diagnostics, and for performing production, shop-floor type statistical analyses. More of an adjustable wrench than a set of spanners?

In sum, if used wisely, Excel is a useful adjunct to third party statistical add-ins or statistical packages, but please avoid pie charts, especially 3D ones, and watch out for those banana skins….

**Excel 2010 (& Gnumeric & OpenOffice) Accuracy / Validity**

http://www.tandfonline.com/doi/abs/10.1198/tas.2011.09076#.UvH4rp24a70

http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf

**Some Excel Statistics Books**

Conrad Carlberg http://www.quepublishing.com/store/statistical-analysis-microsoft-excel-2013-9780789753113

Mark Gardener http://www.pelagicpublishing.com/statistics-for-ecologists-using-r-and-excel-data-collection-exploration-analysis-and-presentation.html

Neil Salkind http://www.sagepub.com/books/Book236672?siteId=sage-us&prodTypes=any&q=salkind&fs=1

**Some Statistical Add-Ins for Excel**

Analyse-It http://analyse-it.com DataDesk /XL http://www.datadesk.com

RExcel (interfaces Excel to open source R) http://rcom.univie.ac.at/

XLStat http://www.xlstat.com/en/

**Some Open Source Spreadsheets**

Gnumeric https://projects.gnome.org/gnumeric/ OpenOffice http://www.openoffice.org.au/

Olden goldies: Cybernetic forests 1967

Richard Brautigan was an American author and poet who, in 1967’s Summer of Love in San Francisco, published ‘All Watched Over by Machines of Loving Grace’, wishing for a future in which computers could save humans from drudgery, (such as performing statistical operations by hand?)

Apart from the perkier PDP ‘mini-computers’, computers of Brautigan’s day were hulking behemoths with more brawn than brain, and a scary dark side, as seen through HAL in 1968’s ‘2001: The Space Odyssey’.

Brautigan’s poem applied sweet 1960’s kandy-green hues to these cold & clanging monsters, just a few years away from friendly little Apple and other micro’s of the 70’s & 80’s. Now we are all linked on the web, and if we get tired of that we can talk to the electronic aide and confidante Siri, developed at Menlo Park, California – not too far away in space, if not time, from where Brautigan wrote.

We can get a glimpse of a ‘rosy’ future in the even friendlier electronic personality operating system of Spike Jonze’s new movie ‘Her’.

In his BBC documentaries of 2011, named after Brautigan’s poem, filmmaker Adam Curtis argues that computers have not liberated humanity much, if at all.

Yet there is still something about Richard Brautigan’s original 1967 poem, something still worth wishing for!

All three verses, further information and audio of Mr Brautigan reading his poem can be found at

http://www.brautigan.net/machines.html

CSIRAC, a real-life Australian digital dinosaur, that stomped the earth from 1949 to 1964, and is the only intact but dormant (hopefully!) first generation computer left anywhere in the world, can be viewed on the lower ground floor of the Melbourne Museum.

(and yes, this computer was used for statistical analyses, as well as other activities, such as composing electronic music)

http://museumvictoria.com.au/csirac/

SecretSource: of Minitab and Dataviz

When the goers go and the stayers stay, when shirts loosen and tattoos glisten, it’s time for the statisticians and the miners and the data scientists to talk, and walk, Big Iron.

R. S-Plus. SAS. Tableau. Stata. GnuPlot. Mondrian. DataDesk. Minitab. MINITAB?????? Okay, we’ll leave the others to get back to their arm wrasslin’, but if you want to produce high quality graphs, simply, readily and quickly, then Minitab could be for you.

A commercialized version of Omnitab, Minitab appeared in Philadelphia in 1972 and has long been associated with students learning stats, but also now with business, industrial and medical/health quality management and six sigma, etc. There’s some other real ‘rough and tumble’ applications involving Minitab – DR Helsell’s ‘Statistics for Censored Environmental Data using Minitab and R’ (Wiley 2012), for instance.

IBM SPSS and Microsoft Excel can produce good graphs (‘good’ in the ‘good sense’ of John Tukey , Edward Tufte, William Cleveland, Howard Wainer, Stephen Few & Nathan Yau etc etc), with the soft pedal down and ‘caution switches’ on, but Minitab is probably going to be easier.

For example, the Statistical Consulting Centre at the University of Melbourne uses Minitab for most of its graphs (R for the trickiest ones). As well as general short courses on Minitab, R, SPSS and GenStat there’s a one day course in Minitab graphics in November, which I’ve done and can recommend.

More details on the Producing Excellent Graphics Simply (PEGS) course using Minitab at Melbourne are at

http://www.scc.ms.unimelb.edu.au/pegs.html

student and academic pricing for Minitab is at http://onthehub.com/

What, I wonder, would Florence Nightingale have used for graphic software if she was alive today???

2014 Books: Medical Illuminations and another Trout in the Milk

The first cab off the rank for 2014 is Howard Wainer’s ‘Medical Illuminations: Using Evidence, Visualization & Statistical Thinking to Improve Healthcare’, Oxford University Press, 2014. It costs around $40 Australian.

Dr Wainer has written several great graphics books, including 2005’s ‘Graphic Discovery: a Trout in the Milk and Other Visual Adventures’, Princeton University Press.

The new book has more of a medical theme, including extremely useful chapters on medical prediction, the importance of showing diabetes patients real-time and understandable information on their blood sugar levels, and the over-use of pie charts.

Although not mentioned in the above books, Florence Nightingale, Nursing pioneer and first female Fellow of what was to become the Royal Statistical Society, developed and used graphs and charts (admittedly an early form of pie chart). Ms Nightingale used such graphs to clearly show Queen Victoria, who wasn’t a statistician and wouldn’t have appreciated heaps and heaps of tables, the very real problems that soldiers were facing in the Crimean War due to poor sanitation.

Since then, much medical data is routinely collected and statistically analysed, but there is still a long way to go in terms of portraying and illuminating that information to medical staff and the patients and carers themselves. Books like Medical Illuminations, supplemented by general info on the ‘how’ of graphic presentation using readily available software (Wainer’s texts focus mainly on the ‘who’, ‘what’ and ‘why’), will help to achieve such an important goal.

Recommended, for non-statisticians and statisticians alike!

Oxford University Press website: http://www.oup.com.au/titles/academic/medicine/9780199668793

Electric Stats: PSPP and SPSS

Most people use computer stats packages if they want to perform statistical or data analysis. One of the most popular packages, particularly in psychology and physiotherapy, is SPSS, now known as IBM SPSS. Although there is room for growth in some areas such as ‘robust regression’ (regression for handling data that may not follow the usual assumptions), IBM SPSS has many jazzy features / options such as decision trees and neural nets and Monte Carlo simulation, as well as all the old faves like ANOVA, t-tests and chi-square.

I love SPSS and have been using it since 1981, back when SPSS analyses had to be submitted to run after 11 pm (23:00) so as not to hog the ‘mainframe’ computer resources. Alas, as with Minitab, SAS and Stata and others, SPSS can be expensive if you’re not a student or academic. An open source alternative that is free as in sarsparilla and free as in speech, is GNU PSPP, which has nothing whatsoever to do with IBM or the former SPSS Inc.

PSPP has a syntax or command line / program interface for old school users such as myself, *and* a snazzy GUI or Graphic User Interface. Currently, it doesn’t have all the features that 1981 SPSS had (e.g. ‘two-way ANOVA’), let alone the more recent features, although it does have logistic regression for binary outcomes such as depressed / non depressed. PSPP is easy to use (easier than open source R and perhaps even R Commander, although nowhere near as powerful).

PSPP can handle most basic analyses, and is great for starters and those using a computer at a worksite etc where SPSS is not installed, but need to run basic analyses or test syntax. The PSPP team is to be congratulated!

http://www.gnu.org/software/pspp/ free, open-source PSPP

http://www-01.ibm.com/software/analytics/spss/ IBM SPSS

(students and academics can obtain less expensive versions of IBM SPSS from http://onthehub.com)