Snappy Stepwise Regression

Stepwise regression, the technique that attempts to select a smaller subset of variables from a larger set by at each step choosing the ‘best’ or dropping the ‘worst’ was developed back in the late 1950’s by applied statisticians in the petroleum and automotive industries. With an ancestry like this, there’s no wonder that it is often regarded as the statistical version of the early 60’s Chev Corvair, at best only ‘driveable’ by expert careful users, or in Ralph Nader’s immortal words and title of his 1966 book  ‘Unsafe at Any Speed’.

Well maybe. But if used with cross-validation and good sense, it’s an old-tech standby to later model ‘lasso’ and ‘elastic net’ techniques. However, there’s an easy way for a bit of a softshoe shuffle of the old stepwise routine. See how well (preferably on a fresh set of data) forward entry with just one or, maybe two, or at most three variables do, compared with larger models. (SAS and SPSS allow the number of steps to be specified).

Of if you’d like to do some slightly fancier steps it in twotone spats, try a best subset regression (available in SAS, and SPSS through automatic linear, and Minitab and R etc), of all one variable combinations, two variables, three variables.

The inspiration for this is partly from Gerd Gigerenzer’s ‘take the best’ heuristic, taking the best cue or clue often beats more complex techniques including multiple regression etc. ‘Take the best’ is described in Prof Gigerenzer’s great new general book Risk Savvy: How to Make Good Decisions (Penguin, 2014) http://www.penguin.co.uk/nf/Book/BookDisplay/0,,9781846144745,00.html as well as his earlier academic books such as Simple Heuristics That Make Us Smart (Oxford University Press, 1999)

See if a good little model can do as well as a good (or bad) big ‘un!.

 

Further Future Reading

Draper NR, Smith H (1966) Applied regression analysis (and later editions). Wiley: New York.

John and Betty’s Journey into Statistics Packages*

In past days of our lives, those who wanted to learn a stats package, would attend courses, and bail up/bake cakes for statisticians, but would mainly raise the drawbridge, lock the computer lab door and settle down with the VT100 terminal or Apple II or IBM PC and a copy of the brown or update blue SPSS Manual, or whatever.

Nowadays, folks tend to look things up on the web, something of a mixed blessing, and so maybe software consultants will now say LIUOTFW (‘Look It Up On The Flipping Web’) rather than the late, great RYFM (‘Read Your Flipping Manual’).

And yes, there are some great websites, and great online documentation supplied by the software venders, but there are also some great books, available in electronic and print form. A list of three of the many wonderful texts available for each package (IBM SPSS, SAS, Stata, R and Minitab) can be downloaded from the Downloadables section on this site.

IBM SPSS (in particular), R (ever growing), and to a slightly lesser extent SAS, seem to have the best range of primers and introductory texts.
IMHO though, Stata could do with a new colourful, fun primer (not necessarily a Dummies Guide, although there’s Roberto Pedace’s Econometrics for Dummies (Wiley, New York, 2013) which features Stata), perhaps one by Andy Field, who has already done superb books on SPSS, R and SAS.

While up on the soapbox, I reckon Minitab could do with a new primer for Psychologists / Social Scientists, much like that early ripsnorter by Ray Watson, Pip Pattison and Sue Finch, Beginning Statistics for Psychology (Prentice Hall, Sydney, 1993).

Anyway, in memories of days gone by, brew a pot of coffee or tea, unplug email, turn off the phone and the mobile/cell, and settle in for an initial night’s journey, on a set or two of real and interesting data, with a good stats package book, or two!

*(The title of this post riffs off the improbably boring and stereotyped 1950’s early readers still used in Victorian primary (grade) schools in the 1960’s
http://nla.gov.au/nla.aus-vn4738114 (think Dick and Jane, or Alice and Jerry), as well as the far more entertaining and recent John and Betty’s Journey into Complex Numbers by Matt Bower http://www.slideshare.net/aus_autarch/john-and-betty )

Hovercrafts and Aunties: Learning Statistics as a Foreign Language

To many who are not members of our Craft, and even some that are, Statistics is something of a Foreign Language, difficult to grasp without a good understanding of its grammar, or at least a whole swag of useful rules.

Stats is also difficult to teach, note our students look of bored angst when we try to explain p values.

So could we teach Stats like a foreign language?

For starters, why don’t we teach statistical ‘tourists’/’travellers’/’consumers’ some useful ‘phrases’ they can actually use, like how to read Excel files into a stats package, how to do a box plot, check for odd values, do some basic recodes etc.

Such things rarely appear in texts. Instead, we tumble about teaching the statistical equivalent of ‘the pen of my aunt is on the table’ or ‘my hovercraft is full of eels’ (Monty Python), or ‘a wolverine is eating my leg’ (Tim Cahill).

For example, as well as assuming that the data are all clean and ready to go, why do stats books persist in showing how to read in a list of 10 or so numbers, rather than reading in an actual file?

Just as human languages may or may not directly have universal concepts, the same may apply for stats packages. The objects of R for example, are very succinct in conception, but very dfficult to explain.

Such apparent lack of universality, may be why English borrows words like ‘gourmand’ (to cite from my own book chapter), as English doesn’t otherwise have words for a person that eats for pleasure. Similarly, courgette/zucchini sounds better than baby marrow (and have you ever seen how big they can actually grow?).

Yet it’s a two way street, with English providing words to other languages, such as ‘weekend’.

According to the old Sapir-Whorf hypothesis, language precedes or at least shapes thought (but see John McWhorter’s recent 2014 book The Language Hoax), so if there’s no word for something, it’s supposedly hard to think about it.

In Stats package terms, instructors would have to somehow explain that it is very easy to extract and store, say, correlation values in R, for further processing, putting smiley faces beside large ones etc. But in SPSS and SAS we would normally have to use OMS/ODS, and think in terms of capturing information that would otherwise be displayed on a line printer. This is a difficult concept to explain to anyone under 45 or so!

Although there are many great books on learning stats packages, (something for a later post), and I myself can ‘speak’ SPSS almost like a native after 33 years, I only know a few words of other human languages, (and, funnily enough, only a few “words” of R).

If you’ll excuse me, my aunt and her pen are now going for a ride on a hovercraft.
(I hope there’s no eels! )

REFS

Sapir-Whorf Hypothesis

https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Sapir%E2%80%93Whorf_hypothesis.html

Counter to the Sapir-Whorf Hypthesis

http://global.oup.com/academic/product/the-language-hoax-9780199361588;jsessioniid=455D218B50BA25BC37B119092C3F7CDE?cc=au&lang=en&

Hovercraft, Gourmands and Stats Packages

McKenzie D (2013) Chapter 14: ‘Statistics and the Computer’ in
http://www.elsevierhealth.com.au/epidemiology-and-public-health/vital-statistics-paperbound/9780729541497/

 

A Probability Book your Gran & Grandad could read: David Hand’s “The Improbability Principle”

Most people have heard of, or have actually experienced, ‘strange coincidences’, of the ‘losing wedding ring on honeymoon in coastal village and then years later, when fishing, finding the ring in the belly of a trout’ variety. Sometimes, the story is helped along a little over the years, such as the 1911 demise of Green, Berry and Hill who’d murdered Sir Edmund Berry Godfrey on *Greenberry* Hill, as used in the opening sequence of the 1999 Magnolia movie featuring the late great Philip Seymour Hoffman. The murder, however actually took place in the 17th century, and on *Primrose* Hill, which was later renamed to Greenberry Hill.

Still, odd things do happen, leading many to wonder ‘wow and what’s the probability of that!’. Strange events can however occur without the need for ghostly Theremin music to suddenly play in the background, in that they’re actually merely examples of coincidence, helped along by human foibles.

Coincidences and foibles are entertainingly and educationally examined in Professor David Hand’s excellent new 2014 book ‘The Improbability Principle: why coincidences, miracles and rare events happen every day’.

http://improbability-principle.com/.

Prof Hand is an Emeritus Professor of Mathematics at Imperial College London, who like fellow British Statistician Brian ‘Chance Rules OK’ Everitt, has been writing instructive as well as readable texts and general books for nigh on forty years.

The book is not scarily mathematical at all, and illustrates using cards, dice, marbles in urns etc, although it might have been fun in the book, or at least the book’s website, to have some actual exercises that more active readers could have undertaken, using dice, cards or electronic versions thereof, such as the free Java version of Simon and Bruce’s classic Resampling Stats software, known as Stats 101 http://www.statistics101.net/ (commercial Excel version available at http://www.resample.com/excel/)

 

All in all though, The Improbability Principle is not only highly readable, entertaining and inexpensive, it is an absolute snorter of a book, for a wide audience, including Uncles, Aunties, Grandmama’s and Grandpapa’s, and is thoroughly recommended!

 

Watch the Skies: Manga Regression!

Although it’s probably the technique most employed by statisticians, regression or at least multiple linear or multiple logistic regression, is often the concept that is most feared or misunderstood by students and newbie researchers. If someone you know is in the latter categories, and they would like a fun and straightforward introduction to regression that literally uses pictures (cartoons), announce that The Manga Guide to Regression book has now been published, around May 2016,  by the friendly folks at NoStarchPress    http://www.nostarch.com/regression

and available  on http://www.oreilly.com   http://www.amazon.com   http://www.bookdepository.com   http://www.dymocks.com.au etc

Authored by Shin Takahashi (Manga Guide to Statistics, 2008), the new book uses Manga http://en.wikipedia.org/wiki/Manga   (think Osamu Tezuka’s Jungle Emperor: made into the 1965 Japanese anime TV series of the same name and the 1966  US overdub ‘Kimba the White Lion’  http://en.wikipedia.org/wiki/Jungle_Emperor, as well as ‘Atomu’ / Astro Boy’ .

Okay, Kimba himself is not actually included in the Manga Regression  book, but there’s both linear and logistic regression, demonstrated using Microsoft Excel (just a slight gritting of teeth, go for Excel 2010/2013/2016 or higher,  but at least it might allow getting ‘down and dirty with data’.

 

Happy Numbers

Why there are so very few statisticians as heroes (or even dashing villains) in novels is a pop culture mystery even bigger than the true identity of reggae magicians Johnny and the Attractions, or the actual final resting place of Butch and Sundance.

I have heard of, but don’t have, the 2008 novel Dancing with Dr Kildare, which features British medical statistician Nina, as well as the Finnish composer Sibelius, and the Tango, by Jane Yardley PhD, in real life a co-ordinator of medical trials for a small pharma.

http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2009.00341.x/abstract.

http://www.transworld-publishers.co.uk/catalog/book.htm?command=Search&db=twmain.txt&eqisbndata=0552773107

I’m now performing statistical consulting at two major hospitals so I’m about to re-read that wonderful book by major scriptwriter / drama writer Jim Keeble ‘The Happy Numbers of Julius Miles’, originally published in 2012 by independent outfit Alma

http://www.almabooks.com/the-happy-numbers-of-julius-miles-p-387-book.html

but there seems to be an April 2014 printing for the US.

It’s a great book about a big fellow, Julius Miles, a professional statistician with Barts Health NHS Trust, Royal London Hospital, Whitechapel, East London, England. Julius loves stats – nose-counting ones such as the fact that it takes him 2 minutes to polish his shoes (with 30 seconds airing between polish, application and buff), as well as meaty methods such as multilevel Poisson regression for length of hospital stay.

Julius is about 1.93 metres (6 foot 4 inches) and wears size 13 (UK) shoes, a solid fellow (although   not reminiscent of the solid Ignatius Reilly in John Kennedy Toole’s classic posthumous 1980  novel ‘A Confederacy of Dunces’).

There’s something about the name Julius, Julius Sumner Miller the US physicist and educator whose ‘Why is it so?’ ran on Australian TV for over 20 years from the 1960’s, and the frothy US drink Orange Julius, named after Julius Freed, around since 1926, taking off in ’29 (the official drink of the 1964 New York World’s Fair).

I can thoroughly recommend this colourful & warm book about Julius Miles, medical statistician.

When Boogie becomes Woogie, when Dog becomes Wolf

An exciting (and not just for statisticians!) area of application in statistics/analytics/data science relates to change/anomaly/outlier detection, the general notion of outliers (e.g. ‘unlikely’ values) having been covered in a previous post, looking at, amongst other things, very long pregnancies.

But tonight’s fr’instance comes from Fleming’s wonderful James Bond Jamaican adventure novel, Dr No, (also a jazzy 1962 movie) which talks of London Radio Security shutting down radio connections with secret agents, if a change in their message transmitting style is detected. This may have indicated that their radio had fallen into enemy hands.

To use a somewhat less exotic example, imagine someone, probably not James Bond, tenpin bowling and keeping track of their scores, this scenario coming from HJ Harrington et al’s excellent Statistical Analysis Simplified: the Easy-to-Understand Guide to SPC and Data Analysis (McGraw-Hill, 1998).

On the 10th week, the score suddenly drops more than three standard deviations (scatter or variation around the mean or average) below the mean.

Enemy agents? Forgotten bowling shoes? Too many milk shakes?

Once again, an anomaly or change, something often examined in industry (Statistical Process Control (SPC) and related areas) to determine the point at which, in the words of Tom Robbin’s great novel Even Cowgirls Get The Blues, ‘the boogie stopped and the woogie began’.

Sudden changes in operations & processes can happen, and so a usual everyday assembly line (‘dog’) can in milliseconds become the unusual, and possibly even dangerous (‘wolf’), at which point hopefully an alarm goes off and corrective action taken.

The basics of SPC were developed many years ago (and taken to Japan after WW2, a story in itself). Anomaly detection is a fast-growing area. For further experimentation / reading, a recent method based upon calculating the closeness of points to their neighbours is described in John Foreman’s marvellous DataSmart: using Data Science to Transform Information into Insight (Wiley, 2014).

We might want to determine if a credit card has been stolen on the basis of different spending patterns/places, or, to return to the opening example, detect an unauthorised intruder to a computer network (e.g. Clifford Stoll’s trailblazing The Cuckoo’s Egg: Tracking a Spy Through the Maze of Computer Espionage).

Finally, we might just want to figure out just exactly when it was that our bowling performance dropped off!