Snappy Stepwise Regression

Stepwise regression, the technique that attempts to select a smaller subset of variables from a larger set by at each step choosing the ‘best’ or dropping the ‘worst’ was developed back in the late 1950’s by applied statisticians in the petroleum and automotive industries. With an ancestry like this, there’s no wonder that it is often regarded as the statistical version of the early 60’s Chev Corvair, at best only ‘driveable’ by expert careful users, or in Ralph Nader’s immortal words and title of his 1966 book  ‘Unsafe at Any Speed’.

Well maybe. But if used with cross-validation and good sense, it’s an old-tech standby to later model ‘lasso’ and ‘elastic net’ techniques. However, there’s an easy way for a bit of a softshoe shuffle of the old stepwise routine. See how well (preferably on a fresh set of data) forward entry with just one or, maybe two, or at most three variables do, compared with larger models. (SAS and SPSS allow the number of steps to be specified).

Of if you’d like to do some slightly fancier steps it in twotone spats, try a best subset regression (available in SAS, and SPSS through automatic linear, and Minitab and R etc), of all one variable combinations, two variables, three variables.

The inspiration for this is partly from Gerd Gigerenzer’s ‘take the best’ heuristic, taking the best cue or clue often beats more complex techniques including multiple regression etc. ‘Take the best’ is described in Prof Gigerenzer’s great new general book Risk Savvy: How to Make Good Decisions (Penguin, 2014) http://www.penguin.co.uk/nf/Book/BookDisplay/0,,9781846144745,00.html as well as his earlier academic books such as Simple Heuristics That Make Us Smart (Oxford University Press, 1999)

See if a good little model can do as well as a good (or bad) big ‘un!.

 

Further Future Reading

Draper NR, Smith H (1966) Applied regression analysis (and later editions). Wiley: New York.

John and Betty’s Journey into Statistics Packages*

In past days of our lives, those who wanted to learn a stats package, would attend courses, and bail up/bake cakes for statisticians, but would mainly raise the drawbridge, lock the computer lab door and settle down with the VT100 terminal or Apple II or IBM PC and a copy of the brown or update blue SPSS Manual, or whatever.

Nowadays, folks tend to look things up on the web, something of a mixed blessing, and so maybe software consultants will now say LIUOTFW (‘Look It Up On The Flipping Web’) rather than the late, great RYFM (‘Read Your Flipping Manual’).

And yes, there are some great websites, and great online documentation supplied by the software venders, but there are also some great books, available in electronic and print form. A list of three of the many wonderful texts available for each package (IBM SPSS, SAS, Stata, R and Minitab) can be downloaded from the Downloadables section on this site.

IBM SPSS (in particular), R (ever growing), and to a slightly lesser extent SAS, seem to have the best range of primers and introductory texts.
IMHO though, Stata could do with a new colourful, fun primer (not necessarily a Dummies Guide, although there’s Roberto Pedace’s Econometrics for Dummies (Wiley, New York, 2013) which features Stata), perhaps one by Andy Field, who has already done superb books on SPSS, R and SAS.

While up on the soapbox, I reckon Minitab could do with a new primer for Psychologists / Social Scientists, much like that early ripsnorter by Ray Watson, Pip Pattison and Sue Finch, Beginning Statistics for Psychology (Prentice Hall, Sydney, 1993).

Anyway, in memories of days gone by, brew a pot of coffee or tea, unplug email, turn off the phone and the mobile/cell, and settle in for an initial night’s journey, on a set or two of real and interesting data, with a good stats package book, or two!

*(The title of this post riffs off the improbably boring and stereotyped 1950’s early readers still used in Victorian primary (grade) schools in the 1960’s
http://nla.gov.au/nla.aus-vn4738114 (think Dick and Jane, or Alice and Jerry), as well as the far more entertaining and recent John and Betty’s Journey into Complex Numbers by Matt Bower http://www.slideshare.net/aus_autarch/john-and-betty )

Hovercrafts and Aunties: Learning Statistics as a Foreign Language

To many who are not members of our Craft, and even some that are, Statistics is something of a Foreign Language, difficult to grasp without a good understanding of its grammar, or at least a whole swag of useful rules.

Stats is also difficult to teach, note our students look of bored angst when we try to explain p values.

So could we teach Stats like a foreign language?

For starters, why don’t we teach statistical ‘tourists’/’travellers’/’consumers’ some useful ‘phrases’ they can actually use, like how to read Excel files into a stats package, how to do a box plot, check for odd values, do some basic recodes etc.

Such things rarely appear in texts. Instead, we tumble about teaching the statistical equivalent of ‘the pen of my aunt is on the table’ or ‘my hovercraft is full of eels’ (Monty Python), or ‘a wolverine is eating my leg’ (Tim Cahill).

For example, as well as assuming that the data are all clean and ready to go, why do stats books persist in showing how to read in a list of 10 or so numbers, rather than reading in an actual file?

Just as human languages may or may not directly have universal concepts, the same may apply for stats packages. The objects of R for example, are very succinct in conception, but very dfficult to explain.

Such apparent lack of universality, may be why English borrows words like ‘gourmand’ (to cite from my own book chapter), as English doesn’t otherwise have words for a person that eats for pleasure. Similarly, courgette/zucchini sounds better than baby marrow (and have you ever seen how big they can actually grow?).

Yet it’s a two way street, with English providing words to other languages, such as ‘weekend’.

According to the old Sapir-Whorf hypothesis, language precedes or at least shapes thought (but see John McWhorter’s recent 2014 book The Language Hoax), so if there’s no word for something, it’s supposedly hard to think about it.

In Stats package terms, instructors would have to somehow explain that it is very easy to extract and store, say, correlation values in R, for further processing, putting smiley faces beside large ones etc. But in SPSS and SAS we would normally have to use OMS/ODS, and think in terms of capturing information that would otherwise be displayed on a line printer. This is a difficult concept to explain to anyone under 45 or so!

Although there are many great books on learning stats packages, (something for a later post), and I myself can ‘speak’ SPSS almost like a native after 33 years, I only know a few words of other human languages, (and, funnily enough, only a few “words” of R).

If you’ll excuse me, my aunt and her pen are now going for a ride on a hovercraft.
(I hope there’s no eels! )

REFS

Sapir-Whorf Hypothesis

https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Sapir%E2%80%93Whorf_hypothesis.html

Counter to the Sapir-Whorf Hypthesis

http://global.oup.com/academic/product/the-language-hoax-9780199361588;jsessioniid=455D218B50BA25BC37B119092C3F7CDE?cc=au&lang=en&

Hovercraft, Gourmands and Stats Packages

McKenzie D (2013) Chapter 14: ‘Statistics and the Computer’ in
http://www.elsevierhealth.com.au/epidemiology-and-public-health/vital-statistics-paperbound/9780729541497/

 

Expected Unexpected: Power bands, performance curves, rogue waves and black swans

Many years ago, I had a ride of a Kawasaki 500 Mach III 2-stroke motorcycle, which along with its even more horrendous 750cc version was known as the ‘widow-maker’. It was incredibly fast in a straight line, but if it went around corners at all, the rider had long since fallen (or jumped) off!

It also had a very narrow ‘power band’ http://en.wikipedia.org/wiki/Power_band, in that it would have no real power until about 7,000 revs per minute, and then all of a sudden it would whoop and holler like the proverbial bat out of hell, the front wheel would lift, the rider’s jaw drop, and well, you get the idea! In statistical terms, this was a nonlinear relationship between twisting the throttle and the available power.

A somewhat less dramatic example of a nonlinear effect is the Yerkes-Dodson ‘law’ http://en.wikipedia.org/wiki/Yerkes%E2%80%93Dodson_law, in which optimum task performance is associated with medium levels of arousal (too much arousal = the ‘heebie-jeebies’, too little = ‘half asleep’).

Various simple & esoteric methods for finding global (follows a standard pattern such as a U shape, or upside down U) or local (different parts of the data might be better explained by different models, rather than ‘one size fits all’) relationships exist. A popular ‘local’ method is known as a ‘spline’ after the flexible metal ruler that draftspeople once fitted curves with. The ‘GT’ version, Multivariate Adaptive Regression Splines http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines. is available in R (itself a little reminiscent of a Mach III cycle at times!),  the big-iron ‘1960’s 390 cubic inch Ford Galaxie V8′ of the SAS statistical package and the original, sleek ‘Ferrari V12’ Salford Systems version.

Other nonlinear methods are available http://en.wikipedia.org/wiki/Loess_curve, but the thing to remember is that life doesn’t always fit within the lines, or follow some human’s idea of a ‘natural law’.

For example, freak or rogue waves, that can literally break supertankers in half, were observed for centuries by mariners but are only recently accepted by shore-bound scientists, similarly the black swans (actually native to Australia) of the stock market http://www.fooledbyrandomness.com/

When analysing data, fitting models, (or riding motorcycles), please be careful!