SecretSource – Synergistic Statistical Consulting, Analysis, Arboronics

Coin Chops: Can the Law of Averages be Replaced by the Law of Probability?

Alas, to the ‘average’ consumer of statistics, unlike we statisticians and data analysts, Probability is a sort of Comic I mean Cosmic Force i.e. ‘The Laws of Probability’ . David Hand OBE FBA has entertainingly looked at misunderstandings of this Comic Force and Coincidences in ‘the improbability principle: why coincidences, miracles and rare events happen all the time’ (2014).

But sitting here in the State Library of Victoria, I’m reading Frank ‘Power Without Glory’ Hardy’s novel ‘Four-legged lottery’ (1958). On page 179 of the Gold Star paperback edition there’s a bit of blarney about the ‘law of probability’ replacing the ‘law of averages’ where one of the two main characters, a professional gambler by the name of Jim Roberts, talks about the Anglo-Australian game of Two-Up which involves throws of pairs of coins, and is legal in Australian casino’s and traditionally, on the streets on ANZAC Day (25th April)..

‘in an honestly conducted two-up school, an equal number of heads and tails will be thrown over a long period; both head and tail bettor must lose [as the ‘house’ must take a percentage]. [To try and overcome the Law of Averages, giggle!] a tail better can back the tail on every spin – only for two throws, doubling [the] stake on the second throw if the spinner [bets heads] the first time. In this way [he or she] defeats the law of average <by winning> every time a spinner throws [both tails or one head and one tail and only loses when spinner throws both heads]’. Time for a simulation !

Watch this space. Same Stat Time! Same Stat Channel!

Divisive Rules OK: Clustering #1

Back about 1965, while (whilst?) attending Primary (grade) School in a little northern Victorian town, lunchtimes would see a behaviour pattern in which, say, two boys would link arms and march around the boys’ playground chanting “join on the boys who want to play Chasey” or some other sport or game, and soon, unless something bizarre or boring was called out for, there’d be three, five, eight, ten etc until there were enough people for that particular game.

Now, let’s imagine a slightly more surreal version, a big group of boys, or girls, or indeed a mixture thereof, wanders around the playground. But what sport are they going to play, it’s unlikely (especially for our purposes here) they’ll all want to play the same thing, and even if they did, there may be too many, or some people may well be better suited to some other sport or game. If we were brave enough and if lunchtimes extended into infinity, we could try every possible way of splitting our big group into two or more smaller groups as, in the general field of cluster analysis, Edwards and Cavalli-Sforza showed back in ’65.

Alternatively, we could ask the single person most different from the rest of the main group M in terms of the game they wanted to play. That person, (let’s call them Brian after Brian Everitt, who wrote a great book on cluster analysis in several editions, and Brian Setzer as in Stray Cats and the Brian Setzer Orchestra, this being the unofficial Brian Setzer Summer) splits off from the group and forms a splinter group S. For each of the remaining members, we check whether on average, they’re more dissimilar to the members of M, than the members of S (i.e. Brian et al). If so, then they too join S.

Known as divisive clustering (the earlier “join on” syndrome is sorta kinda like agglomerative clustering, start off with individuals and group em together), this particular method was published in ’64 by Macnaughton-Smith. Described in Kaufman and Rousseeuw’s book as DIANA, with shades of a great steak sauce and an old song by Paul Anka, DIANA is available in R as part of the cluster package.

Now if you’ll excuse me, there’s a group looking for members to march down the road for a cold drink, on this hot Australian summer night! Once we get to the bar, the most dissimilar, perhaps a nondrinker, will split off, clusters will be formed, and through the night there may be re-splitting and re-joining of groups or cliques, as some go off to the pinball parlour, others to the pizza joint, while some return to the bar, all in the manner of another great clustering algorithm, Ball and Hall’s ISODATA.

Bottled Sources:

Ball GH, Hall DJ (1965). A novel method of data analysis and pattern classification. Technical Report, Stanford Research Institute, California.

Edwards AWF, Cavalli-Sforza, LL (1965). A method for cluster analysis. Biometrics, 21, 362-375.

Everitt, B.S. (1974 and more recent editions). Cluster analysis. Heinemann: London.

Kaufman L, Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. Wiley: New York.

Macnaughton-Smith P, Williams WT, Dale MB, Mockett LG (1964). Dissimilarity analysis: A new technique of hierarchical sub-division. Nature, 202, 1034-1035.

When I grow up, I’m gonna be a Statistician!

How many of us said that, I wonder? Rather than children dressing up as sheriffs or doctors or possibly even scientists (?), how many dressed up like Statisticians? Did anyone even know much about Statisticians then? Mathematicians yes, they were sort of nerdy (although that word wasn’t around when I was a kid) but could do important things, like calculate odds of winning at Las Vegas or horse racing, and the chance of thermonuclear war.

But when I was young, inspired by Get Smart and The Man from UNCLE and James Bond I mainly wanted to be a secret agent! I played with the idea of becoming a private detective, sorry investigator, for a while until I found out that in real life, as opposed to TvLand and BookWorld, they mainly seemed to be involved in divorces. So, when I was in my very early teens, I toyed with the idea of joining the FBI. As an Australian citizen, this would have been rather difficult, I would have had to become a US citizen, as well as either a lawyer or an accountant first. So I put that idea in the ‘too hard basket’. (Imagine, a lawyer or an accountant!).

Well, I suppose it shows evidence of an inquiring mind. Further steps, trots, canters and gallops along the road to Statistics is a story for another time. But there were a couple of ‘residuals’ from that childhood long ago. Asking questions, even if no one else was. The desire to do the right thing, and wear the right colour hat (even if in truth the Jack Palance baddie wearing black was far cooler/groovier/jazzier in the Shane movie, although not the book, than the light coloured cloth-wearing goodie, Alan Ladd).

And a 1963 book which I got for Christmas a year or two later, called The How and Why Wonder Book of Robots and Electronic Brains. I still have that book and I cited it in my PhD Thesis, although back then I was more interested in the robots, especially the black and red tin ones that could be wound up with a key!

But it was a 1979 Texas Instruments TI-55 (simple) programmable LED calculator I got for my 21st, that came with quite a thick manual, showing how one could do fun things like predicting future sales from advertising expenditure, that gave much more excitement, practicality and crunch to the Psych 101 Stats that I was undertaking.

http://www.datamath.org/Sci/MAJESTIC/TI-55.htm

And then, in the early summer of 1981 when I first used SPSS (submitted to be ran at 2300 hours) on a DEC System 20-60 I was truly hooked.

True, James Bond had his Beretta and Walther PPK and Aston Martin and Bentley and Sea Island shirts and Shaken Not Stirred, but at least in the early days, he never used a programmable calculator, let alone a Computer!

A Probability Book your Gran & Grandad could read: David Hand’s “The Improbability Principle”

Most people have heard of, or have actually experienced, ‘strange coincidences’, of the ‘losing wedding ring on honeymoon in coastal village and then years later, when fishing, finding the ring in the belly of a trout’ variety. Sometimes, the story is helped along a little over the years, such as the 1911 demise of Green, Berry and Hill who’d murdered Sir Edmund Berry Godfrey on *Greenberry* Hill, as used in the opening sequence of the 1999 Magnolia movie featuring the late great Philip Seymour Hoffman. The murder, however actually took place in the 17th century, and on *Primrose* Hill, which was later renamed to Greenberry Hill.

Still, odd things do happen, leading many to wonder ‘wow and what’s the probability of that!’. Strange events can however occur without the need for ghostly Theremin music to suddenly play in the background, in that they’re actually merely examples of coincidence, helped along by human foibles.

Coincidences and foibles are entertainingly and educationally examined in Professor David Hand’s excellent new 2014 book ‘The Improbability Principle: why coincidences, miracles and rare events happen every day’.

http://improbability-principle.com/.

Prof Hand is an Emeritus Professor of Mathematics at Imperial College London, who like fellow British Statistician Brian ‘Chance Rules OK’ Everitt, has been writing instructive as well as readable texts and general books for nigh on forty years.

The book is not scarily mathematical at all, and illustrates using cards, dice, marbles in urns etc, although it might have been fun in the book, or at least the book’s website, to have some actual exercises that more active readers could have undertaken, using dice, cards or electronic versions thereof, such as the free Java version of Simon and Bruce’s classic Resampling Stats software, known as Stats 101 http://www.statistics101.net/ (commercial Excel version available at http://www.resample.com/excel/)

All in all though, The Improbability Principle is not only highly readable, entertaining and inexpensive, it is an absolute snorter of a book, for a wide audience, including Uncles, Aunties, Grandmama’s and Grandpapa’s, and is thoroughly recommended!

When Boogie becomes Woogie, when Dog becomes Wolf

An exciting (and not just for statisticians!) area of application in statistics/analytics/data science relates to change/anomaly/outlier detection, the general notion of outliers (e.g. ‘unlikely’ values) having been covered in a previous post, looking at, amongst other things, very long pregnancies.

But tonight’s fr’instance comes from Fleming’s wonderful James Bond Jamaican adventure novel, Dr No, (also a jazzy 1962 movie) which talks of London Radio Security shutting down radio connections with secret agents, if a change in their message transmitting style is detected. This may have indicated that their radio had fallen into enemy hands.

To use a somewhat less exotic example, imagine someone, probably not James Bond, tenpin bowling and keeping track of their scores, this scenario coming from HJ Harrington et al’s excellent Statistical Analysis Simplified: the Easy-to-Understand Guide to SPC and Data Analysis (McGraw-Hill, 1998).

On the 10th week, the score suddenly drops more than three standard deviations (scatter or variation around the mean or average) below the mean.

Enemy agents? Forgotten bowling shoes? Too many milk shakes?

Once again, an anomaly or change, something often examined in industry (Statistical Process Control (SPC) and related areas) to determine the point at which, in the words of Tom Robbin’s great novel Even Cowgirls Get The Blues, ‘the boogie stopped and the woogie began’.

Sudden changes in operations & processes can happen, and so a usual everyday assembly line (‘dog’) can in milliseconds become the unusual, and possibly even dangerous (‘wolf’), at which point hopefully an alarm goes off and corrective action taken.

The basics of SPC were developed many years ago (and taken to Japan after WW2, a story in itself). Anomaly detection is a fast-growing area. For further experimentation / reading, a recent method based upon calculating the closeness of points to their neighbours is described in John Foreman’s marvellous DataSmart: using Data Science to Transform Information into Insight (Wiley, 2014).

We might want to determine if a credit card has been stolen on the basis of different spending patterns/places, or, to return to the opening example, detect an unauthorised intruder to a computer network (e.g. Clifford Stoll’s trailblazing The Cuckoo’s Egg: Tracking a Spy Through the Maze of Computer Espionage).

Finally, we might just want to figure out just exactly when it was that our bowling performance dropped off!

Telstar, Cortina & the Median Quartile Test: where were you in ’62?

It was 1962, the setting of the iconic 1973 movie American Graffiti, from which comes the subtitle of this post. The Beatles had released Love Me Do, their first single. That year also heard and saw Telstar, the eerie but joyful Claviolined Joe Meek instrumental by the Tornados, celebrating the circling communications private transatlantic television satellite it honoured. The British Ford Cortina, named after an Italian ski-resort saw out the humpty-dumpty rounded Prefects and 50’s Zephyrs, while in the US, the first of 50 beautiful, mysterious and largely lost Chrysler Ghia Turbine cars was driven in Detroit.

Meanwhile, the world of statistics was not to be outdone. Rainald Bauer’s Median Quartile test, an extension of Brown and Mood’s early 50’s Median Test, was published, in German, in 1962. The latter test, still available in statistics packages such as IBM SPSS, SAS and Stata simply compares groups on counts below and above the overall median, providing in the case of two groups, a two by two table.

The Median Quartile Test (MQT), as the name suggests, compares each group on the four quartiles. But the MQT is largely unknown, mainly discussed in books and papers published in, or translated from, German.

The MQT conveys similar information to John Tukey’s boxplot, shows both analysts and their customers and colleagues where the data tend to fall, and provides a test of statistical significance to boot. Does one group show a preponderance of scores in the lower and upper quartiles for example, suggesting in the field of pharma fr’instance, that one group either gets much better or much worse.

A 1967 NASA English translation of the original 1962 Bauer paper is available in the Downloadables section of this site.

Recent Application in Journal of Cell Biology

Click to access 809.full.pdf

Further / Future reading

Bauer RK (1962) Der “Median-Quartile Test”… Metrika, 5, 1-16.

Von Eye A et al (1996) The median quartiles test revisited. Studia Psychologica, 38, 79-84.

Minitab 17: think Mini Cooper, not Minnie Mouse

As it has been 3 or 4 years since the previous version, the new release of Minitab 17 statistical package is surely cause for rejoicing, merriment, and an extra biscuit with a strong cup of tea.

At one of the centres where I work, the data analysts sit at the same lunch table, but are known by their packages, the Stata people, the SAS person, the R person, the SPSS person and so on. No Minitab person as yet, but maybe there should be. Not only for its easy to use graphics, mentioned in a previous post, but for its all round interface, programmability (Minitab syntax looks a little like that great Kemeny-Kurtz language from 1964 Dartmouth College, BASIC, but more powerful), and a few new features (Poisson regression for relative risks & counted data, although alas no negative binomial regression for trickier counted data), and even better graphics.

Bubble plots, Outlier tests, and the Box-Cox transformation (another great collaboration from 1964), Minitab was also one of the first packages to include Exploratory Data Analysis (e.g. box plots and smoothed regression), for when the data are about as well-behaved as the next door neighbours strung out on espresso coffee mixed with red cordial.

Not as much cachet for when the R and SAS programmers come a-swaggering in, but still worth recommending for those who may not be getting as much as they should be out of SPSS, particularly for graphics, yet find the other packages a little too high to climb.

http://www.minitab.com/en-us/