cluster analysis – Synergistic Statistical Consulting, Analysis, Arboronics

Back about 1965, while (whilst?) attending Primary (grade) School in a little northern Victorian town, lunchtimes would see a behaviour pattern in which, say, two boys would link arms and march around the boys’ playground chanting “join on the boys who want to play Chasey” or some other sport or game, and soon, unless something bizarre or boring was called out for, there’d be three, five, eight, ten etc until there were enough people for that particular game.

Now, let’s imagine a slightly more surreal version, a big group of boys, or girls, or indeed a mixture thereof, wanders around the playground. But what sport are they going to play, it’s unlikely (especially for our purposes here) they’ll all want to play the same thing, and even if they did, there may be too many, or some people may well be better suited to some other sport or game. If we were brave enough and if lunchtimes extended into infinity, we could try every possible way of splitting our big group into two or more smaller groups as, in the general field of cluster analysis, Edwards and Cavalli-Sforza showed back in ’65.

Alternatively, we could ask the single person most different from the rest of the main group M in terms of the game they wanted to play. That person, (let’s call them Brian after Brian Everitt, who wrote a great book on cluster analysis in several editions, and Brian Setzer as in Stray Cats and the Brian Setzer Orchestra, this being the unofficial Brian Setzer Summer) splits off from the group and forms a splinter group S. For each of the remaining members, we check whether on average, they’re more dissimilar to the members of M, than the members of S (i.e. Brian et al). If so, then they too join S.

Known as divisive clustering (the earlier “join on” syndrome is sorta kinda like agglomerative clustering, start off with individuals and group em together), this particular method was published in ’64 by Macnaughton-Smith. Described in Kaufman and Rousseeuw’s book as DIANA, with shades of a great steak sauce and an old song by Paul Anka, DIANA is available in R as part of the cluster package.

Now if you’ll excuse me, there’s a group looking for members to march down the road for a cold drink, on this hot Australian summer night! Once we get to the bar, the most dissimilar, perhaps a nondrinker, will split off, clusters will be formed, and through the night there may be re-splitting and re-joining of groups or cliques, as some go off to the pinball parlour, others to the pizza joint, while some return to the bar, all in the manner of another great clustering algorithm, Ball and Hall’s ISODATA.

Bottled Sources:

Ball GH, Hall DJ (1965). A novel method of data analysis and pattern classification. Technical Report, Stanford Research Institute, California.

Edwards AWF, Cavalli-Sforza, LL (1965). A method for cluster analysis. Biometrics, 21, 362-375.

Everitt, B.S. (1974 and more recent editions). Cluster analysis. Heinemann: London.

Kaufman L, Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. Wiley: New York.

Macnaughton-Smith P, Williams WT, Dale MB, Mockett LG (1964). Dissimilarity analysis: A new technique of hierarchical sub-division. Nature, 202, 1034-1035.

Apart from the iconic and mysterious Australian Nullarbor http://www.nullarbornet.com.au/ (literally ‘no’, or actually very few, trees) and the baddest and saddest of outer suburban concrete jungles, trees are a major part of our daily life. Trees produce shade and oxygen, and provide inspiration for dreaming scientists watching apples fall. ‘Tree of life’. ‘Family trees’. Tree branches have also long provided a metaphor for branches of knowledge and classification systems.

In his excellent new book ‘The book of trees: visualizing branches of knowledge’,

https://www.papress.com/html/book.details.page.tpl?isbn=9781616892180

Manual Lima, Designer and Fellow of the Royal Society of Arts (http://www.visualcomplexity.com/vc/) examines the role of trees in history, religion, philosophy, biology, computer science, data visualization, information graphics and data analysis / statistics.

Covering various types of tree graphs, including radial trees, sunbursts, Ben Schneiderman’s Treemaps and Voronoi Treemaps, Lima’s treatise provides inspirational historical and contemporary pictures, including timely applications such as looking at the words that appear with ‘I’ and ‘you’ in Google texts.

Statistical applications covered are mainly confined to Icicle plots or trees, used in applications such as cluster analysis, or the grouping observations into related classes, ‘taxa’ or clusters such as disease categories.

Not published in the Northern hemisphere until April 2014, the book is available now in Melbourne, Australia for around $50, e.g. www.ngvshop.ngv.vic.gov.au (the online search does not work) or http://metropolisbookshop.com.au

Accompanied by sources of information on how to construct such diagrams (e.g. http://www.flowingdata.com ) Lima’s new book will serve as an accessible and constant source of information on visualizing trees for new, as well as existing, ‘arborists’.

‘Velut arbor aevo’

‘May the Tree Thrive’!

Treemap software

http://www.cs.umd.edu/hcil/treemap/

http://www.treemap.com/

http://www.tableausoftware.com/

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tag: cluster analysis

Divisive Rules OK: Clustering #1

Visual Trees: The Book of Trees by Manuel Lima