recipes : Statistics : How do I cluster my data?


I want to cluster my data but I don't know where to begin.


There are various functions for clustering and classifying data in the MATLAB Statistics Toolbox. Clustering and classifying are different but related techniques, and we'll discuss them together here. If you've never done any of these things before, however, it can be hard to know where to start. The purpose of this recipe is to help you orient yourself and get you going. Remember that clustering and classification are a big field and there's no way a single web page will tell you everything there is to know. With that caveat in mind, let's get started.

These approaches are part of a family of techniques collectively known as "machine learning." This basically means the use of computer algorithms to extract underlying features of the data. For example, an e-mail spam filter qualifies as a machine learning approach since it searches for keywords to separate real messages from junk. There are two broad classes of machine learning techniques: supervised and unsupervised.

What is supervised learning?
The term "supervised" means that the algorithm "knows" into which groups the data fall. Let's clarify what that means with an example. Say that you have a factory that manufactures widgets and this factory produces supposedly identical widgets from 3 different production lines. You want to know if all the lines really are behaving in the same way. So you measure sizes, weights, and other properties of the widgets coming off the lines. Given that you also know which line each widget came from, you are in a position to explore whether there are significant differences between the widgets from the different lines. The most simple analysis you could do is plot each parameter, such as weight, on a box plot, with one box per production line. Then you could look for obvious differences between the weights of widgets from the different production lines. However, you might have measured a dozen different parameters and perhaps the difference between the production lines is only evident if you look at the variables in combination. Perhaps, say, widgets from Line 2 tend to be slightly lighter and slightly less wide and be a slightly different colour from the widgets in Lines 1 and 3. A supervised learning technique can help you identify complicated relationships of this sort in a way that simple independent comparisons might not.

widget weight box plot from three production lines

What is un-supervised learning?
The term "un-supervised" means that the algorithm does not "know" into which groups the data fall. In our previous example, we had 3 production lines that could be used to segregate the widgets and so we could actively search for differences based upon these lines. Making three box plots, with one box per line, constituted a simple "supervised" analysis. This is because we can organise our thinking based upon where we know the data (the widgets) came from. If we didn't have the benefit of the three production lines then no supervised analysis would be possible. Thus, an example of an un-supervised analysis scenario would be if our widget factory had only one production line and wanted to check whether the widgets coming off the line are uniform, or if they fall into two or more groups. Anything is possible: perhaps the production line screws up 10% of the time and so 10% of the widgets are slightly larger than they should be. Clearly we now have no way of dividing up the data before plotting so all we can do is make a single histogram of the widget weights. Such a histogram would constitute a simple, un-supervised, data analysis.

widget weight histogram from one production line

But why are you showing me how to make simple plots?
Perhaps what appears to be a tutorial on box plots and histograms isn't what you expected when you came to a page looking for how to cluster and classify data? Think again. The key thing about these machine learning approaches is that they don't perform magic. They're just a formalised way of exploring structure in data. If your data have no interesting structure then they're not going to reveal something that isn't there. It can be worse than that: if mis-used these algorithms may suggest the presence of structure that doesn't exist. You have to know what your data look like in order to evaluate the results of any machine learning approach. Thus, the first thing you should do is plot and explore the data. If your data are already divided in some way (e.g. you have a "multiple production line" scenario) then make sure you bring out the identity of the groups in the plots. e.g. colour-code the points in a scatter plot. If you don't have this advantage, then you should focus on plots that showcase how the data from your single group are distributed.

I've made my plots, now what?
Now that you have some idea what you're looking for, you're ready to start applying more complicated tools than histograms, scatter plots, and box plots. What tools? If have an unsupervised scenario then you'll want to employ tools that help you search your data for structure and that objectively sub-divide your data. If have a supervised scenario then you want to employ tools that actively search for differences between your groups and quantify how well separated they are. There's nothing stopping you from using un-supervised techniques on data that are amenable to a supervised algorithm, but you may not get the most out of your data by doing this.

Unsupervised techniques include a range of approaches from data-visualisation to objective clustering of data. Dendrograms, are a type of plot that reveal the hierarchical relationship between data points. They show which data points are similar to each other and which are different. Dendrograms are a useful visualisation tool but are not a formal statistical test. i.e. There is no formal way of judging whether the branches in a dendrogram are in some sense significant. Clustering approaches such as k-means are also unsupervised, because the data need not come from pre-defined groups; the job of the the clustering algorithm is to divide data into classes. The results of this depend on the algorithm you've chosen (there are many) and what your data look like. Again, the fact that you've partitioned your data into clusters doesn't mean that those clusters are real. Additional work needs to be done to verify this. An example is provided on the k-means page of this site.

Supervised techniques essentially boil down to classification algorithms such as the simple nearest neighbour approach, linear classifiers, or more elaborate techniques such as support vector machines. Unlike un-supervised approaches, with supervised techniques you can extract an objective measure of classification success, since you know both the true group (where your data actually came from) and the assigned group (where your algorithm thinks the data came from). The results of a classifier can be summarised as a confusion matrix (see confusionmat.m) or even simply as the proportion of correct classifications. A very import aspect of data classification is the use of cross-validation. Cross-validation is employed to avoid over-fitting. Going into this in detail is beyond the scope of this recipe. Briefly, a simple cross-validation approach would be to randomly select half of your data and train the classifier on this. Then use the model produced by the classifier to partition the remaining data into classes. The key point is that the classifier is "trained" and "tested" on independent sub-sets of the data. If you do not do this, you will get inflated estimates of classification success and your model will not generalise well to new data sets.


That was a whirlwind tour of clustering and classification. The key take-home point is that these approaches are much like any other statistical test: if you can't see an effect in your data by eye then chances are there isn't anything going on there. To move forward, I recommend you read through the MATLAB help on these topics and try some of the examples. You can also read the page on this site about k-means clustering.


Want to continue the discussion?
Enter your comments, suggestions, or thoughts below

comments powered by Disqus