recipes : Statistics : How do I perform k-Means clustering?

Problem

I have some data and want to try clustering it with the k-means algorithm. How do I go about doing that?

Solution

The MATLAB Statistics Toolbox has some nice functions for performing and interpreting k-means clustering analyses. However, before delving into these we should establish what k-means clustering is and what it can and can't tell you.

k-means is a clustering algorithm. Clustering algorithms are unsupervised techniques for sub-dividing a larger dataset into smaller groups. The term "unsupervised" indicates that the data do not originate from clearly defined groups that can be used to label them a priori. For example, say that you're exploring a group of 300 hens in a barn and you've measured a bunch of parameters from each bird: height, weight, egg size, laying regularity, egg color, etc. Based on all these parameters, you want to figure out if the hens fall into a small number of distinct groups or if they constitute just a single homogeneous population. Note that this question has two features: 1) At the outset you don't know how many groups there are and therefore 2) you have no way of assigning a given hen to a given group. All hens are treated equally. For problems such as this, you can use k-means clustering to objectively assign each hen to a group. Using further techniques (described below) you can objectively test whether the assignments you've made are reasonable.

As you might imagine, if there are "unsupervised" techniques then there must also be "supervised" techniques. Supervised techniques are often also called "classification" techniques and are applied to data sets where the identity of the groups are known beforehand. We will not discuss these approaches here but you can read about them in the introductory page on clustering and classification.

Delving deeper into k-means
The MATLAB help has a brief overview of different unsupervised clustering techniques and a nice example of how to apply k-means and an explanation of how it works. The help page for k-means provides another example. I suggest you check these out in addition to reading this page. On this page we will provide a further example and also emphasise some different aspects of the clustering approach, such as validation of the final clustering. So let's kick off with our simple example:

%Let's make some fake data with two groups
n=75; %sample size
x=[randn(n,1)+2;randn(n,1)+4.75];
y=[randn(n,1)+2;randn(n,1)+4.75];


%the true group identity
groups=[ones(n,1);ones(n,1)+1];

%plot the data
scatter(x,y,50,groups,'filled')

We've made two groups of data that overlap somewhat. In this case, because it's a toy example, we know from which group each data point originates. Remember that in real-world uses you won't have this information, but we're going to use it here to demonstrate the effectiveness of k-means. So we know the group identities, let's see if k-means can figure out from which group each data point originates. The k-means algorithm must be supplied with two things: the raw data (obviously!) and also we must tell it into how many groups the data should be partitioned. That's important: it means that it's your job to tell the algorithm how many groups there are. This isn't something it can determine for itself.

Just to hammer home the caveat: normally you would use a supervised classification technique (e.g. discriminant analysis) when you already know the identities of the groups. This is because a supervised technique will use the group identities to search for the boundary that best separates the groups; an un-supervised technique, such as k-means, ignores the group identities and simply asks if the data naturally partition in some way. With that in mind, let's see how k-means divides the data.

data=[x,y];
IDX=kmeans(data,2); %Run k-means, asking for two groups

The variable IDX tells you the group to which k-means has assigned each data point. We can find out the number of times it gets the group right by doing: sum(groups==IDX). That returns 147, so for 147 out of 150 data points k-means has classified things correctly. We can verify this by re-doing the scatter plot, this time with the colours being derived from the k-means results.

Good, right? Can you spot where it made the mistakes? In this case we knew there were two groups and, remember, k-means will divide up data into as many groups as we like. If we had a unimodal distribution (just one group) we could still run k-means and divide it into two. Watch:

%Let's make some fake data with two groups
n=150; %sample size
x=[randn(n,1)+3;randn(n,1)+3];
y=[randn(n,1)+3;randn(n,1)+3];

%plot the data
subplot(1,2,1)
plot(x,y,'ok','MarkerFaceColor','k')

%now divide into two using k-means and plot the results
data2=[x,y];
IDX=kmeans(data2,2);

%plot the k-means results
subplot(1,2,2)
scatter(x,y,50,IDX,'filled')

You can see that k-means has divided the distribution into two even though there is only one group. Blindly running k-means isn't enough: you also have to validate what you've done. You have to demonstrate that dividing the data into k groups is a reasonable thing to do. MATLAB comes to the rescue by allowing to calculate a so-called silhouette value for each observation. Silhouette values can be calculated on any data that have been categorised into groups by a clustering algorithm. The silhouette value tells you both how tightly a data point is associated with its assigned cluster and how dissimilar it is from other clusters. Each data point has a silhouette value and the mean silhouette value over all points summarises the overall quality of the clustering.

Validating cluster identity
So how do we use k-means and the silhouette value to decide if the clustering we have produced is reasonable? The approach I'll show you here works as follows: run k-means a bunch of times, each time with a different k, and calculate the mean silhouette value for each k. If your data really fall into, say, 2 groups then you should see a peak in the silhouette values when k=2. Let's run this for the two cases shown above and see if this works.

%Run k-means for a range of k
for k=2:6
    IDX=kmeans(data,k);  %The data with two groups
    [S,H] = silhouette(data, IDX);
    silA(k)=mean(S); %The mean silhoette value for two groups

    IDX=kmeans(data2,k); %The data with one group
    [S,H] = silhouette(data2, IDX);
    silB(k)=mean(S); %The mean silhoette value for one group
end

%Plot the results
clf %clear the figure window
hold on
plot(1:6, silA,'ok-','MarkerFaceColor','k') %2 groups
plot(1:6, silB,'or-','MarkerFaceColor','r') %1 group
set(gca,'XTick',1:6)
xlabel('k')
ylabel('mean silhouette value')
hold off

The black line shows the mean silhouette values for the data with two groups. The red line shows the mean silhouette values for the data with one group. The black line has a peak at 2, which indicates that the distribution may indeed truly be partitioned into two. The red has no peak, which indicates that the underlying distribution is unimodal. As ever, the devil's in the details. You would probably want to extend such an analysis by repeating it multiple times on randomly chosen sub-samples of the data in order to produce error bars for each point. You'd also want some way to assess how good the clustering is for the value at which you see a peak.

Discussion

What we've covered here is just one way of doing k-means. You'll notice that I just went with the default options. MATLAB allows you to run the k-means algorithm in various different ways; for example you have several choices for how the algorithm chooses its starting points and what measure of distance it uses. It's beyond the scope of this article to go into all of these in detail. The best advice is to play with them and use a little common sense. For example, if you have no good reason for thinking that the cityblock distance option is a good description of your data then don't use it. You might also want to check out the introductory page on clustering and classification.

 

Want to continue the discussion?
Enter your comments, suggestions, or thoughts below

comments powered by Disqus