recipes : Plotting : Distributions : Visualising a distribution

Problem

I have recorded n observations of a phenomenon. How do I visualise its distribution?

Solution

So you have a list of n numbers and you want to understand more about these data. What to do? The first thing not to do is to bring up a spreadsheet and look at the numbers in the cells. Why not do this? Because the approach is very inefficient and makes it likely you will miss important trends or features in your data. Instead what you should be doing is thinking about the best plot to make. A good plot or two will show you everything you need to know.

If you have observations about one variable (so you have one string of numbers) there are two camps into which your data may fall. The first camp is that each observation is independent of the others. i.e. Knowing the value of observation 10 will tell you nothing about the value of, say, observation 11. In such a data set, values aren't correlated with each other. In the second camp, there are (or might be) correlations present. Independent, uncorrelated, data are easier to deal with because common statistical tools for calculating p-values and error bars assume independence. Let's look into some examples. We'll explore why it's important to plot data, what sort of plots to make, and take a brief look at non-independent data.

Why plotting is critical
Let's say we run a dog training class and have 30 dogs signed up for this week's lesson. For some obscure reason we're interested in the quantity of excrement each dog can produce in a week. So we ask owners ahead of time to store the doggy waste bags and report at the end of the week how many grams their pet produced. The numbers come in and here they are:

1192,1201,1186,1203,1247,1181,1249,1242,1221,1189,1255,1191,1189,1190,1213,1187,1247,1193,1206, 1239,1250,1252,1216,1239,1252,1244,1185,1206,1182,1197

Those are bunch of weights in grams. You can scan those numbers for quite a while and all you really get out of it is the realisation that each dog produces about 1100g of stuff in a week. Watch happens if we make a histogram, however:

hist(dogEggs) %Plot a histogram of variable dogEggs
xlabel('mass of dog egg')

Now we see something quite obvious: the data don't have a single peak but instead there are two peaks. Some dogs are churning out about 1250g whereas others are producing about 1190g. Not so many dogs are manufacturing gifts in the middle of that range. Of course a difference of about 60g doesn't seem very large and you may want to run your experiment for a few more weeks to get more data, but the point nonetheless stands that you quickly saw a trend in the graph that wasn't visible by simply looking at the raw numbers. Thus, in a very real sense, the act of plotting the histogram is in itself a basic statistical test.

Are there other, better, plots you could use? In this particular case I reckon not. You have one set of data and you want to explore it in detail. Thus, a histogram is a good approach. You may, however, have multiple variables which you want to plot alongside each other. A standard approach for doing this is to use box plots or, something I'm fond of, plotting all the jittered data. If you have only 2 or 3 variables, you may be able to overlay them all on one histogram. But we're not going to go into that here.

Dealing with non-independence
In our example above, each number provided a new piece of information that was independent of all the other numbers. So each of our thirty numbers contributes one thirtieth to the total knowledge we have. Because of independence we can happily use "standard" statistical tests, such t-tests. Also, because of independence, the histogram shows everything we need to know about our data. That last point is important, as you will now see.

To see why independence matters, we will do a new experiment. Let's say we instead want to follow the output of a single dog over one month. The owner reports back daily dog-egg mass for the month, so we again have 30 values. I won't waste your time with the list of numbers, let's go straight to histogram:

There is nothing very unusual about this histogram. It looks slightly bidmodal, but nothing particularly crazy. Also, there are only 30 values, so maybe the bimodality is just chance. However, might the histogram be hiding something? The data are gathered daily, so there's a possibility that there are correlations across days. How to explore that? Easy, we make a line plot or a scatter plot:

plot(dog,'-or','MarkerFaceColor','r')
xlabel('day')
ylabel('dog-egg mass')

hist(dogEggs) %Plot a histogram of variable dogEggs
xlabel('mass of dog egg')

Ah. The data are clearly not independent: we see regular peaks. These peaks are largely hidden by the histogram (they underlying the bimodality we suspected we saw), but when you plot mass as a function of day they become very apparent. So what's going on? It's too early to tell, but it looks like the peaks are occurring on 7 day intervals. So maybe the dog is over-eating on Friday and Saturday nights, which produces larger eggs on the following days. What are the wider implications of this? It depends on the question you're asking. Say the owner calls at some future time and says the dog laid an egg of 150g. Should he be worried? You now are aware that knowing the day on which the egg was laid is important in determining whether the mass is unusually large. Making the right plot helps you to answer his question.

Discussion

You've had a quick introduction in how to handle data obtained from a single variable. Remember to always plot it. Remember to explore it thoroughly: if you think a histogram might be hiding something then remember to make a line plot or scatter plot too. If you see trends in such a plot then you may need to dig deeper and figure out what is underlying these trends.

 

Want to continue the discussion?
Enter your comments, suggestions, or thoughts below

comments powered by Disqus