recipes : Statistics : Plotting less simple bar charts

Problem

How do I pimp my bar chart? I want SEMs, overlaid raw data, and those little stars everywhere.

Solution

You've made a simple bar chart and want to step on the world. Thankfully, this isn't too tough. First of all, we'll replace the standard deviation we plotted in the previous recipe with a standard error of the mean (SEM). The idea behind this being that if two 95% SEMs don't overlap then you likely have a significant difference on your hands. I say "likely" because in practice this may depend on other things, such as multiple comparisons. We'll use the e SEM_calc.m function introduced in the SEM recipe.

%Same data as before
data.bob=randn(1,12)+0.66; 
data.alice=randn(1,15)+1.2; 
data.rufus=randn(1,8)-0.8; 
data.uma=randn(1,21)+1.4;
data.bozo=randn(1,10)+5;

%Calculate means and SEM in the same manner as previously
f=fields(data); 

for ii=1:length(f)
    mu(ii)=mean( data.(f{ii}) ); 
    sem(ii)=SEM_calc( data.(f{ii}) ); 
end



%Plot all this in a pretty way
H=bar(mu); 
set(H,'EdgeColor','b','FaceColor',[0.5,0.5,1],'LineWidth',1.5) 
set(gca,'XTickLabel',f)
ylabel('Truffles per cubit') 

hold on
for ii=1:length(f)
  plot([ii,ii],[mu(ii)-sem(ii),mu(ii)+sem(ii)],'-k','LineWidth',4)
end
hold off

Ok! We've blasted through all the stuff in the last recipe, but this time we've used the SEM instead of the SD. Let's overlay the raw data now. It's good practice to overlay raw data when it is reasonable to do so. If you have vast quantities of data (large sample sizes and dozens of groups) then overlaying the raw data points may make the plot impossible to read. In most cases, however, overlaying the data is possible. It's always worth doing because this way you are showing yourself (and your reader) all of the information.

hold on
for ii=1:length(f)
    tmp=data.(f{ii}); %temporarily store data in variable "tmp"
    x = repmat(ii,1,length(tmp)); %the x axis location
    x = x+(rand(size(x))-0.5)*0.1; %add a little random "jitter" to aid visibility

    plot(x,tmp,'.r')
end
hold off

Despite overlaying all of the data, everything has remained readable and visible. Using smaller data points can help if you have a lot of data to plot. The error bars can be thicker if you want to emphasise them more. The red data points look rather aggressive but they do stand out. With a little creativity, this plot could look equally clear in gray-scale.

You are amazed at how many truffles per cubit Bozo can eat so you want to submit your study for publication. Let's say that the most important comparison is between the groups "uma" and "bozo." Their 95% confidence intervals are miles apart and don't overlap at all (neither do the raw data). Clearly the difference between their means is significant. However, peer reviewers like p-values so let's add some to get the graph ready for submission.

%Un-paired t-test (needs stats toolbox)	
[H,P,CI,STATS]=ttest2(data.bozo,data.uma);

%Let's add the t-test result as text to the top left corner of the plot
str=sprintf('t(%d)=%0.2f, p<0.001',STATS.df,STATS.tstat);
ylim([-3,8])
text(0.25,7,str,'FontWeight','bold')

%Finally, we manually add the line and the star
hold on
p=plot([4,4,5,5],[6.8,7,7,6.8],'-k','LineWidth',2)
hold off
text(4.5,7,'***','backgroundcolor','w','horizontalalignment','center')

We now have everything one could wish for (if you can stomach the colour of the points, of course). There are all the raw data, 95% confidence intervals, labels, a stats test, and even the stars indicating which groups are different. The p-value was very small (about 10-12) and in this case I've chosen to simply state that it's smaller than 1 in 1000. You might choose to list the value more precisely. Either is fine, particularly given that we're showing all the data.

Discussion

What's really great about MATLAB is that it's possible to make quite elaborate graphs with relatively few lines of code. Once you've laid out your code nicely in an m-file then it can easily be re-used. Your m-files are a self-documenting log of how to make your plots. You can turn them into functions so that you can very easily make new plots based on different data. I've turned the code above (minus that for the stats test) into an example function. Whenever I make a plot for publication I make a directory into which I keep the plot and all of the m-files that are required to make it. For a more general-purpose way of adding the significance bars, see sigstar.

There is one general point worth making about the last graph: the overlaid data points are incredibly useful to have. If you can plot them (and almost always you can), you should be doing so. Those overlaid points are your data, so show them! A mean and an error bar are useful summary statistics but the real purpose of a graph is to showcase data. You can't do that if you're withholding it and replacing it with an error bar, which on its own tells you little about outliers or the underlying shape of the data. With notBoxPlot it's easy to show all your raw data, even if you have a lot of it.

 

Want to continue the discussion?
Enter your comments, suggestions, or thoughts below

comments powered by Disqus