"If you run after two hares, you will catch neither." We probably don't chase hares and probably have no intention to do so, but probably all of us have tried to do two things at a time. Data visualization is an area in which it's tempting to do just that, but if we are not careful, we end up achieving nothing.
Recently, there have been many public discussions about data visualisation in science. For example, barplots have been called out as not transparent, information sparse, and misleading (e.g. Weissgerber et al. 2015, #barbarplot, and read [here] and [here] why the bar plot might still be fine plot for certain communication situations).
Other plot types such as the popular boxplot are sometimes advocated instead. Boxplots represent the median and other indicators of distributional properties such as the upper and lower quartiles. They are able to reveal some aspects of the underlying distributions that simpler plots such as bar plots can’t capture. For full disclosure, boxplots are the darling of my scientific field, seemingly ruling the data visualization scene in speech sciences. There must be good reasons for that, right? Let’s look at what the boxplot brings to the table and why I think the boxplot is actually a terrible plot for communi-cation purposes.
Boxplots are misleading.
Yes, a boxplot provides more distributional landmarks than a simpler plot such as a bar plot, but it still abstracts away from actual distributional properties. One and the same boxplot can result from a variety of different distributions. Justin Matejka and George Fitzmaurice beautifully demonstrate these issues here, as shown in the following illustration.
Don’t get me wrong, a boxplot is a quick and dirty way to roughly eyeball distributions. It’s a useful tool for data exploration, but it clearly falls short when compared to more raw data oriented plots. That being said, can we at least use boxplots to illustrate differences?
Boxplots don't communicate differences well.
When comparing two boxplots, the distributional overlap between categories is visually dominant. The visual assessment of the difference, however, is limited to the comparison between only one out of five pairs of short lines (the median, not the quartiles or whiskers). This is at odds with what we often want to communicate. It is differences that we often talk about, partly because the inferential frameworks that we use are centered around testing differences.
I would like to invite the community to think outside the boxplot and consider our communicative intentions behind plotting data. If we want to communicate a simple relationship between categories (“X is greater than Y”) let’s use a simple plot like the bar plot or a point plot, which are graphs that are processed easily and quickly. If we want to talk about the nature of the empirical distributions, let's use more raw data oriented plots such as density plots or jittered scatterplots ( see Rousselet et al. 2017 who discuss many valuable alternatives).
The boxplot neither communicates differences well nor is it very informative for exploratory purposes. Using a boxplot is the #dataViz equivalent of running after two hares.
Don't do it.
Boxplots, I mean.
Numbers don't speak for themselves. Well, to be honest, they might speak, but more often than not, they speak unintelligible gibberish which nobody understands. We, the people who process, summarize and filter these numbers, have the opportunity to let our data speak in a way that our audience can actually understand. Creating graphs that communicate what’s behind our data requires us to be aware of the needs and resources of our audience in a given communication situation.
Recently there have been several public discussions regarding good and bad types of data visualization. Take for example the most widely used plot type in contemporary science: the bar plot. A bar plot is a very simple plot which depicts a group mean represented by the length of a bar and usually some kind of error margin represented by a whisker. It is a very abstract representation of data and has undergone some criticism (e.g. Weissgerber et al. 2015).
For example, the #barbarplots project has charmingly raised awareness that in many circumstances bar plots are devoid of transparency and lack important information. Bar plots can even be misleading because plotting only the arithmetic mean misses important information about the underlying distribution of the data. These concerns are justified, so I won’t argue with that. When the underlying distribution of our data is not normal or when there are influential outliers, the mean is not a good measure of the central tendency anyways. But for the sake of the argument, let’s assume our data is eligible to be represented by mean values.
Should we abandon bar plots? No, we certainly should not.
A bar plot is an easy-to-process abstraction of complex relationships boiled down to simple differences in length. The bar plot is an invaluable plot type which efficiently expresses relationships such as “X is larger than Y”. If we want to communicate this simple relationship – and we often rhetorically want to do just that– a bar plot is a fine choice. The question arises, though, when should we use bar plots and when should we choose a different plot type?
One driving force behind choosing a good plot type can be the communicative value of a plot. However, communicative value isn’t simply the absolute amount of information of a plot. It’s the amount of information that can be processed in a given communication context; and the communication contexts that we scientists encounter can be very diverse.
Studying a plot in a manuscript and briefly looking at a plot during a presentation are two entirely different situations. In the first scenario, the readers have plenty of time to study the plot with all its elements. They can easily compare it to relevant information in the preceding and following text. So, yes, for this specific communication situation, we can enrich our plot with everything we feel is necessary for our readers. Creating an informationally dense plot including distributional information of the data is appropriate in this case because the reader is in a luxurious comprehension situation defined by the lack of time pressure.
However, this isn’t necessarily the case during a talk or a lecture. We as the speaker offer a linear narrative. The audience can’t assess preceding or following information because they can’t jump back and forth in time. They can only look at a plot for as long as we allow them to. If the people in the audience are lucky, they might get one or two minutes to process the plot. But they have to process it while we constantly fire verbal information at them. On top of that, usually both the speaker and the listener are non-native speakers of the language that is used. This is far off an ideal reception situation. The audience has only limited cognitive capacities to take in the visual information and verbal information. Complex visual relationships take time to both process and integrate with the content. If the audience doesn’t have the time to do just that, our precious plots don’t do what they have been crafted for: communicating information.
Scientific communication is not unidimensional. It comes in diverse forms. Arguments against using a particular visualization technique for one context might be an argument for using this very technique in another context (see also Michael Frank's post for a similar discussion).
Bar plots are simple, therefore they hide many aspects of the data, yes. But bar plots are simple, therefore they allow the audience to quickly understand the main point.
Giving talks or lectures are communication situations in which we need to make it easy for our audience to digest the information. The audience will forget a lot of the stuff we are talking about anyway, so let’s try to make it as easy for them to remember the corner stones of our story. Simple data visualizations like bar plots are one way to achieve this goal.
Don't bar them.