In this blog we’re going to use Flourish with more MusicBrainz data to plot the length of Miles Davis songs on a range of vinyl releases. This type of data often suggests the use of a scatter plot with an x-y axis to best visualize the information. For instance, we could place record labels on the x-axis, and the length of each song (in seconds) on the y-axis. However, with record labels being a categorical variable (i.e.- discrete values such as Sony, Columbia, etc.) there are better options for understanding the data versus a true scatter plot.
The first of these is a boxplot, which provides the ability to see the distribution of data (song lengths) by record label. Let’s take a look at this data in Flourish:
Here we have limited the data display to a single label (showing all was quite messy!). Select CBS or Columbia to see labels with many Miles Davis releases. We now see the median length of a recording, as well as the 25th percentile (bottom of the box) and the 75th percentile (top of the box). It’s also easy to see individual songs that lie below or above the typical range; in statistical terms, these are called outliers. On our plot, they represent songs that are either much shorter than normal (below the extended line) or longer than normal (above the extended line).
This is all useful information, but presents some limitations. Boxplots are very good at doing the aggregations for us while obscuring the individual data values, especially values that lie inside the box. To improve our ability to see those values we turn to a violin plot, which excels at showing the shape of a distribution, rather than the fixed shape provided by the boxplot. We have also combined a beeswarm plot with the violin plot so we can see every individual value:
I’ll be using Flourish to interrogate the MusicBrainz data further in future posts, but that’s it for now. Thanks for reading!