Overlapping Distribution Plots

On Twitter, Phil Cohen asked how he might make a plot showing overlapping distributions:

Hey, here's a question. Does this work for showing the inequality between two distributions? Also, the data, if you have a better idea: pic.twitter.com/ViXuMfw1Qt
— Philip N Cohen (@familyunequal) May 7, 2018

I think that he was on the right track using transparency, but I am not sure that the color was exactly right. The plot reminded me of what Mike Bostock (my generation's Edward Tufte) did to make a population pyramid.

Phil was also working with another disadvantage: he's using Microsoft Excel. Excel (all Microsoft Office products actually), renders the alpha channel terribly. The alpha channel directs programs how to render transparency and how to mix layers. Microsoft's limited use of colors means that the plots don't come out well.

To help Phil out, and to provide a demonstration of how to visualize these same plots in R very simply, I wrote up the following code.

First, I need to create the data (which Phil provided in his post):

library(ggplot2)

## Data
scores <- factor(c(2:11))
white <- c(1,2,3,6,9,15,20,18,18,9)
black <- c(2,6,6,9,12,18,21,14,9,3)

## Create stacked data frame
df<- data.frame(
    score = scores,
    count=c(white,black),
    race =rep(c("White","Black"),c(length(white),length(black)))
    )

Next, I will create the bar plot that Phil initially rendered. I will use HTML hex color codes for red and blue backgrounds. I will use a transparency of .25 to make them very light. If I made them any darker, the overlapping (purple) part of the distribution would be difficult to see.

## Overlapping bar plot
ggplot(df, aes(score,count,fill=race)) +
    geom_bar(aes(fill=race),stat="identity",position="identity") +
    scale_fill_manual(values = alpha(c("#ff1212","#1212ff"), .25)) +
    labs(
        title="Chart for Phil",
        y="Percent", x="Score"
    )

Overlapping Bar Plot — Overlapping bar plot

But, I actually think that the bars make it difficult to read this graph. Although scores were binned into discrete integers from 2-11, I think that Phil actually wants to communicate that the scores are continuous. In that case, lines probably make more sense since we visually want to connect the values, for example, 2 to 3 and 3 to 4. In that case, we would use geom_area() fill the area between the origin ($y-0$) and the value.

## Overlapping area plot
ggplot(df, aes(x=score,y=count,group=race)) +
    geom_area(aes(fill=race),position="identity") +
    scale_fill_manual(values = alpha(c("#ff1212","#1212ff"), .25)) +
    labs(
        title="Chart for Phil",
        y="Percent", x="Score"
    )

Now it becomes much easier to see where and how much the distributions overlap with each other.

mike bader

overlapping distribution plots

Pingbacks

Comments