Three ways to plot logistic regressions

r
data-visualization
Published

August 19, 2020

If your data is just 1’s and 0’s, it can be difficult to visualize alongside a best-fit line from a logistic regression.

Even with transparency, the overplotted data points just turn into a smear on the top and bottom of your plot, adding little information. Here are three ways to get more information out of those points and produce more informative plots. But first, a quick introduction to the data.

The data

I simulated some data on survival as a function of size. Survival is binary (1 = survived, 0 = died).

head(df)
# A tibble: 6 × 2
   size  surv
  <dbl> <int>
1  4.78     0
2  4.40     1
3  5.02     1
4  5.32     1
5  4.61     0
6  4.81     1
nrow(df)
[1] 1000

We can fit a logistic regression…

m <- glm(surv ~ size, family = binomial, data = df)

…and extract fitted values using broom::augment()

plot_df <- augment(m, type.predict = "response")
head(plot_df)
# A tibble: 6 × 8
   surv  size .fitted .resid    .hat .sigma  .cooksd .std.resid
  <int> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>      <dbl>
1     0  4.78   0.746 -1.66  0.00125  0.970 0.00184      -1.66 
2     1  4.40   0.601  1.01  0.00287  0.971 0.000958      1.01 
3     1  5.02   0.819  0.633 0.00122  0.972 0.000136      0.633
4     1  5.32   0.884  0.496 0.00160  0.972 0.000105      0.496
5     0  4.61   0.685 -1.52  0.00169  0.971 0.00184      -1.52 
6     1  4.81   0.756  0.748 0.00122  0.971 0.000197      0.748

These are the data I used for the plot above with the points corresponding to surv and the best-fit line corresponding to .fitted

base <-
  ggplot(plotdf, aes(x = size)) +
  geom_line(aes(y = .fitted), color = "blue") +
  labs(x = "Size", y = "Survival")

base + geom_point(aes(y = surv), alpha = 0.2)

1. Rug plot

Turning those points into a “rug” is a common way of dealing with overplotting in logistic regression plots. ggplot2 provides geom_rug(), but getting that rug to correspond to dead plants on the bottom and live plants on the top requires a little data manipulation. First, we’ll create separate columns for dead and alive plants where the values of size only if the plant is dead or alive, respectively, and otherwise NA.

plot_df <-
  plot_df %>% 
  mutate(survived = ifelse(surv == 1, size, NA),
         died     = ifelse(surv == 0, size, NA))

Then, we can plot these as separate layers.

base <-
  ggplot(plot_df, aes(x = size)) +
  geom_line(aes(y = .fitted), color = "blue") +
  labs(x = "Size", y = "Survival")

base + 
  geom_rug(aes(x = died), sides = "b", alpha = 0.2) +
  geom_rug(aes(x = survived), sides = "t", alpha = 0.2)

Honestly, this is not a huge improvment. The overplotting is less of an issue and you can start to see the density of points a bit better, but it’s still not great.

2. Binned points

I discovered this plot in Data-driven Modeling of Structured Populations by Ellner, Childs, and Rees. Their plot used base R graphics, but I’ll use ggplot2 and stat_summary_bin() to get a mean survival value for binned size classes and plot those as points.

base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv))

I think this is fabulous! It definitely needs an explanation in a figure caption though, because what those points represent is not immediately obvious. Also, how close the points fit to the line has more to do with bin size than with model fit, so this one might be better for inspecting patterns than for evaluating fit.

base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv)) + labs(title = "bins = 30") |
base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv), bins = 60) + labs(title = "bins = 60")

3. Histograms

This option takes the ideas of binning values from #2 and showing distributions in the margins from #1 and combines them. I discovered this in a paper from my postdoc adviser, Emilio Bruna.

A function to make this third type of plot with base R graphics is available in the popbio package.

library(popbio)
logi.hist.plot(size, surv, boxp = FALSE, type = "hist", col = "gray")

Re-creating this with ggplot2 requires some hacks, and I’m still not all the way there.

base +
  geom_histogram(aes(x = died, y = stat(count)/1000), bins = 30, na.rm = TRUE) +
  geom_histogram(aes(x = survived, y = -1*stat(count/1000)), bins = 30, na.rm = TRUE, position = position_nudge(y = 1))
Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

There are at least two “hacks” going on here. First, I’m using stat() to extract the bar heights automatically calculated by stat_bin()/geom_histogram() to scale the histogram down. Second, to get the histogram for survivors to be at the top I need to flip it upside down (by multiplying by -1) and move it to the top of the plot with position_nudge(). The downside to this plot is that there are technically three y-axes—the survival probability and the number or proportion in each size class for dead and alive individuals (with 0 at the bottom and top, respectively). You can add a second y-axis to a ggplot, but I’m not sure about a third y-axis.

If you know of another cool way to visualize logistic regressions, or know of some package that does all this for you, please let me know in the comments!