Speeding up DLNMs with bam()

DLNMs
GAMs
r
Published

January 19, 2021

Note

This is part of series about distributed lag non-linear models. Please read the first post for an introduction and a disclaimer.

DLNMs themselves may not be that computationally expensive, but when combined with random effects and other smoothers, and a large-ish dataset, I’ve noticed gam() being painfully slow. “Slow” is of course relative, and I’m really only talking like a couple minutes for a model to run.

bam() in the mgcv package promises to speed up fitting and predicting for GAMs on big datasets by taking advantage of parallellization through the parallel package. I’m going to try to get that working and see how much it really speeds things up.

library(mgcv)
library(dlnm)
library(parallel)
library(tictoc) #for simple benchmarking

Standard gam() DLNM

This is like the DLNM I’ve been fitting for the last couple of blog posts except now the size covariate is fit as a smooth (s(log_size)) and there is a random effect of plot.

tic()
growth <-
  gam(log_size_next ~ 
        s(log_size) +
        s(plot, bs = "re") + #random effect
        s(spei_history, L, #crossbasis function
          bs = "cb", 
          k = c(3, 24), 
          xt = list(bs = "cr")),
      family = gaussian(link = "identity"),
      method = "REML",
      data = ha)
toc()
0.744 sec elapsed

Remember, this is just a subset of the dataset I’m working with. This same model with the full dataset takes about 90 seconds to run, and if I add a second covariate of year, it takes about 380 seconds.

Set up parallization

parallel works by running code on multiple R sessions simultaneously. Read the documentation before messing with this, because I think if you set the number of clusters too high, you will crash your computer.

cl <- makeForkCluster()

Now, I think all I have to do is re-run the same model, just with bam() instead of gam(), and include the cluster argument.

tic()
growth_bam <-
  bam(log_size_next ~ 
        s(log_size) +
        s(plot, bs = "re") + #random effect
        s(spei_history, L, #crossbasis function
          bs = "cb", 
          k = c(3, 24), 
          xt = list(bs = "cr")),
      family = gaussian(link = "identity"),
      method = "REML",
      cluster = cl,
      data = ha)
toc()
0.636 sec elapsed

Hmm.. that took longer. The help file for bam() seems to indicate that it might not speed things up if a computationally “expensive basis” is used. So with this small dataset, maybe it’s doing more work and taking longer?

When I switch to bam() for the model using the entire dataset (~20,000 rows), I go from 380 seconds to 41 seconds—a significant improvement!