```
library(googlesheets4)
applicants_raw <- read_sheet("1SUy92T7I3ZoEyZjTxLP7F5pAup58c0xKl0QbvElIfxA")
```

`glue`

package to convert that data into something that looked more like an application.
I can’t share the original form or data, so for the sake of this blog post, I made a simple example form.

Form: https://forms.gle/yJjME2yZMZPzw3p28

Responses: https://docs.google.com/spreadsheets/d/1SUy92T7I3ZoEyZjTxLP7F5pAup58c0xKl0QbvElIfxA/edit?usp=sharing

`googlesheets4`

is the package to use to read in the data. We’ll need the sheet ID bit of the URL above to access it.

```
library(googlesheets4)
applicants_raw <- read_sheet("1SUy92T7I3ZoEyZjTxLP7F5pAup58c0xKl0QbvElIfxA")
```

`applicants_raw`

```
# A tibble: 3 × 11
Timestamp Name Email Depar…¹ Caree…² How c…³ How c…⁴ How c…⁵
<dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2023-08-08 11:33:44 Eric Scott eric… Biology Staff Very c… Very c… Very c…
2 2023-08-08 11:35:39 Perrald Mas… perr… Art Hi… Grad S… Not co… Somewh… Not co…
3 2023-08-08 11:37:27 BMO BMO@… Physics Underg… Not co… Not co… Not co…
# … with 3 more variables:
# `How comfortable are you with the following? [Using git commands]` <chr>,
# `How comfortable are you with the following? [Using Quarto or RMarkdown]` <chr>,
# `Why do you want to take this course?` <chr>, and abbreviated variable
# names ¹Department, ²`Career Stage`,
# ³`How comfortable are you with the following? [Using shell commands]`,
# ⁴`How comfortable are you with the following? [Writing for loops in R]`, …
```

You’ll notice that the column headings are long and unruly, but also have important information. I’ll save the originals and then clean them up in the dataframe with `janitor::clean_names()`

. I’ll remove the repetitive question “How comfortable are you with the following?” from the matrix question answer columns with some regex and `stringr`

.

`library(janitor)`

```
Attaching package: 'janitor'
```

```
The following objects are masked from 'package:stats':
chisq.test, fisher.test
```

```
library(stringr)
questions <-
colnames(applicants_raw) |>
str_replace("How comfortable are you with the following\\? \\[(.+)\\]", "\\1")
applicants <-
applicants_raw |>
clean_names()
questions
```

```
[1] "Timestamp"
[2] "Name"
[3] "Email"
[4] "Department"
[5] "Career Stage"
[6] "Using shell commands"
[7] "Writing for loops in R"
[8] "Data wrangling in R"
[9] "Using git commands"
[10] "Using Quarto or RMarkdown"
[11] "Why do you want to take this course?"
```

`applicants`

```
# A tibble: 3 × 11
timestamp name email depar…¹ caree…² how_c…³ how_c…⁴ how_c…⁵
<dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2023-08-08 11:33:44 Eric Scott eric… Biology Staff Very c… Very c… Very c…
2 2023-08-08 11:35:39 Perrald Mas… perr… Art Hi… Grad S… Not co… Somewh… Not co…
3 2023-08-08 11:37:27 BMO BMO@… Physics Underg… Not co… Not co… Not co…
# … with 3 more variables:
# how_comfortable_are_you_with_the_following_using_git_commands <chr>,
# how_comfortable_are_you_with_the_following_using_quarto_or_r_markdown <chr>,
# why_do_you_want_to_take_this_course <chr>, and abbreviated variable names
# ¹department, ²career_stage,
# ³how_comfortable_are_you_with_the_following_using_shell_commands,
# ⁴how_comfortable_are_you_with_the_following_writing_for_loops_in_r, …
```

The “trick” here lies in the fact that you can use the chunk option `output: asis`

in Quarto (and RMarkdown) to treat the output of a code chunk as markdown. So we can use the `glue`

package to programatically create markdown, and because `glue::glue()`

is vectorized, we only have to generate a “template” of sorts and it will apply it to every response to our form (i.e. every row of the `applicants`

tibble). You can see below this chunk how the “applications” get formatted by this template.

```
```{r}
#| output: asis
library(glue)
glue("### {applicants$name}
{applicants$career_stage} | {applicants$department} | <{applicants$email}>
#### How comfortable are you with the following?
| | |
|----------------|--------------------|
|{questions[6]} | {applicants[[6]]} |
|{questions[7]} | {applicants[[7]]} |
|{questions[8]} | {applicants[[8]]} |
|{questions[9]} | {applicants[[9]]} |
|{questions[10]} | {applicants[[10]]} |
#### {questions[11]}
{applicants[[11]]}
")
```
```

Staff | Biology | ericrscott@arizona.edu

Using shell commands | Very comfortable |

Writing for loops in R | Very comfortable |

Data wrangling in R | Very comfortable |

Using git commands | Very comfortable |

Using Quarto or RMarkdown | Very comfortable |

I love learning

Grad Student | Art History | perry@notreal.org

Using shell commands | Not comfortable |

Writing for loops in R | Somewhat comfortable |

Data wrangling in R | Not comfortable |

Using git commands | Not comfortable |

Using Quarto or RMarkdown | Very comfortable |

While my background predominantly lies in the field of Art History, I believe that this course presents a unique and valuable opportunity for me to expand my horizons and develop essential skills that can greatly enhance my academic and professional pursuits.

Undergraduate | Physics | BMO@mo.com

Using shell commands | Not comfortable |

Writing for loops in R | Not comfortable |

Data wrangling in R | Not comfortable |

Using git commands | Not comfortable |

Using Quarto or RMarkdown | Not comfortable |

I want to go to school so I can learn all kinds of sweet coding tricks to impress Finn and Jake and also Football.

For the cherry on top, you can enable a table of contents and annotation with hypothes.is to allow easy navigation between applicants and allow you to take notes. Just add the following to the Quarto YAML header:

```
toc: true
comments:
hypothesis: true
```

I put the whole example together in a repo where you can see the .Qmd source code and the rendered HTML.

`list.files()`

. When you give it a vector of paths (which you totally can do), it `list.files()`

and assumed (a sane assumption, I think) that the output would be in the same order. In this case, I knew there was only one file per path, but I think I would have assumed this even if it was returning more than one file.
```
tmp <- tempdir()
dir.create(file.path(tmp, "A"))
dir.create(file.path(tmp, "B"))
dir.create(file.path(tmp, "C"))
file.create(file.path(tmp, "A", "A.txt"))
```

`[1] TRUE`

`file.create(file.path(tmp, "B", "B.txt"))`

`[1] TRUE`

`file.create(file.path(tmp, "C", "C.txt"))`

`[1] TRUE`

```
file_list <- file.path(tmp, c("C", "A", "B"))
file_list #in order C, A, B
```

```
[1] "/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T//RtmpYfrDkf/C"
[2] "/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T//RtmpYfrDkf/A"
[3] "/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T//RtmpYfrDkf/B"
```

`list.files(file_list, full.names = TRUE) #in order A, B, C!`

```
[1] "/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T//RtmpYfrDkf/A/A.txt"
[2] "/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T//RtmpYfrDkf/B/B.txt"
[3] "/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T//RtmpYfrDkf/C/C.txt"
```

So I was wrong, and it made all the work I did for the past several months somewhat wrong, but the good news is there is an easy fix. The `fs`

package is the ‘tidy’ solution to working with files and file paths in R. The `fs`

alternative to `list.files()`

is `dir_ls()`

, and like many tidyverse equivalents of base R functions, it is better because it does *less*. It won’t re-order the outputs and it always assumes you want the full paths (not just the file name as is the default with `list.files()`

).

```
library(fs)
fs::dir_ls(file_list) #in correct order C, A, B
```

```
/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T/RtmpYfrDkf/C/C.txt
/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T/RtmpYfrDkf/A/A.txt
/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T/RtmpYfrDkf/B/B.txt
```

Needless to say, I’ll be switching over to `fs::dir_ls()`

for this project. I’ll also be spending some more time exploring the `fs`

package and likely using it for all my file exploring and manipulation needs from now on.

The “Scientific Programmer” part refers to the research half of the job. I’ll be collaborating with faculty, grad students, and postdocs in the College of Agriculture and Life Sciences on research; offering my skills in statistics, data science, data visualization, data management, reproducibility, etc.

The “Educator” part refers to my role as a data science trainer where I will develop trainings, workshops, and tutorials to improve the data science capacity of researchers at University of Arizona.

For both roles, I have a lot of flexibility so I’m looking forward to continuing to work on multivariate analyses, ecological modeling, demography, and reproducible research methods. I’m also looking forward to learning new things based on where my interests and my trainees’ and collaborators’ interests take me!

I followed a fairly traditional academic path—masters degree, PhD, postdoc—although with some breaks and other jobs in between (see my CV). During my PhD, I developed an interest and expertise in statistics and programming in R. Even if it wasn’t required for my research, I made excuses to learn new R packages and to stay up-to-date with the latest tools for reproducible data analysis. I contributed to an R package that I used in my research, and I developed an original R package or two. I also gained experience teaching statistics and R from TAing for Biostatistics and teaching Ecological Statistics and Data as an instructor of record in my last semester.

As a postdoc, I collaborated with my advisor remotely using GitHub on a number of projects and manuscripts. As a side-project, I worked on a collaborative manuscript extolling the virtues of GitHub for collaborative research in ecology and evolutionary biology.

At some point in my job search (which was almost entirely for traditional faculty positions) I came across a job title I hadn’t heard before: Research Software Engineer (RSE). According to the United States RSE Association, an RSE is:

We like an inclusive definition of Research Software Engineers to encompass those who regularly use expertise in programming to advance research. This includes researchers who spend a significant amount of time programming, full-time software engineers writing code to solve research problems, and those somewhere in-between. We aspire to apply the skills and practices of software development to research to create more robust, manageable, and sustainable research software.

I read that definition and thought, **“Hey! That’s me! I do that!”**. It turns out I had been an RSE as a PhD student and a postdoc and I was really excited to find a new community that I belonged in and a title that I liked a lot better than “data scientist”. I joined the US-RSE Slack, which is where I think I first saw the job announcement (although it was definitely cross-posted in other Slacks I’m in).

I think the most common ways people in my field (ecology) choose an R package or function to learn to solve a particular problem are:

- It’s what they were exposed to in a course or training
- It’s widely used in their field (e.g. cited often)
- It’s written or maintained by a “big-shot” in the field

After gaining experience in R package development, I’ve begun to place less importance on these factors and more importance on other factors that I’ll discuss briefly below.

It’s really important to me that a package I’m using is being actively developed. There are a number of reasons to choose a package with active development over one that is not updated often. One obvious reason is that if you encounter bugs, you can be more confident that they’ll get fixed if you report them. Even mature, well established packages need active maintenance to ensure they remain functional as their dependencies get updated. I also like choosing packages with active development because I may have an opportunity help improve the package through my feedback and suggestions.

So, how to assess if a package is being actively developed?

**Check CRAN release date.**How recent was the latest stable version published to CRAN?**Check bug reports.**How many issues are open and how many have been closed? How many really old bug reports are there, if any? Has the package author at least responded to bug reports made over a month ago?**Check GitHub development activity.**Have there been somewhat recent commits or pull requests? Is there a NEWS file documenting changes in the development version?

Where an R package or function is in its lifecycle can help me make a decision about whether to use it over an alternative. You might have noticed lifecycle badges in tidyverse package help files (for example, the “superseded” badge on the help file for `gather()`

from the `tidyr`

package) or on README files on GitHub (e.g. the “maturing” badge on the `ipmr`

package I’ve been learning lately).

I generally want to avoid learning anything that has been superseded (no longer being developed) and I definitely don’t want to learn anything that’s been deprecated. Instead I focus my efforts on learning functions or packages that are in the maturing or stable stages. If a package is still in the experimental stage, I may choose to learn it, but probably only if I’m interested in contributing to the package development and it’s really the best option for what I’m trying to do.

These specific lifecycle stages are not always used in all packages, and sometimes the stage must be sussed out through some exploration. For example, if you check the website for the `raster`

package, you’ll see that it has been superseded by the `terra`

package although the lifecycle badges are not used by these packages.

Unit tests are code that is written to check for the correct behavior of functions under a range of possible inputs from users. Tests can help catch bugs, check that users get informative error messages, and check for correctness. Most ecologists probably don’t know to look for tests when choosing to use an R package. I think there is the assumption that if it’s on CRAN, and has been cited before, then it’s legit. That simply is not true. For example, the `spi`

package was on CRAN (it’s now been archived) and had been cited in papers, but didn’t calculate anything *even close* to what it claimed to calculate. If the `spi`

package had unit tests, perhaps this error would have been caught before it went to CRAN. The easiest way to check if a package has tests is to find the GitHub repository (often linked to from the CRAN page). If there is a “tests” folder, and it has some code in it, then that’s a start. You can also look for a badge in the README that describes the test coverage (how much of the package code is tested), such as the codecov badge here. If test coverage is > 75% you’re in really good shape.

As Noam Ross pointed out in the collaborative notes for the call, having a big user base *can* catch bugs without formal tests, but “tests are the ‘ratchet’, though—they make sure you don’t go backwards, introducing old bugs again when you fix new ones”.

Note

This is part of series about distributed lag non-linear models. Please read the first post for an introduction and a disclaimer.

Since I’ve been working so much with GAMs for this project, I decided to read sections of Simon Wood’s book, Generalized additive models: an introduction with R more thoroughly. In Chapter 5: Smoothers, there is an example **tensor product** smooth (more on what that means later) fitting a distributed lag model. When I saw this, I started to question *everything*. What was the `dlnm`

package even doing if I could fit a DLNM with just `mgcv`

? Was it doing something *wrong*? Was I interpreting it wrong? Am I going to have to change EVERYTHING?

I also found this wonderful paper by Nothdurft and Vospernik that fits a DLNM for yearly tree ring data explained by lagged, non-linear, monthly weather data. They also used only the `mgcv`

package and tensor product smooths to fit this model. So what is a tensor product and what is the `dlnm`

package doing differently?

A tensor product smooth is a two (or more) dimensional smooth such that the shape of one dimension varies smoothly over the other dimension. Tensor products are constructed from two (or more) so-called marginal smooths. Imagine a basket weave of wood strips. The strips can be different widths and be more or less flexible in each dimension. The flexibility of the strips roughly corresponds to a smoothing penalty (stiffer = smoother) and the number and width of strips roughly corresponds to number of knots. You can bend the first strip on one dimension, but you can’t really bend the adjacent strip in the completely opposite direction. The shape of the strips in one dimension is forced to vary smoothly across the other dimension.

The tensor product for a DLNM has the environmental predictor on one dimension (SPEI, in our example) and lag time on the other dimension. So SPEI can have a non-linear effect on plant growth, but the shape of that relationship with lag = 0 is constrained to be similar to the shape at lag = 1 (the adjacent strip of wood). The change in the shape of the SPEI effect varies smoothly with lag time. Here’s a pure `mgcv`

implementation of a DLNM.

`library(mgcv)`

`Loading required package: nlme`

```
Attaching package: 'nlme'
```

```
The following object is masked from 'package:dplyr':
collapse
```

`This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.`

```
growth_te <-
gam(log_size_next ~
s(log_size) +
te(spei_history, L,
bs = "cr",
k = c(5, 15)),
family = gaussian(link = "identity"),
method = "REML",
select = TRUE,
data = ha)
```

After re-reading Gasparrini’s papers for the billionth time and reading more of Simon Wood’s book, I realized the difference between the pure `mgcv`

approach and the `dlnm`

approach had to do with “identifiability constraints”. Basically, because a GAM is a function that has covariates which themselves are functions, those smooth covariates are usually constrained to sum to 0. For example, the smooth term for `s(log_size)`

looks essentially centered around 0 on the y-axis when plotted.

```
library(gratia)
draw(growth_te, select = 1)
```

Tensor product smooths in `mgcv`

have this constraint as well, but not for each marginal function. The entire *surface* sums to zero.

`draw(growth_te, select = 2, dist = Inf)`

The `dlnm`

package constructs a “crossbasis” function, but it does this by using **tensor products** from `mgcv`

. So what is the difference? Well, the major difference is in how it does identifiability constraints. For `te(..)`

, the entire surface must sum to zero. For `s(..., bs = "cb")`

, the predictor-response dimension is constrained to sum to zero. That means that every slice for any value of lag must sum to zero. It also removes the the intercept from that dimension, so the resulting smooth ends up having fewer knots and fewer maximum degrees of freedom.

Here’s the `dlnm`

version:

`library(dlnm)`

`This is dlnm 2.4.7. For details: help(dlnm) and vignette('dlnmOverview').`

```
growth_cb <-
gam(log_size_next ~
s(log_size) +
s(spei_history, L, #crossbasis function
bs = "cb",
k = c(5, 15),
xt = list(bs = "cr")),
family = gaussian(link = "identity"),
method = "REML",
select = TRUE,
data = ha)
```

`draw(growth_cb, select = 2, dist = Inf)`

Notice how it looks much more symmetrical left to right. This is more clear if we plot all the slices through lag time on top of eachother, kind of like holding the surface up with the x-axis at eye-level and looking down the y-axis in the middle.

`eval_cb <- evaluate_smooth(growth_cb, "spei_history", dist = Inf)`

```
Warning: `evaluate_smooth()` was deprecated in gratia 0.7.0.
ℹ Please use `smooth_estimates()` instead.
```

`eval_te <- evaluate_smooth(growth_te, "spei_history", dist = Inf)`

```
eval_cb %>%
#just take beginning and end
# filter(L == min(L) | L == max(L)) %>%
mutate(L = as.factor(L)) %>%
ggplot(aes(x = spei_history, y = est, group = L)) +
geom_line(alpha = 0.1) +
labs(title = "crossbasis from {dlnm}") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.02, 0.02)) +
eval_te %>%
#just take beginning and end
# filter(L == min(L) | L == max(L)) %>%
mutate(L = as.factor(L)) %>%
ggplot(aes(x = spei_history, y = est, group = L)) +
geom_line(alpha = 0.1) +
labs(title = "tensor product from {mgcv}") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.02, 0.02))
```

The crossbasis function intercepts are more tightly aligned, I think because of the sum-to-zero constraint along the `spei_history`

axis.

Oddly, the slices in the `dlnm`

version still don’t all sum to zero, so maybe I’m still not totally explaining this right.

Ultimately, there is little difference between these approaches for these data. If we use the `cb_margeff()`

function I mentioned earlier on in this series to get fitted values of y (for the whole GAM, not just the smooth), then the two models look nearly identical.

```
pred_cb <- cb_margeff(growth_cb, spei_history, L)
pred_te <- cb_margeff(growth_te, spei_history, L)
```

```
ggplot(pred_cb, aes(x = x, y = lag, fill = fitted)) +
geom_raster() +
geom_contour(aes(z = fitted), binwidth = 0.01, color = "black", alpha = 0.3) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "{dlnm} crossbasis") +
ggplot(pred_te, aes(x = x, y = lag, fill = fitted)) +
geom_raster() +
geom_contour(aes(z = fitted), binwidth = 0.01, color = "black", alpha = 0.3) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "{mgcv} tensor product")
```

```
Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
```

And the summary statistics are very similar as well

`anova(growth_cb)`

```
Family: gaussian
Link function: identity
Formula:
log_size_next ~ s(log_size) + s(spei_history, L, bs = "cb", k = c(5,
15), xt = list(bs = "cr"))
Approximate significance of smooth terms:
edf Ref.df F p-value
s(log_size) 1.363 9.000 261.921 < 2e-16
s(spei_history,L) 4.570 23.000 1.077 4.01e-05
```

`anova(growth_te)`

```
Family: gaussian
Link function: identity
Formula:
log_size_next ~ s(log_size) + te(spei_history, L, bs = "cr",
k = c(5, 15))
Approximate significance of smooth terms:
edf Ref.df F p-value
s(log_size) 1.364 9.000 261.905 < 2e-16
te(spei_history,L) 4.618 26.000 0.955 4.02e-05
```

The major difference, you’ll notice, is that the reference degrees of freedom are larger for the tensor product version. Again, that is because the crossbasis function constrains the SPEI dimension to sum to zero and the intercept along that dimension is removed.

So the advantages of the `dlnm`

package for fitting DLNMs are probably mostly evident when you consider cases besides GAMs. For example, If I wanted to constrain the lag dimension to be a switchpoint function, or if I wanted the SPEI dimension to be strictly a quadratic function, I could do that with `dlnm`

. If you’re interested in interpreting your results in terms of relative risk ratios, then `dlnm`

offers some OK visualiztion options for that. When using smooth functions for both marginal bases, the differences between using a straight tensor product with `te()`

and using a crossbasis function start to fade away. The `dlnm`

version is still a little easier to interpret, I think, because you can more easily compare slices through lag time with the plot of the smooth itself.

Note

A major goal of my postdoc project is to determine whether drought has an effect on plant vital rates (growth, survival, reproduction, recruitment). Getting some measure of statistical significance of drought history in these models is therefore really important for me. Even with simple linear models, there are multiple ways of doing hypothesis testing, some more “correct” than others. For example, this recent Twitter discussion about the default behavior of `anova()`

usually being innapropriate:

EEB folks: when did you realize that #rstats uses Type I sums of squares as a default? via @DanielBolnick

“before” means that you ran analyses without realizing but then changed them (e.g., to Type III) before publishing.

“after” means you published with it & later realized— Andrew Hendry (@EcoEvoEvoEco) January 31, 2021

This StackExchange answer does a really good job of explaining hypothesis testing with GAMs, and I think this extends to DLNMs fit as GAMs. Unlike `anova.lm()`

or `summary.lm()`

, which are generally **not** the ones you want, the p-values in `anova.gam()`

and `summary.gam()`

are generally safe to interpret (also, they are exactly the same). Simon Wood (the author of `mgcv`

) has given a lot of thought and published multiple papers on the calculation of these p-values.^{1} ^{2} For an ordinary penalized smooth (like the default, thin plant regression splines, or the `"cr"`

cubic regression spline basis), the actual wiggliness is lower than the maximum wiggliness (defined by the number of knots) . This shrinkage toward a straight line (or a plane in the case of a crossbasis function) is expressed by estimated degrees of freedom (edf) . For example, if edf `\(\simeq\)`

1, then the smooth is approaching a straight line . Let’s look at an example .

`library(mgcv)`

`Loading required package: nlme`

```
Attaching package: 'nlme'
```

```
The following object is masked from 'package:dplyr':
collapse
```

`This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.`

`library(dlnm)`

`This is dlnm 2.4.7. For details: help(dlnm) and vignette('dlnmOverview').`

```
growth <-
gam(log_size_next ~
s(log_size) +
s(spei_history, L, #crossbasis function
bs = "cb",
k = c(3, 24),
xt = list(bs = "cr")),
family = gaussian(link = "identity"),
method = "REML",
data = ha)
```

`anova(growth)`

```
Family: gaussian
Link function: identity
Formula:
log_size_next ~ s(log_size) + s(spei_history, L, bs = "cb", k = c(3,
24), xt = list(bs = "cr"))
Approximate significance of smooth terms:
edf Ref.df F p-value
s(log_size) 1.363 1.644 1441.842 <2e-16
s(spei_history,L) 7.400 8.847 2.906 0.0019
```

The `edf`

for `s(log_size)`

is fairly close to 1, indicating that it might be better modeled as a parametric term (i.e. just a slope). The edf for the crossbasis function is higher, indicating a more complex surface. The reference degrees of freedom `Ref.df`

is, I think, another way of calculating the edf, but honestly, the explanation in the help file and associated paper is beyond my understanding. The test is a modification of a Wald test that can take fractional degrees of freedom (the edf). The help file indicates that p-values “may be somewhat too low when smoothing parameters are highly uncertain. High uncertainty happens in particular when smoothing parameters are poorly identified, which can occur with nested smooths or highly correlated covariates (high concurvity)”. This sounds worrying, but I actually don’t think it’s that different than the situation with a linear model. Highly correlated covariates will *also* give you untrustworthy p-values in an ordinary linear regression, so I’m not sure there’s anything super different here.

In the GAM I fit above, the most a term can be penalized to is linear, i.e. edf = 1 (ignore the random effect of plot as it is different). If I set `select = TRUE`

in the `gam()`

call, it adds a second penalty on the “null space” and allows edf to go to 0, effectively dropping out of the model entirely. According to the StackOverflow answer, this is currently the best way to get p-values for GAMs.

```
growth_shrink <-
gam(log_size_next ~
s(log_size) +
s(spei_history, L, #crossbasis function
bs = "cb",
k = c(3, 24),
xt = list(bs = "cr")),
family = gaussian(link = "identity"),
method = "REML",
select = TRUE,
data = ha)
```

`anova(growth_shrink)`

```
Family: gaussian
Link function: identity
Formula:
log_size_next ~ s(log_size) + s(spei_history, L, bs = "cb", k = c(3,
24), xt = list(bs = "cr"))
Approximate significance of smooth terms:
edf Ref.df F p-value
s(log_size) 1.348 9.000 262.525 < 2e-16
s(spei_history,L) 4.558 23.000 1.132 2.33e-05
```

All of the edf are smaller, but the `Ref.df`

have gone up, and are now whole numbers. This is to correct for having done variable selection, I think. Usually it is a bad idea to do variable selection and then do `Anova()`

on the final model—the p-values will be biased since you’ve already pulled terms out of your model. So instead of getting estimated reference degrees of freedom, we now get something like the number of knots - 1 (although that’s not exactly what it is for the crossbasis function).

The test is still a null hypothesis test (), but now terms are allowed to be dropped from the model completely, if they are not supported by the data.

I’m going to use the `gratia`

package to plot the smooths from the shrinkage and non-shrinkage versions.

```
library(gratia)
draw(growth)
```

`draw(growth_shrink)`

There’s really no difference in the shape of `s(log_size)`

meaning that the relationship really is log-linear, but that the term *should* stay in the model. The surface for the lagged drought effect is similar in shape, but slightly *flatter* in the shrinkage penalized version, just as we’d expect from the edf being lower.

Wood SN (2013) On p-values for smooth components of an extended generalized additive model. Biometrika 100:221–228 . https://doi.org/10.1093/biomet/ass048↩︎

Marra G, Wood SN (2011) Practical variable selection for generalized additive models. Computational Statistics & Data Analysis 55:2372–2387 . https://doi.org/10.1016/j.csda.2011.02.004↩︎

Note

DLNMs themselves may not be *that* computationally expensive, but when combined with random effects and other smoothers, and a large-ish dataset, I’ve noticed `gam()`

being painfully slow. “Slow” is of course relative, and I’m really only talking like a couple minutes for a model to run.

`bam()`

in the `mgcv`

package promises to speed up fitting and predicting for GAMs on big datasets by taking advantage of parallellization through the `parallel`

package. I’m going to try to get that working and see how much it really speeds things up.

```
library(mgcv)
library(dlnm)
library(parallel)
library(tictoc) #for simple benchmarking
```

This is like the DLNM I’ve been fitting for the last couple of blog posts except now the size covariate is fit as a smooth (`s(log_size)`

) and there is a random effect of plot.

```
tic()
growth <-
gam(log_size_next ~
s(log_size) +
s(plot, bs = "re") + #random effect
s(spei_history, L, #crossbasis function
bs = "cb",
k = c(3, 24),
xt = list(bs = "cr")),
family = gaussian(link = "identity"),
method = "REML",
data = ha)
toc()
```

`0.744 sec elapsed`

Remember, this is just a subset of the dataset I’m working with. This same model with the full dataset takes about 90 seconds to run, and if I add a second covariate of year, it takes about 380 seconds.

`parallel`

works by running code on multiple R sessions simultaneously. Read the documentation before messing with this, because I think if you set the number of clusters too high, you will crash your computer.

`cl <- makeForkCluster()`

Now, I think all I have to do is re-run the same model, just with `bam()`

instead of `gam()`

, and include the `cluster`

argument.

```
tic()
growth_bam <-
bam(log_size_next ~
s(log_size) +
s(plot, bs = "re") + #random effect
s(spei_history, L, #crossbasis function
bs = "cb",
k = c(3, 24),
xt = list(bs = "cr")),
family = gaussian(link = "identity"),
method = "REML",
cluster = cl,
data = ha)
toc()
```

`0.636 sec elapsed`

Hmm.. that took **longer**. The help file for `bam()`

seems to indicate that it might not speed things up if a computationally “expensive basis” is used. So with this small dataset, maybe it’s doing more work and taking longer?

When I switch to `bam()`

for the model using the entire dataset (~20,000 rows), I go from 380 seconds to 41 seconds—a significant improvement!

Note

According to Gasparrini et al. (2017), a crossbasis function is a “bi-dimensional dose-lag-response function is composed of two marginal functions: the standard dose-response function , and the additional lag-response function that models the lag structure…” Each dimension can be described by a different type of function. The default for the `dlnm`

package is a type of smoother called a P-spline, but it can be changed to other types of splines or even something like step function. The marginal functions can also be mixed and matched, e.g., a P-spline for the lag dimension and a step function for the dose-response dimension.

I’d like to use penalized splines for both bases since they are flexible—that is, they can take nearly any functional shape, including a perfectly straight line.

So far I’ve been using penalized cubic regression splines for both the lag and dose-response dimensions of my DLNMs, but to be perfectly honest, I think I’m only doing this because Teller et al. (2016) use a similar spline basis, However, they aren’t even using DLNMs! I should at least be able to justify my choice of basis function.

`library(mgcv) #for gam()`

`Loading required package: nlme`

```
Attaching package: 'nlme'
```

```
The following object is masked from 'package:dplyr':
collapse
```

`This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.`

`library(dlnm) #for the "cb" basis`

`This is dlnm 2.4.7. For details: help(dlnm) and vignette('dlnmOverview').`

```
#with cubic regression splines for both dimensions
growth_cr <-
gam(log_size_next ~
log_size +
s(spei_history, L, # <- the two dimensions
bs = "cb", # <- fit as crossbasis
k = c(4, 24), # <- knots for each dimension
xt = list(bs = "cr")), # <- what basis to use for each dimension
family = gaussian(link = "identity"),
method = "REML",
data = ha)
```

Note: for P-splines, the number of knots, `k`

, must be 2 greater than order of the basis (default 2, i.e. cubic), so I’m using the minimum (4) for the dose-response dimension.

```
#with default P-splines for both dimensions
growth_ps <-
gam(log_size_next ~
log_size +
s(spei_history, L, # <- the two dimensions
bs = "cb", # <- fit as crossbasis
k = c(4, 24)), # <- knots for each dimension
family = gaussian(link = "identity"),
method = "REML",
data = ha)
```

`growth_cr`

```
Family: gaussian
Link function: identity
Formula:
log_size_next ~ log_size + s(spei_history, L, bs = "cb", k = c(4,
24), xt = list(bs = "cr"))
Estimated degrees of freedom:
8.37 total = 10.37
REML score: 675.5565
```

`growth_ps`

```
Family: gaussian
Link function: identity
Formula:
log_size_next ~ log_size + s(spei_history, L, bs = "cb", k = c(4,
24))
Estimated degrees of freedom:
7.63 total = 9.63
REML score: 673.1247
```

The REML score is slightly higher for the `"cr"`

basis, which I *think* means a better fit to data (I think this score is what is being maximized by the model fitting algorithm).

`AIC(growth_cr, growth_ps)`

```
df AIC
growth_cr 13.16403 1331.719
growth_ps 12.00639 1332.001
```

AIC is also slightly lower for the `"cr"`

basis

I’m going to use the trick I “discovered” in the previous blog post to plot the crossbasis function from each model.

```
growth_cr$smooth[[1]]$plot.me <- TRUE
growth_ps$smooth[[1]]$plot.me <- TRUE
```

```
par(mfrow = c(1,2))
plot(growth_cr, scheme = 2)
plot(growth_ps, scheme = 2)
```

The minima and maxima are in the same places, which is very reassuring. The wiggliness is different, which is also indicated by the estimated degrees of freedom (8.37 for the “cs” model and 7.63 for the “ps” model).

I’m going to stick with the cubic regression spline basis (`bs = "cr"`

) because it seems to result in a *slightly* better fit to data than the P-spline smoothers. In addition, Simon Wood says “However, in regular use, splines with derivative based penalties (e.g.”tp” or “cr” bases) tend to result in slightly better MSE performance” (see `?smooth.construct.ps.smooth.spec`

).

Note

The `dlnm`

package offers two ways of fitting crossbasis functions: an “internal” and an “external” method. The “external” method involves fitting the crossbasis function outside of a model, using some functions in the `dlnm`

package, then including the results as a predictor in a model such as a generalized linear model (GLM). I’m going to focus entirely on the “internal” method that fits the crossbasis function in the context of a generalied additive model (GAM) to take advantage of the penalization and other stuff the `mgcv`

package offers.

Throughout this series, I’m going to use a subset of data from my postdoc project on *Heliconia acuminata*. In this subset, 100 plants were tracked over a decade. Every year in February their size was recorded as height and number of shoots, and it was recorded whether or not they flowered. Any dead plants were marked as such. The goal is to determine how drought impacted growth, survival, and flowering probability with a potentially delayed and/or non-linear relationship. To that end, I’ve calculated SPEI, a measure of drought, where more negative numbers represent more severe drought. SPEI is monthly while the demography data is yearly. For every observation of a plant, there is an entire history of SPEI for the past 36 months from that observation.

`head(ha)`

```
# A tibble: 6 × 11
plot ha_id_nu…¹ year size size_…² log_s…³ log_s…⁴ flwr surv spei_…⁵ L[,1]
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 2107 107 1998 66 112 4.19 4.72 0 1 -1.33 0
2 2107 107 1999 112 102 4.72 4.62 0 1 1.35 0
3 2107 107 2000 102 68 4.62 4.22 0 1 0.169 0
4 2107 107 2001 68 96 4.22 4.56 0 1 -0.0884 0
5 2107 107 2002 96 164 4.56 5.10 0 1 -0.357 0
6 2107 107 2003 164 114 5.10 4.74 0 1 -1.40 0
# … with 2 more variables: spei_history[2:37] <dbl>, L[2:37] <int>, and
# abbreviated variable names ¹ha_id_number, ²size_next, ³log_size,
# ⁴log_size_next, ⁵spei_history[,1]
```

`plot`

(factor): A plot ID`ha_id_number`

(factor): A unique plant ID`year`

(numeric): year of census`size`

(numeric): number of shoots x height in cm`size_next`

(numeric): size in the next year`log_size`

(numeric): log transformed size`log_size_next`

(numeric): log transformed size next year`flwr`

(numeric): Did a plant flower? 1 = yes, 0 = no`surv`

(numeric): Did a plant survive? 1 = yes, 0 = no`spei_history`

(c(“matrix”, “array”)): A matrix column of the drought history starting in the current month (`spei_history[,1]`

= February) and going back 24 months (`spei_history[,25]`

= February 2 years ago)`L`

(c(“matrix”, “array”)): A matrix column describing the lag structure of`spei_history`

. Literally just`0:24`

for every row.

```
library(mgcv) #for gam()
library(dlnm) #for the "cb" basis
```

```
growth <-
gam(log_size_next ~
log_size +
s(spei_history, L, # <- the two dimensions
bs = "cb", # <- fit as crossbasis
k = c(3, 24), # <- knots for each dimension
xt = list(bs = "cr")), # <- what basis to use for each dimension
family = gaussian(link = "identity"),
method = "REML",
data = ha)
```

Above is a simple DLNM with survival modeled as a function of number of shoots and the crossbasis function of SPEI over the past 36 months. `shts`

is a fixed effect (i.e. not a smooth, but to be fit as a straight line), and the crossbasis is defined in `s(spei_history, L, …)`

. `spei_history`

and `L`

are the two dimensions of the crossbasis function, `bs = "cb"`

tells `gam()`

that this is a crossbasis function from the `dlnm`

package (calls `dlnm::smooth.construct.cb.smooth.spec`

behind the scenes). `xt = list(bs = "cr")`

tells it to use a cubic regression spline as the basis for both dimensions of the crossbasis function (but you can also mix and match marginal basis functions by providing a length 2 vector here).

Unfortunately `plot.gam()`

does not work with these crossbasis functions.

`plot.gam(growth)`

`Error in plot.gam(growth): No terms to plot - nothing for plot.gam() to do.`

The `dlnm`

package provides some functions for visualizing the results of a DLNM, though I don’t like them much.

First you use `crosspred()`

to get predicted values for the DLNM.

`pred_dat <- crosspred("spei_history", growth)`

`centering value unspecified. Automatically set to 0`

Then you plot those with `plot.crosspred()`

. The default is a 3D plot.

`plot(pred_dat)`

I prefer a heatmap, although the one produced here has some issues.

`plot(pred_dat, ptype = "contour", xlab = "SPEI", ylab = "lag(months)")`

First obvious problem is the colors. The range is the same for red and blue, despite different number of breaks. Second, the units are not what I’d expect. For a marginal effects plot these should be the size of an average plant in year t+1, all else being equal. This is plotting the size relative to the size at an average value of SPEI, which is a weird thing to think about. That’s because the package was built with epidemiology and relative risk in mind. Here is the plot relative to SPEI = 1.5

```
pred_dat <- crosspred("spei_history", growth, cen = 1.5)
plot(pred_dat, ptype = "contour", xlab = "SPEI", ylab = "lag(months)")
```

So, I spent a lot of time writing a complicated function, `cb_margeff()`

, to create data for a marginal effects plot. It creates a `newdata`

data frame to be passed to `predict()`

and loops across different matrixes with all columns of `spei_history`

set to average except for one, representing a range of possible SPEI values.

```
plotdata <- cb_margeff(growth, spei_history, L)
ggplot(plotdata, aes(x = x, y = lag, fill = fitted)) +
geom_raster() +
scale_fill_viridis_c("size in year t+1", option = "A") +
scale_x_continuous("SPEI", expand = c(0,0)) +
scale_y_continuous("lag (months)", expand = c(0,0))
```

Yeah, this is looking better.

The interpretation of this type of plot (which I would describe as a marginal effects plot, but correct me if I’m wrong) makes more sense to me. If there was drought (low SPEI) about 8 months prior to the census, that’s bad for growth. Drought 20 months prior is good for growth though.

**BUT WAIT**

I poked around in `plot.gam`

with `debug()`

and it turns out the reason the plotting doesn’t work is only because the author of `dlnm`

, Gasparrini, didn’t want it to work.

I can change a simple flag inside the `growth`

model, and then it produces something very similar (identical?) to what I have above:

`growth$smooth[[1]]$plot.me`

`[1] FALSE`

```
growth$smooth[[1]]$plot.me <- TRUE
plot.gam(growth, scheme = 2)
```

Why is this default plot not available? It’s literally **exactly** what I wanted, and I’m pretty sure there’s nothing incorrect about it, but it worries me that the author of `dlnm`

didn’t want me to make it.

I think I better understand what is going on here now. `plot.gam()`

, and the `ggplot2`

implementation of it, `gratia::draw()`

, plot the smooth itself, not the predicted values. By “the smooth itself”, I mean the function that is acting sort of like one of the coefficients in a GLM. Instead of , we have . To further clarify, look at the options for `predict.gam()`

. To get predicted values, you can use `type = "link"`

or `type = "response"`

. But if you just want the values for , then you can use `type = "terms"`

. The plot above looks like the one I want, but the scale is actually not in units of plant size. See the `gratia`

version, which includes a scale bar:

`gratia::draw(growth, select = 1)`

So my efforts in creating `cb_margeff()`

weren’t for nothing, afterall, and are not in conflict with the views of the `dlnm`

package authors. Some day I should probably figure out how to “manually” calculate values of from the GAM coefficients, but today is not that day.

This is the beginning of a series of blog posts where I publicly stumble my way through figuring out some confusing, complicated, and, frankly, cutting-edge modeling and statistics. The models in question are called distributed lag non-linear models (DLNMs) and they are useful for modeling potentially delayed effects of, say, weather on some outcome like plant growth or survival. I’m learning this stuff out loud and out in the open as a way to keep my thoughts organized so that I don’t repeatedly question my decisions, so that others can learn from my mistakes, and so that maybe you, kind read, will help me understand. So, take everything in this series with a grain of salt. I’m not going to go back and edit posts unless I find out I wrote something really egregiously wrong.

Ok, now on to the show…

Before we get into DLNMs, I’ll describe a related and important concept—generalized additive models (GAMs). Briefly, GAMs are a way to fit wiggly lines to data, where the wiggliness is penalized such that it tends toward a straight line unless wiggles are supported by the data. GAMs are really useful for modeling non-linear patterns where forcing a linear (or quadratic) relationship doesn’t make sense. I’m not going to talk a whole lot about GAMs, so if you’re interested in this series and GAMs are new to you, I’d recommend checking out Noam Ross’s course, GAMs in R. It’s free and really, really good.

My journey to learning DLNMs started with a paper by Brittany Teller and colleagues that uses functional linear models (FLM) to model the potentially delayed effects of climate on plant growth. A FLM is called that because at least one covariate in the model is itself a continuous function. In this case, rather than including, for example, the temperature 1 month ago, 2 months ago, 3 months ago, etc. as separate covariates, the model includes a continuous function of temperature history as a covariate. They did this in the context of a GAM to allow for a non-linear relationship through lag time. See, I was looking for a method that would allow me to model delayed effects of temperature and precipitation on leafhopper densities in a tea field. However, their method did not allow for a non-linear climate relationship (which seems likely), and, honestly, I had a hard time understanding their code. So I went searching for something else that could model lagged effects.

I ended up finding a paper describing DLNMs by Gasparrini et al. and an R package `dlnm`

by the same authors. DLNMs are functional linear models with a special 2-dimensional smooth function, called a “crossbasis” function, as a covariate. The crossbasis function fits a non-linear relationship between the response and the intensity of exposure to some environmental condition (e.g. temperature) on one dimension and a non-linear effect of lag on the other dimension.

For example, in the figure above the relative risk of getting some disease is most strongly impacted by the temperature from 0–10 days prior to diagnosis, with increased relative risk (RR) at both low and very high temperatures (part of Figure 3 from Gasparrini et al. 2017).

The great thing about the `dlnm`

package, is that it allows you to fit these crossbasis functions in the context of a GAM and take advantage of a lot of the great stuff the `mgcv`

package offers. The bad thing is that all of this is very new. And when I say “all of this” I mean DLNMs, DLNMs fit using GAMs, and even GAMs themselves. Heck, it was only in 2016 when Simon Wood and colleagues answered the question “How to get unbiased AIC values for GAMs?”.

So in this blog series I’m going to document challenges and confusions I come up against while trying to figure out how to implement and interpret these models. Don’t expect tutorials or great insights or even necessarily complete thoughts. But please, PLEASE, chime in in the comments if you have questions or answers.

The `tibble`

package in R allows for the construction of “tibbles”—a sort of “enhanced” data frame. Most of these enhancements are fairly mundane, such as better printing in the console and not modifying column names. One of the unique features of tibbles is the ability to have a column that is a list. List-columns have been written about fairly extensively as they are a very cool way of working with data in the tidyverse. A less commonly known feature is that matrix-columns are also possible in a tibble. A matrix-column is a column of a tibble that is itself a matrix. Because a matrix-column is simultaneously a single column (of a tibble) and columns (of the matrix), there are some quirks to working with them.

Data frames and tibbles handle matrix inputs differently. `data.frame()`

adds an matrix as columns of a dataframe while `tibble()`

creates a matrix-column.

`my_matrix <- matrix(rnorm(100), nrow = 10)`

No matrix-column. Just regular columns named `mat_col._`

:

```
df <- data.frame(x = letters[1:10], mat_col = my_matrix)
dim(df)
```

`[1] 10 11`

`colnames(df)`

```
[1] "x" "mat_col.1" "mat_col.2" "mat_col.3" "mat_col.4"
[6] "mat_col.5" "mat_col.6" "mat_col.7" "mat_col.8" "mat_col.9"
[11] "mat_col.10"
```

Creating a matrix-colum requires using `tibble()`

instead of `data.frame()`

:

```
tbl <- tibble(x = letters[1:10], mat_col = my_matrix)
dim(tbl)
```

`[1] 10 2`

`colnames(tbl)`

`[1] "x" "mat_col"`

You can also “group” columns of a data frame or tibble into a matrix-column using `dplyr`

.

```
df_mat_col <-
df %>%
mutate(matrix_column = as.matrix(select(., starts_with("mat_col.")))) %>%
#then remove the originals
select(-starts_with("mat_col."))
```

This creates a matrix-column, and the column names of the matrix itself come from the original dataframe (i.e. `df`

).

`colnames(df_mat_col)`

`[1] "x" "matrix_column"`

`colnames(df_mat_col$matrix_column)`

```
[1] "mat_col.1" "mat_col.2" "mat_col.3" "mat_col.4" "mat_col.5"
[6] "mat_col.6" "mat_col.7" "mat_col.8" "mat_col.9" "mat_col.10"
```

Matrix-columns are sometimes useful in modeling, when a predictor or covariate is not just a single variable, but a vector for every observation. For example, in multivariate analyses, certain packages (e.g. `ropls`

) require a matrix as an input. Functional models are another example, which fit continuous functions of some variable (e.g. over time) as a covariate (One specific example are distributed lag non-linear models, which I hope to start blogging about soon).

```
pca <- prcomp(~ mat_col, data = tbl)
summary(pca)
```

```
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8022 1.6779 1.5645 1.3203 1.02222 0.77201 0.51162
Proportion of Variance 0.2647 0.2295 0.1995 0.1421 0.08517 0.04858 0.02134
Cumulative Proportion 0.2647 0.4942 0.6937 0.8358 0.92096 0.96954 0.99087
PC8 PC9 PC10
Standard deviation 0.31635 0.10918 5.838e-18
Proportion of Variance 0.00816 0.00097 0.000e+00
Cumulative Proportion 0.99903 1.00000 1.000e+00
```

Matrix-columns are… weird, and as such they have some quirks in how they are printed in RStudio. Some of these may be bugs, but as far as I know, there aren’t any issues related to matrix-columns at the time of writing this post. If you are using paged printing of data frames in R Markdown documents, a tibble with a matrix column will simply not appear in-line. Instead you get an empty viewer box like so.

You can turn off paged printing for a single code chunk with the `paged.print`

chunk option, and you’ll see something more like this:

```
```{r}
#| paged.print: false
tbl <- tibble(x = letters[1:10], mat_col = my_matrix)
tbl
```
```

```
# A tibble: 10 × 2
x mat_col[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.464 -1.12 -1.01 1.73 0.531 2.10 1.44 0.836 0.369
2 b 1.82 -0.239 0.749 1.57 -0.256 -1.41 -0.951 -1.71 -1.77
3 c 0.190 -0.785 1.27 -1.43 -1.82 0.715 -0.593 2.07 -0.228
4 d -1.18 0.271 1.52 0.135 -0.169 -1.23 0.522 -0.410 1.23
5 e -0.509 -0.944 0.108 -1.03 0.407 -0.953 -0.415 -1.25 -0.621
6 f 1.67 0.185 -0.807 0.149 0.114 0.240 -0.791 0.418 -2.13
7 g -2.04 -2.38 0.786 0.660 -0.114 -0.935 0.519 -1.32 -0.627
8 h -0.0686 0.166 -0.0905 -1.18 0.217 -0.695 -1.53 -0.554 -0.610
9 i -1.65 0.0525 -0.501 -1.64 -0.599 -1.04 0.143 -1.83 -0.626
10 j -0.623 -0.290 -0.430 -0.0352 0.937 -3.33 2.32 1.10 -0.503
# … with 1 more variable: mat_col[10] <dbl>
```

Also note that `View()`

only renders the first column of a matrix column, with no indication that there is more to see.

Important

The behavior of `View()`

has been fixed since the original publication of this post.

Despite the printing and viewing issues, matrix columns are surprisingly easy to use. The usual sort of indexing works as expected. You can select the matrix column by name with `[`

or `dplyr::select()`

, and you can extract the matrix column using the `$`

operator, `[[`

, or `dplyr::pull()`

.

```
#a tibble with only the matrix-column
tbl["mat_col"]
select(tbl, mat_col)
#the matrix itself:
tbl$mat_col
tbl[["mat_col"]]
pull(tbl, "mat_col")
```

Indexing rows works with no problem too.

`tbl[3, ]`

```
# A tibble: 1 × 2
x mat_col[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 c 0.190 -0.785 1.27 -1.43 -1.82 0.715 -0.593 2.07 -0.228 2.15
```

```
#dplyr::filter works too
filter(tbl, x %in% c("a", "f", "i"))
```

```
# A tibble: 3 × 2
x mat_co…¹ [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.464 -1.12 -1.01 1.73 0.531 2.10 1.44 0.836 0.369 -1.50
2 f 1.67 0.185 -0.807 0.149 0.114 0.240 -0.791 0.418 -2.13 -0.422
3 i -1.65 0.0525 -0.501 -1.64 -0.599 -1.04 0.143 -1.83 -0.626 0.376
# … with abbreviated variable name ¹mat_col[,1]
```

And as we saw above, using matrix-columns in model formulas seems to work consistently as long as the input is expected or allowed to be a matrix.

Ordinary data frames and tibbles (i.e. without list-columns or matrix-columns) can usually be reliably saved as .csv files.

A tibble with a list-column will throw an error if you try to write it to a .csv file

```
df_list_col <- tibble(x = 1:10, y = list(1:10))
write_csv(df_list_col, "list_df.csv")
```

`read_csv("list_df.csv")`

```
Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
lgl (1): y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# A tibble: 10 × 2
x y
<dbl> <lgl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
```

Tibbles with matrix-columns don’t throw the same error, but unfortunately this is not because they work correctly.

`write_csv(tbl, "mat_df.csv")`

```
Error in `cli_block()`:
! `x` must not contain list or matrix columns:
✖ invalid columns at index(s): 2
```

`read_csv("mat_df")`

`Error: 'mat_df' does not exist in current working directory ('/Users/ericscott/Documents/GitHub/website-quarto/posts/2020-12-11-matrix-columns').`

As you can see, only the first column of the matrix was saved to the csv file. If you want to use matrix-columns in your work, you should either create them in the same document as your analysis, or save them as .rds files.

Important

Since the publication of this post, these errors have actually switched! Now `write_csv()`

seems to not complain when writing tibbles with list-columns, although these columns are empty. It errors with the second example with a matrix column!

That’s all for now, but please let me know in the comments if you’ve used matrix-columns in your work!

I’ll work through an example with simulated data to show you what I mean. Let’s say you’ve applied fertilizer at 3 different levels to 15 replicate corn fields (5 fields per fertilizer treatment). The treatments are 100, 200, and 300 kg N / ha. We measure yield and standardize it to percent of maximum yield.

I’m going to analyze this both as an ANOVA type design, treating fertilizer as categorical, and as a regression. For the sake of demonstration, I’ll use post-hoc power analysis to get statistical power for each test (something you probably shouldn’t do in practice because post-hoc power is fixed once you compute a p-value).

Here’s the ANOVA model in R:

`m <- aov(yield ~ fert_factor, data = df)`

```
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
fert_factor 2 333.19 166.595 5.0086 0.02621 *
Residuals 12 399.14 33.262
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

According to the ANOVA, there is a significant effect of fertilizer on yield (p = 0.026)

Our results look like this:

Our sample size, , is 5. Statistical power, , is 0.39.^{2}

But why not treat those concentrations as a continuous variable and instead fit a quadratic regression? A quadratic regression fits a line described by a quadratic function (a curve) through the relationship between fertilizer concentration and growth. Here’s what this model looks like:

This method is flexible. The relationship could be concave, convex, or increasing with a varying slope. If the true relationship is linear, then will be zero, and we’ll be left with the equation for a line.

There are two ways to write this model as R code. This first form is useful because the default behavior of `anova()`

gives a single p-value for the effect of fertilizer.

```
m1a <- lm(yield ~ poly(fert, 2, raw = TRUE), data = df)
anova(m1a)
```

```
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
poly(fert, 2, raw = TRUE) 2 333.19 166.595 5.0086 0.02621 *
Residuals 12 399.14 33.262
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

This second form is useful because it tells you if the quadratic term is significant (if it’s not, you might try just fitting a straight line). `I()`

means “literally multiply, don’t fit an interaction term”.

```
m1b <- lm(yield ~ fert + I(fert * fert), data = df)
anova(m1b)
```

```
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
fert 1 141.80 141.805 4.2633 0.06125 .
I(fert * fert) 1 191.39 191.385 5.7539 0.03360 *
Residuals 12 399.14 33.262
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Either way, there is still a significant effect of fertilizer on yield.

Now , and our power, , has gone up to 0.8. The power is doubled compared to the ANOVA design because of the greater effective sample size in the regression model.

But wait, there’s more! I haven’t told you something about the data I simulated. I generated data so that growth has a quadratic response to nitrogen concentration **in the soil**, but soil nitrogen isn’t perfectly correlated with the nitrogen applied. Your intended treatment is rarely what a plant is actually experiencing. So let’s say we can do even better than including the *intended* treatment as a continuous variable—let’s get the soil tested for nitrogen content and use **that** as an independent variable.

```
m2 <- lm(yield ~ true + I(true * true), data = df)
anova(m2)
```

```
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
true 1 219.48 219.479 10.186 0.007754 **
I(true * true) 1 254.28 254.278 11.801 0.004938 **
Residuals 12 258.57 21.548
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Now our power, , is 0.99. It’s probably very often worth it to try to measure whatever latent variable that mediates the effect of your treatment. In fact, that increased spread of your data is a *good thing* if you want to better describe the shape of the relationship between treatment and response.

Well, technically ANOVA

*is*a regression↩︎Statistical power was estimated using the

`pwr`

package. Effect size was calculated using Cohen’s suggestions.↩︎

Even with transparency, the overplotted data points just turn into a smear on the top and bottom of your plot, adding little information. Here are three ways to get more information out of those points and produce more informative plots. But first, a quick introduction to the data.

I simulated some data on survival as a function of size. Survival is binary (1 = survived, 0 = died).

`head(df)`

```
# A tibble: 6 × 2
size surv
<dbl> <int>
1 4.78 0
2 4.40 1
3 5.02 1
4 5.32 1
5 4.61 0
6 4.81 1
```

`nrow(df)`

`[1] 1000`

We can fit a logistic regression…

`m <- glm(surv ~ size, family = binomial, data = df)`

…and extract fitted values using `broom::augment()`

```
plot_df <- augment(m, type.predict = "response")
head(plot_df)
```

```
# A tibble: 6 × 8
surv size .fitted .resid .hat .sigma .cooksd .std.resid
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 4.78 0.746 -1.66 0.00125 0.970 0.00184 -1.66
2 1 4.40 0.601 1.01 0.00287 0.971 0.000958 1.01
3 1 5.02 0.819 0.633 0.00122 0.972 0.000136 0.633
4 1 5.32 0.884 0.496 0.00160 0.972 0.000105 0.496
5 0 4.61 0.685 -1.52 0.00169 0.971 0.00184 -1.52
6 1 4.81 0.756 0.748 0.00122 0.971 0.000197 0.748
```

These are the data I used for the plot above with the points corresponding to `surv`

and the best-fit line corresponding to `.fitted`

```
base <-
ggplot(plotdf, aes(x = size)) +
geom_line(aes(y = .fitted), color = "blue") +
labs(x = "Size", y = "Survival")
base + geom_point(aes(y = surv), alpha = 0.2)
```

Turning those points into a “rug” is a common way of dealing with overplotting in logistic regression plots. `ggplot2`

provides `geom_rug()`

, but getting that rug to correspond to dead plants on the bottom and live plants on the top requires a little data manipulation. First, we’ll create separate columns for dead and alive plants where the values of size only if the plant is dead or alive, respectively, and otherwise `NA`

.

```
plot_df <-
plot_df %>%
mutate(survived = ifelse(surv == 1, size, NA),
died = ifelse(surv == 0, size, NA))
```

Then, we can plot these as separate layers.

```
base <-
ggplot(plot_df, aes(x = size)) +
geom_line(aes(y = .fitted), color = "blue") +
labs(x = "Size", y = "Survival")
base +
geom_rug(aes(x = died), sides = "b", alpha = 0.2) +
geom_rug(aes(x = survived), sides = "t", alpha = 0.2)
```

Honestly, this is not a huge improvment. The overplotting is less of an issue and you can start to see the density of points a bit better, but it’s still not great.

I discovered this plot in Data-driven Modeling of Structured Populations by Ellner, Childs, and Rees. Their plot used base R graphics, but I’ll use `ggplot2`

and `stat_summary_bin()`

to get a mean survival value for binned size classes and plot those as points.

`base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv))`

I think this is fabulous! It definitely needs an explanation in a figure caption though, because what those points represent is not immediately obvious. Also, how close the points fit to the line has more to do with bin size than with model fit, so this one might be better for inspecting patterns than for evaluating fit.

```
base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv)) + labs(title = "bins = 30") |
base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv), bins = 60) + labs(title = "bins = 60")
```

This option takes the ideas of binning values from #2 and showing distributions in the margins from #1 and combines them. I discovered this in a paper from my postdoc adviser, Emilio Bruna.

A function to make this third type of plot with base R graphics is available in the `popbio`

package.

```
library(popbio)
logi.hist.plot(size, surv, boxp = FALSE, type = "hist", col = "gray")
```

Re-creating this with ggplot2 requires some hacks, and I’m still not all the way there.

```
base +
geom_histogram(aes(x = died, y = stat(count)/1000), bins = 30, na.rm = TRUE) +
geom_histogram(aes(x = survived, y = -1*stat(count/1000)), bins = 30, na.rm = TRUE, position = position_nudge(y = 1))
```

```
Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
```

There are at least two “hacks” going on here. First, I’m using `stat()`

to extract the bar heights automatically calculated by `stat_bin()`

/`geom_histogram()`

to scale the histogram down. Second, to get the histogram for survivors to be at the top I need to flip it upside down (by multiplying by -1) and move it to the top of the plot with `position_nudge()`

. The downside to this plot is that there are technically ** three** y-axes—the survival probability and the number or proportion in each size class for dead and alive individuals (with 0 at the bottom and top, respectively). You can add a second y-axis to a ggplot, but I’m not sure about a third y-axis.

If you know of another cool way to visualize logistic regressions, or know of some package that does all this for you, please let me know in the comments!

`debug()`

function with R and R Studio is a great way to do this, but it can be really intimidating. I made a short demo to show you how to get started with using `debug()`

to peer into the inner workings of R functions.
]]>Up to that point, I had assigned almost weekly homework assignments, but only given one exam so far. There was one more exam and a cumulative final exam on the syllabus at that point. I was also ahead of schedule in my lectures, so I was originally planning on doing exam 2, then adding some “special topics” lectures for the last couple of weeks before the final. But after COVID-19, none of that made sense anymore. I eventually decided to cancel exam 2 and forget about the extra lectures and just give the final and make exams a smaller percentage of the total grade.

I quickly realized I didn’t know how to do an online final exam, and that there wasn’t a previous year’s final exam that I could base mine off of. That’s because in previous years this course (taught by Elizabeth Crone) had had a final **project**—something I’d never assigned in any class before. I was skeptical that it would really replace a comprehensive final, but in the interest of being kind to my stressed students (and myself, who didn’t want to write an exam from scratch), I assigned the final project.

The project involved analysis of data provided by the Massachusetts Butterfly Club. Students would choose a butterfly species, and use the data for their species in a series of fairly well defined analyses to determine if their species was increasing or decreasing in abundance and if its phenology might be shifting. So they would use the remaining weeks of class time to work on this analysis and ask questions. But it didn’t make sense to have everyone show up to a Zoom meeting and just work quietly every week. My TA, Avalon Owens had the wonderful idea to split the students into 3 smaller working groups that we met with weekly. When someone had a problem or issue that was pertinant to the whole class, I would make a video about it and post it for the class—just like saying “Listen up everyone, Joe had a question I think you’ll all want to hear the answer to” in a *real* classroom.

The thing that surprised me most about this was how engaged the students were. The final project was relatively structured questions and had detailed instructions like a long homework assignment, but by adding that tiny element of student choice (of butterfly species), I think it really transformed the attitude people took toward it. Student choice also had the added benefit of demonstrating a **variety** of real world problems with data analysis that we wouldn’t have seen if everyone was working on the same data—quite a few of which stumped me. And, to a limited extent (limited because of quarantine) it did promote peer learning/teaching with a couple of students sharing code to deal with convergence errors that were common with very rare butterfly species.

Not every student used every statistical tool we learned in that course, but taking all the final projects together, I think it represented just about the entirety of the curriculum! And students had a final report (produced in R Markdown) that was a real world data analysis project that they could use in a portfolio in future job interviews or graduate school applications. All this is to say that I’m a convert. Final projects can be an amazing assessment tool with so many benefits you wouldn’t get from a traditional exam **and** the added benefit of less stress for your students. Would this have worked as well in a class that was less about practical, applied skills? Maybe not, but I’m encouraged enough by this semester to at least consider a final project in every class I teach from now on.

I’ll be using data on the Northern Rocky Mountain grey wolf population. You can read more about the history of these wolves here.

```
library(tidyverse)
wolves <- read_csv("NRMwolves.csv") %>%
mutate(year_post = year - 1982)
head(wolves)
```

```
# A tibble: 6 × 8
year num.wolves MT.wolves WY.wolves ID.wolves OR.wolves WA.wolves year_post
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1982 8 8 NA NA NA NA 0
2 1983 6 6 NA NA NA NA 1
3 1984 6 6 NA NA NA NA 2
4 1985 13 13 NA NA NA NA 3
5 1986 15 15 NA NA NA NA 4
6 1987 10 10 NA NA NA NA 5
```

Exponential growth describes unregulated reproduction and is described by the equation:

where is the population growth rate, is a number of time steps (e.g. years) and is the population at some initial time.

We can take advantage of a log-link to linearize this equation:

Compare to a generic GLM equation with a log-link:

Here’s the glm for an exponential growth model fit to the wolf data:

```
m_exp <- glm(num.wolves ~ year_post,
family = poisson(link = "log"), data = wolves)
exp(coef(m_exp))
```

```
(Intercept) year_post
19.615307 1.176583
```

The backtransformed intercept is the estimate for , the estimated number of wolves at `year_post`

= 0

The backtransformed coefficient for `year_post`

= , the population growth rate

The exponential growth model fit above is an observation error model. It assumes variation from predicted values are due to inaccuracies in estimating the number of wolves.

A process error model estimates a population growth that depends on current population size. This can be modeled as a rate, .

Again, we can use a log-link to linearize this:

The term, which has no coefficient associated with it, is an **offset**. We can hack a glm to fit this model like so:

```
wolves2 <- wolves %>%
mutate(num.prev = lag(num.wolves)) %>% #create a column of lagged wolf numbers
filter(!is.na(num.prev))
m_process <- glm(num.wolves ~ 1, offset = log(num.prev),
family = poisson(link = "log"), data = wolves2)
```

The backtransformed intercept is the yearly rate of increase ()

`exp(coef(m_process))`

```
(Intercept)
1.108238
```

So, if there are 13 wolves in 1985, how many would it predict in 1986?

`1.108238 * 13`

`[1] 14.40709`

Finally, the most complicated, possibly mind blowing example of hacking a GLM. This one took me quite a while to wrap my head around.

A Ricker model takes carrying capacity into account and allows growth rate to change as the population increases. It approximates logistic growth.

where and is the carrying capacity

Linearizing using a log-link (please tell me if I got the math wrong in the comments):

We can model this with the following GLM:

```
m_rick <- glm(num.wolves ~ num.prev, offset = log(num.prev),
family = poisson, data = wolves2)
```

`coef(m_rick)[1]`

```
(Intercept)
0.3119624
```

`coef(m_rick)[2]`

```
num.prev
-0.0001683389
```

Which means that

`-coef(m_rick)[1]/coef(m_rick)[2]`

```
(Intercept)
1853.18
```

🤯

Tea Science Tuesdays are Instagram live streams where I’ll talk informally about some aspect of tea science while enjoying some tea. Each week, there will be a topic, a suggested tea if you want to drink along, and a suggested “reading” (sometimes a video).

Live streams will be at 9:00 AM eastern time @leafyeric. I know that time is probably not good for many people, but don’t worry, the streams will be saved and pinned to my Instagram profile and uploaded to a YouTube playlist so you can watch them later.

Here’s the “syllabus” for the next few weeks:

Date | Topic | Tea | Reading |
---|---|---|---|

10-Sep | Caffeine | White tea | [tinyurl.com/n7vdbfr](http://tinyurl.com/n7vdbfr) |

17-Sep | Aftertaste | Raw puer or any green tea | [tinyurl.com/y626ey8x](http://tinyurl.com/y626ey8x)* |

24-Sep | Aroma | Any puer or oolong | [youtu.be/uQMAwdARADM](http://youtu.be/uQMAwdARADM) |

1-Oct | ~~Bug-bitten tea~~ | ~~Eastern Beauty oolong~~ | [tinyurl.com/y6yvnmrn](http://tinyurl.com/y6yvnmrn) |

15-Oct | Theanine | Gyokuro or sencha | [tinyurl.com/y5722po4](http://tinyurl.com/y5722po4) |

8-Oct | Climate change and tea | Your choice! | [tinyurl.com/yy6be5lb](http://tinyurl.com/yy6be5lb) |

22-Oct | Fermented tea | Ripe puer or any dark tea | [tinyurl.com/yxhpmn95](http://tinyurl.com/yxhpmn95) |

29-Oct | Aged tea | Something old | NA |

5-Nov | Water for tea | Your choice! | NA |

* The full article is behind a paywall. DM me if you need help finding it.

]]>`holodeck`

, was on it’s way to CRAN! It’s a humble package, providing a framework for quickly slapping together test data with different degrees of correlation between variables and differentiation among levels of a categorical variable.
```
# Example use of holodeck
library(holodeck)
library(dplyr)
df <-
#make a categorical variable with 10 observations and 3 groups
sim_cat(n_obs = 10, n_groups = 3, name = "Treatment") %>%
#add 3 variables that covary
sim_covar(n_vars = 3, var = 1, cov = 0.5) %>%
#add 10 variables that don't covary, but discriminate levels of Treatment
group_by(Treatment) %>%
sim_discr(n_vars = 10, var = 1, cov = 0, group_means = c(-1, 0, 1)) %>%
#sprinkle in som NAs
sim_missing(prop = 0.02)
```

“First package” isn’t entirely correct. The functions in `holodeck`

got their start in another package that’s really just for me. While working on a manuscript I ended up writing functions for simulating multivariate data. From the beginning, I planned to share code related to the manuscript when it (hopefully) is published, but my analysis code loaded my personal package that was only on my computer and included a bunch of other stuff that was probably only useful to me. At rstudio::conf19, I asked several atendees who worked in academic positions what I should do. The answer I heard was as long as my functions *might* be useful to others, I should publish my package to CRAN, then just cite the published package in my manuscript.

So I pulled the relevant functions into their own standalone package, which is now called `holodeck`

, and began working on refining, documenting, and testing those functions to get the package ready for CRAN submission. The process of creating an R package and readying it for CRAN submission was more painless than I imagined! Here some of the resources I used:

- The usethis package provides great tools for automating many things involved in package creation.
- The R Packages book by Hadley Wickham was a great guideline.
- Writing tests with the testthat package.
- I also had to learn a bit about tidyeval, because the functions I wrote were meant to work with
`dplyr::group_by()`

.

I hope that others find my package useful, but even if no one else uses it, I’m happy I went through the process. It was a great learning experience, and I’m excited about the possibility of publishing other packages in the future!

]]>The data and code to repeat this analysis is available on GitHub. This is by no-means a complete analysis of this dataset and I encourage others to use it. I think the concept of recipes as observations and ingredients as variables is a helpful metaphor for multivariate statistics in general.

Multivariate data means data with many things measured on the same samples or observations. In this example, recipes are the observations and the variables are the ingredients measured in US cups per serving. One common problem associated with multivariate data is that usually many of the variables are correlated.

For example, baking powder, salt, baking soda, oil, milk, spice, and fruit are all strongly correlated with each other. This is called “multicollinearity”. Multicollinearity causes problems for statistical techniques that assume variables are independent, like multiple regression.

Other common difficulties presented by ecological multivariate data include the “curse of dimensionality” (more variables than observations), and missing values.

**Principal component analysis** (PCA) is a multivariate technique that aims to explain the variation in the ingredient amounts, but is **unsupervised**. That is, it’s totally agnostic to whether recipes are muffins or cupcakes. Imagine a cloud of points in 3D space. PCA is aiming to draw a line through the spread of that cloud of points. That line explains most of the variation in the data. That line is then rotated and called “principal component 1”. Perpendicular to that, principal component 2 is drawn to explain the second greatest amount of variation in the points. You could then project your points onto this new coordinate space and do some statistical test to determine if your groups (e.g. cupcake or muffin) are different along one or both of these principal components.

Ecologists use unsupervised analyses like PCA all the time, for example to reduce the complexity or “dimensionality” of multivariate datasets like community composition or traits of organisms. But this strategy does not tell you if cupcakes are different from muffins. It tells you: 1) what ingredients vary the most among all cupcake and muffin recipes, and 2) do cupcakes and muffins differ in the amounts of ** those** ingredients, which isn’t exactly the question we are trying to answer.

**Partial least squares regression** (PLS) and its discriminant analysis extension (PLS-DA) are **supervised** multivariate statistical techniques. That is, PLS knows about the Y variable (type of recipe) and instead of making a line through the spread in that cloud of points, PLS draws a line that explains the *difference* between cupcakes and muffins. This **actually** answers the question “are muffins and cupcakes different?” and tells you which ingredients are most responsible for that difference.

To date, supervised analyses like PLS are uncommon in ecology, even though this may often be the kind of question ecologists want to answer. Additionally, PLS is built to handle multicollinearity, the curse of dimensionality, and missing values, which makes it an excellent tool for analyzing ecological data!

For this blog post, I’m using a subset of the dataset with all frosting ingredients removed (because obviously cupcakes have frosting and muffins don’t). The reason I’m using a subset of only 30 recipes is to more accurately replicate the “curse of dimensionality” that is common in ecological data.

```
nofrosting.raw <-
read_rds("nofrosting_wide.rds")
#can be found at github.com/Aariq/cupcakes-vs-muffins
set.seed(888)
nofrosting <-
nofrosting.raw %>%
sample_n(30) %>%
#puts factor names in title case for prettier plots
mutate(type = fct_relabel(type, tools::toTitleCase))
nofrosting
```

```
# A tibble: 30 × 42
type recipe_id agave baking…¹ bakin…² bran butter butte…³ cheese choco…⁴
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cupcake 145206 0 0.00174 8.68e-4 0 0 0 0 0
2 Cupcake 240140 0 0.000868 6.94e-4 0 0.0167 0 0 0
3 Cupcake 161019 0 0.000868 1.74e-3 0 0 0.0417 0 0.0339
4 Muffin 16945 0 0 0 0 0.0741 0 0 0
5 Muffin 228562 0 0.00116 2.31e-3 0 0.00694 0 0 0
6 Cupcake 215375 0 0.00130 6.51e-4 0 0 0 0 0
7 Cupcake 242474 0 0.00208 0 0 0 0 0 0
8 Cupcake 233538 0 0.00312 5.21e-4 0 0.05 0 0 0
9 Cupcake 155534 0 0.00231 5.79e-4 0 0 0 0 0
10 Muffin 6753 0 0.00694 0 0 0 0 0 0.0625
# … with 20 more rows, 32 more variables: cornmeal <dbl>, cream <dbl>,
# `cream cheese` <dbl>, eggs <dbl>, flour <dbl>, frosting <dbl>, fruit <dbl>,
# `fruit juice` <dbl>, honey <dbl>, `low-cal sweetener` <dbl>,
# margarine <dbl>, mayonnaise <dbl>, milk <dbl>, molasses <dbl>, nut <dbl>,
# oats <dbl>, oil <dbl>, other <dbl>, salt <dbl>, shortening <dbl>,
# `sour cream` <dbl>, spice <dbl>, starch <dbl>, sugar <dbl>, syrup <dbl>,
# unitless <dbl>, vanilla <dbl>, vegetable <dbl>, vinegar <dbl>, …
```

I’ll be using the `ropls`

package to do both PCA and PLS-DA. See the documentation for that package for more info on how to use it.

`library(ropls)`

PCA, an unsupervised analysis, answers the question “what ingredients vary among all muffin and cupcake recipes?”

```
baked.pca <-
opls(
dplyr::select(nofrosting, -type, -recipe_id), #the data
fig.pdfC = "none" #suppresses default plot
)
```

```
PCA
30 samples x 29 variables
standard scaling of predictors
11 excluded variables (near zero variance)
R2X(cum) pre ort
Total 0.513 5 0
```

A few ingredients get dropped because none of the recipes in my random sample of 30 have those ingredients. Notice that “type” is excluded in the PCA. PCA is totally agnostic to whether a recipe is for muffins or cupcakes.

Principal component 1 (PC1) represents a spectrum of leavening system. PC1 is negatively correlated with baking soda and some acidic ingredients like yogurt, sour cream, and cream cheese. PC1 is positively correlated with baking powder and milk. If you’re a baker, this makes sense because baking powder is just baking soda plus some powdered acid. If you have an acidic batter, then you can use baking soda.

Principal component 2 is a “healthiness” axis going from savory/healthy at the top to sweet/unhealthy at the bottom.

There is **no separation** between muffins and cupcakes along PC1 (leavening system) even though that’s where the most variation is. There is *slight* separation along the healthiness axis with muffins tending to be a little more healthy than cupcakes.

*BUT* this doesn’t answer the question of whether cupcakes and muffins are different. It answers a slightly different question: “Do cupcakes and muffins differ in the ingredients that vary the most among all the recipes combined?”

PLS-DA looks for a combination of ingredients that best explains categorization as cupcake or muffin. For this dataset the `opls()`

function finds a single significant predictive axis. For the sake of plotting something, I ask it to do orthogonal PLS-DA, which creates a second axis that represents variation **not** related to the type of baked good.

```
baked.plsda <-
opls(
dplyr::select(nofrosting, -type, -recipe_id), #X data
nofrosting$type, #Y data
fig.pdfC = "none", #suppresses default plotting
predI = 1, #make one predictive axis
orthoI = 1, #and one orthogonal axis
permI = 200) #use 200 permutations to generate a p-value
```

```
OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
11 excluded variables (near zero variance)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.189 0.832 0.58 0.214 1 1 0.005 0.005
```

This output gives us some important properties of the model. `R2X(cum)`

is the proportion of variation in the data explained by the predictive axes. `R2Y(cum)`

, on the other hand, is the proportion of variation in **baked good type** explained by the model. PLS-DA only explains 9.44% of total variation, but explains **83%** of the difference between cupcakes and muffins! `Q2(cum)`

is calculated through cross-validation and can be thought of as the predictive power of the model. `Q2(cum)`

is always smaller than `R2Y(cum)`

, but the larger it is, and the closer it is to `R2Y(cum)`

, the better. A large value indicates strong predictive power. `RMSEE`

is the root mean squared error of estimation, a measure of error in the same units as the Y variable, which is not super useful in this case since our Y variable is categorical. `pre`

and `ort`

are just how many predictive and orthogonal components were used. Finally, the two p-values are generated through permutation—the data labels (muffin or cupcake) are shuffled randomly and the PLS-DA is re-fit. These p-values are the proportion of those 200 random datasets that generate and values as good or better than the real data.

So, we can conclude that cupcakes **are** different than muffins (p < 0.005)!

Let’s see what ingredients contribute most to this difference.

Clearly, the more vanilla there is in a recipe, the more likely it is to be a cupcake. Conversely, the more fruit, flour and salt there is in a recipe, the more likely it is to be a muffin.

PCA and PLS-DA give different results because they are answering different questions. In this case, the ingredients that vary the most among baked goods are not the same variables that best distinguish muffins from cupcakes. If you want to know what ingredients vary the most among all the recipes, use an unsupervised analysis like PCA. If you want to know what makes cupcakes different from muffins, use a supervised analysis like PLS-DA

In ecology, we often measure multiple traits of organisms and expect high levels of variation among individuals in a population. The most highly variable traits are not necessarily ones that correlate with some Y variable such as elevation, genotype, or some experimental treatment imposed by researchers. Therefore, it doesn’t make sense to expect PCA to find relationships with that Y variable. If you’re asking a question about multivariate relationships to some Y variable (e.g. how plant metabolites change with elevation), it makes sense to use PLS.

Thanks to Elizabeth Crone for comments on a draft of this post and for encouraging me to do *serious science* using muffin and cupcake recipes!

Shiny apps are interactive web apps that run on R code, and there was a big focus on Shiny development at the conference this year. Almost everyone I talked to was using Shiny in their jobs including creating dashboards, interactive exploratory data analysis, guiding industry researchers through statistical analyses, and teaching focused apps built on the `learnr`

package. There was also a lot of focus on scaling Shiny apps so many users could access apps simultaneously without significant slowdown.

I’ve been toying with the idea of creating a Shiny app to help with my own work in doing some data quality checks on GC/MS data, and this gave me the inspiration to commit to doing it!

I’ve taught an intro to R for Biostatistics course twice now, and both times the first day of class feels like 80% fixing package installation errors. RStudio Cloud allows students to access RStudio through a web interface, without downloading or installing anything. It also lets instructors set up project spaces with all the necessary packages **already installed**. This allows you to start the first day off with fun stuff, like data visualization, and save the lessons about CRAN and troubleshooting package installations for later. Not only can you set up environments for students to work in, you can also peek into their environments. That means no more “I can maybe help you if you send me your code” emails!

“I wish I’d left this code across scattered .R files instead of combining it into a package” said no one ever #rstats http://t.co/udeNH4T67H

— David Robinson (@drob) June 19, 2015

I had already taken this advice and built a package for myself with all the functions that I’ve written and used in multiple projects. I called it `chemhelper`

and put it up on GitHub, just in case someone else would find it useful. Now I’m working on a manuscript that uses some of the functions in this package, and I needed advice on what to do to make my analysis reproducible and archivable upon submitting it. You see, `webchem`

is very much a work in progress, so if I were to archive analysis code that relied on it, it would likely be broken very quickly and therefore not reproducible. One option is submitting my package to CRAN and then recording version information in the analysis code or using something like packrat. The advice I got over and over was **if your package is potentially useful to people other than you, put it on CRAN**. I’ve already started pulling out the functions that are useful to others and plan on submitting a package to CRAN before submitting my manuscript!

Finally, the keynote by @felienne was **phenomenal**! You should watch it regardless of your area of interest—it’s *that* kind of talk. With hand-drawn slides and an incredible stage presence, Dr. Felienne Hermans explored the weirdness of how we teach programming. For example, you wouldn’t just hand someone a guitar and say “the best way to learn is to just try changing something and see what happens!” and you also wouldn’t tell a child riding a bike with training wheels “that’s not real biking!”, but we do both of these things regularly when interacting with beginner programmers.

Most importantly, we know empirically that reading out loud (phonics) is a good way to learn languages, and that *should include* programming languages. I realized that part of the value in teaching tidyverse first is that you can and *should* read tidyverse code out-loud. I’m definitely going to make classrooms read code outloud in the future.

```
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_petal_length = mean(Petal.Length),
sd_petal_length = sd(Petal.Length))
```

*Say it with me, class:*

Take the iris dataset, then

group it by Species, then

summarize it by taking the mean and standard deviation of petal length

**What was your biggest takeaway from RStudio::conf 2019? Let me know in the comments!**