Cupcakes vs Muffins: Round 2

data-science
multivariate-statistics
Published

March 21, 2019

Have you ever pondered whether a muffin is really a breakfast food and not just an excuse to eat cake first thing in the morning? Well, you’ve come to the right blog post! In a previous post, I explained how I created a dataset of the ingredients of 269 cupcake and muffin recipes. In this installment, I’m going to use that dataset to demonstrate some of the important properties of multivariate statistics, specifically the difference between principal component analysis (PCA) and partial least squares regression (PLS).

The data and code to repeat this analysis is available on GitHub. This is by no-means a complete analysis of this dataset and I encourage others to use it. I think the concept of recipes as observations and ingredients as variables is a helpful metaphor for multivariate statistics in general.

Multivariate data

Multivariate data means data with many things measured on the same samples or observations. In this example, recipes are the observations and the variables are the ingredients measured in US cups per serving. One common problem associated with multivariate data is that usually many of the variables are correlated.

Correlation plot made using the corrr package

For example, baking powder, salt, baking soda, oil, milk, spice, and fruit are all strongly correlated with each other. This is called “multicollinearity”. Multicollinearity causes problems for statistical techniques that assume variables are independent, like multiple regression.

Other common difficulties presented by ecological multivariate data include the “curse of dimensionality” (more variables than observations), and missing values.

Unsupervised and Supervised analyses

Principal component analysis (PCA) is a multivariate technique that aims to explain the variation in the ingredient amounts, but is unsupervised. That is, it’s totally agnostic to whether recipes are muffins or cupcakes. Imagine a cloud of points in 3D space. PCA is aiming to draw a line through the spread of that cloud of points. That line explains most of the variation in the data. That line is then rotated and called “principal component 1”. Perpendicular to that, principal component 2 is drawn to explain the second greatest amount of variation in the points. You could then project your points onto this new coordinate space and do some statistical test to determine if your groups (e.g. cupcake or muffin) are different along one or both of these principal components.

Left panel: a 3D graph with three axes labeled variable 1, 2, and 3.  There are red and blue points that cluster in an oval region with the blue and red points mostly split down the middle of the oval.  A dotted line is drawn diagonally through the cloud of points along the longest part of the point cluster.  On the right, a 2D plot of the first 2 principal component axes.  The red and blue points partially overlap on this graph, but are partially separated vertically.

Conceptual diagram of PCA

Ecologists use unsupervised analyses like PCA all the time, for example to reduce the complexity or “dimensionality” of multivariate datasets like community composition or traits of organisms. But this strategy does not tell you if cupcakes are different from muffins. It tells you: 1) what ingredients vary the most among all cupcake and muffin recipes, and 2) do cupcakes and muffins differ in the amounts of those ingredients, which isn’t exactly the question we are trying to answer.

Partial least squares regression (PLS) and its discriminant analysis extension (PLS-DA) are supervised multivariate statistical techniques. That is, PLS knows about the Y variable (type of recipe) and instead of making a line through the spread in that cloud of points, PLS draws a line that explains the difference between cupcakes and muffins. This actually answers the question “are muffins and cupcakes different?” and tells you which ingredients are most responsible for that difference.

To date, supervised analyses like PLS are uncommon in ecology, even though this may often be the kind of question ecologists want to answer. Additionally, PLS is built to handle multicollinearity, the curse of dimensionality, and missing values, which makes it an excellent tool for analyzing ecological data!

On the left, the same cloud of red and blue points in a 3D graph as shown in the previous image, but this time the dotted line is drawn so that it crosses the shortest path across the point cloud with the blue points mostly at one and of the line and the red points mostly at the other end.  On the right, the blue and red points are now totally separated horizontally along an axis labeled "P1"

Conceptual diagram of PLS-DA

The Data

For this blog post, I’m using a subset of the dataset with all frosting ingredients removed (because obviously cupcakes have frosting and muffins don’t). The reason I’m using a subset of only 30 recipes is to more accurately replicate the “curse of dimensionality” that is common in ecological data.

nofrosting.raw <-
  read_rds("nofrosting_wide.rds")
#can be found at github.com/Aariq/cupcakes-vs-muffins

set.seed(888)
nofrosting <-
  nofrosting.raw %>%
  sample_n(30) %>%  
  #puts factor names in title case for prettier plots
  mutate(type = fct_relabel(type, tools::toTitleCase))
nofrosting
# A tibble: 30 × 42
   type    recipe_id agave baking…¹ bakin…²  bran  butter butte…³ cheese choco…⁴
   <fct>   <chr>     <dbl>    <dbl>   <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
 1 Cupcake 145206        0 0.00174  8.68e-4     0 0        0           0  0     
 2 Cupcake 240140        0 0.000868 6.94e-4     0 0.0167   0           0  0     
 3 Cupcake 161019        0 0.000868 1.74e-3     0 0        0.0417      0  0.0339
 4 Muffin  16945         0 0        0           0 0.0741   0           0  0     
 5 Muffin  228562        0 0.00116  2.31e-3     0 0.00694  0           0  0     
 6 Cupcake 215375        0 0.00130  6.51e-4     0 0        0           0  0     
 7 Cupcake 242474        0 0.00208  0           0 0        0           0  0     
 8 Cupcake 233538        0 0.00312  5.21e-4     0 0.05     0           0  0     
 9 Cupcake 155534        0 0.00231  5.79e-4     0 0        0           0  0     
10 Muffin  6753          0 0.00694  0           0 0        0           0  0.0625
# … with 20 more rows, 32 more variables: cornmeal <dbl>, cream <dbl>,
#   `cream cheese` <dbl>, eggs <dbl>, flour <dbl>, frosting <dbl>, fruit <dbl>,
#   `fruit juice` <dbl>, honey <dbl>, `low-cal sweetener` <dbl>,
#   margarine <dbl>, mayonnaise <dbl>, milk <dbl>, molasses <dbl>, nut <dbl>,
#   oats <dbl>, oil <dbl>, other <dbl>, salt <dbl>, shortening <dbl>,
#   `sour cream` <dbl>, spice <dbl>, starch <dbl>, sugar <dbl>, syrup <dbl>,
#   unitless <dbl>, vanilla <dbl>, vegetable <dbl>, vinegar <dbl>, …

I’ll be using the ropls package to do both PCA and PLS-DA. See the documentation for that package for more info on how to use it.

library(ropls)

PCA: What ingredients vary the most among all recipes combined?

PCA, an unsupervised analysis, answers the question “what ingredients vary among all muffin and cupcake recipes?”

baked.pca <-
  opls(
    dplyr::select(nofrosting, -type, -recipe_id), #the data
       fig.pdfC = "none" #suppresses default plot
    )
PCA
30 samples x 29 variables
standard scaling of predictors
11 excluded variables (near zero variance)
      R2X(cum) pre ort
Total    0.513   5   0

A few ingredients get dropped because none of the recipes in my random sample of 30 have those ingredients. Notice that “type” is excluded in the PCA. PCA is totally agnostic to whether a recipe is for muffins or cupcakes.

PCA score and loading plot. Muffins recipes (brown) separate only slightly from cupcake recipes (red) along PC2

Principal component 1 (PC1) represents a spectrum of leavening system. PC1 is negatively correlated with baking soda and some acidic ingredients like yogurt, sour cream, and cream cheese. PC1 is positively correlated with baking powder and milk. If you’re a baker, this makes sense because baking powder is just baking soda plus some powdered acid. If you have an acidic batter, then you can use baking soda.

Principal component 2 is a “healthiness” axis going from savory/healthy at the top to sweet/unhealthy at the bottom.

There is no separation between muffins and cupcakes along PC1 (leavening system) even though that’s where the most variation is. There is slight separation along the healthiness axis with muffins tending to be a little more healthy than cupcakes.

BUT this doesn’t answer the question of whether cupcakes and muffins are different. It answers a slightly different question: “Do cupcakes and muffins differ in the ingredients that vary the most among all the recipes combined?”

PLS-DA: Are cupcakes different from muffins?

PLS-DA looks for a combination of ingredients that best explains categorization as cupcake or muffin. For this dataset the opls() function finds a single significant predictive axis. For the sake of plotting something, I ask it to do orthogonal PLS-DA, which creates a second axis that represents variation not related to the type of baked good.

baked.plsda <-
  opls(
    dplyr::select(nofrosting, -type, -recipe_id), #X data
    nofrosting$type, #Y data
    fig.pdfC = "none", #suppresses default plotting
    predI = 1, #make one predictive axis
    orthoI = 1, #and one orthogonal axis
    permI = 200) #use 200 permutations to generate a p-value
OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
11 excluded variables (near zero variance)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort  pR2Y   pQ2
Total    0.189    0.832    0.58 0.214   1   1 0.005 0.005

This output gives us some important properties of the model. R2X(cum) is the proportion of variation in the data explained by the predictive axes. R2Y(cum), on the other hand, is the proportion of variation in baked good type explained by the model. PLS-DA only explains 9.44% of total variation, but explains 83% of the difference between cupcakes and muffins! Q2(cum) is calculated through cross-validation and can be thought of as the predictive power of the model. Q2(cum) is always smaller than R2Y(cum), but the larger it is, and the closer it is to R2Y(cum), the better. A large \(Q^2\) value indicates strong predictive power. RMSEE is the root mean squared error of estimation, a measure of error in the same units as the Y variable, which is not super useful in this case since our Y variable is categorical. pre and ort are just how many predictive and orthogonal components were used. Finally, the two p-values are generated through permutation—the data labels (muffin or cupcake) are shuffled randomly and the PLS-DA is re-fit. These p-values are the proportion of those 200 random datasets that generate \(R^2_Y\) and \(Q^2\) values as good or better than the real data.

So, we can conclude that cupcakes are different than muffins (p < 0.005)!

Let’s see what ingredients contribute most to this difference.

PLS-DA score and loading plot. Muffins (brown) and cupcakes (red) are significantly different!

Clearly, the more vanilla there is in a recipe, the more likely it is to be a cupcake. Conversely, the more fruit, flour and salt there is in a recipe, the more likely it is to be a muffin.

Use the right tools for the job!

PCA and PLS-DA give different results because they are answering different questions. In this case, the ingredients that vary the most among baked goods are not the same variables that best distinguish muffins from cupcakes. If you want to know what ingredients vary the most among all the recipes, use an unsupervised analysis like PCA. If you want to know what makes cupcakes different from muffins, use a supervised analysis like PLS-DA

In ecology, we often measure multiple traits of organisms and expect high levels of variation among individuals in a population. The most highly variable traits are not necessarily ones that correlate with some Y variable such as elevation, genotype, or some experimental treatment imposed by researchers. Therefore, it doesn’t make sense to expect PCA to find relationships with that Y variable. If you’re asking a question about multivariate relationships to some Y variable (e.g. how plant metabolites change with elevation), it makes sense to use PLS.

Acknowledgments

Thanks to Elizabeth Crone for comments on a draft of this post and for encouraging me to do serious science using muffin and cupcake recipes!