corrr
packageCupcakes vs Muffins: Round 2
Have you ever pondered whether a muffin is really a breakfast food and not just an excuse to eat cake first thing in the morning? Well, you’ve come to the right blog post! In a previous post, I explained how I created a dataset of the ingredients of 269 cupcake and muffin recipes. In this installment, I’m going to use that dataset to demonstrate some of the important properties of multivariate statistics, specifically the difference between principal component analysis (PCA) and partial least squares regression (PLS).
The data and code to repeat this analysis is available on GitHub. This is by no-means a complete analysis of this dataset and I encourage others to use it. I think the concept of recipes as observations and ingredients as variables is a helpful metaphor for multivariate statistics in general.
Multivariate data
Multivariate data means data with many things measured on the same samples or observations. In this example, recipes are the observations and the variables are the ingredients measured in US cups per serving. One common problem associated with multivariate data is that usually many of the variables are correlated.
For example, baking powder, salt, baking soda, oil, milk, spice, and fruit are all strongly correlated with each other. This is called “multicollinearity”. Multicollinearity causes problems for statistical techniques that assume variables are independent, like multiple regression.
Other common difficulties presented by ecological multivariate data include the “curse of dimensionality” (more variables than observations), and missing values.
Unsupervised and Supervised analyses
Principal component analysis (PCA) is a multivariate technique that aims to explain the variation in the ingredient amounts, but is unsupervised. That is, it’s totally agnostic to whether recipes are muffins or cupcakes. Imagine a cloud of points in 3D space. PCA is aiming to draw a line through the spread of that cloud of points. That line explains most of the variation in the data. That line is then rotated and called “principal component 1”. Perpendicular to that, principal component 2 is drawn to explain the second greatest amount of variation in the points. You could then project your points onto this new coordinate space and do some statistical test to determine if your groups (e.g. cupcake or muffin) are different along one or both of these principal components.
Ecologists use unsupervised analyses like PCA all the time, for example to reduce the complexity or “dimensionality” of multivariate datasets like community composition or traits of organisms. But this strategy does not tell you if cupcakes are different from muffins. It tells you: 1) what ingredients vary the most among all cupcake and muffin recipes, and 2) do cupcakes and muffins differ in the amounts of those ingredients, which isn’t exactly the question we are trying to answer.
Partial least squares regression (PLS) and its discriminant analysis extension (PLS-DA) are supervised multivariate statistical techniques. That is, PLS knows about the Y variable (type of recipe) and instead of making a line through the spread in that cloud of points, PLS draws a line that explains the difference between cupcakes and muffins. This actually answers the question “are muffins and cupcakes different?” and tells you which ingredients are most responsible for that difference.
To date, supervised analyses like PLS are uncommon in ecology, even though this may often be the kind of question ecologists want to answer. Additionally, PLS is built to handle multicollinearity, the curse of dimensionality, and missing values, which makes it an excellent tool for analyzing ecological data!
The Data
For this blog post, I’m using a subset of the dataset with all frosting ingredients removed (because obviously cupcakes have frosting and muffins don’t). The reason I’m using a subset of only 30 recipes is to more accurately replicate the “curse of dimensionality” that is common in ecological data.
<-
nofrosting.raw read_rds("nofrosting_wide.rds")
#can be found at github.com/Aariq/cupcakes-vs-muffins
set.seed(888)
<-
nofrosting %>%
nofrosting.raw sample_n(30) %>%
#puts factor names in title case for prettier plots
mutate(type = fct_relabel(type, tools::toTitleCase))
nofrosting
# A tibble: 30 × 42
type recipe_id agave baking…¹ bakin…² bran butter butte…³ cheese choco…⁴
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cupcake 145206 0 0.00174 8.68e-4 0 0 0 0 0
2 Cupcake 240140 0 0.000868 6.94e-4 0 0.0167 0 0 0
3 Cupcake 161019 0 0.000868 1.74e-3 0 0 0.0417 0 0.0339
4 Muffin 16945 0 0 0 0 0.0741 0 0 0
5 Muffin 228562 0 0.00116 2.31e-3 0 0.00694 0 0 0
6 Cupcake 215375 0 0.00130 6.51e-4 0 0 0 0 0
7 Cupcake 242474 0 0.00208 0 0 0 0 0 0
8 Cupcake 233538 0 0.00312 5.21e-4 0 0.05 0 0 0
9 Cupcake 155534 0 0.00231 5.79e-4 0 0 0 0 0
10 Muffin 6753 0 0.00694 0 0 0 0 0 0.0625
# … with 20 more rows, 32 more variables: cornmeal <dbl>, cream <dbl>,
# `cream cheese` <dbl>, eggs <dbl>, flour <dbl>, frosting <dbl>, fruit <dbl>,
# `fruit juice` <dbl>, honey <dbl>, `low-cal sweetener` <dbl>,
# margarine <dbl>, mayonnaise <dbl>, milk <dbl>, molasses <dbl>, nut <dbl>,
# oats <dbl>, oil <dbl>, other <dbl>, salt <dbl>, shortening <dbl>,
# `sour cream` <dbl>, spice <dbl>, starch <dbl>, sugar <dbl>, syrup <dbl>,
# unitless <dbl>, vanilla <dbl>, vegetable <dbl>, vinegar <dbl>, …
I’ll be using the ropls
package to do both PCA and PLS-DA. See the documentation for that package for more info on how to use it.
library(ropls)
PCA: What ingredients vary the most among all recipes combined?
PCA, an unsupervised analysis, answers the question “what ingredients vary among all muffin and cupcake recipes?”
<-
baked.pca opls(
::select(nofrosting, -type, -recipe_id), #the data
dplyrfig.pdfC = "none" #suppresses default plot
)
PCA
30 samples x 29 variables
standard scaling of predictors
11 excluded variables (near zero variance)
R2X(cum) pre ort
Total 0.513 5 0
A few ingredients get dropped because none of the recipes in my random sample of 30 have those ingredients. Notice that “type” is excluded in the PCA. PCA is totally agnostic to whether a recipe is for muffins or cupcakes.
Principal component 1 (PC1) represents a spectrum of leavening system. PC1 is negatively correlated with baking soda and some acidic ingredients like yogurt, sour cream, and cream cheese. PC1 is positively correlated with baking powder and milk. If you’re a baker, this makes sense because baking powder is just baking soda plus some powdered acid. If you have an acidic batter, then you can use baking soda.
Principal component 2 is a “healthiness” axis going from savory/healthy at the top to sweet/unhealthy at the bottom.
There is no separation between muffins and cupcakes along PC1 (leavening system) even though that’s where the most variation is. There is slight separation along the healthiness axis with muffins tending to be a little more healthy than cupcakes.
BUT this doesn’t answer the question of whether cupcakes and muffins are different. It answers a slightly different question: “Do cupcakes and muffins differ in the ingredients that vary the most among all the recipes combined?”
PLS-DA: Are cupcakes different from muffins?
PLS-DA looks for a combination of ingredients that best explains categorization as cupcake or muffin. For this dataset the opls()
function finds a single significant predictive axis. For the sake of plotting something, I ask it to do orthogonal PLS-DA, which creates a second axis that represents variation not related to the type of baked good.
<-
baked.plsda opls(
::select(nofrosting, -type, -recipe_id), #X data
dplyr$type, #Y data
nofrostingfig.pdfC = "none", #suppresses default plotting
predI = 1, #make one predictive axis
orthoI = 1, #and one orthogonal axis
permI = 200) #use 200 permutations to generate a p-value
OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
11 excluded variables (near zero variance)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.189 0.832 0.58 0.214 1 1 0.005 0.005
This output gives us some important properties of the model. R2X(cum)
is the proportion of variation in the data explained by the predictive axes. R2Y(cum)
, on the other hand, is the proportion of variation in baked good type explained by the model. PLS-DA only explains 9.44% of total variation, but explains 83% of the difference between cupcakes and muffins! Q2(cum)
is calculated through cross-validation and can be thought of as the predictive power of the model. Q2(cum)
is always smaller than R2Y(cum)
, but the larger it is, and the closer it is to R2Y(cum)
, the better. A large \(Q^2\) value indicates strong predictive power. RMSEE
is the root mean squared error of estimation, a measure of error in the same units as the Y variable, which is not super useful in this case since our Y variable is categorical. pre
and ort
are just how many predictive and orthogonal components were used. Finally, the two p-values are generated through permutation—the data labels (muffin or cupcake) are shuffled randomly and the PLS-DA is re-fit. These p-values are the proportion of those 200 random datasets that generate \(R^2_Y\) and \(Q^2\) values as good or better than the real data.
So, we can conclude that cupcakes are different than muffins (p < 0.005)!
Let’s see what ingredients contribute most to this difference.
Clearly, the more vanilla there is in a recipe, the more likely it is to be a cupcake. Conversely, the more fruit, flour and salt there is in a recipe, the more likely it is to be a muffin.
Use the right tools for the job!
PCA and PLS-DA give different results because they are answering different questions. In this case, the ingredients that vary the most among baked goods are not the same variables that best distinguish muffins from cupcakes. If you want to know what ingredients vary the most among all the recipes, use an unsupervised analysis like PCA. If you want to know what makes cupcakes different from muffins, use a supervised analysis like PLS-DA
In ecology, we often measure multiple traits of organisms and expect high levels of variation among individuals in a population. The most highly variable traits are not necessarily ones that correlate with some Y variable such as elevation, genotype, or some experimental treatment imposed by researchers. Therefore, it doesn’t make sense to expect PCA to find relationships with that Y variable. If you’re asking a question about multivariate relationships to some Y variable (e.g. how plant metabolites change with elevation), it makes sense to use PLS.
Acknowledgments
Thanks to Elizabeth Crone for comments on a draft of this post and for encouraging me to do serious science using muffin and cupcake recipes!