<- matrix(rnorm(100), nrow = 10) my_matrix
Working with matrix-columns in tibbles
What’s a matrix-column?
The tibble
package in R allows for the construction of “tibbles”—a sort of “enhanced” data frame. Most of these enhancements are fairly mundane, such as better printing in the console and not modifying column names. One of the unique features of tibbles is the ability to have a column that is a list. List-columns have been written about fairly extensively as they are a very cool way of working with data in the tidyverse. A less commonly known feature is that matrix-columns are also possible in a tibble. A matrix-column is a column of a tibble that is itself a \(n \times m\) matrix. Because a matrix-column is simultaneously a single column (of a tibble) and \(m\) columns (of the matrix), there are some quirks to working with them.
Creating a matrix-column.
Data frames and tibbles handle matrix inputs differently. data.frame()
adds an \(n \times m\) matrix as \(m\) columns of a dataframe while tibble()
creates a matrix-column.
No matrix-column. Just regular columns named mat_col._
:
<- data.frame(x = letters[1:10], mat_col = my_matrix)
df dim(df)
[1] 10 11
colnames(df)
[1] "x" "mat_col.1" "mat_col.2" "mat_col.3" "mat_col.4"
[6] "mat_col.5" "mat_col.6" "mat_col.7" "mat_col.8" "mat_col.9"
[11] "mat_col.10"
Creating a matrix-colum requires using tibble()
instead of data.frame()
:
<- tibble(x = letters[1:10], mat_col = my_matrix)
tbl dim(tbl)
[1] 10 2
colnames(tbl)
[1] "x" "mat_col"
You can also “group” columns of a data frame or tibble into a matrix-column using dplyr
.
<-
df_mat_col %>%
df mutate(matrix_column = as.matrix(select(., starts_with("mat_col.")))) %>%
#then remove the originals
select(-starts_with("mat_col."))
This creates a matrix-column, and the column names of the matrix itself come from the original dataframe (i.e. df
).
colnames(df_mat_col)
[1] "x" "matrix_column"
colnames(df_mat_col$matrix_column)
[1] "mat_col.1" "mat_col.2" "mat_col.3" "mat_col.4" "mat_col.5"
[6] "mat_col.6" "mat_col.7" "mat_col.8" "mat_col.9" "mat_col.10"
When do you need a matrix-column?
Matrix-columns are sometimes useful in modeling, when a predictor or covariate is not just a single variable, but a vector for every observation. For example, in multivariate analyses, certain packages (e.g. ropls
) require a matrix as an input. Functional models are another example, which fit continuous functions of some variable (e.g. over time) as a covariate (One specific example are distributed lag non-linear models, which I hope to start blogging about soon).
<- prcomp(~ mat_col, data = tbl)
pca summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8022 1.6779 1.5645 1.3203 1.02222 0.77201 0.51162
Proportion of Variance 0.2647 0.2295 0.1995 0.1421 0.08517 0.04858 0.02134
Cumulative Proportion 0.2647 0.4942 0.6937 0.8358 0.92096 0.96954 0.99087
PC8 PC9 PC10
Standard deviation 0.31635 0.10918 5.838e-18
Proportion of Variance 0.00816 0.00097 0.000e+00
Cumulative Proportion 0.99903 1.00000 1.000e+00
Viewing and using matrix-columns
Matrix-columns are… weird, and as such they have some quirks in how they are printed in RStudio. Some of these may be bugs, but as far as I know, there aren’t any issues related to matrix-columns at the time of writing this post. If you are using paged printing of data frames in R Markdown documents, a tibble with a matrix column will simply not appear in-line. Instead you get an empty viewer box like so.
You can turn off paged printing for a single code chunk with the paged.print
chunk option, and you’ll see something more like this:
```{r}
#| paged.print: false
<- tibble(x = letters[1:10], mat_col = my_matrix)
tbl
tbl```
# A tibble: 10 × 2
x mat_col[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.464 -1.12 -1.01 1.73 0.531 2.10 1.44 0.836 0.369
2 b 1.82 -0.239 0.749 1.57 -0.256 -1.41 -0.951 -1.71 -1.77
3 c 0.190 -0.785 1.27 -1.43 -1.82 0.715 -0.593 2.07 -0.228
4 d -1.18 0.271 1.52 0.135 -0.169 -1.23 0.522 -0.410 1.23
5 e -0.509 -0.944 0.108 -1.03 0.407 -0.953 -0.415 -1.25 -0.621
6 f 1.67 0.185 -0.807 0.149 0.114 0.240 -0.791 0.418 -2.13
7 g -2.04 -2.38 0.786 0.660 -0.114 -0.935 0.519 -1.32 -0.627
8 h -0.0686 0.166 -0.0905 -1.18 0.217 -0.695 -1.53 -0.554 -0.610
9 i -1.65 0.0525 -0.501 -1.64 -0.599 -1.04 0.143 -1.83 -0.626
10 j -0.623 -0.290 -0.430 -0.0352 0.937 -3.33 2.32 1.10 -0.503
# … with 1 more variable: mat_col[10] <dbl>
Also note that View()
only renders the first column of a matrix column, with no indication that there is more to see.
View()
ing a tibble with a matrix-column only shows the first column of the matrixThe behavior of View()
has been fixed since the original publication of this post.
Despite the printing and viewing issues, matrix columns are surprisingly easy to use. The usual sort of indexing works as expected. You can select the matrix column by name with [
or dplyr::select()
, and you can extract the matrix column using the $
operator, [[
, or dplyr::pull()
.
#a tibble with only the matrix-column
"mat_col"]
tbl[select(tbl, mat_col)
#the matrix itself:
$mat_col
tbl"mat_col"]]
tbl[[pull(tbl, "mat_col")
Indexing rows works with no problem too.
3, ] tbl[
# A tibble: 1 × 2
x mat_col[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 c 0.190 -0.785 1.27 -1.43 -1.82 0.715 -0.593 2.07 -0.228 2.15
#dplyr::filter works too
filter(tbl, x %in% c("a", "f", "i"))
# A tibble: 3 × 2
x mat_co…¹ [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.464 -1.12 -1.01 1.73 0.531 2.10 1.44 0.836 0.369 -1.50
2 f 1.67 0.185 -0.807 0.149 0.114 0.240 -0.791 0.418 -2.13 -0.422
3 i -1.65 0.0525 -0.501 -1.64 -0.599 -1.04 0.143 -1.83 -0.626 0.376
# … with abbreviated variable name ¹mat_col[,1]
And as we saw above, using matrix-columns in model formulas seems to work consistently as long as the input is expected or allowed to be a matrix.
Saving matrix-columns to disk
Ordinary data frames and tibbles (i.e. without list-columns or matrix-columns) can usually be reliably saved as .csv files.
A tibble with a list-column will throw an error if you try to write it to a .csv file
<- tibble(x = 1:10, y = list(1:10))
df_list_col
write_csv(df_list_col, "list_df.csv")
read_csv("list_df.csv")
Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
lgl (1): y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 2
x y
<dbl> <lgl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
Tibbles with matrix-columns don’t throw the same error, but unfortunately this is not because they work correctly.
write_csv(tbl, "mat_df.csv")
Error in `cli_block()`:
! `x` must not contain list or matrix columns:
✖ invalid columns at index(s): 2
read_csv("mat_df")
Error: 'mat_df' does not exist in current working directory ('/Users/ericscott/Documents/GitHub/website-quarto/posts/2020-12-11-matrix-columns').
As you can see, only the first column of the matrix was saved to the csv file. If you want to use matrix-columns in your work, you should either create them in the same document as your analysis, or save them as .rds files.
Since the publication of this post, these errors have actually switched! Now write_csv()
seems to not complain when writing tibbles with list-columns, although these columns are empty. It errors with the second example with a matrix column!
That’s all for now, but please let me know in the comments if you’ve used matrix-columns in your work!