Retrieve chemical retention indices from NIST with {webchem}!

data-science
data-wrangling
r
webscraping
GC/MS
Published

June 28, 2018

My PhD has involved learning a lot more than I expected about analytical chemistry, and as I’ve been learning, I’ve been trying my best to make my life easier by writing R functions to help me out. Some of those functions have found a loving home in the webchem package, part of rOpenSci.

Papers that use gas chromatography to separate and measure chemicals often include a table of the compounds they found along with experimental retention indices and literature retention indices. A retention index is basically a corrected retention time—the time it took for the particular chemical to make it through the gas chromatograph, an instrument designed to separate chemicals, to the detector used to identify the compound (e.g. an FID or mass spectrometer). While the retention time for a particular compound might vary from run to run or beetween labs, the retention index should be comparable. Therefore, they are often used to help identify compounds and NIST maintains a database of retention indeces for researchers to refer to.

A data table containing columns for compound name, literature RI, experimental RI, Spring, Monsoon, and % Difference. Table caption reads: "Relative tea metabolite concentrations in spring and monsoon harvests."

An example table including literature retention indices from Kowalsick, et al. 2014

Producing such a table of literature retention indices for potentially hundreds of metabolites by hand can be really tedious!

Enter nist_ri(), a handy function I wrote to scrape retention index tables from NIST. Below, I work through an example of how you might use it. First, you need to install the latest version of webchem. My function isn’t in the latest CRAN release at the time of writing this blog post, but you can install from github like so:

devtools::install_github("ropensci/webchem")

To look up a retention index, you need a CAS identifier number for the chemical (For now, at least. Other search methods may be implemented in the future). If you don’t already have CAS numbers, you can get them using other functions in webchem from chemical names or other identifier numbers.

CASs <- c("83-34-1", "119-36-8", "123-35-3", "19700-21-1")

Load the package and take a look at the help file. You’ll see that we need to choose what type of retention index to scrape, what polarity of column, and what kind of temperature program. If you browse one of the NIST sites for a compound, this will make more sense.

library(webchem)
?nist_ri

Let’s get Van Den Dool & Kratz (AKA “linear”) retention indexes for non-polar columns using a temperature ramp. This might take a while, depending on your internet connection and how many CAS numbers you request data for. If a certain type of retention index doesn’t exist for a compound, the function will return NA for all columns but the CAS number.

RIs <-
  nist_ri(
    query = CASs,
    from = "cas",
    type = "linear",
    polarity = "non-polar",
    temp_prog = "ramp"
  )
head(RIs)
# A tibble: 6 × 16
  query      RI type  phase length gas   subst…¹ diame…² thick…³ temp_…⁴ temp_…⁵
  <chr>   <dbl> <chr> <chr>  <dbl> <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>
1 83-34-1  1410 Capi… SPB-5     60 <NA>  <NA>       0.32    1         40     230
2 83-34-1  1380 Capi… DB-5…     30 Heli… <NA>       0.25    0.25      35     225
3 83-34-1  1410 Capi… SE-54     50 Heli… <NA>       0.32   NA         40     240
4 83-34-1  1381 Capi… DB-5      30 Hydr… <NA>       0.25    0.25      35     270
5 83-34-1  1389 Capi… DB-5      30 Nitr… <NA>       0.25    0.25      30     250
6 83-34-1  1399 Capi… DB-5…     30 <NA>  <NA>       0.25    0.25      40     200
# … with 5 more variables: temp_rate <dbl>, hold_start <dbl>, hold_end <dbl>,
#   reference <chr>, comment <chr>, and abbreviated variable names ¹​substrate,
#   ²​diameter, ³​thickness, ⁴​temp_start, ⁵​temp_end

You can see there are multiple retention indexes (RI) for each CAS number. Let’s filter this down some more using some functions from dplyr and stringr.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)
RIs_filtered <-
  RIs %>%
  filter(gas == "Helium",
         between(length, 20, 30),
         str_detect(phase, "5"),
         diameter < 0.3,
         thickness == 0.25) %>% 
  rename(CAS = query)
head(RIs_filtered)
# A tibble: 6 × 16
  CAS        RI type  phase length gas   subst…¹ diame…² thick…³ temp_…⁴ temp_…⁵
  <chr>   <dbl> <chr> <chr>  <dbl> <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>
1 83-34-1 1380  Capi… DB-5…     30 Heli… <NA>       0.25    0.25      35     225
2 83-34-1 1396  Capi… DB-5…     30 Heli… <NA>       0.25    0.25      35     200
3 83-34-1 1391  Capi… DB-5      30 Heli… <NA>       0.26    0.25      50     300
4 119-36… 1201  Capi… DB-5      25 Heli… <NA>       0.25    0.25      60     200
5 119-36… 1201. Capi… HP-5…     30 Heli… <NA>       0.25    0.25      80     300
6 119-36… 1190  Capi… HP-5…     30 Heli… <NA>       0.25    0.25      60     280
# … with 5 more variables: temp_rate <dbl>, hold_start <dbl>, hold_end <dbl>,
#   reference <chr>, comment <chr>, and abbreviated variable names ¹​substrate,
#   ²​diameter, ³​thickness, ⁴​temp_start, ⁵​temp_end

Now we could summarise to get an average of all the database entries…

RIs_filtered %>% 
  group_by(CAS) %>% 
  summarise(mean_RI = mean(RI))
# A tibble: 4 × 2
  CAS        mean_RI
  <chr>        <dbl>
1 119-36-8     1193.
2 123-35-3      990.
3 19700-21-1   1430 
4 83-34-1      1389 

Or if we wanted to pick a single entry for each CAS number with the median RI, we could do that as well.

best_RIs <-
  RIs_filtered %>%
  group_by(CAS) %>% 
  filter(RI == median(RI)) %>% 
  filter(row_number() == 1) %>% 
  select(CAS, RI, reference)
best_RIs
# A tibble: 4 × 3
# Groups:   CAS [4]
  CAS           RI reference                             
  <chr>      <dbl> <chr>                                 
1 83-34-1     1391 Rostad and Pereira, 1986              
2 119-36-8    1191 Aligiannis, Kalpoutzakis, et al., 2004
3 123-35-3     991 Maccioni, Baldini, et al., 2007       
4 19700-21-1  1430 Dickschat, Wenzel, et al., 2004       

You could then easily take this table and *_join() it to your GC/MS data, if you have a column for CAS#, and select the RI and reference columns, for example.

set.seed(888)
fake.data <-
  data.frame(CAS = CASs,
             #Name = cts_convert(CASs, from = "CAS", to = "Chemical Name", first = TRUE),
             Name = c("skatole", "methyl salicylate", "beta-myrcene", "geosmin"),
             group_1_conc = round(abs(rnorm(4)), 3),
             group_2_conc = round(abs(rnorm(4)), 3))

left_join(fake.data, best_RIs) %>%
  select(CAS, Name, RI, everything()) %>% 
  arrange(RI) %>%
  knitr::kable()
CAS Name RI group_1_conc group_2_conc reference
123-35-3 beta-myrcene 991 0.730 2.166 Maccioni, Baldini, et al., 2007
119-36-8 methyl salicylate 1191 1.544 0.251 Aligiannis, Kalpoutzakis, et al., 2004
83-34-1 skatole 1391 1.951 1.656 Rostad and Pereira, 1986
19700-21-1 geosmin 1430 0.278 0.587 Dickschat, Wenzel, et al., 2004