::install_github("ropensci/webchem") devtools
Retrieve chemical retention indices from NIST with {webchem}!
My PhD has involved learning a lot more than I expected about analytical chemistry, and as I’ve been learning, I’ve been trying my best to make my life easier by writing R functions to help me out. Some of those functions have found a loving home in the webchem
package, part of rOpenSci.
Papers that use gas chromatography to separate and measure chemicals often include a table of the compounds they found along with experimental retention indices and literature retention indices. A retention index is basically a corrected retention time—the time it took for the particular chemical to make it through the gas chromatograph, an instrument designed to separate chemicals, to the detector used to identify the compound (e.g. an FID or mass spectrometer). While the retention time for a particular compound might vary from run to run or beetween labs, the retention index should be comparable. Therefore, they are often used to help identify compounds and NIST maintains a database of retention indeces for researchers to refer to.
Producing such a table of literature retention indices for potentially hundreds of metabolites by hand can be really tedious!
Enter nist_ri()
, a handy function I wrote to scrape retention index tables from NIST. Below, I work through an example of how you might use it. First, you need to install the latest version of webchem
. My function isn’t in the latest CRAN release at the time of writing this blog post, but you can install from github like so:
To look up a retention index, you need a CAS identifier number for the chemical (For now, at least. Other search methods may be implemented in the future). If you don’t already have CAS numbers, you can get them using other functions in webchem
from chemical names or other identifier numbers.
<- c("83-34-1", "119-36-8", "123-35-3", "19700-21-1") CASs
Load the package and take a look at the help file. You’ll see that we need to choose what type of retention index to scrape, what polarity of column, and what kind of temperature program. If you browse one of the NIST sites for a compound, this will make more sense.
library(webchem)
?nist_ri
Let’s get Van Den Dool & Kratz (AKA “linear”) retention indexes for non-polar columns using a temperature ramp. This might take a while, depending on your internet connection and how many CAS numbers you request data for. If a certain type of retention index doesn’t exist for a compound, the function will return NA
for all columns but the CAS number.
<-
RIs nist_ri(
query = CASs,
from = "cas",
type = "linear",
polarity = "non-polar",
temp_prog = "ramp"
)head(RIs)
# A tibble: 6 × 16
query RI type phase length gas subst…¹ diame…² thick…³ temp_…⁴ temp_…⁵
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 83-34-1 1410 Capi… SPB-5 60 <NA> <NA> 0.32 1 40 230
2 83-34-1 1380 Capi… DB-5… 30 Heli… <NA> 0.25 0.25 35 225
3 83-34-1 1410 Capi… SE-54 50 Heli… <NA> 0.32 NA 40 240
4 83-34-1 1381 Capi… DB-5 30 Hydr… <NA> 0.25 0.25 35 270
5 83-34-1 1389 Capi… DB-5 30 Nitr… <NA> 0.25 0.25 30 250
6 83-34-1 1399 Capi… DB-5… 30 <NA> <NA> 0.25 0.25 40 200
# … with 5 more variables: temp_rate <dbl>, hold_start <dbl>, hold_end <dbl>,
# reference <chr>, comment <chr>, and abbreviated variable names ¹substrate,
# ²diameter, ³thickness, ⁴temp_start, ⁵temp_end
You can see there are multiple retention indexes (RI
) for each CAS number. Let’s filter this down some more using some functions from dplyr
and stringr
.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stringr)
<-
RIs_filtered %>%
RIs filter(gas == "Helium",
between(length, 20, 30),
str_detect(phase, "5"),
< 0.3,
diameter == 0.25) %>%
thickness rename(CAS = query)
head(RIs_filtered)
# A tibble: 6 × 16
CAS RI type phase length gas subst…¹ diame…² thick…³ temp_…⁴ temp_…⁵
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 83-34-1 1380 Capi… DB-5… 30 Heli… <NA> 0.25 0.25 35 225
2 83-34-1 1396 Capi… DB-5… 30 Heli… <NA> 0.25 0.25 35 200
3 83-34-1 1391 Capi… DB-5 30 Heli… <NA> 0.26 0.25 50 300
4 119-36… 1201 Capi… DB-5 25 Heli… <NA> 0.25 0.25 60 200
5 119-36… 1201. Capi… HP-5… 30 Heli… <NA> 0.25 0.25 80 300
6 119-36… 1190 Capi… HP-5… 30 Heli… <NA> 0.25 0.25 60 280
# … with 5 more variables: temp_rate <dbl>, hold_start <dbl>, hold_end <dbl>,
# reference <chr>, comment <chr>, and abbreviated variable names ¹substrate,
# ²diameter, ³thickness, ⁴temp_start, ⁵temp_end
Now we could summarise
to get an average of all the database entries…
%>%
RIs_filtered group_by(CAS) %>%
summarise(mean_RI = mean(RI))
# A tibble: 4 × 2
CAS mean_RI
<chr> <dbl>
1 119-36-8 1193.
2 123-35-3 990.
3 19700-21-1 1430
4 83-34-1 1389
Or if we wanted to pick a single entry for each CAS number with the median RI, we could do that as well.
<-
best_RIs %>%
RIs_filtered group_by(CAS) %>%
filter(RI == median(RI)) %>%
filter(row_number() == 1) %>%
select(CAS, RI, reference)
best_RIs
# A tibble: 4 × 3
# Groups: CAS [4]
CAS RI reference
<chr> <dbl> <chr>
1 83-34-1 1391 Rostad and Pereira, 1986
2 119-36-8 1191 Aligiannis, Kalpoutzakis, et al., 2004
3 123-35-3 991 Maccioni, Baldini, et al., 2007
4 19700-21-1 1430 Dickschat, Wenzel, et al., 2004
You could then easily take this table and *_join()
it to your GC/MS data, if you have a column for CAS#, and select the RI
and reference
columns, for example.
set.seed(888)
<-
fake.data data.frame(CAS = CASs,
#Name = cts_convert(CASs, from = "CAS", to = "Chemical Name", first = TRUE),
Name = c("skatole", "methyl salicylate", "beta-myrcene", "geosmin"),
group_1_conc = round(abs(rnorm(4)), 3),
group_2_conc = round(abs(rnorm(4)), 3))
left_join(fake.data, best_RIs) %>%
select(CAS, Name, RI, everything()) %>%
arrange(RI) %>%
::kable() knitr
CAS | Name | RI | group_1_conc | group_2_conc | reference |
---|---|---|---|---|---|
123-35-3 | beta-myrcene | 991 | 0.730 | 2.166 | Maccioni, Baldini, et al., 2007 |
119-36-8 | methyl salicylate | 1191 | 1.544 | 0.251 | Aligiannis, Kalpoutzakis, et al., 2004 |
83-34-1 | skatole | 1391 | 1.951 | 1.656 | Rostad and Pereira, 1986 |
19700-21-1 | geosmin | 1430 | 0.278 | 0.587 | Dickschat, Wenzel, et al., 2004 |