Importing data from a LI-COR photosynthesis meter into R

data-science
r
data-wrangling
regexp
Published

January 17, 2018

The LI-6400XT is a portable device used to measure photosynthesis in plant leaves. As you take measurements by pressing a button on the device, they are recorded into memory. In order to keep track of which measurments go with which plants (or experimental treatments), there is an “add remark” option where you can enter sample information before taking measurements.

When the data are exported, you get a series of .xls files and a plain text file. Both of these have some problems that you’ll have to deal with if you want to read the data into R and use it for statistical analysis or generating reports.

On the left, a messy looking text file with xml tags taking up the top two thirds of the screen.  On the right, an excel spreadsheet with metadata in the top 7 rows, two rows of header columns, and "remark" rows interspersed with the actual data

Excel nightmare or text nightmare? Pick your poison.

Both file types create some problems for easily getting the data into R:

  1. Header information interrupts the data table format. Fortunately, it’s mostly just information about the instrument configuration that we don’t need.
  2. Untidy handling of remarks. Instead of remarks being in their own column, they appear in the HHMMSS column in the .xls files and in the Obs column in the .txt file! And to indicate that the row is a remark, instead of giving it an observation number in Obs, it just says “Remark =”.
  3. Column headers are spread over two rows. There is a (somewhat mysterious to me) row of “in”s and “out”s under the column headers in the .xls file.
  4. Another problem that you can’t see in Fig. 1 is that I’ve done my measurements in several bouts. This produced two .xls files and a text file with header text in-between my two sets of measurements.

At this point I had to choose between reading in the .xls files with read_xls() from the readxl package and doing some wrangling from there, or to deal with the text file, which would surely include some regular expression headaches.

For some unknown reason, read_xls() didn’t work on these files, and I had to open them in Excel, then save them as .xlsx files and use read_xlsx() to get them into R. For the sake of full automatedness, I’m going to work through the text file example here.

Tidying up raw text

My approach to wrangling text files generated by the LI-6400XT

My approach is to read in the raw text, tidy it up, then use read_tsv() to get a list of data frames. After that, I planned to combine them into one big data frame and do some more tidying to extract the sample IDs from the remarks. I’ll be using functions from stringr to do the text tidying, and functions from various tidyverse packages to bring it all together into a coherent data frame.

library(tidyverse)
library(stringr)
library(here)
text.raw <- read_file("licor.txt")

Scrolling through the text a little reveals that, conveniently, the line "$STARTOFDATA$" appears between the header information and the start of the actual data. The headers themselves always begin with "OPEN" followed by a date. I created regular expression patterns for these and used them to split the raw text file first into separate bouts of measurements, then into headers and data, discarding the headers.

header_pattern <- "\"OPEN \\d\\.\\d\\.\\d"
data_pattern <- "\\$STARTOFDATA\\$"

#splits into individual bouts
raw_split <- str_split(text.raw, header_pattern, simplify = TRUE)

#splits further to separate headers from actual data
raw_split2 <- str_split(raw_split, data_pattern, simplify = FALSE)

str(raw_split2)
List of 3
 $ : chr ""
 $ : chr [1:2] "\"\n\"Fri Aug 25 2017 08:29:30\"\n<open><version>\"6.3.4\"</version></open>\n<open><configfile>\"/User/Configs/"| __truncated__ "\n\"Obs\"\t\"HHMMSS\"\t\"FTime\"\t\"EBal?\"\t\"Photo\"\t\"Cond\"\t\"Ci\"\t\"Trmmol\"\t\"VpdL\"\t\"CTleaf\"\t\"A"| __truncated__
 $ : chr [1:2] "\"\n\"Fri Aug 25 2017 10:32:44\"\n<open><version>\"6.3.4\"</version></open>\n<open><configfile>\"/User/Configs/"| __truncated__ "\n\"Obs\"\t\"HHMMSS\"\t\"FTime\"\t\"EBal?\"\t\"Photo\"\t\"Cond\"\t\"Ci\"\t\"Trmmol\"\t\"VpdL\"\t\"CTleaf\"\t\"A"| __truncated__

It’s a little hard to see here, but now there is a list of 3 elements. The first element contains nothing (because there is nothing before the first header_pattern), the other elements contain two strings—one is the header, the other is the data. Let’s get rid of the headers and the empty list element.

#extract just the second element, the actual data
raw_split3 <- raw_split2 %>%
  map(`[`, 2) %>% #equivalent to doing raw_split2[[i]][2] for every element "i"
  flatten_chr() #converts to a vector

#remove empty elements
raw_split3 <- raw_split3[!is.na(raw_split3)]

Reading in our cleaned text file

Then we can finally read in our cleaned text as a tab-separated (.tsv) file. Here I make use of map() from the purrr package to apply read_tsv() to every string in our raw text vector. skip = 1 gets rid of that weird line of “in”s and “out”s.

input <- raw_split3 %>%
  map( ~ read_tsv(
    .x,
    col_types = cols(
      .default = col_double(),
      Obs = col_character(),
      HHMMSS = col_time(format = "")
    )
  ), skip = 1)

# input.all <- bind_rows(input) #not working. Says cols are different types, but that can't be true!
input.all <- rbind(input[[1]], input[[2]])
head(input.all, 10)
# A tibble: 10 × 40
   Obs      HHMMSS   FTime `EBal?` Photo    Cond    Ci Trmmol  VpdL CTleaf  Area
   <chr>    <time>   <dbl>   <dbl> <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
 1 08:30:3…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 2 08:32:3…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 3 08:40:2…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 4 08:40:2…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 5 08:40:2…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 6 08:40:2…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 7 08:42:1…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 8 08:43:1…       NA    NA      NA NA    NA         NA NA     NA      NA      NA
 9 1        08:43:53   932       0  6.62  0.0589   202  0.724  1.25   22.3     6
10 2        08:43:58   937       0  6.99  0.0594   193  0.731  1.25   22.3     6
# … with 29 more variables: BLC_1 <dbl>, StmRat <dbl>, BLCond <dbl>,
#   Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>, CO2S <dbl>, H2OR <dbl>,
#   H2OS <dbl>, RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <dbl>, PARo <dbl>,
#   Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>, HsMchSD <dbl>,
#   CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
#   BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <dbl>

Extracting useful remarks

The first step is moving remarks into a remark column while keeping the observation numbers in the Obs column. I’m sure there is a more elegant way to do this, but I had recently learned about the safely() function from purrr which allows you to capture errors. I figured I could try converting elements of the Obs column to integers and if it failed, I could use that as a criteria for moving to a new column.

#create a "safe" version of as.integer() that returns a list of a result and error
safe_as.int <- safely(as.integer)
#returns error for text remarks, returns value for integer observation numbers

input.all <- input.all %>% 
  mutate(#create a comment column to indicate if an "Obs" is actually a remark
       comment = is.na(safe_as.int(Obs)$result), 
       #copy those remarks to the remark column
       remark = ifelse(comment == TRUE, Obs, NA),
       #remove remarks from Obs column
       Obs = ifelse(comment == FALSE, Obs, NA)) %>% 
#move the remark column the the begining
select(remark, everything()) %>% 
#remove the comment column.  We're done with it
select(-comment)
head(input.all, 10)
# A tibble: 10 × 41
   remark   Obs   HHMMSS   FTime `EBal?` Photo    Cond    Ci Trmmol  VpdL CTleaf
   <chr>    <chr> <time>   <dbl>   <dbl> <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl>
 1 08:30:3… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 2 08:32:3… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 3 08:40:2… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 4 08:40:2… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 5 08:40:2… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 6 08:40:2… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 7 08:42:1… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 8 08:43:1… <NA>        NA    NA      NA NA    NA         NA NA     NA      NA  
 9 <NA>     1     08:43:53   932       0  6.62  0.0589   202  0.724  1.25   22.3
10 <NA>     2     08:43:58   937       0  6.99  0.0594   193  0.731  1.25   22.3
# … with 30 more variables: Area <dbl>, BLC_1 <dbl>, StmRat <dbl>,
#   BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>, CO2S <dbl>,
#   H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <dbl>,
#   PARo <dbl>, Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>,
#   HsMchSD <dbl>, CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
#   BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <dbl>

In the data frame above, you’ll notice that some of the remarks are just me changing parameters of the device, while others are sample IDs (e.g. “08:43:13 c 4 a” is plot c, plant 4, leaf a). I got lucky in my sample naming convention in that the sample IDs are relatively easily distinguishable from other remarks using regular expressions.

#you must replace NA with the literal string "NA" so str_* functions from stringr can deal with it
input.all <- input.all %>% mutate(remark = str_replace_na(remark))

IDpattern <- "[:lower:][:blank:]\\d+[:blank:][:lower:]"
str_view(input.all$remark[1:10], IDpattern)
[8] │ 08:43:13 <c 4 a>

Now that I’ve figure out a pattern that matches the ID’s I can use str_extract() to move them to a new sampleID column.

input.all <- input.all %>%
  mutate(sampleID = str_extract(remark, IDpattern)) %>% 
  select(sampleID, everything())
head(input.all, 10)
# A tibble: 10 × 42
   sampleID remark Obs   HHMMSS   FTime `EBal?` Photo    Cond    Ci Trmmol  VpdL
   <chr>    <chr>  <chr> <time>   <dbl>   <dbl> <dbl>   <dbl> <dbl>  <dbl> <dbl>
 1 <NA>     08:30… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 2 <NA>     08:32… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 3 <NA>     08:40… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 4 <NA>     08:40… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 5 <NA>     08:40… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 6 <NA>     08:40… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 7 <NA>     08:42… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 8 c 4 a    08:43… <NA>        NA    NA      NA NA    NA         NA NA     NA   
 9 <NA>     NA     1     08:43:53   932       0  6.62  0.0589   202  0.724  1.25
10 <NA>     NA     2     08:43:58   937       0  6.99  0.0594   193  0.731  1.25
# … with 31 more variables: CTleaf <dbl>, Area <dbl>, BLC_1 <dbl>,
#   StmRat <dbl>, BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>,
#   CO2R <dbl>, CO2S <dbl>, H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>,
#   Flow <dbl>, PARi <dbl>, PARo <dbl>, Press <dbl>, CsMch <dbl>, HsMch <dbl>,
#   CsMchSD <dbl>, HsMchSD <dbl>, CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>,
#   BLCslope <dbl>, BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>,
#   alphaK <dbl>, Status <dbl>

Fill down

Now, if this were Excel, you could highlight that “c 4 a” and drag the corner down to fill in all the NA’s. In R, you can do exactly this with the fill() function from tidyr.

#get rid of other remarks and fill down the sample ID column
output <- input.all %>% 
  filter(!xor(remark == "NA" , is.na(sampleID))) %>%
  fill(sampleID) %>% 
  #get rid of the rest of the remark rows
  filter(complete.cases(.)) %>% 
  #get rid of the remark column
  select(-remark)
head(output, 10)
# A tibble: 10 × 41
   sampleID Obs   HHMMSS   FTime `EBal?` Photo   Cond    Ci Trmmol  VpdL CTleaf
   <chr>    <chr> <time>   <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
 1 c 4 a    1     08:43:53   932       0  6.62 0.0589   202  0.724  1.25   22.3
 2 c 4 a    2     08:43:58   937       0  6.99 0.0594   193  0.731  1.25   22.3
 3 c 4 a    3     08:44:03   942       0  6.96 0.0596   195  0.731  1.25   22.3
 4 c 4 b    4     08:44:39   978       0  7.08 0.0783   237  0.941  1.23   22.4
 5 c 4 b    5     08:44:46   985       0  6.9  0.0791   242  0.949  1.23   22.4
 6 c 4 b    6     08:44:51   990       0  6.78 0.0801   246  0.959  1.22   22.4
 7 c 4 c    7     08:45:31  1030       0  6.34 0.0654   228  0.805  1.25   22.4
 8 c 4 c    8     08:45:38  1037       0  6.48 0.0648   222  0.798  1.26   22.5
 9 c 4 c    9     08:45:45  1044       0  6.51 0.0664   226  0.819  1.26   22.5
10 c 4 c    10    08:45:51  1050       0  6.68 0.0678   225  0.836  1.26   22.5
# … with 30 more variables: Area <dbl>, BLC_1 <dbl>, StmRat <dbl>,
#   BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>, CO2S <dbl>,
#   H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <dbl>,
#   PARo <dbl>, Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>,
#   HsMchSD <dbl>, CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
#   BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <dbl>

And finally, we have a cleaned data frame ready for use in analyses! You could go on to separate plot ID, plant ID and leaf ID using separate() from tidyr, and then do any necessary calculations, visualizations, and modeling with the resulting data frame.