My approach to wrangling text files generated by the LI-6400XT
Importing data from a LI-COR photosynthesis meter into R
The LI-6400XT is a portable device used to measure photosynthesis in plant leaves. As you take measurements by pressing a button on the device, they are recorded into memory. In order to keep track of which measurments go with which plants (or experimental treatments), there is an “add remark” option where you can enter sample information before taking measurements.
When the data are exported, you get a series of .xls files and a plain text file. Both of these have some problems that you’ll have to deal with if you want to read the data into R and use it for statistical analysis or generating reports.
Both file types create some problems for easily getting the data into R:
- Header information interrupts the data table format. Fortunately, it’s mostly just information about the instrument configuration that we don’t need.
- Untidy handling of remarks. Instead of remarks being in their own column, they appear in the
HHMMSS
column in the .xls files and in theObs
column in the .txt file! And to indicate that the row is a remark, instead of giving it an observation number inObs
, it just says “Remark =”. - Column headers are spread over two rows. There is a (somewhat mysterious to me) row of “in”s and “out”s under the column headers in the .xls file.
- Another problem that you can’t see in Fig. 1 is that I’ve done my measurements in several bouts. This produced two .xls files and a text file with header text in-between my two sets of measurements.
At this point I had to choose between reading in the .xls files with read_xls()
from the readxl
package and doing some wrangling from there, or to deal with the text file, which would surely include some regular expression headaches.
For some unknown reason, read_xls()
didn’t work on these files, and I had to open them in Excel, then save them as .xlsx files and use read_xlsx()
to get them into R. For the sake of full automatedness, I’m going to work through the text file example here.
Tidying up raw text
My approach is to read in the raw text, tidy it up, then use read_tsv()
to get a list of data frames. After that, I planned to combine them into one big data frame and do some more tidying to extract the sample IDs from the remarks. I’ll be using functions from stringr
to do the text tidying, and functions from various tidyverse
packages to bring it all together into a coherent data frame.
library(tidyverse)
library(stringr)
library(here)
<- read_file("licor.txt") text.raw
Scrolling through the text a little reveals that, conveniently, the line "$STARTOFDATA$"
appears between the header information and the start of the actual data. The headers themselves always begin with "OPEN"
followed by a date. I created regular expression patterns for these and used them to split the raw text file first into separate bouts of measurements, then into headers and data, discarding the headers.
<- "\"OPEN \\d\\.\\d\\.\\d"
header_pattern <- "\\$STARTOFDATA\\$"
data_pattern
#splits into individual bouts
<- str_split(text.raw, header_pattern, simplify = TRUE)
raw_split
#splits further to separate headers from actual data
<- str_split(raw_split, data_pattern, simplify = FALSE)
raw_split2
str(raw_split2)
List of 3
$ : chr ""
$ : chr [1:2] "\"\n\"Fri Aug 25 2017 08:29:30\"\n<open><version>\"6.3.4\"</version></open>\n<open><configfile>\"/User/Configs/"| __truncated__ "\n\"Obs\"\t\"HHMMSS\"\t\"FTime\"\t\"EBal?\"\t\"Photo\"\t\"Cond\"\t\"Ci\"\t\"Trmmol\"\t\"VpdL\"\t\"CTleaf\"\t\"A"| __truncated__
$ : chr [1:2] "\"\n\"Fri Aug 25 2017 10:32:44\"\n<open><version>\"6.3.4\"</version></open>\n<open><configfile>\"/User/Configs/"| __truncated__ "\n\"Obs\"\t\"HHMMSS\"\t\"FTime\"\t\"EBal?\"\t\"Photo\"\t\"Cond\"\t\"Ci\"\t\"Trmmol\"\t\"VpdL\"\t\"CTleaf\"\t\"A"| __truncated__
It’s a little hard to see here, but now there is a list of 3 elements. The first element contains nothing (because there is nothing before the first header_pattern
), the other elements contain two strings—one is the header, the other is the data. Let’s get rid of the headers and the empty list element.
#extract just the second element, the actual data
<- raw_split2 %>%
raw_split3 map(`[`, 2) %>% #equivalent to doing raw_split2[[i]][2] for every element "i"
flatten_chr() #converts to a vector
#remove empty elements
<- raw_split3[!is.na(raw_split3)] raw_split3
Reading in our cleaned text file
Then we can finally read in our cleaned text as a tab-separated (.tsv) file. Here I make use of map()
from the purrr
package to apply read_tsv()
to every string in our raw text vector. skip = 1
gets rid of that weird line of “in”s and “out”s.
<- raw_split3 %>%
input map( ~ read_tsv(
.x,col_types = cols(
.default = col_double(),
Obs = col_character(),
HHMMSS = col_time(format = "")
)skip = 1)
),
# input.all <- bind_rows(input) #not working. Says cols are different types, but that can't be true!
<- rbind(input[[1]], input[[2]])
input.all head(input.all, 10)
# A tibble: 10 × 40
Obs HHMMSS FTime `EBal?` Photo Cond Ci Trmmol VpdL CTleaf Area
<chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 08:30:3… NA NA NA NA NA NA NA NA NA NA
2 08:32:3… NA NA NA NA NA NA NA NA NA NA
3 08:40:2… NA NA NA NA NA NA NA NA NA NA
4 08:40:2… NA NA NA NA NA NA NA NA NA NA
5 08:40:2… NA NA NA NA NA NA NA NA NA NA
6 08:40:2… NA NA NA NA NA NA NA NA NA NA
7 08:42:1… NA NA NA NA NA NA NA NA NA NA
8 08:43:1… NA NA NA NA NA NA NA NA NA NA
9 1 08:43:53 932 0 6.62 0.0589 202 0.724 1.25 22.3 6
10 2 08:43:58 937 0 6.99 0.0594 193 0.731 1.25 22.3 6
# … with 29 more variables: BLC_1 <dbl>, StmRat <dbl>, BLCond <dbl>,
# Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>, CO2S <dbl>, H2OR <dbl>,
# H2OS <dbl>, RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <dbl>, PARo <dbl>,
# Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>, HsMchSD <dbl>,
# CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
# BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <dbl>
Extracting useful remarks
The first step is moving remarks into a remark
column while keeping the observation numbers in the Obs
column. I’m sure there is a more elegant way to do this, but I had recently learned about the safely()
function from purrr
which allows you to capture errors. I figured I could try converting elements of the Obs
column to integers and if it failed, I could use that as a criteria for moving to a new column.
#create a "safe" version of as.integer() that returns a list of a result and error
<- safely(as.integer)
safe_as.int #returns error for text remarks, returns value for integer observation numbers
<- input.all %>%
input.all mutate(#create a comment column to indicate if an "Obs" is actually a remark
comment = is.na(safe_as.int(Obs)$result),
#copy those remarks to the remark column
remark = ifelse(comment == TRUE, Obs, NA),
#remove remarks from Obs column
Obs = ifelse(comment == FALSE, Obs, NA)) %>%
#move the remark column the the begining
select(remark, everything()) %>%
#remove the comment column. We're done with it
select(-comment)
head(input.all, 10)
# A tibble: 10 × 41
remark Obs HHMMSS FTime `EBal?` Photo Cond Ci Trmmol VpdL CTleaf
<chr> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 08:30:3… <NA> NA NA NA NA NA NA NA NA NA
2 08:32:3… <NA> NA NA NA NA NA NA NA NA NA
3 08:40:2… <NA> NA NA NA NA NA NA NA NA NA
4 08:40:2… <NA> NA NA NA NA NA NA NA NA NA
5 08:40:2… <NA> NA NA NA NA NA NA NA NA NA
6 08:40:2… <NA> NA NA NA NA NA NA NA NA NA
7 08:42:1… <NA> NA NA NA NA NA NA NA NA NA
8 08:43:1… <NA> NA NA NA NA NA NA NA NA NA
9 <NA> 1 08:43:53 932 0 6.62 0.0589 202 0.724 1.25 22.3
10 <NA> 2 08:43:58 937 0 6.99 0.0594 193 0.731 1.25 22.3
# … with 30 more variables: Area <dbl>, BLC_1 <dbl>, StmRat <dbl>,
# BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>, CO2S <dbl>,
# H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <dbl>,
# PARo <dbl>, Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>,
# HsMchSD <dbl>, CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
# BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <dbl>
In the data frame above, you’ll notice that some of the remarks are just me changing parameters of the device, while others are sample IDs (e.g. “08:43:13 c 4 a” is plot c, plant 4, leaf a). I got lucky in my sample naming convention in that the sample IDs are relatively easily distinguishable from other remarks using regular expressions.
#you must replace NA with the literal string "NA" so str_* functions from stringr can deal with it
<- input.all %>% mutate(remark = str_replace_na(remark))
input.all
<- "[:lower:][:blank:]\\d+[:blank:][:lower:]"
IDpattern str_view(input.all$remark[1:10], IDpattern)
[8] │ 08:43:13 <c 4 a>
Now that I’ve figure out a pattern that matches the ID’s I can use str_extract()
to move them to a new sampleID
column.
<- input.all %>%
input.all mutate(sampleID = str_extract(remark, IDpattern)) %>%
select(sampleID, everything())
head(input.all, 10)
# A tibble: 10 × 42
sampleID remark Obs HHMMSS FTime `EBal?` Photo Cond Ci Trmmol VpdL
<chr> <chr> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <NA> 08:30… <NA> NA NA NA NA NA NA NA NA
2 <NA> 08:32… <NA> NA NA NA NA NA NA NA NA
3 <NA> 08:40… <NA> NA NA NA NA NA NA NA NA
4 <NA> 08:40… <NA> NA NA NA NA NA NA NA NA
5 <NA> 08:40… <NA> NA NA NA NA NA NA NA NA
6 <NA> 08:40… <NA> NA NA NA NA NA NA NA NA
7 <NA> 08:42… <NA> NA NA NA NA NA NA NA NA
8 c 4 a 08:43… <NA> NA NA NA NA NA NA NA NA
9 <NA> NA 1 08:43:53 932 0 6.62 0.0589 202 0.724 1.25
10 <NA> NA 2 08:43:58 937 0 6.99 0.0594 193 0.731 1.25
# … with 31 more variables: CTleaf <dbl>, Area <dbl>, BLC_1 <dbl>,
# StmRat <dbl>, BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>,
# CO2R <dbl>, CO2S <dbl>, H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>,
# Flow <dbl>, PARi <dbl>, PARo <dbl>, Press <dbl>, CsMch <dbl>, HsMch <dbl>,
# CsMchSD <dbl>, HsMchSD <dbl>, CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>,
# BLCslope <dbl>, BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>,
# alphaK <dbl>, Status <dbl>
Fill down
Now, if this were Excel, you could highlight that “c 4 a” and drag the corner down to fill in all the NA’s. In R, you can do exactly this with the fill()
function from tidyr
.
#get rid of other remarks and fill down the sample ID column
<- input.all %>%
output filter(!xor(remark == "NA" , is.na(sampleID))) %>%
fill(sampleID) %>%
#get rid of the rest of the remark rows
filter(complete.cases(.)) %>%
#get rid of the remark column
select(-remark)
head(output, 10)
# A tibble: 10 × 41
sampleID Obs HHMMSS FTime `EBal?` Photo Cond Ci Trmmol VpdL CTleaf
<chr> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 c 4 a 1 08:43:53 932 0 6.62 0.0589 202 0.724 1.25 22.3
2 c 4 a 2 08:43:58 937 0 6.99 0.0594 193 0.731 1.25 22.3
3 c 4 a 3 08:44:03 942 0 6.96 0.0596 195 0.731 1.25 22.3
4 c 4 b 4 08:44:39 978 0 7.08 0.0783 237 0.941 1.23 22.4
5 c 4 b 5 08:44:46 985 0 6.9 0.0791 242 0.949 1.23 22.4
6 c 4 b 6 08:44:51 990 0 6.78 0.0801 246 0.959 1.22 22.4
7 c 4 c 7 08:45:31 1030 0 6.34 0.0654 228 0.805 1.25 22.4
8 c 4 c 8 08:45:38 1037 0 6.48 0.0648 222 0.798 1.26 22.5
9 c 4 c 9 08:45:45 1044 0 6.51 0.0664 226 0.819 1.26 22.5
10 c 4 c 10 08:45:51 1050 0 6.68 0.0678 225 0.836 1.26 22.5
# … with 30 more variables: Area <dbl>, BLC_1 <dbl>, StmRat <dbl>,
# BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>, CO2S <dbl>,
# H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <dbl>,
# PARo <dbl>, Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>,
# HsMchSD <dbl>, CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
# BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <dbl>
And finally, we have a cleaned data frame ready for use in analyses! You could go on to separate plot ID, plant ID and leaf ID using separate()
from tidyr
, and then do any necessary calculations, visualizations, and modeling with the resulting data frame.