I - 4 Aggregating and Analyzing Data with dplyr

Licenced under CC-BY 4.0 and OSI-approved licenses, see licensing.

Overview

Teaching: 40 min
Exercises: 15 min
Questions
  • How can I manipulate dataframes without repeating myself?

Objectives
  • Describe the purpose of the dplyr and tidyr packages.

  • Select certain columns in a data frame with the dplyr function select.

  • Extract certain rows in a data frame according to logical (boolean) conditions with the dplyr function filter.

  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.

  • Use the split-apply-combine concept for data analysis.

  • Use summarize, group_by, and count to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.

  • Export a data frame to a .csv file.

Data manipulation using dplyr and tidyr

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. dplyr is a package for making tabular data manipulation easier. It pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

The tidyverse package is an “umbrella-package” that installs tidyr, dplyr, and several other packages useful for data analysis, such as ggplot2, tibble, etc.

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:

  1. The results from a base R function sometimes depend on the type of data.
  2. Using R expressions in a non standard way, which can be confusing for new learners.
  3. Hidden arguments, having default operations that new learners are not aware of.

You should already have installed and loaded the tidyverse package. If we haven’t already done so, we can type install.packages("tidyverse") straight into the console. Then, to load the package type library(tidyverse).

What are dplyr and tidyr?

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove this limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups (e.g., a time period, an experimental unit like a plot or a batch number). Moving back and forth between these formats is non-trivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet Links to an external site. and this one about tidyr Links to an external site..

In this episode, we will use the SARS-CoV-2 samples dataset that was introduced in the previous episode. You can read the data using the read_csv() function from the tidyverse package readr.

download.file(
  url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/covid_samples.csv",
  destfile = "data_raw/covid_samples.csv")
## load the tidyverse packages, incl. dplyr
library(tidyverse)

We can then read the data into memory:

samples <- read_csv("data_raw/covid_samples.csv")
Rows: 29 Columns: 8
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): patient_id, country, region, disease_outcome, sex
dbl  (2): age, ct
date (1): collection_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Like in the previous episode, we transform the columns disease_outcome and sex into factors:

samples$disease_outcome <- factor(samples$disease_outcome)
samples$sex <- factor(samples$sex)
## inspect the data
str(samples)
spec_tbl_df [29 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ patient_id     : chr [1:29] "OAS-29_1" "OAS-29_10" "OAS-29_11" "OAS-29_12" ...
 $ collection_date: Date[1:29], format: "2020-03-31" "2020-03-31" ...
 $ country        : chr [1:29] "Italy" "Italy" "Italy" "Italy" ...
 $ region         : chr [1:29] "Turin" "Turin" "Turin" "Turin" ...
 $ age            : num [1:29] 48 35 59 60 83 21 44 55 81 63 ...
 $ disease_outcome: Factor w/ 2 levels "dead","recovered": 1 NA 2 2 1 1 2 2 1 2 ...
 $ sex            : Factor w/ 2 levels "female","male": 1 2 2 1 1 2 1 2 1 1 ...
 $ ct             : num [1:29] 41.5 15.3 25.3 27 25.3 ...
 - attr(*, "spec")=
  .. cols(
  ..   patient_id = col_character(),
  ..   collection_date = col_date(format = ""),
  ..   country = col_character(),
  ..   region = col_character(),
  ..   age = col_double(),
  ..   disease_outcome = col_character(),
  ..   sex = col_character(),
  ..   ct = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
## preview the data
view(samples)

Next, we’re going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (samples), and the subsequent arguments are the columns to keep.

select(samples, patient_id, sex, ct)

To select all columns except certain ones, put a “-“ in front of the variable to exclude it.

select(samples, -collection_date, -country)

This will select all the variables in samples except collection_date and country.

To choose rows based on a specific criterion, use filter():

filter(samples, sex == "female")
# A tibble: 16 × 8
   patient_id collection_date country region   age disease_outcome sex       ct
   <chr>      <date>          <chr>   <chr>  <dbl> <fct>           <fct>  <dbl>
 1 OAS-29_1   2020-03-31      Italy   Turin     48 dead            female  41.5
 2 OAS-29_12  2020-03-31      Italy   Turin     60 recovered       female  27  
 3 OAS-29_13  2020-03-31      Italy   Turin     83 dead            female  25.3
 4 OAS-29_15  2020-04-01      Italy   Turin     44 recovered       female  33.7
 5 OAS-29_17  2020-03-31      Italy   Turin     81 dead            female  35.7
 6 OAS-29_18  2020-04-01      Italy   Turin     63 recovered       female  19.3
 7 OAS-29_19  2020-04-01      Italy   Turin     78 dead            female  26.7
 8 OAS-29_2   2020-03-31      Italy   Turin     24 dead            female  37  
 9 OAS-29_22  2020-04-08      Italy   Turin     56 recovered       female  28.3
10 OAS-29_25  2020-04-08      Italy   Turin     80 recovered       female  30.7
11 OAS-29_26  2020-04-08      Italy   Turin     19 <NA>            female  36.7
12 OAS-29_28  2020-04-07      Italy   Turin     30 recovered       female  37.5
13 OAS-29_3   2020-03-31      Italy   Turin     41 dead            female  39  
14 OAS-29_6   2020-03-31      Italy   Turin     59 dead            female  30  
15 OAS-29_8   2020-03-31      Italy   Turin     76 dead            female  30  
16 OAS-29_9   2020-03-31      Italy   Turin     49 dead            female  18.3

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

samples_female <- filter(samples, sex == "female")
samples_female_sml <- select(samples_female, patient_id, sex, ct)

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

samples_female <- select(
  filter(samples, sex == "female"), patient_id, sex, ct)

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

samples %>%
  filter(sex == "female") %>%
  select(patient_id, sex, ct)
# A tibble: 16 × 3
   patient_id sex       ct
   <chr>      <fct>  <dbl>
 1 OAS-29_1   female  41.5
 2 OAS-29_12  female  27  
 3 OAS-29_13  female  25.3
 4 OAS-29_15  female  33.7
 5 OAS-29_17  female  35.7
 6 OAS-29_18  female  19.3
 7 OAS-29_19  female  26.7
 8 OAS-29_2   female  37  
 9 OAS-29_22  female  28.3
10 OAS-29_25  female  30.7
11 OAS-29_26  female  36.7
12 OAS-29_28  female  37.5
13 OAS-29_3   female  39  
14 OAS-29_6   female  30  
15 OAS-29_8   female  30  
16 OAS-29_9   female  18.3

In the above code, we use the pipe to send the samples dataset first through filter() to keep rows where sex equals "female", then through select() to keep only the patient_id, sex, and ct columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame samples, then we filtered for rows with sex == "female", then we selected columns patient_id, sex, and ct. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

samples_female <- samples %>%
  filter(sex == "female") %>%
  select(patient_id, sex, ct)

samples_female
# A tibble: 16 × 3
   patient_id sex       ct
   <chr>      <fct>  <dbl>
 1 OAS-29_1   female  41.5
 2 OAS-29_12  female  27  
 3 OAS-29_13  female  25.3
 4 OAS-29_15  female  33.7
 5 OAS-29_17  female  35.7
 6 OAS-29_18  female  19.3
 7 OAS-29_19  female  26.7
 8 OAS-29_2   female  37  
 9 OAS-29_22  female  28.3
10 OAS-29_25  female  30.7
11 OAS-29_26  female  36.7
12 OAS-29_28  female  37.5
13 OAS-29_3   female  39  
14 OAS-29_6   female  30  
15 OAS-29_8   female  30  
16 OAS-29_9   female  18.3

Note that the final data frame is the leftmost part of this expression.

Challenge 4.1

Using pipes, subset the samples data to include only males with Ct values (column ct) greater than or equal to 35, and retain only the columns patient_id and disease_outcome.

Solution

samples %>%
 filter(sex == "male" & ct >= 35) %>%
 select(patient_id, disease_outcome)
# A tibble: 4 × 2
  patient_id disease_outcome
  <chr>      <fct>          
1 OAS-29_16  recovered      
2 OAS-29_20  <NA>           
3 OAS-29_27  dead           
4 OAS-29_29  recovered      

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean ct value by sex:

samples %>%
  group_by(disease_outcome) %>%
  summarize(mean = mean(ct))
# A tibble: 3 × 2
  disease_outcome  mean
  <fct>           <dbl>
1 dead             28.9
2 recovered        28.6
3 <NA>             31.7

You can also group by multiple columns:

samples %>%
  group_by(disease_outcome, sex) %>%
  summarize(mean = mean(ct))
# A tibble: 6 × 3
# Groups:   disease_outcome [3]
  disease_outcome sex     mean
  <fct>           <fct>  <dbl>
1 dead            female  31.5
2 dead            male    25.1
3 recovered       female  29.4
4 recovered       male    27.7
5 <NA>            female  36.7
6 <NA>            male    29.2

Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum Ct value for each disease outcome for each sex:

samples %>%
  group_by(disease_outcome, sex) %>%
  summarize(mean = mean(ct),
            min = min(ct))
# A tibble: 6 × 4
# Groups:   disease_outcome [3]
  disease_outcome sex     mean   min
  <fct>           <fct>  <dbl> <dbl>
1 dead            female  31.5  18.3
2 dead            male    25.1  16  
3 recovered       female  29.4  19.3
4 recovered       male    27.7  16.3
5 <NA>            female  36.7  36.7
6 <NA>            male    29.2  15.3

It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min to put the lowest numbers first:

samples %>%
  group_by(disease_outcome, sex) %>%
  summarize(mean = mean(ct),
            min = min(ct)) %>%
  arrange(min)
# A tibble: 6 × 4
# Groups:   disease_outcome [3]
  disease_outcome sex     mean   min
  <fct>           <fct>  <dbl> <dbl>
1 <NA>            male    29.2  15.3
2 dead            male    25.1  16  
3 recovered       male    27.7  16.3
4 dead            female  31.5  18.3
5 recovered       female  29.4  19.3
6 <NA>            female  36.7  36.7

To sort in descending order, we need to add the desc() function. If we want to sort the results by decreasing order of mean Ct:

samples %>%
  group_by(disease_outcome, sex) %>%
  summarize(mean = mean(ct),
            min = min(ct)) %>%
  arrange(desc(min))
# A tibble: 6 × 4
# Groups:   disease_outcome [3]
  disease_outcome sex     mean   min
  <fct>           <fct>  <dbl> <dbl>
1 <NA>            female  36.7  36.7
2 recovered       female  29.4  19.3
3 dead            female  31.5  18.3
4 recovered       male    27.7  16.3
5 dead            male    25.1  16  
6 <NA>            male    29.2  15.3

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each sample, we would do:

samples %>%
  count(disease_outcome) 
# A tibble: 3 × 2
  disease_outcome     n
  <fct>           <int>
1 dead               15
2 recovered          11
3 <NA>                3

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, samples %>% count(disease_outcome) is equivalent to:

samples %>%
  group_by(disease_outcome) %>%
  summarize(n = n())
# A tibble: 3 × 2
  disease_outcome     n
  <fct>           <int>
1 dead               15
2 recovered          11
3 <NA>                3

We can also combine count() with other functions such as filter(). Here we will count the disease outcomes for only the samples with high Ct values, i.e. with a Ct value greater than or equal to 35.

samples %>%
  filter(ct >= 35) %>%
  count(disease_outcome) 
# A tibble: 3 × 2
  disease_outcome     n
  <fct>           <int>
1 dead                5
2 recovered           3
3 <NA>                2

The example above shows the use of count() to count the number of rows/observations for one factor (i.e., disease_outcome). If we wanted to count combination of factors, such as disease_outcome and sex, we would specify the first and the second factor as the arguments of count():

samples %>%
  filter(ct >= 35) %>%
  count(disease_outcome, sex) 
# A tibble: 6 × 3
  disease_outcome sex        n
  <fct>           <fct>  <int>
1 dead            female     4
2 dead            male       1
3 recovered       female     1
4 recovered       male       2
5 <NA>            female     1
6 <NA>            male       1

With the above code, we can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the sex and (ii) in descending order of the count:

samples %>%
  filter(ct >= 35) %>%
  count(disease_outcome, sex)  %>%
  arrange(sex, desc(n))
# A tibble: 6 × 3
  disease_outcome sex        n
  <fct>           <fct>  <int>
1 dead            female     4
2 recovered       female     1
3 <NA>            female     1
4 recovered       male       2
5 dead            male       1
6 <NA>            male       1

From the table above, we may learn that, for instance, there are one female and one male where the disease outcome is not specified (i.e. NA).

Challenge 4.2

  • For each collecting date in the samples data frame, how many samples have a Ct value greater than or equal to 35?

Solution

samples %>%
 filter(ct >= 35) %>%
 count(collection_date)
# A tibble: 4 × 2
  collection_date     n
  <date>          <int>
1 2020-03-31          5
2 2020-04-01          1
3 2020-04-07          2
4 2020-04-08          2
  • Use group_by() and summarize() to find the mean and standard deviation of the Ct value for each disease outcome and sex.

    Hint: calculate the standard deviation with the sd() function.

Solution

samples %>%
    group_by(disease_outcome, sex) %>%
    summarize(mean = mean(ct),
              stdev = sd(ct))
# A tibble: 6 × 4
# Groups:   disease_outcome [3]
  disease_outcome sex     mean stdev
  <fct>           <fct>  <dbl> <dbl>
1 dead            female  31.5  7.44
2 dead            male    25.1  6.87
3 recovered       female  29.4  6.22
4 recovered       male    27.7 11.6 
5 <NA>            female  36.7 NA   
6 <NA>            male    29.2 19.6 

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data_raw folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data directory, so even if the files it contains are deleted, we can always re-generate them.

We will conclude this episode by generating a CSV file with a small dataset that contain only samples with a Ct value greater than or equal to 35:

# Filter out samples with high Ct values
samples_high_ct <- samples %>%
  filter(ct >= 35)

# Write data frame to CSV
write_csv(samples_high_ct, file = "data/samples_high_ct.csv")

Key Points

  • Use the dplyr package to manipulate dataframes.

  • Use select() to choose variables from a dataframe.

  • Use filter() to choose data based on values.

  • Use group_by() and summarize() to work with subsets of data.