Overview
Teaching: 0 min
Exercises: 0 minQuestions
Objectives
1. Introducing R and RStudio IDE
Discussion: Your experience and expectations
- What has motivated you to learn R? Have you had a research question for which spreadsheet programs such as Excel have proven difficult to use, or where the size of the data set created issues?
- In what situations do you imagine that using R will be most valuable to you?
2. R Basics
Challenge 2.1
What are the values after each statement in the following?
mass <- 47.5 # mass? age <- 122 # age? mass <- mass * 2.0 # mass? age <- age - 20 # age? mass_index <- mass/age # mass_index?
Challenge 2.2
Write a function that returns the sum of two arguments given to it.
Solution
# Calculate the sum of x and y add_two_args <- function(x, y) { x + y } # Call the function add_two_args(9, 7)
[1] 16
Challenge 2.3
- We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
Answer
R implicitly converts them all to be the same type.
What will happen in each of these examples? (hint: use
class()
to check the data type of your objects):num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) tricky <- c(1, 2, 3, "4")
Why do you think it happens?
Answer
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.
How many values in
combined_logical
are"TRUE"
(as a character) in the following example (reusing the 2..._logical
s from above):combined_logical <- c(num_logical, char_logical)
Answer
Only one. There is no memory of past data types, and the coercion happens the first time the vector is evaluated. Therefore, the
TRUE
innum_logical
gets converted into a1
before it gets converted into"1"
incombined_logical
.
- You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
Answer
logical → numeric → character ← logical
Challenge 2.4 (optional)
- Can you figure out why
"four" > "five"
returnsTRUE
?Answer
When using “>” or “<” on strings, R compares their alphabetical order. Here “four” comes after “five”, and therefore is “greater than” it.
Challenge 2.5
Using this vector of heights in inches, create a new vector,
heights_no_na
, with the NAs removed.heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
Use the function
median()
to calculate the median of theheights
vector.Use R to figure out how many people in the set are taller than 67 inches.
Solution
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) # 1. heights_no_na <- heights[!is.na(heights)] # or heights_no_na <- na.omit(heights) # or heights_no_na <- heights[complete.cases(heights)] # 2. median(heights, na.rm = TRUE)
[1] 64
# 3. heights_above_67 <- heights_no_na[heights_no_na > 67] length(heights_above_67)
[1] 6
3. R Basics continued - factors and data frames
Loading the samples data
download.file(
url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/covid_samples.csv",
destfile = "data_raw/covid_samples.csv")
## load the tidyverse packages, incl. dplyr
library(tidyverse)
samples <- read_csv("data_raw/covid_samples.csv")
Inspecting data frames
- Size:
dim(samples)
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(samples)
- returns the number of rowsncol(samples)
- returns the number of columns
- Content:
head(samples)
- shows the first 6 rowstail(samples)
- shows the last 6 rows
- Names:
names(samples)
- returns the column names (synonym ofcolnames()
fordata.frame
objects)rownames(samples)
- returns the row names
- Summary:
str(samples)
- structure of the object and information about the class, length and content of each columnsummary(samples)
- summary statistics for each column
Challenges
Challenge 3.1
Based on the output of
str(samples)
, can you answer the following questions?
- What is the class of the object
samples
?- How many rows and how many columns are in this object?
Solution
str(samples)
spec_tbl_df [29 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ patient_id : chr [1:29] "OAS-29_1" "OAS-29_10" "OAS-29_11" "OAS-29_12" ... $ collection_date: Date[1:29], format: "2020-03-31" "2020-03-31" ... $ country : chr [1:29] "Italy" "Italy" "Italy" "Italy" ... $ region : chr [1:29] "Turin" "Turin" "Turin" "Turin" ... $ age : num [1:29] 48 35 59 60 83 21 44 55 81 63 ... $ disease_outcome: chr [1:29] "dead" NA "recovered" "recovered" ... $ sex : chr [1:29] "female" "male" "male" "female" ... $ ct : num [1:29] 41.5 15.3 25.3 27 25.3 ... - attr(*, "spec")= .. cols( .. patient_id = col_character(), .. collection_date = col_date(format = ""), .. country = col_character(), .. region = col_character(), .. age = col_double(), .. disease_outcome = col_character(), .. sex = col_character(), .. ct = col_double() .. ) - attr(*, "problems")=<externalptr>
- The object
samples
is of classdata.frame
, or more specifically atibble
(spec_tbl_df/tbl_df/tbl/data.frame
)- Rows and columns: 29 rows and 8 columns
Challenge 3.2
Create a
data.frame
(samples_20
) containing only the data in row 20 of thesamples
dataset.Notice how
nrow()
gave you the number of rows in adata.frame
?
- Use that number to pull out just that last row in the data frame.
- Compare that with what you see as the last row using
tail()
to make sure it’s meeting expectations.- Pull out that last row using
nrow()
instead of the row number.- Create a new data frame (
samples_last
) from that last row.Combine
nrow()
with the-
notation above to reproduce the behavior ofhead(samples)
, keeping just the first 6 rows of the samples dataset.Solution
## 1. samples_20 <- samples[20, ] ## 2. # Saving `n_rows` to improve readability and reduce duplication n_rows <- nrow(samples) samples_last <- samples[n_rows, ] ## 3. samples_head<- samples[-(7:n_rows), ]
Challenge 3.3
Change the columns
disease_outcome
andsex
in thesamples
data frame into factors.Using the functions you have learnt so far, can you find out…
- How many levels are there in the
sex
column?- How many individuals are listed as “dead” in the
disease_outcome
column?Solution
samples$disease_outcome <- factor(samples$disease_outcome) samples$sex <- factor(samples$sex) nlevels(samples$sex)
[1] 2
summary(samples$disease_outcome)
dead recovered NA's 15 11 3
- How many levels in the
sex
column? There are 2 levels.- How many are listed as dead? There are 15 individuals listed as “dead” in the
disease_outcome
column. column.
Challenge 3.4
- Store a copy of the factor column
sex
to a new object namedsex
.- In the new object, rename “female” and “male” to “F” and “M” respectively.
- Reorder the factor levels so that “M” comes before “F”.
- Create a bar plot of the factor.
Solution
sex = samples$sex levels(sex)[1:2] <- c("F", "M") sex <- factor(sex, levels = c("M", "F")) plot(sex)
Challenge 3.5 (optional)
We have seen how data frames are created when using
read_csv()
, but they can also be created by hand with thedata.frame()
function. There are a few mistakes in this hand-crafteddata.frame
. Can you spot and fix them? Don’t hesitate to experiment!animal_data <- data.frame( animal = c(dog, cat, sea cucumber, sea urchin), feel = c("furry", "squishy", "spiny"), weight = c(45, 8 1.1, 0.8) )
Can you predict the class for each of the columns in the following example? Check your guesses using
str(country_climate)
:
- Are they what you expected? Why? Why not?
- What would you need to change to ensure that each column had the accurate data type?
country_climate <- data.frame( country = c("Canada", "Panama", "South Africa", "Australia"), climate = c("cold", "hot", "temperate", "hot/temperate"), temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) )
Solution
- missing quotations around the names of the animals
- missing one entry in the
feel
column (probably for one of the furry animals)- missing one comma in the
weight
columncountry
,climate
,temperature
, andnorthern_hemisphere
are characters;has_kangaroo
is numeric- using
factor()
one could replace character columns with factors columns- removing the quotes in
temperature
andnorthern_hemisphere
and replacing 1 by TRUE in thehas_kangaroo
column would give what was probably intended
4. Aggregating and Analyzing Data with dplyr
Loading the samples data
download.file(
url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/covid_samples.csv",
destfile = "data_raw/covid_samples.csv")
## load the tidyverse packages, incl. dplyr
library(tidyverse)
samples <- read_csv("data_raw/covid_samples.csv")
samples$disease_outcome <- factor(samples$disease_outcome)
samples$sex <- factor(samples$sex)
Useful dplyr
functions
select()
: subset columnsfilter()
: subset rows on conditionsgroup_by()
andsummarize()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Challenges
Challenge 4.1
Using pipes, subset the
samples
data to include only males with Ct values (columnct
) greater than or equal to 35, and retain only the columnspatient_id
anddisease_outcome
.Solution
samples %>% filter(sex == "male" & ct >= 35) %>% select(patient_id, disease_outcome)
# A tibble: 4 × 2 patient_id disease_outcome <chr> <fct> 1 OAS-29_16 recovered 2 OAS-29_20 <NA> 3 OAS-29_27 dead 4 OAS-29_29 recovered
Challenge 4.2
- For each collecting date in the
samples
data frame, how many samples have a Ct value greater than or equal to 35?Solution
samples %>% filter(ct >= 35) %>% count(collection_date)
# A tibble: 4 × 2 collection_date n <date> <int> 1 2020-03-31 5 2 2020-04-01 1 3 2020-04-07 2 4 2020-04-08 2
Use
group_by()
andsummarize()
to find the mean and standard deviation of the Ct value for each disease outcome and sex.Hint: calculate the standard deviation with the
sd()
function.Solution
samples %>% group_by(disease_outcome, sex) %>% summarize(mean = mean(ct), stdev = sd(ct))
# A tibble: 6 × 4 # Groups: disease_outcome [3] disease_outcome sex mean stdev <fct> <fct> <dbl> <dbl> 1 dead female 31.5 7.44 2 dead male 25.1 6.87 3 recovered female 29.4 6.22 4 recovered male 27.7 11.6 5 <NA> female 36.7 NA 6 <NA> male 29.2 19.6
5. Data Visualization with ggplot2
Loading the samples data
download.file(
url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/covid_samples.csv",
destfile = "data_raw/covid_samples.csv")
## load the tidyverse packages, incl. dplyr
library(tidyverse)
samples <- read_csv("data_raw/covid_samples.csv")
samples$disease_outcome <- factor(samples$disease_outcome)
samples$sex <- factor(samples$sex)
Loading the sequencing data
download.file(
url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/covseq_miseq.csv",
destfile = "data_raw/covseq_miseq.csv")
sequencing <- read_csv("data_raw/covseq_miseq.csv")
sequencing$pangolin_lineage <- factor(sequencing$pangolin_lineage)
sequencing$nextclade_clade <- factor(sequencing$nextclade_clade)
Joining the sequencing
and samples
data frames
covseq <- sequencing %>%
inner_join(samples, by = "patient_id")
Challenges
Challenge 5.1
Create a scatter plot with the Ct value plotted against the total number of sequence reads.
Solution
ggplot(data = covseq, mapping = aes(x = ct, y = total_reads)) + geom_point()
Challenge 5.2
Use what you just learned to create a scatter plot of the sex against the cycle threshold (Ct). Is this a good way to show this type of data?
Solution
ggplot(data = covseq, mapping = aes(x = sex, y = ct)) + geom_point(aes(color = sex))
Challenges 5.3
Boxplots are useful summaries, but hide the shape of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn. Replace the box plot with a violin plot; see
geom_violin()
. Modify the code below to show a violin plot instead.ggplot(data = covseq, mapping = aes(x = sex, y = ct)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.5, color = "tomato")
Solution
ggplot(data = covseq, mapping = aes(x = sex, y = ct)) + geom_violin(alpha = 0) + geom_jitter(alpha = 0.5, color = "tomato")
In many types of data, it is important to consider the scale of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components.
- Modify the code below so that the number of reads are shown on a log 10 scale; see
scale_x_log10()
.ggplot(data = covseq, mapping = aes(x = total_reads, y = coverage_pct_10x)) + geom_point(alpha = 0)
Solution
ggplot(data = covseq, mapping = aes(x = total_reads, y = coverage_pct_10x)) + geom_point() + scale_x_log10()
- Add color to the data points on your plot according to the disease outcome.
Solution
ggplot(data = covseq, mapping = aes(x = total_reads, y = coverage_pct_10x)) + geom_point(aes(color = disease_outcome)) + scale_y_log10()
- Replace “NA” with “unknown” (hint: use the
addNA()
andlevels()
functions).Solution
# Add NAs to the factor covseq$disease_outcome <- addNA(covseq$disease_outcome) # Rename the level levels(covseq$disease_outcome)[3] <- "unknown" # Now create the plot in the same way as before ggplot(data = covseq, mapping = aes(x = total_reads, y = coverage_pct_10x)) + geom_point(aes(color = disease_outcome)) + scale_y_log10()
Challenge 5.4
Use what you just learned to create a plot that shows how counts of PANGO lineages (
pangolin_lineage
) differ between disease outcomes.Solution
covseq %>% count(pangolin_lineage, disease_outcome) %>% ggplot(aes(x = disease_outcome, y = n, fill = disease_outcome)) + geom_bar(stat = "identity") + facet_wrap(vars(pangolin_lineage))
Challenge 5.5
With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio
ggplot2
cheat sheet Links to an external site. for inspiration.Here are some ideas:
- See if you can change the thickness of the lines.
- Can you find a way to change the name of the legend? What about its labels?
- Try using a different color palette (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ Links to an external site.).
Key Points