This worksheet relies on the teachingtools and tidyverse packages. To make sure that teachingtools and tidyverse are loaded, run the following code in your console.
If you need to install teachingtools then follow the instructions in the introduction.
xfun::pkg_attach2("teachingtools","tidyverse")
This worksheet covers the material from Intermediate data wrangling with tidyverse. All these questions should be answered using tidyverse style syntax using the pipe operator (%>%
) where appropriate.
The following questions rely on the data set num_dist
. This data has the results of a numerical distance task. The column id
is the participant ID, num_dist
is the condition label, and rt
is the response time. The data is organised in long format.
num_dist
# A tibble: 20 x 3
id num_dist rt
<chr> <chr> <dbl>
1 s01 close 649.
2 s01 far 891.
3 s02 close 703.
4 s02 far 809.
5 s03 close 620.
6 s03 far 804.
7 s04 close 780.
8 s04 far 826.
9 s05 close 768.
10 s05 far 849.
11 s06 close 685.
12 s06 far 701.
13 s07 close 709.
14 s07 far 883.
15 s08 close 749.
16 s08 far 820.
17 s09 close 709.
18 s09 far 831.
19 s10 close 642.
20 s10 far 869.
Pivot the num_dist
dataset so that it is in wide format. It should have three columns. One for id
, one for the close
condition, and one for the far
condition. Make sure you save the result in a variable called num_dist_wide
, because you’ll need it at the next step.
num_dist_wide <- num_dist %>%
pivot_wider(id_cols = id,
names_from = num_dist,
values_from = rt)
num_dist_wide
# A tibble: 10 x 3
id close far
<chr> <dbl> <dbl>
1 s01 649. 891.
2 s02 703. 809.
3 s03 620. 804.
4 s04 780. 826.
5 s05 768. 849.
6 s06 685. 701.
7 s07 709. 883.
8 s08 749. 820.
9 s09 709. 831.
10 s10 642. 869.
Using the output from the previous question, pivot it to long form. The output should have the same column names as the original (i.e., num_dist
and rt
)
num_dist_wide %>% pivot_longer(cols = -id, names_to = "num_dist", values_to="rt")
# A tibble: 20 x 3
id num_dist rt
<chr> <chr> <dbl>
1 s01 close 649.
2 s01 far 891.
3 s02 close 703.
4 s02 far 809.
5 s03 close 620.
6 s03 far 804.
7 s04 close 780.
8 s04 far 826.
9 s05 close 768.
10 s05 far 849.
11 s06 close 685.
12 s06 far 701.
13 s07 close 709.
14 s07 far 883.
15 s08 close 749.
16 s08 far 820.
17 s09 close 709.
18 s09 far 831.
19 s10 close 642.
20 s10 far 869.
The following questions rely on the starwars2
data set. This data set has information about the height, weight, and species of Human and Gungan Star Wars characters.
starwars2
# A tibble: 24 x 4
name height mass species
<chr> <dbl> <dbl> <chr>
1 Luke Skywalker 1.72 77 Human
2 Darth Vader 2.02 136 Human
3 Leia Organa 1.5 49 Human
4 Owen Lars 1.78 120 Human
5 Beru Whitesun lars 1.65 75 Human
6 Biggs Darklighter 1.83 84 Human
7 Obi-Wan Kenobi 1.82 77 Human
8 Anakin Skywalker 1.88 84 Human
9 Han Solo 1.8 80 Human
10 Wedge Antilles 1.7 77 Human
# … with 14 more rows
You’re interested in the population health of Humans and Gungans in the Star Wars universe. Using their height and weight, calculate the BMI of each character. BMI is calculated as: \[ \mathrm{bmi} = \frac{\mathrm{mass}}{\mathrm{height}^2} \]
mutate()
function.
starwars_bmi <- starwars2 %>% mutate(bmi = mass / (height^2))
starwars_bmi
# A tibble: 24 x 5
name height mass species bmi
<chr> <dbl> <dbl> <chr> <dbl>
1 Luke Skywalker 1.72 77 Human 26.0
2 Darth Vader 2.02 136 Human 33.3
3 Leia Organa 1.5 49 Human 21.8
4 Owen Lars 1.78 120 Human 37.9
5 Beru Whitesun lars 1.65 75 Human 27.5
6 Biggs Darklighter 1.83 84 Human 25.1
7 Obi-Wan Kenobi 1.82 77 Human 23.2
8 Anakin Skywalker 1.88 84 Human 23.8
9 Han Solo 1.8 80 Human 24.7
10 Wedge Antilles 1.7 77 Human 26.6
# … with 14 more rows
You may want to save the output into a variable, so you can use it on the next step.
Now that you know the BMI of all the characters, you’re interested in which characters may have exceeded the evil empire’s healthy weight guidelines. For Humans, this is defined as a BMI > 30. However, for Gungans this is defined as a BMI > 15. Add a new column to your data table that has a TRUE
if the character is beyond recommended BMI for their species or otherwise has a FALSE
.
case_when()
function.
starwars_bmi %>%
mutate(excess = case_when(bmi > 30 & species == "Human" ~ TRUE,
bmi > 15 & species == "Gungan" ~ TRUE,
TRUE ~ FALSE))
# A tibble: 24 x 6
name height mass species bmi excess
<chr> <dbl> <dbl> <chr> <dbl> <lgl>
1 Luke Skywalker 1.72 77 Human 26.0 FALSE
2 Darth Vader 2.02 136 Human 33.3 TRUE
3 Leia Organa 1.5 49 Human 21.8 FALSE
4 Owen Lars 1.78 120 Human 37.9 TRUE
5 Beru Whitesun lars 1.65 75 Human 27.5 FALSE
6 Biggs Darklighter 1.83 84 Human 25.1 FALSE
7 Obi-Wan Kenobi 1.82 77 Human 23.2 FALSE
8 Anakin Skywalker 1.88 84 Human 23.8 FALSE
9 Han Solo 1.8 80 Human 24.7 FALSE
10 Wedge Antilles 1.7 77 Human 26.6 FALSE
# … with 14 more rows
The stroop
data set contains data from 10 people performing the Stroop task in two languages (English and Tamil). It is organised in wide format.
stroop
# A tibble: 10 x 5
id english_congruent english_incongruent tamil_congruent tamil_incongruent
<chr> <dbl> <dbl> <dbl> <dbl>
1 s01 644. 507. 845. 1197.
2 s02 675. 530. 833. 803.
3 s03 652. 647. 897. 894.
4 s04 677. 500. 871. 1141.
5 s05 591. 578. 864. 985.
6 s06 533. 592. 839. 981.
7 s07 565. 578. 870. 1234.
8 s08 602. 580. 854. 1252.
9 s09 646. 536. 823. 1109.
10 s10 698. 690. 848. 867.
Transform the stroop
data into long format. The new data table should have a column for the condition labels (congruent or incongruent) and one for the language (whether it was in English or in Tamil)
_
stroop %>%
pivot_longer(cols = -id,
names_to = c("language","condition"),
names_sep = "_",
values_to = "RT")
# A tibble: 40 x 4
id language condition RT
<chr> <chr> <chr> <dbl>
1 s01 english congruent 644.
2 s01 english incongruent 507.
3 s01 tamil congruent 845.
4 s01 tamil incongruent 1197.
5 s02 english congruent 675.
6 s02 english incongruent 530.
7 s02 tamil congruent 833.
8 s02 tamil incongruent 803.
9 s03 english congruent 652.
10 s03 english incongruent 647.
# … with 30 more rows
Create a summary table with the average response time for the trials performed in English and the trials performed in Tamil. Answer this question by extending your previous answer using the pipe %>%
operator
stroop %>%
pivot_longer(cols = -id,
names_to = c("language","condition"),
names_sep = "_",
values_to = "RT")%>%
group_by(language) %>%
summarise(RT = mean(RT))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
language RT
<chr> <dbl>
1 english 601.
2 tamil 950.
Create a vector that contains the ID codes for any participants that have a response time of greater than 1200 ms in any one of the (four) conditions.
max()
will tell you the maxium value in a group
stroop %>%
pivot_longer(cols = -id,
names_to = c("language","condition"),
names_sep = "_", values_to = "RT") %>%
group_by(id) %>%
summarise(RT = max(RT), .groups = 'drop') %>%
filter(RT > 1200) %>%
pull(id)
[1] "s07" "s08"
# More advanced tidyverse syntax might employ this instead
stroop %>% filter_at(vars(-id), any_vars(. > 1200)) %>% pull(id)
[1] "s07" "s08"