×

This worksheet relies on the teachingtools and tidyverse packages. To make sure that teachingtools and tidyverse are loaded, run the following code in your console.

If you need to install teachingtools then follow the instructions in the introduction.

xfun::pkg_attach2("teachingtools","tidyverse")

This worksheet covers the material from Intermediate data wrangling with tidyverse. All these questions should be answered using tidyverse style syntax using the pipe operator (%>%) where appropriate.

The numerical distance data set

The following questions rely on the data set num_dist. This data has the results of a numerical distance task. The column id is the participant ID, num_dist is the condition label, and rt is the response time. The data is organised in long format.

num_dist
# A tibble: 20 x 3
   id    num_dist    rt
   <chr> <chr>    <dbl>
 1 s01   close     649.
 2 s01   far       891.
 3 s02   close     703.
 4 s02   far       809.
 5 s03   close     620.
 6 s03   far       804.
 7 s04   close     780.
 8 s04   far       826.
 9 s05   close     768.
10 s05   far       849.
11 s06   close     685.
12 s06   far       701.
13 s07   close     709.
14 s07   far       883.
15 s08   close     749.
16 s08   far       820.
17 s09   close     709.
18 s09   far       831.
19 s10   close     642.
20 s10   far       869.

Problem 5.1

Pivot the num_dist dataset so that it is in wide format. It should have three columns. One for id, one for the close condition, and one for the far condition. Make sure you save the result in a variable called num_dist_wide, because you’ll need it at the next step.

Get a hint
  • for pivoting to wider to work, you’ll need to tell the function where to get the names from and where to get the values from
Solution
num_dist_wide <- num_dist %>% 
  pivot_wider(id_cols = id, 
              names_from = num_dist, 
              values_from = rt)
num_dist_wide
# A tibble: 10 x 3
   id    close   far
   <chr> <dbl> <dbl>
 1 s01    649.  891.
 2 s02    703.  809.
 3 s03    620.  804.
 4 s04    780.  826.
 5 s05    768.  849.
 6 s06    685.  701.
 7 s07    709.  883.
 8 s08    749.  820.
 9 s09    709.  831.
10 s10    642.  869.

Problem 5.2

Using the output from the previous question, pivot it to long form. The output should have the same column names as the original (i.e., num_dist and rt)

Get a hint
  • when you pivot to long, the names column and the values column will be given default names unless you specifiy the labels the names should go to and the values should go to
Solution
num_dist_wide %>% pivot_longer(cols = -id, names_to = "num_dist", values_to="rt")
# A tibble: 20 x 3
   id    num_dist    rt
   <chr> <chr>    <dbl>
 1 s01   close     649.
 2 s01   far       891.
 3 s02   close     703.
 4 s02   far       809.
 5 s03   close     620.
 6 s03   far       804.
 7 s04   close     780.
 8 s04   far       826.
 9 s05   close     768.
10 s05   far       849.
11 s06   close     685.
12 s06   far       701.
13 s07   close     709.
14 s07   far       883.
15 s08   close     749.
16 s08   far       820.
17 s09   close     709.
18 s09   far       831.
19 s10   close     642.
20 s10   far       869.

The star wars data set

The following questions rely on the starwars2 data set. This data set has information about the height, weight, and species of Human and Gungan Star Wars characters.

starwars2
# A tibble: 24 x 4
   name               height  mass species
   <chr>               <dbl> <dbl> <chr>  
 1 Luke Skywalker       1.72    77 Human  
 2 Darth Vader          2.02   136 Human  
 3 Leia Organa          1.5     49 Human  
 4 Owen Lars            1.78   120 Human  
 5 Beru Whitesun lars   1.65    75 Human  
 6 Biggs Darklighter    1.83    84 Human  
 7 Obi-Wan Kenobi       1.82    77 Human  
 8 Anakin Skywalker     1.88    84 Human  
 9 Han Solo             1.8     80 Human  
10 Wedge Antilles       1.7     77 Human  
# … with 14 more rows

Problem 5.3

You’re interested in the population health of Humans and Gungans in the Star Wars universe. Using their height and weight, calculate the BMI of each character. BMI is calculated as: \[ \mathrm{bmi} = \frac{\mathrm{mass}}{\mathrm{height}^2} \]

Get a hint
  • You’ll need to use the mutate() function.
Solution
starwars_bmi <- starwars2 %>% mutate(bmi = mass / (height^2))
starwars_bmi
# A tibble: 24 x 5
   name               height  mass species   bmi
   <chr>               <dbl> <dbl> <chr>   <dbl>
 1 Luke Skywalker       1.72    77 Human    26.0
 2 Darth Vader          2.02   136 Human    33.3
 3 Leia Organa          1.5     49 Human    21.8
 4 Owen Lars            1.78   120 Human    37.9
 5 Beru Whitesun lars   1.65    75 Human    27.5
 6 Biggs Darklighter    1.83    84 Human    25.1
 7 Obi-Wan Kenobi       1.82    77 Human    23.2
 8 Anakin Skywalker     1.88    84 Human    23.8
 9 Han Solo             1.8     80 Human    24.7
10 Wedge Antilles       1.7     77 Human    26.6
# … with 14 more rows

You may want to save the output into a variable, so you can use it on the next step.

Problem 5.4

Now that you know the BMI of all the characters, you’re interested in which characters may have exceeded the evil empire’s healthy weight guidelines. For Humans, this is defined as a BMI > 30. However, for Gungans this is defined as a BMI > 15. Add a new column to your data table that has a TRUE if the character is beyond recommended BMI for their species or otherwise has a FALSE.

Get a hint
  • For complex conditionals like this, you’ll need to use the case_when() function.
  • The conditional also relies on two checks (species and bmi) so you’ll need to join these
Solution
starwars_bmi %>% 
  mutate(excess = case_when(bmi > 30 & species == "Human" ~ TRUE, 
                            bmi > 15 & species == "Gungan" ~ TRUE, 
                            TRUE ~ FALSE)) 
# A tibble: 24 x 6
   name               height  mass species   bmi excess
   <chr>               <dbl> <dbl> <chr>   <dbl> <lgl> 
 1 Luke Skywalker       1.72    77 Human    26.0 FALSE 
 2 Darth Vader          2.02   136 Human    33.3 TRUE  
 3 Leia Organa          1.5     49 Human    21.8 FALSE 
 4 Owen Lars            1.78   120 Human    37.9 TRUE  
 5 Beru Whitesun lars   1.65    75 Human    27.5 FALSE 
 6 Biggs Darklighter    1.83    84 Human    25.1 FALSE 
 7 Obi-Wan Kenobi       1.82    77 Human    23.2 FALSE 
 8 Anakin Skywalker     1.88    84 Human    23.8 FALSE 
 9 Han Solo             1.8     80 Human    24.7 FALSE 
10 Wedge Antilles       1.7     77 Human    26.6 FALSE 
# … with 14 more rows

The stroop dataset

The stroop data set contains data from 10 people performing the Stroop task in two languages (English and Tamil). It is organised in wide format.

stroop
# A tibble: 10 x 5
   id    english_congruent english_incongruent tamil_congruent tamil_incongruent
   <chr>             <dbl>               <dbl>           <dbl>             <dbl>
 1 s01                644.                507.            845.             1197.
 2 s02                675.                530.            833.              803.
 3 s03                652.                647.            897.              894.
 4 s04                677.                500.            871.             1141.
 5 s05                591.                578.            864.              985.
 6 s06                533.                592.            839.              981.
 7 s07                565.                578.            870.             1234.
 8 s08                602.                580.            854.             1252.
 9 s09                646.                536.            823.             1109.
10 s10                698.                690.            848.              867.

Problem 5.5

Transform the stroop data into long format. The new data table should have a column for the condition labels (congruent or incongruent) and one for the language (whether it was in English or in Tamil)

Get a hint
  • the factor names are seperated with a _
Solution
stroop %>% 
  pivot_longer(cols = -id, 
               names_to = c("language","condition"),
               names_sep = "_", 
               values_to = "RT")
# A tibble: 40 x 4
   id    language condition      RT
   <chr> <chr>    <chr>       <dbl>
 1 s01   english  congruent    644.
 2 s01   english  incongruent  507.
 3 s01   tamil    congruent    845.
 4 s01   tamil    incongruent 1197.
 5 s02   english  congruent    675.
 6 s02   english  incongruent  530.
 7 s02   tamil    congruent    833.
 8 s02   tamil    incongruent  803.
 9 s03   english  congruent    652.
10 s03   english  incongruent  647.
# … with 30 more rows

Problem 5.6

Create a summary table with the average response time for the trials performed in English and the trials performed in Tamil. Answer this question by extending your previous answer using the pipe %>% operator

Get a hint
  • Your final result should only have two values—one for each language. So make sure your group variable only has two values.
Solution
stroop %>% 
  pivot_longer(cols = -id, 
               names_to = c("language","condition"),
               names_sep = "_", 
               values_to = "RT")%>%
  group_by(language) %>% 
  summarise(RT = mean(RT))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
  language    RT
  <chr>    <dbl>
1 english   601.
2 tamil     950.

Problem 5.7

Create a vector that contains the ID codes for any participants that have a response time of greater than 1200 ms in any one of the (four) conditions.

Get a hint
  • the function max() will tell you the maxium value in a group
  • Your answer might require mulitple includuing pivoting, grouping, summarising, filtering, and finally pulling.
Solution
stroop %>% 
    pivot_longer(cols = -id, 
                 names_to = c("language","condition"),
                 names_sep = "_", values_to = "RT") %>% 
  group_by(id) %>% 
  summarise(RT = max(RT), .groups = 'drop') %>% 
  filter(RT > 1200) %>% 
  pull(id)
[1] "s07" "s08"
Solution (more advanced syntax)
#  More advanced tidyverse syntax might employ this instead
stroop %>% filter_at(vars(-id), any_vars(. > 1200)) %>% pull(id)
[1] "s07" "s08"

CC-BY-NC-SA-4.0Lincoln J Colling