×

The following packages are used in this section: dplyr, tibble, tidyr and teachingtools. dplyr, tibble, and tidyr are part of the tidyverse family.

To make sure that the packages used in this section are installed and loaded, run the following code in your console.

xfun::pkg_attach2("teachingtools","tidyverse")

Pivoting data tables

One thing that makes R different to SPSS is that most statistical analyses in R require data in long format. SPSS, in contrast, require data in wide format or a mix of wide and long format depending on whether you’re doing a between subjects, within subjects, or mixed analysis (really, the distinction between wide and long only makes sense in the context of within-subject and mixed analysis). If none of the this makes sense, we’ll look at a few examples.

If you’re coming from SPSS you’re probably familiar with wide format, where each participant gets their own row. In the example below we have the data from a 6 item scale for 10 participants.

This is in contrast to long format, where each measurement gets their own row. Since each of participants has 6 measurements, each participant would get 6 rows. The same data in long format would look like the following.

Our example so far only includes one variable called item. To add additional variables we add new columns. In the example below we’ve just added a new variable called age

Pivoting data from wide to long

Because of this reliance on long data, one of the most basic tasks you’ll need to do in R is transforming data from wide to long. The tidyr package provides a function called pivot_longer() for just this purpose.

Let’s try this out with the item_all data set. First let’s take a look at it in wide format.

item_all
# A tibble: 10 x 7
   id    item1 item2 item3 item4 item5 item6
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 s001      1     1     1     1     1     1
 2 s002      3     3     3     3     3     3
 3 s003      6     6     6     6     6     6
 4 s004      7     7     7     7     7     7
 5 s005      0     0     0     0     0     0
 6 s006      5     5     5     5     5     5
 7 s007      2     2     2     2     2     2
 8 s008      0     0     0     0     0     0
 9 s009      2     2     2     2     2     2
10 s010      1     1     1     1     1     1

Now we’ll transform it to long. At a most basic level pivot_longer() needs to know the name of your data table and the name of the columns you want to pivot

tidyr::pivot_longer(data = item_all, # our data
                    cols = c(item1,item2,item3,item4,item5,item6) # cols to pivot
                    )
# A tibble: 60 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 50 more rows

Writing all the column names out can be take a long time, so we can use a helper function called starts_with() which will just select the columns that start with a particular string.

tidyr::pivot_longer(data = item_all, # our data
                    cols = starts_with("item") # all the columns that start with item
                    )
# A tibble: 60 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 50 more rows

In addition to starts_with(), we can also use the ends_with(), and contains() helper. Or, if we want to pivot all the columns, except for one (or more) columns, then we can use negative selection.

tidyr::pivot_longer(data = item_all, # our data
                    cols = -id, # everything but ID
                    )
# A tibble: 60 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 50 more rows

You’ll notice that our pivoted table now has a column called name and value. If we don’t like those names then we can set them when we call pivot_longer().

tidyr::pivot_longer(data = item_all, # our data
                    cols = starts_with("item"), # all the columns that start with item
                    names_to = "item", # call our names col item
                    values_to = "response", # call our values col response
                    ) 
# A tibble: 60 x 3
   id    item  response
   <chr> <chr>    <dbl>
 1 s001  item1        1
 2 s001  item2        1
 3 s001  item3        1
 4 s001  item4        1
 5 s001  item5        1
 6 s001  item6        1
 7 s002  item1        3
 8 s002  item2        3
 9 s002  item3        3
10 s002  item4        3
# … with 50 more rows

In these example, we’ve only been dealing with one factor. But what if have data with multiple factors. The data table aov_data_repeated_wide contains data from a 2 \(\times\) 4 \(\times\) 2 mixed design. The column names indicate which level of factor 1 and which level of factor 2 the measurement comes from, with the factors separated by an underscore _. For example, the heading happy_congruent, means that that column has measurements from the happy level of factor 1 and the congruent level of factor 2. This is how you would organise your data for SPSS.

aov_data_repeated_wide
# A tibble: 60 x 10
   subjectID group happy_congruent sad_congruent digust_congruent
   <chr>     <chr>           <dbl>         <dbl>            <dbl>
 1 s001      Grou…             506           502              490
 2 s002      Grou…             479           493              516
 3 s003      Grou…             459           509              493
 4 s004      Grou…             493           491              490
 5 s005      Grou…             489           497              522
 6 s006      Grou…             492           503              502
 7 s007      Grou…             507           480              478
 8 s008      Grou…             502           509              524
 9 s009      Grou…             515           460              490
10 s010      Grou…             518           494              489
# … with 50 more rows, and 5 more variables: anger_congruent <dbl>,
#   happy_incongruent <dbl>, sad_incongruent <dbl>, digust_incongruent <dbl>,
#   anger_incongruent <dbl>

To pivot this data to wide format, we’ll need to provide additional information. Specifically, we’ll need to tell R that label on each side of the _ represent different names. To do this, we just use the names_sep argument.

tidyr::pivot_longer(data = aov_data_repeated_wide, # our data
                    c(-subjectID,-group), # pivot everything but subjectId and group
                    names_to = c("emotion","congruency"), # our two factors
                    names_sep = "_" # the token separating the names
                    )
# A tibble: 480 x 5
   subjectID group   emotion congruency  value
   <chr>     <chr>   <chr>   <chr>       <dbl>
 1 s001      Group 1 happy   congruent     506
 2 s001      Group 1 sad     congruent     502
 3 s001      Group 1 digust  congruent     490
 4 s001      Group 1 anger   congruent     494
 5 s001      Group 1 happy   incongruent   508
 6 s001      Group 1 sad     incongruent   514
 7 s001      Group 1 digust  incongruent   518
 8 s001      Group 1 anger   incongruent   512
 9 s002      Group 1 happy   congruent     479
10 s002      Group 1 sad     congruent     493
# … with 470 more rows

Pivoting from long to wide

Occasionally we’ll want to pivot from long to wide. To do this we can just use a function called pivot_wider()

The data table item_all_long is just our 6 item repeated measures data organised in long format.

item_all_long
# A tibble: 60 x 3
   id    item  response
   <chr> <chr>    <dbl>
 1 s001  item1        1
 2 s001  item2        1
 3 s001  item3        1
 4 s001  item4        1
 5 s001  item5        1
 6 s001  item6        1
 7 s002  item1        3
 8 s002  item2        3
 9 s002  item3        3
10 s002  item4        3
# … with 50 more rows

To use pivot_wider(), we need to specify the column that holds the values that will become our new column names (names_from) and the the column that holds the values that will become the values in these columns (values_from).

tidyr::pivot_wider(data = item_all_long, # our data
                   names_from = item,  # col that holds the names
                   values_from = response # col that holds values
                   )
# A tibble: 10 x 7
   id    item1 item2 item3 item4 item5 item6
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 s001      1     1     1     1     1     1
 2 s002      3     3     3     3     3     3
 3 s003      6     6     6     6     6     6
 4 s004      7     7     7     7     7     7
 5 s005      0     0     0     0     0     0
 6 s006      5     5     5     5     5     5
 7 s007      2     2     2     2     2     2
 8 s008      0     0     0     0     0     0
 9 s009      2     2     2     2     2     2
10 s010      1     1     1     1     1     1

When dealing with multiple factors, we can just list multiple columns in names_from.

The data aov_data_repeated is our 2 \(\times\) 4 \(\times\) 2 data organised in long format.

aov_data_repeated
# A tibble: 480 x 5
   subjectID    RT emotion congruency group  
   <chr>     <dbl> <fct>   <fct>      <chr>  
 1 s001        506 happy   congruent  Group 1
 2 s002        479 happy   congruent  Group 1
 3 s003        459 happy   congruent  Group 1
 4 s004        493 happy   congruent  Group 1
 5 s005        489 happy   congruent  Group 1
 6 s006        492 happy   congruent  Group 1
 7 s007        507 happy   congruent  Group 1
 8 s008        502 happy   congruent  Group 1
 9 s009        515 happy   congruent  Group 1
10 s010        518 happy   congruent  Group 1
# … with 470 more rows

To turn this into wide format, we just specify that the names should come from the emotion column and the congruency column.

tidyr::pivot_wider(data = aov_data_repeated, # our data
            names_from = c(emotion,congruency),  # our two name cols
            values_from = RT # the col that holds the values
            )
# A tibble: 60 x 10
   subjectID group happy_congruent sad_congruent digust_congruent
   <chr>     <chr>           <dbl>         <dbl>            <dbl>
 1 s001      Grou…             506           502              490
 2 s002      Grou…             479           493              516
 3 s003      Grou…             459           509              493
 4 s004      Grou…             493           491              490
 5 s005      Grou…             489           497              522
 6 s006      Grou…             492           503              502
 7 s007      Grou…             507           480              478
 8 s008      Grou…             502           509              524
 9 s009      Grou…             515           460              490
10 s010      Grou…             518           494              489
# … with 50 more rows, and 5 more variables: anger_congruent <dbl>,
#   happy_incongruent <dbl>, sad_incongruent <dbl>, digust_incongruent <dbl>,
#   anger_incongruent <dbl>

By default, pivot_wider() will use an underscore _ to separate the names; however, if you don’t like underscores then you can override this by setting the names_sep argument.

At Sussex, we teach the undergraduates to organise data in long format, and it’s almost always better to organise data in long format because it provides greater flexibility in working with the data.

The pipe operator (%>%)

So far when we’ve been using tidyverse functions like dplyr::filter(), dplyr::select(), dplyr::pull(), tidyr::pivot_wider(), and tidyr::pivot_longer() we’ve been using them like ordinary R functions. That is, all these functions have a .data or data argument which we’ve pointed at our data table.

For example, for filter() we wrote something like this:

dplyr::filter(.data = item_all, # our data
              id != "s006" # don't include subject 6
              )
# A tibble: 9 x 7
  id    item1 item2 item3 item4 item5 item6
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s001      1     1     1     1     1     1
2 s002      3     3     3     3     3     3
3 s003      6     6     6     6     6     6
4 s004      7     7     7     7     7     7
5 s005      0     0     0     0     0     0
6 s007      2     2     2     2     2     2
7 s008      0     0     0     0     0     0
8 s009      2     2     2     2     2     2
9 s010      1     1     1     1     1     1

And for pivot_longer() we wrote something like this:

tidyr::pivot_longer(data = item_all, # our data
                    cols = -id, # pivot everything but id
)
# A tibble: 60 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 50 more rows

But what happens if we want to filter something and then pivot it.

Well, we could try nest the functions inside each other:

pivot_longer(data = 
               filter(
                 .data = item_all, 
                 id != "s006"), 
             cols = -id)
# A tibble: 54 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 44 more rows

Or we could try break it down into steps:

step1_output <- dplyr::filter(.data = item_all, # our data
              id != "s006" # don't include subject 6
              )

tidyr::pivot_longer(data = step1_output, # our data
                    cols = -id, # pivot everything but id
)
# A tibble: 54 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 44 more rows

Neither of these solutions is really ideal. Nesting makes it very hard to read and follow the code. And breaking it down into steps means that we’ll be writing out lots of temporary variables. All these variables might be difficult to keep straight in our head and they’ll just clutter our workspace. However, there is a better way.

Flowing between steps through pipes

When we break our problem down into steps we see that the output from step 1 (filter_data) just serves as the input for step 2 (the call to pivot_longer()). In situations like this, where we have a multi-step operation where the output of one step serves as the input for the next step then we can use the pipe operator %>% to “pipe” or transfer the output from step to step.

We’ll look at an example so it’ll make more sense. Just to refresh:

  1. at step 1: we called step1_output <- dplyr::filter(.data = item_all, id != "s006")

  2. and we wanted the output from step 1 as the input for step 2 like so tidyr::pivot_longer(data = step1_output, cols = -id)

We can just link the two steps with the %>%. For the second step, because we’re not saving the output to a variable, we’ll just use a . to indicate where the output should go.

dplyr::filter(.data = item_all, id != "s006")  %>% # step 1
 tidyr::pivot_longer(data = ., cols = -id) # step 2 
# A tibble: 54 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 44 more rows

We’re not just limited to two steps. We can string arbitrarily long sequences together (although it’s probably a good idea to keep your sequences short).

In our example above, we’ve explicitly indicated at step 2 that the output from step 1 must go to the data argument. This is not always something you’ll see. By default, the output from step n will go to the first argument of the function at step n+1. For pivot_longer() this is the data argument. This means, that we can actually leave it out.

# Does the same thing as the example above
dplyr::filter(.data = item_all, id != "s006")  %>% # step 1
 tidyr::pivot_longer(cols = -id) # step 2 
# A tibble: 54 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 44 more rows

For all the functions from the tidyverse family the first argument is called .data() or data() and it takes a data table. So this means you can just leave out the first argument name. In fact, you don’t even need to start with a function call. It’s more typical that you’ll see people start with the name of the data table.

# Again does the same thing as the example above.

item_all %>% #our data
  dplyr::filter(id != "s006")  %>% # step 1
 tidyr::pivot_longer(cols = -id) # step 2 
# A tibble: 54 x 3
   id    name  value
   <chr> <chr> <dbl>
 1 s001  item1     1
 2 s001  item2     1
 3 s001  item3     1
 4 s001  item4     1
 5 s001  item5     1
 6 s001  item6     1
 7 s002  item1     3
 8 s002  item2     3
 9 s002  item3     3
10 s002  item4     3
# … with 44 more rows

Modifying data tables

So far we’ve covered subsetting parts of our data tables using select(), filter(), and pull(). And we’ve covered transforming the shape of our data tables using pivot_wider and pivot_longer(). But sometimes we might want to change the actual values in our data table. For example, we might have data that represents the summed scores of the items on a scale we might instead want the average score (i.e., we just want to divide the value we have by the number of items on the scale).

For example, take a look at the data table item_sum_long:

item_sum_long
# A tibble: 10 x 2
   id    total_response
   <chr>          <dbl>
 1 s001              24
 2 s002              48
 3 s003              36
 4 s004               0
 5 s005              12
 6 s006              12
 7 s007              36
 8 s008              12
 9 s009              48
10 s010               6

This data table has column named total_response which is just a sum score. But we want an average. How can we do this? One was is just to take the values from total_response and divide them by 6 (there were 6 items).

So we could do something like this:

item_sum_long$total_response / 6
 [1] 4 8 6 0 2 2 6 2 8 1

or even:

select(item_sum_long,total_response) / 6
   total_response
1               4
2               8
3               6
4               0
5               2
6               2
7               6
8               2
9               8
10              1

But neither of these is exactly what we want. We still want the original data table, but just with a new column with the averages instead of the sums. To do this we can use another dplyr verb called mutate(). For mutate(), we just need to specify the name of our new column and the calculation that will produce it

Here is an example:

dplyr::mutate(.data = item_sum_long, # our data table
       average = total_response  / 6 # our formula for our new variable
)
# A tibble: 10 x 3
   id    total_response average
   <chr>          <dbl>   <dbl>
 1 s001              24       4
 2 s002              48       8
 3 s003              36       6
 4 s004               0       0
 5 s005              12       2
 6 s006              12       2
 7 s007              36       6
 8 s008              12       2
 9 s009              48       8
10 s010               6       1

Or if we want to do it pipe-style

item_sum_long %>% dplyr::mutate(average  = total_response / 6)
# A tibble: 10 x 3
   id    total_response average
   <chr>          <dbl>   <dbl>
 1 s001              24       4
 2 s002              48       8
 3 s003              36       6
 4 s004               0       0
 5 s005              12       2
 6 s006              12       2
 7 s007              36       6
 8 s008              12       2
 9 s009              48       8
10 s010               6       1

mutate() is more capable that doing simple arithmetic. You can include conditionals using if_else(). For example, let’s say that we want to change all scores above 30 to be equal to exactly 30. We can use an if_else() inside mutate() to do this.

item_sum_long %>% # our data
  dplyr::mutate(total_response = 
          dplyr::if_else(condition = total_response > 30, # total_response > 30
                  true = 30, # set it to 30
                  false = total_response # or just leave it what it was
                  ))
# A tibble: 10 x 2
   id    total_response
   <chr>          <dbl>
 1 s001              24
 2 s002              30
 3 s003              30
 4 s004               0
 5 s005              12
 6 s006              12
 7 s007              30
 8 s008              12
 9 s009              30
10 s010               6

Recoding values

Another useful function that can be used inside mutate() is the case_when() function. The case_when() applies a series of conditionals1.

summer_packing_list contains my packing list for my summer holidays. It has a column called items which has the items I want to pack.

summer_packing_list$items
[1] "jandals"    "scuba gear" "sun screen" "hat"       

You’ll see the first item is jandals, which is what we call flip-flops in New Zealand. Let’s say I want to recode that so that it says flip-flops instead. I’ve also listed scuba gear, but let’s say that I actually want my surf board instead. We’ll use the case_when() function to do it.

Let’s see an example:

summer_packing_list %>% 
  dplyr::mutate(items = # overwrite the existing col
           dplyr::case_when(items == "jandals" ~ "flip flops", # when item is jandals make it flip flips
                     items == "scuba gear" ~ "surf board", # when item is scuba gear make it surf board
                     TRUE ~ items # for everything else just keep the current value
                     ))
# A tibble: 4 x 1
  items     
  <chr>     
1 flip flops
2 surf board
3 sun screen
4 hat       

You can also use case_when() with numeric values. Let’s say that you have a colleague that codes all missing values as 99. You want to to change these to missing otherwise you might accidentally calcualte some weird stats! We can use case_when() to do this.

Let’s take a look at the data table amas_data

amas_data
# A tibble: 4 x 2
  subject score
  <chr>   <dbl>
1 s001       14
2 s002       43
3 s003       20
4 s004       99

Now let’s use case_when() to transform that 99 to a NaN

amas_data %>% 
  dplyr::mutate(
    score = dplyr::case_when(score == 99 ~ NaN, # make 99s Not A Number
              TRUE ~ score)
  )
# A tibble: 4 x 2
  subject score
  <chr>   <dbl>
1 s001       14
2 s002       43
3 s003       20
4 s004      NaN

Let’s say that we want to go the other way round and recode the NaN to 99. We can just use the is.na() function to check whether the value is an NaN. Let’s take a look at the data table called amas_data_2

amas_data_2
# A tibble: 4 x 2
  subject score
  <chr>   <dbl>
1 s001       14
2 s002       43
3 s003       20
4 s004      NaN

Now we’ll use case_when() to change the NaN to a 99.

amas_data_2 %>% dplyr::mutate(
  score = dplyr::case_when(is.na(score) ~ 99,
            TRUE ~ score)
)
# A tibble: 4 x 2
  subject score
  <chr>   <dbl>
1 s001       14
2 s002       43
3 s003       20
4 s004       99

Summarising data tables

The last set of dplyr verbs we’ll be looking at are used for creating data summaries. Creating summaries is useful for when we want to, for example, work out some descriptive statistics or if we have single trial or item level data and we want participant level data.

Let’s look at an example using our item_all_long data. In item_all_long we have a value for each participant and for each item on the scale. But in many cases we might just want something like a the participant average or the sum. We can use the dplyr summarising verbs to help us create this.

item_all_long
# A tibble: 60 x 3
   id    item  response
   <chr> <chr>    <dbl>
 1 s001  item1        1
 2 s001  item2        1
 3 s001  item3        1
 4 s001  item4        1
 5 s001  item5        1
 6 s001  item6        1
 7 s002  item1        3
 8 s002  item2        3
 9 s002  item3        3
10 s002  item4        3
# … with 50 more rows

Grouping data

In most cases where we’re creating summaries we’ll first want to group together specific values. For example, in item_all_long, if we want to average together all the values from a particular participant that means we have to group those values together. dplyr provides a function called group_by() for doing just this.

Let’s take a look at an example:

item_all_long %>% # our data table 
  group_by(id) # group rows together according to the id col
# A tibble: 60 x 3
# Groups:   id [10]
   id    item  response
   <chr> <chr>    <dbl>
 1 s001  item1        1
 2 s001  item2        1
 3 s001  item3        1
 4 s001  item4        1
 5 s001  item5        1
 6 s001  item6        1
 7 s002  item1        3
 8 s002  item2        3
 9 s002  item3        3
10 s002  item4        3
# … with 50 more rows

The output doesn’t look that different, but there is an extra row in the header that says # Groups: id [10]. This just indicates that our data is now grouped, that it’s grouped by id, and that there’s 10 of these groups.

Summarising with grouped data

Grouping by itself might not be super useful, but when we have grouped data then we can apply functions by group. One dplyr verb that comes in handy here is summarise(). All summarise does is collapses down data to a single value.

Let’s take a look at an example. First, with ordinary un-grouped data.

item_all_long %>%  # our data
  dplyr::summarise(M = mean(response) # create a summary variable called M as the mean of response
            )
# A tibble: 1 x 1
      M
  <dbl>
1   2.7

Now let’s try it with grouped data. We’ll group it by id, which means we’ll have 10 groups.

item_all_long %>% # our data
  dplyr::group_by(id) %>% # group by the id col
  dplyr::summarise(M = mean(response) # create a summary variable called M as the mean of response
            )
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
   id        M
   <chr> <dbl>
 1 s001      1
 2 s002      3
 3 s003      6
 4 s004      7
 5 s005      0
 6 s006      5
 7 s007      2
 8 s008      0
 9 s009      2
10 s010      1

Now instead of having only one value we have one value for each group (i.e., each id).

We don’t have to do just one summary. Instead we can put a series of commands inside summary(). For example, we could work out an average (mean()), a sum (sum()), or standard deviation (sd()). dplyr also has a special helper function called n() which we can use to count the number of values in each group.

starwars2 %>% # our data
  dplyr::group_by(species) %>% # group by species
  dplyr::summarise( # create a summary
    MEAN = mean(mass), # create mean
    STDEV = sd(mass), # a standard deviation
    SUM = sum(mass),  # a sum
    N = n() # and count the number of values in each group
    ) 
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5
  species  MEAN STDEV   SUM     N
  <chr>   <dbl> <dbl> <dbl> <int>
1 Gungan   74    11.3  148      2
2 Human    82.8  19.4 1821.    22

Let’s try it with a slighlty more complex data table. Let’s take the starwars data and try to work out the average mass for each species and each homeworld. To do this, we just specify multiple grouping variables.

starwars %>% 
  dplyr::group_by(species, homeworld) %>% 
  dplyr::summarise(mean_mass = mean(mass, na.rm = TRUE))
`summarise()` regrouping output by 'species' (override with `.groups` argument)
# A tibble: 58 x 3
# Groups:   species [38]
   species  homeworld   mean_mass
   <chr>    <chr>           <dbl>
 1 Aleena   Aleen Minor      15  
 2 Besalisk Ojom            102  
 3 Cerean   Cerea            82  
 4 Chagrian Champala        NaN  
 5 Clawdite Zolan            55  
 6 Droid    Naboo            32  
 7 Droid    Tatooine         53.5
 8 Droid    <NA>            140  
 9 Dug      Malastare        40  
10 Ewok     Endor            20  
# … with 48 more rows

If you’ve noticed the little warning messages you’ll notice that they’ve been saying something about ungrouping or regrouping the data. These messages are to inform you about the grouping structure of the final output. These warning were recently added, because not everyone expected the final output to still having a grouping structure, and having data that is grouped when you don’t expect it to be grouped can have unexpected results. So a good habit is to ungroup it when we’re done with it. We can just use the dplyr verb ungroup().

starwars %>% 
  dplyr::group_by(species, homeworld) %>% 
  dplyr::summarise(mean_mass = mean(mass)) %>% 
  dplyr::ungroup()
`summarise()` regrouping output by 'species' (override with `.groups` argument)
# A tibble: 58 x 3
   species  homeworld   mean_mass
   <chr>    <chr>           <dbl>
 1 Aleena   Aleen Minor      15  
 2 Besalisk Ojom            102  
 3 Cerean   Cerea            82  
 4 Chagrian Champala         NA  
 5 Clawdite Zolan            55  
 6 Droid    Naboo            32  
 7 Droid    Tatooine         53.5
 8 Droid    <NA>             NA  
 9 Dug      Malastare        40  
10 Ewok     Endor            20  
# … with 48 more rows

We still get the warning. But because we ungroup at the next step, we can safely ignore the warning because we know our final output will have no grouping. However, the message is kind of annoying. So another option is to specify in the summarise() call that we want the output to be ungrouped.

starwars %>% 
  dplyr::group_by(species, homeworld) %>% 
  dplyr::summarise(mean_mass = mean(mass), .groups = 'drop') 
# A tibble: 58 x 3
   species  homeworld   mean_mass
   <chr>    <chr>           <dbl>
 1 Aleena   Aleen Minor      15  
 2 Besalisk Ojom            102  
 3 Cerean   Cerea            82  
 4 Chagrian Champala         NA  
 5 Clawdite Zolan            55  
 6 Droid    Naboo            32  
 7 Droid    Tatooine         53.5
 8 Droid    <NA>             NA  
 9 Dug      Malastare        40  
10 Ewok     Endor            20  
# … with 48 more rows

One thing about group_by() and summarise() is that the output will only contain the columns specified in group_by() and the ones created in summarise(). But sometimes we want to keep other columns too. Take our data item_all_long2.

item_all_long2
# A tibble: 60 x 4
   id      age item  response
   <chr> <dbl> <chr>    <dbl>
 1 s001     18 item1        4
 2 s001     18 item2        4
 3 s001     18 item3        4
 4 s001     18 item4        4
 5 s001     18 item5        4
 6 s001     18 item6        4
 7 s002     35 item1        8
 8 s002     35 item2        8
 9 s002     35 item3        8
10 s002     35 item4        8
# … with 50 more rows

This data also has an age column and we might want to keep that even though we’re not going to do any summarising with it. Let’s see what happens when we just group by id and then work out the mean of response.

item_all_long2 %>% 
  dplyr::group_by(id) %>% 
  dplyr::summarise(mean_response = mean(response))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
   id    mean_response
   <chr>         <dbl>
 1 s001              4
 2 s002              8
 3 s003              6
 4 s004              0
 5 s005              2
 6 s006              2
 7 s007              6
 8 s008              2
 9 s009              8
10 s010              1

The age column has disappeared. What can we do if we want to keep it? One option is just to add it to our group_by() call. Since each person will only have one unique age this won’t change the number or nature of the groups.

item_all_long2 %>% 
  dplyr::group_by(id,age) %>% 
  dplyr::summarise(mean_response = mean(response))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 10 x 3
# Groups:   id [10]
   id      age mean_response
   <chr> <dbl>         <dbl>
 1 s001     18             4
 2 s002     35             8
 3 s003     16             6
 4 s004     27             0
 5 s005     23             2
 6 s006     20             2
 7 s007     37             6
 8 s008     28             2
 9 s009     25             8
10 s010     36             1

Mutating grouped data

We don’t only have to use summarise() with group_by(). We can also use mutate() with group_by to do group-wise calculations. The most obvious use-case of this is to work out something like a \(z\) score for grouped data.

Let’s take the data dev_data:

dev_data
# A tibble: 20 x 3
   id    age   score
   <chr> <chr> <dbl>
 1 s01   6      9.65
 2 s02   6      9.91
 3 s03   6     10.5 
 4 s04   6      8.56
 5 s05   6      9.81
 6 s06   6     11.5 
 7 s07   6      8.87
 8 s08   6     12.2 
 9 s09   6     11.8 
10 s10   6      8.68
11 s11   9     10.8 
12 s12   9     13.4 
13 s13   9     13.1 
14 s14   9     10.3 
15 s15   9     11.9 
16 s16   9     13.2 
17 s17   9     12.3 
18 s18   9     11.3 
19 s19   9     13.5 
20 s20   9     10.5 

In this data table we have scores on the maths test for two age groups. Let’s say we want to transform the scores to \(z\) scores but we want to do it for each age group separately. This means what we’ll do is:

  1. group our data by age group

  2. work out the average score in age group

  3. work out the sd for each group

  4. Subtract the average from the participant score and divide by the sd

Now let’s try put those step those steps together.

dev_data %>% # our data
  dplyr::group_by(age) %>% # group by age 
  dplyr::mutate(m = mean(score)) %>% # work out the mean for each group but don't collapse down
  dplyr::mutate(s = sd(score)) %>% # do the same for sd
  dplyr::mutate(z = (score - m) / s) # z score it
# A tibble: 20 x 6
# Groups:   age [2]
   id    age   score     m     s      z
   <chr> <chr> <dbl> <dbl> <dbl>  <dbl>
 1 s01   6      9.65  10.2  1.32 -0.379
 2 s02   6      9.91  10.2  1.32 -0.181
 3 s03   6     10.5   10.2  1.32  0.284
 4 s04   6      8.56  10.2  1.32 -1.21 
 5 s05   6      9.81  10.2  1.32 -0.258
 6 s06   6     11.5   10.2  1.32  1.01 
 7 s07   6      8.87  10.2  1.32 -0.969
 8 s08   6     12.2   10.2  1.32  1.57 
 9 s09   6     11.8   10.2  1.32  1.25 
10 s10   6      8.68  10.2  1.32 -1.12 
11 s11   9     10.8   12.0  1.25 -1.00 
12 s12   9     13.4   12.0  1.25  1.09 
13 s13   9     13.1   12.0  1.25  0.827
14 s14   9     10.3   12.0  1.25 -1.40 
15 s15   9     11.9   12.0  1.25 -0.130
16 s16   9     13.2   12.0  1.25  0.956
17 s17   9     12.3   12.0  1.25  0.213
18 s18   9     11.3   12.0  1.25 -0.552
19 s19   9     13.5   12.0  1.25  1.20 
20 s20   9     10.5   12.0  1.25 -1.21 

  1. There’s another function called recode() which you can also use for recoding variables. However, I think this is a badly written function and the developers seem to agree, so it’s not clear how long recode() will still be part of the dplyr package.↩︎

CC-BY-NC-SA-4.0Lincoln J Colling