The following packages are used in this section: dplyr, tibble, tidyr and teachingtools. dplyr, tibble, and tidyr are part of the tidyverse family.
To make sure that the packages used in this section are installed and loaded, run the following code in your console.
xfun::pkg_attach2("teachingtools","tidyverse")
One thing that makes R different to SPSS is that most statistical analyses in R require data in long format. SPSS, in contrast, require data in wide format or a mix of wide and long format depending on whether you’re doing a between subjects, within subjects, or mixed analysis (really, the distinction between wide and long only makes sense in the context of within-subject and mixed analysis). If none of the this makes sense, we’ll look at a few examples.
If you’re coming from SPSS you’re probably familiar with wide format, where each participant gets their own row. In the example below we have the data from a 6 item scale for 10 participants.
This is in contrast to long format, where each measurement gets their own row. Since each of participants has 6 measurements, each participant would get 6 rows. The same data in long format would look like the following.
Our example so far only includes one variable called item
. To add additional variables we add new columns. In the example below we’ve just added a new variable called age
Because of this reliance on long data, one of the most basic tasks you’ll need to do in R is transforming data from wide to long. The tidyr package provides a function called pivot_longer()
for just this purpose.
Let’s try this out with the item_all
data set. First let’s take a look at it in wide format.
item_all
# A tibble: 10 x 7
id item1 item2 item3 item4 item5 item6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s001 1 1 1 1 1 1
2 s002 3 3 3 3 3 3
3 s003 6 6 6 6 6 6
4 s004 7 7 7 7 7 7
5 s005 0 0 0 0 0 0
6 s006 5 5 5 5 5 5
7 s007 2 2 2 2 2 2
8 s008 0 0 0 0 0 0
9 s009 2 2 2 2 2 2
10 s010 1 1 1 1 1 1
Now we’ll transform it to long. At a most basic level pivot_longer()
needs to know the name of your data table and the name of the columns you want to pivot
tidyr::pivot_longer(data = item_all, # our data
cols = c(item1,item2,item3,item4,item5,item6) # cols to pivot
)
# A tibble: 60 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
Writing all the column names out can be take a long time, so we can use a helper function called starts_with()
which will just select the columns that start with a particular string.
tidyr::pivot_longer(data = item_all, # our data
cols = starts_with("item") # all the columns that start with item
)
# A tibble: 60 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
In addition to starts_with()
, we can also use the ends_with()
, and contains()
helper. Or, if we want to pivot all the columns, except for one (or more) columns, then we can use negative selection.
tidyr::pivot_longer(data = item_all, # our data
cols = -id, # everything but ID
)
# A tibble: 60 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
You’ll notice that our pivoted table now has a column called name
and value
. If we don’t like those names then we can set them when we call pivot_longer()
.
tidyr::pivot_longer(data = item_all, # our data
cols = starts_with("item"), # all the columns that start with item
names_to = "item", # call our names col item
values_to = "response", # call our values col response
)
# A tibble: 60 x 3
id item response
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
In these example, we’ve only been dealing with one factor. But what if have data with multiple factors. The data table aov_data_repeated_wide
contains data from a 2 \(\times\) 4 \(\times\) 2 mixed design. The column names indicate which level of factor 1 and which level of factor 2 the measurement comes from, with the factors separated by an underscore _
. For example, the heading happy_congruent
, means that that column has measurements from the happy
level of factor 1 and the congruent
level of factor 2. This is how you would organise your data for SPSS.
aov_data_repeated_wide
# A tibble: 60 x 10
subjectID group happy_congruent sad_congruent digust_congruent
<chr> <chr> <dbl> <dbl> <dbl>
1 s001 Grou… 506 502 490
2 s002 Grou… 479 493 516
3 s003 Grou… 459 509 493
4 s004 Grou… 493 491 490
5 s005 Grou… 489 497 522
6 s006 Grou… 492 503 502
7 s007 Grou… 507 480 478
8 s008 Grou… 502 509 524
9 s009 Grou… 515 460 490
10 s010 Grou… 518 494 489
# … with 50 more rows, and 5 more variables: anger_congruent <dbl>,
# happy_incongruent <dbl>, sad_incongruent <dbl>, digust_incongruent <dbl>,
# anger_incongruent <dbl>
To pivot this data to wide format, we’ll need to provide additional information. Specifically, we’ll need to tell R that label on each side of the _
represent different names. To do this, we just use the names_sep
argument.
tidyr::pivot_longer(data = aov_data_repeated_wide, # our data
c(-subjectID,-group), # pivot everything but subjectId and group
names_to = c("emotion","congruency"), # our two factors
names_sep = "_" # the token separating the names
)
# A tibble: 480 x 5
subjectID group emotion congruency value
<chr> <chr> <chr> <chr> <dbl>
1 s001 Group 1 happy congruent 506
2 s001 Group 1 sad congruent 502
3 s001 Group 1 digust congruent 490
4 s001 Group 1 anger congruent 494
5 s001 Group 1 happy incongruent 508
6 s001 Group 1 sad incongruent 514
7 s001 Group 1 digust incongruent 518
8 s001 Group 1 anger incongruent 512
9 s002 Group 1 happy congruent 479
10 s002 Group 1 sad congruent 493
# … with 470 more rows
Occasionally we’ll want to pivot from long to wide. To do this we can just use a function called pivot_wider()
The data table item_all_long
is just our 6 item repeated measures data organised in long format.
item_all_long
# A tibble: 60 x 3
id item response
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
To use pivot_wider()
, we need to specify the column that holds the values that will become our new column names (names_from
) and the the column that holds the values that will become the values in these columns (values_from
).
tidyr::pivot_wider(data = item_all_long, # our data
names_from = item, # col that holds the names
values_from = response # col that holds values
)
# A tibble: 10 x 7
id item1 item2 item3 item4 item5 item6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s001 1 1 1 1 1 1
2 s002 3 3 3 3 3 3
3 s003 6 6 6 6 6 6
4 s004 7 7 7 7 7 7
5 s005 0 0 0 0 0 0
6 s006 5 5 5 5 5 5
7 s007 2 2 2 2 2 2
8 s008 0 0 0 0 0 0
9 s009 2 2 2 2 2 2
10 s010 1 1 1 1 1 1
When dealing with multiple factors, we can just list multiple columns in names_from
.
The data aov_data_repeated
is our 2 \(\times\) 4 \(\times\) 2 data organised in long format.
aov_data_repeated
# A tibble: 480 x 5
subjectID RT emotion congruency group
<chr> <dbl> <fct> <fct> <chr>
1 s001 506 happy congruent Group 1
2 s002 479 happy congruent Group 1
3 s003 459 happy congruent Group 1
4 s004 493 happy congruent Group 1
5 s005 489 happy congruent Group 1
6 s006 492 happy congruent Group 1
7 s007 507 happy congruent Group 1
8 s008 502 happy congruent Group 1
9 s009 515 happy congruent Group 1
10 s010 518 happy congruent Group 1
# … with 470 more rows
To turn this into wide format, we just specify that the names should come from the emotion
column and the congruency
column.
tidyr::pivot_wider(data = aov_data_repeated, # our data
names_from = c(emotion,congruency), # our two name cols
values_from = RT # the col that holds the values
)
# A tibble: 60 x 10
subjectID group happy_congruent sad_congruent digust_congruent
<chr> <chr> <dbl> <dbl> <dbl>
1 s001 Grou… 506 502 490
2 s002 Grou… 479 493 516
3 s003 Grou… 459 509 493
4 s004 Grou… 493 491 490
5 s005 Grou… 489 497 522
6 s006 Grou… 492 503 502
7 s007 Grou… 507 480 478
8 s008 Grou… 502 509 524
9 s009 Grou… 515 460 490
10 s010 Grou… 518 494 489
# … with 50 more rows, and 5 more variables: anger_congruent <dbl>,
# happy_incongruent <dbl>, sad_incongruent <dbl>, digust_incongruent <dbl>,
# anger_incongruent <dbl>
By default, pivot_wider()
will use an underscore _
to separate the names; however, if you don’t like underscores then you can override this by setting the names_sep
argument.
At Sussex, we teach the undergraduates to organise data in long format, and it’s almost always better to organise data in long format because it provides greater flexibility in working with the data.
%>%
)So far when we’ve been using tidyverse functions like dplyr::filter()
, dplyr::select()
, dplyr::pull()
, tidyr::pivot_wider()
, and tidyr::pivot_longer()
we’ve been using them like ordinary R functions. That is, all these functions have a .data
or data
argument which we’ve pointed at our data table.
For example, for filter()
we wrote something like this:
dplyr::filter(.data = item_all, # our data
id != "s006" # don't include subject 6
)
# A tibble: 9 x 7
id item1 item2 item3 item4 item5 item6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s001 1 1 1 1 1 1
2 s002 3 3 3 3 3 3
3 s003 6 6 6 6 6 6
4 s004 7 7 7 7 7 7
5 s005 0 0 0 0 0 0
6 s007 2 2 2 2 2 2
7 s008 0 0 0 0 0 0
8 s009 2 2 2 2 2 2
9 s010 1 1 1 1 1 1
And for pivot_longer()
we wrote something like this:
tidyr::pivot_longer(data = item_all, # our data
cols = -id, # pivot everything but id
)
# A tibble: 60 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
But what happens if we want to filter something and then pivot it.
Well, we could try nest the functions inside each other:
pivot_longer(data =
filter(
.data = item_all,
id != "s006"),
cols = -id)
# A tibble: 54 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 44 more rows
Or we could try break it down into steps:
step1_output <- dplyr::filter(.data = item_all, # our data
id != "s006" # don't include subject 6
)
tidyr::pivot_longer(data = step1_output, # our data
cols = -id, # pivot everything but id
)
# A tibble: 54 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 44 more rows
Neither of these solutions is really ideal. Nesting makes it very hard to read and follow the code. And breaking it down into steps means that we’ll be writing out lots of temporary variables. All these variables might be difficult to keep straight in our head and they’ll just clutter our workspace. However, there is a better way.
When we break our problem down into steps we see that the output from step 1 (filter_data
) just serves as the input for step 2 (the call to pivot_longer()
). In situations like this, where we have a multi-step operation where the output of one step serves as the input for the next step then we can use the pipe operator %>%
to “pipe” or transfer the output from step to step.
We’ll look at an example so it’ll make more sense. Just to refresh:
at step 1: we called step1_output <- dplyr::filter(.data = item_all, id != "s006")
and we wanted the output from step 1 as the input for step 2 like so tidyr::pivot_longer(data = step1_output, cols = -id)
We can just link the two steps with the %>%
. For the second step, because we’re not saving the output to a variable, we’ll just use a .
to indicate where the output should go.
dplyr::filter(.data = item_all, id != "s006") %>% # step 1
tidyr::pivot_longer(data = ., cols = -id) # step 2
# A tibble: 54 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 44 more rows
We’re not just limited to two steps. We can string arbitrarily long sequences together (although it’s probably a good idea to keep your sequences short).
In our example above, we’ve explicitly indicated at step 2 that the output from step 1 must go to the data
argument. This is not always something you’ll see. By default, the output from step n will go to the first argument of the function at step n+1. For pivot_longer()
this is the data
argument. This means, that we can actually leave it out.
# Does the same thing as the example above
dplyr::filter(.data = item_all, id != "s006") %>% # step 1
tidyr::pivot_longer(cols = -id) # step 2
# A tibble: 54 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 44 more rows
For all the functions from the tidyverse family the first argument is called .data()
or data()
and it takes a data table. So this means you can just leave out the first argument name. In fact, you don’t even need to start with a function call. It’s more typical that you’ll see people start with the name of the data table.
# Again does the same thing as the example above.
item_all %>% #our data
dplyr::filter(id != "s006") %>% # step 1
tidyr::pivot_longer(cols = -id) # step 2
# A tibble: 54 x 3
id name value
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 44 more rows
So far we’ve covered subsetting parts of our data tables using select()
, filter()
, and pull()
. And we’ve covered transforming the shape of our data tables using pivot_wider
and pivot_longer()
. But sometimes we might want to change the actual values in our data table. For example, we might have data that represents the summed scores of the items on a scale we might instead want the average score (i.e., we just want to divide the value we have by the number of items on the scale).
For example, take a look at the data table item_sum_long
:
item_sum_long
# A tibble: 10 x 2
id total_response
<chr> <dbl>
1 s001 24
2 s002 48
3 s003 36
4 s004 0
5 s005 12
6 s006 12
7 s007 36
8 s008 12
9 s009 48
10 s010 6
This data table has column named total_response
which is just a sum score. But we want an average. How can we do this? One was is just to take the values from total_response
and divide them by 6 (there were 6 items).
So we could do something like this:
item_sum_long$total_response / 6
[1] 4 8 6 0 2 2 6 2 8 1
or even:
select(item_sum_long,total_response) / 6
total_response
1 4
2 8
3 6
4 0
5 2
6 2
7 6
8 2
9 8
10 1
But neither of these is exactly what we want. We still want the original data table, but just with a new column with the averages instead of the sums. To do this we can use another dplyr verb called mutate()
. For mutate()
, we just need to specify the name of our new column and the calculation that will produce it
Here is an example:
dplyr::mutate(.data = item_sum_long, # our data table
average = total_response / 6 # our formula for our new variable
)
# A tibble: 10 x 3
id total_response average
<chr> <dbl> <dbl>
1 s001 24 4
2 s002 48 8
3 s003 36 6
4 s004 0 0
5 s005 12 2
6 s006 12 2
7 s007 36 6
8 s008 12 2
9 s009 48 8
10 s010 6 1
Or if we want to do it pipe-style
item_sum_long %>% dplyr::mutate(average = total_response / 6)
# A tibble: 10 x 3
id total_response average
<chr> <dbl> <dbl>
1 s001 24 4
2 s002 48 8
3 s003 36 6
4 s004 0 0
5 s005 12 2
6 s006 12 2
7 s007 36 6
8 s008 12 2
9 s009 48 8
10 s010 6 1
mutate()
is more capable that doing simple arithmetic. You can include conditionals using if_else()
. For example, let’s say that we want to change all scores above 30 to be equal to exactly 30. We can use an if_else()
inside mutate()
to do this.
item_sum_long %>% # our data
dplyr::mutate(total_response =
dplyr::if_else(condition = total_response > 30, # total_response > 30
true = 30, # set it to 30
false = total_response # or just leave it what it was
))
# A tibble: 10 x 2
id total_response
<chr> <dbl>
1 s001 24
2 s002 30
3 s003 30
4 s004 0
5 s005 12
6 s006 12
7 s007 30
8 s008 12
9 s009 30
10 s010 6
Another useful function that can be used inside mutate()
is the case_when()
function. The case_when()
applies a series of conditionals1.
summer_packing_list
contains my packing list for my summer holidays. It has a column called items
which has the items I want to pack.
summer_packing_list$items
[1] "jandals" "scuba gear" "sun screen" "hat"
You’ll see the first item is jandals, which is what we call flip-flops in New Zealand. Let’s say I want to recode that so that it says flip-flops instead. I’ve also listed scuba gear, but let’s say that I actually want my surf board instead. We’ll use the case_when()
function to do it.
Let’s see an example:
summer_packing_list %>%
dplyr::mutate(items = # overwrite the existing col
dplyr::case_when(items == "jandals" ~ "flip flops", # when item is jandals make it flip flips
items == "scuba gear" ~ "surf board", # when item is scuba gear make it surf board
TRUE ~ items # for everything else just keep the current value
))
# A tibble: 4 x 1
items
<chr>
1 flip flops
2 surf board
3 sun screen
4 hat
You can also use case_when()
with numeric values. Let’s say that you have a colleague that codes all missing values as 99. You want to to change these to missing otherwise you might accidentally calcualte some weird stats! We can use case_when()
to do this.
Let’s take a look at the data table amas_data
amas_data
# A tibble: 4 x 2
subject score
<chr> <dbl>
1 s001 14
2 s002 43
3 s003 20
4 s004 99
Now let’s use case_when()
to transform that 99
to a NaN
amas_data %>%
dplyr::mutate(
score = dplyr::case_when(score == 99 ~ NaN, # make 99s Not A Number
TRUE ~ score)
)
# A tibble: 4 x 2
subject score
<chr> <dbl>
1 s001 14
2 s002 43
3 s003 20
4 s004 NaN
Let’s say that we want to go the other way round and recode the NaN
to 99
. We can just use the is.na()
function to check whether the value is an NaN
. Let’s take a look at the data table called amas_data_2
amas_data_2
# A tibble: 4 x 2
subject score
<chr> <dbl>
1 s001 14
2 s002 43
3 s003 20
4 s004 NaN
Now we’ll use case_when()
to change the NaN
to a 99
.
amas_data_2 %>% dplyr::mutate(
score = dplyr::case_when(is.na(score) ~ 99,
TRUE ~ score)
)
# A tibble: 4 x 2
subject score
<chr> <dbl>
1 s001 14
2 s002 43
3 s003 20
4 s004 99
The last set of dplyr verbs we’ll be looking at are used for creating data summaries. Creating summaries is useful for when we want to, for example, work out some descriptive statistics or if we have single trial or item level data and we want participant level data.
Let’s look at an example using our item_all_long
data. In item_all_long
we have a value for each participant and for each item on the scale. But in many cases we might just want something like a the participant average or the sum. We can use the dplyr summarising verbs to help us create this.
item_all_long
# A tibble: 60 x 3
id item response
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
In most cases where we’re creating summaries we’ll first want to group together specific values. For example, in item_all_long
, if we want to average together all the values from a particular participant that means we have to group those values together. dplyr provides a function called group_by()
for doing just this.
Let’s take a look at an example:
item_all_long %>% # our data table
group_by(id) # group rows together according to the id col
# A tibble: 60 x 3
# Groups: id [10]
id item response
<chr> <chr> <dbl>
1 s001 item1 1
2 s001 item2 1
3 s001 item3 1
4 s001 item4 1
5 s001 item5 1
6 s001 item6 1
7 s002 item1 3
8 s002 item2 3
9 s002 item3 3
10 s002 item4 3
# … with 50 more rows
The output doesn’t look that different, but there is an extra row in the header that says # Groups: id [10]
. This just indicates that our data is now grouped, that it’s grouped by id, and that there’s 10 of these groups.
Grouping by itself might not be super useful, but when we have grouped data then we can apply functions by group. One dplyr verb that comes in handy here is summarise()
. All summarise does is collapses down data to a single value.
Let’s take a look at an example. First, with ordinary un-grouped data.
item_all_long %>% # our data
dplyr::summarise(M = mean(response) # create a summary variable called M as the mean of response
)
# A tibble: 1 x 1
M
<dbl>
1 2.7
Now let’s try it with grouped data. We’ll group it by id
, which means we’ll have 10 groups.
item_all_long %>% # our data
dplyr::group_by(id) %>% # group by the id col
dplyr::summarise(M = mean(response) # create a summary variable called M as the mean of response
)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
id M
<chr> <dbl>
1 s001 1
2 s002 3
3 s003 6
4 s004 7
5 s005 0
6 s006 5
7 s007 2
8 s008 0
9 s009 2
10 s010 1
Now instead of having only one value we have one value for each group (i.e., each id
).
We don’t have to do just one summary. Instead we can put a series of commands inside summary()
. For example, we could work out an average (mean()
), a sum (sum()
), or standard deviation (sd()
). dplyr also has a special helper function called n()
which we can use to count the number of values in each group.
starwars2 %>% # our data
dplyr::group_by(species) %>% # group by species
dplyr::summarise( # create a summary
MEAN = mean(mass), # create mean
STDEV = sd(mass), # a standard deviation
SUM = sum(mass), # a sum
N = n() # and count the number of values in each group
)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5
species MEAN STDEV SUM N
<chr> <dbl> <dbl> <dbl> <int>
1 Gungan 74 11.3 148 2
2 Human 82.8 19.4 1821. 22
Let’s try it with a slighlty more complex data table. Let’s take the starwars
data and try to work out the average mass for each species and each homeworld. To do this, we just specify multiple grouping variables.
starwars %>%
dplyr::group_by(species, homeworld) %>%
dplyr::summarise(mean_mass = mean(mass, na.rm = TRUE))
`summarise()` regrouping output by 'species' (override with `.groups` argument)
# A tibble: 58 x 3
# Groups: species [38]
species homeworld mean_mass
<chr> <chr> <dbl>
1 Aleena Aleen Minor 15
2 Besalisk Ojom 102
3 Cerean Cerea 82
4 Chagrian Champala NaN
5 Clawdite Zolan 55
6 Droid Naboo 32
7 Droid Tatooine 53.5
8 Droid <NA> 140
9 Dug Malastare 40
10 Ewok Endor 20
# … with 48 more rows
If you’ve noticed the little warning messages you’ll notice that they’ve been saying something about ungrouping or regrouping the data. These messages are to inform you about the grouping structure of the final output. These warning were recently added, because not everyone expected the final output to still having a grouping structure, and having data that is grouped when you don’t expect it to be grouped can have unexpected results. So a good habit is to ungroup it when we’re done with it. We can just use the dplyr verb ungroup()
.
starwars %>%
dplyr::group_by(species, homeworld) %>%
dplyr::summarise(mean_mass = mean(mass)) %>%
dplyr::ungroup()
`summarise()` regrouping output by 'species' (override with `.groups` argument)
# A tibble: 58 x 3
species homeworld mean_mass
<chr> <chr> <dbl>
1 Aleena Aleen Minor 15
2 Besalisk Ojom 102
3 Cerean Cerea 82
4 Chagrian Champala NA
5 Clawdite Zolan 55
6 Droid Naboo 32
7 Droid Tatooine 53.5
8 Droid <NA> NA
9 Dug Malastare 40
10 Ewok Endor 20
# … with 48 more rows
We still get the warning. But because we ungroup at the next step, we can safely ignore the warning because we know our final output will have no grouping. However, the message is kind of annoying. So another option is to specify in the summarise()
call that we want the output to be ungrouped.
starwars %>%
dplyr::group_by(species, homeworld) %>%
dplyr::summarise(mean_mass = mean(mass), .groups = 'drop')
# A tibble: 58 x 3
species homeworld mean_mass
<chr> <chr> <dbl>
1 Aleena Aleen Minor 15
2 Besalisk Ojom 102
3 Cerean Cerea 82
4 Chagrian Champala NA
5 Clawdite Zolan 55
6 Droid Naboo 32
7 Droid Tatooine 53.5
8 Droid <NA> NA
9 Dug Malastare 40
10 Ewok Endor 20
# … with 48 more rows
One thing about group_by()
and summarise()
is that the output will only contain the columns specified in group_by()
and the ones created in summarise()
. But sometimes we want to keep other columns too. Take our data item_all_long2
.
item_all_long2
# A tibble: 60 x 4
id age item response
<chr> <dbl> <chr> <dbl>
1 s001 18 item1 4
2 s001 18 item2 4
3 s001 18 item3 4
4 s001 18 item4 4
5 s001 18 item5 4
6 s001 18 item6 4
7 s002 35 item1 8
8 s002 35 item2 8
9 s002 35 item3 8
10 s002 35 item4 8
# … with 50 more rows
This data also has an age column and we might want to keep that even though we’re not going to do any summarising with it. Let’s see what happens when we just group by id
and then work out the mean of response
.
item_all_long2 %>%
dplyr::group_by(id) %>%
dplyr::summarise(mean_response = mean(response))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
id mean_response
<chr> <dbl>
1 s001 4
2 s002 8
3 s003 6
4 s004 0
5 s005 2
6 s006 2
7 s007 6
8 s008 2
9 s009 8
10 s010 1
The age
column has disappeared. What can we do if we want to keep it? One option is just to add it to our group_by()
call. Since each person will only have one unique age
this won’t change the number or nature of the groups.
item_all_long2 %>%
dplyr::group_by(id,age) %>%
dplyr::summarise(mean_response = mean(response))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 10 x 3
# Groups: id [10]
id age mean_response
<chr> <dbl> <dbl>
1 s001 18 4
2 s002 35 8
3 s003 16 6
4 s004 27 0
5 s005 23 2
6 s006 20 2
7 s007 37 6
8 s008 28 2
9 s009 25 8
10 s010 36 1
We don’t only have to use summarise()
with group_by()
. We can also use mutate()
with group_by
to do group-wise calculations. The most obvious use-case of this is to work out something like a \(z\) score for grouped data.
Let’s take the data dev_data
:
dev_data
# A tibble: 20 x 3
id age score
<chr> <chr> <dbl>
1 s01 6 9.65
2 s02 6 9.91
3 s03 6 10.5
4 s04 6 8.56
5 s05 6 9.81
6 s06 6 11.5
7 s07 6 8.87
8 s08 6 12.2
9 s09 6 11.8
10 s10 6 8.68
11 s11 9 10.8
12 s12 9 13.4
13 s13 9 13.1
14 s14 9 10.3
15 s15 9 11.9
16 s16 9 13.2
17 s17 9 12.3
18 s18 9 11.3
19 s19 9 13.5
20 s20 9 10.5
In this data table we have scores on the maths test for two age groups. Let’s say we want to transform the scores to \(z\) scores but we want to do it for each age group separately. This means what we’ll do is:
group our data by age group
work out the average score in age group
work out the sd for each group
Subtract the average from the participant score and divide by the sd
Now let’s try put those step those steps together.
dev_data %>% # our data
dplyr::group_by(age) %>% # group by age
dplyr::mutate(m = mean(score)) %>% # work out the mean for each group but don't collapse down
dplyr::mutate(s = sd(score)) %>% # do the same for sd
dplyr::mutate(z = (score - m) / s) # z score it
# A tibble: 20 x 6
# Groups: age [2]
id age score m s z
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 s01 6 9.65 10.2 1.32 -0.379
2 s02 6 9.91 10.2 1.32 -0.181
3 s03 6 10.5 10.2 1.32 0.284
4 s04 6 8.56 10.2 1.32 -1.21
5 s05 6 9.81 10.2 1.32 -0.258
6 s06 6 11.5 10.2 1.32 1.01
7 s07 6 8.87 10.2 1.32 -0.969
8 s08 6 12.2 10.2 1.32 1.57
9 s09 6 11.8 10.2 1.32 1.25
10 s10 6 8.68 10.2 1.32 -1.12
11 s11 9 10.8 12.0 1.25 -1.00
12 s12 9 13.4 12.0 1.25 1.09
13 s13 9 13.1 12.0 1.25 0.827
14 s14 9 10.3 12.0 1.25 -1.40
15 s15 9 11.9 12.0 1.25 -0.130
16 s16 9 13.2 12.0 1.25 0.956
17 s17 9 12.3 12.0 1.25 0.213
18 s18 9 11.3 12.0 1.25 -0.552
19 s19 9 13.5 12.0 1.25 1.20
20 s20 9 10.5 12.0 1.25 -1.21
There’s another function called recode()
which you can also use for recoding variables. However, I think this is a badly written function and the developers seem to agree, so it’s not clear how long recode()
will still be part of the dplyr package.↩︎