Subsetting allows us to select specific elements within a vector of items.
We can select elements by position (or index) in the vector. Note that R indexes by position (like Matlab) and not by offset (like python and javascript). So the first element is 1
(not 0
)
x[1]
Select the first element
x[4]
Select the fourth element
x[-4]
Select everything but the fourth element
x[2:5]
Select element two to five
x[-(2:5)]
Select everything but element two to five
x[c(1, 3)]
Select element one and three
Trying it out!
Let us see how some of these operations will look in R. First, we’ll create a vector with some values in it, and then we’ll perform some operations on it.
x <- c(1, 9, 1, 18, 4, 1, 3, 8, 9, 13) # create a vector filled with random integers
x # print out the values to terminal
[1] 1 9 1 18 4 1 3 8 9 13
x[4] # select the fourth element
[1] 18
x[-4] # select everything but the fourth element
[1] 1 9 1 4 1 3 8 9 13
x[2:5] # select element two to five
[1] 9 1 18 4
x[-(2:5)] # select everything **but** element two to five
[1] 1 1 3 8 9 13
x[c(1, 3)] #elect element one and three
[1] 1 1
To select elements that have a specific value, we’ll first ask R to tell us the index of those items. To do this we’ll just use one of the logical operators. Some examples of logical operators are:
==
for equal to
!=
for not equal to
>
for greater than
<
for less than
<=
for less than or equal to
>=
for greater than or equal to
To use these logical operators we need our vector, the operator, and the value to compare it to.
For example, x >= 1
means “which elements in x
are greater than or equal to 1?”
When we perform logical operations on a vector like this R, will tell us which elements match the logical rule and which don’t. It’ll print out TRUE
for those that match and FALSE
for those that don’t.
Trying it out!
Let us see what some of these operations will look like in R. We’ll again create a vector with some values in it, and then we’ll perform some operations on it.
x <- c(2, 8, 11, 10, -1) # create a vector with some numbers
x # print out the values to terminal
[1] 2 8 11 10 -1
x >= 1 # which elements are greater than or equal to 1
[1] TRUE TRUE TRUE TRUE FALSE
x == 2 # which elements are equal to 2
[1] TRUE FALSE FALSE FALSE FALSE
x != 2 # which elements are NOT equal to 2
[1] FALSE TRUE TRUE TRUE TRUE
We can also use logical operators to find which elements in a vector are a member of a set.
%in%
for is an member of a setTo use the %in%
we’ll also need another vector (our comparison set)
Trying it out!
x # print out x just is case we forgot what was in it!
[1] 2 8 11 10 -1
x %in% c(1, 8, -1) # elements a member of the set {1, 8, -1}
[1] FALSE TRUE FALSE FALSE TRUE
Finally, the function is.na()
can be used to test whether an element is a missing value. is.na()
is used, for example, when checking your data for missing values that you might want to impute.
Trying it out!
x <- c(1,2,3,NA) # create a vector with a missing value
x
[1] 1 2 3 NA
is.na(x) # check which values are missing
[1] FALSE FALSE FALSE TRUE
One thing to note about missing values is that they can’t be tested using regular logical operations like ==
, >
, and the like. If you test a value that is NA
using these operations, it will evaluate to NA
, and not to FALSE
.
x <- c(1, 2, 4, NA)
x == 4
[1] FALSE FALSE TRUE NA
In addition to using logical operations to check numeric values, we can also use logical operations to test the values of strings. This works in just the same way as it does for numeric values, expect for the tests <
, <=
, >
, and >=
, because these only make sense for numbers.
x <- "dog"
x == "cat"
[1] FALSE
x == "dog"
[1] TRUE
x != "cat"
[1] TRUE
x %in% c("dog", "cat", "rabbit")
[1] TRUE
We can also do some more advanced logical subsetting my combining logical operations, by using &
(AND) and |
(OR). Two operations joined by an &
evaluates to TRUE
if all conditions are true, and it evaluate to FALSE
if any condition is false. Two operations joined by an |
evaluates to TRUE
if either conditions are true, and it evaluated to FALSE
if all conditions are false.
x > 5 & x < 10
will evaluate to TRUE
only if x
is a number between 5 and 10 (not including 5 and 10)
x > 5 | x < 10
will evaluate to TRUE
for any number
Trying it out!
x <- 6
x > 5 & x < 10
[1] TRUE
x <- 11
x > 5 & x < 10
[1] FALSE
x <- 6
x > 5 | x < 10
[1] TRUE
x <- 11
x > 5 | x < 10
[1] TRUE
In the examples above, x
only contains a single number (it is a 1 element vector). When combining logical operators and testing vectors with multiple elements the tests are evaluated by element just as you would expect.
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
x > 5 & x < 10
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
x > 5 | x < 10
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
If we have a vector output of logical results then we can use the any()
and all()
functions to test is any are true or all are true
any(x > 5 & x < 10)
[1] TRUE
any(x > 5 | x < 10)
[1] TRUE
all(x > 5 & x < 10)
[1] FALSE
all(x > 5 | x < 10)
[1] TRUE
Combining logical operations with is.na()
can be useful if your vector contains NA
s. For example, if you want to know what elements match a particular condition, but you also want elements that contain an NA
to evaluate to FALSE
(and not NA
)
x <- c(1, 2, 3, 5, NA, 6)
x == 5 & !is.na(x) # elements equal to 5 evaluate as TRUE and NA as FALSE
[1] FALSE FALSE FALSE TRUE FALSE FALSE
Compare this to what happens in the & !is.na(x)
is ommited.
x <- c(1, 2, 3, 5, NA, 6)
x == 5 # elements equal to 5 evaluate as TRUE and NA as NA
[1] FALSE FALSE FALSE TRUE NA FALSE
Once we know which elements of a vector match a logical rule there are a couple of things we can do with this information: 1) we can ask for the element indexes or 2) we can ask for the element values
To get to positions in a vector that match a particular rule we just wrap our logical operation in a which()
function.
For example, which(x > 5)
asks “what are the positions in x
that have a value greater than 5?”
Trying it out!
x <- c(2, 8, 11, 10, -1) # create a vector with some numbers
x # print out x
[1] 2 8 11 10 -1
which(x == 11) # which position is equal to 11
[1] 3
# the next line will return an empty vector because no values are greater than 15
which(x > 15) # which position is greater than 15
integer(0)
# we can save the output of our logical operation to a variable and then use that as input for the which() function
matches <- x == 11
which(matches)
[1] 3
We saw that when we used a logical operator on a vector is returned a vector of TRUE
and FALSE
values. We can use these TRUE
and FALSE
values to select elements in vector.
x[x > 0]
Select all element that have a value greater than 0
x[x %in% c(1, 6, 8)]
Select all elements that are a member of the set {1, 6, 8}
Trying it out!
Let us see how some of these operations will look in R. We’ll again create a vector with some values in it, and then we’ll perform some operations on it.
x <- c(10, 10, 10, 11, -1) # create a vector with some numbers
x # print out the values to terminal
[1] 10 10 10 11 -1
x[x == 10] # all the elements with value 10
[1] 10 10 10
matches <- x == 10
x[matches] # note that there's no '' around matches because it's a variable name
[1] 10 10 10
Be careful that if you have elements that evaluate to NA
, then NA
s will be returned when you subset a vector.
x <- c(1, 2, 3, 5, NA, 6)
x[x == 5]
[1] 5 NA
One way to get around this, is to ask for the elements that match the condition and aren’t NA
.
x[x == 5 & !is.na(x)]
[1] 5
Or to wrap the logic in which()
.
x[which(x == 5)]
[1] 5
When we have named vectors we can select elements using the name of that element.
x["arms"]
Select the element named arms
x[c("arms","legs")]
Select the elements named arms and legs
Trying it out!
x <- c(arms = 2, legs = 2, eyes = 2, heads = 1)
x
arms legs eyes heads
2 2 2 1
x["arms"]
arms
2
x[c("arms","legs")]
arms legs
2 2
When we subset a named vector, the result will also be a named vector. The fact that it’s named makes very little difference. Named and unnamed vectors behave identically in almost every situation. But if you don’t want the name, the you can just double up on the []
x[["arms"]]
Select the element named arms and discard the name.x[["arms"]]
[1] 2
We can also ask R for give us subsets of data tables and lists. There’s a lot of similarities between data tables and lists when it comes to subsetting. There’s also a few differences, but the similarities make it worth dealing with them together.
First, lets create a list and data table (we’ll create a data table of type data.frame
). Like vectors, lists can also have named and unnamed elements, so we’ll create a list with named and an unnamed element. Data tables in contrast are tabular and therefore each column needs a name (if you don’t set one R will set it for you, so best to do it yourself!)
# our list
our_list <- list(el1 = c(1,2,3),
el2 = c("a","b","c"),
c("x","y"))
our_list
$el1
[1] 1 2 3
$el2
[1] "a" "b" "c"
[[3]]
[1] "x" "y"
# our data table
our_dt <- data.frame(col1 = c(1,2,3),
col2 = c("a","b","c"),
col3 = c("x","y",NA_character_))
our_dt
col1 col2 col3
1 1 a x
2 2 b y
3 3 c <NA>
To get named elements from lists and tables there are two general approaches. The first is the same as you would use for a named vector. That is, by using []
and the name of the element. This works for both lists and data tables. The output for a list, is also a list. And for a data table, the output is also a data table. However, they each only contain the selected element.
# extract the named element el1
our_list["el1"]
$el1
[1] 1 2 3
our_dt["col1"]
col1
1 1
2 2
3 3
We can also ask for multiple elements
our_list[c("el1","el2")]
$el1
[1] 1 2 3
$el2
[1] "a" "b" "c"
our_dt[c("col1",'col2')]
col1 col2
1 1 a
2 2 b
3 3 c
While using single []
returns an object of the same type (a list for a list and a data table for a data table), sometimes we might want to access the data inside the list element, or the data inside the column. To get the data inside, you simply use [[]]
1. When using [[]]
, however, you can only select single elements because the returned data will no longer be organised inside the list or data table.
our_list[["el1"]]
[1] 1 2 3
our_dt[["col2"]]
[1] "a" "b" "c"
In addition to getting elements using the []
and [[]]
syntax, it’s also possible to get named elements using the $
. Using the $
is the same as using [[]]
our_list$el1
[1] 1 2 3
our_dt$col1
[1] 1 2 3
You can also subset data tables and lists using indexes the same way that you would for a vector.
our_dt[c(1,3)] # the first and third column
col1 col3
1 1 x
2 2 y
3 3 <NA>
our_list[c(1,3)] # first and third list element
$el1
[1] 1 2 3
[[2]]
[1] "x" "y"
our_list[[1]] # content *inside* element 1
[1] 1 2 3
our_dt[[3]] # content *inside* column 1
[1] "x" "y" NA
Note that when the content inside the element is a vector or a list, then we can subset it further without having to save the intermediate result to a variable. To do this, we just add further []
to the end.
our_list[[2]][3]
[1] "c"
our_dt[[2]][[3]]
[1] "c"
Unlike lists, data tables are tabular, which means you can access specific cells inside the table. To do this, we just employ matrix notation. This is less cumbersome than using [[]][[]]
style notation. To demonstrate this, we’ll just create a new data table where each cell contains information about where in the table it’s located (it’s column position and it’s row).
our_dt2 <- data.frame(col1 = c("r1_c1","r2_c1","r3_c1"),
col2 = c("r1_c2","r2_c2","r3_c2"))
our_dt2
col1 col2
1 r1_c1 r1_c2
2 r2_c1 r2_c2
3 r3_c1 r3_c2
We can now request the cell by position using [row_num,col_num]
(apart from []
instead of ()
this is the same as it would be done in Matlab)
our_dt2[3,2] # row 3 column 2
[1] "r3_c2"
Finally, rows, columns, or single cells can also be accessed using logic. As with logical sub setting of vectors we just use some logic to generate our indexes and then use the indexes to subset.
Logical subsetting of data table is usually used for filtering rows. To see this in action we’ll first get the indexes in col2
of those cells that match a particular condition.
# The elements in col2 that match the condition of being equal to "r1_c2"
our_dt2$col2 == "r2_c2"
[1] FALSE TRUE FALSE
The output tells use that element 2 matches the condition. We can now use this condition inside []
to request row 2 of the data table. Because we want all the columns we just don’t write anything after the ,
(in Matlab this would be done by using :
after the ,
).
our_dt2[our_dt2$col2 == "r2_c2",]
col1 col2
2 r2_c1 r2_c2
If you use Matlab you can think of this as the distinction between the element and the stuff inside the elements is the distinction between using ()
and {}
to access elements in a cell array↩︎