Creating vectors

We’ll create three types of vectors: numeric, character, and logical. Here are examples of each type:

# Numeric vectors 
n1 <- 20
n2 <- c(20, 25, 60, 55)

# Character vectors
c1 <- "Blue"
c2 <- c("Red", "Green", "Purple")

# Logical vectors
l1 <- TRUE
l2 <- c(TRUE, FALSE, TRUE)

Note that vectors can consist of one or many elements. Three common ways to create vectors with more than one element is to use c(), seq(), or rep().

c()

As illustrated above, one very common way to create vectors with more than one element is to use c() (“concatenate”), which simply combines whatever values you specify in the parentheses.

seq()

seq() applies to numeric vectors only:

n1 <- seq(from = 0, to = 10, by = 2)   #using 'by'
n1

## [1]  0  2  4  6  8 10

n2 <- seq(from = 0, to = 10, length.out = 5)   #using 'length.out'
n2

## [1]  0.0  2.5  5.0  7.5 10.0

n3 <- seq(1, 2, 0.1)   #no argument names specified; automatically uses 'from', 'to', 'by'
n3

##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

n4 <- 1:5    #shortcut for integer sequence; same as 'seq(1, 5, 1)'
n4

## [1] 1 2 3 4 5

seq() by default takes three parameters: starting value, end value, and a value that specifies how elements will be incremented (“by”), which can be substituted with “length.out”. Integer sequences can be created using a colon.

General note about argument names: A function’s argument names need not be specified, as illustrated when we created n3. If they are not specified, R uses arguments based on a default order. One way to learn about this order is to use ? (e.g., ?seq). If you specify the argument names, the order doesn’t matter. Putting all this together, hopefully it’s obvious why seq(by = -1, from = 10, to = 2) is the same as seq(10, 2, -1).

rep()

Vectors can also be created using rep(). As the name implies, this function is useful if you want to repeat an element or elements.

rep(1, 5)

## [1] 1 1 1 1 1

rep("blue", 3)

## [1] "blue" "blue" "blue"

rep(TRUE, 4)

## [1] TRUE TRUE TRUE TRUE

As should be obvious, the first parameter in the function specifies the element to repeat, and the second the number of times to repeat it.

Using more than one function

Perhaps the most powerful use of these functions comes from combining them. Here are a two examples:

rep(c("blue", "red"), 3)

## [1] "blue" "red"  "blue" "red"  "blue" "red"

c(rep(seq(0, 6, 2), 2), 4:1)

##  [1] 0 2 4 6 0 2 4 6 4 3 2 1

The second example is somewhat hard to follow, and is probably at the limit of complexity in terms of how many functions we want to combine. Separating a task into multiple lines of codes can help.

s <- rep(seq(0, 6, 2), 2)
c(s, 4:1)

##  [1] 0 2 4 6 0 2 4 6 4 3 2 1

Subsetting vectors

Extracting a subset of elements from a vector is an extremely important task, not least because it generalizes nicely to datasets (which are at the heart of data science). This process — whether applied to a vector or a dataset — is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill you need to master as quickly as possible, it’s this.

In R, there are three ways to filter a vector: using a separate logical vector, using indexing, and using names. I tend to use the first method most, but all three are useful.

Subsetting with logicals

Let’s jump right into an example. Say we have a character vector with only two elements (“apple” and “banana”). Subsetting it to “apple” could be done like so:

fruits <- c("apple", "banana")
fruits[c(TRUE, FALSE)]

## [1] "apple"

Note the use of brackets, [] — this is common when filtering. Within these brackets is a vector with the same number of logical elements as there are elements in the vector you want to subset. Elements across the two vectors are matched by order: elements that match with TRUE are kept while elements that match with FALSE are dropped.

This process is extremely useful when combined with a logical operation. Please familiarize yourself with the logical operations listed here. For example, using a logical operation we can filter a large vector of oranges, apples and bananas:

# Create a vector with 30 fruits 
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits

##  [1] "orange" "apple"  "banana" "orange" "apple"  "banana" "orange"
##  [8] "apple"  "banana" "orange" "apple"  "banana" "orange" "apple"
## [15] "banana" "orange" "apple"  "banana" "orange" "apple"  "banana"
## [22] "orange" "apple"  "banana" "orange" "apple"  "banana" "orange"
## [29] "apple"  "banana"

# Create a logical vector for dropping bananas
# Note: I'm creating the exact same logical vector three times (overriding it each time)
# This is for illustrative purposes; using one of these is sufficient
lv <- fruits == "orange" | fruits == "apple"
lv <- fruits != "banana"
lv <- fruits %in% c("orange", "apple")
lv

##  [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [12] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## [23]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

# Carry out the subset
fruits[lv]

##  [1] "orange" "apple"  "orange" "apple"  "orange" "apple"  "orange"
##  [8] "apple"  "orange" "apple"  "orange" "apple"  "orange" "apple"
## [15] "orange" "apple"  "orange" "apple"  "orange" "apple"

We applied the same logic as above: We have a vector (fruits) that we want to subset. We do so using a logical vector (lv), where elements that match with TRUE are kept. The only difference here is that we create the logical vector with a logical operation. The logical operators (e.g., !=, |) used here are discussed in the link above, with the exception of %in%.

General note about %in%: This operator is extremely useful as an alternative for repeated “or” (|) statements. For example, say you have a vector with 10 types of fruits and you want to keep elements that are equal to “orange”, “apple”, “mango”, “mandarin”, or “kiwi”. You could accomplish this by creating a logical vector like so: lv <- fruits == "orange" | fruits == "apple" | fruits == "mango" | fruits == "mandarin" | fruits == "kiwi".
What a nighmarishly long statement compared to the %in% option that accomplishes the exact same thing: lv <- fruits %in% c("orange", "apple", "mango", "mandarin", "kiwi").

Of course, subsetting using logicals can also be done on numeric vectors. Here are a few examples:

# Create a numeric vector
numbers <- seq(0, 100, by = 10)
numbers

##  [1]   0  10  20  30  40  50  60  70  80  90 100

# Illustrate three different filters
numbers[numbers <= 50 & numbers != 30]

## [1]  0 10 20 40 50

numbers[numbers == 0 | numbers == 100]

## [1]   0 100

numbers[numbers > 100] #returns an empty vector

## numeric(0)

Note that I didn’t create logical objects to carry out the subsets here, as opposed to above where we explicitly defined lv. I find it more compact and intuitive to take subsets without first creating a logical vector.

Subsetting using indexing

A different way to subset a vector is to specify the index or indeces you want to keep, again using brackets. Here are a few examples:

fruits <- c("apple", "banana")
fruits[1]

## [1] "apple"

fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[c(10, 20)]

## [1] "orange" "apple"

fruits[seq(1, 30, by = 5)]

## [1] "orange" "banana" "apple"  "orange" "banana" "apple"

I sometimes use this when I want to inspect or modify an element that I know occurs at a specific index in the vector, a more manual approach than using logical statements.

Subsetting using indexing can also be used in random sampling, which has many important applications — for example, in experiments and when you want to test-run code on a representative subset of your data. So, let’s introduce the sample() function:

# Draw 10 elements at random from 1 to 100 
sample(1:100, size = 10)

##  [1]  32  60  48  58  90  50  72 100  62  46

The function takes a vector of values (often successive integer values) and an argument that specifies how many values to draw at random from this vector. We can use the resulting values as indeces to subset another vector:

fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[sample(1:30, size = 5)]

## [1] "orange" "apple"  "orange" "orange" "orange"

Here, we’re drawing a random sample of five elements from the vector fruits. Why did I specify 1:30? Well, fruits consists of 30 elements, so specifying something like 1:100 likely would have resulted in sampled values outside the bounds of the vector (e.g., fruits[35] doesn’t exist). Specifying 1:30 gives every element in fruits an equal chance of being included in the sample.

Subsetting using names

Lastly, we can assign names to each element in a vector and take a subset based on the names.

age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")
age #note that values now have names

##     mom     dad grandpa
##      50      55      80

age[c("dad", "grandpa")] #subset

##     dad grandpa
##      55      80

That is, we have a vector representing the age of three family members. We assign names to each value, and then keep the values associated with two of the family members.

Modifying vectors

The subsetting logic from above can be used to modify vectors. The idea here is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them. For example, what if we had mis-entered grandpa’s age above? We can fix it using indexing, a logical statement, or naming.

# Recreate vector with age values from above
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")

# Three ways of changing grandpa's age
# Note: you'd only need to use one of these
age[age == 80] <- 82 #using a logical statement
age[3] <- 82 #using indexing
age["grandpa"] <- 82 #using naming
age

##     mom     dad grandpa
##      50      55      82

A logical statement is most efficient when we need to change a lot of elements.

fruits <- rep(c("orange", "apple", "bamama"), 5)
fruits #bamamas anyone?

##  [1] "orange" "apple"  "bamama" "orange" "apple"  "bamama" "orange"
##  [8] "apple"  "bamama" "orange" "apple"  "bamama" "orange" "apple"
## [15] "bamama"

# Let's fix the misspelled element
fruits[fruits == "bamama"] <- "banana"
fruits

##  [1] "orange" "apple"  "banana" "orange" "apple"  "banana" "orange"
##  [8] "apple"  "banana" "orange" "apple"  "banana" "orange" "apple"
## [15] "banana"

Vector arithmetics

We can modify or create new numeric vectors using arithmetic operations. Three common types of operations involve:

A vector with more than one element and a vector with only one element.
Two vectors with the same number of elements. Elements are matched based on index.
A vector modified by a function.

In all cases, we can modify all elements of a vector or only a subset of elements using the bracket notation we learned above.

numbers <- 1:10
numbers

##  [1]  1  2  3  4  5  6  7  8  9 10

# One value modifying all values in a vector 
numbers <- numbers / 10
numbers

##  [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

# One value modifying a subset of a vector 
numbers[numbers > 0.5] <- numbers[numbers > 0.5] * 100
numbers

##  [1]   0.1   0.2   0.3   0.4   0.5  60.0  70.0  80.0  90.0 100.0

# Two vectors with the same number of elements 
numbers1 <- 1:10
numbers2 <- 10:1
numbers3 <- numbers2 - numbers1
numbers3

##  [1]  9  7  5  3  1 -1 -3 -5 -7 -9

# Replacing a subset of a vector using another vector
numbers <- 1:10
numbers[numbers > 5] <- 5:1
numbers

##  [1] 1 2 3 4 5 5 4 3 2 1

# Modify a vector (or a subset of a vector) using a function
numbers <- 1:10
sqrt(numbers) #square root

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

exp(numbers) #exponentiate

##  [1]     2.718282     7.389056    20.085537    54.598150   148.413159
##  [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795

log(numbers[c(1, 5, 10)]) #natural log

## [1] 0.000000 1.609438 2.302585

Vector arithmetics can also be carried out in R on two multi-value vectors with different number of elements. Such operations use the recycling rule.

Summarizing vectors

We often want to get summary statistics from a vector — that is, learn something general about it by looking beyond its constituent elements. If we have a vector in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for vectors:

numbers <- sample(1:1000, 10)
numbers

##  [1] 301 797 560 949 335 472 556 566 990 358

class(numbers) #check the class

## [1] "integer"

length(numbers) #number of elements

## [1] 10

max(numbers) #maximum value

## [1] 990

min(numbers) #minimum value

## [1] 301

sum(numbers) #sum of all values in the vector

## [1] 5884

mean(numbers) #mean

## [1] 588.4

median(numbers) #median

## [1] 558

var(numbers) #variance

## [1] 61181.16

sd(numbers) #standard deviation

## [1] 247.3482

quantile(numbers) #percentiles in intervals of .25

##     0%    25%    50%    75%   100%
## 301.00 386.50 558.00 739.25 990.00

quantile(numbers, probs = seq(0, 1, 0.1)) #percentiles in invervals of 0.1

##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100%
## 301.0 331.6 353.4 437.8 522.4 558.0 562.4 635.3 827.4 953.1 990.0

summary(numbers) #function that contains many summary stats from above

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   301.0   386.5   558.0   588.4   739.2   990.0

If you forget one of these functions or if I haven’t included one here that you need, google almost surely has the answer for you. Also note that some of the operations above — most notably class() and length() — apply to non-numeric vectors.

Code generalization

We want our code to be as general as possible so that it can be reapplied to a different coding task or if the data change. Commands that summarize vectors can be useful to accomplish this.

Remember above when we found a random sample of fruits? Here is more or less the code we used:

fruits <- rep(c("orange", "apple", "banana"), 10)
length(fruits)

## [1] 30

random_sample <- fruits[sample(1:30, size = 5)]
random_sample

## [1] "orange" "banana" "apple"  "orange" "apple"

The third line, where we create random_sample, is not very general. Why? In this case, fruits has 30 elements. What if it instead had 50 elements? Then the third line would not give us a random sample. Or more precisely, this line would give us a random sample of the 30 first elements of fruits — the last 20 elements would not have a chance of being included. We could modify the third line to read random_sample <- fruits[sample(1:50, size = 5)]. But if we then modified fruits to have a different number of elements again we’d end up with the same problem.

Here’s the solution: find the number of elements of fruits using length() and then input this as an argument in the sample() function.

fruits <- rep(c("orange", "apple", "banana"), 100)
n <- length(fruits) #store the result of length() in an object 
n

## [1] 300

random_sample <- fruits[sample(1:n, size = 5)] #now use 'n' in the sample() function
random_sample

## [1] "orange" "orange" "orange" "banana" "orange"

# Or we could have used length() directly in the sample() function
# Note: Accomplishes the same thing as first creating 'n'
random_sample <- fruits[sample(1:length(fruits), size = 5)]

Exercises

Create a vector that represents the age of at least four different family members or friends. You can name it whatever you want.

What is the mean age of the people in your vector? Find out in two ways, with and without using the mean() command.
How old is the youngest person in your vector? (Use an R command to find out.)
What is the age gap between the youngest person and the oldest person in your vector? (Again use R to find out, and try to be as general as possible in the sense that your code should work even if the elements in your vector, or their order, change.)
How many people in your vector are above age 25? (Again, try to make your code work even in the case that your vector changes.)
Replace the age of the oldest person in your vector with the age of someone else you know.
Create a new vector that indicates how old each person in your vector will be in 10 years.
Create a new vector that indicates what year each person in your vector will turn 100 years old.
Create a new vector with a random sample of 3 individuals from your original vector. What is the mean age of the people in this new vector?