1 Introduction – Why Use R?

R is an incredibly powerful tool for for data management and analysis.

Data management involves

Data analysis helps us assess what the data we’ve compiled tell us about the world. To do so, we can use descriptive statistics, graphs, or multivariate statistical tests.

As an example of data analysis, here’s a graph generated in R (using the ggvis package). It shows how the weight, number of cyllinders, and miles per gallon of 32 different car models are related.

You’ll be able to create similar graphs soon. But first, let’s jump into the basics of R.


2 Data Structures

Four common object types that store data are:

  1. Scalars: store a single numeric value.

  2. Strings: store a set of one or more characters.

  3. Vectors: store several scalar or string elements.

  4. Data Frames. Store several vectors (meaning that they contain several rows and columns).

To create any object in R, we use the assignment operator <-.


2.1 Scalars & Strings

The following line of code creates a scalar named a that stores the value 9:

a <- 9

This scalar can then be used create a different scalar b:

b <- a + 1
b
## [1] 10

Any type of object can be overwritten. What value does b contain after running the following command?

b <- b - a


Objects need not be numeric. The following code creates an object c of mode character rather than numeric:

c <- "Hello world"

We call the object c a “string”. To check the class of an object, use the class() command.

class(a)
## [1] "numeric"
class(c)
## [1] "character"

2.2 Vectors

Vectors have several elements (whether numeric or non-numeric), and are created in the following way:

v <- c(1, 2, 3, 4)

c stands for “concatenate.” The object v now contains the vector {1, 2, 3, 4}, as can be seen if you call the object:

v 
## [1] 1 2 3 4

A shortcut for creating a vector with an integer sequence is:

v <- 1:4
v
## [1] 1 2 3 4

Non-integer sequences can be created using the seq() command:

v <- seq(from = 0, to = 0.5, by = 0.1)
v
## [1] 0.0 0.1 0.2 0.3 0.4 0.5

You can also use scalar objects to create vectors:

a <- 0
b <- 22
v <- c(b, 1:4, a)
v
## [1] 22  1  2  3  4  0


Use the length() and mean() commands to find out how many elements a vector contains and the mean of a vector, respectively:

v <- c(1, 2, 3, 4)
length(v)   
## [1] 4
mean(v)     
## [1] 2.5

Of course, these values can themselves be stored as scalar objects:

length_v <- length(v)
mean_v <- mean(v)


Again, vectors need not contain numerical values. The following line of code creates a vector of strings:

v_colors <- c("blue", "yellow", "light green")
v_colors
## [1] "blue"        "yellow"      "light green"


Getting a particular element or elements of a vector can be very useful. This is done using brackets.

v_colors[2]         #get the second element of v_colors
## [1] "yellow"
v_colors[c(1, 3)]   #get the first and third elements
## [1] "blue"        "light green"

You can also reassign an element or elements of a vector:

v_colors[2:3]  <- c("red", "purple")
v_colors 
## [1] "blue"   "red"    "purple"

2.3 Data Frames

Data frames are extremely useful. Think of them as datasets where each column represents a variable and each row represents a unit of observation.

To create a data frame, it is useful to go through two steps:

  1. Create the vectors (variables) that you want the data frame to contain.

  2. Piece these vectors together using the data.frame() command.

For example, say I wanted to create a data frame containing information about students’ name, height (in centimeters), and GPA:

name <- c("Harry", "Ron", "Hermione", "Hagrid", "Voldemort")
height <- c(176, 175, 167, 230, 180)
gpa <- c(3.4, 2.8, 4.0, 2.2, 3.4)
df_students <- data.frame(name, height, gpa)   #piecing vectors together

The output in the console if we execute df_students now is:

df_students
##        name height gpa
## 1     Harry    176 3.4
## 2       Ron    175 2.8
## 3  Hermione    167 4.0
## 4    Hagrid    230 2.2
## 5 Voldemort    180 3.4

df_students is a data frame with three vectors. Put differently, df_students is a dataset with three variables: name (a nominal variable), height (a continuous variable), and gpa (a continuous variable), where the unit of observation is a student/individual.

We can create data frames without first creating vectors. The following creates the same data frame (df_students) as above:

df_students <- data.frame(name = c("Harry", "Ron", "Hermione", "Hagrid", "Voldemort"),
                          height = c(176, 175, 167, 230, 180),
                          gpa = c(3.4, 2.8, 4.0, 2.2, 3.4))
df_students
##        name height gpa
## 1     Harry    176 3.4
## 2       Ron    175 2.8
## 3  Hermione    167 4.0
## 4    Hagrid    230 2.2
## 5 Voldemort    180 3.4

Say we wanted to add a dummy variable (also called indicator variable) that equals 1 if the individual is good and 0 if he or she is evil. We can do this using the $ operator:

df_students$good <- c(1, 1, 1, 1, 0)
df_students
##        name height gpa good
## 1     Harry    176 3.4    1
## 2       Ron    175 2.8    1
## 3  Hermione    167 4.0    1
## 4    Hagrid    230 2.2    1
## 5 Voldemort    180 3.4    0


To get the dimensions of a data frame, use the dim() command:

dim(df_students)  
## [1] 5 4

The data frame df_students has 5 rows and 4 columns.

Again, we can get particular elements or set of elements of a data frame using brackets. The first number indicates the row and the second the column of the data frame:

df_students[2, 3]      #Ron's GPA
## [1] 2.8

You could also use

df_students$gpa[2]     #Ron's GPA
## [1] 2.8

We can get a full row or set of rows by leaving out the column number:

df_students[5, ]   
##        name height gpa good
## 5 Voldemort    180 3.4    0
df_students[3:5, ]   
##        name height gpa good
## 3  Hermione    167 4.0    1
## 4    Hagrid    230 2.2    1
## 5 Voldemort    180 3.4    0

Likewise, we can get a full column or set of columns:

df_students[, 2]  
## [1] 176 175 167 230 180
df_students$height   
## [1] 176 175 167 230 180
df_students[, 1:3]   
##        name height gpa
## 1     Harry    176 3.4
## 2       Ron    175 2.8
## 3  Hermione    167 4.0
## 4    Hagrid    230 2.2
## 5 Voldemort    180 3.4

As with vectors, we can reassign a given element or elements:

df_students[4, 2] <- 255       #reassign Hagrid's height
df_students$height[4] <- 255   #same thing as above
df_students
##        name height gpa good
## 1     Harry    176 3.4    1
## 2       Ron    175 2.8    1
## 3  Hermione    167 4.0    1
## 4    Hagrid    255 2.2    1
## 5 Voldemort    180 3.4    0

2.4 Exercises

  1. Create four scalar objects. Each should contain the age of a different family member. You can name these objects whatever you want. Then, using R:
    1. Find the age difference between the oldest and youngest family member.
    2. Find the total age of your family members.
    3. Find the mean age.
  2. Create a vector that contains the four objects you created in Exercise 1.
    1. Find the mean of the vector. (Of course, you should get the same answer as in 1.c.)
    2. How old will each of your family members be in 10 years? Hint: If your vector is called v, then you can add a number c to each element of that vector using v + c.
  3. Create a data frame that contains three variables: the name of each of your family members, their age, and their gender.
    1. Use R to find the class of each of these variables.
    2. (Conceptual) What type of variable (nominal, ordinal, continuous) is each of these variables? (No need to use R to answer this.)
    3. Add a variable to the data frame that indicates what year each of your family members will turn 100 years old. What is the mean of this variable?
    4. What is the mean age of your male family members? Of your female family members? Hint: use brackets to get particular elements of the data frame, then find the mean.

3 Reading Data

Note: For the next part of the tutorial we’ll be working with a dataset called world_small.csv, which you can download here.

You now know how to create some simple but common data objects in R. Oftentimes, however, we’ll want to read an existing dataset into R. Datasets come in many formats—e.g., .csv, .txt, .dta, .RData, and online data structures (HTML tables). R can read most data formats as is, but sometimes it may be necessary to manually reformat some elements in the file or even to reconvert the whole file to a different format (e.g., using Stat/Transfer). For now, we’ll assume that the file is in a readable format.

To read a file you need to

  1. Specify where the file is located on your computer. This is referred to as setting your working directory.

  2. Execute a command that will read the file from your working directory.


3.1 Setting the Working Directory

You can set your working directory manually. In RStudio, go to Session –> Set Working Directory –> Choose Directory… and find the folder in which your file is located.

While this works, good coding practice requires that you always also include a line of code that sets the working directory in the beginning of your .R file when you need to read a file. To do this, use the command setwd(path-to-dir) where path-to-dir is the the path to the folder in which the file is located. One way to find this path is to set your working directory manually first. The path to the directory then shows up in the R console.

To set my working directory for world_small.csv, I include the following line somewhere in the beginning of my .R file:

setwd("~/dropbox/155/tutorial1")

Note that the path to your working directory may look different than mine, and that in Windows you may see back slashes instead of forward slashes.


3.2 Reading the File

Now that we’ve told R where to look for the file we want to read, it’s time to actually read the file. Different commands are used to read different types of files. This is the syntax used for reading a .csv file:

world <- read.csv("world_small.csv")

I’m reading the file from the working directory and assigning it to an object called world, which becomes of class “data.frame”:

class(world)
## [1] "data.frame"

Let’s check if the file was read correctly, using dim() (returns the dimensions), head() (returns the top six rows), and summary() (returns summary information about each variable):

dim(world)
## [1] 145   4
head(world)   #same as: world[1:6, ]
##     country       region gdppcap08 polityIV
## 1   Albania   C&E Europe      7715     17.8
## 2   Algeria       Africa      8033     10.0
## 3    Angola       Africa      5899      8.0
## 4 Argentina   S. America     14333     18.0
## 5   Armenia   C&E Europe      6070     15.0
## 6 Australia Asia-Pacific     35677     20.0
summary(world)
##       country             region     gdppcap08        polityIV
##  Albania  :  1   Africa      :42   Min.   :  188   Min.   : 0.00
##  Algeria  :  1   C&E Europe  :25   1st Qu.: 2153   1st Qu.: 7.67
##  Angola   :  1   Asia-Pacific:24   Median : 7271   Median :16.00
##  Argentina:  1   S. America  :19   Mean   :13252   Mean   :13.41
##  Armenia  :  1   Middle East :16   3rd Qu.:19330   3rd Qu.:19.00
##  Australia:  1   W. Europe   :12   Max.   :85868   Max.   :20.00
##  (Other)  :139   (Other)     : 7

Everything looks as we would have hoped.


3.3 Exercises

  1. Read the world_small.csv file from a directory on your computer. Put it in a directory that will allow you to keep your files organized throughout the quarter (not on your Desktop).

4 Installing & Loading Packages

R is open-source, meaning that anyone can write a package that extends its functionality. We’ll make use of many packages in this class. To use a package you must

  1. Install it. You only need to do this once.
  2. Load it. If you close R, it will no longer have packages you loaded in memory the next time you open it. If you want to use a package, you therefore need to load it again after closing down R.

To install packages plyr, dplyr, and ggplot2, run

install.packages(c("plyr", "dplyr", "ggplot2"), dep = T)

To load these packages:

require(plyr)
require(dplyr)
require(ggplot2)

Alternative way of loading packages that is more compact:

sapply(c("plyr", "dplyr", "ggplot2"), require, character.only = T)
   plyr   dplyr ggplot2
   TRUE    TRUE    TRUE 

4.1 Exercises

  1. Install packages plyr, dplyr, and ggplot2.

5 Basic Data Manipulation

5.1 Subsets

There are many ways to subset data frames in R. Here are three ways to subset the world data frame to countries in Africa:

setwd("~/dropbox/155/tutorial1")
world <- read.csv("world_small.csv")
afr1 <- world[world$region == "Africa", ]   #option 1: use brackets
dim(afr1)
## [1] 42  4
afr1 <- subset(world, region == "Africa")   #option 2: use subset()
dim(afr1)
## [1] 42  4
require(dplyr)
afr1 <- filter(world, region == "Africa")   #option 3: use filter() from package dplyr
dim(afr1)
## [1] 42  4

Subset to African countries with a polity score of at least 15:

afr2 <- world[world$region == "Africa" & world$polityIV >= 15, ]          #option 1
afr2 <- subset(world, region == "Africa" & polityIV >= 15)                #option 2
afr2 <- filter(world, region == "Africa", polityIV >= 15)                 #option 3

Same as above, keeping only variables “country” and “polityIV”

afr3 <- world[world$region == "Africa" & world$polityIV >= 15, c(1, 4)]   #option 1
afr3 <- subset(world, region == "Africa" & polityIV >= 15,                #option 2
               select = c("country", "polityIV"))
afr3 <- filter(world, region == "Africa", polityIV >= 15) %>%             #option 3
        select(country, polityIV)

Notes about how we produced these subsets:

  • We used the logical operators ==, & and >=.
  • The distinction between a double equal sign (==) and a single equal sign (=) is important. A single equal sign is equivalent to the assignment operator, so a <- 3 and a = 3 does the same thing. On the other hand, a == 3 tests whether a is equal to 3 and returns either TRUE or FALSE, given that a has been defined.
  • Read more about logical operators.

5.2 Creating New Variables

The most common way to add a variable to a dataset is to use the $ operator followed by a new variable name:

world$gdp_log <- log(world$gdppcap08)              #add logged gdp per cap variable
world$democ <- ifelse(world$polityIV > 10, 1, 0)   #create democracy dummy variable
head(world)
##     country       region gdppcap08 polityIV gdp_log democ
## 1   Albania   C&E Europe      7715     17.8   8.951     1
## 2   Algeria       Africa      8033     10.0   8.991     0
## 3    Angola       Africa      5899      8.0   8.683     0
## 4 Argentina   S. America     14333     18.0   9.570     1
## 5   Armenia   C&E Europe      6070     15.0   8.711     1
## 6 Australia Asia-Pacific     35677     20.0  10.482     1

We used ifelse() to create the variable democ, a dummy variable that equals 1 if a country has a Polity IV score above 10 and 0 otherwise.

ifelse works in the following way:

  • It takes three arguments (separated by commas).
  • The first argument is a “test”—a conditional statement that can be either TRUE or FALSE.
  • The second argument tells R what to do when the test is TRUE. In this case, R assigns a ‘1’ to the variable democ.
  • The third argument tells R what to do when the test is FALSE. In this case, R assigns a ‘0’ to the variable democ.

mutate() from package dplyr is another way to create new variables. Using the data management functions included in dplyr has many advantages (see below).

The following code accomplishes the same thing as the code above using mutate():

world <- mutate(world, gdp_log = log(gdppcap08),
                       democ = ifelse(polityIV > 10, 1, 0))
head(world)
##     country       region gdppcap08 polityIV gdp_log democ
## 1   Albania   C&E Europe      7715     17.8   8.951     1
## 2   Algeria       Africa      8033     10.0   8.991     0
## 3    Angola       Africa      5899      8.0   8.683     0
## 4 Argentina   S. America     14333     18.0   9.570     1
## 5   Armenia   C&E Europe      6070     15.0   8.711     1
## 6 Australia Asia-Pacific     35677     20.0  10.482     1


In world, the variable region is a factor variable:

class(world$region)
## [1] "factor"

Factor variables are an important class of variables. They are simply categorical variables. Here are the categories (called “levels” in R) of region:

levels(world$region)
## [1] "Africa"       "Asia-Pacific" "C&E Europe"   "Middle East"
## [5] "N. America"   "S. America"   "Scandinavia"  "W. Europe"

While this variable could have been stored in character mode, storing it as a facor makes life easier. For example, note that region splits European countries into three categories: “C&E Europe”, “Scandinavia”, and “W. Europe”. Factor variables are easy to recode. Let’s create a new region variable that groups all European countries together:

world$region2 <- world$region         #create new region variable identical to 'region'
levels(world$region2) <- c("Africa", "Asia-Pacific", "Europe", "Middle East",
                           "N. America", "S. America", "Europe", "Europe")     #relevel
table(world$region)                   #number of countries by region (original variable) 
##
##       Africa Asia-Pacific   C&E Europe  Middle East   N. America
##           42           24           25           16            3
##   S. America  Scandinavia    W. Europe
##           19            4           12
table(world$region2)                  #number of countries by region (recoded variable)
##
##       Africa Asia-Pacific       Europe  Middle East   N. America
##           42           24           41           16            3
##   S. America
##           19

Note that the number of European countries in region2 is equal to the combined number of European countries (“C&E Europe”, “Scandinavia”, and “W. Europe”) in region.


5.3 Sorting

The easiest way to re-order a data frame is to use arrange() from dplyr. The function takes a data frame and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

head(world)                                  #original order
##     country       region gdppcap08 polityIV gdp_log democ      region2
## 1   Albania   C&E Europe      7715     17.8   8.951     1       Europe
## 2   Algeria       Africa      8033     10.0   8.991     0       Africa
## 3    Angola       Africa      5899      8.0   8.683     0       Africa
## 4 Argentina   S. America     14333     18.0   9.570     1   S. America
## 5   Armenia   C&E Europe      6070     15.0   8.711     1       Europe
## 6 Australia Asia-Pacific     35677     20.0  10.482     1 Asia-Pacific
world <- arrange(world, gdppcap08)           #order by gdp per cap
head(world)                                  
##          country region gdppcap08 polityIV gdp_log democ region2
## 1       Zimbabwe Africa       188     6.00   5.236     0  Africa
## 2 Congo Kinshasa Africa       321    15.00   5.771     1  Africa
## 3        Liberia Africa       388    10.00   5.961     0  Africa
## 4  Guinea-Bissau Africa       538    11.00   6.288     1  Africa
## 5        Eritrea Africa       632     3.00   6.449     0  Africa
## 6          Niger Africa       684    15.33   6.528     1  Africa
world <- arrange(world, desc(gdppcap08))     #order by gdp per cap (descending)
head(world)
##         country       region gdppcap08 polityIV gdp_log democ      region2
## 1         Qatar  Middle East     85868        0   11.36     0  Middle East
## 2        Norway  Scandinavia     58138       20   10.97     1       Europe
## 3     Singapore Asia-Pacific     49284        8   10.81     0 Asia-Pacific
## 4 United States   N. America     46716       20   10.75     1   N. America
## 5       Ireland    W. Europe     44200       20   10.70     1       Europe
## 6   Switzerland    W. Europe     42536       20   10.66     1       Europe
world <- arrange(world, region, country)     #order by region, then country
head(world)
##        country region gdppcap08 polityIV gdp_log democ region2
## 1      Algeria Africa      8033     10.0   8.991     0  Africa
## 2       Angola Africa      5899      8.0   8.683     0  Africa
## 3        Benin Africa      1468     16.2   7.292     1  Africa
## 4     Botswana Africa     13392     19.0   9.502     1  Africa
## 5 Burkina Faso Africa      1161     10.0   7.057     0  Africa
## 6     Cameroon Africa      2215      6.0   7.703     0  Africa

5.4 Combining Tasks with Piping

dplyr makes it possible to write beautiful, fast code that combines different data management tasks using the %>% (piping) operator. Say we wanted to

  • subset world to South American countries with a polity score above 10,
  • create two new variables (logged gdp and a democracy dummy),
  • keep only some of the variables, and
  • re-order the data frame based on logged gdp (descending order).

With dplyr, our code might look something like this:

samr <- world %>%
        filter(region == "S. America", polityIV > 10) %>%    #subset
        mutate(gdp_log = log(gdppcap08),                     #create new variables
               democ = ifelse(polityIV > 10, 1, 0)) %>%
        select(country, gdppcap08, gdp_log, democ) %>%       #keep four variables
        arrange(desc(gdp_log))                               #sort based on logged gdp
samr
##        country gdppcap08 gdp_log democ
## 1        Chile     14465   9.579     1
## 2    Argentina     14333   9.570     1
## 3    Venezuela     12804   9.458     1
## 4      Uruguay     12734   9.452     1
## 5   Costa Rica     11241   9.327     1
## 6       Brazil     10296   9.240     1
## 7     Colombia      8885   9.092     1
## 8         Peru      8507   9.049     1
## 9      Ecuador      8009   8.988     1
## 10     Jamaica      7705   8.950     1
## 11 El Salvador      6794   8.824     1
## 12   Guatemala      4760   8.468     1
## 13    Paraguay      4709   8.457     1
## 14     Bolivia      4278   8.361     1
## 15    Honduras      3965   8.285     1
## 16   Nicaragua      2682   7.894     1
## 17      Guyana      2542   7.841     1

Here’s a short explanation:

  • Since we specify world on the first line, every subsequent line of code will operate on that data frame.
  • The %>% operator pieces the different functions together, telling R to execute each line from top to bottom and update world accordingly.
  • The result is saved to the new data frame samr.

I recommend using this piping functionality whenever you can.


5.5 Exercises

  1. Add three variables to world:
    1. A variable that recodes polityIV from 0-20 to -10-10.
    2. A variable that categorizes a country as “rich” or “poor” based on some cutoff of gdppcap08 you think is reasonable.
  2. Subset world to European countries.
  3. Drop the region variable (keep the rest).
  4. Sort the data frame based on Polity IV.
  5. Repeat Exercises 1-4 using dplyr’s piping functionality.
  6. How many countries in Europe are “rich” according to your coding? How many are poor? What percentage have Polity IV scores of at least 18?

6 Trouble Shooting

Many, many times when coding you’ll have an idea of what you want to do but won’t know how to do it in R. This happens even for experienced coders. With the right strategies, you’ll be able to solve a majority of issues you run into yourself. Not having to ask someone else every time you run into a problem will save you a lot of time.

When you’re stuck, consult class material (handouts, textbook, etc.). Perhaps more efficiently, google what you’re trying to do. For example, if you want to find the mean of a variable, try googling “how to find mean in R” and there likely will be tons of explanations of how to do this.

R also has a nifty help feature that is called using the following syntax: ?commandname, where commandname is the name of the command that you need help with. For example, ?mean will bring up a help dialog box with information about how to use R’s mean() command.


6.1 Exercises

  1. Use an R command to find the mean of the vector x defined below, ignoring NA values. If you try mean(x), R will return NA, and we don’t want this. Do not change the state of the vector in any way. (Use google and/or ?mean to figure this one out.)
x <- sample(c(rep(NA, 200), runif(800)), 500)


Link to .R file with code used in this tutorial (with minimimal commenting)