1 Preliminaries

In this tutorial we’ll use the ANES cumulative file available here and on Coursework. Before we begin, let’s take care of some preliminary tasks, including loading the packages we need and importing this file.

# Load packages (install before if haven't already)
libs <- c("foreign", "plyr", "dplyr", "ggplot2")
sapply(libs, require, character.only = TRUE)
## foreign    plyr   dplyr ggplot2 
##    TRUE    TRUE    TRUE    TRUE
# Import the anescum.dta file
setwd("~/dropbox/155/tutorial2")
system("unzip ANEScumul_data_codebook.zip")                #unzip (using batch command)
anes <- read.dta("anescum.dta", warn.missing.labels = F)   #read file
system("rm -r anescum* __M*")                              #delete unzipped files (using batch command)

# Keep only some variables and rename them 
# see http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
# for the combined select/rename function used here    
anes <- anes %>%
        select(year = VCF0004,
               age = VCF0101,
               gender = VCF0104,
               race = VCF0106,
               edu = VCF0140,
               south = VCF0113,
               income = VCF0114,
               partyid = VCF0301,
               interest = VCF0310,
               govtrust = VCF0604,
               abortion = VCF0838,
               demtherm = VCF0218)

To learn more about the variables we kept, consult the ANES codebook included in the .zip file above.

You can learn more about using batch commands with system() here. These commands may work slightly different in Windows.


2 Basic Data Manipulation (Cont.)

Tutorial 1 focused heavily on data manipulation. It is worth spending a little more time on this. In particular, let’s go over how to recode factor variables into numeric variables.

The abortion variable is a factor variable with 6 categories, 4 of which are meaningful for our purposes:

class(anes$abortion)
## [1] "factor"
summary(anes$abortion)
##                    0. NA; no Post IW; version NEW (2008) 
##                                                     1594 
##            1. By law, abortion should never be permitted 
##                                                     2757 
##  2. The law should permit abortion only in case of rape, 
##                                                     6726 
## 3. The law should permit abortion for reasons other than 
##                                                     3710 
##    4. By law, a woman should always be able to obtain an 
##                                                     8841 
##                                             9. DK; other 
##                                                      473 
##                                                     NA's 
##                                                    25659

Say we want to create a dummy variable that equals 1 if the respondent believes women should always be able to obtain abortion, and 0 otherwise (coding categories 0 and 9 as NA). Here are three ways of doing this:

# Option 1 (using ifelse)
anes$abortionD <- with(anes,
   ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
   ifelse(abortion == "1. By law, abortion should never be permitted" |
          abortion == "2. The law should permit abortion only in case of rape," |
          abortion == "3. The law should permit abortion for reasons other than", 0, NA)))
table(anes$abortionD)
## 
##     0     1 
## 13193  8841
# Option 2 (using ifelse and %in%)
anes$abortionD <- with(anes,
   ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
   ifelse(abortion %in% c("1. By law, abortion should never be permitted",
                          "2. The law should permit abortion only in case of rape,",
                          "3. The law should permit abortion for reasons other than"), 0, NA)))
table(anes$abortionD)
## 
##     0     1 
## 13193  8841
# Option 3 (recode factor levels, then convert to numeric)
anes$abortionD <- anes$abortion                               #clone variable 
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)               #recode levels
anes$abortionD <- as.numeric(as.character(anes$abortionD))    #convert to numeric
table(anes$abortionD)
## 
##     0     1 
## 13193  8841

Options 1 and 2 both use ifelse(), which turns out to be quite cumbersome here because of the long labels. Option 3 requires less code, and may therefore be preferred.

However, when using Option 3, make sure you correctly convert the variable from factor to numeric. If you want to convert numbers stored as factor levels in R to numeric values, you need to first convert them to character, then convert them to numeric values. Above, this is done using as.numeric(as.character(anes$abortionD)).

If we were to convert the variable from factor to numeric without the intermediate step of converting them to character the following happens:

anes$abortionD <- anes$abortion                               #clone variable 
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)               #recode levels
anes$abortionD <- as.numeric(anes$abortionD)                  #FAIL
table(anes$abortionD)
## 
##     1     2 
## 13193  8841

R converts the numbers to ‘1’ and ‘2’ instead of ‘0’ and ‘1’. This has to do with how R stores factor levels internally. If you want to convert a factor variable to numeric, always remember to convert factors using as.numeric(as.character(var)) where var is your variable of interest.


2.1 Exercises

  1. Create a new variable called incomeD which recodes income in the anes data frame into a (numeric) dummy variable that equals 1 if the respondent’s income is in the 68th percentile or above and 0 otherwise.

3 Univariate Descriptive Statistics

Let’s now begin some basic data analysis. As part of this, it is important to get a sense for what variables look like separately (their univariate distributions). The univariate descriptive statistics you produce will depend on whether the variable is categorical or numeric.


3.1 Categorical variables

Let’s first recode the 7-point party ID variable to a 3-point party ID variable:

table(anes$partyid)
## 
## 0. DK; NA; other; refused to answer; no Pre IW 
##                                            968 
##                             1. Strong Democrat 
##                                           9320 
##                               2. Weak Democrat 
##                                          10390 
##                      3. Independent - Democrat 
##                                           5649 
##                   4. Independent - Independent 
##                                           5617 
##                    5. Independent - Republican 
##                                           4772 
##                             6. Weak Republican 
##                                           6790 
##                           7. Strong Republican 
##                                           5592
anes$partyid3 <- anes$partyid
levels(anes$partyid3) <- c(NA, "Democrat", "Democrat", "Democrat",
                           "Independent", "Republican", "Republican",
                           "Republican") 

table(anes$partyid3)
## 
##    Democrat Independent  Republican 
##       25359        5617       17154

As is already apparent, table() is a useful function for getting the distribution of a categorical variable. To get the proportion of respondents in each category, you can add prop.table():

prop.table(table(anes$partyid3))
## 
##    Democrat Independent  Republican 
##      0.5269      0.1167      0.3564


We’ll use the package ggplot2 for most of our graphing (see below). It produces high-quality, customizable graphs for publication. However, if you want to quickly examine the distribution of a categorical variable, you can use barplot() in combination with table() and prop.table():

barplot(table(anes$partyid3))

plot of chunk unnamed-chunk-7

barplot(prop.table(table(anes$partyid3)))

plot of chunk unnamed-chunk-7


3.2 Numeric variables

There are many commands that will help you learn about the distribution of a variable—e.g., mean(), median(), min(), max(), and sd(). Remember to include na.rm = T if the variable includes NA values (though also beware of the biases missing data may introduce). Here are some examples, using the demtherm variable (a feeling thermometer for the democratic party).

# First recode the demtherm variable (values 98 and 99 indicate missing data)
anes$demtherm <- ifelse(anes$demtherm %in% 98:99, NA, identity(anes$demtherm))

# Find a few different descriptive statistics
mean(anes$demtherm, na.rm = T)
## [1] 60.34
median(anes$demtherm, na.rm = T)
## [1] 60
sd(anes$demtherm, na.rm = T)
## [1] 23.07

summary() provides many of these statistics in one command:

summary(anes$demtherm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      50      60      60      75      97   24599


To find specific percentiles of a numeric variable, use quantile():

quantile(anes$demtherm, probs = c(0.1, 0.5, 0.9), na.rm = T)        #10th, 50th, and 90th percentiles 
## 10% 50% 90% 
##  30  60  95
quantile(anes$demtherm, probs = seq(0, 1, by = 0.05), na.rm = T)    #percentiles in increments of 5
##   0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70% 
##    0   15   30   40   40   50   50   50   50   60   60   60   70   70   70 
##  75%  80%  85%  90%  95% 100% 
##   75   85   85   95   97   97

Specify the percentiles you want in a vector using the option probs in the function.

We can quickly get a sense for the distribution of a continuous variable by plotting a histogram of it (though use ggplot for your papers):

hist(anes$demtherm)

plot of chunk unnamed-chunk-11

3.2.1 Dummy variables

Dummy variables are a special case of numeric, categorical variables. Because they are so useful, it is worth going over what taking the mean of a dummy variable means. For example, interpret the mean of the abortion dummy variable that we created above:

anes$abortionD <- anes$abortion                               
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)              
anes$abortionD <- as.numeric(as.character(anes$abortionD))  
mean(anes$abortionD, na.rm = T)
## [1] 0.4012

3.3 Exercises

  1. Find the distribution of the interest variable (a categorical variable measuring interest in politics). More specifically, find the number and proportion of observations for each category of the variable.
  2. Find the following descriptive statistics of a numeric variable of your choice (though not demtherm or abortionD, which we examined above): mean, median, standard deviation, min and max value, and the 5th and 95th percentiles.

4 Bivariate Descriptive Statistics

Examining the relationship between two (or more) variables is usually more interesting. Basic bivariate descriptive statistics come in the form of conditional means, crosstabs, and graphing. (Graphing is covered in a different section below.) What type of descriptive statistics you provide depends on the nature of your variables.


4.1 Categorical IV and DV

Let’s examine the relationship between gender and party identification. First clean the gender variable:

levels(anes$gender) <- c(NA, "Male", "Female")
table(anes$gender)
## 
##   Male Female 
##  22017  27640

Here’s a simple crosstab:

table(anes$gender, anes$partyid3)      
##         
##          Democrat Independent Republican
##   Male      10803        2596       8039
##   Female    14556        3021       9115
with(anes, table(gender, partyid3))    #same as above
##         partyid3
## gender   Democrat Independent Republican
##   Male      10803        2596       8039
##   Female    14556        3021       9115

prop.table() with the correct margin specification is more interesting:

t <- with(anes, table(gender, partyid3))
prop.table(t, margin = 1)
##         partyid3
## gender   Democrat Independent Republican
##   Male     0.5039      0.1211     0.3750
##   Female   0.5453      0.1132     0.3415

Compare with:

prop.table(t)
##         partyid3
## gender   Democrat Independent Republican
##   Male    0.22445     0.05394    0.16703
##   Female  0.30243     0.06277    0.18938
prop.table(t, margin = 2)
##         partyid3
## gender   Democrat Independent Republican
##   Male     0.4260      0.4622     0.4686
##   Female   0.5740      0.5378     0.5314

What is the difference between prop.table(t), prop.table(t, margin = 1), and prop.table(t, margin = 2)? Make sure you can interpret the output.


An alternative way to find bivariate relationships when you have two categorical variables is to recode your proposed dependent variable to a dummy variable and use ddply() from package plyr. (I personally use this approach much more frequently than prop.table().)

anes$democrat <- anes$partyid3
levels(anes$democrat) <- c(1, 0, 0)
anes$democrat <- as.numeric(as.character(anes$democrat))

ddply(anes, .(gender), summarise, mean = mean(democrat, na.rm = T))
##   gender   mean
## 1   Male 0.5039
## 2 Female 0.5453
## 3   <NA>    NaN

In the code above I first created a dummy variable called democrat that equals 1 if the respondent (weakly) identifies with the democratic party and 0 otherwise. I then found the mean of this variable by gender using ddply().

Notes about ddply, a very useful function:

  • It takes four (or more) arguments.
  • The first argument is the data frame, in this case anes.
  • The second argument specifies the independent variable, or the variable we want to split the data by.
  • The third argument is always summarise when we want to find a summary statistic.
  • The fourth argument takes a function, in this case mean(). We take the mean of the dependent variable.

We can specify more than one function in the last argument:

ddply(anes, .(gender), summarise,
         mean = mean(democrat, na.rm = T),
         median = median(democrat, na.rm = T))
##   gender   mean median
## 1   Male 0.5039      1
## 2 Female 0.5453      1
## 3   <NA>    NaN     NA

We can also find the mean by more than one variable. For example, we can find mean democratic identification by gender and year (say for years 1970-2000 to reduce the output):

anes_sub <- subset(anes, year %in% 1970:1990)
ddply(anes_sub, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
##    year gender   mean
## 1  1970   Male 0.5437
## 2  1970 Female 0.5447
## 3  1972   Male 0.4845
## 4  1972 Female 0.5336
## 5  1974   Male 0.4970
## 6  1974 Female 0.5278
## 7  1976   Male 0.4941
## 8  1976 Female 0.5230
## 9  1978   Male 0.5215
## 10 1978 Female 0.5457
## 11 1980   Male 0.4870
## 12 1980 Female 0.5479
## 13 1982   Male 0.5032
## 14 1982 Female 0.5912
## 15 1984   Male 0.4549
## 16 1984 Female 0.4960
## 17 1986   Male 0.4730
## 18 1986 Female 0.5288
## 19 1988   Male 0.4263
## 20 1988 Female 0.5043
## 21 1990   Male 0.4966
## 22 1990 Female 0.5322

4.2 Categorical IV, Continuous DV

When we have a categorical independent variable and a continuous dependent variable, finding conditional means using ddply() again is useful. Take for example the relationship between income and the democratic feeling thermometer:

ddply(anes, .(income), summarise, mean = mean(demtherm, na.rm = T),
                                  median = median(demtherm, na.rm = T))
##                                    income  mean median
## 1 0. DK; NA; refused to answer; no Pre IW 60.17     60
## 2                   1. 0 to 16 percentile 67.67     70
## 3                  2. 17 to 33 percentile 64.58     65
## 4                  3. 34 to 67 percentile 59.88     60
## 5                  4. 68 to 95 percentile 55.92     60
## 6                 5. 96 to 100 percentile 50.84     50
## 7                                    <NA>   NaN     NA

4.3 Continuous IV and DV

The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below). As a complement, you may want to find the Pearson correlation between the two variables. Let’s find the correlation between age and demtherm (after fixing age):

sort(unique(anes$age))            #age variable has 0-year olds 
##  [1]  0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [24] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [47] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## [70] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
anes$age[anes$age == 0] <- NA     #get rid of 0-year olds 

# Pearson correlation 
cor(anes$age, anes$demtherm, use = "complete.obs")
## [1] 0.08352

4.4 Exercises

  1. Create a crosstab of south and the 3-point party ID variable, separately for 1952 and 2008. (Recode the 7-point party ID variable if you haven’t already.) The crosstab should be able to tell you to what extent people in the south are Democrat, Independent, and Republican (and likewise for the non-south). How did this change between 1952 and 2008?

  2. Now create a Democrat dummy variable from the party ID variable. The variable should equal 1 if the respondent (weakly) identifies with the Democratic party and 0 if the respondent is Republican or (purely) Independent. Find the mean of this variable for people in the south and non-south using ddply(), again for years 1952 and 2008. Compare with the results from Exercise 1.

  3. Use ddply() to find mean Democratic identification, by South and year—that is, by whether the respondent lives in the south for every year in the data set. Use the dummy variable from Exercise 2.

  4. Use ddply() to find both mean Democratic identification and the standard deviation of this variable, by South and year. Use ddply() only once.


5 Graphing with ggplot

Let’s get started with ggplot2, a very powerful graphic package for R. What follows is not a comprehensive survey of every possible graphing technique with ggplot2—rather it provides three examples aimed to get you started using the package. See the end of this section for more resources.


5.1 Mean Democratic Identification, by Gender and Year

Say we wanted to plot the proportion of ANES respondents who say they identify with the Democratic party, separately for men and women and for each year in the dataset.

First, let’s use ddply(), where the variables to split the data by are year and gender, and where the democrat dummy variable we created above is the dependent variable of interest:

t <- ddply(anes, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
tail(t)
##    year gender   mean
## 55 2002   Male 0.4501
## 56 2002 Female 0.5025
## 57 2004   Male 0.4462
## 58 2004 Female 0.5369
## 59 2008   Male 0.5520
## 60 2008 Female 0.6336

Now create a very basic plot with year on the x-axis, proportion who identify with the democratic party on the y-axis, and a color scheme that differentiates women from men:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() 

plot of chunk unnamed-chunk-23

The code above consists of two functions added together using +:

  1. ggplot() specifies
    • the data frame we want to use for plotting (in this case t),
    • the aesthetic attributes of the plot (in this case x, y, and color) specified inside aes(). Note that these attributes reference different column names in t.
  2. geom_point() tells R to add points.

The plot can be improved by adding lines:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() +
    geom_line()

plot of chunk unnamed-chunk-24

Change x- and y-labels and add a black and white theme:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() +
    geom_line() +
    xlab("") +
    ylab("Proportion Identifying as Democrat") +
    theme_bw()