In this tutorial weâ€™ll use the ANES cumulative file available here and on Coursework. Before we begin, letâ€™s take care of some preliminary tasks, including loading the packages we need and importing this file.

```
# Load packages (install before if haven't already)
libs <- c("foreign", "plyr", "dplyr", "ggplot2")
sapply(libs, require, character.only = TRUE)
```

```
## foreign plyr dplyr ggplot2
## TRUE TRUE TRUE TRUE
```

```
# Import the anescum.dta file
setwd("~/dropbox/155/tutorial2")
system("unzip ANEScumul_data_codebook.zip") #unzip (using batch command)
anes <- read.dta("anescum.dta", warn.missing.labels = F) #read file
system("rm -r anescum* __M*") #delete unzipped files (using batch command)
# Keep only some variables and rename them
# see http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
# for the combined select/rename function used here
anes <- anes %>%
select(year = VCF0004,
age = VCF0101,
gender = VCF0104,
race = VCF0106,
edu = VCF0140,
south = VCF0113,
income = VCF0114,
partyid = VCF0301,
interest = VCF0310,
govtrust = VCF0604,
abortion = VCF0838,
demtherm = VCF0218)
```

To learn more about the variables we kept, consult the ANES codebook included in the .zip file above.

You can learn more about using batch commands with `system()`

here. These commands may work slightly different in Windows.

Tutorial 1 focused heavily on data manipulation. It is worth spending a little more time on this. In particular, letâ€™s go over how to recode factor variables into numeric variables.

The abortion variable is a factor variable with 6 categories, 4 of which are meaningful for our purposes:

`class(anes$abortion)`

`## [1] "factor"`

`summary(anes$abortion)`

```
## 0. NA; no Post IW; version NEW (2008)
## 1594
## 1. By law, abortion should never be permitted
## 2757
## 2. The law should permit abortion only in case of rape,
## 6726
## 3. The law should permit abortion for reasons other than
## 3710
## 4. By law, a woman should always be able to obtain an
## 8841
## 9. DK; other
## 473
## NA's
## 25659
```

Say we want to create a dummy variable that equals 1 if the respondent believes women should always be able to obtain abortion, and 0 otherwise (coding categories 0 and 9 as NA). Here are three ways of doing this:

```
# Option 1 (using ifelse)
anes$abortionD <- with(anes,
ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
ifelse(abortion == "1. By law, abortion should never be permitted" |
abortion == "2. The law should permit abortion only in case of rape," |
abortion == "3. The law should permit abortion for reasons other than", 0, NA)))
table(anes$abortionD)
```

```
##
## 0 1
## 13193 8841
```

```
# Option 2 (using ifelse and %in%)
anes$abortionD <- with(anes,
ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
ifelse(abortion %in% c("1. By law, abortion should never be permitted",
"2. The law should permit abortion only in case of rape,",
"3. The law should permit abortion for reasons other than"), 0, NA)))
table(anes$abortionD)
```

```
##
## 0 1
## 13193 8841
```

```
# Option 3 (recode factor levels, then convert to numeric)
anes$abortionD <- anes$abortion #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA) #recode levels
anes$abortionD <- as.numeric(as.character(anes$abortionD)) #convert to numeric
table(anes$abortionD)
```

```
##
## 0 1
## 13193 8841
```

Options 1 and 2 both use `ifelse()`

, which turns out to be quite cumbersome here because of the long labels. Option 3 requires less code, and may therefore be preferred.

However, when using Option 3, **make sure you correctly convert the variable from factor to numeric**. If you want to convert numbers stored as factor levels in `R`

to numeric values, you need to first convert them to character, then convert them to numeric values. Above, this is done using `as.numeric(as.character(anes$abortionD))`

.

If we were to convert the variable from factor to numeric without the intermediate step of converting them to character the following happens:

```
anes$abortionD <- anes$abortion #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA) #recode levels
anes$abortionD <- as.numeric(anes$abortionD) #FAIL
table(anes$abortionD)
```

```
##
## 1 2
## 13193 8841
```

`R`

converts the numbers to â€˜1â€™ and â€˜2â€™ instead of â€˜0â€™ and â€˜1â€™. This has to do with how `R`

stores factor levels internally. If you want to convert a factor variable to numeric, **always remember to convert factors using as.numeric(as.character(var))** where

`var`

is your variable of interest.- Create a new variable called
`incomeD`

which recodes`income`

in the`anes`

data frame into a (numeric) dummy variable that equals 1 if the respondentâ€™s income is in the 68th percentile or above and 0 otherwise.

Letâ€™s now begin some basic data analysis. As part of this, it is important to get a sense for what variables look like separately (their univariate distributions). The univariate descriptive statistics you produce will depend on whether the variable is categorical or numeric.

Letâ€™s first recode the 7-point party ID variable to a 3-point party ID variable:

`table(anes$partyid)`

```
##
## 0. DK; NA; other; refused to answer; no Pre IW
## 968
## 1. Strong Democrat
## 9320
## 2. Weak Democrat
## 10390
## 3. Independent - Democrat
## 5649
## 4. Independent - Independent
## 5617
## 5. Independent - Republican
## 4772
## 6. Weak Republican
## 6790
## 7. Strong Republican
## 5592
```

```
anes$partyid3 <- anes$partyid
levels(anes$partyid3) <- c(NA, "Democrat", "Democrat", "Democrat",
"Independent", "Republican", "Republican",
"Republican")
table(anes$partyid3)
```

```
##
## Democrat Independent Republican
## 25359 5617 17154
```

As is already apparent, `table()`

is a useful function for getting the distribution of a categorical variable. To get the proportion of respondents in each category, you can add `prop.table()`

:

`prop.table(table(anes$partyid3))`

```
##
## Democrat Independent Republican
## 0.5269 0.1167 0.3564
```

Weâ€™ll use the package `ggplot2`

for most of our graphing (see below). It produces high-quality, customizable graphs for publication. However, if you want to quickly examine the distribution of a categorical variable, you can use `barplot()`

in combination with `table()`

and `prop.table()`

:

`barplot(table(anes$partyid3))`

`barplot(prop.table(table(anes$partyid3)))`

There are many commands that will help you learn about the distribution of a variableâ€”e.g., `mean()`

, `median()`

, `min()`

, `max()`

, and `sd()`

. Remember to include `na.rm = T`

if the variable includes `NA`

values (though also beware of the biases missing data may introduce). Here are some examples, using the `demtherm`

variable (a feeling thermometer for the democratic party).

```
# First recode the demtherm variable (values 98 and 99 indicate missing data)
anes$demtherm <- ifelse(anes$demtherm %in% 98:99, NA, identity(anes$demtherm))
# Find a few different descriptive statistics
mean(anes$demtherm, na.rm = T)
```

`## [1] 60.34`

`median(anes$demtherm, na.rm = T)`

`## [1] 60`

`sd(anes$demtherm, na.rm = T)`

`## [1] 23.07`

`summary()`

provides many of these statistics in one command:

`summary(anes$demtherm)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 50 60 60 75 97 24599
```

To find specific percentiles of a numeric variable, use `quantile()`

:

`quantile(anes$demtherm, probs = c(0.1, 0.5, 0.9), na.rm = T) #10th, 50th, and 90th percentiles `

```
## 10% 50% 90%
## 30 60 95
```

`quantile(anes$demtherm, probs = seq(0, 1, by = 0.05), na.rm = T) #percentiles in increments of 5`

```
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70%
## 0 15 30 40 40 50 50 50 50 60 60 60 70 70 70
## 75% 80% 85% 90% 95% 100%
## 75 85 85 95 97 97
```

Specify the percentiles you want in a vector using the option `probs`

in the function.

We can quickly get a sense for the distribution of a continuous variable by plotting a histogram of it (though use `ggplot`

for your papers):

`hist(anes$demtherm)`

Dummy variables are a special case of numeric, categorical variables. Because they are so useful, it is worth going over what taking the mean of a dummy variable means. For example, interpret the mean of the abortion dummy variable that we created above:

```
anes$abortionD <- anes$abortion
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)
anes$abortionD <- as.numeric(as.character(anes$abortionD))
mean(anes$abortionD, na.rm = T)
```

`## [1] 0.4012`

- Find the distribution of the
`interest`

variable (a categorical variable measuring interest in politics). More specifically, find the number and proportion of observations for each category of the variable. - Find the following descriptive statistics of a numeric variable of your choice (though not
`demtherm`

or`abortionD`

, which we examined above): mean, median, standard deviation, min and max value, and the 5th and 95th percentiles.

Examining the relationship between two (or more) variables is usually more interesting. Basic bivariate descriptive statistics come in the form of conditional means, crosstabs, and graphing. (Graphing is covered in a different section below.) What type of descriptive statistics you provide depends on the nature of your variables.

Letâ€™s examine the relationship between gender and party identification. First clean the gender variable:

```
levels(anes$gender) <- c(NA, "Male", "Female")
table(anes$gender)
```

```
##
## Male Female
## 22017 27640
```

Hereâ€™s a simple crosstab:

`table(anes$gender, anes$partyid3) `

```
##
## Democrat Independent Republican
## Male 10803 2596 8039
## Female 14556 3021 9115
```

`with(anes, table(gender, partyid3)) #same as above`

```
## partyid3
## gender Democrat Independent Republican
## Male 10803 2596 8039
## Female 14556 3021 9115
```

`prop.table()`

with the correct margin specification is more interesting:

```
t <- with(anes, table(gender, partyid3))
prop.table(t, margin = 1)
```

```
## partyid3
## gender Democrat Independent Republican
## Male 0.5039 0.1211 0.3750
## Female 0.5453 0.1132 0.3415
```

Compare with:

`prop.table(t)`

```
## partyid3
## gender Democrat Independent Republican
## Male 0.22445 0.05394 0.16703
## Female 0.30243 0.06277 0.18938
```

`prop.table(t, margin = 2)`

```
## partyid3
## gender Democrat Independent Republican
## Male 0.4260 0.4622 0.4686
## Female 0.5740 0.5378 0.5314
```

What is the difference between `prop.table(t)`

, `prop.table(t, margin = 1)`

, and `prop.table(t, margin = 2)`

? Make sure you can interpret the output.

An alternative way to find bivariate relationships when you have two categorical variables is to recode your proposed dependent variable to a dummy variable and use `ddply()`

from package `plyr`

. (I personally use this approach much more frequently than `prop.table()`

.)

```
anes$democrat <- anes$partyid3
levels(anes$democrat) <- c(1, 0, 0)
anes$democrat <- as.numeric(as.character(anes$democrat))
ddply(anes, .(gender), summarise, mean = mean(democrat, na.rm = T))
```

```
## gender mean
## 1 Male 0.5039
## 2 Female 0.5453
## 3 <NA> NaN
```

In the code above I first created a dummy variable called `democrat`

that equals 1 if the respondent (weakly) identifies with the democratic party and 0 otherwise. I then found the mean of this variable by gender using `ddply()`

.

Notes about `ddply`

, a very useful function:

- It takes four (or more) arguments.
- The first argument is the data frame, in this case
`anes`

. - The second argument specifies the independent variable, or the variable we want to split the data by.
- The third argument is always
`summarise`

when we want to find a summary statistic. - The fourth argument takes a function, in this case
`mean()`

. We take the mean of the dependent variable.

We can specify more than one function in the last argument:

```
ddply(anes, .(gender), summarise,
mean = mean(democrat, na.rm = T),
median = median(democrat, na.rm = T))
```

```
## gender mean median
## 1 Male 0.5039 1
## 2 Female 0.5453 1
## 3 <NA> NaN NA
```

We can also find the mean by more than one variable. For example, we can find mean democratic identification by gender *and* year (say for years 1970-2000 to reduce the output):

```
anes_sub <- subset(anes, year %in% 1970:1990)
ddply(anes_sub, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
```

```
## year gender mean
## 1 1970 Male 0.5437
## 2 1970 Female 0.5447
## 3 1972 Male 0.4845
## 4 1972 Female 0.5336
## 5 1974 Male 0.4970
## 6 1974 Female 0.5278
## 7 1976 Male 0.4941
## 8 1976 Female 0.5230
## 9 1978 Male 0.5215
## 10 1978 Female 0.5457
## 11 1980 Male 0.4870
## 12 1980 Female 0.5479
## 13 1982 Male 0.5032
## 14 1982 Female 0.5912
## 15 1984 Male 0.4549
## 16 1984 Female 0.4960
## 17 1986 Male 0.4730
## 18 1986 Female 0.5288
## 19 1988 Male 0.4263
## 20 1988 Female 0.5043
## 21 1990 Male 0.4966
## 22 1990 Female 0.5322
```

When we have a categorical independent variable and a continuous dependent variable, finding conditional means using `ddply()`

again is useful. Take for example the relationship between income and the democratic feeling thermometer:

```
ddply(anes, .(income), summarise, mean = mean(demtherm, na.rm = T),
median = median(demtherm, na.rm = T))
```

```
## income mean median
## 1 0. DK; NA; refused to answer; no Pre IW 60.17 60
## 2 1. 0 to 16 percentile 67.67 70
## 3 2. 17 to 33 percentile 64.58 65
## 4 3. 34 to 67 percentile 59.88 60
## 5 4. 68 to 95 percentile 55.92 60
## 6 5. 96 to 100 percentile 50.84 50
## 7 <NA> NaN NA
```

The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below). As a complement, you may want to find the Pearson correlation between the two variables. Letâ€™s find the correlation between `age`

and `demtherm`

(after fixing `age`

):

`sort(unique(anes$age)) #age variable has 0-year olds `

```
## [1] 0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [24] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [47] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## [70] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
```

```
anes$age[anes$age == 0] <- NA #get rid of 0-year olds
# Pearson correlation
cor(anes$age, anes$demtherm, use = "complete.obs")
```

`## [1] 0.08352`

Create a crosstab of

`south`

and the 3-point party ID variable, separately for 1952 and 2008. (Recode the 7-point party ID variable if you havenâ€™t already.) The crosstab should be able to tell you to what extent people in the south are Democrat, Independent, and Republican (and likewise for the non-south). How did this change between 1952 and 2008?Now create a Democrat dummy variable from the party ID variable. The variable should equal 1 if the respondent (weakly) identifies with the Democratic party and 0 if the respondent is Republican or (purely) Independent. Find the mean of this variable for people in the south and non-south using

`ddply()`

, again for years 1952 and 2008. Compare with the results from Exercise 1.Use

`ddply()`

to find mean Democratic identification, by South and yearâ€”that is, by whether the respondent lives in the south for*every*year in the data set. Use the dummy variable from Exercise 2.Use

`ddply()`

to find both mean Democratic identification and the standard deviation of this variable, by South and year. Use`ddply()`

only once.

Letâ€™s get started with `ggplot2`

, a very powerful graphic package for `R`

. What follows is not a comprehensive survey of every possible graphing technique with `ggplot2`

â€”rather it provides three examples aimed to get you started using the package. See the end of this section for more resources.

Say we wanted to plot the proportion of ANES respondents who say they identify with the Democratic party, separately for men and women and for each year in the dataset.

First, letâ€™s use `ddply()`

, where the variables to split the data by are `year`

and `gender`

, and where the `democrat`

dummy variable we created above is the dependent variable of interest:

```
t <- ddply(anes, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
tail(t)
```

```
## year gender mean
## 55 2002 Male 0.4501
## 56 2002 Female 0.5025
## 57 2004 Male 0.4462
## 58 2004 Female 0.5369
## 59 2008 Male 0.5520
## 60 2008 Female 0.6336
```

Now create a very basic plot with year on the x-axis, proportion who identify with the democratic party on the y-axis, and a color scheme that differentiates women from men:

```
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point()
```

The code above consists of two functions added together using `+`

:

`ggplot()`

specifies- the data frame we want to use for plotting (in this case
`t`

), - the aesthetic attributes of the plot (in this case
`x`

,`y`

, and`color`

) specified inside`aes()`

. Note that these attributes reference different column names in`t`

.

- the data frame we want to use for plotting (in this case
`geom_point()`

tells`R`

to add points.

The plot can be improved by adding lines:

```
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line()
```

Change x- and y-labels and add a black and white theme:

```
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line() +
xlab("") +
ylab("Proportion Identifying as Democrat") +
theme_bw()
```