In this tutorial we’ll use the ANES data we used in Tutorial 2. Read in the file and clean it like we did in the Preliminaries section of that tutorial.

You should then have:

dim(anes)
## [1] 49760    12

Also make sure you load the following packages:

libs <- c("plyr", "dplyr", "ggplot2", "arm")
sapply(libs, require, character.only = TRUE)
##    plyr   dplyr ggplot2     arm
##    TRUE    TRUE    TRUE    TRUE

The only package we haven’t used before is arm. It allows us to display OLS output more cleanly and to extract standard errors.


1 Interpreting and Graphing OLS results

Make note of two graphing approaches of OLS results below:

  1. Graphing the coefficient estimates themselves with confidence intervals
  2. Graphing predicted/fitted values with confidence intervals

1.1 Bivariate Model

Let’s start by a simple model that predicts democratic feeling ratings given the respondent’s gender.

levels(anes$gender) <- c(NA, "Male", "Female")

lm1 <- lm(demtherm ~ gender, data = anes)

require(arm)    #for display function
display(lm1)
## lm(formula = demtherm ~ gender, data = anes)
##              coef.est coef.se
## (Intercept)  58.99     0.22
## genderFemale  5.65     0.29
## ---
## n = 26405, k = 2
## residual sd = 23.74, R-Squared = 0.01

How do we interpret the output? Compare the output with:

ddply(anes, .(gender), summarise, mean(demtherm, na.rm = T))
##   gender mean(demtherm, na.rm = T)
## 1   Male                     58.99
## 2 Female                     64.64
## 3   <NA>                       NaN

Does this make sense?

Graphing can aid in the interpretation of OLS estimates. The benefits to graphing aren’t obvious in this simple model, but let’s use it for illustrative purposes. We’ll use two approaches.

1.1.1 Graphing Approach 1

We can extract the coefficients from the model using coef() and the standard errors using se.coef(). Using these we can calculate confidence intervals and graph the results. We can also use confint() to find the confidence intervals directly.

ci1 <- coef(lm1)[2] + c(-1, 1) * se.coef(lm1)[2] * 1.96   #or equivalently:
ci1 <- confint(lm1, level = 0.95)[2, ]

est1 <- data.frame(est = coef(lm1)[2],
                   lb = ci1[1],
                   ub = ci1[2],
                   model = "Model 1")

ggplot(est1, aes(x = model, y = est)) +
    geom_point() +
    geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
    geom_hline(yintercept = 0, lty = 2, color = "red") +
    xlab("") +
    ylab("Female Democrat Thermometer Estimate (Relative to Males)")

plot of chunk unnamed-chunk-6

1.1.2 Graphing Approach 2

Here’s a different approach using R’s predict() function. Before you use this command, however, ask yourself the following:

  • Given the estimates from the model, what is the “predicted” feeling rating among women?
  • Given the estimates from the model, what is the “predicted” feeling rating among men?
  • What is the predicted gap between women and men and how does that relate to the graph above?
pred1 <- predict(lm1,
                 newdata = data.frame(gender = c("Male", "Female")),
                 se.fit = T,
                 interval = "confidence")

# The result is a list with estimates, 95% CIs, and other info:
pred1
## $fit
##     fit   lwr   upr
## 1 58.99 58.56 59.42
## 2 64.64 64.25 65.02
##
## $se.fit
##      1      2
## 0.2190 0.1961
##
## $df
## [1] 26403
##
## $residual.scale
## [1] 23.74
# Extract part of the list with estimates and confidence interval, and add gender info:
pred1 <- data.frame(pred1$fit, gender = c("Male", "Female"))

ggplot(pred1, aes(x = gender, y = fit, color = gender)) +
    geom_point() +
    geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
    xlab("") +
    ylab("Predicted Democrat Thermometer")

plot of chunk unnamed-chunk-7

The newdata argument of the predict() function lets you supply the values for which you want a prediction. You have to use the same variable name(s) as those of the variable(s) included in lm().


1.2 Multivariate model

Let’s add income and age to the model. Income in particular may be a confounder here, since women have lower income and income is negatively correlated with support for the democratic party.

levels(anes$income)[1] <- NA
levels(anes$income) <- with(anes, substr(levels(income), 4, nchar(levels(income))))
anes$age[anes$age == 0] <- NA

lm2 <- lm(demtherm ~ gender + age + income, data = anes)
display(lm2)
## lm(formula = demtherm ~ gender + age + income, data = anes)
##                            coef.est coef.se
## (Intercept)                 64.00     0.59
## genderFemale                 3.89     0.30
## age                          0.08     0.01
## income17 to 33 percentile   -3.99     0.51
## income34 to 67 percentile   -8.50     0.45
## income68 to 95 percentile  -12.48     0.47
## income96 to 100 percentile -17.94     0.74
## ---
## n = 23686, k = 7
## residual sd = 22.92, R-Squared = 0.06

** Exercise**: Interpret each of the coefficient estimates from this model.

1.2.1 Graphing Approach 1

Again, we can graph the estimate on female. For comparison, let’s put it next to the estimate from the bivariate model above.

ci2 <- confint(lm2, level = 0.95)[2, ]

est2 <- data.frame(est = coef(lm2)[2],
                   lb = ci2[1],
                   ub = ci2[2],
                   model = "Model 2")

est <- rbind(est1, est2)   #combine with estimates from first model

ggplot(est, aes(x = model, y = est)) +
    geom_point() +
    geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
    geom_hline(yintercept = 0, lty = 2, color = "red") +
    xlab("") +
    ylab("Female Democrat Thermometer Estimate (Relative to Males)")

plot of chunk unnamed-chunk-9

1.2.2 Graphing Approach 2

We can also graph the predicted values for males and females for “average” values of the covariates.

Mean age is 46 years:

mean(anes$age, na.rm = T) 
## [1] 45.84

The most common income category is “34 to 67 percentile”:

prop.table(table(anes$income))
##
##   0 to 16 percentile  17 to 33 percentile  34 to 67 percentile
##              0.17108              0.16785              0.32710
##  68 to 95 percentile 96 to 100 percentile
##              0.27925              0.05473

So let’s use these (constant) values for each of our covariates, while varying gender.


** Exercise**: Before you use the predict() function, calculate by hand the predicted values for males and females at the average values of age and income.


Now let’s graph the predicted values:

newdta <- data.frame(gender = c("Male", "Female"),
                     age = rep(mean(anes$age, na.rm = T), 2),
                     income = rep("34 to 67 percentile", 2))

pred2 <- predict(lm2,
                 newdata = newdta,
                 se.fit = T,
                 interval = "confidence")

pred2 <- data.frame(pred2$fit, gender = c("Male", "Female"))

ggplot(pred2, aes(x = gender, y = fit, color = gender)) +
    geom_point() +
    geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
    xlab("\nNote: Estimates condition on income and age") +
    ylab("Predicted Democrat Thermometer")

plot of chunk unnamed-chunk-12


1.3 Relationship over time

The previous models group respondents from all ANES surveys starting in 1978. (The democratic rating was not asked before 1978, and lm() automatically excludes missing observations.)

Let’s examine the relationship between gender and the democratic rating over time, while still conditioning on age and income.

First run a separate regression for each year starting in 1978. I use dlply() here: it takes a data frame, splits it by year, and returns a list with the model estimates for each year:

anes_sub <- subset(anes, year >= 1978 & year != 2002)

lm3 <- dlply(anes_sub, .(year), function(x) {
    lm(demtherm ~ gender + age + income, data = x)
})

display(lm3$"1990")   #estimates for year 1990
## lm(formula = demtherm ~ gender + age + income, data = x)
##                            coef.est coef.se
## (Intercept)                 61.10     1.98
## genderFemale                 4.54     1.05
## age                          0.11     0.03
## income17 to 33 percentile   -1.16     1.76
## income34 to 67 percentile   -4.33     1.55
## income68 to 95 percentile   -8.66     1.56
## income96 to 100 percentile -20.20     2.73
## ---
## n = 1792, k = 7
## residual sd = 21.68, R-Squared = 0.07

Now grab the estimate on gender as well as the associated 95% confidence interval for each year. Now I use ldply: it takes a list and returns a data frame:

est_time <- ldply(lm3, function(x) {
    c(coef(x)[2], confint(x)[2, ])
})

names(est_time) <- c("year", "est", "lb", "ub")
est_time
##    year   est      lb    ub
## 1  1978 3.485  1.5872 5.382
## 2  1980 1.668 -0.7725 4.108
## 3  1982 3.407  1.0232 5.792
## 4  1984 2.177  0.2555 4.098
## 5  1986 3.763  1.7877 5.739
## 6  1988 4.140  1.9031 6.376
## 7  1990 4.544  2.4844 6.603
## 8  1992 4.107  2.2672 5.947
## 9  1994 3.367  1.1190 5.615
## 10 1996 4.329  1.9330 6.725
## 11 1998 5.192  2.5626 7.822
## 12 2000 4.728  2.2240 7.231
## 13 2004 3.246  0.3541 6.138
## 14 2008 5.180  3.0823 7.278

Now graph these estimates:

ggplot(est_time, aes(x = year, y = est)) +
    geom_point() +
    geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.3) +
    geom_hline(yintercept = 0, lty = 2, color = "red") +
    xlab("") +
    ylab("Female Democrat Thermometer Estimate (Relative to Males)")

plot of chunk unnamed-chunk-15

Again, we can graph the predicted values instead:

pred_time <- ldply(lm3, function(x) {
    predict(x, newdata = newdta, se.fit = T, interval = "confidence")$fit
})

pred_time$gender <- rep(c("Male", "Female"), length(pred_time)/2)


ggplot(pred_time, aes(x = year, y = fit, color = gender)) +
    geom_point() +
    geom_line() +
    geom_ribbon(aes(ymax = upr, ymin = lwr, fill = gender),
                    alpha = 0.15, color = NA) +
    xlab("\nNote: Income and age held at average values") +
    ylab("Predicted Democrat Thermometer")

plot of chunk unnamed-chunk-16


1.4 Interactions

Graphing can dramatically improve the interpretability of interaction terms in OLS models. Say, for example, we interacted gender and age to study whether democratic rating by gender changes with age:

lm4 <- lm(demtherm ~ gender + age + income + age:gender, data = anes)
display(lm4)
## lm(formula = demtherm ~ gender + age + income + age:gender, data = anes)
##                            coef.est coef.se
## (Intercept)                 62.17     0.73
## genderFemale                 7.26     0.84
## age                          0.12     0.01
## income17 to 33 percentile   -4.05     0.51
## income34 to 67 percentile   -8.59     0.45
## income68 to 95 percentile  -12.59     0.47
## income96 to 100 percentile -18.07     0.74
## genderFemale:age            -0.08     0.02
## ---
## n = 23686, k = 8
## residual sd = 22.91, R-Squared = 0.06

Try interpreting the interaction term here?

Graphing makes this simpler:

agerange <- 18:85
newdta <- data.frame(expand.grid(gender = c("Male", "Female"),
                                 age = agerange),
                     income = "34 to 67 percentile")

pred4 <- predict(lm4,
                 newdata = newdta,
                 se.fit = T,
                 interval = "confidence")

pred4 <- data.frame(cbind(pred4$fit, newdta))

head(pred4)
##     fit   lwr   upr gender age              income
## 1 55.80 54.90 56.69   Male  18 34 to 67 percentile
## 2 61.70 60.88 62.52 Female  18 34 to 67 percentile
## 3 55.92 55.04 56.80   Male  19 34 to 67 percentile
## 4 61.75 60.94 62.55 Female  19 34 to 67 percentile
## 5 56.04 55.19 56.90   Male  20 34 to 67 percentile
## 6 61.79 61.00 62.58 Female  20 34 to 67 percentile
ggplot(pred4, aes(x = age, y = fit, color = gender)) +
    geom_point() +
    geom_line(aes(y = lwr), lty = 3) +
    geom_line(aes(y = upr), lty = 3) +
    ylab("Democrat Thermometer Ratings") +
    xlab("Age") +
    theme_bw()

plot of chunk unnamed-chunk-18

We make a pretty strong linearity assumption on age (by gender) here. We can relax this by creating a factor variable that groups respondents into smaller age categories:

anes_sub <- subset(anes, age %in% 20:70)
anes_sub$age_cat <- cut(anes_sub$age, breaks = seq(20, 70, by = 5))

with(anes_sub, head(data.frame(age, age_cat), 10))
##    age age_cat
## 1   25 (20,25]
## 2   33 (30,35]
## 3   26 (25,30]
## 4   63 (60,65]
## 5   66 (65,70]
## 6   48 (45,50]
## 7   70 (65,70]
## 8   25 (20,25]
## 9   30 (25,30]
## 10  35 (30,35]
lm5 <- lm(demtherm ~ gender + income + age_cat + + age_cat:gender,
          data = anes_sub)

display(lm5)
## lm(formula = demtherm ~ gender + income + age_cat + +age_cat:gender,
##     data = anes_sub)
##                             coef.est coef.se
## (Intercept)                  64.70     0.81
## genderFemale                  3.96     0.97
## income17 to 33 percentile    -3.38     0.58
## income34 to 67 percentile    -8.42     0.51
## income68 to 95 percentile   -12.59     0.53
## income96 to 100 percentile  -18.53     0.79
## age_cat(25,30]               -0.60     0.98
## age_cat(30,35]                2.25     0.96
## age_cat(35,40]                2.04     0.97
## age_cat(40,45]                1.44     1.01
## age_cat(45,50]                3.66     1.06
## age_cat(50,55]                4.51     1.08
## age_cat(55,60]                6.03     1.11
## age_cat(60,65]                5.98     1.14
## age_cat(65,70]                4.93     1.23
## genderFemale:age_cat(25,30]   2.54     1.32
## genderFemale:age_cat(30,35]   0.06     1.29
## genderFemale:age_cat(35,40]   0.13     1.31
## genderFemale:age_cat(40,45]   2.51     1.38
## genderFemale:age_cat(45,50]   0.64     1.44
## genderFemale:age_cat(50,55]  -1.11     1.47
## genderFemale:age_cat(55,60]  -0.65     1.50
## genderFemale:age_cat(60,65]  -1.48     1.54
## genderFemale:age_cat(65,70]  -1.44     1.63
## ---
## n = 20426, k = 24
## residual sd = 22.71, R-Squared = 0.06

Let’s again graph the predicted values (holding income constant):

agerange <- sort(as.character(unique(na.omit(anes_sub$age_cat))))
newdta <- data.frame(expand.grid(gender = c("Male", "Female"),
                                 age_cat = agerange),
                     income = "34 to 67 percentile")

pred5 <- predict(lm5,
                 newdata = newdta,
                 se.fit = T,
                 interval = "confidence")

pred5 <- data.frame(cbind(pred5$fit, newdta))

ggplot(pred5, aes(x = age_cat, y = fit, color = gender)) +
    geom_point() +
    geom_line(aes(y = lwr, group = gender), lty = 3) +
    geom_line(aes(y = upr, group = gender), lty = 3) +
    geom_line(aes(y = fit, group = gender), lty = 1, alpha = 0.5) +
    ylab("Democrat Thermometer Ratings") +
    xlab("Age Category") +
    theme_bw()

plot of chunk unnamed-chunk-20




The code used in this tutorial is available here.