In this tutorial we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling.

UPDATE: Many people have asked for the data used in this tutorial. To get the dataset used below (ecls.csv), please follow the instructions outlined here.

To examine the effect of going to Catholic school (“Treated”) versus public school (“Control”) on student achievement using matching we will go through the following steps:

  1. Estimate the propensity score (the probability of being Treated given a set of pre-treatment covariates).
  2. Examine the region of common support.
  3. Choose and execute a matching algorithm. In this tutorial we’ll use nearest neighbor propensity score matching.
  4. Examine covariate balance after matching.
  5. Estimate treatment effects.

In addition, before we implement a matching method, we’ll conduct the following analyses using the non-matched data:

Before we start, load a few packages and read in ecls.csv:

library(MatchIt)
library(dplyr)
library(ggplot2)

ecls <- read.csv("ecls.csv")

1 Pre-analysis using non-matched data

1.1 Difference-in-means: outcome variable

Here is some basic information about public and catholic school students in terms of math achievement. Note that we’re using students’ standardized math score (c5r2mtsc_std) – with a mean of 0 and standard deviation of 1 – as the outcome variable of interest. The independent variable of interest is catholic (1 = student went to catholic school; 0 = student went to public school).

ecls %>%
  group_by(catholic) %>%
  summarise(n_students = n(),
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std) / sqrt(n_students))
## # A tibble: 2 × 4
##   catholic n_students   mean_math  std_error
##      <int>      <int>       <dbl>      <dbl>
## 1        0       9568 -0.03059583 0.01038536
## 2        1       1510  0.19386817 0.02235282

Note that the outcome variable has been standardized (mean = 0, sd = 1). This is common in education research. The summary table above indicates that 3rd grade Catholic school students’ average math score is more than 20% of a standard deviation higher than that of public school students. This could have been calculated using the non-standardized outcome variable as follows:

ecls %>%
  mutate(test = (c5r2mtsc - mean(c5r2mtsc)) / sd(c5r2mtsc)) %>% #this is how the math score is standardized
  group_by(catholic) %>%
  summarise(mean_math = mean(test))
## # A tibble: 2 × 2
##   catholic   mean_math
##      <int>       <dbl>
## 1        0 -0.03059583
## 2        1  0.19386817

The difference-in-means is statistically significant at conventional levels of confidence (as is also evident from the small standard error above):

with(ecls, t.test(c5r2mtsc_std ~ catholic))
## 
##  Welch Two Sample t-test
## 
## data:  c5r2mtsc_std by catholic
## t = -9.1069, df = 2214.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2727988 -0.1761292
## sample estimates:
## mean in group 0 mean in group 1 
##     -0.03059583      0.19386817

1.2 Difference-in-means: pre-treatment covariates

We’ll work with the following covariates for now:

  • race_white: Is the student white (1) or not (0)?
  • p5hmage: Mother’s age
  • w3income: Family income
  • p5numpla: Number of places the student has lived for at least 4 months
  • w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?

Let’s calculate the mean for each covariate by the treatment status:

ecls_cov <- c('race_white', 'p5hmage', 'w3income', 'p5numpla', 'w3momed_hsb')
ecls %>%
  group_by(catholic) %>%
  select(one_of(ecls_cov)) %>%
  summarise_all(funs(mean(., na.rm = T)))
## Adding missing grouping variables: `catholic`
## # A tibble: 2 × 6
##   catholic race_white  p5hmage w3income p5numpla w3momed_hsb
##      <int>      <dbl>    <dbl>    <dbl>    <dbl>       <dbl>
## 1        0  0.5561246 37.56097 54889.16 1.132669   0.4640918
## 2        1  0.7251656 39.57516 82074.30 1.092701   0.2272069

What do you see? Take a moment to reflect on what these differences suggest for the relationship of interest (that between Catholic schooling and student achievement).

We can carry out t-tests to evaluate whether these means are statistically distinguishable:

lapply(ecls_cov, function(v) {
    t.test(ecls[, v] ~ ecls[, 'catholic'])
})
## [[1]]
## 
##  Welch Two Sample t-test
## 
## data:  ecls[, v] by ecls[, "catholic"]
## t = -13.453, df = 2143.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1936817 -0.1444003
## sample estimates:
## mean in group 0 mean in group 1 
##       0.5561246       0.7251656 
## 
## 
## [[2]]
## 
##  Welch Two Sample t-test
## 
## data:  ecls[, v] by ecls[, "catholic"]
## t = -12.665, df = 2186.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.326071 -1.702317
## sample estimates:
## mean in group 0 mean in group 1 
##        37.56097        39.57516 
## 
## 
## [[3]]
## 
##  Welch Two Sample t-test
## 
## data:  ecls[, v] by ecls[, "catholic"]
## t = -20.25, df = 1825.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29818.10 -24552.18
## sample estimates:
## mean in group 0 mean in group 1 
##        54889.16        82074.30 
## 
## 
## [[4]]
## 
##  Welch Two Sample t-test
## 
## data:  ecls[, v] by ecls[, "catholic"]
## t = 4.2458, df = 2233.7, p-value = 2.267e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02150833 0.05842896
## sample estimates:
## mean in group 0 mean in group 1 
##        1.132669        1.092701 
## 
## 
## [[5]]
## 
##  Welch Two Sample t-test
## 
## data:  ecls[, v] by ecls[, "catholic"]
## t = 18.855, df = 2107.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2122471 0.2615226
## sample estimates:
## mean in group 0 mean in group 1 
##       0.4640918       0.2272069

If you don’t understand the code for the t-tests above, you can also use:

with(ecls, t.test(race_white ~ catholic))  #(repeat for each covariate)
## 
##  Welch Two Sample t-test
## 
## data:  race_white by catholic
## t = -13.453, df = 2143.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1936817 -0.1444003
## sample estimates:
## mean in group 0 mean in group 1 
##       0.5561246       0.7251656

2 Propensity score estimation

We estimate the propensity score by running a logit model (probit also works) where the outcome variable is a binary variable indicating treatment status. What covariates should you include? For the matching to give you a causal estimate in the end, you need to include any covariate that is related to both the treatment assignment and potential outcomes. I choose just a few covariates below—they are unlikely to capture all covariates that should be included. You’ll be asked to come up with a potentially better model on your own later.

ecls <- ecls %>% mutate(w3income_1k = w3income / 1000)
m_ps <- glm(catholic ~ race_white + w3income_1k + p5hmage + p5numpla + w3momed_hsb,
            family = binomial(), data = ecls)
summary(m_ps)
## 
## Call:
## glm(formula = catholic ~ race_white + w3income_1k + p5hmage + 
##     p5numpla + w3momed_hsb, family = binomial(), data = ecls)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1883  -0.6140  -0.4508  -0.3336   2.5659  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.2125519  0.2379826 -13.499  < 2e-16 ***
## race_white   0.3145014  0.0700895   4.487 7.22e-06 ***
## w3income_1k  0.0073038  0.0006495  11.245  < 2e-16 ***
## p5hmage      0.0292168  0.0050771   5.755 8.69e-09 ***
## p5numpla    -0.1439392  0.0912255  -1.578    0.115    
## w3momed_hsb -0.6935868  0.0743207  -9.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7701.3  on 9266  degrees of freedom
## Residual deviance: 7168.8  on 9261  degrees of freedom
##   (1811 observations deleted due to missingness)
## AIC: 7180.8
## 
## Number of Fisher Scoring iterations: 5

Using this model, we can now calculate the propensity score for each student. It is simply the student’s predicted probability of being Treated, given the estimates from the logit model. Below, I calculate this propensity score using predict() and create a dataframe that has the propensity score as well as the student’s actual treatment status.

prs_df <- data.frame(pr_score = predict(m_ps, type = "response"),
                     catholic = m_ps$model$catholic)
head(prs_df)
##    pr_score catholic
## 1 0.2292928        0
## 2 0.1801360        0
## 4 0.2092957        0
## 5 0.2154022        1
## 6 0.3604931        0
## 7 0.1080608        0

2.1 Examining the region of common support

After estimating the propensity score, it is useful to plot histograms of the estimated propensity scores by treatment status:

labs <- paste("Actual school type attended:", c("Catholic", "Public"))
prs_df %>%
  mutate(catholic = ifelse(catholic == 1, labs[1], labs[2])) %>%
  ggplot(aes(x = pr_score)) +
  geom_histogram(color = "white") +
  facet_wrap(~catholic) +
  xlab("Probability of going to Catholic school") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


3 Executing a matching algorithm

A simple method for estimating the treatment effect of Catholic schooling is to restrict the sample to observations within the region of common support, and then to divide the sample within the region of common support into 5 quintiles, based on the estimated propensity score. Within each of these 5 quintiles, we can then estimate the mean difference in student achievement by treatment status. Rubin and others have argued that this is sufficient to eliminate 95% of the bias due to confounding of treatment status with a covariate.

However, most matching algorithms adopt slightly more complex methods. The method we use below is to find pairs of observations that have very similar propensity scores, but that differ in their treatment status. We use the package MatchIt for this. This package estimates the propensity score in the background and then matches observations based on the method of choice (“nearest” in this case).

ecls_nomiss <- ecls %>%  # MatchIt does not allow missing values
  select(c5r2mtsc_std, catholic, one_of(ecls_cov)) %>%
  na.omit()

mod_match <- matchit(catholic ~ race_white + w3income + p5hmage + p5numpla + w3momed_hsb,
                     method = "nearest", data = ecls_nomiss)

We can get some information about how successful the matching was using summary(mod_match) and plot(mod_match) (try this yourself).

To create a dataframe containing only the matched observations, use the match.data() function:

dta_m <- match.data(mod_match)
dim(dta_m)
## [1] 2704    9

Note that the final dataset is smaller than the original: it contains 2704 observations, meaning that 1352 pairs of treated and control observations were matched. Also note that the final dataset contains a variable called distance, which is the propensity score.


4 Examining covariate balance in the matched sample

We’ll do three things to assess covariate balance in the matched sample:

  1. visual inspection
  2. t-tests of difference-in-means
  3. computation of the average absolute standardized difference (“standardized imbalance”)

4.1 Visual inspection

It is useful to plot the mean of each covariate against the estimated propensity score, separately by treatment status. If matching is done well, the treatment and control groups will have (near) identical means of each covariate at each value of the propensity score.

Below is an example using the four covariates in our model. Here I use a loess smoother to estimate the mean of each covariate, by treatment status, at each value of the propensity score.

fn_bal <- function(dta, variable) {
  dta$variable <- dta[, variable]
  if (variable == 'w3income') dta$variable <- dta$variable / 10^3
  dta$catholic <- as.factor(dta$catholic)
  support <- c(min(dta$variable), max(dta$variable))
  ggplot(dta, aes(x = distance, y = variable, color = catholic)) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") +
    ylab(variable) +
    theme_bw() +
    ylim(support)
}

library(gridExtra)
grid.arrange(
   fn_bal(dta_m, "w3income"),
   fn_bal(dta_m, "p5numpla") + theme(legend.position = "none"),
   fn_bal(dta_m, "p5hmage"),
   fn_bal(dta_m, "w3momed_hsb") + theme(legend.position = "none"),
   fn_bal(dta_m, "race_white"),
   nrow = 3, widths = c(1, 0.8)
)