1 More on For Loops

In Tutorial 2 we learned to run for loops of the following type:

v1 <- c(175, 182, 150, 187, 165)
v2 <- rep(NA, 5)
for(i in 1:5) {
    v2[i] <- v1[i] / 100
}
v2                         
## [1] 1.75 1.82 1.50 1.87 1.65

Here, we stored the output into a new vector v2. I’ve included examples of more involved for loops with vector outputs in the exercises below.

While storing the output of for loops in vectors works for many purposes, sometimes we want to use lists instead of vectors. Lists are more flexible than vectors in two ways:

  1. List elements can store almost any type of object: scalars, strings, vectors, data frames, graphs, and other lists.
  2. You don’t need to specify how many elements you want the list to have before you run the for loop.

In the examples below, I use ANES data with only three variables, available here.

dat <- read.csv("nesexample2.csv")     #(after setting my working directory)
head(dat)
##   therm.unions therm.dems year
## 1           97         97 1980
## 2           85         60 1980
## 3           NA         85 1980
## 4           NA         70 1980
## 5           NA         60 1980
## 6           60         85 1980

Say we wanted to create a separate data frame for every year in this dataset. First find the years in the dataset (sorted) and store them into a vector:

years <- sort(unique(dat$year))
years
##  [1] 1980 1984 1986 1988 1990 1992 1994 1996 1998 2000 2004 2008

We can then run a loop that stores a data frame for each year in a list:

l <- list()          #create an empty list
n <- length(years)   #define number of iterations of loop
for(i in 1:n){
    l[[i]] <- subset(dat, year == years[i])
}

Many things here are similar to how we run for loops with vector outputs. Notice two differences:

  1. Most importantly, we have to use double brackets ([[ ]]) to call a specific element of a list. With vectors, we use single brackets ([ ]).
  2. When we create the empty list, we don’t have to store the number of elements we want the list to contain—it’s sufficient to say l <- list(). (Of course, you can call the list something other than l.)

Let’s look at the first few rows of a few elements in l:

head(l[[1]])
##   therm.unions therm.dems year
## 1           97         97 1980
## 2           85         60 1980
## 3           NA         85 1980
## 4           NA         70 1980
## 5           NA         60 1980
## 6           60         85 1980
head(l[[5]])
##      therm.unions therm.dems year
## 7235           60         40 1990
## 7236           NA         85 1990
## 7237           97         70 1990
## 7238           85         70 1990
## 7239           NA         50 1990
## 7240           97         50 1990
head(l[[12]])
##       therm.unions therm.dems year
## 18661           15         40 2008
## 18662           40         30 2008
## 18663           50         85 2008
## 18664           40         30 2008
## 18665           40         50 2008
## 18666           15         30 2008

The function lapply() conveniently lets us operate on each element of the list. For example, we can get the dimensions of each data frame in l like this:

lapply(l, dim)     
## [[1]]
## [1] 1386    3
##
## [[2]]
## [1] 1917    3
##
## [[3]]
## [1] 2163    3
##
## [[4]]
## [1] 1768    3
##
## [[5]]
## [1] 1965    3
##
## [[6]]
## [1] 2246    3
##
## [[7]]
## [1] 1791    3
##
## [[8]]
## [1] 1533    3
##
## [[9]]
## [1] 1278    3
##
## [[10]]
## [1] 1552    3
##
## [[11]]
## [1] 1061    3
##
## [[12]]
## [1] 2092    3

1.1 Exercises

  1. Using a for loop, create a new variable in dat called demtherm_scaled that is a transformation of therm.dems—it should range between 0 and 0.97 instead of 0 and 97. As a reminder to yourself, also use a vectorized function (no for loop) to create this variable. Using vectorized functions when you can is preferred; the for loop is just for practice. (No need to use a list here.)

  2. Use a for loop to grab every other element of therm.unions in dat. Store the output into a new vector called thu2. That is, if the first 6 elements of therm.unions are 97 85 NA NA NA 60, then the first 3 elements of thu2 should be 97 NA NA. Output the first 6 elements of thu2 and confirm that it has half the number of elements as dat$therm.unions.

  3. Using a for loop, store (in a list) one scatterplot of therm.unions versus therm.dems for each year in dat. Each scatter plot should include a LOESS smoother. Graph the first, fourth, and tenth element of the list. (In your write-up, include these graphs.)


2 Sample Means and Coverage Probability

Let’s begin by a class exercise based on Tuesday’s lecture. The exercise is meant to illustrate how well we do when we estimate a population mean with a sample mean. Let’s first introduce some notation.

Let \(\mu\) represent the population mean and \(\bar{x}\) the sample mean of a variable \(X\). We calculate the sample mean using

\(\frac{1}{n} \displaystyle \sum_{i=1}^n x_i\)

where \(x_i\) is each value in variable we sampled (that is, \(X\)).

We calculate confidence intervals around the sample mean using

\(\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}\)

where \(s\) is the standard deviation of the sampled variable of interest, \(n\) is the number of observations in the sample, and \(z\) is the z-score associated with the area under the standard-normal distribution we want to use for a given level of confidence. The standard deviation is calculated using

\(s = \sqrt{\frac{\sum^n_{n=1}(x_i - \bar{x})^2}{n}}\)

The z-score depends on what level of confidence we want. The standard level of confidence in the social sciences is 95%. We can find the z-score for this level of confidence in R:

qnorm(0.975)
## [1] 1.96

Note that we have to input the proportion of the standard-normal that we want to cover on either side of the mean. For 95% confidence, we use 0.975 in qnorm(). For 90% confidence we would use:

qnorm(0.95)
## [1] 1.645

Note that \(\frac{s}{\sqrt{n}}\) above is usually referred to as the standard error of the sample mean.


2.1 Class Exercise

We’ll use the same ANES data as above.

dat <- read.csv("nesexample2.csv")

Let’s work with the democratic feeling thermometer.

summary(dat$therm.dems)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##     0.0    50.0    60.0    59.8    75.0    97.0     707
thd <- na.omit(dat$therm.dems)     #omit NA's in and store variable in vector
length(thd)                        #20,045 valid answers 
## [1] 20045

To illustrate how well we do when we estimate a population mean with a sample mean, we will pretend that the mean of thd is the population mean. This is clearly not true and only used for illustrative purposes.

mean(thd)
## [1] 59.84

Now let’s draw a random sample of 900 from this variable, and calculate the sample mean and confidence interval associated with the sample mean. For illustrative purposes, we’ll construct 75% confidence intervals.

n <- 900

# Generate a sample and calculate the sample mean
smpl <- sample(thd, n)
mean(smpl)                    
## [1] 60.87
# Generate 75% confidence intervals
std.err <- function(x) sd(x) / sqrt(length(x))
z <- qnorm(0.875)                        #find z-score
mean(smpl) - z * std.err(smpl)           #lower CI
## [1] 59.96
mean(smpl) + z * std.err(smpl)           #upper CI
## [1] 61.78



  1. If each of you drew a different random sample, estimated the mean, and calculated 75% confidence intervals, how many of you would have confidence intervals that cover the population mean (59.8379)? Let’s find out what happens in practice!

  2. (Exercise to complete on your own) Now run a Monte Carlo simulation repeating the process we went through above 10,000 times. Use 95% confidence intervals. Calculate the proportion of confidence intervals that cover the population mean.


3 Writing Functions

Functions come in handy for both data management and analysis. In addition, using functions will make your code cleaner. For example, above we created a function called std.err that calculates the standard error of a sample mean, which we then used to generate confidence intervals.

thd <- na.omit(dat$therm.dems)
n <- 900
smpl <- sample(thd, n)

std.err <- function(x) sd(x) / sqrt(length(x))
z <- qnorm(0.875)                        #find z-score
mean(smpl) - z * std.err(smpl)           #lower CI
## [1] 59.5
mean(smpl) + z * std.err(smpl)           #upper CI
## [1] 61.3

The last two lines equivalently could have been written:

mean(smpl) - z * (sd(smpl) / sqrt(length(smpl)))         
## [1] 59.5
mean(smpl) + z * (sd(smpl) / sqrt(length(smpl)))         
## [1] 61.3

Although the real gains come when you have to execute more complex operations, the advantage of using a function is clear even here. The last two lines are longer and harder to interpret than when we used the std.err function.

Further, the nice thing about this function is that it works for any vector that we feed the function, provided that the vector doesn’t have missing values and that it has more than one observation. To make the function work for missing values we could specify it as follows:

std.err2 <- function(x) sd(x, na.rm = T) / sqrt(na.omit(length(x)))
std.err2(dat$therm.dems)   
## [1] 0.1623
# Compare with original function from above:
std.err <- function(x) sd(x) / sqrt(length(x))
std.err(dat$therm.dems)                         #no meaningful output for vector with NAs
## [1] NA
std.err(na.omit(dat$therm.dems))                #this works
## [1] 0.1651

To write a function in R:

  1. Start by thinking about what you want the input(s) and output(s) of the function to be. In the std.err and std.err2 functions, the input is a vector x and the output is a scalar (the standard error of the sample mean).

  2. Specify the input in general terms. Above we called it x. We didn’t call it something specific, like dat$demtherm. That’s because we want the function to work across different vectors, not just dat$demtherm.

  3. The inputs go in the parentheses in function() (e.g., function(x)). We then specify what we want to do to the inputs. We can have more than one input. For example, here’s a function that estimates the covariance between two variables (which must have the same number of observations):

covar <- function(x, y) {
    df <- na.omit(data.frame(x, y))
    X <- df[, 1]
    Y <- df[, 2]
    covariance <- (1 / (length(X) - 1)) * sum((X - mean(X)) * (Y - mean(Y)))
    return(covariance)
}

This example also illustrates that a function can have more than one line, and that, if it does, we have to specify what we want the function to return. Also note that I’m defining new objects in the function, for example df. These objects only exist within the function, and are not saved to the workspace in R.

Let’s double check that the function works, comparing the output with the output from R’s built-in function (something we can’t always do):

covar(dat$therm.unions, dat$therm.dems)                   #our user-written function
## [1] 207.8
cov(dat$therm.unions, dat$therm.dems, use = "complete")   #R's built in function
## [1] 207.8

3.1 Exercises

  1. Write your own function for finding the mean of any variable. Call it mean2. The function should work even if the variable has missing values (NA’s). Confirm that your function works by applying it to therm.unions in dat. Compare the output of your function to the output of Rs mean() function.

  2. Write your own function for finding the standard deviation of any variable. Call it sd2. The function should work even if the variable has missing values (NA’s). Confirm that your function works by applying it to therm.unions in dat. Compare the output of your function to the output of Rs sd() function.



The code used in this tutorial is available here.