Motivation

To illustrate the value of functions, let’s start by a thought experiment: say R didn’t provide a function for finding the median of a numeric vector. (Of course this is not true — R has a built-in function called median() for this purpose.) In this annoying scenario, it would still be possible to find the median using a few lines of code.

# Create a numeric vector 
v <- c(2, 5, 8, 0, 10)

# Find the number of elements in v  
n <- length(v)

# Is n odd?
n %% 2  #use mod to find remainder after dividing by 2; if remainder is 1 --> odd

## [1] 1

# Cool, it's odd so let's find the mid-value after sorting v
v_sort <- sort(v)
v_sort[n / 2 + 1] #this is the median

## [1] 5

Ok, we found the median, but what a nightmare! Imagine if we had to go through these steps every time we wanted to find the median. Plus, the code above isn’t general enough to account for scenarios in which the numeric vector has an even number of elements. In scenarios like this, it’s therefore extremely useful to write a function. Here’s one way of doing so for finding the median:

median2 <- function(vec) {
    n <- length(vec)
    odd <- n %% 2 == 1
    vec_sort <- sort(vec)

    if(odd) {
        result <- vec_sort[n / 2 + 1]
    } else {
        result <- (vec_sort[n / 2] + vec_sort[n / 2 + 1]) / 2
    }

    return(result)
}

Let’s test if it works on two vectors, one with an odd number of elements and the other with an even number:

v1 <- c(2, 5, 8, 0, 10)
median2(v1)

## [1] 5

v2 <- c(2, 5, 8, 0, 10, 12)
median2(v2)

## [1] 6.5

This motivating example shows that writing functions can save us many lines of code and avoid mistakes that inevitably will happen if you rely too heavily on copying and pasting code.

Building blocks

Remind yourself of a basic mathematical principle: a function takes some input, transforms it, and outputs the transformation. For example, the function f(x) = 2x takes a vector x and transforms each element to twice its original value. Functions in R (and other languages) do the same thing. For example:

doubleval <- function(x) 2 * x #write a function that doubles x
doubleval(c(3, 5, 7)) #test the function on a vector

## [1]  6 10 14

Here are other, equivalent, ways of writing this function:

doubleval <- function(x) return(2 * x)
doubleval <- function(x) {
    tranformation <- 2 * x
    transformation
}
doubleval <- function(x) {
    tranformation <- 2 * x
    return(transformation)
}

Observe the following:

Functions include some input or, more technically, one or many parameters. The function doubleval has one parameter called x; median2 also has one parameter (vec). The name of parameters are arbitrary: you can call them whatever you want as long as you reference the same name within the function. Note that functions often have more than one parameter.
Functions include a line that specifies the output of the function. For clarity, it is useful to use the return() statement for indicating what the function is outputting, although this is not necessary.
If a function includes several operations, those operations should be written on separate lines and be surrounded by curly brackets ({}). Very simple functions can be written on one line, omitting the curly brackets.
Objects created within functions do not exist in the global variable space. For example, vec_sort in the function median2 (and other objects created within the function) cannot be accessed outside the function. This relates to an important feature of programming called scope.

Applications

Functions can be used in a wide variety of scenarios. Here are two applications, which I go over in detail below:

A function that reads and manipulates a .csv file. Use it with lapply() or in a for loop to iterate over several files with a similar structure. Then combine the resulting data frames into one data frame.
A function that carries out a regression or graphing analysis on a select number of variables or on a subset of the data.

Reading several files

Begin by downloading a .zip file with service request data from NYC. The zip file contains six files for years 2004-2009, each with 10,000 observations. The data are originally from NYC’s Open Data portal, which hosts datasets with millions of service requests filed by residents through the city’s 311 program. For the purpose of this example, I have taken a random sample of 10,000 for each year.

Here’s what the 2004 file looks like (the other years have the same structure).

# Read the 311 data for 2004 (after setting the working directory)
nyc04 <- read.csv("nyc-311-2004-sample.csv")
head(nyc04)

##   Unique.Key        Created.Date         Closed.Date
## 1    4735434 01/23/2004 12:00 AM 02/02/2004 12:00 AM
## 2    7547062 06/04/2004 12:00 AM 06/09/2004 12:00 AM
## 3    5050661 08/04/2004 12:00 AM 08/06/2004 12:00 AM
## 4    7281795 11/26/2004 12:00 AM 12/10/2004 12:00 AM
## 5    1443894 08/22/2004 12:00 AM 08/22/2004 12:00 AM
## 6    3244577 12/02/2004 12:00 AM 12/15/2004 12:00 AM
##                  Complaint.Type                                 Location
## 1                       Boilers  (40.71511134258793, -73.98998982667266)
## 2                       HEATING (40.871781348425515, -73.88238262118011)
## 3 General Construction/Plumbing  (40.59418801428136, -73.80082145383885)
## 4                      PLUMBING  (40.85911979460089, -73.90605127158484)
## 5       Noise - Street/Sidewalk  (40.54800892371052, -74.17041676351323)
## 6                         Noise

The variables in the data are as follows:

Unique.Key: An id number unique to each request.
Created.Date: The date the request was filed in the 311 system.
Closed.Date: The date the request was resolved by city workers (NA implies that it was never resolved).
Complaint.Type: The subject of the complaint.
Location: Coordinates that give the location of the service issue.

Our goal with the function is to read the file and clean it. In particular, we want to convert the Created.Date and Closed.Date variables so that R recognizes them as dates. From these variables, we can then calculate measures of government responsiveness: (1) how many days it took city workers to resolve a request, and (2) whether or not a request was resolved within a week.

# Load required packages
require(dplyr)
require(lubridate) #to work with dates

# Create a function that reads and cleans a service request file.
# The input is the name of a service request file and the
# output is a data frame with cleaned variables.  
clean_dta <- function(file_name) {

    # Read the file and save it to an object called 'dta'
    dta <- read.csv(file_name)

    # Clean the dates in the dta file and generate responsiveness measures
    dta <- dta %>%
        mutate(opened = mdy(substring(Created.Date, 1, 10)),
               closed = mdy(substring(Closed.Date, 1, 10)),
               resptime = as.numeric(difftime(closed, opened, units = "days")),
               resptime = ifelse(resptime >=0, resptime, NA),
               solvedin7 = ifelse(resptime <= 7, 1, 0))

    # Return the cleaned data 
    return(dta)
}

Let’s test the function on the 2004 data:

# Execute function on the 2004 data 
nyc04 <- clean_dta("nyc-311-2004-sample.csv")
head(nyc04)

##   Unique.Key        Created.Date         Closed.Date
## 1    4735434 01/23/2004 12:00 AM 02/02/2004 12:00 AM
## 2    7547062 06/04/2004 12:00 AM 06/09/2004 12:00 AM
## 3    5050661 08/04/2004 12:00 AM 08/06/2004 12:00 AM
## 4    7281795 11/26/2004 12:00 AM 12/10/2004 12:00 AM
## 5    1443894 08/22/2004 12:00 AM 08/22/2004 12:00 AM
## 6    3244577 12/02/2004 12:00 AM 12/15/2004 12:00 AM
##                  Complaint.Type                                 Location
## 1                       Boilers  (40.71511134258793, -73.98998982667266)
## 2                       HEATING (40.871781348425515, -73.88238262118011)
## 3 General Construction/Plumbing  (40.59418801428136, -73.80082145383885)
## 4                      PLUMBING  (40.85911979460089, -73.90605127158484)
## 5       Noise - Street/Sidewalk  (40.54800892371052, -74.17041676351323)
## 6                         Noise
##       opened     closed resptime solvedin7
## 1 2004-01-23 2004-02-02       10         0
## 2 2004-06-04 2004-06-09        5         1
## 3 2004-08-04 2004-08-06        2         1
## 4 2004-11-26 2004-12-10       14         0
## 5 2004-08-22 2004-08-22        0         1
## 6 2004-12-02 2004-12-15       13         0

The cleaned dataset has four new variables:

opened: The date the request was filed in date format.
closed: The date the request was resolved in date format.
resptime: The number of days it took to resolve the request (closed - opened).
solvedin7: A dummy variable equal to 1 if the request was solved within a week and 0 otherwise.

We can now use this function on all the six files using lapply(), saving each cleaned data frame into a list. (Read more about lapply() here. Of course, you can also use a for loop.)

# First create a vector with the names of the files we want to read
file_names <- paste0("nyc-311-", 2004:2009, "-sample.csv")
file_names

## [1] "nyc-311-2004-sample.csv" "nyc-311-2005-sample.csv"
## [3] "nyc-311-2006-sample.csv" "nyc-311-2007-sample.csv"
## [5] "nyc-311-2008-sample.csv" "nyc-311-2009-sample.csv"

# Now use the vector of file names and the 'clean_dta' function in lapply()
nyc_all <- lapply(file_names, clean_dta)

The list nyc_all now has six elements, consisting of cleaned data for each of the years in 2004-2009. For example, here’s the first and second elements with the 2004 and 2005 data:

head(nyc_all[[1]]) #cleaned data for 2004

##   Unique.Key        Created.Date         Closed.Date
## 1    4735434 01/23/2004 12:00 AM 02/02/2004 12:00 AM
## 2    7547062 06/04/2004 12:00 AM 06/09/2004 12:00 AM
## 3    5050661 08/04/2004 12:00 AM 08/06/2004 12:00 AM
## 4    7281795 11/26/2004 12:00 AM 12/10/2004 12:00 AM
## 5    1443894 08/22/2004 12:00 AM 08/22/2004 12:00 AM
## 6    3244577 12/02/2004 12:00 AM 12/15/2004 12:00 AM
##                  Complaint.Type                                 Location
## 1                       Boilers  (40.71511134258793, -73.98998982667266)
## 2                       HEATING (40.871781348425515, -73.88238262118011)
## 3 General Construction/Plumbing  (40.59418801428136, -73.80082145383885)
## 4                      PLUMBING  (40.85911979460089, -73.90605127158484)
## 5       Noise - Street/Sidewalk  (40.54800892371052, -74.17041676351323)
## 6                         Noise
##       opened     closed resptime solvedin7
## 1 2004-01-23 2004-02-02       10         0
## 2 2004-06-04 2004-06-09        5         1
## 3 2004-08-04 2004-08-06        2         1
## 4 2004-11-26 2004-12-10       14         0
## 5 2004-08-22 2004-08-22        0         1
## 6 2004-12-02 2004-12-15       13         0

head(nyc_all[[2]]) #cleaned data for 2005

##   Unique.Key        Created.Date         Closed.Date       Complaint.Type
## 1    7998176 12/07/2005 12:00 AM 01/12/2006 12:00 AM             PLUMBING
## 2    6007505 11/18/2005 12:00 AM 11/18/2005 12:00 AM Sanitation Condition
## 3    3112357 06/06/2005 12:00 AM 06/08/2005 12:00 AM                Sewer
## 4    6833210 07/13/2005 12:00 AM 08/04/2005 12:00 AM             ELECTRIC
## 5    2551810 08/18/2005 12:00 AM 12/06/2005 12:00 AM   Indoor Air Quality
## 6    8275913 12/29/2005 12:00 AM 01/04/2006 12:00 AM              HEATING
##                                   Location     opened     closed resptime
## 1  (40.70071218913509, -73.90866200407376) 2005-12-07 2006-01-12       36
## 2 (40.769875671896564, -73.91746638294454) 2005-11-18 2005-11-18        0
## 3  (40.762006544550786, -73.7884685704754) 2005-06-06 2005-06-08        2
## 4   (40.6007819975723, -73.98167403896822) 2005-07-13 2005-08-04       22
## 5  (40.73281603654728, -73.86048419316799) 2005-08-18 2005-12-06      110
## 6 (40.643464799063885, -74.00610775671524) 2005-12-29 2006-01-04        6
##   solvedin7
## 1         0
## 2         1
## 3         1
## 4         0
## 5         0
## 6         1

Here’s the same task using a for loop instead. (In reality, you’d either use lapply() or a for loop, not both — this is just for illustrative purposes. As you’ll see, lapply() is more compact and elegant in this case, but a for loop is probably more intuitive.)

nyc_all <- list()
for(i in 1:length(file_names)) {
    nyc_all[[i]] <- clean_dta(file_names[i])
}
head(nyc_all[[1]]) #take a look at the 2004 data

##   Unique.Key        Created.Date         Closed.Date
## 1    4735434 01/23/2004 12:00 AM 02/02/2004 12:00 AM
## 2    7547062 06/04/2004 12:00 AM 06/09/2004 12:00 AM
## 3    5050661 08/04/2004 12:00 AM 08/06/2004 12:00 AM
## 4    7281795 11/26/2004 12:00 AM 12/10/2004 12:00 AM
## 5    1443894 08/22/2004 12:00 AM 08/22/2004 12:00 AM
## 6    3244577 12/02/2004 12:00 AM 12/15/2004 12:00 AM
##                  Complaint.Type                                 Location
## 1                       Boilers  (40.71511134258793, -73.98998982667266)
## 2                       HEATING (40.871781348425515, -73.88238262118011)
## 3 General Construction/Plumbing  (40.59418801428136, -73.80082145383885)
## 4                      PLUMBING  (40.85911979460089, -73.90605127158484)
## 5       Noise - Street/Sidewalk  (40.54800892371052, -74.17041676351323)
## 6                         Noise
##       opened     closed resptime solvedin7
## 1 2004-01-23 2004-02-02       10         0
## 2 2004-06-04 2004-06-09        5         1
## 3 2004-08-04 2004-08-06        2         1
## 4 2004-11-26 2004-12-10       14         0
## 5 2004-08-22 2004-08-22        0         1
## 6 2004-12-02 2004-12-15       13         0

Finally, let’s append the data frames stored in the nyc_all list into one data frame. This is easy using do.call() and rbind().

# List of data frames --> one data frame
nyc_all <- do.call(rbind, nyc_all)
class(nyc_all) #nyc_all is now a data frame

## [1] "data.frame"

dim(nyc_all) #nyc_all now has 60,000 observations

## [1] 60000     9

summary(nyc_all$opened) #opened contains all years in 2004-2009

##                  Min.               1st Qu.                Median
## "2004-01-01 00:00:00" "2005-07-10 00:00:00" "2006-12-31 12:00:00"
##                  Mean               3rd Qu.                  Max.
## "2007-01-09 08:50:12" "2008-07-04 06:00:00" "2009-12-31 00:00:00"

Complex analyses

Functions can also be used when you have to carry out a bunch of analyses in a flexible way. Let’s use the nyc_all dataset that we just created above to test the hypothesis that it takes city workers in NYC a longer time to resolve requests that are filed during the winter (December-February), presumably because of tougher weather conditions.

First let’s add a dummy variable equal to 1 if a request was filed during December-February.

nyc_all <- nyc_all %>% mutate(winter = ifelse(month(opened) %in% c(1, 2, 12), 1, 0))
head(select(nyc_all, opened, winter)) #'winter' equals 1 if request opened in Dec-Feb

##       opened winter
## 1 2004-01-23      1
## 2 2004-06-04      0
## 3 2004-08-04      0
## 4 2004-11-26      0
## 5 2004-08-22      0
## 6 2004-12-02      1

Now let’s write a function that allows us to test our hypothesis in a few different ways. The function has four parameters:

dta: the data frame to use in the analyses (probably nyc_all).
model: a regression model, specified in a formula() call
method: the method by which to carry out the analysis (either “OLS” or “logit”).

The output of the will be a regression table (either OLS or logit).

nyc_analysis <- function(dta, model, method) {

    if (method == "OLS") {
        m <- lm(model, data = dta)
    } else if (method == "logit") {
        m <- glm(model, data = dta, family = binomial)
    }

    return(summary(m))

}

# Run OLS and logit models
nyc_analysis(nyc_all, formula(resptime ~ winter), "OLS")

##
## Call:
## lm(formula = model, data = dta)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
##  -29.23  -28.23  -23.40  -13.40 2715.77
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  29.2323     0.5682  51.451  < 2e-16 ***
## winter       -4.8351     1.1418  -4.234 2.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118.3 on 57611 degrees of freedom
##   (2387 observations deleted due to missingness)
## Multiple R-squared:  0.0003111,  Adjusted R-squared:  0.0002938
## F-statistic: 17.93 on 1 and 57611 DF,  p-value: 2.295e-05

nyc_analysis(nyc_all, formula(solvedin7 ~ winter), "logit")

##
## Call:
## glm(formula = model, family = binomial, data = dta)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.4510  -1.4330   0.9265   0.9418   0.9418
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.58325    0.01002   58.22   <2e-16 ***
## winter       0.04023    0.02022    1.99   0.0466 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 75015  on 57612  degrees of freedom
## Residual deviance: 75011  on 57611  degrees of freedom
##   (2387 observations deleted due to missingness)
## AIC: 75015
##
## Number of Fisher Scoring iterations: 4

It actually appears that, on average, it takes city workers less time — about 5 days less — to respond to service requests during the winter (OLS model), which is corroborated by the logit model, which shows a higher likelihood of requests being resolved within a week during the winter.

Say we settle for the OLS model and want to graph the OLS coefficient for each year in the data (to look at over-time changes). We can then write a function that gets the OLS coefficient on winter for a desired year as well as lower and upper 95% confidence bounds on this estimate.

nyc_ols <- function(dta, model, year) {

    # Filter the data to the desired year
    sub <- dta %>% filter(year(opened) == year)

    # Run OLS model
    m <- lm(model, data = sub)

    # Get the coefficient estimate, standard error, and confidence bounds
    coef <- coef(m)[2]
    se <- sqrt(diag(vcov(m)))[2]
    lb <- coef - se * 1.96
    ub <- coef + se * 1.96

    # Create a data frame with this information (as well as the year)
    # The data frame will have only one row
    result <- data.frame(year, coef, se, lb, ub, row.names = NULL)

    return(result)

}

# Test that the function works for 2004
nyc_ols(nyc_all, formula(resptime ~ winter), 2004)

##   year     coef       se       lb       ub
## 1 2004 8.461265 3.563702 1.476408 15.44612

# Confirm using regular approach
# Coefficient and SE should be the same as above
summary(lm(resptime ~ winter, data = nyc_all, subset = year(opened) == 2004))

##
## Call:
## lm(formula = resptime ~ winter, data = nyc_all, subset = year(opened) ==
##     2004)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
##  -39.66  -31.20  -29.20  -18.20 2340.80
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   31.195      1.349  23.125   <2e-16 ***
## winter         8.461      3.564   2.374   0.0176 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124.1 on 9873 degrees of freedom
##   (125 observations deleted due to missingness)
## Multiple R-squared:  0.0005707,  Adjusted R-squared:  0.0004694
## F-statistic: 5.637 on 1 and 9873 DF,  p-value: 0.0176

Now that we can run this model for a given year, we can iterate over all the years in the dataset, again using lapply() (which creates a list of data frames). We then create one data frame from this list and graph the results.

f <- formula(resptime ~ winter)
nyc_results <- lapply(2004:2009, nyc_ols, dta = nyc_all, model = f)
nyc_results <- do.call(rbind, nyc_results) #list --> data.frame
nyc_results

##   year      coef       se         lb          ub
## 1 2004  8.461265 3.563702   1.476408 15.44612206
## 2 2005 -7.297124 3.790361 -14.726231  0.13198332
## 3 2006 -5.151039 3.302475 -11.623889  1.32181119
## 4 2007 -9.093511 2.493186 -13.980156 -4.20686516
## 5 2008 -5.366320 1.824958  -8.943237 -1.78940229
## 6 2009 -2.390624 1.266309  -4.872590  0.09134128

# Graph the results
require(ggplot2)

ggplot(nyc_results, aes(x = year, y = coef)) +
    geom_point() +
    geom_errorbar(aes(ymin = lb, ymax = ub), width = 0) +
    geom_hline() +
    theme_bw() +
    ylab("response time during winter compared to summer (days)") + 
    ggtitle("Response time during winter compared to summer (in days)")

Note that negative values indicate how many fewer days, on average, it takes city workers to resolve requests during the winter as compared to the summer. If this analysis is correct, it seems like it takes the city less time to respond to service requests during the winter as compared to the summer (between 2 and 9 days less) for all years except 2004.

We can also run the analyses with controls. Most importantly, maybe a different type of complaint is filed during the winter than during other periods of the year. We can adjust for such potential confounding by introducing complaint type as a covariate in the analysis:

f <- formula(resptime ~ winter + factor(Complaint.Type))
nyc_results <- lapply(2004:2009, nyc_ols, dta = nyc_all, model = f)
nyc_results <- do.call(rbind, nyc_results) #list --> data.frame
nyc_results

##   year       coef       se        lb        ub
## 1 2004  8.6437735 3.519323  1.745901 15.541646
## 2 2005 12.5461785 3.037074  6.593514 18.498843
## 3 2006  6.8321752 2.712781  1.515124 12.149226
## 4 2007 -0.9162294 2.265190 -5.356002  3.523543
## 5 2008 -0.3561900 1.670338 -3.630052  2.917672
## 6 2009 -0.2771271 1.239851 -2.707236  2.152981

# Graph the results
require(ggplot2)

ggplot(nyc_results, aes(x = year, y = coef)) +
    geom_point() +
    geom_errorbar(aes(ymin = lb, ymax = ub), width = 0) +
    geom_hline() +
    theme_bw() +
    ylab("response time during winter compared to summer (days)") +  
    ggtitle("Response time, winter v. summer (controlling for complaint type)")

Now it indeed seems like it takes longer to resolve service requests during the winter (at least between 2004 and 2006).

To summarize, in the applications above, a function was created to allow for easy and flexible completion of a task. Not creating a function for these tasks would work, though it would also result in more verbose code (e.g., copying and pasting, changing only some aspects of the code). Functions minimize potential mistakes that may result from such manual iteration of code. They are also useful for carrying out a range of analyses and graphing the results, as the last application makes clear.

Exercises

Write a function called second_largest that finds the second largest value in a vector of numeric values. That is, the input should be a numeric vector and the output should be the second largest value in the vector. You can assume that the input vector has at least two values. Test your function on the following two vectors:

v1 <- 1:10
v2 <- c(15, 1000, 2, 3, 8)

Modify the second_largest function so that it accounts for two special cases: (1) when the user inputs a numeric vector with only one value, the function should return the message “Invalid input: at least two values required”; (2) when the user inputs a vector that is not numeric, the function should return the message “Invalid input: only numeric vectors accepted”. Test your new function on the following vectors:

v1 <- 1:10
v2 <- 2
v3 <- c("banana", "apple", "orange")
v4 <- as.factor(1:10)

Using the nyc_all data frame created above (it should have 60,000 observations and have observations from 2004 to 2009), write a function called no_obs that finds the number of observations for a given complaint type in a given year. The function should have three parameters: dta (the data frame of interest), type (the complaint type category as a string), and year (the year the request was opened). The output of the function should be a data frame with one row with the name of the complaint type, the year, and a value with the number of observations. The function should be indifferent to whether the complaint type is in upper or lower case or capitalized (e.g., “HEATING”, “Heating”, and “heating” should be counted as one complaint type). You can assume the input data frame (dta) always has a variable called Complaint.Type. Test your function by ensuring that the results for the complaint types “Sewer”, “sewer”, and “heating” for various years look as follows:

no_obs(dta = nyc_all, type = "Sewer", year = 2004)

##   complaint.type year   n
## 1          sewer 2004 642

no_obs(dta = nyc_all, type = "sewer", year = 2004)

##   complaint.type year   n
## 1          sewer 2004 642

no_obs(dta = nyc_all, type = "heating", year = 2004)

##   complaint.type year    n
## 1        heating 2004 1187

no_obs(dta = nyc_all, type = "heating", year = 2009)

##   complaint.type year    n
## 1        heating 2009 1356