Simon Ejdemyr December, 2015

Merging & Appending


Contents
Summary This tutorial explains how to combine datasets by merging or appending them. Merging means transferring columns from one dataset to another, while appending means transferring rows from one dataset to another. These are two important tasks since datasets only very rarely come with all the information we need.

Merging

Say you’re interested in how economic development affects democracy across countries, but all you have are two separate datasets on these two factors. What you need to do is to combine the two. When you’re done with this, you also need information on what world region each country is located in. These tasks can be accomplished with one-to-one merging and one-to-many merging.

One-to-one merging

To illustrate, say you had two datasets that look like this:

# Generate toy dataset 1: 
demo <- data.frame(
    country = c("USA", "Albania", "Turkey", "China", "Sudan"),
    democracy_score = c(19, 16, 16, 5, 10)
    )
demo
##   country democracy_score
## 1     USA              19
## 2 Albania              16
## 3  Turkey              16
## 4   China               5
## 5   Sudan              10
# Generate toy dataset 2: 
econ <- data.frame(
    country = c("China", "Albania", "Turkey", "USA", "Sudan"),
    gdp_pc = c(12000, 10000, 9000, 20000, 500)
    )
econ
##   country gdp_pc
## 1   China  12000
## 2 Albania  10000
## 3  Turkey   9000
## 4     USA  20000
## 5   Sudan    500

We can merge these datasets using R’s merge() command, where by specifies the variable the two datasets has in common (usually called an “ID” variable):

df <- merge(demo, econ, by = "country")
df
##   country democracy_score gdp_pc
## 1 Albania              16  10000
## 2   China               5  12000
## 3   Sudan              10    500
## 4  Turkey              16   9000
## 5     USA              19  20000

You can merge on more than one variable. Say you had two datasets that look like this:

# Generate toy dataset 1: 
demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo
##   country year democracy_score
## 1     USA 1994               9
## 2   China 1994              10
## 3   Sudan 1994               4
## 4     USA 1995              19
## 5   China 1995               7
## 6   Sudan 1995              17
## 7     USA 1996              16
## 8   China 1996              13
## 9   Sudan 1996               5
# Generate toy dataset 2: 
econ <- data.frame(
    expand.grid(year = 1994:1996, country = c("USA", "China", "Sudan")),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
econ
##   year country gdp_pc
## 1 1994     USA  10230
## 2 1995     USA  11714
## 3 1996     USA  11904
## 4 1994   China   9735
## 5 1995   China   2908
## 6 1996   China   9009
## 7 1994   Sudan   8489
## 8 1995   Sudan  17027
## 9 1996   Sudan  19729
merge(demo, econ, by = c("country", "year"))
##   country year democracy_score gdp_pc
## 1   China 1994              10   9735
## 2   China 1995               7   2908
## 3   China 1996              13   9009
## 4   Sudan 1994               4   8489
## 5   Sudan 1995              17  17027
## 6   Sudan 1996               5  19729
## 7     USA 1994               9  10230
## 8     USA 1995              19  11714
## 9     USA 1996              16  11904

Here, we’re merging on both country and year.

If one of the data frames has missing observations, only observations the two data frames have in common will be kept when merging, unless you specify all.x, all.y, or all:

econ <- econ[-c(5, 9), ] #delete obs. 5 & 9 for illustrative purposes
econ
##   year country gdp_pc
## 1 1994     USA  10230
## 2 1995     USA  11714
## 3 1996     USA  11904
## 4 1994   China   9735
## 6 1996   China   9009
## 7 1994   Sudan   8489
## 8 1995   Sudan  17027
dim(demo) #different number of observations (rows) than 'econ'
## [1] 9 3
dim(econ) #different number of observations (rows) than 'demo'          
## [1] 7 3
merge(demo, econ, by = c("country", "year")) #keep only matching observations 
##   country year democracy_score gdp_pc
## 1   China 1994              10   9735
## 2   China 1996              13   9009
## 3   Sudan 1994               4   8489
## 4   Sudan 1995              17  17027
## 5     USA 1994               9  10230
## 6     USA 1995              19  11714
## 7     USA 1996              16  11904
merge(demo, econ, by = c("country", "year"), all.x = TRUE) #keep all observations in 'demo'
##   country year democracy_score gdp_pc
## 1     USA 1994               9  10230
## 2     USA 1995              19  11714
## 3     USA 1996              16  11904
## 4   China 1994              10   9735
## 5   China 1995               7     NA
## 6   China 1996              13   9009
## 7   Sudan 1994               4   8489
## 8   Sudan 1995              17  17027
## 9   Sudan 1996               5     NA

One-to-many merging

In the previous examples the ID variables we used for merging uniquely identified observations in each data frame. This is called one-to-one merging.

Sometimes we need to do things slightly differently, using one-to-many merging. Say, for example, we have the following two data frames:

demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo
##   country year democracy_score
## 1     USA 1994               7
## 2   China 1994              16
## 3   Sudan 1994               6
## 4     USA 1995               4
## 5   China 1995               5
## 6   Sudan 1995              15
## 7     USA 1996              16
## 8   China 1996               5
## 9   Sudan 1996              14
region <- data.frame(
    country = c("USA", "China", "Sudan"),
    region = c("America", "Asia", "Africa")
    )
region
##   country  region
## 1     USA America
## 2   China    Asia
## 3   Sudan  Africa

A one-to-many merge of these data frames would look like this:

merge(demo, region, by = "country")
##   country year democracy_score  region
## 1   China 1994              16    Asia
## 2   China 1995               5    Asia
## 3   China 1996               5    Asia
## 4   Sudan 1995              15  Africa
## 5   Sudan 1994               6  Africa
## 6   Sudan 1996              14  Africa
## 7     USA 1994               7 America
## 8     USA 1996              16 America
## 9     USA 1995               4 America

Appending

Appending means matching datasets vertically. We can do this in R using rbind(). The two dataframes you’re appending must have identical variable names. Here’s an example:

df1 <- data.frame(
    year = rep(1990:1995, 2),
    country = c(rep("country1", 6), rep("country2", 6))
    )
df2 <- data.frame(
    year = rep(1996:2000, 2),
    country = c(rep("country1", 5), rep("country2", 5))
    )

df <- rbind(df1, df2) #append

require(dplyr) #to display output in certain order
arrange(df, country, year)
##    year  country
## 1  1990 country1
## 2  1991 country1
## 3  1992 country1
## 4  1993 country1
## 5  1994 country1
## 6  1995 country1
## 7  1996 country1
## 8  1997 country1
## 9  1998 country1
## 10 1999 country1
## 11 2000 country1
## 12 1990 country2
## 13 1991 country2
## 14 1992 country2
## 15 1993 country2
## 16 1994 country2
## 17 1995 country2
## 18 1996 country2
## 19 1997 country2
## 20 1998 country2
## 21 1999 country2
## 22 2000 country2

Exercises

Warm-up: Merge the three data frames defined below. Your final data frame should have 9 rows and 5 columns. Hint: merge only two data frames at a time.

df1 <- data.frame(
    name = c("Mary", "Thor", "Sven", "Jane", "Ake", "Stephan",
             "Bjorn", "Oden", "Dennis"),
    treatment_gr = c(rep(c(1, 2, 3), each = 3)),
    weight_p1 = round(runif(9, 100, 200), 0)
    )
df2 <- data.frame(
    name = c("Sven", "Jane", "Ake", "Mary", "Thor", "Stephan",
             "Oden", "Bjorn"),
    weight_p2 = round(runif(8, 100, 200), 0)
    )
df3 <- data.frame(
    treatment_gr = c(1, 2, 3),
    type = c("dog-lovers", "cat-lovers", "all-lovers")
    )

More extensive: A prominent economic theory predicts that higher income inequality should be associated with more redistribution from the rich to the poor (Meltzer and Richard 1981). Let’s create a dataset that will allow us to test this prediction using U.S. state-level data.

  1. Download data on inequality and tax collection (links below). Take a moment to familiarize yourself with the codebooks. Then read each dataset into R.
  • Dataset with six measures of inequality (originally from Mark W. Frank).
  • Tax data that can be used to measure fiscal capacity/redistribution. Use the “State Government Tax Collections” link.
  1. Keep only the Year, State, Name, Total Taxes, and Total Income Taxes variables in the tax dataset, and rename these if necessary. Keep all the variables in the inequality dataset. Subset both datasets to be in the year range 1960-2012.

  2. Merge the two datasets. Take a moment to think about how to do this. Your final dataset should have 2650 rows and 12 variables. (Keep only data on the 50 states; 50 states x 53 years = 2650 observations.) Hint: You may find this dataset with state identifiers helpful.

  3. The Total Tax and Total Income Tax variables from the tax dataset have commas in them, and are therefore not numeric. Remove the commas and convert the variables to numeric. Hint: gsub().