Wide versus long data

Illustrating the difference between wide and long datasets is easiest using an example. Here are two datasets with the exact same information represented in wide and long form respectively (imagine that avgtemp represents average temperature in Celsius):

# Create long dataset
country_long <- data.frame(
    expand.grid(country = c("Sweden", "Denmark", "Norway"), year = 1994:1996),
    avgtemp = round(runif(9, 3, 12), 0)
    )
country_long

   country year avgtemp
 1  Sweden 1994       6
 2 Denmark 1994       6
 3  Norway 1994       3
 4  Sweden 1995       5
 5 Denmark 1995       8
 6  Norway 1995      11
 7  Sweden 1996       7
 8 Denmark 1996       8
 9  Norway 1996       7

# Create wide dataset
country_wide <- data.frame(
    country = c("Sweden", "Denmark", "Norway"),
    avgtemp.1994 = country_long$avgtemp[1:3],
    avgtemp.1995 = country_long$avgtemp[4:6],
    avgtemp.1996 = country_long$avgtemp[7:9])
country_wide

   country avgtemp.1994 avgtemp.1995 avgtemp.1996
 1  Sweden            6            5            7
 2 Denmark            6            8            8
 3  Norway            3           11            7

As is obvious, the long dataset separates the unit of analysis (country-year) into two separate variables. The wide dataset combines one of the keys (year) with the value variable (avgtemp).

A case for long data

There are many reasons to prefer datasets structured in long form. Repeating some of the points made in Hadley Wickham’s excellent paper on the topic, here are three reasons why you should attempt to structure your data in long form:

If you have many value variables, it is difficult to summarize wide-form datasets at a glance (which in turn makes it hard to identify mistakes in the data). For example, imagine we have a dataset with 50 years and 10 value variables of interest — this would result in 500 columns in wide form. Summarizing each column to look for strange observations, or simply understanding which variables are included in the dataset, becomes difficult in this case.
Structuring data as key-value pairs — as is done in long-form datasets — facilitates conceptual clarity. For example, in country_long above, it is clear that the unit of analysis is country-year — or, put differently, that the variables country and year jointly constitute the key in the dataset. In wide-form datasets, one of the variables that constitutes the unit of analysis is mixed with a variable that holds values. (Read more about this in Hadley’s paper referenced above.)
Long-form datasets are often required for advanced statistical analysis and graphing. For example, if you wanted to run a regression with year and/or country fixed effects, you would have to structure your data in long form. Furthermore, many graphing packages, including ggplot, rely on your data being in long form.

Wide-to-long conversion

To illustrate how to convert a dataset from wide to long format, we’ll use a UNICEF dataset on under-five child mortality across 196 countries. Download the dataset here; it is based on data that can be found at www.childmortality.org. The under-five mortality rate is expressed as the number of under-five deaths per 1,000 live births.

Set your working directory and read the file:

u5mr <- read.csv("unicef-u5mr.csv")

This dataset has 196 rows, one for each country, and 67 variables:

dim(u5mr) #dimensions of the data frame

 [1] 196  67

names(u5mr) #the variable names

  [1] "CountryName" "U5MR.1950"   "U5MR.1951"   "U5MR.1952"   "U5MR.1953"
  [6] "U5MR.1954"   "U5MR.1955"   "U5MR.1956"   "U5MR.1957"   "U5MR.1958"
 [11] "U5MR.1959"   "U5MR.1960"   "U5MR.1961"   "U5MR.1962"   "U5MR.1963"
 [16] "U5MR.1964"   "U5MR.1965"   "U5MR.1966"   "U5MR.1967"   "U5MR.1968"
 [21] "U5MR.1969"   "U5MR.1970"   "U5MR.1971"   "U5MR.1972"   "U5MR.1973"
 [26] "U5MR.1974"   "U5MR.1975"   "U5MR.1976"   "U5MR.1977"   "U5MR.1978"
 [31] "U5MR.1979"   "U5MR.1980"   "U5MR.1981"   "U5MR.1982"   "U5MR.1983"
 [36] "U5MR.1984"   "U5MR.1985"   "U5MR.1986"   "U5MR.1987"   "U5MR.1988"
 [41] "U5MR.1989"   "U5MR.1990"   "U5MR.1991"   "U5MR.1992"   "U5MR.1993"
 [46] "U5MR.1994"   "U5MR.1995"   "U5MR.1996"   "U5MR.1997"   "U5MR.1998"
 [51] "U5MR.1999"   "U5MR.2000"   "U5MR.2001"   "U5MR.2002"   "U5MR.2003"
 [56] "U5MR.2004"   "U5MR.2005"   "U5MR.2006"   "U5MR.2007"   "U5MR.2008"
 [61] "U5MR.2009"   "U5MR.2010"   "U5MR.2011"   "U5MR.2012"   "U5MR.2013"
 [66] "U5MR.2014"   "U5MR.2015"

Let’s convert it to long format, where the unit of analysis is country-year. That is, we’ll have three variables indicating country, year, and U5MR. This can be done using gather() in the tidyr package.

require(tidyr)

 Loading required package: tidyr

u5mr_long <- u5mr %>% gather(year, u5mr, U5MR.1950:U5MR.2015)
tail(u5mr_long)

       CountryName      year u5mr
 12931     Uruguay U5MR.2015 10.1
 12932  Uzbekistan U5MR.2015 39.1
 12933   Venezuela U5MR.2015 14.9
 12934       Samoa U5MR.2015 17.5
 12935       Yemen U5MR.2015 41.9
 12936      Zambia U5MR.2015 64.0

gather() takes three arguments. The first two specify a key-value pair: year is the key and u5mr the value. The third argument specifies which variables in the original data to convert into the key-value combination (in this case, all variables from U5MR.1950 to U5MR.2015).

Note that it would be better if year indicated years as numeric values. This only requires one more line of code, which uses mutate() from the dplyr package, gsub() to remove instances of “U5MR.”, and as.numeric() to convert the variable from character to numeric.

require(dplyr)
u5mr_long <- u5mr %>%
    gather(year, u5mr, U5MR.1950:U5MR.2015) %>%
    mutate(year = as.numeric(gsub("U5MR.", "", year)))
tail(u5mr_long)

       CountryName year u5mr
 12931     Uruguay 2015 10.1
 12932  Uzbekistan 2015 39.1
 12933   Venezuela 2015 14.9
 12934       Samoa 2015 17.5
 12935       Yemen 2015 41.9
 12936      Zambia 2015 64.0

Learn more about gather() and other excellent tidyr functions that facilitate tidy data here and here.

Exercises

Go to www.childmortality.org and download the dataset under “Estimates for under-five, infant and neonatal mortality”. (You can also download it here.) The dataset is in wide form and contains six value variables of interest: under-five (0-4 years) mortality, infant (0-1 years) mortality, neonatal (0-1 month) mortality, as well as the number of under-five, infant, and neonatal deaths.

Read the dataset into R. Note that it is in .xlsx format — there are R packages that will allow you to read such files, or you can convert it to .csv using software such as Excel and then use read.csv(). Note that you don’t need the first few rows of the dataset.
Subset the dataset to the median estimate for each country (i.e., drop rows representing lower and upper uncertainty bounds).
Convert the dataset to long form. The final dataset should have four variables:

country (a character variable with the country name)
year (a numeric variable with the year)
type (a character or factor variable with six categories: “U5MR”, “IMR”, “NMR”, “Under five deaths”, “Infant deaths”, and “Neonatal deaths”).
value (a numeric variable with the value for the given country, year, and type)

For example, here are what 12 observations for Afghanistan (displaying only years 1990-91) should look like:

        country year              type    value
 1  Afghanistan 1990              U5MR    181.0
 2  Afghanistan 1990               IMR    122.5
 3  Afghanistan 1990               NMR     52.8
 4  Afghanistan 1990 Under five deaths 100437.0
 5  Afghanistan 1990     Infant deaths  68718.0
 6  Afghanistan 1990   Neonatal deaths  29658.0
 7  Afghanistan 1991              U5MR    174.2
 8  Afghanistan 1991               IMR    118.3
 9  Afghanistan 1991               NMR     51.9
 10 Afghanistan 1991 Under five deaths 101417.0
 11 Afghanistan 1991     Infant deaths  69482.0
 12 Afghanistan 1991   Neonatal deaths  30537.0

How many under-five, infant, and neonatal deaths occurred in total in the world in years 1990, 1995, 2000, 2005, and 2015? (Hint: see the tutorial on collapsing data.)