Simon Ejdemyr December, 2015

Dataset Basics


Contents
Summary This tutorial introduces datasets — “data frames” in R. If you’ve completed the tutorial on vectors, you’ll soon see that datasets can be thought of as a collection of vectors stored as columns in a dataset. We’ll talk about how to create datasets and how to read them from file. We’ll also talk more conceptually about how datasets should be structured.

Creating datasets

Let’s start by learning how to create a dataset in R. This turns out to be very simple — just combine vectors using the data.frame() command.

# Create three vectors 
name <- c("al", "bea", "carol")
age <- c(6, 7, 4)
hair <- c("brown", "green", "blond")

# Create data frame 
children <- data.frame(name, age, hair)
children
##    name age  hair
## 1    al   6 brown
## 2   bea   7 green
## 3 carol   4 blond
# Creating a data frame can also be done without first saving vectors 
children <- data.frame(
    name = c("al", "bea", "carol"),
    age = c(6, 7, 4),
    hair = c("brown", "green", "blond")
    )
children
##    name age  hair
## 1    al   6 brown
## 2   bea   7 green
## 3 carol   4 blond

We created a dataset called children, which has 3 rows and 3 columns. We used two approaches that differ in whether they first save vectors to R’s memory.

Dataset structure

More important than learning the mechanics of creating a dataset in R is to understand their general structure:

  1. Each column should consist of a vector that gives some fact about the world (e.g., age in years). We usually refer to these columns as variables.
  2. At least one column should identify who or what the information in the data is about. Such a variable is called an “id” variable or “key”. In the children dataset above this variable is name. The remaining variables have the facts or measurements that we care about. For example, we gather from the dataset that Al is 6 years old (one fact) and that Al has brown hair (a second fact).

To better understand the proper structure of datasets, let’s create a second data frame. Suppose here that gdp_pc is a measure of a country’s GDP per capita in a given year. (Use ?expand.grid and ?runif to learn more about these functions, though that is not a priority right now.)

countries <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
countries
##   country year gdp_pc
## 1     USA 1994  16454
## 2   China 1994  13753
## 3   Sudan 1994  16899
## 4     USA 1995  16964
## 5   China 1995   2358
## 6   Sudan 1995   4262
## 7     USA 1996   5995
## 8   China 1996   8699
## 9   Sudan 1996   3242

This time around our dataset has two id variables: country and year. Why two and not one? One way to think about it is that country by itself wouldn’t be sufficient to uniquely identify a row, because there are three rows for each country (and likewise with year). Combined, however, country and year uniquely identify each row. In other words, GDP per capita (the only fact or measurement in this dataset) describes a given country in a given year.

We can say that the unit of analysis in the dataset countries is country-year. This means that two id variables (country and year) are required to uniquely identify each row. In the children dataset above the unit of analysis is “child” or “person”.

Basic commands

Here are some commands that are useful for getting to know your data and for understanding dataset structures in general.

Dimensions

The first is dim(), which gives the dimensions of a data frame. The number of rows are listed first, columns second.

dim(countries)
## [1] 9 3

Use nrow() and ncol() to to get the number of rows or columns separately. These commands are useful for code generalization.

nrow(countries)
## [1] 9
ncol(countries)
## [1] 3

Snapshots

Use head() and tail() to look at the first and last few rows of a dataset, respectively. This is more useful when we have datasets with many observations.

head(countries)
##   country year gdp_pc
## 1     USA 1994  16454
## 2   China 1994  13753
## 3   Sudan 1994  16899
## 4     USA 1995  16964
## 5   China 1995   2358
## 6   Sudan 1995   4262
tail(countries)
##   country year gdp_pc
## 4     USA 1995  16964
## 5   China 1995   2358
## 6   Sudan 1995   4262
## 7     USA 1996   5995
## 8   China 1996   8699
## 9   Sudan 1996   3242

Other useful commands to get to know variables better include summary(), table(), and prop.table().

# Get some summary information about each variable
summary(countries)
##   country       year          gdp_pc
##  USA  :3   Min.   :1994   Min.   : 2358
##  China:3   1st Qu.:1994   1st Qu.: 4262
##  Sudan:3   Median :1995   Median : 8699
##            Mean   :1995   Mean   : 9847
##            3rd Qu.:1996   3rd Qu.:16454
##            Max.   :1996   Max.   :16964
# Number of observations by country 
table(countries$country)
##
##   USA China Sudan
##     3     3     3
# Proportion of observations by country 
prop.table(table(countries$country))
##
##       USA     China     Sudan
## 0.3333333 0.3333333 0.3333333

Accessing specific rows and columns

Like with vectors, brackets ([]) can be used to access data in datasets. But unlike with vectors, we need to input two arguments — separated by a comma — into the brackets. The first argument always applies to rows while the second applies to columns.

countries <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
countries
##   country year gdp_pc
## 1     USA 1994   9592
## 2   China 1994   9890
## 3   Sudan 1994   9554
## 4     USA 1995   9159
## 5   China 1995   7681
## 6   Sudan 1995   7619
## 7     USA 1996   6121
## 8   China 1996   9116
## 9   Sudan 1996   2136
# Access row 2, col 3
countries[2, 3]
## [1] 9890
# Access entire row 5
countries[5, ] #note: leaving second argument blank
##   country year gdp_pc
## 5   China 1995   7681
# Access entire column 3
countries[, 3] #note: leaving first argument blank
## [1] 9592 9890 9554 9159 7681 7619 6121 9116 2136

In general, though, accessing rows and columns by index is bad for code generalization. It particularly causes problems when you add or delete rows/columns, because then the indexing will change (e.g., column 3 representing GDP per capita may now be in column 4).

For this reason, it’s better to access columns using column names.

# Access a column using column/variable name (two equivalent approaches)
countries$year
## [1] 1994 1994 1994 1995 1995 1995 1996 1996 1996
countries[, "year"]
## [1] 1994 1994 1994 1995 1995 1995 1996 1996 1996

Note that when we’re accessing a column this way, it’s just a vector and all the things we’ve learned about vectors apply. For example:

# Get mean gdp per cap
mean(countries$gdp_pc)
## [1] 7874.222

To access rows, it’s best to use a logical statement, which is covered in more detail in a separate tutorial on modifying data. But just to give an example, here’s how we can access a row using bracket notation and a logical statement:

countries[countries$year == 1995 & countries$country == "USA", ]
##   country year gdp_pc
## 4     USA 1995   9159

Reading data

Note: In this section we’ll be working with a dataset called world-small.csv, which you can download here.

So far we’ve created datasets ourselves. Oftentimes, however, we’ll want to read a dataset into R from file. Datasets come in many formats — e.g., .csv, .txt, .dta, and .RData. R can read most data formats as is, but sometimes it may be necessary to manually reformat some elements in the file or even to reconvert the whole file to a different format (e.g., using Stat/Transfer). For now, we’ll assume that the file is in a readable format.

To read a file you need to

  1. Specify where the file is located on your computer. This is referred to as setting your working directory.
  2. Execute a command that will read the file from your working directory.

Setting the working directory

You can set your working directory manually. In RStudio, go to Session –> Set Working Directory –> Choose Directory… and find the folder in which your file is located.

While this works, you should also set the working directory using code. Use setwd(path-to-dir) where path-to-dir is the the path to the folder in which the file is located. How can you find this path? Here are instructions for Windows and mac. If you’re still not sure how to do this, take a look at this tutorial.

To check that your working directory includes the file you want to read, use dir() without anything in the parentheses. This function outputs all the files in your working directory into the R console. So, if you want to read the world-small.csv file that you downloaded above, you should see this file listed when you execute dir().

Reading the file

Now that we’ve told R where to look for our file, it’s time to read it. Different commands are used to read different types of files. This is the syntax used for reading a .csv file:

world <- read.csv("world-small.csv") 

I’m reading the file from the working directory and assigning it to the object world, which becomes of class data.frame.

class(world)
## [1] "data.frame"

Let’s check if the file was read correctly, using dim() (returns the dimensions), head() (returns the top six rows), and summary() (returns summary information about each variable):

dim(world) #the number of rows and columns 
## [1] 145   4
head(world) #the first few rows of the dataset
##     country       region gdppcap08 polityIV
## 1   Albania   C&E Europe      7715     17.8
## 2   Algeria       Africa      8033     10.0
## 3    Angola       Africa      5899      8.0
## 4 Argentina   S. America     14333     18.0
## 5   Armenia   C&E Europe      6070     15.0
## 6 Australia Asia-Pacific     35677     20.0
summary(world) #a summary of the variables in the dataset
##       country             region     gdppcap08        polityIV
##  Albania  :  1   Africa      :42   Min.   :  188   Min.   : 0.000
##  Algeria  :  1   C&E Europe  :25   1st Qu.: 2153   1st Qu.: 7.667
##  Angola   :  1   Asia-Pacific:24   Median : 7271   Median :16.000
##  Argentina:  1   S. America  :19   Mean   :13252   Mean   :13.408
##  Armenia  :  1   Middle East :16   3rd Qu.:19330   3rd Qu.:19.000
##  Australia:  1   W. Europe   :12   Max.   :85868   Max.   :20.000
##  (Other)  :139   (Other)     : 7

Everything looks as we would have hoped.

Exercises

  1. Read the world-small.csv data into R and store it in an object called world. (Set your working directory using code first.)

  2. (Conceptual) What is the unit of analysis in the dataset? What’s the name of the dataset’s id variable?

  3. How many observations does world have? How many variables? Use an R command to find out.

  4. Use brackets and a logical statement to inspect all the values for Nigeria and United States. That is, your code should return two entire rows of the dataset.

  5. Use R to return China’s Polity IV score. As in question 4, use a logical statement and brackets, but don’t return the entire row. Rather, return a single value with the Polity IV score.

  6. What is the lowest GDP per capita in the dataset? (Use R to return only the value.)

  7. What country has the lowest GDP per capita? (Your code should return the country name and be general enough so that if the observations in the dataset — or their order — change, your code should still return the country with the lowest GDP per capita.)