1 Tables and Graphs for Professional Reports

Many of you will have to produce professional reports and presentations, for which producing nice tables and graphs is important. The best way to automate this process is to use LaTeX and Beamer rather than Microsoft Word and Powerpoint. Alternatively, check out the RStudio team’s slick R Markdown package, which makes producing beautiful reports really simple.

However, here are some options if you still want to use Word.


1.1 Tables

You can produce tables quickly without having to copy and paste every number from R to Word.

To do so:

  1. Create a table or data.frame in R.
  2. Write this table to a comma-separated .txt file using write.table().
  3. Copy and paste the text in the .txt file into Word.
  4. In Word,
    1. select the text you just pasted from the .txt file
    2. go to Table \(\rightarrow\) Convert \(\rightarrow\) Convert Text to Table…
    3. make sure “Commas” is selected under “Separate text at”, click OK

The text should now end up in a table that you can format in Word.

1.1.1 Example 1

Here’s an example of the first two steps using a dataset on U.S. states (codebook here).

# Read data
states <- read.csv("states.csv")

# (1) Create a table of Bush support by U.S. region in 2000 (South versus Non-South):
t <- with(states, table(south, gb_win00))
t <- prop.table(t, margin = 1)
t                                         #large Majority of Southerners voted for Bush:    
##           gb_win00
## south      Bush win Gore win
##   Nonsouth   0.4706   0.5294
##   South      0.8750   0.1250
# (2) Write this table to a comma separated .txt file:
write.table(t, file = "bush_south.txt", sep = ",", quote = FALSE)

The .txt file will end up in your working directory. Now follow steps 3 and 4 to create a table in Word.

1.1.2 Example 2

Here’s another example that again uses the states.csv dataset. Say we wanted to create a table with summary statistics for five of the variables in this dataset:

# Keep 5 variables in states dataset
states_sub <- select(states, blkpct, attend_pct, bush00, obama08, womleg)

# Find summary statistics for each variable
library(plyr)                              #to access colwise function
means <- colwise(mean)(states_sub)
stdev <- colwise(sd)(states_sub)
mins <- colwise(min)(states_sub)
maxs <- colwise(max)(states_sub)

# Create df with summary statistics, putting variables in rows using transpose function t()
df <- data.frame(t(means),
                 t(stdev),
                 t(mins),
                 t(maxs))

# Clean column and row names
names(df) <- c("Mean", "SD", "Min", "Max")
row.names(df) <- c("Black (%)", "Attend Church (%)", "Bush -00 (%)",
                   "Obama -08 (%)", "Women in Legislature (%)")

# Restrict number of decimal points to 1
df <- round(df, 1)
df
##                          Mean  SD  Min  Max
## Black (%)                10.3 9.7  0.4 36.8
## Attend Church (%)        38.9 9.4 22.0 60.0
## Bush -00 (%)             50.4 8.7 31.9 67.8
## Obama -08 (%)            50.5 9.5 32.5 71.8
## Women in Legislature (%) 23.2 7.3  8.8 37.8
# Write data frame to .txt file
write.table(df, file = "sumstats.txt", sep = ",", quote = FALSE)

1.1.3 Exercises

  1. Create a table of summary statistics in Word for vep04_turnout, vep08_turnout, unemploy, urban, and hispanic. The table should include the number of observations (n), mean, median, 10th percentile, and 90th percentile of each of the variables. Put the variables in the rows of the table and the summary statistics in the columns, like we did in the example above. Format your table in Word to make it look similar to this table.

1.2 Graphs

In Tutorial 2, we covered graphing with the ggplot package. Let’s talk about how to ensure that the graphs you produce look good when you include them in your write-ups.

1.2.1 Saving images as .pdf

Saving images as .pdf is usually your best option. This format ensures that images don’t pixelate. (And you can insert .pdfs into word like you do with other image file formats.)

To save a .pdf, use the pdf() function before the image you want to save, and include dev.off() after the image.

Here’s an example, again using the states.csv dataset:

states <- read.csv("states.csv")

library(ggplot2)
p <- ggplot(states, aes(x = attend_pct, y = bush00)) +
       geom_point() +
       geom_text(aes(label = stateid, y = bush00 - 0.7), size = 3) +
       geom_smooth(method = "loess", se = F) +
       xlab("% in State Attending Religious Services") +
       ylab("% in State Voting for Bush in 2000")

# Save the image as a pdf:
pdf(file = "bush_religion.pdf", height = 6, width = 8)
p
dev.off()
## pdf
##   2

1.2.2 Arranging images in columns and rows

Arranging graphs into a matrix of rows and columns, like we did on problem set 2, can be very useful for presentational purposes. There are two ways to do this using ggplot:

  1. Create each graph separately and then arrange them using the function grid.arrange() in the gridExtra package.
  2. Use facet_wrap(), like we did in this example in Tutorial 2. This second approach is useful if we want to display the same relationship across different groups or years.

Here’s an example of the first approach:

p1 <- ggplot(states, aes(x = bush00, y = bush04)) +
        geom_point() +
        geom_text(aes(label = stateid, y = bush04 - 0.7), size = 3) +
        geom_smooth(method = "loess", se = F) +
        xlab("% in State Voting for Bush in 2000") +
        ylab("% in State Voting for Bush in 2004")

p2 <- ggplot(states, aes(x = bush04, y = obama08)) +
        geom_point() +
        geom_text(aes(label = stateid, y = obama08 - 0.7), size = 3) +
        geom_smooth(method = "loess", se = F) +
        xlab("% in State Voting for Bush in 2004") +
        ylab("% in State Voting for Obama in 2008")

p3 <- ggplot(states, aes(x = vep04_turnout, y = bush04)) +
        geom_point() +
        geom_text(aes(label = stateid, y = bush04 - 0.7), size = 3) +
        geom_smooth(method = "loess", se = F) +
        xlab("Turnout among Voting Eligible Population (2004)") +
        ylab("% in State Voting for Bush in 2004")

p4 <- ggplot(states, aes(x = vep08_turnout, y = obama08)) +
        geom_point() +
        geom_text(aes(label = stateid, y = obama08 - 0.7), size = 3) +
        geom_smooth(method = "loess", se = F) +
        xlab("Turnout among Voting Eligible Population (2008)") +
        ylab("% in State Voting for Obama in 2008")

library(gridExtra)
grid.arrange(p1, p2, p3, p4,     #specify the graphs to include
             ncol = 2)           #specify the number of columns we want   

plot of chunk unnamed-chunk-4

Of course, you could save this graph using the pdf() function from above.

1.2.3 Exercises

  1. Using ggplot and gridExtra, create four scatterplots of your choice (not the same as in the examples above) and arrange them into 2 rows and 2 columns.
  2. Save this image using pdf() and ’dev.off()`, specifying an appropriate width and height, and insert this image into Word.

2 Hypothesis Testing in R

For Problem Set 3, you will need to carry out one- and two-sample hypothesis tests. Refer to the lecture notes for the theory behind these tests. What follows is a brief discussion of how to implement these tests in R. Let’s keep working with the states.csv dataset.


2.1 Chi-Squared Tests

Given a cross-tab, a chi-squared test essentially tests whether there is a “relation between the rows and columns”, or whether there is statistical independence given the marginal distributions of the rows and columns.

states <- read.csv("states.csv")

with(states, table(gb_win00, states$gay_policy))
##
## gb_win00   Conservative Liberal Most conservative Most liberal
##   Bush win            7       2                20            1
##   Gore win            3      12                 0            5
# Rearrange the order of the gay policy scale
states$gay_policy <- factor(states$gay_policy,
                            levels = c("Most liberal", "Liberal",
                                       "Conservative", "Most conservative"))

with(states, table(gb_win00, states$gay_policy))
##
## gb_win00   Most liberal Liberal Conservative Most conservative
##   Bush win            1       2            7                20
##   Gore win            5      12            3                 0

Class Exercise: What would this distribution look like if the cell values approximately were proportional to the marginal distributions?


Let’s do a chi-squared test on the actual distribution:

t <- with(states, table(gb_win00, states$gay_policy))
chisq.test(t)
##
##  Pearson's Chi-squared test
##
## data:  t
## X-squared = 30.63, df = 3, p-value = 1.015e-06

How do we interpret this output?


2.2 One-Sample t-Tests

In one-sample t-tests, we test whether an estimated mean can be statistically distinguished from a posited “true” population mean \(\mu_0\). Let’s test whether per capita income—with an estimated mean of 31951 across states—actually is 30000. So, the null hypothesis defines \(\mu_0 =\) 30000. How weird would it be to see a value of 31951 given that \(\mu_0\) actually is 30000?

mean(states$prcapinc)
## [1] 31951
t.test(states$prcapinc, mu = 30000)
##
##  One Sample t-test
##
## data:  states$prcapinc
## t = 3.101, df = 49, p-value = 0.003193
## alternative hypothesis: true mean is not equal to 30000
## 95 percent confidence interval:
##  30687 33215
## sample estimates:
## mean of x
##     31951

How do we interpret this output?


2.3 Two-Sample t-Tests

In two-sample t-tests, we want to test whether two samples (or groups) assumed to come from the same distribution have different means. For example, say we wanted to test whether the percentage women in state legislatures differ across Southern and non-Southern states. Before we carry out this test, what is the null hypothesis? What is the alternative hypothesis?

The following carries out a Welch test, which doesn’t assume that the two groups have the same variances and uses Satterthaite-Welch adjustment of the degrees of freedom (usually resulting in non-integer degrees of freedom):

with(states, t.test(womleg ~ south))
##
##  Welch Two Sample t-test
##
## data:  womleg by south
## t = 3.301, df = 28.7, p-value = 0.002583
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.563 10.924
## sample estimates:
## mean in group Nonsouth    mean in group South
##                  25.41                  18.66

How do we interpret this output?



The code used in this tutorial is available here.