psychRstats: Learning Statistics for Psychology in R: Correlation and Regression (Lab 09) Exercises, Completed

Justin Dainer-Best

This document is meant to be used to practice after you have completed the tutorial for today’s lab. Make sure to put your name as the author of the document, above!

If you intend to work on these exercises while referring to the tutorial, there are instructions on the wiki on how to do so. You may also want to refer to past labs. Don’t forget that previous labs are linked to on the main labs website.

Objectives

In the tutorial, we learned about using lm() and summary() for regressions, and cor() and cor.test() for correlations. You’ll use those and the library(ggplot2) functions to plot them to make further sense of the predictions data, including adding regression lines. You’ll also practice (briefly) filter() and a few other functions to clean up the data as provided.

You can find a completed version of these exercises at https://jdbest.github.io/psychRstats/answers.html

Don’t forget to (a) save and (b) knit the document frequently, so you’ll keep track of your work and also know where you run into errors.

Loading packages

As always, you must load packages if you intend to use their functions. Run the following code chunk to load necessary packages for these exercises.

library(tidyverse)

Importing data

As discussed in the tutorial, we’re using data from Beall, Hofer, & Shaller (2016).

Beall, A. T., Hofer, M. K., & Shaller, M. (2016). Infections and elections: Did an Ebola outbreak influence the 2014 U.S. federal elections (and if so, how)? Psychological Science, 27, 595-605. https://doi.org/10.1177/0956797616628861

Make sure you read the description of the study in the tutorial—it’s important for thinking about what we’re doing in these exercises.

In the tutorial, we used a “cleaned-up” version of the data. But let’s actually use the raw data here: that one is called beall_untidy.csv and should be in the same folder as this document.

The data was downloaded with this file. Load it using the read_csv() command—probably with the code below:

predictions <- read_csv("beall_untidy.csv")

Today’s task

For the questions below, create your own code chunks and insert all code into them.

Be sure to assign the resulting data to itself or to a new data frame, so you can use it in the subsequent questions. Filter the data using the filter() function:
1. Remove the two lines at the very top for which there are NAS even in the Date and Month column
2. Remove the column DJIA with either select() (putting a - in front of the name will remove it) or by assigning predictions$DJIA to the value NULL
3. You should now have 65 observations and 14 variables in the Environment.

predictions <- tibble(predictions) %>%
  filter( ! is.na(Month) ) %>%
  select(-DJIA)

The authors report that “Across all days in the data set, [the Ebola-search-volume index] was very highly correlated with an index—computed from LexisNexis data—of the mean number of daily news stories about Ebola during the preceding week, r = .83, p < .001.” Calculate this correlation yourself using cor.test() and your predictions data. (You’ll use the columns Ebola.Search.Volume.Index and LexisNexisNewsVolumeWeek) Then, briefly report the correlation. Is it significant?

cor.test(predictions$Ebola.Search.Volume.Index, predictions$LexisNexisNewsVolumeWeek)


    Pearson's product-moment correlation

data:  predictions$Ebola.Search.Volume.Index and predictions$LexisNexisNewsVolumeWeek
t = 11.759, df = 63, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7331684 0.8923563
sample estimates:
      cor 
0.8288528

There was a significant relationship between the Ebola-search-volume index and the LexisNexis index, $r(63)=.83, 95% CI [.73, .89], p < .05$

Plot that relationship using ggplot() + geom_point(). Add a theme and label the axes. Add a regression line using geom_smooth() or geom_abline() (you’ll get the data in the next question).

ggplot(predictions, aes(x = Ebola.Search.Volume.Index, 
                        y = LexisNexisNewsVolumeWeek)) +
  geom_point() +
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  labs(x = "Ebola-search-volume index", y = "LexisNexis index")

Use the lm() function to create a regression model of the same relationship. Then use summary() to get the results. Report them succinctly below. Also report what parallels exist between the numbers from this regression and the correlation.

model <- lm(Ebola.Search.Volume.Index ~ LexisNexisNewsVolumeWeek, 
            data = predictions)
summary(model)


Call:
lm(formula = Ebola.Search.Volume.Index ~ LexisNexisNewsVolumeWeek, 
    data = predictions)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.615  -7.050  -1.244   9.823  24.349 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -1.91116    2.73372  -0.699    0.487    
LexisNexisNewsVolumeWeek  0.15516    0.01319  11.759   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.88 on 63 degrees of freedom
Multiple R-squared:  0.687, Adjusted R-squared:  0.682 
F-statistic: 138.3 on 1 and 63 DF,  p-value: < 2.2e-16

There was a statistically-significant relationship between the two indexes, $b=0.16,p<.05$, with an $R^2$ of .69, $p<.05$.

Use filter() to select only the scores from the two-week period including the last week of September and the first week of October. You could look at the Month and Date columns… but the third column might be more helpful. Don’t forget to assign this to a new data frame so we can use it.

highanxtime <- filter(predictions, Two.weeks.prior.to.outbreak.only==1)

On the full dataset, run the correlation analyses we did in the tutorial, for the association between Ebola search volume index and voter intention index.
With the filtered data from #5, re-run the correlation analyses for the association between Ebola search volume index and voter intention index. Is the correlation higher or lower?

cor.test(highanxtime$Voter.Intention.Index, 
         highanxtime$Ebola.Search.Volume.Index)


    Pearson's product-moment correlation

data:  highanxtime$Voter.Intention.Index and highanxtime$Ebola.Search.Volume.Index
t = 15.975, df = 6, p-value = 3.821e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9351079 0.9979890
sample estimates:
      cor 
0.9884478

It’s much higher—although note that there are many fewer data points!

Plot that filtered data. Add a regression line. Label the axes and add a theme.

ggplot(highanxtime, 
       aes(x = Ebola.Search.Volume.Index, y = Voter.Intention.Index)) +
  geom_point() +
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  labs(x = "Ebola-search-volume index", y = "LexisNexis index")

Correlation and Regression (Lab 09) Exercises, Completed

Objectives

Loading packages

Importing data

Today’s task

Citation