Introduction to Data Analysis and R: Linear Regression and Correlation

Linear Regression and Correlation

Linear regression and correlation can be considered two sides of the same statistical analysis, one which uses two measurement variables.

They have three different uses:

testing whether one measurement variable is associated with the other measurement variable (this is essentially your standard hypothesis testing, easily summarized with the p-value);
obtaining an equation that describes the relationship between the two variables and which can then be used to make predictions about unknown values; and
measuring the strength of the association.

As an example, you could use linear regression and correlation to compare the relationship between the height and diameter of a sample of trees, since both of these are measurement variables.

You may already be familiar with the visual representation of these tests, as they're typically visualized as a scatterplot with the best fitting regression line overlaid. Different types of regression have different definitions for what "best fit" means, but the most common is called an "ordinary least-squares regression". We're not going to dive too far into what this means, but in simple terms, the best fit line in this case is the one that minimizes across all the data points the squared vertical distance between the points and the line. So, essentially, this regression line goes through the middle of the data. The equation for this best fitting line is one of the things we're after when we conduct a linear regression, and typically looks something like this:

$\hat{Y}=bX+a$

where $\hat{Y}$ is the predicted value of Y (i.e., measurement variable 1) for a given value of X (i.e., measurement variable 2), b is the slope of the line, and a is the intercept. We'll talk about this in more detail on the next page.

As an example, take a look at the two scatterplots shown below (from Navarro, 2019, chapter 15). The data show hours of sleep each night (x-axis) and level of grumpiness during the following day (y-axis), and the scatterplots show, first, the best fitting regression line, and, next, a not-so-well fitting regression line.

The strength of the relationship between your two variables essentially tells you how well your data points cluster around the best fit regression line (not quite accurate, but good enough for our purposes). There are two common ways of reporting this, and the one you use will largely depend on your discipline. One common way is to calculate and report r, the "correlation coefficient" (also called the Pearson correlation coefficient or Pearson's r). The other common way is to report r2, the "coefficient of determination", which is just correlation coefficient squared (there's also an adjusted r2 you can report if necessary). The value of r will range from -1 to 1, with a value of -1 indicating a perfect negative relationship, a value of 1 indicating a perfect positive relationship, and a value of 0 indicating no relationship. The value of r2 will range from 0 to 1, with a value of 0 indicating no relationship, and a value of 1 indicating a perfect relationship.

Supplemental Readings

Correlation and linear regression (McDonald 2014)

R - Linear Regression and Correlation

We'll be using the treeGrowth dataset (linked below) for the examples in the video.

treeGrowth dataset

(Don't hesitate to use the player controls to pause, rewind, slow down the video as needed! A thorough understanding of the concepts is vastly preferable to just speeding through.)

Recap

Here's a quick reference table for the functions discussed in the video.

Function	Example	Notes
lm( formula = [measurement variable 1] ~ [measurement variable 2] )	lm( formula = treeGrowth$diameter ~ treeGrowth$height )	call with summary() for full stats table
cor.test( x = [measurement variable 1], y = [measurement variable 2] )	cor.test( x = treeGrowth$diameter, y = treeGrowth$height)	provides r instead of r^2

Supplemental Readings

Interpreting Simple Linear Output in R

Linear Regression and Correlation Practice

You will need the "treeGrowth" dataset (linked below) in order to complete this quiz.

treeGrowth dataset