If you encounter any issues or errors in this module, please let me know using this feedback form:
Linear regression and correlation can be considered two sides of the same statistical analysis, one which uses two measurement variables.
They have three different uses:
As an example, you could use linear regression and correlation to compare the relationship between the height and diameter of a sample of trees, since both of these are measurement variables.
You may already be familiar with the visual representation of these tests, as they're typically visualized as a scatterplot with the best fitting regression line overlaid. Different types of regression have different definitions for what "best fit" means, but the most common is called an "ordinary least-squares regression". We're not going to dive too far into what this means, but in simple terms, the best fit line in this case is the one that minimizes across all the data points the squared vertical distance between the points and the line. So, essentially, this regression line goes through the middle of the data. The equation for this best fitting line is one of the things we're after when we conduct a linear regression, and typically looks something like this:
where is the predicted value of Y (i.e., measurement variable 1) for a given value of X (i.e., measurement variable 2), b is the slope of the line, and a is the intercept. We'll talk about this in more detail on the next page.
As an example, take a look at the two scatterplots shown below (from Navarro, 2019, chapter 15). The data show hours of sleep each night (x-axis) and level of grumpiness during the following day (y-axis), and the scatterplots show, first, the best fitting regression line, and, next, a not-so-well fitting regression line.
The strength of the relationship between your two variables essentially tells you how well your data points cluster around the best fit regression line (not quite accurate, but good enough for our purposes). There are two common ways of reporting this, and the one you use will largely depend on your discipline. One common way is to calculate and report r, the "correlation coefficient" (also called the Pearson correlation coefficient or Pearson's r). The other common way is to report r2, the "coefficient of determination", which is just correlation coefficient squared (there's also an adjusted r2 you can report if necessary). The value of r will range from -1 to 1, with a value of -1 indicating a perfect negative relationship, a value of 1 indicating a perfect positive relationship, and a value of 0 indicating no relationship. The value of r2 will range from 0 to 1, with a value of 0 indicating no relationship, and a value of 1 indicating a perfect relationship.
We'll be using the treeGrowth dataset (linked below) for the examples in the video.
(Don't hesitate to use the player controls to pause, rewind, slow down the video as needed! A thorough understanding of the concepts is vastly preferable to just speeding through.)
Here's a quick reference table for the functions discussed in the video.
Function | Example | Notes |
---|---|---|
lm( formula = [measurement variable 1] ~ [measurement variable 2] ) | lm( formula = treeGrowth$diameter ~ treeGrowth$height ) | call with summary() for full stats table |
cor.test( x = [measurement variable 1], y = [measurement variable 2] ) | cor.test( x = treeGrowth$diameter, y = treeGrowth$height) | provides r instead of r^2 |
You will need the "treeGrowth" dataset (linked below) in order to complete this quiz.