Introduction to Data Analysis and R: Analysis of Variance

Analysis of Variance

The analysis of variance, or anova, is another commonly encountered inferential test. Like the t-test, it has several different forms. We will focus our efforts on one-way anovas, but you should at least finish the module with a basic understanding of two-way anovas as well.

One-way Anova

A one-way anova (also called one-factor, single-factor, or single-classification anova) is an analysis that uses one measurement variable and one nominal variable. This may sound familiar, as this is quite like a t-test for two samples. In fact, a one-way anova is mathematically equivalent to this t-test; the only difference is that the nominal variable in a one-way anova can have more than two possible values. The statistical null hypothesis for this analysis is that the means of the measurement variable are all the same for the different categories of the nominal variable.

As an example, let's say you have a large sample of chickens and are interested in seeing what effect several different types of food have on the animals. Each chicken would be assigned to a different group and each group is fed a different food. Chicken weight is the measurement variable, and the different food would be the nominal variable. (If there were only two different groups, you could use a two-sample t-test, but since we have several different groups, we need to use an anova.)

Post-hoc tests

An important thing to keep in mind: with anovas, your statistical null hypothesis is that the means of your measurement variable across each group are all equal. For our chicken feed example above, let's say we are testing four different foods: food A, B, C, and D. So, we could state our statistical null hypothesis like this:

$H_{0}:\bar{x}_A=\bar{x}_B=\bar{x}_C=\bar{x}_D$

An anova only tests to see if this is true. If there is a difference, it doesn't tell you where the difference lies. Thus, if the results of your analysis indicate statistical significance, you will need to perform one or more post-hoc tests to investigate where the means differ.

It's possible to separate your comparisons into groups of two (in this case, A ~ B; A ~ C; A ~ D; B ~ C; B ~ D; and C ~ D) and conduct a series of t-tests to try to determine whether any of these pairings show significant differences. However, doing so dramatically increases the chances of having false positives pop up, and thus should not be your approach.

There are several more appropriate options out there for post-hoc analysis. One that is commonly encountered is Tukey's test. This test basically calculates a minimum significant difference (MSD) between each pair of means, which is dependent upon sample size of each group, variation within each group, and the total number of groups. Then it compares the observed difference between each pair of means, and if this observed difference is greater than the MSD, then that pair of means is significantly different from each other. We'll look at this in more detail in RStudio.

R - Anovas

We'll be using the chickenFeed dataset (linked below) for the examples in the video.

chickenFeed dataset

(Don't hesitate to use the player controls to pause, rewind, slow down the video as needed! A thorough understanding of the concepts is vastly preferable to just speeding through.)

Recap

Here is a quick reference table with the functions discussed in the video.

Function Structure	Example	Notes
aov( formula = [measurement variable] ~ [nominal variable] )	aov( formula = chickenFeed$weight ~ chickenFeed$feed )	store anova as new variable (e.g., aovChick), and call with summary() to see full table
TukeyHSD( [anova] )	TukeyHSD( aovChick )	use to see full table of p-values
HSD.test( [anova], "[group names]")	HSD.test( aovChick, "chickenFeed$feed" )	found in agricolae package; call with print() to see streamlined comparison table

Anova practice

You will need the "insecticides" dataset (linked below) in order to complete this quiz.

insecticides

Two-way Anovas

A two-way anova (also called a factorial anova, with two factors) is an analysis that uses one measurement variable and two nominal variables (which are often called "factors" or "main effects"), where the nominal variables are found in all possible combinations. A common experimental design for this analysis is repeated measures, where multiple observations have been made for one individual (e.g., measurements taken at different time points).

This analysis tests three different null hypotheses:

the means of the measurement variable are equal for different values of the first nominal variable;
the means of the measurement variable are equal for different values of the second nominal variable; and
there is no interaction between the two nominal variables, i.e, the effects of one nominal variable do not affect the value of the other.

As an example, let's say you're conducting an experiment to see the effect that different doses of different supplements have on tooth growth in beavers. You're testing two different supplements, each at three different concentrations. Tooth growth here is the measurement variable, supplement type is one of the nominal variables, and supplement concentration is the other nominal variable.

Two-way anovas are more complex than the comparatively simplistic one-way anova, particularly when you start getting into the different kinds of post-hoc analyses that may be required. As such, two-way anovas are beyond the scope of this module. If your data requires such analysis, your best bet at this point is to consult a statistician.