One-way ANOVA


One-way ANOVA is a parametric test designed to compare the means of three or more groups. The null hypothesis states that the means of all groups to be tested are equal. As usual, the test will return a p-value in the end, and you will be able to decide whether or not to reject the null hypothesis depending on this p-value.

 

Assumptions are:

  • independence of observations (each individual is represented by 1 entry/measurement ONLY)
  • normality of distribution (to be tested for each group, for example with the Shapiro-Wilk test)
  • homogeneity of variance (to be tested with, for example, Levene’s test).

 

The function to use in R is lm() followed by anova(). This option is used to fit a linear model and will work in virtually all cases. A second option works as well and involves the function aov(); however you must know that this option is restricted to balanced design (where groups have equal numbers of entries, i.e. number of observations is the same for all groups).

 

Let’s take an example. Here, let’s say that we want to check whether the average size of blue ground beetles (Carabus intricatus) differs depending on their location. We consider 3 different locations, for example 3 forests beautifully named A, B and C. In each location, we measure the size (in millimeters) of 10 individuals.

In Excel, the table containing the data would look like this (click to enlarge):

Skjermbilde 2016-07-08 13.24.33

To create the corresponding dataframe in R, use the following code:

size <- c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22)
location <- as.factor(c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10)))
my.dataframe <- data.frame(size,location)
my.dataframe

and the resulting dataframe is:

Skjermbilde 2016-07-08 13.42.19

 

It is always nice and useful to get an overview of the whole dataset, so let’s plot the data:

plot(size~location, data=my.dataframe)

Skjermbilde 2016-07-08 13.46.19

 

Now we need to check the assumptions of normality of distribution and homogeneity of variance. We thus run the Shapiro-Wilk test on each group and then Levene’s test (for which you will need to load/activate the package car via the command library(car)).

library(car)
shapiro.test(my.dataframe$size[location=="ForestA"])
shapiro.test(my.dataframe$size[location=="ForestB"])
shapiro.test(my.dataframe$size[location=="ForestC"])
leveneTest(size~location, data=my.dataframe, center=mean)

Skjermbilde 2016-07-08 23.28.40

 

So, each of the 3 groups (ForestA, ForestB and ForestC) is asumed to come from normal distribution since the p-value of the Shapiro-Wilk test is greater than 0.05; additionally, variances are not different according to Levene’s test (p-value greater than 0.05).

Note: if you are a bit confused about the way data/groups are retrieved for running the Shapiro-Wilk test, here is a quick explanation. Let’s consider the group ForestA: we need to tell the function to retrieve all size data located in the object my.dataframe (hence my.dataframe$size) but we need to restrict the selection to data matching the criteria ForestA only (hence [location==ForestA]). Putting everything together, we write my.dataframe$size[location=="ForestA"] inside shapiro.test(). 

 

Let’s see how to run the ANOVA

We consider the first option using lm(). The syntax is lm(variable ~ groups, data=dataframe) where variable is the vector that contain the response variable, groups is the vector that contains the grouping variable or factor (which categorizes the observations) and dataframe the name of the dataframe that contains the data. We first need to fit a linear model with lm() and then we store the results in the object results.lm and print them out using anova():

results.lm <- lm(size~location, data=my.dataframe)
anova(results.lm)

 

Skjermbilde 2016-09-20 14.30.12

This output provides you with the F-value (7.1101) and the corresponding p-value (0.003307). The hypothesis stating that the means of the groups are equal is apparently to be rejected.

The second option implies that we run the ANOVA on the dataframe with aov(). The syntax is very similar to lm(). Here, we store the results in the object results, then we “print” some of the data in results using summary(results):

results <- aov(size~location, data=my.dataframe)
summary(results)

Skjermbilde 2016-07-08 23.50.45

This output gives the value F of the statistic (here F=7.11) and the p-value (0.00331) and you rapidly notice that these are very close to the results obtained with lm(), at least in this example. Here, the ANOVA test tells that the null hypothesis is to be rejected and that there exists a significant difference between some of the groups, nothing more.

 

But this does not tell us anything about the groups which means are significantly different…

Indeed, the ANOVA needs to be followed by another test if we want to check which of the groups are different from the others. For that we’ll need a post-hoc test, possibly a pairwise t-test or a Tukey HSD.

 

What to do if the assumption of normality is not met?

In this case you may simply apply the non-parametric Kruskal-Wallis test.

The syntax is the following:

kruskal.test(size~location, data=my.dataframe)

and the output looks like this:
Skjermbilde 2016-07-09 10.02.21

Again, the test shows that the null hypothesis may be rejected. There are differences between the group means.