For a simple exercise to understnad one-way ANOVA, I will use the data set red.cell.folate
from the package ISwR
(see the book “Introductory Statistics with R” by Peter Dalgaard) but will also generate our own data.
kbtools::libmgr("l", c("kbtools", "ISwR", "ggplot2", "reshape2"))
set.seed <- 117
head(red.cell.folate)
## folate ventilation
## 1 243 N2O+O2,24h
## 2 251 N2O+O2,24h
## 3 275 N2O+O2,24h
## 4 291 N2O+O2,24h
## 5 347 N2O+O2,24h
## 6 354 N2O+O2,24h
mean(red.cell.folate$folate)
## [1] 283.2273
ggplot(red.cell.folate) +
geom_boxplot(aes(ventilation, folate), alpha=I(1/4), color="darkgray") +
geom_jitter(aes(ventilation, folate, color=ventilation), size=6) +
geom_segment(aes(x=0.5, xend=3.5, y=mean(folate), yend=mean(folate)), color="orange", size=1) +
labs(title="Orange line shows the average for ALL groups")

And now (drum roll) … it’s time to run the ANOVA
with(red.cell.folate, anova(lm(folate ~ ventilation)))
## Analysis of Variance Table
##
## Response: folate
## Df Sum Sq Mean Sq F value Pr(>F)
## ventilation 2 15516 7757.9 3.7113 0.04359 *
## Residuals 19 39716 2090.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s look at what this ANOVA table means.
A bit of background first. In a set of multiple groups, each observation can be linked to the mean of its own group and that group mean can be linked to the mean of all observations (a.k.a. “the grand mean”) so that \[x_{ij}=(x_{ij}-\bar{x}_{i})+(\bar{x}_i-\bar{x})+\bar{x}\] which corresponds to \[x_{ij}=\mu + \alpha_i + \epsilon_{ij}\] where the error terms \(\epsilon_{ij}\) are independent and have the same distribution \[\epsilon_{ij}\sim N(0,\sigma^2)\]
The null hypothesis that all groups are the same implies that all \(\alpha_{i}\) are zero.
The total sum of squares (a.k.a. variation or SSD) equals the SSD “between” groups plus “within” groups \[SSD_{total}=SSD_{B}+SSD_{W}\] The first row in the ANOVA table above is the SSD “between” groups, labeled by the “treatment” or “grouping” factor. The last row is the “within” groups SSD is also known as “residuals” or “errors.”
SSD component | name | rows | a.k.a. |
---|---|---|---|
\(SSD_{B}\) | “between groups” | first (k-1) | treatment, grouping |
\(SSD_{W}\) | “within groups” | last | residuals, error |
The sums (SSDs) can and should be normalized by their respective degrees of freedom (df) to calculate mean squares. If N is the total number of observations and k is the number of groups, the df for \(SSD_{B}\) is k-1 and for \(SSD_{W}\) is N-k. Therefore
\[ \begin{aligned} \ MS_{B} = \frac{SSD_{B}}{k-1} \\ \ MS_{W} = \frac{SSD_{W}}{N-k} \\ \end{aligned} \]
The \(MS_{W}\) is the pooled variance from combining the individual group variances and is thus an estimate of \(\sigma^2\). If there is no group effect, \(MS_{B}\) will be approximately equal to \(MS_{W}\) (and thus also provide an estimate of \(\sigma^2\)), but if there is a group effect, \(MS_{B}\) will tend to be larger.
The F statistic is \(\frac{MS_{B}}{MS_{W}}\). Under the null hypothesis, F is 1, or a little greater, due to random effects (by the way, this test is one-sided). In the case above \(F=\frac{7758}{2090} = 3.71\).
You reject the null if F is larger than the 95% quantile of an F distribution with k-1 and N-k degrees of freedom, referred to as \(F_{k-1,N-k}\) (in this case, k=3 and N=22).
Just because we can, let’s see what such an F distribution looks like
fdist <- data.frame(F = rf(1000, 2, 19))
ggplot(fdist) + geom_histogram(aes(F), binwidth=prettyb(fdist$F), fill = "steelblue") +
geom_segment(x = qf(.95,2,19), xend = qf(.95,2,19), y=0, yend=50, color="red", size=1) +
annotate("text", x=3.5, y=55, label="95th percentile", size=5, color="red") +
scale_x_continuous(limits = c(0, 8), breaks = seq(0, 8, 1))

Because \(F=3.71\) is larger than \(qf(.95, 2, 19) = 3.5218933\), we reject the null hypothesis, and conclude that the population means are likely different. Another way of putting this is that \(p = 1-pf(3.71, 2, 19) = 0.0435697\).
Let’s look now at an artificial data set
I love generating datasets that are vaguely medicine-related. Here is one:
n <- 10
age <- rnorm (n, 50, 10)
df <- data.frame(age= age,
a= (2 * age + rnorm(n, 0, 40)),
b= rnorm(n, 100, 20),
c= age * rbeta(n, 1, 5),
d=(age/8)^2)#; head(df)
ggplot(df) +
geom_point(aes(age,a), color="red", size=6, alpha=I(1/1)) +
geom_point(aes(age,b), color="green", size=6, alpha=I(1/1)) +
geom_point(aes(age,c), color="blue", size=6, alpha=I(1/1)) +
geom_point(aes(age,d), color="yellow", size=6, alpha=I(1/1))

attach(df)
The data frame presented this way is not suitable for an ANOVA, because the data has to be molten (a tidyverse concept). A beginner temptation is to run
anova(lm(a~age))
## Analysis of Variance Table
##
## Response: a
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 170.7 170.71 0.0996 0.7603
## Residuals 8 13707.5 1713.44
anova(lm(b~age))
## Analysis of Variance Table
##
## Response: b
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 1944.7 1944.7 7.9021 0.02281 *
## Residuals 8 1968.8 246.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above two analyses do NOT describe a grouping of data, but a linear regression of the effect (a or b) on the predictor variable (age). The df=1 for the effect is a telltale sign! (Dalgaard, p. 130)
You need to melt the data frame
df2 <- melt(df, id="age"); colnames(df2)[2] <-"subgroup"#;head(df2); tail(df2)
ggplot(df2) +
geom_boxplot(aes(subgroup, value, color=subgroup), size=1, alpha=I(1/2), color="darkgray", fill="lightgray") +
geom_jitter(aes(subgroup, value, color=subgroup), size=8, alpha=I(3/4)) +
geom_segment(aes(x=0.5, xend=4.5, y=mean(value), yend=mean(value)), color="orange", size=2)+
labs(title="Orange line shows the average for all groups")

attach(df2)
anova(lm(value ~ subgroup)) # NOT age ~ subgroup
## Analysis of Variance Table
##
## Response: value
## Df Sum Sq Mean Sq F value Pr(>F)
## subgroup 3 48456 16151.9 29.211 9.416e-10 ***
## Residuals 36 19906 552.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which is quite obviously significant, but if you are anal and want to run the numbers, \(p = 1-pf(\frac{MS_{B}}{MS_{W}}, k-1, N-k)\).
Pairwise comparisons and multiple comparisons
The F test suggests that the groups are not the same, but will not tell where the difference lies. For that one needs to do pairwise or multiple comparisons.
summary (lm(value ~ subgroup))
##
## Call:
## lm(formula = value ~ subgroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.363 -16.329 -2.756 12.677 85.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.010 7.436 11.432 1.55e-13 ***
## subgroupb 13.034 10.516 1.239 0.223215
## subgroupc -73.714 10.516 -7.010 3.19e-08 ***
## subgroupd -45.241 10.516 -4.302 0.000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.51 on 36 degrees of freedom
## Multiple R-squared: 0.7088, Adjusted R-squared: 0.6846
## F-statistic: 29.21 on 3 and 36 DF, p-value: 9.416e-10
The intercept is the mean of the first group, whereas the other two coefficients describe the difference between the given group and the first one.
The p values compare the groups to the first group, but groups 2…k are not compared between themselves. To compare all groups, one needs to adjust for p value inflation, using the Bonferroni correction.
pairwise.t.test(value, subgroup, p.adj="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: value and subgroup
##
## a b c
## b 1.00000 - -
## c 1.9e-07 4.9e-09 -
## d 0.00074 1.7e-05 0.06180
##
## P value adjustment method: bonferroni
Unequal variance
The typical one-way ANOVA relies on an assumption of equal variance for all groups. In the common scenario when that is not likely to be correct, do not use the pooled standard deviation:
pairwise.t.test(value, subgroup, p.adj="bonferroni", pool.sd=F)
##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: value and subgroup
##
## a b c
## b 1.00000 - -
## c 0.00109 2.5e-07 -
## d 0.03220 1.1e-05 0.00014
##
## P value adjustment method: bonferroni
What if you are unsure about the variances in your groups?
Run the Bartlett test, which evaluates for homogeneity of variance (i.e. the homoscedasticity) of the samples.
bartlett.test(value~subgroup)
##
## Bartlett test of homogeneity of variances
##
## data: value by subgroup
## Bartlett's K-squared = 21.086, df = 3, p-value = 0.0001011
The null is that the populations have equal variance. As suggested from the plots above, the Bartlett \(p<0.05\) confirms that the variances are likely different, so we were correct in not using the pooled standard deviations in the pairwise comparisons above. One problem with the Bartlett test is that it is nonrobust against departures from the assumption of normal distribution.
1 Reply to “One-way ANOVA”