One-way ANOVA

For a simple exercise to understnad one-way ANOVA, I will use the data set red.cell.folate from the package ISwR (see the book “Introductory Statistics with R” by Peter Dalgaard) but will also generate our own data.

kbtools::libmgr("l", c("kbtools", "ISwR", "ggplot2", "reshape2"))
set.seed <- 117
head(red.cell.folate)

##   folate ventilation
## 1    243  N2O+O2,24h
## 2    251  N2O+O2,24h
## 3    275  N2O+O2,24h
## 4    291  N2O+O2,24h
## 5    347  N2O+O2,24h
## 6    354  N2O+O2,24h

mean(red.cell.folate$folate)

## [1] 283.2273

ggplot(red.cell.folate) + 
  geom_boxplot(aes(ventilation, folate), alpha=I(1/4), color="darkgray") +
  geom_jitter(aes(ventilation, folate, color=ventilation), size=6) + 
  geom_segment(aes(x=0.5, xend=3.5, y=mean(folate), yend=mean(folate)), color="orange", size=1) + 
  labs(title="Orange line shows the average for ALL groups")

And now (drum roll) … it’s time to run the ANOVA

with(red.cell.folate, anova(lm(folate ~ ventilation)))

## Analysis of Variance Table
## 
## Response: folate
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## ventilation  2  15516  7757.9  3.7113 0.04359 *
## Residuals   19  39716  2090.3                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s look at what this ANOVA table means.

A bit of background first. In a set of multiple groups, each observation can be linked to the mean of its own group and that group mean can be linked to the mean of all observations (a.k.a. “the grand mean”) so that \[x_{ij}=(x_{ij}-\bar{x}_{i})+(\bar{x}_i-\bar{x})+\bar{x}\] which corresponds to \[x_{ij}=\mu + \alpha_i + \epsilon_{ij}\] where the error terms \(\epsilon_{ij}\) are independent and have the same distribution \[\epsilon_{ij}\sim N(0,\sigma^2)\]
The null hypothesis that all groups are the same implies that all \(\alpha_{i}\) are zero.

The total sum of squares (a.k.a. variation or SSD) equals the SSD “between” groups plus “within” groups \[SSD_{total}=SSD_{B}+SSD_{W}\] The first row in the ANOVA table above is the SSD “between” groups, labeled by the “treatment” or “grouping” factor. The last row is the “within” groups SSD is also known as “residuals” or “errors.”

SSD component	name	rows	a.k.a.
\(SSD_{B}\)	“between groups”	first (k-1)	treatment, grouping
\(SSD_{W}\)	“within groups”	last	residuals, error

The sums (SSDs) can and should be normalized by their respective degrees of freedom (df) to calculate mean squares. If N is the total number of observations and k is the number of groups, the df for \(SSD_{B}\) is k-1 and for \(SSD_{W}\) is N-k. Therefore

\[ \begin{aligned} \ MS_{B} = \frac{SSD_{B}}{k-1} \\ \ MS_{W} = \frac{SSD_{W}}{N-k} \\ \end{aligned} \]

The \(MS_{W}\) is the pooled variance from combining the individual group variances and is thus an estimate of \(\sigma^2\). If there is no group effect, \(MS_{B}\) will be approximately equal to \(MS_{W}\) (and thus also provide an estimate of \(\sigma^2\)), but if there is a group effect, \(MS_{B}\) will tend to be larger.

The F statistic is \(\frac{MS_{B}}{MS_{W}}\). Under the null hypothesis, F is 1, or a little greater, due to random effects (by the way, this test is one-sided). In the case above \(F=\frac{7758}{2090} = 3.71\).

You reject the null if F is larger than the 95% quantile of an F distribution with k-1 and N-k degrees of freedom, referred to as \(F_{k-1,N-k}\) (in this case, k=3 and N=22).

Just because we can, let’s see what such an F distribution looks like

fdist <- data.frame(F = rf(1000, 2, 19))
ggplot(fdist) + geom_histogram(aes(F), binwidth=prettyb(fdist$F), fill = "steelblue") +
geom_segment(x = qf(.95,2,19), xend = qf(.95,2,19), y=0, yend=50, color="red", size=1) +
annotate("text", x=3.5, y=55, label="95th percentile", size=5, color="red") +
scale_x_continuous(limits = c(0, 8), breaks = seq(0, 8, 1))

Because \(F=3.71\) is larger than \(qf(.95, 2, 19) = 3.5218933\), we reject the null hypothesis, and conclude that the population means are likely different. Another way of putting this is that \(p = 1-pf(3.71, 2, 19) = 0.0435697\).

Let’s look now at an artificial data set

I love generating datasets that are vaguely medicine-related. Here is one:

n <- 10
age <- rnorm (n, 50, 10)
df <- data.frame(age= age, 
                 a= (2 * age + rnorm(n, 0, 40)), 
                 b= rnorm(n, 100, 20), 
                 c= age * rbeta(n, 1, 5),
                 d=(age/8)^2)#; head(df)
ggplot(df) + 
  geom_point(aes(age,a), color="red", size=6, alpha=I(1/1)) +
  geom_point(aes(age,b), color="green", size=6, alpha=I(1/1)) + 
  geom_point(aes(age,c), color="blue", size=6, alpha=I(1/1)) + 
  geom_point(aes(age,d), color="yellow", size=6, alpha=I(1/1))

attach(df)

The data frame presented this way is not suitable for an ANOVA, because the data has to be molten (a tidyverse concept). A beginner temptation is to run

anova(lm(a~age))

## Analysis of Variance Table
## 
## Response: a
##           Df  Sum Sq Mean Sq F value Pr(>F)
## age        1   170.7  170.71  0.0996 0.7603
## Residuals  8 13707.5 1713.44

anova(lm(b~age))

## Analysis of Variance Table
## 
## Response: b
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## age        1 1944.7  1944.7  7.9021 0.02281 *
## Residuals  8 1968.8   246.1                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above two analyses do NOT describe a grouping of data, but a linear regression of the effect (a or b) on the predictor variable (age). The df=1 for the effect is a telltale sign! (Dalgaard, p. 130)

You need to melt the data frame

df2 <- melt(df, id="age"); colnames(df2)[2] <-"subgroup"#;head(df2); tail(df2)
ggplot(df2) + 
  geom_boxplot(aes(subgroup, value, color=subgroup),  size=1, alpha=I(1/2), color="darkgray", fill="lightgray") +
  geom_jitter(aes(subgroup, value, color=subgroup), size=8, alpha=I(3/4)) +
  geom_segment(aes(x=0.5, xend=4.5, y=mean(value), yend=mean(value)), color="orange", size=2)+
  labs(title="Orange line shows the average for all groups")

attach(df2)

anova(lm(value ~ subgroup)) # NOT age ~ subgroup

## Analysis of Variance Table
## 
## Response: value
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## subgroup   3  48456 16151.9  29.211 9.416e-10 ***
## Residuals 36  19906   552.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which is quite obviously significant, but if you are anal and want to run the numbers, \(p = 1-pf(\frac{MS_{B}}{MS_{W}}, k-1, N-k)\).

Pairwise comparisons and multiple comparisons

The F test suggests that the groups are not the same, but will not tell where the difference lies. For that one needs to do pairwise or multiple comparisons.

summary (lm(value ~ subgroup))

## 
## Call:
## lm(formula = value ~ subgroup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.363 -16.329  -2.756  12.677  85.312 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   85.010      7.436  11.432 1.55e-13 ***
## subgroupb     13.034     10.516   1.239 0.223215    
## subgroupc    -73.714     10.516  -7.010 3.19e-08 ***
## subgroupd    -45.241     10.516  -4.302 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.51 on 36 degrees of freedom
## Multiple R-squared:  0.7088, Adjusted R-squared:  0.6846 
## F-statistic: 29.21 on 3 and 36 DF,  p-value: 9.416e-10

The intercept is the mean of the first group, whereas the other two coefficients describe the difference between the given group and the first one.
The p values compare the groups to the first group, but groups 2…k are not compared between themselves. To compare all groups, one needs to adjust for p value inflation, using the Bonferroni correction.

pairwise.t.test(value, subgroup, p.adj="bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  value and subgroup 
## 
##   a       b       c      
## b 1.00000 -       -      
## c 1.9e-07 4.9e-09 -      
## d 0.00074 1.7e-05 0.06180
## 
## P value adjustment method: bonferroni

Unequal variance

The typical one-way ANOVA relies on an assumption of equal variance for all groups. In the common scenario when that is not likely to be correct, do not use the pooled standard deviation:

pairwise.t.test(value, subgroup, p.adj="bonferroni", pool.sd=F)

## 
##  Pairwise comparisons using t tests with non-pooled SD 
## 
## data:  value and subgroup 
## 
##   a       b       c      
## b 1.00000 -       -      
## c 0.00109 2.5e-07 -      
## d 0.03220 1.1e-05 0.00014
## 
## P value adjustment method: bonferroni

What if you are unsure about the variances in your groups?

Run the Bartlett test, which evaluates for homogeneity of variance (i.e. the homoscedasticity) of the samples.

bartlett.test(value~subgroup)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  value by subgroup
## Bartlett's K-squared = 21.086, df = 3, p-value = 0.0001011

The null is that the populations have equal variance. As suggested from the plots above, the Bartlett \(p<0.05\) confirms that the variances are likely different, so we were correct in not using the pooled standard deviations in the pairwise comparisons above. One problem with the Bartlett test is that it is nonrobust against departures from the assumption of normal distribution.