ANOVA

ANOVA, Analysis of Variance, is used to analyze differences in two or more means for a single quantitative response variable and a single categorical explanatory variable. The data for this example comes from the iris dataset in R. We will use this dataset to investigate whether iris species have different average petal lengths.

SUmmary Statistics

The easiest way to calculate summary statistics for ANOVA is to use the favstats() function in the mosaic package. We are primarily interested in the means, standard deviations, and the sample sizes, but the other values are also interesting.

library(mosaic)
favstats(data=iris, Petal.Length ~ Species)

##      Species min  Q1 median    Q3 max  mean        sd  n missing
## 1     setosa 1.0 1.4   1.50 1.575 1.9 1.462 0.1736640 50       0
## 2 versicolor 3.0 4.0   4.35 4.600 5.1 4.260 0.4699110 50       0
## 3  virginica 4.5 5.1   5.55 5.875 6.9 5.552 0.5518947 50       0

Plots

Side-by-side boxplots are the most common plots used in ANOVA. However, you may also want to produce additional plots for each level of the explanatory variable.

ANOVA Output

Once you have looked at the summary statistics and plots and are convinced that the criteria for ANOVA is met, you will want to fit a model to the data and look at the ANOVA table. The lm(), aov() and the summary() function will do this for you.

my.anova <- aov( data=iris, Petal.Length ~ Species ); # Calculates ANOVA values
summary(my.anova) # Displays the ANOVA table

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  437.1  218.55    1180 <2e-16 ***
## Residuals   147   27.2    0.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table provides all the calculations necessary to summarize the results of the analysis, Degrees of Freedom, Sum of Squares, Mean Squared, F statistic, and p-value. In this particular example the p-value is very small. This suggests that at lease one of the iris species has an average petal length different from the rest.

Additional Analysis

In cases where the null hypothesis is rejected additional analysis is required. By rejecting the null hypothesis you have declared that an average from at least one of the groups is different from the rest. The next logical step is to find out which groups differ and which groups are similar. To do this you examine every pair of treatment groups. This can be done several ways, but the most common ways are using a t test with slightly modified significance levels. Tukey's HSD and pairwise t tests are the most common techniques used to do this.

# Multiple Comparisons
TukeyHSD(my.anova, conf.level = 0.95)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
## 
## $Species
##                       diff     lwr     upr p adj
## versicolor-setosa    2.798 2.59422 3.00178     0
## virginica-setosa     4.090 3.88622 4.29378     0
## virginica-versicolor 1.292 1.08822 1.49578     0

The output from TukeysHSD produces confidence intervals as well as p-values for a two-sided t-test for every combination of group pairs. The confidence intervals and p-values show that all three averages are significantly different from one another. This is somewhat evident in the boxplot that was constructed at the beginning of the analysis. .

Mathematicss, Computer Science, and Statistics Department Gustavus Adolphus College