Chapter 11 - Analysis of Variance

In [38]:

# Useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.stats.weightstats as sms
from scipy.stats import t
from scipy.stats import f
from scipy.stats import norm
import math
from IPython.display import Math, display

One Factor Analysis of Variance

One Factor Layouts

Suppose we experiment on k populations with unknown population means $\mu_1, \mu_2, \dots, \mu_k$
- Observation $x_{ij}$ represents the j-th observation from the i-th population
The one factor analysis of variance methodology is appropriate for comparing three of more populations
Each population i consists of $n_i$ observations
- If sample sizes $n_1, n_2, \cdots, n_k$ are all equal, then the data set is balanced, otherwise it is called imbalanced

Total sample size is $n_T = n_1 + \cdots + n_k$
- The total data set is a called one-way or one factor layout
- Each factor has $k$ levels corresponding to $k$ populations
Completely randomized designs: experiment performed by randomly allocating a $n_T$ units among the $k$ populations
Modeling assumption: $x_{ij} = \mu_i + \epsilon_{ij}$
- where the error terms $\epsilon_{ij} \sim N(0, \sigma^2)$ $\Longrightarrow x_{ij} \sim N(\mu_i, \sigma^2)$
Point estimates of the unknown population means: $\hat{\mu}_i = \overline x_{i \cdot} = \frac{ \sum_{j=1}^{n_i} }{n_i}, 1 \leq i \leq k$

If we test $H_0: \mu_1 = \mu_2 = \cdots = \mu_k$ vs $H_A$ : not $H_0$

$\Longrightarrow$ Acceptance of the null hypothesis indicates that there is no evidence that any of the population means are unequal

Partitioning the Total Sum of Squares

We can call the $SSTr$ the "sum of squares for treatments", supposing we are analyzing the effects of treatments

\begin{align} &SST = \sum_{i=i}^k \sum_{j=1}^{n_i} (x_{ij} - \overline x_{\cdot \cdot} )^2 \\ &= \sum_{i=1}^k n_i ( \overline x_{i \cdot} - \overline x_{\cdot \cdot} )^2 + \sum_{i=i}^k \sum_{j=1}^{n_i} (x_{ij} - \overline x_{i \cdots} )^2\\ &= SSTr + SSE \end{align}

$p$ -value considerations: The plausibility of the null hypothesis that the factor level means are all equal depends upon the relative size of the sum of squares for treatments, SSTr, to the sum of squares for error, SSE

The Analysis of Variance (ANOVA) Table

Mean square error (MSE) $MSE = \frac{SSE}{n_T - k}$ $\frac{SSE}{\sigma^2} \sim \chi_{n_T - k}^2$ $E(MSE) = \sigma^2$
Mean squares for treatments (MSTr) $MSTr = \frac{SSTr}{k-1}$
If the factor level means $\mu_i$ are all equal ( $H_0$ ), then $E(MSTr) = \sigma^2$ and $\frac{SSTr}{\sigma^2} \sim \chi^2_{k-1}$ $\Longrightarrow$ under $H_0$ we have: $F = \frac{MSTr}{MSE} \sim F_{k-1, n_T -k}$

Analysis of variance table for one factor layout

Source	d.f.	Sum of Squares	Mean Squares	F-statistic	P-value
Treatments	$k-1$	SSTr	$\frac{SSTr}{k-1}$	$F=\frac{MSTr}{MSE}$	$P(F_{k-1,n_T -k}\geq obs(F))$
Error	$n_T-k$	SSE	$\frac{SSE}{n_T-k}$
Total	$n_T-1$	SST

In [47]:

# Build a toy dataset
l1 = np.array([23.8, 24, 25.6, 25.1, 25.5, 26.1, 23.8, 25.7, 24.3, 26, 24.6, 27])
l2 = np.array([30.2, 29.9, 29.1 ,28.8, 29.1, 28.6, 28.3, 28.7, 27.9, 30.5])
l3 = np.array([27, 25.4, 25.6, 24.2, 24.8, 24, 25.5, 23.9, 22.6, 26, 23.4])
layout_list = [l1, l2, l3]

# Hard coded calculations
k = len(layout_list)
df_treatments = k - 1
nt = np.count_nonzero(l1) + np.count_nonzero(l2) + np.count_nonzero(l3)
df_error = nt -k
df_total = nt -1

# Sum of Squares
stack = np.hstack((l1, l2, l3))
SST = 0
for i in range(len(stack)):
    SST += (stack[i] - stack.mean())**2
SSTr = 0
for i in range(3):
    SSTr += len(layout_list[i]) * (layout_list[i].mean() - stack.mean())**2
SSE = SST - SSTr
display(Math(r'SSTr: {:.4f} \\SSE: {:.4f}\\SST: {:.4f}'.format(SSTr, SSE, SST)))

# Mean of Squares
MSTr = SSTr / (k-1)
MSE = SSE / (nt-k)
display(Math(r'MSTr: {:.4f} \\MSE: {:.4f}'.format(MSTr, MSE)))

# F-value
F = MSTr/MSE

# p-value
p = f.sf(F, k-1, nt-k)

display(Math(r'F: {:.4f} \\ p-value: {:.4f}'.format(F, p)))

Out[47]:

$\displaystyle SSTr: 121.2382 \\SSE: 34.4170\\SST: 155.6552$

$\displaystyle MSTr: 60.6191 \\MSE: 1.1472$

$\displaystyle F: 52.8395 \\ p-value: 0.0000$

Pairwise Comparisons of the Factor Level Means (T-Method: Tukey’s Multiple Comparisons Procedure)

When the null hypothesis is rejected, the experimenter can follow up the analysis with pairwise comparisons of the factor level means to discover which ones have been shown to be different and by how much
With $k$ factor levels there are $k(k-1)/2$ pairwise differences
A set of $1-\alpha$ confidence intervals for the differences $\mu_{i_1} - \mu_{i_2}$ are $\left( \overline x _{i_1} - \overline x_{i_2} - \hat{\sigma} \frac{q_{\alpha, k, \nu}}{\sqrt{2}} \sqrt{ \frac{1}{n_{i_1}} + \frac{1}{n_{i_2}} } , \overline x _{i_1} - \overline x_{i_2} + \hat{\sigma} \frac{q_{\alpha, k, \nu}}{\sqrt{2}} \sqrt{ \frac{1}{n_{i_1}} + \frac{1}{n_{i_2}} } \right)$
where $\hat{\sigma} = \sqrt{MSE}$
$q_{\alpha, k, \nu}$ is is a critical point that is the upper point of the Studentized range distribution with parameter α and k and degrees of freedom $v = n T - k$
- Difference with the $t_{\alpha/2, \nu}$ : T-intervals have an individual confidence level whereas this set of simultaneous confidence intervals have an overall confidence level
If the confidence interval for the difference $\mu_{i_1} - \mu_{i_2}$ contains zero, then there is no evidence that the means at factor levels $i_1$ and $i_2$ are different

In [48]:

# Example from the previous dataset
# Return the left and right confidence interval
def print_confidence_interval(data1, data2, q):
    diff = data1.mean() - data2.mean()
    left = diff - sigma*q/(math.sqrt(2)) * math.sqrt( 1/len(data1) + 1/len(data2))
    right = diff + sigma*q/(math.sqrt(2)) * math.sqrt( 1/len(data1) + 1/len(data2))
    return left, right

# sigma
sigma = math.sqrt(MSE)

# Critical value
q = 3.49 # We obtain this from tables

# Print functions
left, right = confidence_interval(l1, l2, q); display(Math(r'\mu_1 - \mu_2 \in ({:.4f}, {:.4f})'.format(left, right)))
left, right = confidence_interval(l1, l3, q); display(Math(r'\mu_1 - \mu_3 \in ({:.4f}, {:.4f})'.format(left, right)))
left, right = confidence_interval(l2, l3, q); display(Math(r'\mu_2 - \mu_3 \in ({:.4f}, {:.4f})'.format(left, right)))

Out[48]:

$\displaystyle \mu_1 - \mu_2 \in (-5.1168, -2.8532)$

$\displaystyle \mu_1 - \mu_3 \in (-0.7420, 1.4647)$

$\displaystyle \mu_2 - \mu_3 \in (3.1915, 5.5013)$

Sample Size Determination

The sensitivity of the analysis depends on the k sample sizes
The power of the test is higher as the sample size increases
Increase in the sample size $\Longrightarrow$ decrease in the lengths of pairwise confidence intervals
If the sample sizes $n_i$ are unequal, then $L = \sqrt{2} \hat{\sigma} q_{\alpha,k,\nu} \sqrt{ \frac{1}{n_{i_1}} + \frac{1}{n_{i_2}} }$
If they are equal, then the expression becomes $L = 2q_{\alpha, k, \nu} \sqrt{ \frac{s^2}{n} }$
- If we want a maximum length of $L$ then we need to collect $n \geq \frac { 4 s^2 q_{\alpha, k, \nu}^2 }{L^2}$

Model Assumptions

Observations are distributed independently with normal distribution that has a common variance
The ANOVA is fairly robust to the distribution of data, so that it provides fairly accurate results as long as the distribution is not very far from a normal distribution
The equality of the variances for each of the k factor levels can be judged from a comparison of the sample variances or from a visual comparison of the lengths of boxplots of the observations at each factor level