Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Probability-Statistics-Jupy…
GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-learning/Chapter11-Analysis-of-Variance.ipynb
388 views
Kernel: Python 3

Chapter 11 - Analysis of Variance

# Useful libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats import statsmodels.stats.weightstats as sms from scipy.stats import t from scipy.stats import f from scipy.stats import norm import math from IPython.display import Math, display

One Factor Analysis of Variance

One Factor Layouts

  • Suppose we experiment on k populations with unknown population means μ1,μ2,,μk\mu_1, \mu_2, \dots, \mu_k

    • Observation xijx_{ij} represents the j-th observation from the i-th population

  • The one factor analysis of variance methodology is appropriate for comparing three of more populations

  • Each population i consists of nin_i observations

    • If sample sizes n1,n2,,nkn_1, n_2, \cdots, n_k are all equal, then the data set is balanced, otherwise it is called imbalanced

  • Total sample size is nT=n1++nkn_T = n_1 + \cdots + n_k

    • The total data set is a called one-way or one factor layout

    • Each factor has kk levels corresponding to kk populations

  • Completely randomized designs: experiment performed by randomly allocating a nTn_T units among the kk populations

  • Modeling assumption: xij=μi+ϵijx_{ij} = \mu_i + \epsilon_{ij}

    • where the error terms ϵijN(0,σ2)\epsilon_{ij} \sim N(0, \sigma^2) xijN(μi,σ2)\Longrightarrow x_{ij} \sim N(\mu_i, \sigma^2)

  • Point estimates of the unknown population means: μ^i=xi=j=1nini,1ik\hat{\mu}_i = \overline x_{i \cdot} = \frac{ \sum_{j=1}^{n_i} }{n_i}, 1 \leq i \leq k

If we test H0:μ1=μ2==μkH_0: \mu_1 = \mu_2 = \cdots = \mu_k vs HAH_A: not H0H_0

\Longrightarrow Acceptance of the null hypothesis indicates that there is no evidence that any of the population means are unequal

Partitioning the Total Sum of Squares

  • We can call the SSTrSSTr the "sum of squares for treatments", supposing we are analyzing the effects of treatments

SST=i=ikj=1ni(xijx)2=i=1kni(xix)2+i=ikj=1ni(xijxi)2=SSTr+SSE\begin{align} &SST = \sum_{i=i}^k \sum_{j=1}^{n_i} (x_{ij} - \overline x_{\cdot \cdot} )^2 \\ &= \sum_{i=1}^k n_i ( \overline x_{i \cdot} - \overline x_{\cdot \cdot} )^2 + \sum_{i=i}^k \sum_{j=1}^{n_i} (x_{ij} - \overline x_{i \cdots} )^2\\ &= SSTr + SSE \end{align}
  • pp-value considerations: The plausibility of the null hypothesis that the factor level means are all equal depends upon the relative size of the sum of squares for treatments, SSTr, to the sum of squares for error, SSE

The Analysis of Variance (ANOVA) Table

  • Mean square error (MSE) MSE=SSEnTkMSE = \frac{SSE}{n_T - k} SSEσ2χnTk2\frac{SSE}{\sigma^2} \sim \chi_{n_T - k}^2 E(MSE)=σ2E(MSE) = \sigma^2

  • Mean squares for treatments (MSTr) MSTr=SSTrk1MSTr = \frac{SSTr}{k-1}

  • If the factor level means μi\mu_i are all equal (H0H_0), then E(MSTr)=σ2E(MSTr) = \sigma^2 and SSTrσ2χk12\frac{SSTr}{\sigma^2} \sim \chi^2_{k-1} \Longrightarrow under H0H_0 we have: F=MSTrMSEFk1,nTkF = \frac{MSTr}{MSE} \sim F_{k-1, n_T -k}

Analysis of variance table for one factor layout

Sourced.f.Sum of SquaresMean SquaresF-statisticP-value
Treatmentsk1k-1SSTrSSTrk1\frac{SSTr}{k-1}F=MSTrMSEF=\frac{MSTr}{MSE}P(Fk1,nTkobs(F))P(F_{k-1,n_T -k}\geq obs(F))
ErrornTkn_T-kSSESSEnTk\frac{SSE}{n_T-k}
TotalnT1n_T-1SST
# Build a toy dataset l1 = np.array([23.8, 24, 25.6, 25.1, 25.5, 26.1, 23.8, 25.7, 24.3, 26, 24.6, 27]) l2 = np.array([30.2, 29.9, 29.1 ,28.8, 29.1, 28.6, 28.3, 28.7, 27.9, 30.5]) l3 = np.array([27, 25.4, 25.6, 24.2, 24.8, 24, 25.5, 23.9, 22.6, 26, 23.4]) layout_list = [l1, l2, l3] # Hard coded calculations k = len(layout_list) df_treatments = k - 1 nt = np.count_nonzero(l1) + np.count_nonzero(l2) + np.count_nonzero(l3) df_error = nt -k df_total = nt -1 # Sum of Squares stack = np.hstack((l1, l2, l3)) SST = 0 for i in range(len(stack)): SST += (stack[i] - stack.mean())**2 SSTr = 0 for i in range(3): SSTr += len(layout_list[i]) * (layout_list[i].mean() - stack.mean())**2 SSE = SST - SSTr display(Math(r'SSTr: {:.4f} \\SSE: {:.4f}\\SST: {:.4f}'.format(SSTr, SSE, SST))) # Mean of Squares MSTr = SSTr / (k-1) MSE = SSE / (nt-k) display(Math(r'MSTr: {:.4f} \\MSE: {:.4f}'.format(MSTr, MSE))) # F-value F = MSTr/MSE # p-value p = f.sf(F, k-1, nt-k) display(Math(r'F: {:.4f} \\ p-value: {:.4f}'.format(F, p)))

SSTr:121.2382SSE:34.4170SST:155.6552\displaystyle SSTr: 121.2382 \\SSE: 34.4170\\SST: 155.6552

MSTr:60.6191MSE:1.1472\displaystyle MSTr: 60.6191 \\MSE: 1.1472

F:52.8395pvalue:0.0000\displaystyle F: 52.8395 \\ p-value: 0.0000

Pairwise Comparisons of the Factor Level Means (T-Method: Tukey’s Multiple Comparisons Procedure)

  • When the null hypothesis is rejected, the experimenter can follow up the analysis with pairwise comparisons of the factor level means to discover which ones have been shown to be different and by how much

  • With kk factor levels there are k(k1)/2k(k-1)/2 pairwise differences

  • A set of 1α1-\alpha confidence intervals for the differences μi1μi2\mu_{i_1} - \mu_{i_2} are (xi1xi2σ^qα,k,ν21ni1+1ni2,xi1xi2+σ^qα,k,ν21ni1+1ni2) \left( \overline x _{i_1} - \overline x_{i_2} - \hat{\sigma} \frac{q_{\alpha, k, \nu}}{\sqrt{2}} \sqrt{ \frac{1}{n_{i_1}} + \frac{1}{n_{i_2}} } , \overline x _{i_1} - \overline x_{i_2} + \hat{\sigma} \frac{q_{\alpha, k, \nu}}{\sqrt{2}} \sqrt{ \frac{1}{n_{i_1}} + \frac{1}{n_{i_2}} } \right)

  • where σ^=MSE\hat{\sigma} = \sqrt{MSE}

  • qα,k,νq_{\alpha, k, \nu} is is a critical point that is the upper point of the Studentized range distribution with parameter α and k and degrees of freedom v=nTkv = n T - k

    • Difference with the tα/2,νt_{\alpha/2, \nu}: T-intervals have an individual confidence level whereas this set of simultaneous confidence intervals have an overall confidence level

  • If the confidence interval for the difference μi1μi2\mu_{i_1} - \mu_{i_2} contains zero, then there is no evidence that the means at factor levels i1i_1 and i2i_2 are different

# Example from the previous dataset # Return the left and right confidence interval def print_confidence_interval(data1, data2, q): diff = data1.mean() - data2.mean() left = diff - sigma*q/(math.sqrt(2)) * math.sqrt( 1/len(data1) + 1/len(data2)) right = diff + sigma*q/(math.sqrt(2)) * math.sqrt( 1/len(data1) + 1/len(data2)) return left, right # sigma sigma = math.sqrt(MSE) # Critical value q = 3.49 # We obtain this from tables # Print functions left, right = confidence_interval(l1, l2, q); display(Math(r'\mu_1 - \mu_2 \in ({:.4f}, {:.4f})'.format(left, right))) left, right = confidence_interval(l1, l3, q); display(Math(r'\mu_1 - \mu_3 \in ({:.4f}, {:.4f})'.format(left, right))) left, right = confidence_interval(l2, l3, q); display(Math(r'\mu_2 - \mu_3 \in ({:.4f}, {:.4f})'.format(left, right)))

μ1μ2(5.1168,2.8532)\displaystyle \mu_1 - \mu_2 \in (-5.1168, -2.8532)

μ1μ3(0.7420,1.4647)\displaystyle \mu_1 - \mu_3 \in (-0.7420, 1.4647)

μ2μ3(3.1915,5.5013)\displaystyle \mu_2 - \mu_3 \in (3.1915, 5.5013)

Sample Size Determination

  • The sensitivity of the analysis depends on the k sample sizes

  • The power of the test is higher as the sample size increases

  • Increase in the sample size \Longrightarrow decrease in the lengths of pairwise confidence intervals

  • If the sample sizes nin_i are unequal, then L=2σ^qα,k,ν1ni1+1ni2L = \sqrt{2} \hat{\sigma} q_{\alpha,k,\nu} \sqrt{ \frac{1}{n_{i_1}} + \frac{1}{n_{i_2}} }

  • If they are equal, then the expression becomes L=2qα,k,νs2nL = 2q_{\alpha, k, \nu} \sqrt{ \frac{s^2}{n} }

    • If we want a maximum length of LL then we need to collect n4s2qα,k,ν2L2n \geq \frac { 4 s^2 q_{\alpha, k, \nu}^2 }{L^2}

Model Assumptions

  • Observations are distributed independently with normal distribution that has a common variance

  • The ANOVA is fairly robust to the distribution of data, so that it provides fairly accurate results as long as the distribution is not very far from a normal distribution

  • The equality of the variances for each of the k factor levels can be judged from a comparison of the sample variances or from a visual comparison of the lengths of boxplots of the observations at each factor level