GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-learning/Chapter-9-Comparing-Two-Population-Means.ipynb
³⁸⁸ views

Kernel: Python 3

In [51]:

# Useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.stats.weightstats as sms
from scipy.stats import t
from scipy.stats import norm
import math
from IPython.display import Math, display

Chapter 9 - Comparing Two Population Means

Introduction

Two Sample Problems

Suppose we have:

Set of data observations $x_1, \dots, x_n$ from a population A with cumulative dist. $F_A(x)$
Set of data observations $y_1, \dots, y_m$ from a population B with cumulative dist. $F_B(x)$

$\Longrightarrow$ goal: compare the means $\mu_A$ and $\mu_B$

Common practices

For avoiding biases, we can use:

Randomization
Placebo, blind and double blind experiments

Testing

Consider testing $H_0: \mu = \mu_0$ versus $H_A: \mu \neq \mu_0$

$\Longrightarrow$ the $p$ -value can be found in the same way as for one-sample problems

Paired Samples Versus Independent Samples

Data from paired samples are of the form $(x_1, y_1), (x_2, y_2), \dots, x_n, y_n)$ from each $n$ experimental subjects (i.e. test heart rate reduction in each patient)
Blocking: keep out all unwanted influences
Comparison is done via the pairwise differences $z_i = x_i - y_i, 1 \leq i \leq n$

Analysis of Paired Samples

Methodology

Data: $(x_1, y_1), (x_2, y_2), \dots, x_n, y_n)$ $\Longrightarrow z_i = x_i - y_i, 1 \leq i \leq n$
$x_i = \mu_A + \gamma_i + \epsilon_I^A$
$y_i = \mu_B + \gamma_i + \epsilon_I^B$

where $\mu_A$ (or $\mu_B$ ) are the effects of the treatments A or B, $\gamma_i$ effects by subject I, $\epsilon_i^A \sim N(0, \sigma^2_A)$ are measurement errors for subject I under treatment i.e. A (similarly for B)

$\Longrightarrow z_i = \mu_A - \mu_B + \epsilon_i^{AB}$ are observations from a distribution with mean $\mu = \mu_A - \mu_B$

In [57]:

# Parameter
data_A = np.array([23.6, 27.9, 22.9, 21.8, 25.8, 30.7, 26.5, 25.4])
data_B = np.array([22.5, 25.6, 24.0, 20.4, 26.0, 26.6, 26.4, 22.1])
z = data_A - data_B

# Mean
z_bar = z.mean(); s = z.std(ddof=1) # Same as pandas
print('Mean = {:.4f}\nStandard deviation = {:.4f}'.format(z_bar, s))

""" We test
$H_0 : \mu \leq 0$
$H_A : \mu > 0$ """
n = len(z)
mu = 0
t_stat = math.sqrt(n)*(z_bar - mu)/s
print('t-statistics: {:.4f}'.format(t_stat))
print(("p-value = {:.4f}".format(t.sf(t_stat, n-1))))

Out[57]:

Mean = 1.3750
Standard deviation = 1.7847
t-statistics: 2.1792
p-value = 0.0329

Analysis of Independent Samples

Population	Samples	Size	Mean	Standard deviation
Population A	$x_1, x_2,\dots, x_n$	n	$\overline x$	$s_x$
Population B	$y_1, y_2,\dots, y_m$	m	$\overline y$	$s_y$

Point estimate $\mu_A - \mu_B = \overline x - \overline y$ Standard error $se(\overline x - \overline y) = \sqrt{\frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m}}$

Assume $\sigma_A^2$ and $\sigma_B^2$ are unknown. Then we have:

General procedure: $se(\overline x - \overline y) = \sqrt{\frac{s^2_x}{n} + \frac{s^2_x}{m}}$
Pooled variance procedure: $se(\overline x - \overline y) = s_p \sqrt{\frac{1}{n} + \frac{1}{m}}$ where $s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}$
When the variances are known, we use a two-sample z-test.

General Procedure (Smith-Satterthwaite test)

We use the statistics $T = \frac{ \bar{x} - \bar{y} - (\mu_A - \mu_B)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}$
This statistic follows approximately t-distribution with the d.f. $\nu$ as the largest integer not larger than $v^* = \frac{ (s_x^2 / n + s_y^2 /m ) ^2}{ s_x^4/ n^2(n-1) + s_y^4 / m^2(m-1) }$
Two sided $1-\alpha$ level of confidence iunterval for $\mu_A - \mu_B$ is given by $\overline x - \overline y \pm t_{\alpha/2, \nu} \sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}$

In [76]:

# Population A
n = 14; x_bar = 32.45; s_x = 4.30

# Population B
m = 14; y_bar = 41.45; s_y = 5.23

alpha = 0.01 # = 1 - confidence level

def degrees_of_freedom(n, m, s_x, s_y):
    nu =  ((s_x**2/n + s_y**2/m)**2) / ( s_x**4 / (n**2*(n-1)) + s_y**4 / (m**2*(m-1)))
    return math.floor(nu)

def wing_span(alpha, nu, n, m, s_x, s_y):
    return t.ppf(1-alpha/2, nu) * math.sqrt( s_x**2/n + s_y**2/m)

def print_confidence_interval(alpha, mu, wing_span):
    print(("Confidence interval with {:.2f}% confidence level: ({:.4f})({:.4f})").format(((1-alpha)*100), mu - wing_span, mu + wing_span))

# Calculations
diff = x_bar - y_bar
nu = degrees_of_freedom( n, m, s_x, s_y)
wing_span = wing_span(alpha, nu, n, m, s_x, s_y)
display(Math('v = {}'.format(nu)))
print_confidence_interval(alpha, diff, wing_span)

Out[76]:

$\displaystyle v = 25$

Confidence interval with 99.00% confidence level: (-14.0440)(-3.9560)

For testing $H_0: \mu_A - \mu_B = \delta$ vs $H_A: \mu_A - \mu_B \neq \delta$ the t-statistic is $T = \frac{( \bar{x} - \bar{y} - \delta)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}$

In [77]:

"""Hypotheses:
- $H_0 : \mu_A = \mu_B$
- $H_A : \mu_A \neq \mu_B$"""

# Calculate the t-statistics general
def T_statistic_general(x_bar, y_bar, s_x, s_y, n, m, delta= 0):
    return ( (x_bar - y_bar - delta)  / math.sqrt(s_x**2/n + s_y**2/m))

t_stat = T_statistic_general(x_bar, y_bar, s_x, s_y, n, m, delta= 0)
print("|t| value: {:.4f} ".format(abs(t_stat)))
print("Critical point: {:.4f}".format(t.ppf(1- alpha/2, n + m -2 )))

Out[77]:

|t| value: 4.9736 
Critical point: 2.7787

Pooled Variance Procedure

Assume $\sigma_A^2 = \sigma_B^2 = \sigma^2$
Unbiased estimate of $\hat{\sigma}^2$ of $\sigma^2$ is given by: $s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}$
t-statistics becomes $T = \frac{ \overline x - \overline y - (\mu_A - \mu_B)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}} } \sim t_{n+m-2}$
A two-sided $1-\alpha$ level confidence interval for $\mu_A - \mu_B$ is given by the end-points $\overline x - \overline y \pm t_{\alpha/2, n+m-2} s_p \sqrt{ \frac{1}{n} + \frac{1}{m} }$

In [74]:

# Population A
n = 14; x_bar = 32.45; s_x = 4.30

# Population B
m = 14; y_bar = 41.45; s_y = 5.23

alpha = 0.01 # = 1 - confidence level

# Pooled variance
def pooled_variance(n, m, s_x, s_y):
    return ( ((n-1)*s_x**2 + (m-1)*s_y**2) / (n+m-2))

def wing_span(s_p, n, m, alpha):
    return t.ppf(1-alpha/2, n+m-2) * s_p*math.sqrt(1/n + 1/m)

def print_confidence_interval(alpha, mu, wing_span):
    print(("Confidence interval with {:.2f}% confidence level: ({:.4f})({:.4f})").format(((1-alpha)*100), mu - wing_span, mu + wing_span))
    
diff = x_bar - y_bar
s_p = math.sqrt(pooled_variance(n, m, s_x, s_y))
wing_span = wing_span(s_p, n, m, alpha)
print_confidence_interval(alpha, diff, wing_span)

Out[74]:

Confidence interval with 99.00% confidence level: (-14.0282)(-3.9718)

For testing $H_0: \mu_A - \mu_B = \delta$ vs $H_A: \mu_A - \mu_B \neq \delta$ the t-statistic is $T = \frac{( \bar{x} - \bar{y} - \delta)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}$

In [79]:

"""Hypotheses:
- $H_0 : \mu_A = \mu_B$
- $H_A : \mu_A \neq \mu_B$"""

# Calculate the t-statistics
def T_statistic_pooled(x_bar, y_bar, s_p, n, m, delta= 0):
    return ( (x_bar - y_bar - delta)  / (s_p * math.sqrt(1/n + 1/m)))

t_stat = T_statistic_pooled(x_bar, y_bar, s_p, n, m, delta= 0)
print("|t| value: {:.4f} ".format(abs(t_stat)))
print("Critical point: {:.4f}".format(t.ppf(1- alpha/2, n + m -2 )))

Out[79]:

|t| value: 4.9736 
Critical point: 2.7787

z-Procedure

When the population variances are known for the two samples, we can use a z-statistic instead of a t-statistic.
A two-sided 1 − α level confidence interval for $\mu_A - \mu_B$ is given by the end-points $\overline x - \overline y \pm z_{\alpha/2}\sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m} }$
For testing $H_0: \mu_A - \mu_B = \delta$ versus $H_A: \mu_A - \mu_B \neq \delta$ we use $Z = \frac{\overline x - \overline y - \delta}{ \sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m} } } \sim N(0,1)$ under $H_0$

In [86]:

# Population A
n = 38; x_bar = 5.782; sigma_x= 2.0

# Population B
m = 40; y_bar = 6.443; sigma_y= 2.0

alpha = 0.01 # = 1 - confidence level

# Calculate the Z-statistics
def Z_statistic(x_bar, y_bar, s_x, s_y, n, m, delta = 0):
    return ( (x_bar - y_bar - delta)  / math.sqrt(s_x**2/n + s_y**2/m))

Z_stat = Z_statistic(x_bar, y_bar, sigma_x, sigma_y, n, m)
print("Z-statistics: {:.4f}".format(Z_stat))
print("p-value: {:.4f} ".format(norm.cdf(Z_stat)))

def wing_span(alpha, n, m, s_x, s_y):
    return norm.ppf(1-alpha) * math.sqrt( s_x**2/n + s_y**2/m)

def print_confidence_interval(alpha, mu, wing_span, lower_bound = False, upper_bound = False):
    if upper_bound:
        print(("Confidence interval with {:.2f}% confidence level: (-∞)({:.4f})").format(((1-alpha)*100), mu + wing_span))
        return
    if lower_bound:
        print(("Confidence interval with {:.2f}% confidence level: ({:.4f})(∞))").format(((1-alpha)*100), mu - wing_span))
        return
    # Default case: double bounded
    print(("Confidence interval with {:.2f}% confidence level: ({:.4f})({:.4f})").format(((1-alpha)*1, mu - wing_span, mu + wing_span)))

"""Hypotheses:
- $H_0 : \mu_A > \mu_B$''
- $H_A : \mu_A >= \mu_B$"""
diff = x_bar - y_bar
wing_span = wing_span(alpha, n, m, sigma_x, sigma_y)
print_confidence_interval(alpha, diff, wing_span, upper_bound = True)

Out[86]:

Z-statistics: -1.4590
p-value: 0.0723 
Confidence interval with 99.00% confidence level: (-∞)(0.3930)

Interval length

The interval length (suppose we are using the general procedure) is: $L = 2 \times t_{\alpha / 2, \nu} \sqrt{ \frac{s_A^2}{n} + \frac{s_B^2}{m}}$
Then the minimum number of samples will be (supposing n = m) $n = m \geq \frac{4t_{\alpha / 2, \nu} (s^2_A + s^2_B)}{L_0^2}$