GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-reviewing/chapter_9_comparing_two_population_means.ipynb
³⁸⁸ views

Kernel: Python 3

In [2]:

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.stats.weightstats as sms

from scipy import stats

Chapter 9 Comapring Two Population Means

Paired Datasets

Two paired sample sets:

$x_1, \dots, x_n$ with $\mu_A$
$y_1, \dots, y_n$ with $\mu_B$

To test the hypothesis:

H_0:\mu_A=\mu_B.versus. H_A:\mu_A\ne\mu_B

Step 1. - Create $z_i = x_i - y_i$ with $\mu$

Step 2. - Create new hypothesis $H_0:\mu=0.versus. H_A:\mu\ne 0$

Step 3. - Use t-statistics method to get the confidence interval and p-value

(\bar{z} - \frac{t_{\alpha/2, n-1}s}{\sqrt{n}}, \bar{z} + \frac{t_{\alpha/2, n-1}s}{\sqrt{n}})

In [7]:

# Confidence Interval

# Input
sample_a = np.array([23.6, 27.9, 22.9, 21.8, 25.8, 30.7, 26.5, 25.4])
sample_b = np.array([22.5, 25.6, 24.0, 20.4, 26.0, 26.6, 26.4, 22.1])
alpha = 0.1

# Calculate
n = len(sample_a)
z = sample_a - sample_b
z_bar = z.mean()
z_std = z.std(ddof=1)

t = stats.t.ppf(1 - alpha / 2, n - 1)
wing = t * z_std / math.sqrt(n)

t_stats = math.sqrt(n) * (z_bar - 0) / z_std
p_value = stats.t.sf(t_stats, n-1)

# Output
print('Confidence Interval\t({:.4f}, {:.4f})'.format(z_bar - wing, z_bar + wing))
print('P-Value\t\t\t{:.4f}'.format(p_value))

Out[7]:

Confidence Interval	(0.1796, 2.5704)
P-Value			0.0329

Unpaired Datasets

Independent Samples

General Method

Population	Samples	Size	Mean	Standard deviation
Population A	$x_1, x_2,\dots, x_n$	n	$\overline x$	$s_x$
Population B	$y_1, y_2,\dots, y_m$	m	$\overline y$	$s_y$

Point estimate:

$\mu_A - \mu_B = \overline x - \overline y$

$se(\overline x - \overline y) = \sqrt{\frac{s^2_x}{n} + \frac{s^2_x}{m}}$

$se(\overline x - \overline y) = s_p \sqrt{\frac{1}{n} + \frac{1}{m}}$

where $s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}$

Two-sided $1-\alpha$ level confidence interval for $\mu_A-\mu_B$ is:

(\bar{x} - \bar{y} - t_{\alpha/2, v}\sqrt{\frac{s_x^2}{n}+\frac{s_y^2}{m}}, \bar{x} - \bar{y} + t_{\alpha/2, v}\sqrt{\frac{s_x^2}{n}+\frac{s_y^2}{m}})

Where the degree of freedom:

v^* = \frac{ (s_x^2 / n + s_y^2 /m ) ^2}{ s_x^4/ n^2(n-1) + s_y^4 / m^2(m-1) }

For $P-Value$ of $H_0:\mu_A - \mu_B = \delta.versus.H_A:\mu_A-\mu_B\ne\delta$ :

T = \frac{( \bar{x} - \bar{y} - \delta)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}

In [14]:

# Calcualte confidence interval

# Input
n = 14
m = 14

x_bar = 32.45
y_bar = 41.45

s_x = 4.3
s_y = 5.23

alpha = 0.01

# Calculate
v = (s_x ** 2 / n + s_y ** 2 / m) ** 2 / (s_x ** 4 / (n ** 2 * (n - 1)) + s_y ** 4 / (m ** 2 * (m - 1)))
t = stats.t.ppf(1 - alpha / 2, v)

wing = t * math.sqrt(s_x ** 2 / n + s_y ** 2 / m)

# Output
print('Confidence Interval\t({:.4f}, {:.4f})'.format(x_bar - y_bar - wing, x_bar - y_bar + wing))

Out[14]:

Confidence Interval	(-14.0430, -3.9570)

In [12]:

# Calculate P-Value

# Input
n = 14
m = 14

x_bar = 32.45
y_bar = 41.45

s_x = 4.3
s_y = 5.23

delta = 0

# Calculate
v = (s_x ** 2 / n + s_y ** 2 / m) ** 2 / (s_x ** 4 / (n ** 2 * (n - 1)) + s_y ** 4 / (m ** 2 * (m - 1)))
t =  (x_bar - y_bar - delta) / math.sqrt(s_x ** 2 / n + s_y ** 2 / m)
p_value = 2 * stats.t.sf(abs(t), v)

# Output
print('|t|\t{:.4f}'.format(abs(t)))
print('P-Value\t{:.4f}'.format(p_value))

Out[12]:

|t|	4.9736
P-Value	0.0000

Pool Method

Unbiased estimate of $\hat{\sigma}^2$ of $\sigma^2$ : $s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}$

Two-sided $1-\alpha$ level confidence interval for $\mu_A-\mu_B$ is:

(\bar{x} - \bar{y} - t_{\alpha/2, m+n-2}s_p\sqrt{\frac{1}{n}+\frac{1}{m}}, \bar{x} - \bar{y} + t_{\alpha/2, m+n-2}s_p\sqrt{\frac{1}{n}+\frac{1}{m}})

For $P-Value$ of $H_0:\mu_A - \mu_B = \delta.versus.H_A:\mu_A-\mu_B\ne\delta$ :

T = \frac{( \bar{x} - \bar{y} - \delta)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}

In [16]:

# Calcualte confidence interval

# Input
n = 14
m = 14

x_bar = 32.45
y_bar = 41.45

s_x = 4.3
s_y = 5.23

alpha = 0.01

# Calculate
sp_square = ((n - 1) * s_x ** 2 + (m - 1) * s_y ** 2) / (n + m - 2)
t = stats.t.ppf(1 - alpha / 2, m + n - 2)

wing = t * math.sqrt(sp_square) * math.sqrt(1 / n + 1 / m)

# Output
print('Confidence Interval\t({:.4f}, {:.4f})'.format(x_bar - y_bar - wing, x_bar - y_bar + wing))

Out[16]:

Confidence Interval	(-14.0282, -3.9718)

In [17]:

# Calculate P-Value

# Input
n = 14
m = 14

x_bar = 32.45
y_bar = 41.45

s_x = 4.3
s_y = 5.23

delta = 0

# Calculate
sp_square = ((n - 1) * s_x ** 2 + (m - 1) * s_y ** 2) / (n + m - 2)
t =  (x_bar - y_bar - delta) / (math.sqrt(sp_square) * math.sqrt(1 / n + 1 / m))
p_value = 2 * stats.t.sf(abs(t), n + m - 2)

# Output
print('|t|\t{:.4f}'.format(abs(t)))
print('P-Value\t{:.4f}'.format(p_value))

Out[17]:

|t|	4.9736
P-Value	0.0000

Z-test Method

If those two variances are known.

Two-sided $1-\alpha$ level confidence interval for $\mu_A-\mu_B$ is:

(\overline x - \overline y - z_{\alpha/2}\sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m}}, (\overline x - \overline y + z_{\alpha/2}\sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m}})

For $P-Value$ of $H_0:\mu_A - \mu_B = \delta.versus.H_A:\mu_A-\mu_B\ne\delta$ :

Z = \frac{\overline x - \overline y - \delta}{ \sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m} } } \sim N(0,1)

In [21]:

# Calcualte confidence interval

# Input
n = 38
m = 40

x_bar = 5.782
y_bar = 6.443

sigma_x = 2
sigma_y = 2

alpha = 0.01

# Calculate
z = stats.norm.ppf(1 - alpha / 2)
wing = z * math.sqrt(sigma_x ** 2 / n + sigma_y ** 2 / m)

# Output
print('Confidence Interval\t({:.4f}, {:.4f})'.format(x_bar - y_bar - wing, x_bar - y_bar + wing))

Out[21]:

Confidence Interval	(-1.7150, 0.3930)

In [23]:

# Calculate P-Value

# Input
n = 38
m = 40

x_bar = 5.782
y_bar = 6.443

sigma_x = 2
sigma_y = 2

delta = 0

# Calculate
z = (x_bar - y_bar - delta) / math.sqrt(sigma_x ** 2 / n + sigma_y ** 2 / m)
p_value = stats.norm.cdf(z)

# Output
print('P-Value\t{:.4f}'.format(p_value))

Out[23]:

P-Value	0.0723

Chapter 9 Comapring Two Population Means

Paired Datasets

Unpaired Datasets

General Method

Pool Method

Z-test Method

Product

Resources

Company