Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Probability-Statistics-Jupy…
GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-reviewing/chapter_9_comparing_two_population_means.ipynb
388 views
Kernel: Python 3
import math import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.stats.weightstats as sms from scipy import stats

Chapter 9 Comapring Two Population Means

Paired Datasets

Two paired sample sets:

  • x1,,xnx_1, \dots, x_n with μA\mu_A

  • y1,,yny_1, \dots, y_n with μB\mu_B

To test the hypothesis:

H0:μA=μB.versus.HA:μAμBH_0:\mu_A=\mu_B.versus. H_A:\mu_A\ne\mu_B

Step 1. - Create zi=xiyiz_i = x_i - y_i with μ\mu

Step 2. - Create new hypothesis H0:μ=0.versus.HA:μ0H_0:\mu=0.versus. H_A:\mu\ne 0

Step 3. - Use t-statistics method to get the confidence interval and p-value

(zˉtα/2,n1sn,zˉ+tα/2,n1sn)(\bar{z} - \frac{t_{\alpha/2, n-1}s}{\sqrt{n}}, \bar{z} + \frac{t_{\alpha/2, n-1}s}{\sqrt{n}})
# Confidence Interval # Input sample_a = np.array([23.6, 27.9, 22.9, 21.8, 25.8, 30.7, 26.5, 25.4]) sample_b = np.array([22.5, 25.6, 24.0, 20.4, 26.0, 26.6, 26.4, 22.1]) alpha = 0.1 # Calculate n = len(sample_a) z = sample_a - sample_b z_bar = z.mean() z_std = z.std(ddof=1) t = stats.t.ppf(1 - alpha / 2, n - 1) wing = t * z_std / math.sqrt(n) t_stats = math.sqrt(n) * (z_bar - 0) / z_std p_value = stats.t.sf(t_stats, n-1) # Output print('Confidence Interval\t({:.4f}, {:.4f})'.format(z_bar - wing, z_bar + wing)) print('P-Value\t\t\t{:.4f}'.format(p_value))
Confidence Interval (0.1796, 2.5704) P-Value 0.0329

Unpaired Datasets

Independent Samples

General Method

PopulationSamplesSizeMeanStandard deviation
Population Ax1,x2,,xnx_1, x_2,\dots, x_nnx\overline xsxs_x
Population By1,y2,,ymy_1, y_2,\dots, y_mmy\overline ysys_y

Point estimate:

μAμB=xy\mu_A - \mu_B = \overline x - \overline y

se(xy)=sx2n+sx2mse(\overline x - \overline y) = \sqrt{\frac{s^2_x}{n} + \frac{s^2_x}{m}}

se(xy)=sp1n+1mse(\overline x - \overline y) = s_p \sqrt{\frac{1}{n} + \frac{1}{m}}

where sp2=(n1)sx2+(m1)sy2n+m2s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}

Two-sided 1α1-\alpha level confidence interval for μAμB\mu_A-\mu_B is:

(xˉyˉtα/2,vsx2n+sy2m,xˉyˉ+tα/2,vsx2n+sy2m)(\bar{x} - \bar{y} - t_{\alpha/2, v}\sqrt{\frac{s_x^2}{n}+\frac{s_y^2}{m}}, \bar{x} - \bar{y} + t_{\alpha/2, v}\sqrt{\frac{s_x^2}{n}+\frac{s_y^2}{m}})

Where the degree of freedom:

v=(sx2/n+sy2/m)2sx4/n2(n1)+sy4/m2(m1)v^* = \frac{ (s_x^2 / n + s_y^2 /m ) ^2}{ s_x^4/ n^2(n-1) + s_y^4 / m^2(m-1) }

For PValueP-Value of H0:μAμB=δ.versus.HA:μAμBδH_0:\mu_A - \mu_B = \delta.versus.H_A:\mu_A-\mu_B\ne\delta:

T=(xˉyˉδ)sx2n+sy2mT = \frac{( \bar{x} - \bar{y} - \delta)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}
# Calcualte confidence interval # Input n = 14 m = 14 x_bar = 32.45 y_bar = 41.45 s_x = 4.3 s_y = 5.23 alpha = 0.01 # Calculate v = (s_x ** 2 / n + s_y ** 2 / m) ** 2 / (s_x ** 4 / (n ** 2 * (n - 1)) + s_y ** 4 / (m ** 2 * (m - 1))) t = stats.t.ppf(1 - alpha / 2, v) wing = t * math.sqrt(s_x ** 2 / n + s_y ** 2 / m) # Output print('Confidence Interval\t({:.4f}, {:.4f})'.format(x_bar - y_bar - wing, x_bar - y_bar + wing))
Confidence Interval (-14.0430, -3.9570)
# Calculate P-Value # Input n = 14 m = 14 x_bar = 32.45 y_bar = 41.45 s_x = 4.3 s_y = 5.23 delta = 0 # Calculate v = (s_x ** 2 / n + s_y ** 2 / m) ** 2 / (s_x ** 4 / (n ** 2 * (n - 1)) + s_y ** 4 / (m ** 2 * (m - 1))) t = (x_bar - y_bar - delta) / math.sqrt(s_x ** 2 / n + s_y ** 2 / m) p_value = 2 * stats.t.sf(abs(t), v) # Output print('|t|\t{:.4f}'.format(abs(t))) print('P-Value\t{:.4f}'.format(p_value))
|t| 4.9736 P-Value 0.0000

Pool Method

Unbiased estimate of σ^2\hat{\sigma}^2 of σ2\sigma^2: sp2=(n1)sx2+(m1)sy2n+m2s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}

Two-sided 1α1-\alpha level confidence interval for μAμB\mu_A-\mu_B is:

(xˉyˉtα/2,m+n2sp1n+1m,xˉyˉ+tα/2,m+n2sp1n+1m)(\bar{x} - \bar{y} - t_{\alpha/2, m+n-2}s_p\sqrt{\frac{1}{n}+\frac{1}{m}}, \bar{x} - \bar{y} + t_{\alpha/2, m+n-2}s_p\sqrt{\frac{1}{n}+\frac{1}{m}})

For PValueP-Value of H0:μAμB=δ.versus.HA:μAμBδH_0:\mu_A - \mu_B = \delta.versus.H_A:\mu_A-\mu_B\ne\delta:

T=(xˉyˉδ)sp1n+1mT = \frac{( \bar{x} - \bar{y} - \delta)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}
# Calcualte confidence interval # Input n = 14 m = 14 x_bar = 32.45 y_bar = 41.45 s_x = 4.3 s_y = 5.23 alpha = 0.01 # Calculate sp_square = ((n - 1) * s_x ** 2 + (m - 1) * s_y ** 2) / (n + m - 2) t = stats.t.ppf(1 - alpha / 2, m + n - 2) wing = t * math.sqrt(sp_square) * math.sqrt(1 / n + 1 / m) # Output print('Confidence Interval\t({:.4f}, {:.4f})'.format(x_bar - y_bar - wing, x_bar - y_bar + wing))
Confidence Interval (-14.0282, -3.9718)
# Calculate P-Value # Input n = 14 m = 14 x_bar = 32.45 y_bar = 41.45 s_x = 4.3 s_y = 5.23 delta = 0 # Calculate sp_square = ((n - 1) * s_x ** 2 + (m - 1) * s_y ** 2) / (n + m - 2) t = (x_bar - y_bar - delta) / (math.sqrt(sp_square) * math.sqrt(1 / n + 1 / m)) p_value = 2 * stats.t.sf(abs(t), n + m - 2) # Output print('|t|\t{:.4f}'.format(abs(t))) print('P-Value\t{:.4f}'.format(p_value))
|t| 4.9736 P-Value 0.0000

Z-test Method

If those two variances are known.

Two-sided 1α1-\alpha level confidence interval for μAμB\mu_A-\mu_B is:

(xyzα/2σA2n+σB2m,(xy+zα/2σA2n+σB2m)(\overline x - \overline y - z_{\alpha/2}\sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m}}, (\overline x - \overline y + z_{\alpha/2}\sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m}})

For PValueP-Value of H0:μAμB=δ.versus.HA:μAμBδH_0:\mu_A - \mu_B = \delta.versus.H_A:\mu_A-\mu_B\ne\delta:

Z=xyδσA2n+σB2mN(0,1)Z = \frac{\overline x - \overline y - \delta}{ \sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m} } } \sim N(0,1)
# Calcualte confidence interval # Input n = 38 m = 40 x_bar = 5.782 y_bar = 6.443 sigma_x = 2 sigma_y = 2 alpha = 0.01 # Calculate z = stats.norm.ppf(1 - alpha / 2) wing = z * math.sqrt(sigma_x ** 2 / n + sigma_y ** 2 / m) # Output print('Confidence Interval\t({:.4f}, {:.4f})'.format(x_bar - y_bar - wing, x_bar - y_bar + wing))
Confidence Interval (-1.7150, 0.3930)
# Calculate P-Value # Input n = 38 m = 40 x_bar = 5.782 y_bar = 6.443 sigma_x = 2 sigma_y = 2 delta = 0 # Calculate z = (x_bar - y_bar - delta) / math.sqrt(sigma_x ** 2 / n + sigma_y ** 2 / m) p_value = stats.norm.cdf(z) # Output print('P-Value\t{:.4f}'.format(p_value))
P-Value 0.0723