Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Probability-Statistics-Jupy…
GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-learning/Chapter-9-Comparing-Two-Population-Means.ipynb
388 views
Kernel: Python 3
# Useful libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats import statsmodels.stats.weightstats as sms from scipy.stats import t from scipy.stats import norm import math from IPython.display import Math, display

Chapter 9 - Comparing Two Population Means

Introduction

Two Sample Problems

Suppose we have:

  • Set of data observations x1,…,xnx_1, \dots, x_n from a population A with cumulative dist. FA(x)F_A(x)

  • Set of data observations y1,…,ymy_1, \dots, y_m from a population B with cumulative dist. FB(x)F_B(x)

⟹\Longrightarrow goal: compare the means μA\mu_A and μB\mu_B

Common practices

For avoiding biases, we can use:

  • Randomization

  • Placebo, blind and double blind experiments

Testing

Consider testing H0:μ=μ0H_0: \mu = \mu_0 versus HA:μ≠μ0H_A: \mu \neq \mu_0

⟹\Longrightarrow the pp-value can be found in the same way as for one-sample problems

Paired Samples Versus Independent Samples

  • Data from paired samples are of the form (x1,y1),(x2,y2),…,xn,yn)(x_1, y_1), (x_2, y_2), \dots, x_n, y_n) from each nn experimental subjects (i.e. test heart rate reduction in each patient)

  • Blocking: keep out all unwanted influences

  • Comparison is done via the pairwise differences zi=xi−yi,1≤i≤nz_i = x_i - y_i, 1 \leq i \leq n

Analysis of Paired Samples

Methodology

  • Data: (x1,y1),(x2,y2),…,xn,yn)(x_1, y_1), (x_2, y_2), \dots, x_n, y_n) ⟹zi=xi−yi,1≤i≤n\Longrightarrow z_i = x_i - y_i, 1 \leq i \leq n

  • xi=μA+γi+ϵIAx_i = \mu_A + \gamma_i + \epsilon_I^A

  • yi=μB+γi+ϵIBy_i = \mu_B + \gamma_i + \epsilon_I^B

where μA\mu_A (or μB\mu_B) are the effects of the treatments A or B, γi\gamma_i effects by subject I, ϵiA∼N(0,σA2)\epsilon_i^A \sim N(0, \sigma^2_A) are measurement errors for subject I under treatment i.e. A (similarly for B)

  • ⟹zi=μA−μB+ϵiAB\Longrightarrow z_i = \mu_A - \mu_B + \epsilon_i^{AB} are observations from a distribution with mean μ=μA−μB\mu = \mu_A - \mu_B

# Parameter data_A = np.array([23.6, 27.9, 22.9, 21.8, 25.8, 30.7, 26.5, 25.4]) data_B = np.array([22.5, 25.6, 24.0, 20.4, 26.0, 26.6, 26.4, 22.1]) z = data_A - data_B # Mean z_bar = z.mean(); s = z.std(ddof=1) # Same as pandas print('Mean = {:.4f}\nStandard deviation = {:.4f}'.format(z_bar, s)) """ We test $H_0 : \mu \leq 0$ $H_A : \mu > 0$ """ n = len(z) mu = 0 t_stat = math.sqrt(n)*(z_bar - mu)/s print('t-statistics: {:.4f}'.format(t_stat)) print(("p-value = {:.4f}".format(t.sf(t_stat, n-1))))
Mean = 1.3750 Standard deviation = 1.7847 t-statistics: 2.1792 p-value = 0.0329

Analysis of Independent Samples

PopulationSamplesSizeMeanStandard deviation
Population Ax1,x2,…,xnx_1, x_2,\dots, x_nnx‾\overline xsxs_x
Population By1,y2,…,ymy_1, y_2,\dots, y_mmy‾\overline ysys_y

Point estimate μA−μB=x‾−y‾\mu_A - \mu_B = \overline x - \overline y Standard error se(x‾−y‾)=σA2n+σB2mse(\overline x - \overline y) = \sqrt{\frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m}}

Assume σA2\sigma_A^2 and σB2\sigma_B^2 are unknown. Then we have:

  • General procedure: se(x‾−y‾)=sx2n+sx2mse(\overline x - \overline y) = \sqrt{\frac{s^2_x}{n} + \frac{s^2_x}{m}}

  • Pooled variance procedure: se(x‾−y‾)=sp1n+1mse(\overline x - \overline y) = s_p \sqrt{\frac{1}{n} + \frac{1}{m}} where sp2=(n−1)sx2+(m−1)sy2n+m−2s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}

  • When the variances are known, we use a two-sample z-test.

General Procedure (Smith-Satterthwaite test)

  • We use the statistics T=xˉ−yˉ−(μA−μB)sx2n+sy2mT = \frac{ \bar{x} - \bar{y} - (\mu_A - \mu_B)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}

  • This statistic follows approximately t-distribution with the d.f. ν\nu as the largest integer not larger than v∗=(sx2/n+sy2/m)2sx4/n2(n−1)+sy4/m2(m−1)v^* = \frac{ (s_x^2 / n + s_y^2 /m ) ^2}{ s_x^4/ n^2(n-1) + s_y^4 / m^2(m-1) }

  • Two sided 1−α1-\alpha level of confidence iunterval for μA−μB\mu_A - \mu_B is given by x‾−y‾±tα/2,νsx2n+sy2m\overline x - \overline y \pm t_{\alpha/2, \nu} \sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}

# Population A n = 14; x_bar = 32.45; s_x = 4.30 # Population B m = 14; y_bar = 41.45; s_y = 5.23 alpha = 0.01 # = 1 - confidence level def degrees_of_freedom(n, m, s_x, s_y): nu = ((s_x**2/n + s_y**2/m)**2) / ( s_x**4 / (n**2*(n-1)) + s_y**4 / (m**2*(m-1))) return math.floor(nu) def wing_span(alpha, nu, n, m, s_x, s_y): return t.ppf(1-alpha/2, nu) * math.sqrt( s_x**2/n + s_y**2/m) def print_confidence_interval(alpha, mu, wing_span): print(("Confidence interval with {:.2f}% confidence level: ({:.4f})({:.4f})").format(((1-alpha)*100), mu - wing_span, mu + wing_span)) # Calculations diff = x_bar - y_bar nu = degrees_of_freedom( n, m, s_x, s_y) wing_span = wing_span(alpha, nu, n, m, s_x, s_y) display(Math('v = {}'.format(nu))) print_confidence_interval(alpha, diff, wing_span)

v=25\displaystyle v = 25

Confidence interval with 99.00% confidence level: (-14.0440)(-3.9560)
  • For testing H0:μA−μB=δH_0: \mu_A - \mu_B = \delta vs HA:μA−μB≠δH_A: \mu_A - \mu_B \neq \delta the t-statistic is T=(xˉ−yˉ−δ)sx2n+sy2mT = \frac{( \bar{x} - \bar{y} - \delta)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}

"""Hypotheses: - $H_0 : \mu_A = \mu_B$ - $H_A : \mu_A \neq \mu_B$""" # Calculate the t-statistics general def T_statistic_general(x_bar, y_bar, s_x, s_y, n, m, delta= 0): return ( (x_bar - y_bar - delta) / math.sqrt(s_x**2/n + s_y**2/m)) t_stat = T_statistic_general(x_bar, y_bar, s_x, s_y, n, m, delta= 0) print("|t| value: {:.4f} ".format(abs(t_stat))) print("Critical point: {:.4f}".format(t.ppf(1- alpha/2, n + m -2 )))
|t| value: 4.9736 Critical point: 2.7787

Pooled Variance Procedure

  • Assume σA2=σB2=σ2\sigma_A^2 = \sigma_B^2 = \sigma^2

  • Unbiased estimate of σ^2\hat{\sigma}^2 of σ2\sigma^2 is given by: sp2=(n−1)sx2+(m−1)sy2n+m−2s_p^2 = \frac{(n-1)s_x^2 + (m-1) s_y^2}{n+m-2}

  • t-statistics becomes T=x‾−y‾−(μA−μB)sp1n+1m∼tn+m−2T = \frac{ \overline x - \overline y - (\mu_A - \mu_B)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}} } \sim t_{n+m-2}

  • A two-sided 1−α1-\alpha level confidence interval for μA−μB\mu_A - \mu_B is given by the end-points x‾−y‾±tα/2,n+m−2sp1n+1m\overline x - \overline y \pm t_{\alpha/2, n+m-2} s_p \sqrt{ \frac{1}{n} + \frac{1}{m} }

# Population A n = 14; x_bar = 32.45; s_x = 4.30 # Population B m = 14; y_bar = 41.45; s_y = 5.23 alpha = 0.01 # = 1 - confidence level # Pooled variance def pooled_variance(n, m, s_x, s_y): return ( ((n-1)*s_x**2 + (m-1)*s_y**2) / (n+m-2)) def wing_span(s_p, n, m, alpha): return t.ppf(1-alpha/2, n+m-2) * s_p*math.sqrt(1/n + 1/m) def print_confidence_interval(alpha, mu, wing_span): print(("Confidence interval with {:.2f}% confidence level: ({:.4f})({:.4f})").format(((1-alpha)*100), mu - wing_span, mu + wing_span)) diff = x_bar - y_bar s_p = math.sqrt(pooled_variance(n, m, s_x, s_y)) wing_span = wing_span(s_p, n, m, alpha) print_confidence_interval(alpha, diff, wing_span)
Confidence interval with 99.00% confidence level: (-14.0282)(-3.9718)
  • For testing H0:μA−μB=δH_0: \mu_A - \mu_B = \delta vs HA:μA−μB≠δH_A: \mu_A - \mu_B \neq \delta the t-statistic is T=(xˉ−yˉ−δ)sp1n+1mT = \frac{( \bar{x} - \bar{y} - \delta)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}

"""Hypotheses: - $H_0 : \mu_A = \mu_B$ - $H_A : \mu_A \neq \mu_B$""" # Calculate the t-statistics def T_statistic_pooled(x_bar, y_bar, s_p, n, m, delta= 0): return ( (x_bar - y_bar - delta) / (s_p * math.sqrt(1/n + 1/m))) t_stat = T_statistic_pooled(x_bar, y_bar, s_p, n, m, delta= 0) print("|t| value: {:.4f} ".format(abs(t_stat))) print("Critical point: {:.4f}".format(t.ppf(1- alpha/2, n + m -2 )))
|t| value: 4.9736 Critical point: 2.7787

z-Procedure

  • When the population variances are known for the two samples, we can use a z-statistic instead of a t-statistic.

  • A two-sided 1 − α level confidence interval for μA−μB\mu_A - \mu_B is given by the end-points x‾−y‾±zα/2σA2n+σB2m\overline x - \overline y \pm z_{\alpha/2}\sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m} }

  • For testing H0:μA−μB=δH_0: \mu_A - \mu_B = \delta versus HA:μA−μB≠δH_A: \mu_A - \mu_B \neq \delta we use Z=x‾−y‾−δσA2n+σB2m∼N(0,1)Z = \frac{\overline x - \overline y - \delta}{ \sqrt{ \frac{\sigma^2_A}{n} + \frac{\sigma^2_B}{m} } } \sim N(0,1) under H0H_0

# Population A n = 38; x_bar = 5.782; sigma_x= 2.0 # Population B m = 40; y_bar = 6.443; sigma_y= 2.0 alpha = 0.01 # = 1 - confidence level # Calculate the Z-statistics def Z_statistic(x_bar, y_bar, s_x, s_y, n, m, delta = 0): return ( (x_bar - y_bar - delta) / math.sqrt(s_x**2/n + s_y**2/m)) Z_stat = Z_statistic(x_bar, y_bar, sigma_x, sigma_y, n, m) print("Z-statistics: {:.4f}".format(Z_stat)) print("p-value: {:.4f} ".format(norm.cdf(Z_stat))) def wing_span(alpha, n, m, s_x, s_y): return norm.ppf(1-alpha) * math.sqrt( s_x**2/n + s_y**2/m) def print_confidence_interval(alpha, mu, wing_span, lower_bound = False, upper_bound = False): if upper_bound: print(("Confidence interval with {:.2f}% confidence level: (-∞)({:.4f})").format(((1-alpha)*100), mu + wing_span)) return if lower_bound: print(("Confidence interval with {:.2f}% confidence level: ({:.4f})(∞))").format(((1-alpha)*100), mu - wing_span)) return # Default case: double bounded print(("Confidence interval with {:.2f}% confidence level: ({:.4f})({:.4f})").format(((1-alpha)*1, mu - wing_span, mu + wing_span))) """Hypotheses: - $H_0 : \mu_A > \mu_B$'' - $H_A : \mu_A >= \mu_B$""" diff = x_bar - y_bar wing_span = wing_span(alpha, n, m, sigma_x, sigma_y) print_confidence_interval(alpha, diff, wing_span, upper_bound = True)
Z-statistics: -1.4590 p-value: 0.0723 Confidence interval with 99.00% confidence level: (-∞)(0.3930)

Interval length

  • The interval length (suppose we are using the general procedure) is: L=2×tα/2,νsA2n+sB2mL = 2 \times t_{\alpha / 2, \nu} \sqrt{ \frac{s_A^2}{n} + \frac{s_B^2}{m}}

  • Then the minimum number of samples will be (supposing n = m) n=m≥4tα/2,ν(sA2+sB2)L02n = m \geq \frac{4t_{\alpha / 2, \nu} (s^2_A + s^2_B)}{L_0^2}