GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-learning/Chapter10-Discrete-Data-Analysis.ipynb
³⁸⁸ views

Kernel: Python 3

Chapter 10 - Discrete Data Analysis

In [1]:

# Useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.stats.weightstats as sms
from scipy.stats import t
from scipy.stats import norm
import math
from IPython.display import Math, display

Inferences on a Population Proportion

If we have a population proportion $p$ with characteristic, then for the random sample size of $n$ we have:

With characteristic: cell probability $p$ ; cell frequency $x$
Without characteristic: cell probability $1-p$ ; cell frequency $n-x$

Given a sample proportion $\hat{p}$ :

$X \sim B(n,p)$
$\hat{p} = \frac{x}{n}$
$E(\hat{p})$ and $Var(\hat{p} = \frac{p(1-p)}{n}$

For large $n$ we have

$\hat{p} \sim N(p, \frac{p(1-p)}{f} \approx N(p, \frac{\hat{p}(1-\hat{p})}{n}$

Confidence Intervals for Population Proportions

Two sided confidence intervals for a population proportion $\left( \hat{p} -z_{\alpha/2} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} }, \hat{p} + z_{\alpha/2} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} } \right)$
One sided with a lower bound $\left( \hat{p} - z_{\alpha} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} }, 1 \right)$
One sided with a upper bound $\left( 0, \hat{p} + z_{\alpha} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} } \right)$
- The results are safe if both $x$ and $n-x$ are $> 5$

In [2]:

# Calculating z_alpha:
alpha = 0.05 # 95% confidence interval

# Calculate via Gaussian distribution
display(Math(r'z_\alpha= {:.4f}'.format(norm.ppf(1-alpha)))) # One sided
display(Math(r'z_\alpha/2 = {:.4f}'.format(norm.ppf(1-alpha/2)))) # Two sided

Out[2]:

$\displaystyle z_\alpha= 1.6449$

$\displaystyle z_\alpha/2 = 1.9600$

Hypothesis Tests on a Population Proportion

Two-sided tests: $H_0: p=p_0$ vs $H_A: p\neq p_0$

$\Longrightarrow p-value = min [ P(X \geq x), P(X \leq x)]$ , where $X \sim B(n, p_0)$

When $np_0$ and $n(1 - p_0 )$ are both larger than 5, a normal approximation may be used to compute the p-value $p-value = 2 \times \Phi(- \mid z \mid)$

where

z = \frac{ \hat{p} - p_0 }{ \sqrt{ p_0 (1-p_0)/n } } \sim N(0,1)

Continuity correction can be used for a better approximation to the $p$ -value.
A size α hypothesis test rejects $H_0$ when $\mid z \mid > z_{\alpha / 2} \text{ or } p-value < \alpha$

One-sided hypothesis tests for a population mean

For testing $H_0: p \geq p_0$ vs $H_A: p<p_0$

$\Longrightarrow p-value = P(X \leq x) \text{ where } X \sim B(n, p_0)$

By the normal approximation: $p-value = \Phi(z)$ where $z = \frac{x + 0.5 - np_0}{ \sqrt{ np_0 (1-p_0) } }$
$+ 0.5$ is the continuity correction. If we want to test the opposite, then we have $-0.5$

In [3]:

"""Example: test H_0: p <=  p_0 =  0.05 """
# Parameters
x = 13
n = 62
p0 = 0.05
alpha = 0.05

# Check x = 13 > 5 and n − x = 62 − 13 > 5.
# Calculate the z statistics as
z = (x - n*p0 -0.5)/math.sqrt( n*p0*(1-p0))

# P-value
print('z-statistics = {:.4f}'.format(z))
print('p-value = {:.4f}'.format(1 - norm.cdf(5.48)))

# 95% one sided confidence interval
p_hat = x/n; crit = norm.ppf(1-alpha)
wing_span = crit*math.sqrt(( p_hat*(1-p_hat)) /n)
print('{:.0f}% Confidence interval: ({:.4f}, 1)'.format(((1-alpha)*100), p_hat - wing_span))

Out[3]:

z-statistics = 5.4775
p-value = 0.0000
95% Confidence interval: (0.1246, 1)

In [4]:

# Other example
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import binom_test
zstat, pvalue = proportions_ztest(45,100,0.5)
print("1 sample proportions test \n Z = %.4f, p-value = %.4f" %(zstat,
pvalue))
pvalue = binom_test(8,20,0.5)
print("Exact binomial test \n p-value = %.4f" %pvalue)

Out[4]:

1 sample proportions test 
 Z = -1.0050, p-value = 0.3149
Exact binomial test 
 p-value = 0.5034

Sample size calculations

Considering a two sided $1-\alpha$ level confidence interval $L = 2 z _{\alpha/2} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}}$
If $\hat{p}$ is not available, $L = 2 z _{\alpha/2} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}} \leq z_{\alpha/2} \sqrt{ \frac{1}{n} }$

Then e.g. $\Longrightarrow n \geq \frac{4 z^2_{\alpha/2} \hat{p} (1-\hat{p})}{L^2}$

Comparing Two Population Proportions

Confidence Intervals for the Difference Between Two Population Proportions

Assume $X \sim B(n, p_A)$ and $Y \sim B(m, p_B)$ and they are independent
$1-\alpha$ confidence intervals for $p_A - p_B$
- Two-sided:

\hat{p}_a - \hat{p}_b \pm z_{\alpha / 2} \sqrt{ \frac { \hat{p}_a (1- \hat{p}_a ) }{n} + \frac{ \hat{p}_b (1- \hat{p}_b)}{m} }

- One-sided with a lower bound:

\left( \hat{p}_a - \hat{p}_b - z_{\alpha} \sqrt{ \frac { \hat{p}_a (1- \hat{p}_a ) }{n} + \frac{ \hat{p}_b (1- \hat{p}_b)}{m} }, 1 \right)

- One-sided with an upper bound:

\left( -1, \hat{p}_a - \hat{p}_b + z_{\alpha / 2} \sqrt{ \frac { \hat{p}_a (1- \hat{p}_a ) }{n} + \frac{ \hat{p}_b (1- \hat{p}_b)}{m} } \right)

These approximations are reasonable as long as x, n − x, y, and n − y are all larger than 5.

Hypothesis Tests on the Difference Between Two Population Proportions

For testing $H_0: p_A = p_B$ vs $H_A: p_A \neq p_B$
- p-value: $2 \times \Phi(- \mid z \mid)$ where $z = \frac{\hat{p}_A - \hat{p}_B}{ \sqrt{\hat{p}(1-\hat{p})(\frac{1}{n} + \frac{1}{m}) } } \text{ and } \hat{p} = \frac{x+y}{n+m}$
For $H_0: p_A \geq p_B$ vs $H_A: p_A < p_B$ $\Longrightarrow \text{ p-value = } \Phi(z)$
For $H_0: p_A \leq p_B$ vs $H_A: p_A > p_B$ $\Longrightarrow \text{ p-value = } 1-\Phi(z)$
Reject $H_0$ if $p$ -value is smaller than the significance level $\alpha$ , otherwise accept

In [5]:

# Example
x = 4
y = 10
n = 50
m = 50
p_a = x/n
p_b = y/m

# Pooled probability
p_hat = (x+y)/(n+m)

# Test statistics
z_test = (p_a - p_b)/math.sqrt(p_hat*(1-p_hat)*(1/n + 1/m))

print('Z = {:.4f}'.format(z_test))
print('p-value = {:.4f}'.format(2*norm.cdf(- abs(z_test))))

Out[5]:

Z = -1.7292
p-value = 0.0838

In [6]:

# Other example
from statsmodels.stats.proportion import proportions_ztest
stat, pvalue = proportions_ztest([12, 8], [34, 24]) #([x1,x2],[n1,n2])
print('2 sample test for equality of proportions \n Z = %.4f, p-value = %.4f' %(stat, pvalue))

Out[6]:

2 sample test for equality of proportions 
 Z = 0.1547, p-value = 0.8770

Goodness of Fit Tests for One-Way Contingency Tables

One-Way Classifications

Each of $n$ observations is classified into one of $k$ categories or cells

Cell frequencies: $x_1, \dots, x_k$ , $\sum_{i=1}^k x_i =n$
Cell probabilities: $p_1, \dots, p_k$ , $\sum_{i=1}^k p_i =1$
Test $H_0: p_i = p_i^*$ , $i=1, \dots, k$ vs $H_A: not H_0$ Under $H_0$ , the expected cell frequency at cell $i$ , $e_i$ , is given by $e_i = np_i$

There are two test statistics:

Pearson's Chi-square statistic: $X^2 = \sum_{i=1}^k \frac{(x_i - e_i)^2}{e_i}$
Likelihood ration Chi-square statistic: $G^2 = 2 \sum_{i=1}{k} x_i ln \left( \frac{x_i}{e_1} \right)$

Both of the statistics, $X^2$ and $G_2$ , follow asymptotically $\sim \chi^2_{k-1}$
- This asymptotic result is reasonable as long as all the $e_i$ are larger than 5. $\Longrightarrow$ if not, it is appropriate to group cell frequecies to get at least 5
$p$ -value = $P(X^2 \geq obs(X^2))$
- Conclusion: Reject $H_0$ if the p-value smaller than the sig. level α. Otherwise, accept $H_0$

In [9]:

# Example: check for homogeneity
from scipy.stats import chi2

# Data (each cell represents a formulation)
x = np.array([225, 223, 152]) # Observed frequencies
e = np.array([200, 200, 200]) # Expected frequencies

def chi2_stat(x, e):
    stat = 0
    for i in range(len(x)):
        stat += (x[i] - e[i])**2 / e[i]
    return stat

stat = chi2_stat(x, e)

# Df = k-1
print('p-value: {:.4f}'.format(chi2.sf(stat, len(x)-1)))

Out[9]:

p-value: 0.0002

Testing for Independence in Two-Way Contingency Tables

Two-Way Classifications

Two way $r \times c$ contingency table | | Level 1 | Level 2 | $\cdots$ | Level j | $\cdots$ | Level c | | |----------|---------------|---------------|----------|---------------|----------|---------------|-----------------------| | Level 1 | $x_{11}$ | $x_{12}$ | | $\vdots$ | | $x_{1c}$ | $x_{1 \cdot}$ | | Level 2 | $x_{21}$ | $x_{22}$ | | $\vdots$ | | $x_{2c}$ | $x_{2 \cdot}$ | | $\vdots$ | | | | $\vdots$ | | | | | Level i | $\cdots$ | $\cdots$ | $\cdots$ | $x_{ij}$ | $\cdots$ | $\cdots$ | $x_{i \cdot}$ | | $\vdots$ | | | | $\vdots$ | | | | | Level r | $x_{r1}$ | $x_{r2}$ | | $\vdots$ | | $x_{rc}$ | $x_{r\cdot}$ | | | $x_{\cdot 1}$ | $x_{\cdot 2}$ | | $x_{\cdot j}$ | | $x_{\cdot c}$ | $n = x_{\cdot \cdot}$ |

Testing for Independence

Testing for independence in a Two-way contingency table
- $H_0$ : two factors are independent vs $H_A$ : not $H_0$

We have the statistics: $X^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(x_{ij} - e_{ij})^2}{e_{ij}}$ $G^2 = 2 \sum_{i=1}^r \sum_{j=1}^c x_{ij} ln \left( \frac{x_{ij}}{e_{ij}} \right)$

where $e_{ij} = \frac{x_{i \cdot} x_{\cdot j} }{n}$

These statistics follow $\sim \chi^2_{df}$ with $df = (r-1)(c-1)$
The results are valid as long as the $e_{ij}$ are larger than 5
- $p$ -value = $P(X^2 \geq obs(X^2))$

In [15]:

from scipy.stats import chi2_contingency
aptocc = np.array([[31, 17, 9], [36,9,4], [56, 19, 15]])

# Build contingency table
aptocc = pd.DataFrame(data=aptocc,
    index=['A', 'B', 'C'], columns=['Minor Cracking','Medium cracking', 'Severe cracking'])
print("Observed cell frequencies:\n", aptocc)
chi, pvalue, dof, expctd = chi2_contingency(aptocc)
print("Pearson's Chi-squared test \nX-squared = %.4f, p-value = %.4f, df = %d," %(chi,
pvalue, dof))
print("Expected cell frequencies:\n", pd.DataFrame(expctd, index=
    ['A', 'B', 'C'], columns=['Minor Cracking','Medium cracking', 'Severe cracking']))

Out[15]:

Observed cell frequencies:
    Minor Cracking  Medium cracking  Severe cracking
A              31               17                9
B              36                9                4
C              56               19               15
Pearson's Chi-squared test 
X-squared = 5.0237, p-value = 0.2849, df = 4,
Expected cell frequencies:
    Minor Cracking  Medium cracking  Severe cracking
A       35.770408        13.086735         8.142857
B       30.750000        11.250000         7.000000
C       56.479592        20.663265        12.857143

Simpson’s paradox

Suppose $P(A \mid B' \cap C_i) < P( A' \mid B' \cap C_i)$ for $i = 1, 2, \cdots, k$ with $\sum_{i=1}^k P(C_i) = 1$

$\Longrightarrow$ It is possible that $P( A \mid B') > P( A' \mid B')$

This can be due e.g. if the size of the events are different