Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Probability-Statistics-Jupy…
GitHub Repository: Probability-Statistics-Jupyter-Notebook/probability-statistics-notebook
Path: blob/master/notebook-for-learning/Chapter10-Discrete-Data-Analysis.ipynb
388 views
Kernel: Python 3

Chapter 10 - Discrete Data Analysis

# Useful libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats import statsmodels.stats.weightstats as sms from scipy.stats import t from scipy.stats import norm import math from IPython.display import Math, display

Inferences on a Population Proportion

If we have a population proportion pp with characteristic, then for the random sample size of nn we have:

  • With characteristic: cell probability pp; cell frequency xx

  • Without characteristic: cell probability 1p1-p; cell frequency nxn-x

Given a sample proportion p^\hat{p}:

  • XB(n,p)X \sim B(n,p)

  • p^=xn\hat{p} = \frac{x}{n}

  • E(p^)E(\hat{p}) and Var(p^=p(1p)nVar(\hat{p} = \frac{p(1-p)}{n}

For large nn we have

  • p^N(p,p(1p)fN(p,p^(1p^)n\hat{p} \sim N(p, \frac{p(1-p)}{f} \approx N(p, \frac{\hat{p}(1-\hat{p})}{n}

Confidence Intervals for Population Proportions

  • Two sided confidence intervals for a population proportion (p^zα/2p^(1p^)n,p^+zα/2p^(1p^)n)\left( \hat{p} -z_{\alpha/2} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} }, \hat{p} + z_{\alpha/2} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} } \right)

  • One sided with a lower bound (p^zαp^(1p^)n,1)\left( \hat{p} - z_{\alpha} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} }, 1 \right)

  • One sided with a upper bound (0,p^+zαp^(1p^)n)\left( 0, \hat{p} + z_{\alpha} \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} } \right)

    • The results are safe if both xx and nxn-x are >5> 5

# Calculating z_alpha: alpha = 0.05 # 95% confidence interval # Calculate via Gaussian distribution display(Math(r'z_\alpha= {:.4f}'.format(norm.ppf(1-alpha)))) # One sided display(Math(r'z_\alpha/2 = {:.4f}'.format(norm.ppf(1-alpha/2)))) # Two sided

zα=1.6449\displaystyle z_\alpha= 1.6449

zα/2=1.9600\displaystyle z_\alpha/2 = 1.9600

Hypothesis Tests on a Population Proportion

  • Two-sided tests: H0:p=p0H_0: p=p_0 vs HA:pp0H_A: p\neq p_0

pvalue=min[P(Xx),P(Xx)]\Longrightarrow p-value = min [ P(X \geq x), P(X \leq x)], where XB(n,p0)X \sim B(n, p_0)

  • When np0np_0 and n(1p0)n(1 - p_0 ) are both larger than 5, a normal approximation may be used to compute the p-value pvalue=2×Φ(z)p-value = 2 \times \Phi(- \mid z \mid)

where

z=p^p0p0(1p0)/nN(0,1)z = \frac{ \hat{p} - p_0 }{ \sqrt{ p_0 (1-p_0)/n } } \sim N(0,1)
  • Continuity correction can be used for a better approximation to the pp-value.

  • A size α hypothesis test rejects H0H_0 when z>zα/2 or pvalue<α\mid z \mid > z_{\alpha / 2} \text{ or } p-value < \alpha

One-sided hypothesis tests for a population mean

  • For testing H0:pp0H_0: p \geq p_0 vs HA:p<p0H_A: p<p_0

pvalue=P(Xx) where XB(n,p0)\Longrightarrow p-value = P(X \leq x) \text{ where } X \sim B(n, p_0)

  • By the normal approximation: pvalue=Φ(z)p-value = \Phi(z) where z=x+0.5np0np0(1p0)z = \frac{x + 0.5 - np_0}{ \sqrt{ np_0 (1-p_0) } }

  • +0.5+ 0.5 is the continuity correction. If we want to test the opposite, then we have 0.5-0.5

"""Example: test H_0: p <= p_0 = 0.05 """ # Parameters x = 13 n = 62 p0 = 0.05 alpha = 0.05 # Check x = 13 > 5 and n − x = 62 − 13 > 5. # Calculate the z statistics as z = (x - n*p0 -0.5)/math.sqrt( n*p0*(1-p0)) # P-value print('z-statistics = {:.4f}'.format(z)) print('p-value = {:.4f}'.format(1 - norm.cdf(5.48))) # 95% one sided confidence interval p_hat = x/n; crit = norm.ppf(1-alpha) wing_span = crit*math.sqrt(( p_hat*(1-p_hat)) /n) print('{:.0f}% Confidence interval: ({:.4f}, 1)'.format(((1-alpha)*100), p_hat - wing_span))
z-statistics = 5.4775 p-value = 0.0000 95% Confidence interval: (0.1246, 1)
# Other example from statsmodels.stats.proportion import proportions_ztest from scipy.stats import binom_test zstat, pvalue = proportions_ztest(45,100,0.5) print("1 sample proportions test \n Z = %.4f, p-value = %.4f" %(zstat, pvalue)) pvalue = binom_test(8,20,0.5) print("Exact binomial test \n p-value = %.4f" %pvalue)
1 sample proportions test Z = -1.0050, p-value = 0.3149 Exact binomial test p-value = 0.5034

Sample size calculations

  • Considering a two sided 1α1-\alpha level confidence interval L=2zα/2p^(1p^)nL = 2 z _{\alpha/2} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}}

  • If p^\hat{p} is not available, L=2zα/2p^(1p^)nzα/21nL = 2 z _{\alpha/2} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}} \leq z_{\alpha/2} \sqrt{ \frac{1}{n} }

Then e.g. n4zα/22p^(1p^)L2\Longrightarrow n \geq \frac{4 z^2_{\alpha/2} \hat{p} (1-\hat{p})}{L^2}

Comparing Two Population Proportions

Confidence Intervals for the Difference Between Two Population Proportions

  • Assume XB(n,pA)X \sim B(n, p_A) and YB(m,pB)Y \sim B(m, p_B) and they are independent

  • 1α1-\alpha confidence intervals for pApBp_A - p_B

    • Two-sided:

p^ap^b±zα/2p^a(1p^a)n+p^b(1p^b)m\hat{p}_a - \hat{p}_b \pm z_{\alpha / 2} \sqrt{ \frac { \hat{p}_a (1- \hat{p}_a ) }{n} + \frac{ \hat{p}_b (1- \hat{p}_b)}{m} }
- One-sided with a lower bound:
(p^ap^bzαp^a(1p^a)n+p^b(1p^b)m,1)\left( \hat{p}_a - \hat{p}_b - z_{\alpha} \sqrt{ \frac { \hat{p}_a (1- \hat{p}_a ) }{n} + \frac{ \hat{p}_b (1- \hat{p}_b)}{m} }, 1 \right)
- One-sided with an upper bound:
(1,p^ap^b+zα/2p^a(1p^a)n+p^b(1p^b)m)\left( -1, \hat{p}_a - \hat{p}_b + z_{\alpha / 2} \sqrt{ \frac { \hat{p}_a (1- \hat{p}_a ) }{n} + \frac{ \hat{p}_b (1- \hat{p}_b)}{m} } \right)
  • These approximations are reasonable as long as x, n − x, y, and n − y are all larger than 5.

Hypothesis Tests on the Difference Between Two Population Proportions

  • For testing H0:pA=pBH_0: p_A = p_B vs HA:pApBH_A: p_A \neq p_B

    • p-value: 2×Φ(z)2 \times \Phi(- \mid z \mid) where z=p^Ap^Bp^(1p^)(1n+1m) and p^=x+yn+mz = \frac{\hat{p}_A - \hat{p}_B}{ \sqrt{\hat{p}(1-\hat{p})(\frac{1}{n} + \frac{1}{m}) } } \text{ and } \hat{p} = \frac{x+y}{n+m}

  • For H0:pApBH_0: p_A \geq p_B vs HA:pA<pBH_A: p_A < p_B  p-value = Φ(z)\Longrightarrow \text{ p-value = } \Phi(z)

  • For H0:pApBH_0: p_A \leq p_B vs HA:pA>pBH_A: p_A > p_B  p-value = 1Φ(z)\Longrightarrow \text{ p-value = } 1-\Phi(z)

  • Reject H0H_0 if pp-value is smaller than the significance level α\alpha, otherwise accept

# Example x = 4 y = 10 n = 50 m = 50 p_a = x/n p_b = y/m # Pooled probability p_hat = (x+y)/(n+m) # Test statistics z_test = (p_a - p_b)/math.sqrt(p_hat*(1-p_hat)*(1/n + 1/m)) print('Z = {:.4f}'.format(z_test)) print('p-value = {:.4f}'.format(2*norm.cdf(- abs(z_test))))
Z = -1.7292 p-value = 0.0838
# Other example from statsmodels.stats.proportion import proportions_ztest stat, pvalue = proportions_ztest([12, 8], [34, 24]) #([x1,x2],[n1,n2]) print('2 sample test for equality of proportions \n Z = %.4f, p-value = %.4f' %(stat, pvalue))
2 sample test for equality of proportions Z = 0.1547, p-value = 0.8770

Goodness of Fit Tests for One-Way Contingency Tables

One-Way Classifications

Each of nn observations is classified into one of kk categories or cells

  • Cell frequencies: x1,,xkx_1, \dots, x_k, i=1kxi=n\sum_{i=1}^k x_i =n

  • Cell probabilities: p1,,pkp_1, \dots, p_k, i=1kpi=1\sum_{i=1}^k p_i =1

  • Test H0:pi=piH_0: p_i = p_i^*, i=1,,ki=1, \dots, k vs HA:notH0H_A: not H_0 Under H0H_0 , the expected cell frequency at cell ii, eie_i , is given by ei=npie_i = np_i

There are two test statistics:

  1. Pearson's Chi-square statistic: X2=i=1k(xiei)2eiX^2 = \sum_{i=1}^k \frac{(x_i - e_i)^2}{e_i}

  2. Likelihood ration Chi-square statistic: G2=2i=1kxiln(xie1)G^2 = 2 \sum_{i=1}{k} x_i ln \left( \frac{x_i}{e_1} \right)

  • Both of the statistics, X2X^2 and G2G_2 , follow asymptotically χk12\sim \chi^2_{k-1}

    • This asymptotic result is reasonable as long as all the eie_i are larger than 5. \Longrightarrow if not, it is appropriate to group cell frequecies to get at least 5

  • pp-value = P(X2obs(X2))P(X^2 \geq obs(X^2))

    • Conclusion: Reject H0H_0 if the p-value smaller than the sig. level α. Otherwise, accept H0H_0

# Example: check for homogeneity from scipy.stats import chi2 # Data (each cell represents a formulation) x = np.array([225, 223, 152]) # Observed frequencies e = np.array([200, 200, 200]) # Expected frequencies def chi2_stat(x, e): stat = 0 for i in range(len(x)): stat += (x[i] - e[i])**2 / e[i] return stat stat = chi2_stat(x, e) # Df = k-1 print('p-value: {:.4f}'.format(chi2.sf(stat, len(x)-1)))
p-value: 0.0002

Testing for Independence in Two-Way Contingency Tables

Two-Way Classifications

  • Two way r×cr \times c contingency table | | Level 1 | Level 2 | \cdots | Level j | \cdots | Level c | | |----------|---------------|---------------|----------|---------------|----------|---------------|-----------------------| | Level 1 | x11x_{11} | x12x_{12} | | \vdots | | x1cx_{1c} | x1x_{1 \cdot} | | Level 2 | x21x_{21} | x22x_{22} | | \vdots | | x2cx_{2c} | x2x_{2 \cdot} | | \vdots | | | | \vdots | | | | | Level i | \cdots | \cdots | \cdots | xijx_{ij} | \cdots | \cdots | xix_{i \cdot} | | \vdots | | | | \vdots | | | | | Level r | xr1x_{r1} | xr2x_{r2} | | \vdots | | xrcx_{rc} | xrx_{r\cdot} | | | x1x_{\cdot 1} | x2x_{\cdot 2} | | xjx_{\cdot j} | | xcx_{\cdot c} | n=xn = x_{\cdot \cdot} |

Testing for Independence

  • Testing for independence in a Two-way contingency table

    • H0H_0: two factors are independent vs HAH_A: not H0H_0

We have the statistics: X2=i=1rj=1c(xijeij)2eijX^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(x_{ij} - e_{ij})^2}{e_{ij}} G2=2i=1rj=1cxijln(xijeij)G^2 = 2 \sum_{i=1}^r \sum_{j=1}^c x_{ij} ln \left( \frac{x_{ij}}{e_{ij}} \right)

where eij=xixjne_{ij} = \frac{x_{i \cdot} x_{\cdot j} }{n}

  • These statistics follow χdf2\sim \chi^2_{df} with df=(r1)(c1)df = (r-1)(c-1)

  • The results are valid as long as the eije_{ij} are larger than 5

    • pp-value = P(X2obs(X2))P(X^2 \geq obs(X^2))

from scipy.stats import chi2_contingency aptocc = np.array([[31, 17, 9], [36,9,4], [56, 19, 15]]) # Build contingency table aptocc = pd.DataFrame(data=aptocc, index=['A', 'B', 'C'], columns=['Minor Cracking','Medium cracking', 'Severe cracking']) print("Observed cell frequencies:\n", aptocc) chi, pvalue, dof, expctd = chi2_contingency(aptocc) print("Pearson's Chi-squared test \nX-squared = %.4f, p-value = %.4f, df = %d," %(chi, pvalue, dof)) print("Expected cell frequencies:\n", pd.DataFrame(expctd, index= ['A', 'B', 'C'], columns=['Minor Cracking','Medium cracking', 'Severe cracking']))
Observed cell frequencies: Minor Cracking Medium cracking Severe cracking A 31 17 9 B 36 9 4 C 56 19 15 Pearson's Chi-squared test X-squared = 5.0237, p-value = 0.2849, df = 4, Expected cell frequencies: Minor Cracking Medium cracking Severe cracking A 35.770408 13.086735 8.142857 B 30.750000 11.250000 7.000000 C 56.479592 20.663265 12.857143

Simpson’s paradox

Suppose P(ABCi)<P(ABCi)P(A \mid B' \cap C_i) < P( A' \mid B' \cap C_i) for i=1,2,,ki = 1, 2, \cdots, k with i=1kP(Ci)=1\sum_{i=1}^k P(C_i) = 1

\Longrightarrow It is possible that P(AB)>P(AB)P( A \mid B') > P( A' \mid B')

This can be due e.g. if the size of the events are different