📚 The CoCalc Library - books, templates and other resources
License: OTHER
% LaTeX source for ``Think Stats:1% Exploratory data analysis in Python''2% Copyright 2014 Allen B. Downey.34% License: Creative Commons5% Attribution-NonCommercial-ShareAlike 4.0 International6% http://creativecommons.org/licenses/by-nc-sa/4.0/7%89%\documentclass[10pt,b5paper]{book}10\documentclass[12pt]{book}1112%\usepackage[width=5.5in,height=8.5in,13% hmarginratio=3:2,vmarginratio=1:1]{geometry}1415% for some of these packages, you might have to install16% texlive-latex-extra (in Ubuntu)1718%\usepackage[T1]{fontenc}19%\usepackage{textcomp}20%\usepackage{mathpazo}21%\usepackage{pslatex}2223\usepackage{url}24\usepackage{hyperref}25\usepackage{fancyhdr}26\usepackage{graphicx}27\usepackage{subfig}28\usepackage{amsmath}29\usepackage{amsthm}30%\usepackage{amssymb}31\usepackage{makeidx}32\usepackage{setspace}33\usepackage{hevea}34\usepackage{upquote}3536\title{Think Stats}37\author{Allen B. Downey}3839\newcommand{\thetitle}{Think Stats}40\newcommand{\thesubtitle}{Exploratory Data Analysis in Python}41\newcommand{\theversion}{2.0.38}4243% these styles get translated in CSS for the HTML version44\newstyle{a:link}{color:black;}45\newstyle{p+p}{margin-top:1em;margin-bottom:1em}46\newstyle{img}{border:0px}4748% change the arrows in the HTML version49\setlinkstext50{\imgsrc[ALT="Previous"]{back.png}}51{\imgsrc[ALT="Up"]{up.png}}52{\imgsrc[ALT="Next"]{next.png}}5354\makeindex5556\newif\ifplastex57\plastexfalse5859\begin{document}6061\frontmatter6263\newcommand{\Erdos}{Erd\H{o}s}64\newcommand{\nhat}{\hat{N}}65\newcommand{\eps}{\varepsilon}66\newcommand{\slope}{\mathrm{slope}}67\newcommand{\inter}{\mathrm{inter}}68\newcommand{\xs}{\mathrm{xs}}69\newcommand{\ys}{\mathrm{ys}}70\newcommand{\res}{\mathrm{res}}71\newcommand{\xbar}{\bar{x}}72\newcommand{\ybar}{\bar{y}}73\newcommand{\PMF}{\mathrm{PMF}}74\newcommand{\PDF}{\mathrm{PDF}}75\newcommand{\CDF}{\mathrm{CDF}}76\newcommand{\ICDF}{\mathrm{ICDF}}77\newcommand{\Prob}{\mathrm{P}}78\newcommand{\Corr}{\mathrm{Corr}}79\newcommand{\normal}{\mathcal{N}}80\newcommand{\given}{|}81%\newcommand{\goodchi}{\protect\raisebox{2pt}{$\chi$}}82\newcommand{\goodchi}{\chi}8384\ifplastex85\usepackage{localdef}86\maketitle8788\newcount\anchorcnt89\newcommand*{\Anchor}[1]{%90\@bsphack%91\Hy@GlobalStepCount\anchorcnt%92\edef\@currentHref{anchor.\the\anchorcnt}%93\Hy@raisedlink{\hyper@anchorstart{\@currentHref}\hyper@anchorend}%94\M@gettitle{}\label{#1}%95\@esphack%96}979899\else100101%%% EXERCISE102103\newtheoremstyle{exercise}% name of the style to be used104{\topsep}% measure of space to leave above the theorem. E.g.: 3pt105{\topsep}% measure of space to leave below the theorem. E.g.: 3pt106{}% name of font to use in the body of the theorem107{}% measure of space to indent108{\bfseries}% name of head font109{}% punctuation between head and body110{ }% space after theorem head; " " = normal interword space111{}% Manually specify head112113\theoremstyle{exercise}114\newtheorem{exercise}{Exercise}[chapter]115116%\newcounter{exercise}[chapter]117%\newcommand{\nextexercise}{\refstepcounter{exercise}}118119%\newenvironment{exercise}{\nextexercise \noindent \textbf{Exercise \thechapter.\theexercise} \begin{itshape} \noindent}{\end{itshape}}120121\input{latexonly}122123\begin{latexonly}124125\renewcommand{\blankpage}{\thispagestyle{empty} \quad \newpage}126127%\blankpage128%\blankpage129130% TITLE PAGES FOR LATEX VERSION131132%-half title--------------------------------------------------133\thispagestyle{empty}134135\begin{flushright}136\vspace*{2.0in}137138\begin{spacing}{3}139{\huge \thetitle}\\140{\Large \thesubtitle }141\end{spacing}142143\vspace{0.25in}144145Version \theversion146147\vfill148149\end{flushright}150151%--verso------------------------------------------------------152153\blankpage154\blankpage155%\clearemptydoublepage156%\pagebreak157%\thispagestyle{empty}158%\vspace*{6in}159160%--title page--------------------------------------------------161\pagebreak162\thispagestyle{empty}163164\begin{flushright}165\vspace*{2.0in}166167\begin{spacing}{3}168{\huge \thetitle}\\169{\Large \thesubtitle}170\end{spacing}171172\vspace{0.25in}173174Version \theversion175176\vspace{1in}177178179{\Large180Allen B. Downey\\181}182183184\vspace{0.5in}185186{\Large Green Tea Press}187188{\small Needham, Massachusetts}189190%\includegraphics[width=1in]{figs/logo1.eps}191\vfill192193\end{flushright}194195196%--copyright--------------------------------------------------197\pagebreak198\thispagestyle{empty}199200{\small201Copyright \copyright ~2014 Allen B. Downey.202203204\vspace{0.2in}205206\begin{flushleft}207Green Tea Press \\2089 Washburn Ave \\209Needham MA 02492210\end{flushleft}211212Permission is granted to copy, distribute, and/or modify this document213under the terms of the Creative Commons214Attribution-NonCommercial-ShareAlike 4.0 International License, which215is available at216\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.217218The original form of this book is \LaTeX\ source code. Compiling this219code has the effect of generating a device-independent representation220of a textbook, which can be converted to other formats and printed.221222The \LaTeX\ source for this book is available from223\url{http://thinkstats2.com}.224225\vspace{0.2in}226227} % end small228229\end{latexonly}230231232% HTMLONLY233234\begin{htmlonly}235236% TITLE PAGE FOR HTML VERSION237238{\Large \thetitle: \thesubtitle}239240{\large Allen B. Downey}241242Version \theversion243244\vspace{0.25in}245246Copyright 2014 Allen B. Downey247248\vspace{0.25in}249250Permission is granted to copy, distribute, and/or modify this document251under the terms of the Creative Commons252Attribution-NonCommercial-ShareAlike 4.0 International253Unported License, which is available at254\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.255256\setcounter{chapter}{-1}257258\end{htmlonly}259260\fi261% END OF THE PART WE SKIP FOR PLASTEX262263\chapter{Preface}264\label{preface}265266This book is an267introduction to the practical tools of exploratory data analysis.268The organization of the book follows the process I use269when I start working with a dataset:270271\begin{itemize}272273\item Importing and cleaning: Whatever format the data is in, it274usually takes some time and effort to read the data, clean and275transform it, and check that everything made it through the276translation process intact.277\index{cleaning}278279\item Single variable explorations: I usually start by examining one280variable at a time, finding out what the variables mean, looking281at distributions of the values, and choosing appropriate282summary statistics.283\index{distribution}284285\item Pair-wise explorations: To identify possible relationships286between variables, I look at tables and scatter plots, and compute287correlations and linear fits.288\index{correlation}289\index{linear fit}290291\item Multivariate analysis: If there are apparent relationships292between variables, I use multiple regression to add control variables293and investigate more complex relationships.294\index{multiple regression}295\index{control variable}296297\item Estimation and hypothesis testing: When reporting statistical298results, it is important to answer three questions: How big is299the effect? How much variability should we expect if we run the same300measurement again? Is it possible that the apparent effect is301due to chance?302\index{estimation}303\index{hypothesis testing}304305\item Visualization: During exploration, visualization is an important306tool for finding possible relationships and effects. Then if an307apparent effect holds up to scrutiny, visualization is an effective308way to communicate results.309\index{visualization}310311\end{itemize}312313This book takes a computational approach, which has several314advantages over mathematical approaches:315\index{computational methods}316317\begin{itemize}318319\item I present most ideas using Python code, rather than320mathematical notation. In general, Python code is more readable;321also, because it is executable, readers can download it, run it,322and modify it.323324\item Each chapter includes exercises readers can do to develop325and solidify their learning. When you write programs, you326express your understanding in code; while you are debugging the327program, you are also correcting your understanding.328\index{debugging}329330\item Some exercises involve experiments to test statistical331behavior. For example, you can explore the Central Limit Theorem332(CLT) by generating random samples and computing their sums. The333resulting visualizations demonstrate why the CLT works and when334it doesn't.335\index{Central Limit Theorem}336\index{CLT}337338\item Some ideas that are hard to grasp mathematically are easy to339understand by simulation. For example, we approximate p-values by340running random simulations, which reinforces the meaning of the341p-value.342\index{p-value}343344\item Because the book is based on a general-purpose programming345language (Python), readers can import data from almost any source.346They are not limited to datasets that have been cleaned and347formatted for a particular statistics tool.348349\end{itemize}350351The book lends itself to a project-based approach. In my class,352students work on a semester-long project that requires them to pose a353statistical question, find a dataset that can address it, and apply354each of the techniques they learn to their own data.355356To demonstrate my approach to statistical analysis, the book357presents a case study that runs through all of the chapters. It uses358data from two sources:359360\begin{itemize}361362\item The National Survey of Family Growth (NSFG), conducted by the363U.S. Centers for Disease Control and Prevention (CDC) to gather364``information on family life, marriage and divorce, pregnancy,365infertility, use of contraception, and men's and women's health.''366(See \url{http://cdc.gov/nchs/nsfg.htm}.)367368\item The Behavioral Risk Factor Surveillance System (BRFSS),369conducted by the National Center for Chronic Disease Prevention and370Health Promotion to ``track health conditions and risk behaviors in371the United States.'' (See \url{http://cdc.gov/BRFSS/}.)372373\end{itemize}374375Other examples use data from the IRS, the U.S. Census, and376the Boston Marathon.377378This second edition of {\it Think Stats\/} includes the chapters from379the first edition, many of them substantially revised, and new380chapters on regression, time series analysis, survival analysis,381and analytic methods. The previous edition did not use pandas,382SciPy, or StatsModels, so all of that material is new.383384385\section{How I wrote this book}386387When people write a new textbook, they usually start by388reading a stack of old textbooks. As a result, most books389contain the same material in pretty much the same order.390391I did not do that. In fact, I used almost no printed material while I392was writing this book, for several reasons:393394\begin{itemize}395396\item My goal was to explore a new approach to this material, so I didn't397want much exposure to existing approaches.398399\item Since I am making this book available under a free license, I wanted400to make sure that no part of it was encumbered by copyright restrictions.401402\item Many readers of my books don't have access to libraries of403printed material, so I tried to make references to resources that are404freely available on the Internet.405406\item Some proponents of old media think that the exclusive407use of electronic resources is lazy and unreliable. They might be right408about the first part, but I think they are wrong about the second, so409I wanted to test my theory.410411% http://www.ala.org/ala/mgrps/rts/nmrt/news/footnotes/may2010/in_defense_of_wikipedia_bonnett.cfm412413\end{itemize}414415The resource I used more than any other is Wikipedia. In general, the416articles I read on statistical topics were very good (although I made417a few small changes along the way). I include references to Wikipedia418pages throughout the book and I encourage you to follow those links;419in many cases, the Wikipedia page picks up where my description leaves420off. The vocabulary and notation in this book are generally421consistent with Wikipedia, unless I had a good reason to deviate.422Other resources I found useful were Wolfram MathWorld and423the Reddit statistics forum, \url{http://www.reddit.com/r/statistics}.424425426\section{Using the code}427\label{code}428429The code and data used in this book are available from430\url{https://github.com/AllenDowney/ThinkStats2}. Git is a version431control system that allows you to keep track of the files that432make up a project. A collection of files under Git's control is433called a {\bf repository}. GitHub is a hosting service that provides434storage for Git repositories and a convenient web interface.435\index{repository}436\index{Git}437\index{GitHub}438439The GitHub homepage for my repository provides several ways to440work with the code:441442\begin{itemize}443444\item You can create a copy of my repository445on GitHub by pressing the {\sf Fork} button. If you don't already446have a GitHub account, you'll need to create one. After forking, you'll447have your own repository on GitHub that you can use to keep track448of code you write while working on this book. Then you can449clone the repo, which means that you make a copy of the files450on your computer.451\index{fork}452453\item Or you could clone454my repository. You don't need a GitHub account to do this, but you455won't be able to write your changes back to GitHub.456\index{clone}457458\item If you don't want to use Git at all, you can download the files459in a Zip file using the button in the lower-right corner of the460GitHub page.461462\end{itemize}463464All of the code is written to work in both Python 2 and Python 3465with no translation.466467I developed this book using Anaconda from468Continuum Analytics, which is a free Python distribution that includes469all the packages you'll need to run the code (and lots more).470I found Anaconda easy to install. By default it does a user-level471installation, not system-level, so you don't need administrative472privileges. And it supports both Python 2 and Python 3. You can473download Anaconda from \url{http://continuum.io/downloads}.474\index{Anaconda}475476If you don't want to use Anaconda, you will need the following477packages:478479\begin{itemize}480481\item pandas for representing and analyzing data,482\url{http://pandas.pydata.org/};483\index{pandas}484485\item NumPy for basic numerical computation, \url{http://www.numpy.org/};486\index{NumPy}487488\item SciPy for scientific computation including statistics,489\url{http://www.scipy.org/};490\index{SciPy}491492\item StatsModels for regression and other statistical analysis,493\url{http://statsmodels.sourceforge.net/}; and494\index{StatsModels}495496\item matplotlib for visualization, \url{http://matplotlib.org/}.497\index{matplotlib}498499\end{itemize}500501Although these are commonly used packages, they are not included with502all Python installations, and they can be hard to install in some503environments. If you have trouble installing them, I strongly504recommend using Anaconda or one of the other Python distributions505that include these packages.506\index{installation}507508After you clone the repository or unzip the zip file, you should have509a folder called {\tt ThinkStats2/code} with a file called {\tt nsfg.py}.510If you run {\tt nsfg.py}, it should read a data file, run some tests, and print a511message like, ``All tests passed.'' If you get import errors, it512probably means there are packages you need to install.513514Most exercises use Python scripts, but some also use the IPython515notebook. If you have not used IPython notebook before, I suggest516you start with the documentation at517\url{http://ipython.org/ipython-doc/stable/notebook/notebook.html}.518\index{IPython}519520I wrote this book assuming that the reader is familiar with core Python,521including object-oriented features, but not pandas,522NumPy, and SciPy. If you are already familiar with these modules, you523can skip a few sections.524525I assume that the reader knows basic mathematics, including526logarithms, for example, and summations. I refer to calculus concepts527in a few places, but you don't have to do any calculus.528529If you have never studied statistics, I think this book is a good place530to start. And if you have taken531a traditional statistics class, I hope this book will help repair the532damage.533534535536---537538Allen B. Downey is a Professor of Computer Science at539the Franklin W. Olin College of Engineering in Needham, MA.540541542543544\section*{Contributor List}545546If you have a suggestion or correction, please send email to547{\tt downey@allendowney.com}. If I make a change based on your548feedback, I will add you to the contributor list549(unless you ask to be omitted).550\index{contributors}551552If you include at least part of the sentence the553error appears in, that makes it easy for me to search. Page and554section numbers are fine, too, but not quite as easy to work with.555Thanks!556557\small558559\begin{itemize}560561\item Lisa Downey and June Downey read an early draft and made many562corrections and suggestions.563564\item Steven Zhang found several errors.565566\item Andy Pethan and Molly Farison helped debug some of the solutions,567and Molly spotted several typos.568569\item Dr. Nikolas Akerblom knows how big a Hyracotherium is.570571\item Alex Morrow clarified one of the code examples.572573\item Jonathan Street caught an error in the nick of time.574575\item Many thanks to Kevin Smith and Tim Arnold for their work on576plasTeX, which I used to convert this book to DocBook.577578\item George Caplan sent several suggestions for improving clarity.579580\item Julian Ceipek found an error and a number of typos.581582\item Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent Johnson583found errors in the first print edition.584585\item J\"{o}rg Beyer found typos in the book and made many corrections586in the docstrings of the accompanying code.587588\item Tommie Gannert sent a patch file with a number of corrections.589590\item Christoph Lendenmann submitted several errata.591592\item Michael Kearney sent me many excellent suggestions.593594\item Alex Birch made a number of helpful suggestions.595596\item Lindsey Vanderlyn, Griffin Tschurwald, and Ben Small read an597early version of this book and found many errors.598599\item John Roth, Carol Willing, and Carol Novitsky performed technical600reviews of the book. They found many errors and made many601helpful suggestions.602603\item David Palmer sent many helpful suggestions and corrections.604605\item Erik Kulyk found many typos.606607\item Nir Soffer sent several excellent pull requests for both the608book and the supporting code.609610\item GitHub user flothesof sent a number of corrections.611612\item Toshiaki Kurokawa, who is working on the Japanese translation of613this book, has sent many corrections and helpful suggestions.614615\item Benjamin White suggested more idiomatic Pandas code.616617\item Takashi Sato spotted an code error.618619% ENDCONTRIB620621\end{itemize}622623Other people who found typos and similar errors are Andrew Heine,624G\'{a}bor Lipt\'{a}k,625Dan Kearney,626Alexander Gryzlov,627Martin Veillette,628Haitao Ma,629Jeff Pickhardt,630Rohit Deshpande,631Joanne Pratt,632Lucian Ursu,633Paul Glezen,634Ting-kuang Lin,635Scott Miller,636Luigi Patruno.637638639640\normalsize641642\clearemptydoublepage643644% TABLE OF CONTENTS645\begin{latexonly}646647\tableofcontents648649\clearemptydoublepage650651\end{latexonly}652653% START THE BOOK654\mainmatter655656657\chapter{Exploratory data analysis}658\label{intro}659660The thesis of this book is that data combined with practical661methods can answer questions and guide decisions under uncertainty.662663As an example, I present a case study motivated by a question664I heard when my wife and I were expecting our first child: do first665babies tend to arrive late?666\index{first babies}667668If you Google this question, you will find plenty of discussion. Some669people claim it's true, others say it's a myth, and some people say670it's the other way around: first babies come early.671672In many of these discussions, people provide data to support their673claims. I found many examples like these:674675\begin{quote}676677``My two friends that have given birth recently to their first babies,678BOTH went almost 2 weeks overdue before going into labour or being679induced.''680681``My first one came 2 weeks late and now I think the second one is682going to come out two weeks early!!''683684``I don't think that can be true because my sister was my mother's685first and she was early, as with many of my cousins.''686687\end{quote}688689Reports like these are called {\bf anecdotal evidence} because they690are based on data that is unpublished and usually personal. In casual691conversation, there is nothing wrong with anecdotes, so I don't mean692to pick on the people I quoted.693\index{anecdotal evidence}694695But we might want evidence that is more persuasive and696an answer that is more reliable. By those standards, anecdotal697evidence usually fails, because:698699\begin{itemize}700701\item Small number of observations: If pregnancy length is longer702for first babies, the difference is probably small compared to703natural variation. In that case, we might have to compare a large704number of pregnancies to be sure that a difference exists.705\index{pregnancy length}706707\item Selection bias: People who join a discussion of this question708might be interested because their first babies were late. In that709case the process of selecting data would bias the results.710\index{selection bias}711\index{bias!selection}712713\item Confirmation bias: People who believe the claim might be more714likely to contribute examples that confirm it. People who doubt the715claim are more likely to cite counterexamples.716\index{confirmation bias}717\index{bias!confirmation}718719\item Inaccuracy: Anecdotes are often personal stories, and often720misremembered, misrepresented, repeated721inaccurately, etc.722723\end{itemize}724725So how can we do better?726727728\section{A statistical approach}729730To address the limitations of anecdotes, we will use the tools731of statistics, which include:732733\begin{itemize}734735\item Data collection: We will use data from a large national survey736that was designed explicitly with the goal of generating737statistically valid inferences about the U.S. population.738\index{data collection}739740\item Descriptive statistics: We will generate statistics that741summarize the data concisely, and evaluate different ways to742visualize data.743\index{descriptive statistics}744745\item Exploratory data analysis: We will look for746patterns, differences, and other features that address the questions747we are interested in. At the same time we will check for748inconsistencies and identify limitations.749\index{exploratory data analysis}750751\item Estimation: We will use data from a sample to estimate752characteristics of the general population.753\index{estimation}754755\item Hypothesis testing: Where we see apparent effects, like a756difference between two groups, we will evaluate whether the effect757might have happened by chance.758\index{hypothesis testing}759760\end{itemize}761762By performing these steps with care to avoid pitfalls, we can763reach conclusions that are more justifiable and more likely to be764correct.765766767\section{The National Survey of Family Growth}768\label{nsfg}769770Since 1973 the U.S. Centers for Disease Control and Prevention (CDC)771have conducted the National Survey of Family Growth (NSFG),772which is intended to gather ``information on family life, marriage and773divorce, pregnancy, infertility, use of contraception, and men's and774women's health. The survey results are used \ldots to plan health services and775health education programs, and to do statistical studies of families,776fertility, and health.'' See777\url{http://cdc.gov/nchs/nsfg.htm}.778\index{National Survey of Family Growth}779\index{NSFG}780781We will use data collected by this survey to investigate whether first782babies tend to come late, and other questions. In order to use this783data effectively, we have to understand the design of the study.784785The NSFG is a {\bf cross-sectional} study, which means that it786captures a snapshot of a group at a point in time. The most787common alternative is a {\bf longitudinal} study, which observes a788group repeatedly over a period of time.789\index{cross-sectional study}790\index{study!cross-sectional}791\index{longitudinal study}792\index{study!longitudinal}793794The NSFG has been conducted seven times; each deployment is called a795{\bf cycle}. We will use data from Cycle 6, which was conducted from796January 2002 to March 2003. \index{cycle}797798The goal of the survey is to draw conclusions about a {\bf799population}; the target population of the NSFG is people in the800United States aged 15-44. Ideally surveys would collect data from801every member of the population, but that's seldom possible. Instead802we collect data from a subset of the population called a {\bf sample}.803The people who participate in a survey are called {\bf respondents}.804\index{population}805806In general,807cross-sectional studies are meant to be {\bf representative}, which808means that every member of the target population has an equal chance809of participating. That ideal is hard to achieve in810practice, but people who conduct surveys come as close as they can.811\index{respondent} \index{representative}812813The NSFG is not representative; instead it is deliberately {\bf814oversampled}. The designers of the study recruited three815groups---Hispanics, African-Americans and teenagers---at rates higher816than their representation in the U.S. population, in order to817make sure that the number of respondents in each of818these groups is large enough to draw valid statistical inferences.819\index{oversampling}820821Of course, the drawback of oversampling is that it is not as easy822to draw conclusions about the general population based on statistics823from the survey. We will come back to this point later.824825When working with this kind of data, it is important to be familiar826with the {\bf codebook}, which documents the design of the study, the827survey questions, and the encoding of the responses. The codebook and828user's guide for the NSFG data are available from829\url{http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm}830831832\section{Importing the data}833834The code and data used in this book are available from835\url{https://github.com/AllenDowney/ThinkStats2}. For information836about downloading and working with this code,837see Section~\ref{code}.838839Once you download the code, you should have a file called {\tt840ThinkStats2/code/nsfg.py}. If you run it, it should read a data841file, run some tests, and print a message like, ``All tests passed.''842843Let's see what it does. Pregnancy data from Cycle 6 of the NSFG is in844a file called {\tt 2002FemPreg.dat.gz}; it845is a gzip-compressed data file in plain text (ASCII), with fixed width846columns. Each line in the file is a {\bf record} that847contains data about one pregnancy.848849The format of the file is documented in {\tt 2002FemPreg.dct}, which850is a Stata dictionary file. Stata is a statistical software system;851a ``dictionary'' in this context is a list of variable names, types,852and indices that identify where in each line to find each variable.853854For example, here are a few lines from {\tt 2002FemPreg.dct}:855%856\begin{verbatim}857infile dictionary {858_column(1) str12 caseid %12s "RESPONDENT ID NUMBER"859_column(13) byte pregordr %2f "PREGNANCY ORDER (NUMBER)"860}861\end{verbatim}862863This dictionary describes two variables: {\tt caseid} is a 12-character864string that represents the respondent ID; {\tt pregordr} is a865one-byte integer that indicates which pregnancy this record866describes for this respondent.867868The code you downloaded includes {\tt thinkstats2.py}, which is a Python869module870that contains many classes and functions used in this book,871including functions that read the Stata dictionary and872the NSFG data file. Here's how they are used in {\tt nsfg.py}:873874\begin{verbatim}875def ReadFemPreg(dct_file='2002FemPreg.dct',876dat_file='2002FemPreg.dat.gz'):877dct = thinkstats2.ReadStataDct(dct_file)878df = dct.ReadFixedWidth(dat_file, compression='gzip')879CleanFemPreg(df)880return df881\end{verbatim}882883{\tt ReadStataDct} takes the name of the dictionary file884and returns {\tt dct}, a {\tt FixedWidthVariables} object that contains the885information from the dictionary file. {\tt dct} provides {\tt886ReadFixedWidth}, which reads the data file.887888889\section{DataFrames}890\label{dataframe}891892The result of {\tt ReadFixedWidth} is a DataFrame, which is the893fundamental data structure provided by pandas, which is a Python894data and statistics package we'll use throughout this book.895A DataFrame contains a896row for each record, in this case one row per pregnancy, and a column897for each variable.898\index{pandas}899\index{DataFrame}900901In addition to the data, a DataFrame also contains the variable902names and their types, and it provides methods for accessing and modifying903the data.904905If you print {\tt df} you get a truncated view of the rows and906columns, and the shape of the DataFrame, which is 13593907rows/records and 244 columns/variables.908909\begin{verbatim}910>>> import nsfg911>>> df = nsfg.ReadFemPreg()912>>> df913...914[13593 rows x 244 columns]915\end{verbatim}916917The DataFrame is too big to display, so the output is truncated. The918last line reports the number of rows and columns.919920The attribute {\tt columns} returns a sequence of column921names as Unicode strings:922923\begin{verbatim}924>>> df.columns925Index([u'caseid', u'pregordr', u'howpreg_n', u'howpreg_p', ... ])926\end{verbatim}927928The result is an Index, which is another pandas data structure.929We'll learn more about Index later, but for930now we'll treat it like a list:931\index{pandas}932\index{Index}933934\begin{verbatim}935>>> df.columns[1]936'pregordr'937\end{verbatim}938939To access a column from a DataFrame, you can use the column940name as a key:941\index{DataFrame}942943\begin{verbatim}944>>> pregordr = df['pregordr']945>>> type(pregordr)946<class 'pandas.core.series.Series'>947\end{verbatim}948949The result is a Series, yet another pandas data structure.950A Series is like a Python list with some additional features.951When you print a Series, you get the indices and the952corresponding values:953\index{Series}954955\begin{verbatim}956>>> pregordr9570 19581 29592 19603 2961...96213590 396313591 496413592 5965Name: pregordr, Length: 13593, dtype: int64966\end{verbatim}967968In this example the indices are integers from 0 to 13592, but in969general they can be any sortable type. The elements970are also integers, but they can be any type.971972The last line includes the variable name, Series length, and data type;973{\tt int64} is one of the types provided by NumPy. If you run974this example on a 32-bit machine you might see {\tt int32}.975\index{NumPy}976977You can access the elements of a Series using integer indices978and slices:979980\begin{verbatim}981>>> pregordr[0]9821983>>> pregordr[2:5]9842 19853 29864 3987Name: pregordr, dtype: int64988\end{verbatim}989990The result of the index operator is an {\tt int64}; the991result of the slice is another Series.992993You can also access the columns of a DataFrame using dot notation:994\index{DataFrame}995996\begin{verbatim}997>>> pregordr = df.pregordr998\end{verbatim}9991000This notation only works if the column name is a valid Python1001identifier, so it has to begin with a letter, can't contain spaces, etc.100210031004\section{Variables}10051006We have already seen two variables in the NSFG dataset, {\tt caseid}1007and {\tt pregordr}, and we have seen that there are 244 variables in1008total. For the explorations in this book, I use the following1009variables:10101011\begin{itemize}10121013\item {\tt caseid} is the integer ID of the respondent.10141015\item {\tt prglngth} is the integer duration of the pregnancy in weeks.1016\index{pregnancy length}10171018\item {\tt outcome} is an integer code for the outcome of the1019pregnancy. The code 1 indicates a live birth.10201021\item {\tt pregordr} is a pregnancy serial number; for example, the1022code for a respondent's first pregnancy is 1, for the second1023pregnancy is 2, and so on.10241025\item {\tt birthord} is a serial number for live1026births; the code for a respondent's first child is 1, and so on.1027For outcomes other than live birth, this field is blank.10281029\item \verb"birthwgt_lb" and \verb"birthwgt_oz" contain the pounds and1030ounces parts of the birth weight of the baby.1031\index{birth weight}1032\index{weight!birth}10331034\item {\tt agepreg} is the mother's age at the end of the pregnancy.10351036\item {\tt finalwgt} is the statistical weight associated with the1037respondent. It is a floating-point value that indicates the number1038of people in the U.S. population this respondent represents.1039\index{weight!sample}10401041\end{itemize}10421043If you read the codebook carefully, you will see that many of the1044variables are {\bf recodes}, which means that they are not part of the1045{\bf raw data} collected by the survey; they are calculated using1046the raw data. \index{recode} \index{raw data}10471048For example, {\tt prglngth} for live births is equal to the raw1049variable {\tt wksgest} (weeks of gestation) if it is available;1050otherwise it is estimated using {\tt mosgest * 4.33} (months of1051gestation times the average number of weeks in a month).10521053Recodes are often based on logic that checks the consistency and1054accuracy of the data. In general it is a good idea to use recodes1055when they are available, unless there is a compelling reason to1056process the raw data yourself.105710581059\section{Transformation}1060\label{cleaning}10611062When you import data like this, you often have to check for errors,1063deal with special values, convert data into different formats, and1064perform calculations. These operations are called {\bf data cleaning}.10651066{\tt nsfg.py} includes {\tt CleanFemPreg}, a function that cleans1067the variables I am planning to use.10681069\begin{verbatim}1070def CleanFemPreg(df):1071df.agepreg /= 100.010721073na_vals = [97, 98, 99]1074df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)1075df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)10761077df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.01078\end{verbatim}10791080{\tt agepreg} contains the mother's age at the end of the1081pregnancy. In the data file, {\tt agepreg} is encoded as an integer1082number of centiyears. So the first line divides each element1083of {\tt agepreg} by 100, yielding a floating-point value in1084years.10851086\verb"birthwgt_lb" and \verb"birthwgt_oz" contain the weight of the1087baby, in pounds and ounces, for pregnancies that end in live birth.1088In addition it uses several special codes:10891090\begin{verbatim}109197 NOT ASCERTAINED109298 REFUSED109399 DON'T KNOW1094\end{verbatim}10951096Special values encoded as numbers are {\em dangerous\/} because if they1097are not handled properly, they can generate bogus results, like1098a 99-pound baby. The {\tt replace} method replaces these values with1099{\tt np.nan}, a special floating-point value that represents ``not a1100number.'' The {\tt inplace} flag tells {\tt replace} to modify the1101existing Series rather than create a new one.1102\index{NaN}11031104As part of the IEEE floating-point standard, all mathematical1105operations return {\tt nan} if either argument is {\tt nan}:11061107\begin{verbatim}1108>>> import numpy as np1109>>> np.nan / 100.01110nan1111\end{verbatim}11121113So computations with {\tt nan} tend to do the right thing, and most1114pandas functions handle {\tt nan} appropriately. But dealing with1115missing data will be a recurring issue.1116\index{pandas}1117\index{missing values}11181119The last line of {\tt CleanFemPreg} creates a new1120column \verb"totalwgt_lb" that combines pounds and ounces into1121a single quantity, in pounds.11221123One important note: when you add a new column to a DataFrame, you1124must use dictionary syntax, like this1125\index{DataFrame}11261127\begin{verbatim}1128# CORRECT1129df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.01130\end{verbatim}11311132Not dot notation, like this:11331134\begin{verbatim}1135# WRONG!1136df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.01137\end{verbatim}11381139The version with dot notation adds an attribute to the DataFrame1140object, but that attribute is not treated as a new column.114111421143\section{Validation}11441145When data is exported from one software environment and imported into1146another, errors might be introduced. And when you are1147getting familiar with a new dataset, you might interpret data1148incorrectly or introduce other misunderstandings. If you take1149time to validate the data, you can save time later and avoid errors.11501151One way to validate data is to compute basic statistics and compare1152them with published results. For example, the NSFG codebook includes1153tables that summarize each variable. Here is the table for1154{\tt outcome}, which encodes the outcome of each pregnancy:11551156\begin{verbatim}1157value label Total11581 LIVE BIRTH 914811592 INDUCED ABORTION 186211603 STILLBIRTH 12011614 MISCARRIAGE 192111625 ECTOPIC PREGNANCY 19011636 CURRENT PREGNANCY 3521164\end{verbatim}11651166The Series class provides a method, \verb"value_counts", that1167counts the number of times each value appears. If we select the {\tt1168outcome} Series from the DataFrame, we can use \verb"value_counts"1169to compare with the published data:1170\index{DataFrame}1171\index{Series}11721173\begin{verbatim}1174>>> df.outcome.value_counts().sort_index()11751 914811762 186211773 12011784 192111795 19011806 3521181\end{verbatim}11821183The result of \verb"value_counts" is a Series;1184\verb"sort_index()" sorts the Series by index, so the values1185appear in order.11861187Comparing the results with the published table, it looks like the1188values in {\tt outcome} are correct. Similarly, here is the published1189table for \verb"birthwgt_lb"11901191\begin{verbatim}1192value label Total1193. INAPPLICABLE 444911940-5 UNDER 6 POUNDS 112511956 6 POUNDS 222311967 7 POUNDS 304911978 8 POUNDS 188911989-95 9 POUNDS OR MORE 7991199\end{verbatim}12001201And here are the value counts:12021203\begin{verbatim}1204>>> df.birthwgt_lb.value_counts(sort=False)12050 812061 4012072 5312083 9812094 22912105 69712116 222312127 304912138 188912149 623121510 132121611 26121712 10121813 3121914 3122015 1122151 11222\end{verbatim}12231224The counts for 6, 7, and 8 pounds check out, and if you add1225up the counts for 0-5 and 9-95, they check out, too. But1226if you look more closely, you will notice one value that has to be1227an error, a 51 pound baby!12281229To deal with this error, I added a line to {\tt CleanFemPreg}:12301231\begin{verbatim}1232df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan1233\end{verbatim}12341235This statement replaces invalid values with {\tt np.nan}.1236The attribute {\tt loc} provides several ways to select1237rows and columns from a DataFrame. In this example, the1238first expression in brackets is the row indexer; the second1239expression selects the column.1240\index{loc indexer}1241\index{indexer!loc}12421243The expression \verb"df.birthwgt_lb > 20" yields a Series of type1244{\tt bool}, where True indicates that the condition is true. When a1245boolean Series is used as an index, it selects only the elements that1246satisfy the condition.1247\index{Series} \index{boolean} \index{NaN}1248124912501251\section{Interpretation}12521253To work with data effectively, you have to think on two levels at the1254same time: the level of statistics and the level of context.12551256As an example, let's look at the sequence of outcomes for a few1257respondents. Because of the way the data files are organized, we have1258to do some processing to collect the pregnancy data for each respondent.1259Here's a function that does that:12601261\begin{verbatim}1262def MakePregMap(df):1263d = defaultdict(list)1264for index, caseid in df.caseid.iteritems():1265d[caseid].append(index)1266return d1267\end{verbatim}12681269{\tt df} is the DataFrame with pregnancy data. The {\tt iteritems}1270method enumerates the index (row number)1271and {\tt caseid} for each pregnancy.1272\index{DataFrame}12731274{\tt d} is a dictionary that maps from each case ID to a list of1275indices. If you are not familiar with {\tt defaultdict}, it is in1276the Python {\tt collections} module.1277Using {\tt d}, we can look up a respondent and get the1278indices of that respondent's pregnancies.12791280This example looks up one respondent and prints a list of outcomes1281for her pregnancies:12821283\begin{verbatim}1284>>> caseid = 102291285>>> preg_map = nsfg.MakePregMap(df)1286>>> indices = preg_map[caseid]1287>>> df.outcome[indices].values1288[4 4 4 4 4 4 1]1289\end{verbatim}12901291{\tt indices} is the list of indices for pregnancies corresponding1292to respondent {\tt 10229}.12931294Using this list as an index into {\tt df.outcome} selects the1295indicated rows and yields a Series. Instead of printing the1296whole Series, I selected the {\tt values} attribute, which is1297a NumPy array.1298\index{NumPy}1299\index{Series}13001301The outcome code {\tt 1} indicates a live birth. Code {\tt 4} indicates1302a miscarriage; that is, a pregnancy that ended spontaneously, usually1303with no known medical cause.13041305Statistically this respondent is not unusual. Miscarriages are common1306and there are other respondents who reported as many or more.13071308But remembering the context, this data tells the story of a woman who1309was pregnant six times, each time ending in miscarriage. Her seventh1310and most recent pregnancy ended in a live birth. If we consider this1311data with empathy, it is natural to be moved by the story it tells.13121313Each record in the NSFG dataset represents a person who provided1314honest answers to many personal and difficult questions. We can use1315this data to answer statistical questions about family life,1316reproduction, and health. At the same time, we have an obligation1317to consider the people represented by the data, and to afford them1318respect and gratitude.1319\index{ethics}132013211322\section{Exercises}13231324\begin{exercise}1325In the repository you downloaded, you should find a file named1326\verb"chap01ex.ipynb", which is an IPython notebook. You can1327launch IPython notebook from the command line like this:1328\index{IPython}13291330\begin{verbatim}1331$ ipython notebook &1332\end{verbatim}13331334If IPython is installed, it should launch a server that runs in the1335background and open a browser to view the notebook. If you are not1336familiar with IPython, I suggest you start at1337\url{http://ipython.org/ipython-doc/stable/notebook/notebook.html}.13381339To launch the IPython notebook server, run:13401341\begin{verbatim}1342$ ipython notebook &1343\end{verbatim}13441345It should open a new browser window, but if not, the startup1346message provides a URL you can load in a browser, usually1347\url{http://localhost:8888}. The new window should list the notebooks1348in the repository.13491350Open \verb"chap01ex.ipynb". Some cells are already filled in, and1351you should execute them. Other cells give you instructions for1352exercises you should try.13531354A solution to this exercise is in \verb"chap01soln.ipynb"1355\end{exercise}135613571358\begin{exercise}1359In the repository you downloaded, you should find a file named1360\verb"chap01ex.py"; using this file as a starting place, write a1361function that reads the respondent file, {\tt 2002FemResp.dat.gz}.13621363The variable {\tt pregnum} is a recode that indicates how many1364times each respondent has been pregnant. Print the value counts1365for this variable and compare them to the published results in1366the NSFG codebook.13671368You can also cross-validate the respondent and pregnancy files by1369comparing {\tt pregnum} for each respondent with the number of1370records in the pregnancy file.13711372You can use {\tt nsfg.MakePregMap} to make a dictionary that maps1373from each {\tt caseid} to a list of indices into the pregnancy1374DataFrame.1375\index{DataFrame}13761377A solution to this exercise is in \verb"chap01soln.py"1378\end{exercise}137913801381\begin{exercise}1382The best way to learn about statistics is to work on a project you are1383interested in. Is there a question like, ``Do first babies arrive1384late,'' that you want to investigate?13851386Think about questions you find personally interesting, or items of1387conventional wisdom, or controversial topics, or questions that have1388political consequences, and see if you can formulate a question that1389lends itself to statistical inquiry.13901391Look for data to help you address the question. Governments are good1392sources because data from public research is often freely1393available. Good places to start include \url{http://www.data.gov/},1394and \url{http://www.science.gov/}, and in the United Kingdom,1395\url{http://data.gov.uk/}.13961397Two of my favorite data sets are the General Social Survey at1398\url{http://www3.norc.org/gss+website/}, and the European Social1399Survey at \url{http://www.europeansocialsurvey.org/}.14001401If it seems like someone has already answered your question, look1402closely to see whether the answer is justified. There might be flaws1403in the data or the analysis that make the conclusion unreliable. In1404that case you could perform a different analysis of the same data, or1405look for a better source of data.14061407If you find a published paper that addresses your question, you1408should be able to get the raw data. Many authors make their data1409available on the web, but for sensitive data you might have to1410write to the authors, provide information about how you plan to use1411the data, or agree to certain terms of use. Be persistent!14121413\end{exercise}141414151416\section{Glossary}14171418\begin{itemize}14191420\item {\bf anecdotal evidence}: Evidence, often personal, that is collected1421casually rather than by a well-designed study.1422\index{anecdotal evidence}14231424\item {\bf population}: A group we are interested in studying.1425``Population'' often refers to a1426group of people, but the term is used for other subjects,1427too.1428\index{population}14291430\item {\bf cross-sectional study}: A study that collects data about a1431population at a particular point in time.1432\index{cross-sectional study}1433\index{study!cross-sectional}14341435\item {\bf cycle}: In a repeated cross-sectional study, each repetition1436of the study is called a cycle.14371438\item {\bf longitudinal study}: A study that follows a population over1439time, collecting data from the same group repeatedly.1440\index{longitudinal study}1441\index{study!longitudinal}14421443\item {\bf record}: In a dataset, a collection of information about1444a single person or other subject.1445\index{record}14461447\item {\bf respondent}: A person who responds to a survey.1448\index{respondent}14491450\item {\bf sample}: The subset of a population used to collect data.1451\index{sample}14521453\item {\bf representative}: A sample is representative if every member1454of the population has the same chance of being in the sample.1455\index{representative}14561457\item {\bf oversampling}: The technique of increasing the representation1458of a sub-population in order to avoid errors due to small sample1459sizes.1460\index{oversampling}14611462\item {\bf raw data}: Values collected and recorded with little or no1463checking, calculation or interpretation.1464\index{raw data}14651466\item {\bf recode}: A value that is generated by calculation and other1467logic applied to raw data.1468\index{recode}14691470\item {\bf data cleaning}: Processes that include validating data,1471identifying errors, translating between data types and1472representations, etc.14731474\end{itemize}1475147614771478\chapter{Distributions}1479\label{descriptive}148014811482\section{Histograms}1483\label{histograms}14841485One of the best ways to describe a variable is to report the values1486that appear in the dataset and how many times each value appears.1487This description is called the {\bf distribution} of the variable.1488\index{distribution}14891490The most common representation of a distribution is a {\bf histogram},1491which is a graph that shows the {\bf frequency} of each value. In1492this context, ``frequency'' means the number of times the value1493appears. \index{histogram} \index{frequency}1494\index{dictionary}14951496In Python, an efficient way to compute frequencies is with a1497dictionary. Given a sequence of values, {\tt t}:1498%1499\begin{verbatim}1500hist = {}1501for x in t:1502hist[x] = hist.get(x, 0) + 11503\end{verbatim}15041505The result is a dictionary that maps from values to frequencies.1506Alternatively, you could use the {\tt Counter} class defined in the1507{\tt collections} module:15081509\begin{verbatim}1510from collections import Counter1511counter = Counter(t)1512\end{verbatim}15131514The result is a {\tt Counter} object, which is a subclass of1515dictionary.15161517Another option is to use the pandas method \verb"value_counts", which1518we saw in the previous chapter. But for this book I created a class,1519Hist, that represents histograms and provides the methods1520that operate on them.1521\index{pandas}152215231524\section{Representing histograms}1525\index{histogram}1526\index{Hist}15271528The Hist constructor can take a sequence, dictionary, pandas1529Series, or another Hist. You can instantiate a Hist object like this:1530%1531\begin{verbatim}1532>>> import thinkstats21533>>> hist = thinkstats2.Hist([1, 2, 2, 3, 5])1534>>> hist1535Hist({1: 1, 2: 2, 3: 1, 5: 1})1536\end{verbatim}15371538Hist objects provide {\tt Freq}, which takes a value and1539returns its frequency: \index{frequency}1540%1541\begin{verbatim}1542>>> hist.Freq(2)154321544\end{verbatim}15451546The bracket operator does the same thing: \index{bracket operator}1547%1548\begin{verbatim}1549>>> hist[2]155021551\end{verbatim}15521553If you look up a value that has never appeared, the frequency is 0.1554%1555\begin{verbatim}1556>>> hist.Freq(4)155701558\end{verbatim}15591560{\tt Values} returns an unsorted list of the values in the Hist:1561%1562\begin{verbatim}1563>>> hist.Values()1564[1, 5, 3, 2]1565\end{verbatim}15661567To loop through the values in order, you can use the built-in function1568{\tt sorted}:1569%1570\begin{verbatim}1571for val in sorted(hist.Values()):1572print(val, hist.Freq(val))1573\end{verbatim}15741575Or you can use {\tt Items} to iterate through1576value-frequency pairs: \index{frequency}1577%1578\begin{verbatim}1579for val, freq in hist.Items():1580print(val, freq)1581\end{verbatim}158215831584\section{Plotting histograms}1585\index{pyplot}15861587\begin{figure}1588% first.py1589\centerline{\includegraphics[height=2.5in]{figs/first_wgt_lb_hist.pdf}}1590\caption{Histogram of the pound part of birth weight.}1591\label{first_wgt_lb_hist}1592\end{figure}15931594For this book I wrote a module called {\tt thinkplot.py} that provides1595functions for plotting Hists and other objects defined in {\tt1596thinkstats2.py}. It is based on {\tt pyplot}, which is part of the1597{\tt matplotlib} package. See Section~\ref{code} for information1598about installing {\tt matplotlib}. \index{thinkplot}1599\index{matplotlib}16001601To plot {\tt hist} with {\tt thinkplot}, try this:1602\index{Hist}16031604\begin{verbatim}1605>>> import thinkplot1606>>> thinkplot.Hist(hist)1607>>> thinkplot.Show(xlabel='value', ylabel='frequency')1608\end{verbatim}16091610You can read the documentation for {\tt thinkplot} at1611\url{http://greenteapress.com/thinkstats2/thinkplot.html}.161216131614\begin{figure}1615% first.py1616\centerline{\includegraphics[height=2.5in]{figs/first_wgt_oz_hist.pdf}}1617\caption{Histogram of the ounce part of birth weight.}1618\label{first_wgt_oz_hist}1619\end{figure}162016211622\section{NSFG variables}16231624Now let's get back to the data from the NSFG. The code in this1625chapter is in {\tt first.py}.1626For information about downloading and1627working with this code, see Section~\ref{code}.16281629When you start working with a new dataset, I suggest you explore1630the variables you are planning to use one at a time, and a good1631way to start is by looking at histograms.1632\index{histogram}16331634In Section~\ref{cleaning} we transformed {\tt agepreg}1635from centiyears to years, and combined \verb"birthwgt_lb" and1636\verb"birthwgt_oz" into a single quantity, \verb"totalwgt_lb".1637In this section I use these variables to demonstrate some1638features of histograms.16391640\begin{figure}1641% first.py1642\centerline{\includegraphics[height=2.5in]{figs/first_agepreg_hist.pdf}}1643\caption{Histogram of mother's age at end of pregnancy.}1644\label{first_agepreg_hist}1645\end{figure}16461647I'll start by reading the data and selecting records for live1648births:16491650\begin{verbatim}1651preg = nsfg.ReadFemPreg()1652live = preg[preg.outcome == 1]1653\end{verbatim}16541655The expression in brackets is a boolean Series that1656selects rows from the DataFrame and returns a new DataFrame.1657Next I generate and plot the histogram of1658\verb"birthwgt_lb" for live births.1659\index{DataFrame}1660\index{Series}1661\index{Hist}1662\index{bracket operator}1663\index{boolean}16641665\begin{verbatim}1666hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')1667thinkplot.Hist(hist)1668thinkplot.Show(xlabel='pounds', ylabel='frequency')1669\end{verbatim}16701671When the argument passed to Hist is a pandas Series, any1672{\tt nan} values are dropped. {\tt label} is a string that appears1673in the legend when the Hist is plotted.1674\index{pandas}1675\index{Series}1676\index{thinkplot}1677\index{NaN}16781679\begin{figure}1680% first.py1681\centerline{\includegraphics[height=2.5in]{figs/first_prglngth_hist.pdf}}1682\caption{Histogram of pregnancy length in weeks.}1683\label{first_prglngth_hist}1684\end{figure}16851686Figure~\ref{first_wgt_lb_hist} shows the result. The most common1687value, called the {\bf mode}, is 7 pounds. The distribution is1688approximately bell-shaped, which is the shape of the {\bf normal}1689distribution, also called a {\bf Gaussian} distribution. But unlike a1690true normal distribution, this distribution is asymmetric; it has1691a {\bf tail} that extends farther to the left than to the right.16921693Figure~\ref{first_wgt_oz_hist} shows the histogram of1694\verb"birthwgt_oz", which is the ounces part of birth weight. In1695theory we expect this distribution to be {\bf uniform}; that is, all1696values should have the same frequency. In fact, 0 is more common than1697the other values, and 1 and 15 are less common, probably because1698respondents round off birth weights that are close to an integer1699value.1700\index{birth weight}1701\index{weight!birth}17021703Figure~\ref{first_agepreg_hist} shows the histogram of \verb"agepreg",1704the mother's age at the end of pregnancy. The mode is 21 years. The1705distribution is very roughly bell-shaped, but in this case the tail1706extends farther to the right than left; most mothers are in1707their 20s, fewer in their 30s.17081709Figure~\ref{first_prglngth_hist} shows the histogram of1710\verb"prglngth", the length of the pregnancy in weeks. By far the1711most common value is 39 weeks. The left tail is longer than the1712right; early babies are common, but pregnancies seldom go past 431713weeks, and doctors often intervene if they do.1714\index{pregnancy length}171517161717\section{Outliers}17181719Looking at histograms, it is easy to identify the most common1720values and the shape of the distribution, but rare values are1721not always visible.1722\index{histogram}17231724Before going on, it is a good idea to check for {\bf1725outliers}, which are extreme values that might be errors in1726measurement and recording, or might be accurate reports of rare1727events.1728\index{outlier}17291730Hist provides methods {\tt Largest} and {\tt Smallest}, which take1731an integer {\tt n} and return the {\tt n} largest or smallest1732values from the histogram:1733\index{Hist}17341735\begin{verbatim}1736for weeks, freq in hist.Smallest(10):1737print(weeks, freq)1738\end{verbatim}17391740In the list of pregnancy lengths for live births, the 10 lowest values1741are {\tt [0, 4, 9, 13, 17, 18, 19, 20, 21, 22]}. Values below 10 weeks1742are certainly errors; the most likely explanation is that the outcome1743was not coded correctly. Values higher than 30 weeks are probably1744legitimate. Between 10 and 30 weeks, it is hard to be sure; some1745values are probably errors, but some represent premature babies.1746\index{pregnancy length}17471748On the other end of the range, the highest values are:1749%1750\begin{verbatim}1751weeks count175243 148175344 46175445 10175546 1175647 1175748 7175850 21759\end{verbatim}17601761Most doctors recommend induced labor if a pregnancy exceeds 42 weeks,1762so some of the longer values are surprising. In particular, 50 weeks1763seems medically unlikely.17641765The best way to handle outliers depends on ``domain knowledge'';1766that is, information about where the data come from and what they1767mean. And it depends on what analysis you are planning to perform.1768\index{outlier}17691770In this example, the motivating question is whether first babies1771tend to be early (or late). When people ask this question, they are1772usually interested in full-term pregnancies, so for this analysis1773I will focus on pregnancies longer than 27 weeks.177417751776\section{First babies}17771778Now we can compare the distribution of pregnancy lengths for first1779babies and others. I divided the DataFrame of live births using1780{\tt birthord}, and computed their histograms:1781\index{DataFrame}1782\index{Hist}1783\index{pregnancy length}17841785\begin{verbatim}1786firsts = live[live.birthord == 1]1787others = live[live.birthord != 1]17881789first_hist = thinkstats2.Hist(firsts.prglngth, label='first')1790other_hist = thinkstats2.Hist(others.prglngth, label='other')1791\end{verbatim}17921793Then I plotted their histograms on the same axis:17941795\begin{verbatim}1796width = 0.451797thinkplot.PrePlot(2)1798thinkplot.Hist(first_hist, align='right', width=width)1799thinkplot.Hist(other_hist, align='left', width=width)1800thinkplot.Show(xlabel='weeks', ylabel='frequency',1801xlim=[27, 46])1802\end{verbatim}18031804{\tt thinkplot.PrePlot} takes the number of histograms1805we are planning to plot; it uses this information to choose1806an appropriate collection of colors.1807\index{thinkplot}18081809\begin{figure}1810% first.py1811\centerline{\includegraphics[height=2.5in]{figs/first_nsfg_hist.pdf}}1812\caption{Histogram of pregnancy lengths.}1813\label{first_nsfg_hist}1814\end{figure}18151816{\tt thinkplot.Hist} normally uses {\tt align='center'} so that1817each bar is centered over its value. For this figure, I use1818{\tt align='right'} and {\tt align='left'} to place1819corresponding bars on either side of the value.1820\index{Hist}18211822With {\tt width=0.45}, the total width of the two bars is 0.9,1823leaving some space between each pair.18241825Finally, I adjust the axis to show only data between 27 and 46 weeks.1826Figure~\ref{first_nsfg_hist} shows the result.1827\index{pregnancy length}1828\index{length!pregnancy}18291830Histograms are useful because they make the most frequent values1831immediately apparent. But they are not the best choice for comparing1832two distributions. In this example, there are fewer ``first babies''1833than ``others,'' so some of the apparent differences in the histograms1834are due to sample sizes. In the next chapter we address this problem1835using probability mass functions.183618371838\section{Summarizing distributions}1839\label{mean}18401841A histogram is a complete description of the distribution of a sample;1842that is, given a histogram, we could reconstruct the values in the1843sample (although not their order).18441845If the details of the distribution are important, it might be1846necessary to present a histogram. But often we want to1847summarize the distribution with a few descriptive statistics.18481849Some of the characteristics we might want to report are:18501851\begin{itemize}18521853\item central tendency: Do the values tend to cluster around1854a particular point?1855\index{central tendency}18561857\item modes: Is there more than one cluster?1858\index{mode}18591860\item spread: How much variability is there in the values?1861\index{spread}18621863\item tails: How quickly do the probabilities drop off as we1864move away from the modes?1865\index{tail}18661867\item outliers: Are there extreme values far from the modes?1868\index{outlier}18691870\end{itemize}18711872Statistics designed to answer these questions are called {\bf summary1873statistics}. By far the most common summary statistic is the {\bf1874mean}, which is meant to describe the central tendency of the1875distribution. \index{mean} \index{average} \index{summary statistic}18761877If you have a sample of {\tt n} values, $x_i$, the mean, $\xbar$, is1878the sum of the values divided by the number of values; in other words1879%1880\[ \xbar = \frac{1}{n} \sum_i x_i \]1881%1882The words ``mean'' and ``average'' are sometimes used interchangeably,1883but I make this distinction:18841885\begin{itemize}18861887\item The ``mean'' of a sample is the summary statistic computed with1888the previous formula.18891890\item An ``average'' is one of several summary statistics you might1891choose to describe a central tendency.1892\index{central tendency}18931894\end{itemize}18951896Sometimes the mean is a good description of a set of values. For1897example, apples are all pretty much the same size (at least the ones1898sold in supermarkets). So if I buy 6 apples and the total weight is 31899pounds, it would be a reasonable summary to say they are about a half1900pound each.1901\index{weight!pumpkin}19021903But pumpkins are more diverse. Suppose I grow several varieties in my1904garden, and one day I harvest three decorative pumpkins that are 11905pound each, two pie pumpkins that are 3 pounds each, and one Atlantic1906Giant\textregistered~pumpkin that weighs 591 pounds. The mean of this1907sample is 100 pounds, but if I told you ``The average pumpkin in my1908garden is 100 pounds,'' that would be misleading. In this example,1909there is no meaningful average because there is no typical pumpkin.1910\index{pumpkin}1911191219131914\section{Variance}1915\index{variance}19161917If there is no single number that summarizes pumpkin weights,1918we can do a little better with two numbers: mean and {\bf variance}.19191920Variance is a summary statistic intended to describe the variability1921or spread of a distribution. The variance of a set of values is1922%1923\[ S^2 = \frac{1}{n} \sum_i (x_i - \xbar)^2 \]1924%1925The term $x_i - \xbar$ is called the ``deviation from the mean,'' so1926variance is the mean squared deviation. The square root of variance,1927$S$, is the {\bf standard deviation}. \index{deviation}1928\index{standard deviation}1929\index{deviation}19301931If you have prior experience, you might have seen a formula for1932variance with $n-1$ in the denominator, rather than {\tt n}. This1933statistic is used to estimate the variance in a population using a1934sample. We will come back to this in Chapter~\ref{estimation}.1935\index{sample variance}19361937Pandas data structures provides methods to compute mean, variance and1938standard deviation:1939\index{pandas}19401941\begin{verbatim}1942mean = live.prglngth.mean()1943var = live.prglngth.var()1944std = live.prglngth.std()1945\end{verbatim}19461947For all live births, the mean pregnancy length is 38.6 weeks, the1948standard deviation is 2.7 weeks, which means we should expect1949deviations of 2-3 weeks to be common.1950\index{pregnancy length}19511952Variance of pregnancy length is 7.3, which is hard to interpret,1953especially since the units are weeks$^2$, or ``square weeks.''1954Variance is useful in some calculations, but it is not1955a good summary statistic.195619571958\section{Effect size}1959\index{effect size}19601961An {\bf effect size} is a summary statistic intended to describe (wait1962for it) the size of an effect. For example, to describe the1963difference between two groups, one obvious choice is the difference in1964the means. \index{effect size}19651966Mean pregnancy length for first babies is 38.601; for1967other babies it is 38.523. The difference is 0.078 weeks, which works1968out to 13 hours. As a fraction of the typical pregnancy length, this1969difference is about 0.2\%.1970\index{pregnancy length}19711972If we assume this estimate is accurate, such a difference1973would have no practical consequences. In fact, without1974observing a large number of pregnancies, it is unlikely that anyone1975would notice this difference at all.1976\index{effect size}19771978Another way to convey the size of the effect is to compare the1979difference between groups to the variability within groups.1980Cohen's $d$ is a statistic intended to do that; it is defined1981%1982\[ d = \frac{\bar{x_1} - \bar{x_2}}{s} \]1983%1984where $\bar{x_1}$ and $\bar{x_2}$ are the means of the groups and1985$s$ is the ``pooled standard deviation''. Here's the Python1986code that computes Cohen's $d$:1987\index{standard deviation!pooled}19881989\begin{verbatim}1990def CohenEffectSize(group1, group2):1991diff = group1.mean() - group2.mean()19921993var1 = group1.var()1994var2 = group2.var()1995n1, n2 = len(group1), len(group2)19961997pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)1998d = diff / math.sqrt(pooled_var)1999return d2000\end{verbatim}20012002In this example, the difference in means is 0.029 standard deviations,2003which is small. To put that in perspective, the difference in2004height between men and women is about 1.7 standard deviations (see2005\url{https://en.wikipedia.org/wiki/Effect_size}).200620072008\section{Reporting results}20092010We have seen several ways to describe the difference in pregnancy2011length (if there is one) between first babies and others. How should2012we report these results?2013\index{pregnancy length}20142015The answer depends on who is asking the question. A scientist might2016be interested in any (real) effect, no matter how small. A doctor2017might only care about effects that are {\bf clinically significant};2018that is, differences that affect treatment decisions. A pregnant2019woman might be interested in results that are relevant to her, like2020the probability of delivering early or late.2021\index{clinically significant} \index{significant}20222023How you report results also depends on your goals. If you are trying2024to demonstrate the importance of an effect, you might choose summary2025statistics that emphasize differences. If you are trying to reassure2026a patient, you might choose statistics that put the differences in2027context.20282029Of course your decisions should also be guided by professional ethics.2030It's ok to be persuasive; you {\em should\/} design statistical reports2031and visualizations that tell a story clearly. But you should also do2032your best to make your reports honest, and to acknowledge uncertainty2033and limitations.2034\index{ethics}203520362037\section{Exercises}20382039\begin{exercise}2040Based on the results in this chapter, suppose you were asked to2041summarize what you learned about whether first babies arrive late.20422043Which summary statistics would you use if you wanted to get a story2044on the evening news? Which ones would you use if you wanted to2045reassure an anxious patient?2046\index{Adams, Cecil}2047\index{Straight Dope, The}20482049Finally, imagine that you are Cecil Adams, author of {\it The Straight2050Dope\/} (\url{http://straightdope.com}), and your job is to answer the2051question, ``Do first babies arrive late?'' Write a paragraph that2052uses the results in this chapter to answer the question clearly,2053precisely, and honestly.2054\index{ethics}20552056\end{exercise}20572058\begin{exercise}2059In the repository you downloaded, you should find a file named2060\verb"chap02ex.ipynb"; open it. Some cells are already filled in, and2061you should execute them. Other cells give you instructions for2062exercises. Follow the instructions and fill in the answers.20632064A solution to this exercise is in \verb"chap02soln.ipynb"2065\end{exercise}20662067In the repository you downloaded, you should find a file named2068\verb"chap02ex.py"; you can use this file as a starting place2069for the following exercises.2070My solution is in \verb"chap02soln.py".20712072\begin{exercise}2073The mode of a distribution is the most frequent value; see2074\url{http://wikipedia.org/wiki/Mode_(statistics)}. Write a function2075called {\tt Mode} that takes a Hist and returns the most2076frequent value.\index{mode}2077\index{Hist}20782079As a more challenging exercise, write a function called {\tt AllModes}2080that returns a list of value-frequency pairs in descending order of2081frequency.2082\index{frequency}2083\end{exercise}20842085\begin{exercise}2086Using the variable \verb"totalwgt_lb", investigate whether first2087babies are lighter or heavier than others. Compute Cohen's $d$2088to quantify the difference between the groups. How does it2089compare to the difference in pregnancy length?2090\index{pregnancy length}2091\end{exercise}209220932094\section{Glossary}20952096\begin{itemize}20972098\item distribution: The values that appear in a sample2099and the frequency of each.2100\index{distribution}21012102\item histogram: A mapping from values to frequencies, or a graph2103that shows this mapping.2104\index{histogram}21052106\item frequency: The number of times a value appears in a sample.2107\index{frequency}21082109\item mode: The most frequent value in a sample, or one of the2110most frequent values.2111\index{mode}21122113\item normal distribution: An idealization of a bell-shaped distribution;2114also known as a Gaussian distribution.2115\index{Gaussian distribution}2116\index{normal distribution}21172118\item uniform distribution: A distribution in which all values have2119the same frequency.2120\index{uniform distribution}21212122\item tail: The part of a distribution at the high and low extremes.2123\index{tail}21242125\item central tendency: A characteristic of a sample or population;2126intuitively, it is an average or typical value.2127\index{central tendency}21282129\item outlier: A value far from the central tendency.2130\index{outlier}21312132\item spread: A measure of how spread out the values in a distribution2133are.2134\index{spread}21352136\item summary statistic: A statistic that quantifies some aspect2137of a distribution, like central tendency or spread.2138\index{summary statistic}21392140\item variance: A summary statistic often used to quantify spread.2141\index{variance}21422143\item standard deviation: The square root of variance, also used2144as a measure of spread.2145\index{standard deviation}21462147\item effect size: A summary statistic intended to quantify the size2148of an effect like a difference between groups.2149\index{effect size}21502151\item clinically significant: A result, like a difference between groups,2152that is relevant in practice.2153\index{clinically significant}21542155\end{itemize}21562157215821592160\chapter{Probability mass functions}2161\index{probability mass function}21622163The code for this chapter is in {\tt probability.py}.2164For information about downloading and2165working with this code, see Section~\ref{code}.216621672168\section{Pmfs}2169\index{Pmf}21702171Another way to represent a distribution is a {\bf probability mass2172function} (PMF), which maps from each value to its probability. A2173{\bf probability} is a frequency expressed as a fraction of the sample2174size, {\tt n}. To get from frequencies to probabilities, we divide2175through by {\tt n}, which is called {\bf normalization}.2176\index{frequency}2177\index{probability}2178\index{normalization}2179\index{PMF}2180\index{probability mass function}21812182Given a Hist, we can make a dictionary that maps from each2183value to its probability: \index{Hist}2184%2185\begin{verbatim}2186n = hist.Total()2187d = {}2188for x, freq in hist.Items():2189d[x] = freq / n2190\end{verbatim}2191%2192Or we can use the Pmf class provided by {\tt thinkstats2}.2193Like Hist, the Pmf constructor can take a list, pandas2194Series, dictionary, Hist, or another Pmf object. Here's an example2195with a simple list:2196%2197\begin{verbatim}2198>>> import thinkstats22199>>> pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])2200>>> pmf2201Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})2202\end{verbatim}22032204The Pmf is normalized so total probability is 1.22052206Pmf and Hist objects are similar in many ways; in fact, they inherit2207many of their methods from a common parent class. For example, the2208methods {\tt Values} and {\tt Items} work the same way for both. The2209biggest difference is that a Hist maps from values to integer2210counters; a Pmf maps from values to floating-point probabilities.2211\index{Hist}22122213To look up the probability associated with a value, use {\tt Prob}:2214%2215\begin{verbatim}2216>>> pmf.Prob(2)22170.42218\end{verbatim}22192220The bracket operator is equivalent:2221\index{bracket operator}22222223\begin{verbatim}2224>>> pmf[2]22250.42226\end{verbatim}22272228You can modify an existing Pmf by incrementing the probability2229associated with a value:2230%2231\begin{verbatim}2232>>> pmf.Incr(2, 0.2)2233>>> pmf.Prob(2)22340.62235\end{verbatim}22362237Or you can multiply a probability by a factor:2238%2239\begin{verbatim}2240>>> pmf.Mult(2, 0.5)2241>>> pmf.Prob(2)22420.32243\end{verbatim}22442245If you modify a Pmf, the result may not be normalized; that is, the2246probabilities may no longer add up to 1. To check, you can call {\tt2247Total}, which returns the sum of the probabilities:2248%2249\begin{verbatim}2250>>> pmf.Total()22510.92252\end{verbatim}22532254To renormalize, call {\tt Normalize}:2255%2256\begin{verbatim}2257>>> pmf.Normalize()2258>>> pmf.Total()22591.02260\end{verbatim}22612262Pmf objects provide a {\tt Copy} method so you can make2263and modify a copy without affecting the original.2264\index{Pmf}22652266My notation in this section might seem inconsistent, but there is a2267system: I use Pmf for the name of the class, {\tt pmf} for an instance2268of the class, and PMF for the mathematical concept of a2269probability mass function.227022712272\section{Plotting PMFs}2273\index{PMF}22742275{\tt thinkplot} provides two ways to plot Pmfs:2276\index{thinkplot}22772278\begin{itemize}22792280\item To plot a Pmf as a bar graph, you can use2281{\tt thinkplot.Hist}. Bar graphs are most useful if the number2282of values in the Pmf is small.2283\index{bar plot}2284\index{plot!bar}22852286\item To plot a Pmf as a step function, you can use2287{\tt thinkplot.Pmf}. This option is most useful if there are2288a large number of values and the Pmf is smooth. This function2289also works with Hist objects.2290\index{line plot}2291\index{plot!line}2292\index{Hist}2293\index{Pmf}22942295\end{itemize}22962297In addition, {\tt pyplot} provides a function called {\tt hist} that2298takes a sequence of values, computes a histogram, and plots it.2299Since I use Hist objects, I usually don't use {\tt pyplot.hist}.2300\index{pyplot}23012302\begin{figure}2303% probability.py2304\centerline{\includegraphics[height=3.0in]{figs/probability_nsfg_pmf.pdf}}2305\caption{PMF of pregnancy lengths for first babies and others, using2306bar graphs and step functions.}2307\label{probability_nsfg_pmf}2308\end{figure}2309\index{pregnancy length}2310\index{length!pregnancy}23112312Figure~\ref{probability_nsfg_pmf} shows PMFs of pregnancy length for2313first babies and others using bar graphs (left) and step functions2314(right).2315\index{pregnancy length}23162317By plotting the PMF instead of the histogram, we can compare the two2318distributions without being mislead by the difference in sample2319size. Based on this figure, first babies seem to be less likely than2320others to arrive on time (week 39) and more likely to be a late (weeks232141 and 42).23222323Here's the code that generates Figure~\ref{probability_nsfg_pmf}:23242325\begin{verbatim}2326thinkplot.PrePlot(2, cols=2)2327thinkplot.Hist(first_pmf, align='right', width=width)2328thinkplot.Hist(other_pmf, align='left', width=width)2329thinkplot.Config(xlabel='weeks',2330ylabel='probability',2331axis=[27, 46, 0, 0.6])23322333thinkplot.PrePlot(2)2334thinkplot.SubPlot(2)2335thinkplot.Pmfs([first_pmf, other_pmf])2336thinkplot.Show(xlabel='weeks',2337axis=[27, 46, 0, 0.6])2338\end{verbatim}23392340{\tt PrePlot} takes optional parameters {\tt rows} and {\tt cols}2341to make a grid of figures, in this case one row of two figures.2342The first figure (on the left) displays the Pmfs using {\tt thinkplot.Hist},2343as we have seen before.2344\index{thinkplot}2345\index{Hist}23462347The second call to {\tt PrePlot} resets the color generator. Then2348{\tt SubPlot} switches to the second figure (on the right) and2349displays the Pmfs using {\tt thinkplot.Pmfs}. I used the {\tt axis} option2350to ensure that the two figures are on the same axes, which is2351generally a good idea if you intend to compare two figures.235223532354\section{Other visualizations}2355\label{visualization}23562357Histograms and PMFs are useful while you are exploring data and2358trying to identify patterns and relationships.2359Once you have an idea what is going on, a good next step is to2360design a visualization that makes the patterns you have identified2361as clear as possible.2362\index{exploratory data analysis}2363\index{visualization}23642365In the NSFG data, the biggest differences in the distributions are2366near the mode. So it makes sense to zoom in on that part of the2367graph, and to transform the data to emphasize differences:2368\index{National Survey of Family Growth}2369\index{NSFG}23702371\begin{verbatim}2372weeks = range(35, 46)2373diffs = []2374for week in weeks:2375p1 = first_pmf.Prob(week)2376p2 = other_pmf.Prob(week)2377diff = 100 * (p1 - p2)2378diffs.append(diff)23792380thinkplot.Bar(weeks, diffs)2381\end{verbatim}23822383In this code, {\tt weeks} is the range of weeks; {\tt diffs} is the2384difference between the two PMFs in percentage points.2385Figure~\ref{probability_nsfg_diffs} shows the result as a bar chart.2386This figure makes the pattern clearer: first babies are less likely to2387be born in week 39, and somewhat more likely to be born in weeks 412388and 42.2389\index{thinkplot}23902391\begin{figure}2392% probability.py2393\centerline{\includegraphics[height=2.5in]{figs/probability_nsfg_diffs.pdf}}2394\caption{Difference, in percentage points, by week.}2395\label{probability_nsfg_diffs}2396\end{figure}23972398For now we should hold this conclusion only tentatively.2399We used the same dataset to identify an2400apparent difference and then chose a visualization that makes the2401difference apparent. We can't be sure this effect is real;2402it might be due to random variation. We'll address this concern2403later.240424052406\section{The class size paradox}2407\index{class size}24082409Before we go on, I want to demonstrate2410one kind of computation you can do with Pmf objects; I call2411this example the ``class size paradox.''2412\index{Pmf}24132414At many American colleges and universities, the student-to-faculty2415ratio is about 10:1. But students are often surprised to discover2416that their average class size is bigger than 10. There2417are two reasons for the discrepancy:24182419\begin{itemize}24202421\item Students typically take 4--5 classes per semester, but2422professors often teach 1 or 2.24232424\item The number of students who enjoy a small class is small,2425but the number of students in a large class is (ahem!) large.24262427\end{itemize}24282429The first effect is obvious, at least once it is pointed out;2430the second is more subtle. Let's look at an example. Suppose2431that a college offers 65 classes in a given semester, with the2432following distribution of sizes:2433%2434\begin{verbatim}2435size count24365- 9 8243710-14 8243815-19 14243920-24 4244025-29 6244130-34 12244235-39 8244340-44 3244445-49 22445\end{verbatim}24462447If you ask the Dean for the average class size, he would2448construct a PMF, compute the mean, and report that the2449average class size is 23.7. Here's the code:24502451\begin{verbatim}2452d = { 7: 8, 12: 8, 17: 14, 22: 4,245327: 6, 32: 12, 37: 8, 42: 3, 47: 2 }24542455pmf = thinkstats2.Pmf(d, label='actual')2456print('mean', pmf.Mean())2457\end{verbatim}24582459But if you survey a group of students, ask them how many2460students are in their classes, and compute the mean, you would2461think the average class was bigger. Let's see how2462much bigger.24632464First, I compute the2465distribution as observed by students, where the probability2466associated with each class size is ``biased'' by the number2467of students in the class.2468\index{observer bias}2469\index{bias!observer}24702471\begin{verbatim}2472def BiasPmf(pmf, label):2473new_pmf = pmf.Copy(label=label)24742475for x, p in pmf.Items():2476new_pmf.Mult(x, x)24772478new_pmf.Normalize()2479return new_pmf2480\end{verbatim}24812482For each class size, {\tt x}, we multiply the probability by2483{\tt x}, the number of students who observe that class size.2484The result is a new Pmf that represents the biased distribution.24852486Now we can plot the actual and observed distributions:2487\index{thinkplot}24882489\begin{verbatim}2490biased_pmf = BiasPmf(pmf, label='observed')2491thinkplot.PrePlot(2)2492thinkplot.Pmfs([pmf, biased_pmf])2493thinkplot.Show(xlabel='class size', ylabel='PMF')2494\end{verbatim}24952496\begin{figure}2497% probability.py2498\centerline{\includegraphics[height=3.0in]{figs/class_size1.pdf}}2499\caption{Distribution of class sizes, actual and as observed by students.}2500\label{class_size1}2501\end{figure}25022503Figure~\ref{class_size1} shows the result. In the biased distribution2504there are fewer small classes and more large ones.2505The mean of the biased distribution is 29.1, almost 25\% higher2506than the actual mean.25072508It is also possible to invert this operation. Suppose you want to2509find the distribution of class sizes at a college, but you can't get2510reliable data from the Dean. An alternative is to choose a random2511sample of students and ask how many students are in their2512classes. \index{bias!oversampling} \index{oversampling}25132514The result would be biased for the reasons we've just seen, but you2515can use it to estimate the actual distribution. Here's the function2516that unbiases a Pmf:25172518\begin{verbatim}2519def UnbiasPmf(pmf, label):2520new_pmf = pmf.Copy(label=label)25212522for x, p in pmf.Items():2523new_pmf.Mult(x, 1.0/x)25242525new_pmf.Normalize()2526return new_pmf2527\end{verbatim}25282529It's similar to {\tt BiasPmf}; the only difference is that it2530divides each probability by {\tt x} instead of multiplying.253125322533\section{DataFrame indexing}25342535In Section~\ref{dataframe} we read a pandas DataFrame and used it to2536select and modify data columns. Now let's look at row selection.2537To start, I create a NumPy array of random numbers and use it2538to initialize a DataFrame:2539\index{NumPy}2540\index{pandas}2541\index{DataFrame}25422543\begin{verbatim}2544>>> import numpy as np2545>>> import pandas2546>>> array = np.random.randn(4, 2)2547>>> df = pandas.DataFrame(array)2548>>> df25490 125500 -0.143510 0.61605025511 -1.489647 0.30077425522 -0.074350 0.03962125533 -1.369968 0.5458972554\end{verbatim}25552556By default, the rows and columns are numbered starting at zero, but2557you can provide column names:25582559\begin{verbatim}2560>>> columns = ['A', 'B']2561>>> df = pandas.DataFrame(array, columns=columns)2562>>> df2563A B25640 -0.143510 0.61605025651 -1.489647 0.30077425662 -0.074350 0.03962125673 -1.369968 0.5458972568\end{verbatim}25692570You can also provide row names. The set of row names is called the2571{\bf index}; the row names themselves are called {\bf labels}.25722573\begin{verbatim}2574>>> index = ['a', 'b', 'c', 'd']2575>>> df = pandas.DataFrame(array, columns=columns, index=index)2576>>> df2577A B2578a -0.143510 0.6160502579b -1.489647 0.3007742580c -0.074350 0.0396212581d -1.369968 0.5458972582\end{verbatim}25832584As we saw in the previous chapter, simple indexing selects a2585column, returning a Series:2586\index{Series}25872588\begin{verbatim}2589>>> df['A']2590a -0.1435102591b -1.4896472592c -0.0743502593d -1.3699682594Name: A, dtype: float642595\end{verbatim}25962597To select a row by label, you can use the {\tt loc} attribute, which2598returns a Series:25992600\begin{verbatim}2601>>> df.loc['a']2602A -0.143512603B 0.616052604Name: a, dtype: float642605\end{verbatim}26062607If you know the integer position of a row, rather than its label, you2608can use the {\tt iloc} attribute, which also returns a Series.26092610\begin{verbatim}2611>>> df.iloc[0]2612A -0.143512613B 0.616052614Name: a, dtype: float642615\end{verbatim}26162617{\tt loc} can also take a list of labels; in that case,2618the result is a DataFrame.26192620\begin{verbatim}2621>>> indices = ['a', 'c']2622>>> df.loc[indices]2623A B2624a -0.14351 0.6160502625c -0.07435 0.0396212626\end{verbatim}26272628Finally, you can use a slice to select a range of rows by label:26292630\begin{verbatim}2631>>> df['a':'c']2632A B2633a -0.143510 0.6160502634b -1.489647 0.3007742635c -0.074350 0.0396212636\end{verbatim}26372638Or by integer position:26392640\begin{verbatim}2641>>> df[0:2]2642A B2643a -0.143510 0.6160502644b -1.489647 0.3007742645\end{verbatim}26462647The result in either case is a DataFrame, but notice that the first2648result includes the end of the slice; the second doesn't.2649\index{DataFrame}26502651My advice: if your rows have labels that are not simple integers, use2652the labels consistently and avoid using integer positions.2653265426552656\section{Exercises}26572658Solutions to these exercises are in \verb"chap03soln.ipynb"2659and \verb"chap03soln.py"26602661\begin{exercise}2662Something like the class size paradox appears if you survey children2663and ask how many children are in their family. Families with many2664children are more likely to appear in your sample, and2665families with no children have no chance to be in the sample.2666\index{observer bias}2667\index{bias!observer}26682669Use the NSFG respondent variable \verb"NUMKDHH" to construct the actual2670distribution for the number of children under 18 in the household.26712672Now compute the biased distribution we would see if we surveyed the2673children and asked them how many children under 18 (including themselves)2674are in their household.26752676Plot the actual and biased distributions, and compute their means.2677As a starting place, you can use \verb"chap03ex.ipynb".2678\end{exercise}267926802681\begin{exercise}2682\index{mean}2683\index{variance}2684\index{PMF}26852686In Section~\ref{mean} we computed the mean of a sample by adding up2687the elements and dividing by n. If you are given a PMF, you can2688still compute the mean, but the process is slightly different:2689%2690\[ \xbar = \sum_i p_i~x_i \]2691%2692where the $x_i$ are the unique values in the PMF and $p_i=PMF(x_i)$.2693Similarly, you can compute variance like this:2694%2695\[ S^2 = \sum_i p_i~(x_i - \xbar)^2\]2696%2697Write functions called {\tt PmfMean} and {\tt PmfVar} that take a2698Pmf object and compute the mean and variance. To test these methods,2699check that they are consistent with the methods {\tt Mean} and {\tt2700Var} provided by Pmf.2701\index{Pmf}27022703\end{exercise}270427052706\begin{exercise}2707I started with the question, ``Are first babies more likely2708to be late?'' To address it, I computed the difference in2709means between groups of babies, but I ignored the possibility2710that there might be a difference between first babies and2711others {\em for the same woman}.27122713To address this version of the question, select respondents who2714have at least two babies and compute pairwise differences. Does2715this formulation of the question yield a different result?27162717Hint: use {\tt nsfg.MakePregMap}.2718\end{exercise}271927202721\begin{exercise}2722\label{relay}27232724In most foot races, everyone starts at the same time. If you are a2725fast runner, you usually pass a lot of people at the beginning of the2726race, but after a few miles everyone around you is going at the same2727speed.2728\index{relay race}27292730When I ran a long-distance (209 miles) relay race for the first2731time, I noticed an odd phenomenon: when I overtook another runner, I2732was usually much faster, and when another runner overtook me, he was2733usually much faster.27342735At first I thought that the distribution of speeds might be bimodal;2736that is, there were many slow runners and many fast runners, but few2737at my speed.27382739Then I realized that I was the victim of a bias similar to the2740effect of class size. The race2741was unusual in two ways: it used a staggered start, so teams started2742at different times; also, many teams included runners at different2743levels of ability. \index{bias!selection} \index{selection bias}27442745As a result, runners were spread out along the course with little2746relationship between speed and location. When I joined the race, the2747runners near me were (pretty much) a random sample of the runners in2748the race.27492750So where does the bias come from? During my time on the course, the2751chance of overtaking a runner, or being overtaken, is proportional to2752the difference in our speeds. I am more likely to catch a slow2753runner, and more likely to be caught by a fast runner. But runners2754at the same speed are unlikely to see each other.27552756Write a function called {\tt ObservedPmf} that takes a Pmf representing2757the actual distribution of runners' speeds, and the speed of a running2758observer, and returns a new Pmf representing the distribution of2759runners' speeds as seen by the observer.2760\index{observer bias}2761\index{bias!observer}27622763To test your function, you can use {\tt relay.py}, which reads the2764results from the James Joyce Ramble 10K in Dedham MA and converts the2765pace of each runner to mph.27662767Compute the distribution of speeds you would observe if you ran a2768relay race at 7.5 mph with this group of runners. A solution to this2769exercise is in \verb"relay_soln.py".2770\end{exercise}277127722773\section{Glossary}27742775\begin{itemize}27762777\item Probability mass function (PMF): a representation of a distribution2778as a function that maps from values to probabilities.2779\index{PMF}2780\index{probability mass function}27812782\item probability: A frequency expressed as a fraction of the sample2783size.2784\index{frequency}2785\index{probability}27862787\item normalization: The process of dividing a frequency by a sample2788size to get a probability.2789\index{normalization}27902791\item index: In a pandas DataFrame, the index is a special column2792that contains the row labels.2793\index{pandas}2794\index{DataFrame}27952796\end{itemize}279727982799\chapter{Cumulative distribution functions}2800\label{cumulative}28012802The code for this chapter is in {\tt cumulative.py}.2803For information about downloading and2804working with this code, see Section~\ref{code}.280528062807\section{The limits of PMFs}2808\index{PMF}28092810PMFs work well if the number of values is small. But as the number of2811values increases, the probability associated with each value gets2812smaller and the effect of random noise increases.28132814For example, we might be interested in the distribution of birth2815weights. In the NSFG data, the variable \verb"totalwgt_lb" records2816weight at birth in pounds. Figure~\ref{nsfg_birthwgt_pmf} shows2817the PMF of these values for first babies and others.2818\index{National Survey of Family Growth} \index{NSFG} \index{birth weight}2819\index{weight!birth}28202821\begin{figure}2822% cumulative.py2823\centerline{\includegraphics[height=2.5in]{figs/nsfg_birthwgt_pmf.pdf}}2824\caption{PMF of birth weights. This figure shows a limitation2825of PMFs: they are hard to compare visually.}2826\label{nsfg_birthwgt_pmf}2827\end{figure}28282829Overall, these distributions resemble the bell shape of a normal2830distribution, with many values near the mean and a few values much2831higher and lower.28322833But parts of this figure are hard to interpret. There are many spikes2834and valleys, and some apparent differences between the distributions.2835It is hard to tell which of these features are meaningful. Also, it2836is hard to see overall patterns; for example, which distribution do2837you think has the higher mean?2838\index{binning}28392840These problems can be mitigated by binning the data; that is, dividing2841the range of values into non-overlapping intervals and counting the2842number of values in each bin. Binning can be useful, but it is tricky2843to get the size of the bins right. If they are big enough to smooth2844out noise, they might also smooth out useful information.28452846An alternative that avoids these problems is the cumulative2847distribution function (CDF), which is the subject of this chapter.2848But before I can explain CDFs, I have to explain percentiles.2849\index{CDF}285028512852\section{Percentiles}2853\index{percentile rank}28542855If you have taken a standardized test, you probably got your2856results in the form of a raw score and a {\bf percentile rank}.2857In this context, the percentile rank is the fraction of people who2858scored lower than you (or the same). So if you are ``in the 90th2859percentile,'' you did as well as or better than 90\% of the people who2860took the exam.28612862Here's how you could compute the percentile rank of a value,2863\verb"your_score", relative to the values in the sequence {\tt2864scores}:2865%2866\begin{verbatim}2867def PercentileRank(scores, your_score):2868count = 02869for score in scores:2870if score <= your_score:2871count += 128722873percentile_rank = 100.0 * count / len(scores)2874return percentile_rank2875\end{verbatim}28762877As an example, if the2878scores in the sequence were 55, 66, 77, 88 and 99, and you got the 88,2879then your percentile rank would be {\tt 100 * 4 / 5} which is 80.28802881If you are given a value, it is easy to find its percentile rank; going2882the other way is slightly harder. If you are given a percentile rank2883and you want to find the corresponding value, one option is to2884sort the values and search for the one you want:2885%2886\begin{verbatim}2887def Percentile(scores, percentile_rank):2888scores.sort()2889for score in scores:2890if PercentileRank(scores, score) >= percentile_rank:2891return score2892\end{verbatim}28932894The result of this calculation is a {\bf percentile}. For example,2895the 50th percentile is the value with percentile rank 50. In the2896distribution of exam scores, the 50th percentile is 77.2897\index{percentile}28982899This implementation of {\tt Percentile} is not efficient. A2900better approach is to use the percentile rank to compute the index of2901the corresponding percentile:29022903\begin{verbatim}2904def Percentile2(scores, percentile_rank):2905scores.sort()2906index = percentile_rank * (len(scores)-1) // 1002907return scores[index]2908\end{verbatim}29092910The difference between ``percentile'' and ``percentile rank'' can2911be confusing, and people do not always use the terms precisely.2912To summarize, {\tt PercentileRank} takes a value and computes2913its percentile rank in a set of values; {\tt Percentile} takes2914a percentile rank and computes the corresponding value.2915\index{percentile rank}291629172918\section{CDFs}2919\index{CDF}29202921Now that we understand percentiles and percentile ranks,2922we are ready to tackle the {\bf cumulative distribution function}2923(CDF). The CDF is the function that maps from a value to its2924percentile rank.2925\index{cumulative distribution function}2926\index{percentile rank}29272928The CDF is a function of $x$, where $x$ is any value that might appear2929in the distribution. To evaluate $\CDF(x)$ for a particular value of2930$x$, we compute the fraction of values in the distribution less2931than or equal to $x$.29322933Here's what that looks like as a function that takes a sequence,2934{\tt sample}, and a value, {\tt x}:2935%2936\begin{verbatim}2937def EvalCdf(sample, x):2938count = 0.02939for value in sample:2940if value <= x:2941count += 129422943prob = count / len(sample)2944return prob2945\end{verbatim}29462947This function is almost identical to {\tt PercentileRank}, except that2948the result is a probability in the range 0--1 rather than a2949percentile rank in the range 0--100.2950\index{sample}29512952As an example, suppose we collect a sample with the values2953{\tt [1, 2, 2, 3, 5]}. Here are some values from its CDF:2954%2955\[ CDF(0) = 0 \]2956%2957\[ CDF(1) = 0.2\]2958%2959\[ CDF(2) = 0.6\]2960%2961\[ CDF(3) = 0.8\]2962%2963\[ CDF(4) = 0.8\]2964%2965\[ CDF(5) = 1\]2966%2967We can evaluate the CDF for any value of $x$, not just2968values that appear in the sample.2969If $x$ is less than the smallest value in the sample, $\CDF(x)$ is 0.2970If $x$ is greater than the largest value, $\CDF(x)$ is 1.29712972\begin{figure}2973% cumulative.py2974\centerline{\includegraphics[height=2.5in]{figs/cumulative_example_cdf.pdf}}2975\caption{Example of a CDF.}2976\label{example_cdf}2977\end{figure}29782979Figure~\ref{example_cdf} is a graphical representation of this CDF.2980The CDF of a sample is a step function.2981\index{step function}298229832984\section{Representing CDFs}2985\index{Cdf}29862987{\tt thinkstats2} provides a class named Cdf that represents2988CDFs. The fundamental methods Cdf provides are:29892990\begin{itemize}29912992\item {\tt Prob(x)}: Given a value {\tt x}, computes the probability2993$p = \CDF(x)$. The bracket operator is equivalent to {\tt Prob}.2994\index{bracket operator}29952996\item {\tt Value(p)}: Given a probability {\tt p}, computes the2997corresponding value, {\tt x}; that is, the {\bf inverse CDF} of {\tt p}.2998\index{inverse CDF}2999\index{CDF, inverse}30003001\end{itemize}30023003\begin{figure}3004% cumulative.py3005\centerline{\includegraphics[height=2.5in]{figs/cumulative_prglngth_cdf.pdf}}3006\caption{CDF of pregnancy length.}3007\label{cumulative_prglngth_cdf}3008\end{figure}30093010The Cdf constructor can take as an argument a list of values,3011a pandas Series, a Hist, Pmf, or another Cdf. The following3012code makes a Cdf for the distribution of pregnancy lengths in3013the NSFG:3014\index{NSFG}3015\index{pregnancy length}30163017\begin{verbatim}3018live, firsts, others = first.MakeFrames()3019cdf = thinkstats2.Cdf(live.prglngth, label='prglngth')3020\end{verbatim}30213022{\tt thinkplot} provides a function named {\tt Cdf} that3023plots Cdfs as lines:3024\index{thinkplot}30253026\begin{verbatim}3027thinkplot.Cdf(cdf)3028thinkplot.Show(xlabel='weeks', ylabel='CDF')3029\end{verbatim}30303031Figure~\ref{cumulative_prglngth_cdf} shows the result. One way to3032read a CDF is to look up percentiles. For example, it looks like3033about 10\% of pregnancies are shorter than 36 weeks, and about 90\%3034are shorter than 41 weeks. The CDF also provides a visual3035representation of the shape of the distribution. Common values appear3036as steep or vertical sections of the CDF; in this example, the mode at303739 weeks is apparent. There are few values below 30 weeks, so3038the CDF in this range is flat.3039\index{CDF, interpreting}30403041It takes some time to get used to CDFs, but once you3042do, I think you will find that they show more information, more3043clearly, than PMFs.304430453046\section{Comparing CDFs}3047\label{birth_weights}3048\index{National Survey of Family Growth}3049\index{NSFG}3050\index{birth weight}3051\index{weight!birth}30523053CDFs are especially useful for comparing distributions. For3054example, here is the code that plots the CDF of birth3055weight for first babies and others.3056\index{thinkplot}3057\index{distributions, comparing}30583059\begin{verbatim}3060first_cdf = thinkstats2.Cdf(firsts.totalwgt_lb, label='first')3061other_cdf = thinkstats2.Cdf(others.totalwgt_lb, label='other')30623063thinkplot.PrePlot(2)3064thinkplot.Cdfs([first_cdf, other_cdf])3065thinkplot.Show(xlabel='weight (pounds)', ylabel='CDF')3066\end{verbatim}30673068\begin{figure}3069% cumulative.py3070\centerline{\includegraphics[height=2.5in]{figs/cumulative_birthwgt_cdf.pdf}}3071\caption{CDF of birth weights for first babies and others.}3072\label{cumulative_birthwgt_cdf}3073\end{figure}30743075Figure~\ref{cumulative_birthwgt_cdf} shows the result.3076Compared to Figure~\ref{nsfg_birthwgt_pmf},3077this figure makes the shape of the distributions, and the differences3078between them, much clearer. We can see that first babies are slightly3079lighter throughout the distribution, with a larger discrepancy above3080the mean.3081\index{shape}30823083308430853086\section{Percentile-based statistics}3087\index{summary statistic}3088\index{interquartile range}3089\index{quartile}3090\index{percentile}3091\index{median}3092\index{central tendency}3093\index{spread}30943095Once you have computed a CDF, it is easy to compute percentiles3096and percentile ranks. The Cdf class provides these two methods:3097\index{Cdf}3098\index{percentile rank}30993100\begin{itemize}31013102\item {\tt PercentileRank(x)}: Given a value {\tt x}, computes its3103percentile rank, $100 \cdot \CDF(x)$.31043105\item {\tt Percentile(p)}: Given a percentile rank {\tt p},3106computes the corresponding value, {\tt x}. Equivalent to {\tt3107Value(p/100)}.31083109\end{itemize}31103111{\tt Percentile} can be used to compute percentile-based summary3112statistics. For example, the 50th percentile is the value that3113divides the distribution in half, also known as the {\bf median}.3114Like the mean, the median is a measure of the central tendency3115of a distribution.31163117Actually, there are several definitions of ``median,'' each with3118different properties. But {\tt Percentile(50)} is simple and3119efficient to compute.31203121Another percentile-based statistic is the {\bf interquartile range} (IQR),3122which is a measure of the spread of a distribution. The IQR3123is the difference between the 75th and 25th percentiles.31243125More generally, percentiles are often used to summarize the shape3126of a distribution. For example, the distribution of income is3127often reported in ``quintiles''; that is, it is split at the312820th, 40th, 60th and 80th percentiles. Other distributions3129are divided into ten ``deciles''. Statistics like these that represent3130equally-spaced points in a CDF are called {\bf quantiles}.3131For more, see \url{https://en.wikipedia.org/wiki/Quantile}.3132\index{quantile}3133\index{quintile}3134\index{decile}3135313631373138\section{Random numbers}3139\label{random}3140\index{random number}31413142Suppose we choose a random sample from the population of live3143births and look up the percentile rank of their birth weights.3144Now suppose we compute the CDF of the percentile ranks. What do3145you think the distribution will look like?3146\index{percentile rank}3147\index{birth weight}3148\index{weight!birth}31493150Here's how we can compute it. First, we make the Cdf of3151birth weights:3152\index{Cdf}31533154\begin{verbatim}3155weights = live.totalwgt_lb3156cdf = thinkstats2.Cdf(weights, label='totalwgt_lb')3157\end{verbatim}31583159Then we generate a sample and compute the percentile rank of3160each value in the sample.31613162\begin{verbatim}3163sample = np.random.choice(weights, 100, replace=True)3164ranks = [cdf.PercentileRank(x) for x in sample]3165\end{verbatim}31663167{\tt sample}3168is a random sample of 100 birth weights, chosen with {\bf replacement};3169that is, the same value could be chosen more than once. {\tt ranks}3170is a list of percentile ranks.3171\index{replacement}31723173Finally we make and plot the Cdf of the percentile ranks.3174\index{thinkplot}31753176\begin{verbatim}3177rank_cdf = thinkstats2.Cdf(ranks)3178thinkplot.Cdf(rank_cdf)3179thinkplot.Show(xlabel='percentile rank', ylabel='CDF')3180\end{verbatim}31813182\begin{figure}3183% cumulative.py3184\centerline{\includegraphics[height=2.5in]{figs/cumulative_random.pdf}}3185\caption{CDF of percentile ranks for a random sample of birth weights.}3186\label{cumulative_random}3187\end{figure}31883189Figure~\ref{cumulative_random} shows the result. The CDF is3190approximately a straight line, which means that the distribution3191is uniform.31923193That outcome might be non-obvious, but it is a consequence of3194the way the CDF is defined. What this figure shows is that 10\%3195of the sample is below the 10th percentile, 20\% is below the319620th percentile, and so on, exactly as we should expect.31973198So, regardless of the shape of the CDF, the distribution of3199percentile ranks is uniform. This property is useful, because it3200is the basis of a simple and efficient algorithm for generating3201random numbers with a given CDF. Here's how:3202\index{inverse CDF algorithm}3203\index{random number}32043205\begin{itemize}32063207\item Choose a percentile rank uniformly from the range 0--100.32083209\item Use {\tt Cdf.Percentile} to find the value in the distribution3210that corresponds to the percentile rank you chose.3211\index{Cdf}32123213\end{itemize}32143215Cdf provides an implementation of this algorithm, called3216{\tt Random}:32173218\begin{verbatim}3219# class Cdf:3220def Random(self):3221return self.Percentile(random.uniform(0, 100))3222\end{verbatim}32233224Cdf also provides {\tt Sample}, which takes an integer,3225{\tt n}, and returns a list of {\tt n} values chosen at random3226from the Cdf.322732283229\section{Comparing percentile ranks}32303231Percentile ranks are useful for comparing measurements across3232different groups. For example, people who compete in foot races are3233usually grouped by age and gender. To compare people in different3234age groups, you can convert race times to percentile ranks.3235\index{percentile rank}32363237A few years ago I ran the James Joyce Ramble 10K in3238Dedham MA; I finished in 42:44, which was 97th in a field of 1633. I beat or3239tied 1537 runners out of 1633, so my percentile rank in the field is324094\%. \index{James Joyce Ramble} \index{race time}32413242More generally, given position and field size, we can compute3243percentile rank:3244\index{field size}32453246\begin{verbatim}3247def PositionToPercentile(position, field_size):3248beat = field_size - position + 13249percentile = 100.0 * beat / field_size3250return percentile3251\end{verbatim}32523253In my age group, denoted M4049 for ``male between 40 and 49 years of3254age'', I came in 26th out of 256. So my percentile rank in my age3255group was 90\%.3256\index{age group}32573258If I am still running in 10 years (and I hope I am), I will be in3259the M5059 division. Assuming that my percentile rank in my division3260is the same, how much slower should I expect to be?32613262I can answer that question by converting my percentile rank in M40493263to a position in M5059. Here's the code:32643265\begin{verbatim}3266def PercentileToPosition(percentile, field_size):3267beat = percentile * field_size / 100.03268position = field_size - beat + 13269return position3270\end{verbatim}32713272There were 171 people in M5059, so I would have to come in between327317th and 18th place to have the same percentile rank. The finishing3274time of the 17th runner in M5059 was 46:05, so that's the time I will3275have to beat to maintain my percentile rank.327632773278\section{Exercises}32793280For the following exercises, you can start with \verb"chap04ex.ipynb".3281My solution is in \verb"chap04soln.ipynb".32823283\begin{exercise}3284How much did you weigh at birth? If you don't know, call your mother3285or someone else who knows. Using the NSFG data (all live births),3286compute the distribution of birth weights and use it to find your3287percentile rank. If you were a first baby, find your percentile rank3288in the distribution for first babies. Otherwise use the distribution3289for others. If you are in the 90th percentile or higher, call your3290mother back and apologize.3291\index{birth weight}3292\index{weight!birth}32933294\end{exercise}32953296\begin{exercise}3297The numbers generated by {\tt random.random} are supposed to be3298uniform between 0 and 1; that is, every value in the range3299should have the same probability.33003301Generate 1000 numbers from {\tt random.random} and plot their3302PMF and CDF. Is the distribution uniform?3303\index{uniform distribution}3304\index{distribution!uniform}3305\index{random number}33063307\end{exercise}330833093310\section{Glossary}33113312\begin{itemize}33133314\item percentile rank: The percentage of values in a distribution that are3315less than or equal to a given value.3316\index{percentile rank}33173318\item percentile: The value associated with a given percentile rank.3319\index{percentile}33203321\item cumulative distribution function (CDF): A function that maps3322from values to their cumulative probabilities. $\CDF(x)$ is the3323fraction of the sample less than or equal to $x$. \index{CDF}3324\index{cumulative probability}33253326\item inverse CDF: A function that maps from a cumulative probability,3327$p$, to the corresponding value.3328\index{inverse CDF}3329\index{CDF, inverse}33303331\item median: The 50th percentile, often used as a measure of central3332tendency. \index{median}33333334\item interquartile range: The difference between3335the 75th and 25th percentiles, used as a measure of spread.3336\index{interquartile range}33373338\item quantile: A sequence of values that correspond to equally spaced3339percentile ranks; for example, the quartiles of a distribution are3340the 25th, 50th and 75th percentiles.3341\index{quantile}33423343\item replacement: A property of a sampling process. ``With replacement''3344means that the same value can be chosen more than once; ``without3345replacement'' means that once a value is chosen, it is removed from3346the population.3347\index{replacement}33483349\end{itemize}335033513352\chapter{Modeling distributions}3353\label{modeling}33543355The distributions we have used so far are called {\bf empirical3356distributions} because they are based on empirical observations,3357which are necessarily finite samples.3358\index{analytic distribution}3359\index{distribution!analytic}3360\index{empirical distribution}3361\index{distribution!empirical}33623363The alternative is an {\bf analytic distribution}, which is3364characterized by a CDF that is a mathematical function.3365Analytic distributions can be used to model empirical distributions.3366In this context, a {\bf model} is a simplification that leaves out3367unneeded details. This chapter presents common analytic distributions3368and uses them to model data from a variety of sources.3369\index{model}33703371The code for this chapter is in {\tt analytic.py}. For information3372about downloading and working with this code, see Section~\ref{code}.3373337433753376\section{The exponential distribution}3377\label{exponential}3378\index{exponential distribution}3379\index{distribution!exponential}33803381\begin{figure}3382% analytic.py3383\centerline{\includegraphics[height=2.5in]{figs/analytic_expo_cdf.pdf}}3384\caption{CDFs of exponential distributions with various parameters.}3385\label{analytic_expo_cdf}3386\end{figure}33873388I'll start with the {\bf exponential distribution} because it is3389relatively simple. The CDF of the exponential distribution is3390%3391\[ \CDF(x) = 1 - e^{-\lambda x} \]3392%3393The parameter, $\lambda$, determines the shape of the distribution.3394Figure~\ref{analytic_expo_cdf} shows what this CDF looks like with3395$\lambda = $ 0.5, 1, and 2.3396\index{parameter}33973398In the real world, exponential distributions3399come up when we look at a series of events and measure the3400times between events, called {\bf interarrival times}.3401If the events are equally likely to occur at any time, the distribution3402of interarrival times tends to look like an exponential distribution.3403\index{interarrival time}34043405As an example, we will look at the interarrival time of births.3406On December 18, 1997, 44 babies were born in a hospital in Brisbane,3407Australia.\footnote{This example is based on information and data from3408Dunn, ``A Simple Dataset for Demonstrating Common Distributions,''3409Journal of Statistics Education v.7, n.3 (1999).} The time of3410birth for all 44 babies was reported in the local paper; the3411complete dataset is in a file called {\tt babyboom.dat}, in the3412{\tt ThinkStats2} repository.3413\index{birth time}3414\index{Australia} \index{Brisbane}34153416\begin{verbatim}3417df = ReadBabyBoom()3418diffs = df.minutes.diff()3419cdf = thinkstats2.Cdf(diffs, label='actual')34203421thinkplot.Cdf(cdf)3422thinkplot.Show(xlabel='minutes', ylabel='CDF')3423\end{verbatim}34243425{\tt ReadBabyBoom} reads the data file and returns a DataFrame3426with columns {\tt time}, {\tt sex}, \verb"weight_g", and {\tt minutes},3427where {\tt minutes} is time of birth converted to minutes since3428midnight.3429\index{DataFrame}3430\index{thinkplot}34313432\begin{figure}3433% analytic.py3434\centerline{\includegraphics[height=2.5in]{figs/analytic_interarrivals.pdf}}3435\caption{CDF of interarrival times (left) and CCDF on a log-y scale (right).}3436\label{analytic_interarrival_cdf}3437\end{figure}34383439%\begin{figure}3440% analytic.py3441%\centerline{\includegraphics[height=2.5in]{figs/analytic_interarrivals_logy.pdf}}3442%\caption{CCDF of interarrival times.}3443%\label{analytic_interarrival_ccdf}3444%\end{figure}34453446{\tt diffs} is the difference between consecutive birth times, and3447{\tt cdf} is the distribution of these interarrival times.3448Figure~\ref{analytic_interarrival_cdf} (left) shows the CDF. It seems3449to have the general shape of an exponential distribution, but how can3450we tell?34513452One way is to plot the {\bf complementary CDF}, which is $1 - \CDF(x)$,3453on a log-y scale. For data from an exponential distribution, the3454result is a straight line. Let's see why that works.3455\index{complementary CDF} \index{CDF!complementary} \index{CCDF}34563457If you plot the complementary CDF (CCDF) of a dataset that you think is3458exponential, you expect to see a function like:3459%3460\[ y \approx e^{-\lambda x} \]3461%3462Taking the log of both sides yields:3463%3464\[ \log y \approx -\lambda x\]3465%3466So on a log-y scale the CCDF is a straight line3467with slope $-\lambda$. Here's how we can generate a plot like that:3468\index{logarithmic scale}3469\index{complementary CDF}3470\index{CDF!complementary}3471\index{CCDF}347234733474\begin{verbatim}3475thinkplot.Cdf(cdf, complement=True)3476thinkplot.Show(xlabel='minutes',3477ylabel='CCDF',3478yscale='log')3479\end{verbatim}34803481With the argument {\tt complement=True}, {\tt thinkplot.Cdf} computes3482the complementary CDF before plotting. And with {\tt yscale='log'},3483{\tt thinkplot.Show} sets the {\tt y} axis to a logarithmic scale.3484\index{thinkplot}3485\index{Cdf}34863487Figure~\ref{analytic_interarrival_cdf} (right) shows the result. It is not3488exactly straight, which indicates that the exponential distribution is3489not a perfect model for this data. Most likely the underlying3490assumption---that a birth is equally likely at any time of day---is3491not exactly true. Nevertheless, it might be reasonable to model this3492dataset with an exponential distribution. With that simplification, we can3493summarize the distribution with a single parameter.3494\index{model}34953496The parameter, $\lambda$, can be interpreted as a rate; that is, the3497number of events that occur, on average, in a unit of time. In this3498example, 44 babies are born in 24 hours, so the rate is $\lambda =34990.0306$ births per minute. The mean of an exponential distribution is3500$1/\lambda$, so the mean time between births is 32.7 minutes.350135023503\section{The normal distribution}3504\label{normal}35053506The {\bf normal distribution}, also called Gaussian, is commonly3507used because it describes many phenomena, at least approximately.3508It turns out that there is a good reason for its ubiquity, which we3509will get to in Section~\ref{CLT}.3510\index{CDF}3511\index{parameter}3512\index{mean}3513\index{standard deviation}3514\index{normal distribution}3515\index{distribution!normal}3516\index{Gaussian distribution}3517\index{distribution!Gaussian}35183519%3520%\[ \CDF(z) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^z e^{-t^2/2} dt \]3521%35223523\begin{figure}3524% analytic.py3525\centerline{\includegraphics[height=2.5in]{figs/analytic_gaussian_cdf.pdf}}3526\caption{CDF of normal distributions with a range of parameters.}3527\label{analytic_gaussian_cdf}3528\end{figure}35293530The normal distribution is characterized by two parameters: the mean,3531$\mu$, and standard deviation $\sigma$. The normal distribution with3532$\mu=0$ and $\sigma=1$ is called the {\bf standard normal3533distribution}. Its CDF is defined by an integral that does not have3534a closed form solution, but there are algorithms that evaluate it3535efficiently. One of them is provided by SciPy: {\tt scipy.stats.norm}3536is an object that represents a normal distribution; it provides a3537method, {\tt cdf}, that evaluates the standard normal CDF:3538\index{SciPy}3539\index{closed form}35403541\begin{verbatim}3542>>> import scipy.stats3543>>> scipy.stats.norm.cdf(0)35440.53545\end{verbatim}35463547This result is correct: the median of the standard normal distribution3548is 0 (the same as the mean), and half of the values fall below the3549median, so $\CDF(0)$ is 0.5.35503551{\tt norm.cdf} takes optional parameters: {\tt loc}, which3552specifies the mean, and {\tt scale}, which specifies the3553standard deviation.35543555{\tt thinkstats2} makes this function a little easier to use3556by providing {\tt EvalNormalCdf}, which takes parameters {\tt mu}3557and {\tt sigma} and evaluates the CDF at {\tt x}:3558\index{normal distribution}35593560\begin{verbatim}3561def EvalNormalCdf(x, mu=0, sigma=1):3562return scipy.stats.norm.cdf(x, loc=mu, scale=sigma)3563\end{verbatim}35643565Figure~\ref{analytic_gaussian_cdf} shows CDFs for normal3566distributions with a range of parameters. The sigmoid shape of these3567curves is a recognizable characteristic of a normal distribution.35683569In the previous chapter we looked at the distribution of birth3570weights in the NSFG. Figure~\ref{analytic_birthwgt_model} shows the3571empirical CDF of weights for all live births and the CDF of3572a normal distribution with the same mean and variance.3573\index{National Survey of Family Growth}3574\index{NSFG}3575\index{birth weight}3576\index{weight!birth}35773578\begin{figure}3579% analytic.py3580\centerline{\includegraphics[height=2.5in]{figs/analytic_birthwgt_model.pdf}}3581\caption{CDF of birth weights with a normal model.}3582\label{analytic_birthwgt_model}3583\end{figure}35843585The normal distribution is a good model for this dataset, so3586if we summarize the distribution with the parameters3587$\mu = 7.28$ and $\sigma = 1.24$, the resulting error3588(difference between the model and the data) is small.3589\index{model}3590\index{percentile}35913592Below the 10th percentile there is a discrepancy between the data3593and the model; there are more light babies than we would expect in3594a normal distribution. If we are specifically interested in preterm3595babies, it would be important to get this part of the distribution3596right, so it might not be appropriate to use the normal3597model.359835993600\section{Normal probability plot}36013602For the exponential distribution, and a few others, there are3603simple transformations we can use to test whether an analytic3604distribution is a good model for a dataset.3605\index{exponential distribution}3606\index{distribution!exponential}3607\index{model}36083609For the normal distribution there is no such transformation, but there3610is an alternative called a {\bf normal probability plot}. There3611are two ways to generate a normal probability plot: the hard way3612and the easy way. If you are interested in the hard way, you can3613read about it at \url{https://en.wikipedia.org/wiki/Normal_probability_plot}.3614Here's the easy way:3615\index{normal probability plot}3616\index{plot!normal probability}3617\index{normal distribution}3618\index{distribution!normal}3619\index{Gaussian distribution}3620\index{distribution!Gaussian}36213622\begin{enumerate}36233624\item Sort the values in the sample.36253626\item From a standard normal distribution ($\mu=0$ and $\sigma=1$),3627generate a random sample with the same size as the sample, and sort it.3628\index{random number}36293630\item Plot the sorted values from the sample versus the random values.36313632\end{enumerate}36333634If the distribution of the sample is approximately normal, the result3635is a straight line with intercept {\tt mu} and slope {\tt sigma}.3636{\tt thinkstats2} provides {\tt NormalProbability}, which takes a3637sample and returns two NumPy arrays:3638\index{NumPy}36393640\begin{verbatim}3641xs, ys = thinkstats2.NormalProbability(sample)3642\end{verbatim}36433644\begin{figure}3645% analytic.py3646\centerline{\includegraphics[height=2.5in]{figs/analytic_normal_prob_example.pdf}}3647\caption{Normal probability plot for random samples from normal distributions.}3648\label{analytic_normal_prob_example}3649\end{figure}36503651{\tt ys} contains the sorted values from {\tt sample}; {\tt xs}3652contains the random values from the standard normal distribution.36533654To test {\tt NormalProbability} I generated some fake samples that3655were actually drawn from normal distributions with various parameters.3656Figure~\ref{analytic_normal_prob_example} shows the results.3657The lines are approximately straight, with values in the tails3658deviating more than values near the mean.36593660Now let's try it with real data. Here's code to generate3661a normal probability plot for the birth weight data from the3662previous section. It plots a gray line that represents the model3663and a blue line that represents the data.3664\index{birth weight}3665\index{weight!birth}36663667\begin{verbatim}3668def MakeNormalPlot(weights):3669mean = weights.mean()3670std = weights.std()36713672xs = [-4, 4]3673fxs, fys = thinkstats2.FitLine(xs, inter=mean, slope=std)3674thinkplot.Plot(fxs, fys, color='gray', label='model')36753676xs, ys = thinkstats2.NormalProbability(weights)3677thinkplot.Plot(xs, ys, label='birth weights')3678\end{verbatim}36793680{\tt weights} is a pandas Series of birth weights;3681{\tt mean} and {\tt std} are the mean and standard deviation.3682\index{pandas}3683\index{Series}3684\index{thinkplot}3685\index{standard deviation}36863687{\tt FitLine} takes a sequence of {\tt xs}, an intercept, and a3688slope; it returns {\tt xs} and {\tt ys} that represent a line3689with the given parameters, evaluated at the values in {\tt xs}.36903691{\tt NormalProbability} returns {\tt xs} and {\tt ys} that3692contain values from the standard normal distribution and values3693from {\tt weights}. If the distribution of weights is normal,3694the data should match the model.3695\index{model}36963697\begin{figure}3698% analytic.py3699\centerline{\includegraphics[height=2.5in]{figs/analytic_birthwgt_normal.pdf}}3700\caption{Normal probability plot of birth weights.}3701\label{analytic_birthwgt_normal}3702\end{figure}37033704Figure~\ref{analytic_birthwgt_normal} shows the results for3705all live births, and also for full term births (pregnancy length greater3706than 36 weeks). Both curves match the model near the mean and3707deviate in the tails. The heaviest babies are heavier than what3708the model expects, and the lightest babies are lighter.3709\index{pregnancy length}37103711When we select only full term births, we remove some of the lightest3712weights, which reduces the discrepancy in the lower tail of the3713distribution.37143715This plot suggests that the normal model describes the distribution3716well within a few standard deviations from the mean, but not in the3717tails. Whether it is good enough for practical purposes depends3718on the purposes.3719\index{model}3720\index{birth weight}3721\index{weight!birth}3722\index{standard deviation}372337243725\section{The lognormal distribution}3726\label{brfss}3727\label{lognormal}37283729If the logarithms of a set of values have a normal distribution, the3730values have a {\bf lognormal distribution}. The CDF of the lognormal3731distribution is the same as the CDF of the normal distribution,3732with $\log x$ substituted for $x$.3733%3734\[ CDF_{lognormal}(x) = CDF_{normal}(\log x)\]3735%3736The parameters of the lognormal distribution are usually denoted3737$\mu$ and $\sigma$. But remember that these parameters are {\em not\/}3738the mean and standard deviation; the mean of a lognormal distribution3739is $\exp(\mu +\sigma^2/2)$ and the standard deviation is3740ugly (see \url{http://wikipedia.org/wiki/Log-normal_distribution}).3741\index{parameter} \index{weight!adult} \index{adult weight}3742\index{lognormal distribution}3743\index{distribution!lognormal}3744\index{CDF}37453746\begin{figure}3747% brfss.py3748\centerline{3749\includegraphics[height=2.5in]{figs/brfss_weight.pdf}}3750\caption{CDF of adult weights on a linear scale (left) and3751log scale (right).}3752\label{brfss_weight}3753\end{figure}37543755If a sample is approximately lognormal and you plot its CDF on a3756log-x scale, it will have the characteristic shape of a normal3757distribution. To test how well the sample fits a lognormal model, you3758can make a normal probability plot using the log of the values3759in the sample.3760\index{normal probability plot}3761\index{model}37623763As an example, let's look at the distribution of adult weights, which3764is approximately lognormal.\footnote{I was tipped off to this3765possibility by a comment (without citation) at3766\url{http://mathworld.wolfram.com/LogNormalDistribution.html}.3767Subsequently I found a paper that proposes the log transform and3768suggests a cause: Penman and Johnson, ``The Changing Shape of the3769Body Mass Index Distribution Curve in the Population,'' Preventing3770Chronic Disease, 2006 July; 3(3): A74. Online at3771\url{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1636707}.}37723773The National Center for Chronic Disease3774Prevention and Health Promotion conducts an annual survey as part of3775the Behavioral Risk Factor Surveillance System3776(BRFSS).\footnote{Centers for Disease Control and Prevention3777(CDC). Behavioral Risk Factor Surveillance System Survey3778Data. Atlanta, Georgia: U.S. Department of Health and Human3779Services, Centers for Disease Control and Prevention, 2008.} In37802008, they interviewed 414,509 respondents and asked about their3781demographics, health, and health risks.3782Among the data they collected are the weights in kilograms of3783398,484 respondents.3784\index{Behavioral Risk Factor Surveillance System}3785\index{BRFSS}37863787The repository for this book contains {\tt CDBRFS08.ASC.gz},3788a fixed-width ASCII file that contains data from the BRFSS,3789and {\tt brfss.py}, which reads the file and analyzes the data.37903791\begin{figure}3792% brfss.py3793\centerline{3794\includegraphics[height=2.5in]{figs/brfss_weight_normal.pdf}}3795\caption{Normal probability plots for adult weight on a linear scale3796(left) and log scale (right).}3797\label{brfss_weight_normal}3798\end{figure}37993800Figure~\ref{brfss_weight} (left) shows the distribution of adult3801weights on a linear scale with a normal model.3802Figure~\ref{brfss_weight} (right) shows the same distribution on a log3803scale with a lognormal model. The lognormal model is a better fit,3804but this representation of the data does not make the difference3805particularly dramatic. \index{respondent} \index{model}38063807Figure~\ref{brfss_weight_normal} shows normal probability plots for3808adult weights, $w$, and for their logarithms, $\log_{10} w$. Now it3809is apparent that the data deviate substantially from the normal model.3810On the other hand, the lognormal model is a good match for the data.3811\index{normal distribution} \index{distribution!normal}3812\index{Gaussian distribution} \index{distribution!Gaussian}3813\index{lognormal distribution} \index{distribution!lognormal}3814\index{standard deviation} \index{adult weight} \index{weight!adult}3815\index{model} \index{normal probability plot}381638173818\section{The Pareto distribution}3819\index{Pareto distribution}3820\index{distribution!Pareto}3821\index{Pareto, Vilfredo}38223823The {\bf Pareto distribution} is named after the economist Vilfredo Pareto,3824who used it to describe the distribution of wealth (see3825\url{http://wikipedia.org/wiki/Pareto_distribution}). Since then, it3826has been used to describe phenomena in the natural and social sciences3827including sizes of cities and towns, sand particles and meteorites,3828forest fires and earthquakes. \index{CDF}38293830The CDF of the Pareto distribution is:3831%3832\[ CDF(x) = 1 - \left( \frac{x}{x_m} \right) ^{-\alpha} \]3833%3834The parameters $x_{m}$ and $\alpha$ determine the location and shape3835of the distribution. $x_{m}$ is the minimum possible value.3836Figure~\ref{analytic_pareto_cdf} shows CDFs of Pareto3837distributions with $x_{m} = 0.5$ and different values3838of $\alpha$.3839\index{parameter}38403841\begin{figure}3842% analytic.py3843\centerline{\includegraphics[height=2.5in]{figs/analytic_pareto_cdf.pdf}}3844\caption{CDFs of Pareto distributions with different parameters.}3845\label{analytic_pareto_cdf}3846\end{figure}38473848There is a simple visual test that indicates whether an empirical3849distribution fits a Pareto distribution: on a log-log scale, the CCDF3850looks like a straight line. Let's see why that works.38513852If you plot the CCDF of a sample from a Pareto distribution on a3853linear scale, you expect to see a function like:3854%3855\[ y \approx \left( \frac{x}{x_m} \right) ^{-\alpha} \]3856%3857Taking the log of both sides yields:3858%3859\[ \log y \approx -\alpha (\log x - \log x_{m})\]3860%3861So if you plot $\log y$ versus $\log x$, it should look like a straight3862line with slope $-\alpha$ and intercept3863$\alpha \log x_{m}$.38643865As an example, let's look at the sizes of cities and towns.3866The U.S.~Census Bureau publishes the3867population of every incorporated city and town in the United States.3868\index{Pareto distribution} \index{distribution!Pareto}3869\index{U.S.~Census Bureau} \index{population} \index{city size}38703871\begin{figure}3872% populations.py3873\centerline{\includegraphics[height=2.5in]{figs/populations_pareto.pdf}}3874\caption{CCDFs of city and town populations, on a log-log scale.}3875\label{populations_pareto}3876\end{figure}38773878I downloaded their data from3879\url{http://www.census.gov/popest/data/cities/totals/2012/SUB-EST2012-3.html};3880it is in the repository for this book in a file named3881\verb"PEP_2012_PEPANNRES_with_ann.csv". The repository also3882contains {\tt populations.py}, which reads the file and plots3883the distribution of populations.38843885Figure~\ref{populations_pareto} shows the CCDF of populations on a3886log-log scale. The largest 1\% of cities and towns, below $10^{-2}$,3887fall along a straight line. So we could3888conclude, as some researchers have, that the tail of this distribution3889fits a Pareto model.3890\index{model}38913892On the other hand, a lognormal distribution also models the data well.3893Figure~\ref{populations_normal} shows the CDF of populations and a3894lognormal model (left), and a normal probability plot (right). Both3895plots show good agreement between the data and the model.3896\index{normal probability plot}38973898Neither model is perfect.3899The Pareto model only applies to the largest 1\% of cities, but it3900is a better fit for that part of the distribution. The lognormal3901model is a better fit for the other 99\%.3902Which model is appropriate depends on which part of the distribution3903is relevant.39043905\begin{figure}3906% populations.py3907\centerline{\includegraphics[height=2.5in]{figs/populations_normal.pdf}}3908\caption{CDF of city and town populations on a log-x scale (left), and3909normal probability plot of log-transformed populations (right).}3910\label{populations_normal}3911\end{figure}391239133914\section{Generating random numbers}3915\index{exponential distribution}3916\index{distribution!exponential}3917\index{random number}3918\index{CDF}3919\index{inverse CDF algorithm}3920\index{uniform distribution}3921\index{distribution!uniform}39223923Analytic CDFs can be used to generate random numbers with a given3924distribution function, $p = \CDF(x)$. If there is an efficient way to3925compute the inverse CDF, we can generate random values3926with the appropriate distribution by choosing $p$ from a uniform3927distribution between 0 and 1, then choosing3928$x = ICDF(p)$.3929\index{inverse CDF}3930\index{CDF, inverse}39313932For example, the CDF of the exponential distribution is3933%3934\[ p = 1 - e^{-\lambda x} \]3935%3936Solving for $x$ yields:3937%3938\[ x = -\log (1 - p) / \lambda \]3939%3940So in Python we can write3941%3942\begin{verbatim}3943def expovariate(lam):3944p = random.random()3945x = -math.log(1-p) / lam3946return x3947\end{verbatim}39483949{\tt expovariate} takes {\tt lam} and returns a random value chosen3950from the exponential distribution with parameter {\tt lam}.39513952Two notes about this implementation:3953I called the parameter \verb"lam" because \verb"lambda" is a Python3954keyword. Also, since $\log 0$ is undefined, we have to3955be a little careful. The implementation of {\tt random.random}3956can return 0 but not 1, so $1 - p$ can be 1 but not 0, so3957{\tt log(1-p)} is always defined. \index{random module}395839593960\section{Why model?}3961\index{model}39623963At the beginning of this chapter, I said that many real world phenomena3964can be modeled with analytic distributions. ``So,'' you might ask,3965``what?'' \index{abstraction}39663967Like all models, analytic distributions are abstractions, which3968means they leave out details that are considered irrelevant.3969For example, an observed distribution might have measurement errors3970or quirks that are specific to the sample; analytic models smooth3971out these idiosyncrasies.3972\index{smoothing}39733974Analytic models are also a form of data compression. When a model3975fits a dataset well, a small set of parameters can summarize a3976large amount of data.3977\index{parameter}3978\index{compression}39793980It is sometimes surprising when data from a natural phenomenon fit an3981analytic distribution, but these observations can provide insight3982into physical systems. Sometimes we can explain why an observed3983distribution has a particular form. For example, Pareto distributions3984are often the result of generative processes with positive feedback3985(so-called preferential attachment processes: see3986\url{http://wikipedia.org/wiki/Preferential_attachment}.).3987\index{preferential attachment}3988\index{generative process}3989\index{Pareto distribution}3990\index{distribution!Pareto}3991\index{analysis}39923993Also, analytic distributions lend themselves to mathematical3994analysis, as we will see in Chapter~\ref{analysis}.39953996But it is important to remember that all models are imperfect.3997Data from the real world never fit an analytic distribution perfectly.3998People sometimes talk as if data are generated by models; for example,3999they might say that the distribution of human heights is normal,4000or the distribution of income is lognormal. Taken literally, these4001claims cannot be true; there are always differences between the4002real world and mathematical models.40034004Models are useful if they capture the relevant aspects of the4005real world and leave out unneeded details. But what is ``relevant''4006or ``unneeded'' depends on what you are planning to use the model4007for.400840094010\section{Exercises}40114012For the following exercises, you can start with \verb"chap05ex.ipynb".4013My solution is in \verb"chap05soln.ipynb".40144015\begin{exercise}4016In the BRFSS (see Section~\ref{lognormal}), the distribution of4017heights is roughly normal with parameters $\mu = 178$ cm and4018$\sigma = 7.7$ cm for men, and $\mu = 163$ cm and $\sigma = 7.3$ cm for4019women.4020\index{normal distribution}4021\index{distribution!normal}4022\index{Gaussian distribution}4023\index{distribution!Gaussian}4024\index{height}4025\index{Blue Man Group}4026\index{Group, Blue Man}40274028In order to join Blue Man Group, you have to be male between 5'10''4029and 6'1'' (see \url{http://bluemancasting.com}). What percentage of4030the U.S. male population is in this range? Hint: use {\tt4031scipy.stats.norm.cdf}.4032\index{SciPy}40334034\end{exercise}403540364037\begin{exercise}4038To get a feel for the Pareto distribution, let's see how different4039the world4040would be if the distribution of human height were Pareto.4041With the parameters $x_{m} = 1$ m and $\alpha = 1.7$, we4042get a distribution with a reasonable minimum, 1 m,4043and median, 1.5 m.4044\index{height}4045\index{Pareto distribution}4046\index{distribution!Pareto}40474048Plot this distribution. What is the mean human height in Pareto4049world? What fraction of the population is shorter than the mean? If4050there are 7 billion people in Pareto world, how many do we expect to4051be taller than 1 km? How tall do we expect the tallest person to be?4052\index{Pareto World}40534054\end{exercise}405540564057\begin{exercise}4058\label{weibull}40594060The Weibull distribution is a generalization of the exponential4061distribution that comes up in failure analysis4062(see \url{http://wikipedia.org/wiki/Weibull_distribution}). Its CDF is4063%4064\[ CDF(x) = 1 - e^{-(x / \lambda)^k} \]4065%4066Can you find a transformation that makes a Weibull distribution look4067like a straight line? What do the slope and intercept of the4068line indicate?4069\index{Weibull distribution}4070\index{distribution!Weibull}4071\index{exponential distribution}4072\index{distribution!exponential}4073\index{random module}40744075Use {\tt random.weibullvariate} to generate a sample from a4076Weibull distribution and use it to test your transformation.40774078\end{exercise}407940804081\begin{exercise}4082For small values of $n$, we don't expect an empirical distribution4083to fit an analytic distribution exactly. One way to evaluate4084the quality of fit is to generate a sample from an analytic4085distribution and see how well it matches the data.4086\index{empirical distribution}4087\index{distribution!empirical}4088\index{random module}40894090For example, in Section~\ref{exponential} we plotted the distribution4091of time between births and saw that it is approximately exponential.4092But the distribution is based on only 44 data points. To see whether4093the data might have come from an exponential distribution, generate 444094values from an exponential distribution with the same mean as the4095data, about 33 minutes between births.40964097Plot the distribution of the random values and compare it to the4098actual distribution. You can use {\tt random.expovariate}4099to generate the values.41004101\end{exercise}41024103\begin{exercise}4104In the repository for this book, you'll find a set of data files4105called {\tt mystery0.dat}, {\tt mystery1.dat}, and so on. Each4106contains a sequence of random numbers generated from an analytic4107distribution.4108\index{random number}41094110You will also find \verb"test_models.py", a script that reads4111data from a file and plots the CDF under a variety of transforms.4112You can run it like this:41134114\begin{verbatim}4115$ python test_models.py mystery0.dat4116\end{verbatim}41174118Based on these plots, you should be able to infer what kind of4119distribution generated each file. If you are stumped, you can4120look in {\tt mystery.py}, which contains the code that generated4121the files.41224123\end{exercise}412441254126\begin{exercise}4127\label{income}41284129The distributions of wealth and income are sometimes modeled using4130lognormal and Pareto distributions. To see which is better, let's4131look at some data.4132\index{Pareto distribution}4133\index{distribution!Pareto}4134\index{lognormal distribution}4135\index{distribution!lognormal}41364137The Current Population Survey (CPS) is a joint effort of the Bureau4138of Labor Statistics and the Census Bureau to study income and related4139variables. Data collected in 2013 is available from4140\url{http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm}.4141I downloaded {\tt hinc06.xls}, which is an Excel spreadsheet with4142information about household income, and converted it to {\tt hinc06.csv},4143a CSV file you will find in the repository for this book. You4144will also find {\tt hinc.py}, which reads this file.41454146Extract the distribution of incomes from this dataset. Are any of the4147analytic distributions in this chapter a good model of the data? A4148solution to this exercise is in {\tt hinc_soln.py}.4149\index{model}41504151\end{exercise}41524153415441554156\section{Glossary}41574158\begin{itemize}41594160\item empirical distribution: The distribution of values in a sample.4161\index{empirical distribution} \index{distribution!empirical}41624163\item analytic distribution: A distribution whose CDF is an analytic4164function.4165\index{analytic distribution}4166\index{distribution!analytic}41674168\item model: A useful simplification. Analytic distributions are4169often good models of more complex empirical distributions.4170\index{model}41714172\item interarrival time: The elapsed time between two events.4173\index{interarrival time}41744175\item complementary CDF: A function that maps from a value, $x$,4176to the fraction of values that exceed $x$, which is $1 - \CDF(x)$.4177\index{complementary CDF} \index{CDF!complementary} \index{CCDF}41784179\item standard normal distribution: The normal distribution with4180mean 0 and standard deviation 1.4181\index{standard normal distribution}41824183\item normal probability plot: A plot of the values in a sample versus4184random values from a standard normal distribution.4185\index{normal probability plot}4186\index{plot!normal probability}41874188\end{itemize}418941904191\chapter{Probability density functions}4192\label{density}4193\index{PDF}4194\index{probability density function}4195\index{exponential distribution}4196\index{distribution!exponential}4197\index{normal distribution}4198\index{distribution!normal}4199\index{Gaussian distribution}4200\index{distribution!Gaussian}4201\index{CDF}4202\index{derivative}42034204The code for this chapter is in {\tt density.py}. For information4205about downloading and working with this code, see Section~\ref{code}.420642074208\section{PDFs}42094210The derivative of a CDF is called a {\bf probability density function},4211or PDF. For example, the PDF of an exponential distribution is4212%4213\[ \PDF_{expo}(x) = \lambda e^{-\lambda x} \]4214%4215The PDF of a normal distribution is4216%4217\[ \PDF_{normal}(x) = \frac{1}{\sigma \sqrt{2 \pi}}4218\exp \left[ -\frac{1}{2}4219\left( \frac{x - \mu}{\sigma} \right)^2 \right] \]4220%4221Evaluating a PDF for a particular value of $x$ is usually not useful.4222The result is not a probability; it is a probability {\em density}.4223\index{density}4224\index{mass}42254226In physics, density is mass per unit of4227volume; in order to get a mass, you have to multiply by volume or,4228if the density is not constant, you have to integrate over volume.42294230Similarly, {\bf probability density} measures probability per unit of $x$.4231In order to get a probability mass, you have to integrate over $x$.42324233{\tt thinkstats2} provides a class called Pdf that represents4234a probability density function. Every Pdf object provides the4235following methods:42364237\begin{itemize}42384239\item {\tt Density}, which takes a value, {\tt x}, and returns the4240density of the distribution at {\tt x}.42414242\item {\tt Render}, which evaluates the density at a discrete set of4243values and returns a pair of sequences: the sorted values, {\tt xs},4244and their probability densities, {\tt ds}.42454246\item {\tt MakePmf}, which evaluates {\tt Density}4247at a discrete set of values and returns a normalized Pmf that4248approximates the Pdf.4249\index{Pmf}42504251\item {\tt GetLinspace}, which returns the default set of points used4252by {\tt Render} and {\tt MakePmf}.42534254\end{itemize}42554256Pdf is an abstract parent class, which means you should not4257instantiate it; that is, you cannot create a Pdf object. Instead, you4258should define a child class that inherits from Pdf and provides4259definitions of {\tt Density} and {\tt GetLinspace}. Pdf provides4260{\tt Render} and {\tt MakePmf}.42614262For example, {\tt thinkstats2} provides a class named {\tt4263NormalPdf} that evaluates the normal density function.42644265\begin{verbatim}4266class NormalPdf(Pdf):42674268def __init__(self, mu=0, sigma=1, label=''):4269self.mu = mu4270self.sigma = sigma4271self.label = label42724273def Density(self, xs):4274return scipy.stats.norm.pdf(xs, self.mu, self.sigma)42754276def GetLinspace(self):4277low, high = self.mu-3*self.sigma, self.mu+3*self.sigma4278return np.linspace(low, high, 101)4279\end{verbatim}42804281The NormalPdf object contains the parameters {\tt mu} and4282{\tt sigma}. {\tt Density} uses4283{\tt scipy.stats.norm}, which is an object that represents a normal4284distribution and provides {\tt cdf} and {\tt pdf}, among other4285methods (see Section~\ref{normal}).4286\index{SciPy}42874288The following example creates a NormalPdf with the mean and variance4289of adult female heights, in cm, from the BRFSS (see4290Section~\ref{brfss}). Then it computes the density of the4291distribution at a location one standard deviation from the mean.4292\index{standard deviation}42934294\begin{verbatim}4295>>> mean, var = 163, 52.84296>>> std = math.sqrt(var)4297>>> pdf = thinkstats2.NormalPdf(mean, std)4298>>> pdf.Density(mean + std)42990.03330014300\end{verbatim}43014302The result is about 0.03, in units of probability mass per cm.4303Again, a probability density doesn't mean much by itself. But if4304we plot the Pdf, we can see the shape of the distribution:43054306\begin{verbatim}4307>>> thinkplot.Pdf(pdf, label='normal')4308>>> thinkplot.Show()4309\end{verbatim}43104311{\tt thinkplot.Pdf} plots the Pdf as a smooth function,4312as contrasted with {\tt thinkplot.Pmf}, which renders a Pmf as a4313step function. Figure~\ref{pdf_example} shows the result, as well4314as a PDF estimated from a sample, which we'll compute in the next4315section.4316\index{thinkplot}43174318You can use {\tt MakePmf} to approximate the Pdf:43194320\begin{verbatim}4321>>> pmf = pdf.MakePmf()4322\end{verbatim}43234324By default, the resulting Pmf contains 101 points equally spaced from4325{\tt mu - 3*sigma} to {\tt mu + 3*sigma}. Optionally, {\tt MakePmf}4326and {\tt Render} can take keyword arguments {\tt low}, {\tt high},4327and {\tt n}.43284329\begin{figure}4330% pdf_example.py4331\centerline{\includegraphics[height=2.2in]{figs/pdf_example.pdf}}4332\caption{A normal PDF that models adult female height in the U.S.,4333and the kernel density estimate of a sample with $n=500$.}4334\label{pdf_example}4335\end{figure}433643374338\section{Kernel density estimation}43394340{\bf Kernel density estimation} (KDE) is an algorithm that takes4341a sample and finds an appropriately smooth PDF that fits4342the data. You can read details at4343\url{http://en.wikipedia.org/wiki/Kernel_density_estimation}.4344\index{KDE}4345\index{kernel density estimation}43464347{\tt scipy} provides an implementation of KDE and {\tt thinkstats2}4348provides a class called {\tt EstimatedPdf} that uses it:4349\index{SciPy}4350\index{NumPy}43514352\begin{verbatim}4353class EstimatedPdf(Pdf):43544355def __init__(self, sample):4356self.kde = scipy.stats.gaussian_kde(sample)43574358def Density(self, xs):4359return self.kde.evaluate(xs)4360\end{verbatim}43614362\verb"__init__" takes a sample4363and computes a kernel density estimate. The result is a4364\verb"gaussian_kde" object that provides an {\tt evaluate}4365method.43664367{\tt Density} takes a value or sequence, calls4368\verb"gaussian_kde.evaluate", and returns the resulting density. The4369word ``Gaussian'' appears in the name because it uses a filter based4370on a Gaussian distribution to smooth the KDE. \index{density}43714372Here's an example that generates a sample from a normal4373distribution and then makes an EstimatedPdf to fit it:4374\index{NumPy}4375\index{EstimatedPdf}43764377\begin{verbatim}4378>>> sample = [random.gauss(mean, std) for i in range(500)]4379>>> sample_pdf = thinkstats2.EstimatedPdf(sample)4380>>> thinkplot.Pdf(sample_pdf, label='sample KDE')4381\end{verbatim}43824383\verb"sample" is a list of 500 random heights.4384\verb"sample_pdf" is a Pdf object that contains the estimated4385KDE of the sample.4386\index{thinkplot}4387\index{Pmf}43884389Figure~\ref{pdf_example} shows the normal density function and a KDE4390based on a sample of 500 random heights. The estimate is a good4391match for the original distribution.43924393Estimating a density function with KDE is useful for several purposes:43944395\begin{itemize}43964397\item {\it Visualization:\/} During the exploration phase of a project, CDFs4398are usually the best visualization of a distribution. After you4399look at a CDF, you can decide whether an estimated PDF is an4400appropriate model of the distribution. If so, it can be a better4401choice for presenting the distribution to an audience that is4402unfamiliar with CDFs.4403\index{visualization}4404\index{model}44054406\item {\it Interpolation:\/} An estimated PDF is a way to get from a sample4407to a model of the population. If you have reason to believe that4408the population distribution is smooth, you can use KDE to interpolate4409the density for values that don't appear in the sample.4410\index{interpolation}44114412\item {\it Simulation:\/} Simulations are often based on the distribution4413of a sample. If the sample size is small, it4414might be appropriate to smooth the sample distribution using KDE,4415which allows the simulation to explore more possible outcomes,4416rather than replicating the observed data.4417\index{simulation}44184419\end{itemize}442044214422\section{The distribution framework}4423\index{distribution framework}44244425\begin{figure}4426\centerline{\includegraphics[height=2.2in]{figs/distribution_functions.pdf}}4427\caption{A framework that relates representations of distribution4428functions.}4429\label{dist_framework}4430\end{figure}44314432At this point we have seen PMFs, CDFs and PDFs; let's take a minute4433to review. Figure~\ref{dist_framework} shows how these functions relate4434to each other.4435\index{Pmf}4436\index{Cdf}4437\index{Pdf}44384439We started with PMFs, which represent the probabilities for a discrete4440set of values. To get from a PMF to a CDF, you add up the probability4441masses to get cumulative probabilities.4442To get from a CDF back to a PMF, you compute differences in cumulative4443probabilities. We'll see the implementation of these operations4444in the next few sections.4445\index{cumulative probability}44464447A PDF is the derivative of a continuous CDF; or, equivalently,4448a CDF is the integral of a PDF. Remember that a PDF maps from4449values to probability densities; to get a probability, you have to4450integrate.4451\index{discrete distribution}4452\index{continuous distribution}4453\index{smoothing}44544455To get from a discrete to a continuous distribution, you can perform4456various kinds of smoothing. One form of smoothing is to assume that4457the data come from an analytic continuous distribution4458(like exponential or normal) and to estimate the parameters of that4459distribution. Another option is kernel density estimation.4460\index{exponential distribution}4461\index{distribution!exponential}4462\index{normal distribution}4463\index{distribution!normal}4464\index{Gaussian distribution}4465\index{distribution!Gaussian}44664467The opposite of smoothing is {\bf discretizing}, or quantizing. If you4468evaluate a PDF at discrete points, you can generate a PMF that is an4469approximation of the PDF. You can get a better approximation using4470numerical integration. \index{discretize}4471\index{quantize}4472\index{binning}44734474To distinguish between continuous and discrete CDFs, it might be4475better for a discrete CDF to be a ``cumulative mass function,'' but as4476far as I can tell no one uses that term. \index{CDF}4477447844794480\section{Hist implementation}44814482At this point you should know how to use the basic types provided4483by {\tt thinkstats2}: Hist, Pmf, Cdf, and Pdf. The next few sections4484provide details about how they are implemented. This material4485might help you use these classes more effectively, but it is not4486strictly necessary.4487\index{Hist}44884489Hist and Pmf inherit from a parent class called \verb"_DictWrapper".4490The leading underscore indicates that this class is ``internal;'' that4491is, it should not be used by code in other modules. The name4492indicates what it is: a dictionary wrapper. Its primary attribute is4493{\tt d}, the dictionary that maps from values to their frequencies.4494\index{DictWrapper}4495\index{internal class}4496\index{wrapper}44974498The values can be any hashable type. The frequencies should be integers,4499but can be any numeric type.4500\index{hashable}45014502\verb"_DictWrapper" contains methods appropriate for both4503Hist and Pmf, including \verb"__init__", {\tt Values},4504{\tt Items} and {\tt Render}. It also provides modifier4505methods {\tt Set}, {\tt Incr}, {\tt Mult}, and {\tt Remove}. These4506methods are all implemented with dictionary operations. For example:4507\index{dictionary}45084509\begin{verbatim}4510# class _DictWrapper45114512def Incr(self, x, term=1):4513self.d[x] = self.d.get(x, 0) + term45144515def Mult(self, x, factor):4516self.d[x] = self.d.get(x, 0) * factor45174518def Remove(self, x):4519del self.d[x]4520\end{verbatim}45214522Hist also provides {\tt Freq}, which looks up the frequency4523of a given value.4524\index{frequency}45254526Because Hist operators and methods are based on dictionaries,4527these methods are constant time operations;4528that is, their run time does not increase as the Hist gets bigger.4529\index{Hist}453045314532\section{Pmf implementation}45334534Pmf and Hist are almost the same thing, except that a Pmf4535maps values to floating-point probabilities, rather than integer4536frequencies. If the sum of the probabilities is 1, the Pmf is normalized.4537\index{Pmf}45384539Pmf provides {\tt Normalize}, which computes the sum of the4540probabilities and divides through by a factor:45414542\begin{verbatim}4543# class Pmf45444545def Normalize(self, fraction=1.0):4546total = self.Total()4547if total == 0.0:4548raise ValueError('Total probability is zero.')45494550factor = float(fraction) / total4551for x in self.d:4552self.d[x] *= factor45534554return total4555\end{verbatim}45564557{\tt fraction} determines the sum of the probabilities after4558normalizing; the default value is 1. If the total probability is 0,4559the Pmf cannot be normalized, so {\tt Normalize} raises {\tt4560ValueError}.45614562Hist and Pmf have the same constructor. It can take4563as an argument a {\tt dict}, Hist, Pmf or Cdf, a pandas4564Series, a list of (value, frequency) pairs, or a sequence of values.4565\index{Hist}45664567If you instantiate a Pmf, the result is normalized. If you4568instantiate a Hist, it is not. To construct an unnormalized Pmf,4569you can create an empty Pmf and modify it. The Pmf modifiers do4570not renormalize the Pmf.457145724573\section{Cdf implementation}45744575A CDF maps from values to cumulative probabilities, so I could have4576implemented Cdf as a \verb"_DictWrapper". But the values in a CDF are4577ordered and the values in a \verb"_DictWrapper" are not. Also, it is4578often useful to compute the inverse CDF; that is, the map from4579cumulative probability to value. So the implementaion I chose is two4580sorted lists. That way I can use binary search to do a forward or4581inverse lookup in logarithmic time.4582\index{Cdf}4583\index{binary search}4584\index{cumulative probability}4585\index{DictWrapper}4586\index{inverse CDF}4587\index{CDF, inverse}45884589The Cdf constructor can take as a parameter a sequence of values4590or a pandas Series, a dictionary that maps from values to4591probabilities, a sequence of (value, probability) pairs, a Hist, Pmf,4592or Cdf. Or if it is given two parameters, it treats them as a sorted4593sequence of values and the sequence of corresponding cumulative4594probabilities.45954596Given a sequence, pandas Series, or dictionary, the constructor makes4597a Hist. Then it uses the Hist to initialize the attributes:45984599\begin{verbatim}4600self.xs, freqs = zip(*sorted(dw.Items()))4601self.ps = np.cumsum(freqs, dtype=np.float)4602self.ps /= self.ps[-1]4603\end{verbatim}46044605{\tt xs} is the sorted list of values; {\tt freqs} is the list4606of corresponding frequencies. {\tt np.cumsum} computes4607the cumulative sum of the frequencies. Dividing through by the4608total frequency yields cumulative probabilities.4609For {\tt n} values, the time to construct the4610Cdf is proportional to $n \log n$.4611\index{frequency}46124613Here is the implementation of {\tt Prob}, which takes a value4614and returns its cumulative probability:46154616\begin{verbatim}4617# class Cdf4618def Prob(self, x):4619if x < self.xs[0]:4620return 0.04621index = bisect.bisect(self.xs, x)4622p = self.ps[index - 1]4623return p4624\end{verbatim}46254626The {\tt bisect} module provides an implementation of binary search.4627And here is the implementation of {\tt Value}, which takes a4628cumulative probability and returns the corresponding value:46294630\begin{verbatim}4631# class Cdf4632def Value(self, p):4633if p < 0 or p > 1:4634raise ValueError('p must be in range [0, 1]')46354636index = bisect.bisect_left(self.ps, p)4637return self.xs[index]4638\end{verbatim}46394640Given a Cdf, we can compute the Pmf by computing differences between4641consecutive cumulative probabilities. If you call the Cdf constructor4642and pass a Pmf, it computes differences by calling {\tt Cdf.Items}:4643\index{Pmf}4644\index{Cdf}46454646\begin{verbatim}4647# class Cdf4648def Items(self):4649a = self.ps4650b = np.roll(a, 1)4651b[0] = 04652return zip(self.xs, a-b)4653\end{verbatim}46544655{\tt np.roll} shifts the elements of {\tt a} to the right, and ``rolls''4656the last one back to the beginning. We replace the first element of4657{\tt b} with 0 and then compute the difference {\tt a-b}. The result4658is a NumPy array of probabilities.4659\index{NumPy}46604661Cdf provides {\tt Shift} and {\tt Scale}, which modify the4662values in the Cdf, but the probabilities should be treated as4663immutable.466446654666\section{Moments}4667\index{moment}46684669Any time you take a sample and reduce it to a single number, that4670number is a statistic. The statistics we have seen so far include4671mean, variance, median, and interquartile range.46724673A {\bf raw moment} is a kind of statistic. If you have a sample of4674values, $x_i$, the $k$th raw moment is:4675%4676\[ m'_k = \frac{1}{n} \sum_i x_i^k \]4677%4678Or if you prefer Python notation:46794680\begin{verbatim}4681def RawMoment(xs, k):4682return sum(x**k for x in xs) / len(xs)4683\end{verbatim}46844685When $k=1$ the result is the sample mean, $\xbar$. The other4686raw moments don't mean much by themselves, but they are used4687in some computations.46884689The {\bf central moments} are more useful. The4690$k$th central moment is:4691%4692\[ m_k = \frac{1}{n} \sum_i (x_i - \xbar)^k \]4693%4694Or in Python:46954696\begin{verbatim}4697def CentralMoment(xs, k):4698mean = RawMoment(xs, 1)4699return sum((x - mean)**k for x in xs) / len(xs)4700\end{verbatim}47014702When $k=2$ the result is the second central moment, which you might4703recognize as variance. The definition of variance gives a hint about4704why these statistics are called moments. If we attach a weight along a4705ruler at each location, $x_i$, and then spin the ruler around4706the mean, the moment of inertia of the spinning weights is the variance4707of the values. If you are not familiar with moment of inertia, see4708\url{http://en.wikipedia.org/wiki/Moment_of_inertia}. \index{moment4709of inertia}47104711When you report moment-based statistics, it is important to think4712about the units. For example, if the values $x_i$ are in cm, the4713first raw moment is also in cm. But the second moment is in4714cm$^2$, the third moment is in cm$^3$, and so on.47154716Because of these units, moments are hard to interpret by themselves.4717That's why, for the second moment, it is common to report standard4718deviation, which is the square root of variance, so it is in the same4719units as $x_i$.4720\index{standard deviation}472147224723\section{Skewness}4724\index{skewness}47254726{\bf Skewness} is a property that describes the shape of a distribution.4727If the distribution is symmetric around its central tendency, it is4728unskewed. If the values extend farther to the right, it is ``right4729skewed'' and if the values extend left, it is ``left skewed.''4730\index{central tendency}47314732This use of ``skewed'' does not have the usual connotation of4733``biased.'' Skewness only describes the shape of the distribution;4734it says nothing about whether the sampling process might have been4735biased.4736\index{bias}4737\index{sample skewness}47384739Several statistics are commonly used to quantify the skewness of a4740distribution. Given a sequence of values, $x_i$, the {\bf sample4741skewness}, $g_1$, can be computed like this:47424743\begin{verbatim}4744def StandardizedMoment(xs, k):4745var = CentralMoment(xs, 2)4746std = math.sqrt(var)4747return CentralMoment(xs, k) / std**k47484749def Skewness(xs):4750return StandardizedMoment(xs, 3)4751\end{verbatim}47524753$g_1$ is the third {\bf standardized moment}, which means that it has4754been normalized so it has no units.4755\index{standardized moment}47564757Negative skewness indicates that a distribution4758skews left; positive skewness indicates4759that a distribution skews right. The magnitude of $g_1$ indicates4760the strength of the skewness, but by itself it is not easy to4761interpret.47624763In practice, computing sample skewness is usually not4764a good idea. If there are any outliers, they4765have a disproportionate effect on $g_1$.4766\index{outlier}47674768Another way to evaluate the asymmetry of a distribution is to look4769at the relationship between the mean and median.4770Extreme values have more effect on the mean than the median, so4771in a distribution that skews left, the mean is less than the median.4772In a distribution that skews right, the mean is greater.4773\index{symmetric}4774\index{Pearson median skewness}47754776{\bf Pearson's median skewness coefficient} is a measure4777of skewness based on the difference between the4778sample mean and median:4779%4780\[ g_p = 3 (\xbar - m) / S \]4781%4782Where $\xbar$ is the sample mean, $m$ is the median, and4783$S$ is the standard deviation. Or in Python:4784\index{standard deviation}47854786\begin{verbatim}4787def Median(xs):4788cdf = thinkstats2.Cdf(xs)4789return cdf.Value(0.5)47904791def PearsonMedianSkewness(xs):4792median = Median(xs)4793mean = RawMoment(xs, 1)4794var = CentralMoment(xs, 2)4795std = math.sqrt(var)4796gp = 3 * (mean - median) / std4797return gp4798\end{verbatim}47994800This statistic is {\bf robust}, which means that it is less vulnerable4801to the effect of outliers.4802\index{robust}4803\index{outlier}48044805\begin{figure}4806\centerline{\includegraphics[height=2.2in]{figs/density_totalwgt_kde.pdf}}4807\caption{Estimated PDF of birthweight data from the NSFG.}4808\label{density_totalwgt_kde}4809\end{figure}48104811As an example, let's look at the skewness of birth weights in the4812NSFG pregnancy data. Here's the code to estimate and plot the PDF:4813\index{thinkplot}48144815\begin{verbatim}4816live, firsts, others = first.MakeFrames()4817data = live.totalwgt_lb.dropna()4818pdf = thinkstats2.EstimatedPdf(data)4819thinkplot.Pdf(pdf, label='birth weight')4820\end{verbatim}48214822Figure~\ref{density_totalwgt_kde} shows the result. The left tail appears4823longer than the right, so we suspect the distribution is skewed left.4824The mean, 7.27 lbs, is a bit less than4825the median, 7.38 lbs, so that is consistent with left skew.4826And both skewness coefficients are negative:4827sample skewness is -0.59;4828Pearson's median skewness is -0.23.4829\index{skewness}4830\index{dropna}4831\index{NaN}48324833\begin{figure}4834\centerline{\includegraphics[height=2.2in]{figs/density_wtkg2_kde.pdf}}4835\caption{Estimated PDF of adult weight data from the BRFSS.}4836\label{density_wtkg2_kde}4837\end{figure}48384839Now let's compare this distribution to the distribution of adult4840weight in the BRFSS. Again, here's the code:4841\index{thinkplot}48424843\begin{verbatim}4844df = brfss.ReadBrfss(nrows=None)4845data = df.wtkg2.dropna()4846pdf = thinkstats2.EstimatedPdf(data)4847thinkplot.Pdf(pdf, label='adult weight')4848\end{verbatim}48494850Figure~\ref{density_wtkg2_kde} shows the result. The distribution4851appears skewed to the right. Sure enough, the mean, 79.0, is bigger4852than the median, 77.3. The sample skewness is 1.1 and Pearson's4853median skewness is 0.26.4854\index{dropna}4855\index{NaN}48564857The sign of the skewness coefficient indicates whether the distribution4858skews left or right, but other than that, they are hard to interpret.4859Sample skewness is less robust; that is, it is more4860susceptible to outliers. As a result it is less reliable4861when applied to skewed distributions, exactly when it would be most4862relevant.4863\index{outlier}4864\index{robust}48654866Pearson's median skewness is based on a computed mean and variance,4867so it is also susceptible to outliers, but since it does not depend4868on a third moment, it is somewhat more robust.4869\index{Pearson median skewness}487048714872\section{Exercises}48734874A solution to this exercise is in \verb"chap06soln.py".48754876\begin{exercise}48774878The distribution of income is famously skewed to the right. In this4879exercise, we'll measure how strong that skew is.4880\index{skewness}4881\index{income}48824883The Current Population Survey (CPS) is a joint effort of the Bureau4884of Labor Statistics and the Census Bureau to study income and related4885variables. Data collected in 2013 is available from4886\url{http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm}.4887I downloaded {\tt hinc06.xls}, which is an Excel spreadsheet with4888information about household income, and converted it to {\tt hinc06.csv},4889a CSV file you will find in the repository for this book. You4890will also find {\tt hinc2.py}, which reads this file and transforms4891the data.4892\index{Current Population Survey}4893\index{Bureau of Labor Statistics}4894\index{Census Bureau}48954896The dataset is in the form of a series of income ranges and the number4897of respondents who fell in each range. The lowest range includes4898respondents who reported annual household income ``Under \$5000.''4899The highest range includes respondents who made ``\$250,000 or4900more.''49014902To estimate mean and other statistics from these data, we have to4903make some assumptions about the lower and upper bounds, and how4904the values are distributed in each range. {\tt hinc2.py} provides4905{\tt InterpolateSample}, which shows one way to model4906this data. It takes a DataFrame with a column, {\tt income}, that4907contains the upper bound of each range, and {\tt freq}, which contains4908the number of respondents in each frame.4909\index{DataFrame}4910\index{model}49114912It also takes \verb"log_upper", which is an assumed upper bound4913on the highest range, expressed in {\tt log10} dollars.4914The default value, \verb"log_upper=6.0" represents the assumption4915that the largest income among the respondents is4916$10^6$, or one million dollars.49174918{\tt InterpolateSample} generates a pseudo-sample; that is, a sample4919of household incomes that yields the same number of respondents4920in each range as the actual data. It assumes that incomes in4921each range are equally spaced on a log10 scale.49224923Compute the median, mean, skewness and Pearson's skewness of the4924resulting sample. What fraction of households reports a taxable4925income below the mean? How do the results depend on the assumed4926upper bound?4927\end{exercise}492849294930\section{Glossary}49314932\begin{itemize}49334934\item Probability density function (PDF): The derivative of a continuous CDF,4935a function that maps a value to its probability density.4936\index{PDF}4937\index{probability density function}49384939\item Probability density: A quantity that can be integrated over a4940range of values to yield a probability. If the values are in units4941of cm, for example, probability density is in units of probability4942per cm.4943\index{probability density}49444945\item Kernel density estimation (KDE): An algorithm that estimates a PDF4946based on a sample.4947\index{kernel density estimation}4948\index{KDE}49494950\item discretize: To approximate a continuous function or distribution4951with a discrete function. The opposite of smoothing.4952\index{discretize}49534954\item raw moment: A statistic based on the sum of data raised to a power.4955\index{raw moment}49564957\item central moment: A statistic based on deviation from the mean,4958raised to a power.4959\index{central moment}49604961\item standardized moment: A ratio of moments that has no units.4962\index{standardized moment}49634964\item skewness: A measure of how asymmetric a distribution is.4965\index{skewness}49664967\item sample skewness: A moment-based statistic intended to quantify4968the skewness of a distribution.4969\index{sample skewness}49704971\item Pearson's median skewness coefficient: A statistic intended to4972quantify the skewness of a distribution based on the median, mean,4973and standard deviation.4974\index{Pearson median skewness}49754976\item robust: A statistic is robust if it is relatively immune to the4977effect of outliers.4978\index{robust}49794980\end{itemize}4981498249834984\chapter{Relationships between variables}49854986So far we have only looked at one variable at a time. In this4987chapter we look at relationships between variables. Two variables are4988related if knowing one gives you information about the other. For4989example, height and weight are related; people who are taller tend to4990be heavier. Of course, it is not a perfect relationship: there4991are short heavy people and tall light ones. But if you are4992trying to guess someone's weight, you will be more accurate if you4993know their height than if you don't.4994\index{adult weight}4995\index{adult height}49964997The code for this chapter is in {\tt scatter.py}.4998For information about downloading and4999working with this code, see Section~\ref{code}.500050015002\section{Scatter plots}5003\index{scatter plot}5004\index{plot!scatter}50055006The simplest way to check for a relationship between two variables5007is a {\bf scatter plot}, but making a good scatter plot is not always easy.5008As an example, I'll plot weight versus height for the respondents5009in the BRFSS (see Section~\ref{lognormal}).5010\index{BRFSS}50115012Here's the code that reads the data file and extracts height and5013weight:50145015\begin{verbatim}5016df = brfss.ReadBrfss(nrows=None)5017sample = thinkstats2.SampleRows(df, 5000)5018heights, weights = sample.htm3, sample.wtkg25019\end{verbatim}50205021{\tt SampleRows} chooses a random subset of the data:5022\index{SampleRows}50235024\begin{verbatim}5025def SampleRows(df, nrows, replace=False):5026indices = np.random.choice(df.index, nrows, replace=replace)5027sample = df.loc[indices]5028return sample5029\end{verbatim}50305031{\tt df} is the DataFrame, {\tt nrows} is the number of rows to choose,5032and {\tt replace} is a boolean indicating whether sampling should be5033done with replacement; in other words, whether the same row could be5034chosen more than once.5035\index{DataFrame}5036\index{thinkplot}5037\index{boolean}5038\index{replacement}50395040{\tt thinkplot} provides {\tt Scatter}, which makes scatter plots:5041%5042\begin{verbatim}5043thinkplot.Scatter(heights, weights)5044thinkplot.Show(xlabel='Height (cm)',5045ylabel='Weight (kg)',5046axis=[140, 210, 20, 200])5047\end{verbatim}50485049The result, in Figure~\ref{scatter1} (left), shows the shape of5050the relationship. As we expected, taller5051people tend to be heavier.50525053\begin{figure}5054% scatter.py5055\centerline{\includegraphics[height=3.0in]{figs/scatter1.pdf}}5056\caption{Scatter plots of weight versus height for the respondents5057in the BRFSS, unjittered (left), jittered (right).}5058\label{scatter1}5059\end{figure}50605061But this is not the best representation of5062the data, because the data are packed into columns. The problem is5063that the heights are rounded to the nearest inch, converted to5064centimeters, and then rounded again. Some information is lost in5065translation. \index{height} \index{weight} \index{jitter}50665067We can't get that information back, but we can minimize the effect on5068the scatter plot by {\bf jittering} the data, which means adding random5069noise to reverse the effect of rounding off. Since these measurements5070were rounded to the nearest inch, they might be off by up to 0.5 inches or50711.3 cm. Similarly, the weights might be off by 0.5 kg.5072\index{uniform distribution}5073\index{distribution!uniform}5074\index{noise}50755076%5077\begin{verbatim}5078heights = thinkstats2.Jitter(heights, 1.3)5079weights = thinkstats2.Jitter(weights, 0.5)5080\end{verbatim}50815082Here's the implementation of {\tt Jitter}:50835084\begin{verbatim}5085def Jitter(values, jitter=0.5):5086n = len(values)5087return np.random.uniform(-jitter, +jitter, n) + values5088\end{verbatim}50895090The values can be any sequence; the result is a NumPy array.5091\index{NumPy}50925093Figure~\ref{scatter1} (right) shows the result. Jittering reduces the5094visual effect of rounding and makes the shape of the relationship5095clearer. But in general you should only jitter data for purposes of5096visualization and avoid using jittered data for analysis.50975098Even with jittering, this is not the best way to represent the data.5099There are many overlapping points, which hides data5100in the dense parts of the figure and gives disproportionate emphasis5101to outliers. This effect is called {\bf saturation}.5102\index{outlier}5103\index{saturation}51045105\begin{figure}5106% scatter.py5107\centerline{\includegraphics[height=3.0in]{figs/scatter2.pdf}}5108\caption{Scatter plot with jittering and transparency (left),5109hexbin plot (right).}5110\label{scatter2}5111\end{figure}51125113We can solve this problem with the {\tt alpha} parameter, which makes5114the points partly transparent:5115%5116\begin{verbatim}5117thinkplot.Scatter(heights, weights, alpha=0.2)5118\end{verbatim}5119%5120Figure~\ref{scatter2} (left) shows the result. Overlapping data5121points look darker, so darkness is proportional to density. In this5122version of the plot we can see two details that were not apparent before:5123vertical clusters at several heights and a horizontal line near 90 kg5124or 200 pounds. Since this data is based on self-reports in pounds,5125the most likely explanation is that some respondents reported5126rounded values.5127\index{thinkplot}5128\index{alpha}5129\index{transparency}51305131Using transparency works well for moderate-sized datasets, but this5132figure only shows the first 5000 records in the BRFSS, out of a total5133of 414 509.5134\index{hexbin plot}5135\index{plot!hexbin}51365137To handle larger datasets, another option is a hexbin plot, which5138divides the graph into hexagonal bins and colors each bin according to5139how many data points fall in it. {\tt thinkplot} provides5140{\tt HexBin}:5141%5142\begin{verbatim}5143thinkplot.HexBin(heights, weights)5144\end{verbatim}5145%5146Figure~\ref{scatter2} (right) shows the result. An advantage of a5147hexbin is that it shows the shape of the relationship well, and it is5148efficient for large datasets, both in time and in the size of the file5149it generates. A drawback is that it makes the outliers invisible.5150\index{thinkplot}5151\index{outlier}51525153The point of this example is that it is5154not easy to make a scatter plot that shows relationships clearly5155without introducing misleading artifacts.5156\index{artifact}515751585159\section{Characterizing relationships}5160\label{characterizing}51615162Scatter plots provide a general impression of the relationship between5163variables, but there are other visualizations that provide more5164insight into the nature of the relationship. One option is to bin one5165variable and plot percentiles of the other.5166\index{binning}51675168NumPy and pandas provide functions for binning data:5169\index{NumPy}5170\index{pandas}51715172\begin{verbatim}5173df = df.dropna(subset=['htm3', 'wtkg2'])5174bins = np.arange(135, 210, 5)5175indices = np.digitize(df.htm3, bins)5176groups = df.groupby(indices)5177\end{verbatim}51785179{\tt dropna} drops rows with {\tt nan} in any of the listed columns.5180{\tt arange} makes a NumPy array of bins from 135 to, but not including,5181210, in increments of 5.5182\index{dropna}5183\index{digitize}5184\index{NaN}51855186{\tt digitize} computes the index of the bin that contains each value5187in {\tt df.htm3}. The result is a NumPy array of integer indices.5188Values that fall below the lowest bin are mapped to index 0. Values5189above the highest bin are mapped to {\tt len(bins)}.51905191\begin{figure}5192% scatter.py5193\centerline{\includegraphics[height=2.5in]{figs/scatter3.pdf}}5194\caption{Percentiles of weight for a range of height bins.}5195\label{scatter3}5196\end{figure}51975198{\tt groupby} is a DataFrame method that returns a GroupBy object;5199used in a {\tt for} loop, {\tt groups} iterates the names of the groups5200and the DataFrames that represent them. So, for example, we can5201print the number of rows in each group like this:5202\index{DataFrame}5203\index{groupby}52045205\begin{verbatim}5206for i, group in groups:5207print(i, len(group))5208\end{verbatim}52095210Now for each group we can compute the mean height and the CDF5211of weight:5212\index{Cdf}52135214\begin{verbatim}5215heights = [group.htm3.mean() for i, group in groups]5216cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]5217\end{verbatim}52185219Finally, we can5220plot percentiles of weight versus height:5221\index{percentile}52225223\begin{verbatim}5224for percent in [75, 50, 25]:5225weights = [cdf.Percentile(percent) for cdf in cdfs]5226label = '%dth' % percent5227thinkplot.Plot(heights, weights, label=label)5228\end{verbatim}52295230Figure~\ref{scatter3} shows the result. Between 140 and 200 cm5231the relationship between these variables is roughly linear. This range5232includes more than 99\% of the data, so we don't have to worry5233too much about the extremes.5234\index{thinkplot}523552365237\section{Correlation}52385239A {\bf correlation} is a statistic intended to quantify the strength5240of the relationship between two variables.5241\index{correlation}52425243A challenge in measuring correlation is that the variables we want to5244compare are often not expressed in the same units. And even if they5245are in the same units, they come from different distributions.5246\index{units}52475248There are two common solutions to these problems:52495250\begin{enumerate}52515252\item Transform each value to a {\bf standard score}, which is the5253number of standard deviations from the mean.5254This transform leads to5255the ``Pearson product-moment correlation coefficient.''5256\index{standard score}5257\index{standard deviation}5258\index{Pearson coefficient of correlation}52595260\item Transform each value to its {\bf rank}, which is its index in5261the sorted list of values. This transform5262leads to the ``Spearman rank correlation coefficient.''5263\index{rank}5264\index{percentile rank}5265\index{Spearman coefficient of correlation}52665267\end{enumerate}52685269If $X$ is a series of $n$ values, $x_i$, we can convert to standard5270scores by subtracting the mean and dividing by the standard deviation:5271$z_i = (x_i - \mu) / \sigma$.5272\index{mean}5273\index{standard deviation}52745275The numerator is a deviation: the distance from the mean. Dividing by5276$\sigma$ {\bf standardizes} the deviation, so the values of $Z$ are5277dimensionless (no units) and their distribution has mean 0 and5278variance 1.5279\index{standardize}5280\index{deviation}5281\index{normal distribution}5282\index{distribution!normal}5283\index{Gaussian distribution}5284\index{distribution!Gaussian}52855286If $X$ is normally distributed, so is $Z$. But if $X$ is skewed or has5287outliers, so does $Z$; in those cases, it is more robust to use5288percentile ranks. If we compute a new variable, $R$, so that $r_i$ is5289the rank of $x_i$, the distribution of $R$ is uniform5290from 1 to $n$, regardless of the distribution of $X$.5291\index{uniform distribution} \index{distribution!uniform}5292\index{robust}5293\index{skewness}5294\index{outlier}529552965297\section{Covariance}5298\index{covariance}5299\index{deviation}53005301{\bf Covariance} is a measure of the tendency of two variables5302to vary together. If we have two series, $X$ and $Y$, their5303deviations from the mean are5304%5305\[ dx_i = x_i - \xbar \]5306\[ dy_i = y_i - \ybar \]5307%5308where $\xbar$ is the sample mean of $X$ and $\ybar$ is the sample mean5309of $Y$. If $X$ and $Y$ vary together, their deviations tend to have5310the same sign.53115312If we multiply them together, the product is positive when the5313deviations have the same sign and negative when they have the opposite5314sign. So adding up the products gives a measure of the tendency to5315vary together.53165317Covariance is the mean of these products:5318%5319\[ Cov(X,Y) = \frac{1}{n} \sum dx_i~dy_i \]5320%5321where $n$ is the length of the two series (they have to be the same5322length).53235324If you have studied linear algebra, you might recognize that5325{\tt Cov} is the dot product of the deviations, divided5326by their length. So the covariance is maximized if the two vectors5327are identical, 0 if they are orthogonal, and negative if they5328point in opposite directions. {\tt thinkstats2} uses {\tt np.dot} to5329implement {\tt Cov} efficiently:5330\index{linear algebra}5331\index{dot product}5332\index{orthogonal vector}53335334\begin{verbatim}5335def Cov(xs, ys, meanx=None, meany=None):5336xs = np.asarray(xs)5337ys = np.asarray(ys)53385339if meanx is None:5340meanx = np.mean(xs)5341if meany is None:5342meany = np.mean(ys)53435344cov = np.dot(xs-meanx, ys-meany) / len(xs)5345return cov5346\end{verbatim}53475348By default {\tt Cov} computes deviations from the sample means,5349or you can provide known means. If {\tt xs} and {\tt ys} are5350Python sequences, {\tt np.asarray} converts them to NumPy arrays.5351If they are already NumPy arrays, {\tt np.asarray} does nothing.5352\index{NumPy}53535354This implementation of covariance is meant to be simple for purposes5355of explanation. NumPy and pandas also provide implementations of5356covariance, but both of them apply a correction for small sample sizes5357that we have not covered yet, and {\tt np.cov} returns a covariance5358matrix, which is more than we need for now.5359\index{pandas}536053615362\section{Pearson's correlation}5363\index{correlation}5364\index{standard score}53655366Covariance is useful in some computations, but it is seldom reported5367as a summary statistic because it is hard to interpret. Among other5368problems, its units are the product of the units of $X$ and $Y$. For5369example, the covariance of weight and height in the BRFSS dataset is5370113 kilogram-centimeters, whatever that means.5371\index{deviation}5372\index{units}53735374One solution to this problem is to divide the deviations by the standard5375deviation, which yields standard scores, and compute the product of5376standard scores:5377%5378\[ p_i = \frac{(x_i - \xbar)}{S_X} \frac{(y_i - \ybar)}{S_Y} \]5379%5380Where $S_X$ and $S_Y$ are the standard deviations of $X$ and $Y$.5381The mean of these products is \index{standard deviation}5382%5383\[ \rho = \frac{1}{n} \sum p_i \]5384%5385Or we can rewrite $\rho$ by factoring out $S_X$ and5386$S_Y$:5387%5388\[ \rho = \frac{Cov(X,Y)}{S_X S_Y} \]5389%5390This value is called {\bf Pearson's correlation} after Karl Pearson,5391an influential early statistician. It is easy to compute and easy to5392interpret. Because standard scores are dimensionless, so is $\rho$.5393\index{Pearson, Karl}5394\index{Pearson coefficient of correlation}53955396Here is the implementation in {\tt thinkstats2}:53975398\begin{verbatim}5399def Corr(xs, ys):5400xs = np.asarray(xs)5401ys = np.asarray(ys)54025403meanx, varx = MeanVar(xs)5404meany, vary = MeanVar(ys)54055406corr = Cov(xs, ys, meanx, meany) / math.sqrt(varx * vary)5407return corr5408\end{verbatim}54095410{\tt MeanVar} computes mean and variance slightly more efficiently5411than separate calls to {\tt np.mean} and {\tt np.var}.5412\index{MeanVar}54135414Pearson's correlation is always between -1 and +1 (including both).5415If $\rho$ is positive, we say that the correlation is positive,5416which means that when one variable is high, the other tends to be5417high. If $\rho$ is negative, the correlation is negative, so5418when one variable is high, the other is low.54195420The magnitude of $\rho$ indicates the strength of the correlation. If5421$\rho$ is 1 or -1, the variables are perfectly correlated, which means5422that if you know one, you can make a perfect prediction about the5423other. \index{prediction}54245425Most correlation in the real world is not perfect, but it is still5426useful. The correlation of height and weight is 0.51, which is a5427strong correlation compared to similar human-related variables.542854295430\section{Nonlinear relationships}54315432If Pearson's correlation is near 0, it is tempting to conclude5433that there is no relationship between the variables, but that5434conclusion is not valid. Pearson's correlation only measures {\em5435linear\/} relationships. If there's a nonlinear relationship, $\rho$5436understates its strength. \index{linear relationship}5437\index{nonlinear}5438\index{Pearson coefficient of correlation}54395440\begin{figure}5441\centerline{\includegraphics[height=2.5in]{figs/Correlation_examples.png}}5442\caption{Examples of datasets with a range of correlations.}5443\label{corr_examples}5444\end{figure}54455446Figure~\ref{corr_examples} is from5447\url{http://wikipedia.org/wiki/Correlation_and_dependence}. It shows5448scatter plots and correlation coefficients for several5449carefully constructed datasets.5450\index{scatter plot}5451\index{plot!scatter}54525453The top row shows linear relationships with a range of correlations;5454you can use this row to get a sense of what different values of5455$\rho$ look like. The second row shows perfect correlations with a5456range of slopes, which demonstrates that correlation is unrelated to5457slope (we'll talk about estimating slope soon). The third row shows5458variables that are clearly related, but because the relationship is5459nonlinear, the correlation coefficient is 0.5460\index{nonlinear}54615462The moral of this story is that you should always look at a scatter5463plot of your data before blindly computing a correlation coefficient.5464\index{correlation}546554665467\section{Spearman's rank correlation}54685469Pearson's correlation works well if the relationship between variables5470is linear and if the variables are roughly normal. But it is not5471robust in the presence of outliers.5472\index{Pearson coefficient of correlation}5473\index{Spearman coefficient of correlation}5474\index{normal distribution}5475\index{distribution!normal}5476\index{Gaussian distribution}5477\index{distribution!Gaussian}5478\index{robust}5479Spearman's rank correlation is an alternative that mitigates the5480effect of outliers and skewed distributions. To compute Spearman's5481correlation, we have to compute the {\bf rank} of each value, which is its5482index in the sorted sample. For example, in the sample {\tt [1, 2, 5, 7]}5483the rank of the value 5 is 3, because it appears third in the sorted5484list. Then we compute Pearson's correlation for the ranks.5485\index{skewness}5486\index{outlier}5487\index{rank}54885489{\tt thinkstats2} provides a function that computes Spearman's rank5490correlation:54915492\begin{verbatim}5493def SpearmanCorr(xs, ys):5494xranks = pandas.Series(xs).rank()5495yranks = pandas.Series(ys).rank()5496return Corr(xranks, yranks)5497\end{verbatim}54985499I convert the arguments to pandas Series objects so I can use5500{\tt rank}, which computes the rank for each value and returns5501a Series. Then I use {\tt Corr} to compute the correlation5502of the ranks.5503\index{pandas}5504\index{Series}55055506I could also use {\tt Series.corr} directly and specify5507Spearman's method:55085509\begin{verbatim}5510def SpearmanCorr(xs, ys):5511xs = pandas.Series(xs)5512ys = pandas.Series(ys)5513return xs.corr(ys, method='spearman')5514\end{verbatim}55155516The Spearman rank correlation for the BRFSS data is 0.54, a little5517higher than the Pearson correlation, 0.51. There are several possible5518reasons for the difference, including:5519\index{rank correlation}5520\index{BRFSS}55215522\begin{itemize}55235524\item If the relationship is5525nonlinear, Pearson's correlation tends to underestimate the strength5526of the relationship, and5527\index{nonlinear}55285529\item Pearson's correlation can be affected (in either direction)5530if one of the distributions is skewed or contains outliers. Spearman's5531rank correlation is more robust.5532\index{skewness}5533\index{outlier}5534\index{robust}55355536\end{itemize}55375538In the BRFSS example, we know that the distribution of weights is5539roughly lognormal; under a log transform it approximates a normal5540distribution, so it has no skew.5541So another way to eliminate the effect of skewness is to5542compute Pearson's5543correlation with log-weight and height:5544\index{lognormal distribution}5545\index{distribution!lognormal}55465547\begin{verbatim}5548thinkstats2.Corr(df.htm3, np.log(df.wtkg2)))5549\end{verbatim}55505551The result is 0.53, close to the rank correlation, 0.54. So that5552suggests that skewness in the distribution of weight explains most of5553the difference between Pearson's and Spearman's correlation.5554\index{skewness}5555\index{Spearman coefficient of correlation}5556\index{Pearson coefficient of correlation}555755585559\section{Correlation and causation}5560\index{correlation}5561\index{causation}55625563If variables A and B are correlated, there are three possible5564explanations: A causes B, or B causes A, or some other set of factors5565causes both A and B. These explanations are called ``causal5566relationships''.5567\index{causal relationship}55685569Correlation alone does not distinguish between these explanations,5570so it does not tell you which ones are true.5571This rule is often summarized with the phrase ``Correlation5572does not imply causation,'' which is so pithy it has its own5573Wikipedia page: \url{http://wikipedia.org/wiki/Correlation_does_not_imply_causation}.55745575So what can you do to provide evidence of causation?55765577\begin{enumerate}55785579\item Use time. If A comes before B, then A can cause B but not the5580other way around (at least according to our common understanding of5581causation). The order of events can help us infer the direction5582of causation, but it does not preclude the possibility that something5583else causes both A and B.55845585\item Use randomness. If you divide a large sample into two5586groups at random and compute the means of almost any variable, you5587expect the difference to be small.5588If the groups are nearly identical in all variables but one, you5589can eliminate spurious relationships.5590\index{spurious relationship}55915592This works even if you don't know what the relevant variables5593are, but it works even better if you do, because you can check that5594the groups are identical.55955596\end{enumerate}55975598These ideas are the motivation for the {\bf randomized controlled5599trial}, in which subjects are assigned randomly to two (or more)5600groups: a {\bf treatment group} that receives some kind of intervention,5601like a new medicine, and a {\bf control group} that receives5602no intervention, or another treatment whose effects are known.5603\index{randomized controlled trial}5604\index{controlled trial}5605\index{treatment group}5606\index{control group}5607\index{medicine}56085609A randomized controlled trial is the most reliable way to demonstrate5610a causal relationship, and the foundation of science-based medicine5611(see \url{http://wikipedia.org/wiki/Randomized_controlled_trial}).56125613Unfortunately, controlled trials are only possible in the laboratory5614sciences, medicine, and a few other disciplines. In the social sciences,5615controlled experiments are rare, usually because they are impossible5616or unethical.5617\index{ethics}56185619An alternative is to look for a {\bf natural experiment}, where5620different ``treatments'' are applied to groups that are otherwise5621similar. One danger of natural experiments is that the groups might5622differ in ways that are not apparent. You can read more about this5623topic at \url{http://wikipedia.org/wiki/Natural_experiment}.5624\index{natural experiment}56255626In some cases it is possible to infer causal relationships using {\bf5627regression analysis}, which is the topic of Chapter~\ref{regression}.5628\index{regression analysis}562956305631\section{Exercises}56325633A solution to this exercise is in \verb"chap07soln.py".56345635\begin{exercise}5636Using data from the NSFG, make a scatter plot of birth weight5637versus mother's age. Plot percentiles of birth weight5638versus mother's age. Compute Pearson's and Spearman's correlations.5639How would you characterize the relationship5640between these variables?5641\index{birth weight}5642\index{weight!birth}5643\index{Pearson coefficient of correlation}5644\index{Spearman coefficient of correlation}5645\end{exercise}564656475648\section{Glossary}56495650\begin{itemize}56515652\item scatter plot: A visualization of the relationship between5653two variables, showing one point for each row of data.5654\index{scatter plot}56555656\item jitter: Random noise added to data for purposes of5657visualization.5658\index{jitter}56595660\item saturation: Loss of information when multiple points are5661plotted on top of each other.5662\index{saturation}56635664\item correlation: A statistic that measures the strength of the5665relationship between two variables.5666\index{correlation}56675668\item standardize: To transform a set of values so that their mean is 0 and5669their variance is 1.5670\index{standardize}56715672\item standard score: A value that has been standardized so that it is5673expressed in standard deviations from the mean.5674\index{standard score}5675\index{standard deviation}56765677\item covariance: A measure of the tendency of two variables5678to vary together.5679\index{covariance}56805681\item rank: The index where an element appears in a sorted list.5682\index{rank}56835684\item randomized controlled trial: An experimental design in which subjects5685are divided into groups at random, and different groups are given different5686treatments.5687\index{randomized controlled trial}56885689\item treatment group: A group in a controlled trial that receives5690some kind of intervention.5691\index{treatment group}56925693\item control group: A group in a controlled trial that receives no5694treatment, or a treatment whose effect is known.5695\index{control group}56965697\item natural experiment: An experimental design that takes advantage of5698a natural division of subjects into groups in ways that are at least5699approximately random.5700\index{natural experiment}57015702\end{itemize}57035704570557065707\chapter{Estimation}5708\label{estimation}5709\index{estimation}57105711The code for this chapter is in {\tt estimation.py}. For information5712about downloading and working with this code, see Section~\ref{code}.571357145715\section{The estimation game}57165717Let's play a game. I think of a distribution, and you have to guess5718what it is. I'll give you two hints: it's a5719normal distribution, and here's a random sample drawn from it:5720\index{normal distribution}5721\index{distribution!normal}5722\index{Gaussian distribution}5723\index{distribution!Gaussian}57245725{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -2.138]}57265727What do you think is the mean parameter, $\mu$, of this distribution?5728\index{mean}5729\index{parameter}57305731One choice is to use the sample mean, $\xbar$, as an estimate of $\mu$.5732In this example, $\xbar$ is 0.155, so it would5733be reasonable to guess $\mu$ = 0.155.5734This process is called {\bf estimation}, and the statistic we used5735(the sample mean) is called an {\bf estimator}.5736\index{estimator}57375738Using the sample mean to estimate $\mu$ is so obvious that it is hard5739to imagine a reasonable alternative. But suppose we change the game by5740introducing outliers.5741\index{normal distribution}5742\index{distribution!normal}5743\index{Gaussian distribution}5744\index{distribution!Gaussian}57455746{\em I'm thinking of a distribution.\/} It's a normal distribution, and5747here's a sample that was collected by an unreliable surveyor who5748occasionally puts the decimal point in the wrong place.5749\index{measurement error}57505751{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -213.8]}57525753Now what's your estimate of $\mu$? If you use the sample mean, your5754guess is -35.12. Is that the best choice? What are the alternatives?5755\index{outlier}57565757One option is to identify and discard outliers, then compute the sample5758mean of the rest. Another option is to use the median as an estimator.5759\index{median}57605761Which estimator is best depends on the circumstances (for example,5762whether there are outliers) and on what the goal is. Are you5763trying to minimize errors, or maximize your chance of getting the5764right answer?5765\index{error}5766\index{MSE}5767\index{mean squared error}57685769If there are no outliers, the sample mean minimizes the {\bf mean squared5770error} (MSE). That is, if we play the game many times, and each time5771compute the error $\xbar - \mu$, the sample mean minimizes5772%5773\[ MSE = \frac{1}{m} \sum (\xbar - \mu)^2 \]5774%5775Where $m$ is the number of times you play the estimation game, not5776to be confused with $n$, which is the size of the sample used to5777compute $\xbar$.57785779Here is a function that simulates the estimation game and computes5780the root mean squared error (RMSE), which is the square root of5781MSE:5782\index{mean squared error}5783\index{MSE}5784\index{RMSE}57855786\begin{verbatim}5787def Estimate1(n=7, m=1000):5788mu = 05789sigma = 157905791means = []5792medians = []5793for _ in range(m):5794xs = [random.gauss(mu, sigma) for i in range(n)]5795xbar = np.mean(xs)5796median = np.median(xs)5797means.append(xbar)5798medians.append(median)57995800print('rmse xbar', RMSE(means, mu))5801print('rmse median', RMSE(medians, mu))5802\end{verbatim}58035804Again, {\tt n} is the size of the sample, and {\tt m} is the5805number of times we play the game. {\tt means} is the list of5806estimates based on $\xbar$. {\tt medians} is the list of medians.5807\index{median}58085809Here's the function that computes RMSE:58105811\begin{verbatim}5812def RMSE(estimates, actual):5813e2 = [(estimate-actual)**2 for estimate in estimates]5814mse = np.mean(e2)5815return math.sqrt(mse)5816\end{verbatim}58175818{\tt estimates} is a list of estimates; {\tt actual} is the5819actual value being estimated. In practice, of course, we don't5820know {\tt actual}; if we did, we wouldn't have to estimate it.5821The purpose of this experiment is to compare the performance of5822the two estimators.5823\index{estimator}58245825When I ran this code, the RMSE of the sample mean was 0.41, which5826means that if we use $\xbar$ to estimate the mean of this5827distribution, based on a sample with $n=7$, we should expect to be off5828by 0.41 on average. Using the median to estimate the mean yields5829RMSE 0.53, which confirms that $\xbar$ yields lower RMSE, at least5830for this example.58315832Minimizing MSE is a nice property, but it's not always the best5833strategy. For example, suppose we are estimating the distribution of5834wind speeds at a building site. If the estimate is too high, we might5835overbuild the structure, increasing its cost. But if it's too5836low, the building might collapse. Because cost as a function of5837error is not symmetric, minimizing MSE is not the best strategy.5838\index{prediction}5839\index{cost function}5840\index{MSE}58415842As another example, suppose I roll three six-sided dice and ask you to5843predict the total. If you get it exactly right, you get a prize;5844otherwise you get nothing. In this case the value that minimizes MSE5845is 10.5, but that would be a bad guess, because the total of three5846dice is never 10.5. For this game, you want an estimator that has the5847highest chance of being right, which is a {\bf maximum likelihood5848estimator} (MLE). If you pick 10 or 11, your chance of winning is 15849in 8, and that's the best you can do. \index{MLE}5850\index{maximum likelihood estimator}5851\index{dice}585258535854\section{Guess the variance}5855\index{variance}5856\index{normal distribution}5857\index{distribution!normal}5858\index{Gaussian distribution}5859\index{distribution!Gaussian}58605861{\em I'm thinking of a distribution\/.} It's a normal distribution, and5862here's a (familiar) sample:58635864{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -2.138]}58655866What do you think is the variance, $\sigma^2$, of my distribution?5867Again, the obvious choice is to use the sample variance, $S^2$, as an5868estimator.5869%5870\[ S^2 = \frac{1}{n} \sum (x_i - \xbar)^2 \]5871%5872For large samples, $S^2$ is an adequate estimator, but for small5873samples it tends to be too low. Because of this unfortunate5874property, it is called a {\bf biased} estimator.5875An estimator is {\bf unbiased} if the expected total (or mean) error,5876after many iterations of the estimation game, is 0.5877\index{sample variance}5878\index{biased estimator}5879\index{estimator!biased}5880\index{unbiased estimator}5881\index{estimator!unbiased}58825883Fortunately, there is another simple statistic that is an unbiased5884estimator of $\sigma^2$:5885%5886\[ S_{n-1}^2 = \frac{1}{n-1} \sum (x_i - \xbar)^2 \]5887%5888For an explanation of why $S^2$ is biased, and a proof that5889$S_{n-1}^2$ is unbiased, see5890\url{http://wikipedia.org/wiki/Bias_of_an_estimator}.58915892The biggest problem with this estimator is that its name and symbol5893are used inconsistently. The name ``sample variance'' can refer to5894either $S^2$ or $S_{n-1}^2$, and the symbol $S^2$ is used5895for either or both.58965897Here is a function that simulates the estimation game and tests5898the performance of $S^2$ and $S_{n-1}^2$:58995900\begin{verbatim}5901def Estimate2(n=7, m=1000):5902mu = 05903sigma = 159045905estimates1 = []5906estimates2 = []5907for _ in range(m):5908xs = [random.gauss(mu, sigma) for i in range(n)]5909biased = np.var(xs)5910unbiased = np.var(xs, ddof=1)5911estimates1.append(biased)5912estimates2.append(unbiased)59135914print('mean error biased', MeanError(estimates1, sigma**2))5915print('mean error unbiased', MeanError(estimates2, sigma**2))5916\end{verbatim}59175918Again, {\tt n} is the sample size and {\tt m} is the number of times5919we play the game. {\tt np.var} computes $S^2$ by default and5920$S_{n-1}^2$ if you provide the argument {\tt ddof=1}, which stands for5921``delta degrees of freedom.'' I won't explain that term, but you can read5922about it at5923\url{http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)}.5924\index{degrees of freedom}59255926{\tt MeanError} computes the mean difference between the estimates5927and the actual value:59285929\begin{verbatim}5930def MeanError(estimates, actual):5931errors = [estimate-actual for estimate in estimates]5932return np.mean(errors)5933\end{verbatim}59345935When I ran this code, the mean error for $S^2$ was -0.13. As5936expected, this biased estimator tends to be too low. For $S_{n-1}^2$,5937the mean error was 0.014, about 10 times smaller. As {\tt m}5938increases, we expect the mean error for $S_{n-1}^2$ to approach 0.5939\index{mean error}59405941Properties like MSE and bias are long-term expectations based on5942many iterations of the estimation game. By running simulations like5943the ones in this chapter, we can compare estimators and check whether5944they have desired properties.5945\index{biased estimator}5946\index{estimator!biased}59475948But when you apply an estimator to real5949data, you just get one estimate. It would not be meaningful to say5950that the estimate is unbiased; being unbiased is a property of the5951estimator, not the estimate.59525953After you choose an estimator with appropriate properties, and use it to5954generate an estimate, the next step is to characterize the5955uncertainty of the estimate, which is the topic of the next5956section.595759585959\section{Sampling distributions}5960\label{gorilla}59615962Suppose you are a scientist studying gorillas in a wildlife5963preserve. You want to know the average weight of the adult5964female gorillas in the preserve. To weigh them, you have5965to tranquilize them, which is dangerous, expensive, and possibly5966harmful to the gorillas. But if it is important to obtain this5967information, it might be acceptable to weigh a sample of 95968gorillas. Let's assume that the population of the preserve is5969well known, so we can choose a representative sample of adult5970females. We could use the sample mean, $\xbar$, to estimate the5971unknown population mean, $\mu$.5972\index{gorilla}5973\index{population}5974\index{sample}59755976Having weighed 9 female gorillas, you might find $\xbar=90$ kg and5977sample standard deviation, $S=7.5$ kg. The sample mean5978is an unbiased estimator of $\mu$, and in the long run it5979minimizes MSE. So if you report a single5980estimate that summarizes the results, you would report 90 kg.5981\index{MSE}5982\index{sample mean}5983\index{biased estimator}5984\index{estimator!biased}5985\index{standard deviation}59865987But how confident should you be in this estimate? If you only weigh5988$n=9$ gorillas out of a much larger population, you might be unlucky5989and choose the 9 heaviest gorillas (or the 9 lightest ones) just by5990chance. Variation in the estimate caused by random selection is5991called {\bf sampling error}.5992\index{sampling error}59935994To quantify sampling error, we can simulate the5995sampling process with hypothetical values of $\mu$ and $\sigma$, and5996see how much $\xbar$ varies.59975998Since we don't know the actual values of5999$\mu$ and $\sigma$ in the population, we'll use the estimates6000$\xbar$ and $S$.6001So the question we answer is:6002``If the actual values of $\mu$ and $\sigma$ were 90 kg and 7.5 kg,6003and we ran the same experiment many times, how much would the6004estimated mean, $\xbar$, vary?''60056006The following function answers that question:60076008\begin{verbatim}6009def SimulateSample(mu=90, sigma=7.5, n=9, m=1000):6010means = []6011for j in range(m):6012xs = np.random.normal(mu, sigma, n)6013xbar = np.mean(xs)6014means.append(xbar)60156016cdf = thinkstats2.Cdf(means)6017ci = cdf.Percentile(5), cdf.Percentile(95)6018stderr = RMSE(means, mu)6019\end{verbatim}60206021{\tt mu} and {\tt sigma} are the {\em hypothetical\/} values of6022the parameters. {\tt n} is the sample size, the number of6023gorillas we measured. {\tt m} is the number of times we run6024the simulation.6025\index{gorilla}6026\index{sample size}6027\index{simulation}60286029\begin{figure}6030% estimation.py6031\centerline{\includegraphics[height=2.5in]{figs/estimation1.pdf}}6032\caption{Sampling distribution of $\xbar$, with confidence interval.}6033\label{estimation1}6034\end{figure}60356036In each iteration, we choose {\tt n} values from a normal6037distribution with the given parameters, and compute the sample mean,6038{\tt xbar}. We run 1000 simulations and then compute the6039distribution, {\tt cdf}, of the estimates. The result is shown in6040Figure~\ref{estimation1}. This distribution is called the {\bf6041sampling distribution} of the estimator. It shows how much the6042estimates would vary if we ran the experiment over and over.6043\index{sampling distribution}60446045The mean of the sampling distribution is pretty close6046to the hypothetical value of $\mu$, which means that the experiment6047yields the right answer, on average. After 1000 tries, the lowest6048result is 82 kg, and the highest is 98 kg. This range suggests that6049the estimate might be off by as much as 8 kg.60506051There are two common ways to summarize the sampling distribution:60526053\begin{itemize}60546055\item {\bf Standard error} (SE) is a measure of how far we expect the6056estimate to be off, on average. For each simulated experiment, we6057compute the error, $\xbar - \mu$, and then compute the root mean6058squared error (RMSE). In this example, it is roughly 2.5 kg.6059\index{standard error}60606061\item A {\bf confidence interval} (CI) is a range that includes a6062given fraction of the sampling distribution. For example, the 90\%6063confidence interval is the range from the 5th to the 95th6064percentile. In this example, the 90\% CI is $(86, 94)$ kg.6065\index{confidence interval}6066\index{sampling distribution}60676068\end{itemize}60696070Standard errors and confidence intervals are the source of much confusion:60716072\begin{itemize}60736074\item People often confuse standard error and standard deviation.6075Remember that standard deviation describes variability in a measured6076quantity; in this example, the standard deviation of gorilla weight6077is 7.5 kg. Standard error describes variability in an estimate. In6078this example, the standard error of the mean, based on a sample of 96079measurements, is 2.5 kg.6080\index{gorilla}6081\index{standard deviation}60826083One way to remember the difference is that, as sample size6084increases, standard error gets smaller; standard deviation does not.60856086\item People often think that there is a 90\% probability that the6087actual parameter, $\mu$, falls in the 90\% confidence interval.6088Sadly, that is not true. If you want to make a claim like that, you6089have to use Bayesian methods (see my book, {\it Think Bayes\/}).6090\index{Bayesian statistics}60916092The sampling distribution answers a different question: it gives you6093a sense of how reliable an estimate is by telling you how much it6094would vary if you ran the experiment again.6095\index{sampling distribution}60966097\end{itemize}60986099It is important to remember that confidence intervals6100and standard errors only quantify sampling error; that is,6101error due to measuring only part of the population.6102The sampling distribution does not account for other6103sources of error, notably sampling bias and measurement error,6104which are the topics of the next section.610561066107\section{Sampling bias}61086109Suppose that instead of the weight of gorillas in a nature preserve,6110you want to know the average weight of women in the city where you6111live. It is unlikely that you would be allowed6112to choose a representative sample of women and6113weigh them.6114\index{gorilla}6115\index{adult weight}6116\index{sampling bias}6117\index{bias!sampling}6118\index{measurement error}61196120A simple alternative would be6121``telephone sampling;'' that is,6122you could choose random numbers from the phone book, call and ask to6123speak to an adult woman, and ask how much she weighs.6124\index{telephone sampling}6125\index{random number}61266127Telephone sampling has obvious limitations. For example, the sample6128is limited to people whose telephone numbers are listed, so it6129eliminates people without phones (who might be poorer than average)6130and people with unlisted numbers (who might be richer). Also, if you6131call home telephones during the day, you are less likely to sample6132people with jobs. And if you only sample the person who answers the6133phone, you are less likely to sample people who share a phone line.61346135If factors like income, employment, and household size are related6136to weight---and it is plausible that they are---the results of your6137survey would be affected one way or another. This problem is6138called {\bf sampling bias} because it is a property of the sampling6139process.6140\index{sampling bias}61416142This sampling process is also vulnerable to self-selection, which is a6143kind of sampling bias. Some people will refuse to answer the6144question, and if the tendency to refuse is related to weight, that6145would affect the results.6146\index{self-selection}61476148Finally, if you ask people how much they weigh, rather than weighing6149them, the results might not be accurate. Even helpful respondents6150might round up or down if they are uncomfortable with their actual6151weight. And not all respondents are helpful. These inaccuracies are6152examples of {\bf measurement error}.6153\index{measurement error}61546155When you report an estimated quantity, it is useful to report6156standard error, or a confidence interval, or both, in order to6157quantify sampling error. But it is also important to remember that6158sampling error is only one source of error, and often it is not the6159biggest.6160\index{standard error}6161\index{confidence interval}616261636164\section{Exponential distributions}6165\index{exponential distribution}6166\index{distribution!exponential}61676168Let's play one more round of the estimation game.6169{\em I'm thinking of a distribution.\/} It's an exponential distribution, and6170here's a sample:61716172{\tt [5.384, 4.493, 19.198, 2.790, 6.122, 12.844]}61736174What do you think is the parameter, $\lambda$, of this distribution?6175\index{parameter}6176\index{mean}61776178\newcommand{\lamhat}{L}6179\newcommand{\lamhatmed}{L_m}61806181In general, the mean of an exponential distribution is $1/\lambda$,6182so working backwards, we might choose6183%6184\[ \lamhat = 1 / \xbar\]6185%6186$\lamhat$ is an6187estimator of $\lambda$. And not just any estimator; it is also the6188maximum likelihood estimator (see6189\url{http://wikipedia.org/wiki/Exponential_distribution#Maximum_likelihood}).6190So if you want to maximize your chance of guessing $\lambda$ exactly,6191$\lamhat$ is the way to go.6192\index{MLE}6193\index{maximum likelihood estimator}61946195But we know that $\xbar$ is not robust in the presence of outliers, so6196we expect $\lamhat$ to have the same problem.6197\index{robust}6198\index{outlier}6199\index{sample median}62006201We can choose an alternative based on the sample median.6202The median of an exponential distribution is $\ln(2) / \lambda$,6203so working backwards again, we can define an estimator6204%6205\[ \lamhatmed = \ln(2) / m \]6206%6207where $m$ is the sample median.6208\index{median}62096210To test the performance of these estimators, we can simulate the6211sampling process:62126213\begin{verbatim}6214def Estimate3(n=7, m=1000):6215lam = 262166217means = []6218medians = []6219for _ in range(m):6220xs = np.random.exponential(1.0/lam, n)6221L = 1 / np.mean(xs)6222Lm = math.log(2) / thinkstats2.Median(xs)6223means.append(L)6224medians.append(Lm)62256226print('rmse L', RMSE(means, lam))6227print('rmse Lm', RMSE(medians, lam))6228print('mean error L', MeanError(means, lam))6229print('mean error Lm', MeanError(medians, lam))6230\end{verbatim}62316232When I run this experiment with $\lambda=2$, the RMSE of $L$ is62331.1. For the median-based estimator $L_m$, RMSE is 1.8. We can't6234tell from this experiment whether $L$ minimizes MSE, but at least6235it seems better than $L_m$.6236\index{MSE}6237\index{RMSE}62386239Sadly, it seems that both estimators are biased. For $L$ the mean6240error is 0.33; for $L_m$ it is 0.45. And neither converges to 06241as {\tt m} increases.6242\index{biased estimator}6243\index{estimator!biased}62446245It turns out that $\xbar$ is an unbiased estimator of the mean6246of the distribution, $1 / \lambda$, but $L$ is not an unbiased6247estimator of $\lambda$.624862496250\section{Exercises}62516252For the following exercises, you might want to start with a copy of6253{\tt estimation.py}. Solutions are in \verb"chap08soln.py"62546255\begin{exercise}62566257In this chapter we used $\xbar$ and median to estimate $\mu$, and6258found that $\xbar$ yields lower MSE.6259Also, we used $S^2$ and $S_{n-1}^2$ to estimate $\sigma$, and found that6260$S^2$ is biased and $S_{n-1}^2$ unbiased.62616262Run similar experiments to see if $\xbar$ and median are biased estimates6263of $\mu$.6264Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.6265\index{sample mean}6266\index{sample median}6267\index{estimator!biased}62686269\end{exercise}627062716272\begin{exercise}62736274Suppose you draw a sample with size $n=10$ from6275an exponential distribution with $\lambda=2$. Simulate6276this experiment 1000 times and plot the sampling distribution of6277the estimate $\lamhat$. Compute the standard error of the estimate6278and the 90\% confidence interval.6279\index{standard error}6280\index{confidence interval}6281\index{sampling distribution}62826283Repeat the experiment with a few different values of $n$ and make6284a plot of standard error versus $n$.6285\index{exponential distribution}6286\index{distribution!exponential}628762886289\end{exercise}629062916292\begin{exercise}62936294In games like hockey and soccer, the time between goals is6295roughly exponential. So you could estimate a team's goal-scoring rate6296by observing the number of goals they score in a game. This6297estimation process is a little different from sampling the time6298between goals, so let's see how it works.6299\index{hockey}6300\index{soccer}63016302Write a function that takes a goal-scoring rate, {\tt lam}, in goals6303per game, and simulates a game by generating the time between goals6304until the total time exceeds 1 game, then returns the number of goals6305scored.63066307Write another function that simulates many games, stores the6308estimates of {\tt lam}, then computes their mean error and RMSE.63096310Is this way of making an estimate biased? Plot the sampling6311distribution of the estimates and the 90\% confidence interval. What6312is the standard error? What happens to sampling error for increasing6313values of {\tt lam}?6314\index{estimator!biased}6315\index{biased estimator}6316\index{standard error}6317\index{confidence interval}63186319\end{exercise}632063216322\section{Glossary}63236324\begin{itemize}63256326\item estimation: The process of inferring the parameters of a distribution6327from a sample.6328\index{estimation}63296330\item estimator: A statistic used to estimate a parameter.6331\index{estimation}63326333\item mean squared error (MSE): A measure of estimation error.6334\index{mean squared error}6335\index{MSE}63366337\item root mean squared error (RMSE): The square root of MSE,6338a more meaningful representation of typical error magnitude.6339\index{mean squared error}6340\index{MSE}63416342\item maximum likelihood estimator (MLE): An estimator that computes the6343point estimate most likely to be correct.6344\index{MLE}6345\index{maximum likelihood estimator}63466347\item bias (of an estimator): The tendency of an estimator to be above or6348below the actual value of the parameter, when averaged over repeated6349experiments. \index{biased estimator}63506351\item sampling error: Error in an estimate due to the limited6352size of the sample and variation due to chance. \index{point estimation}63536354\item sampling bias: Error in an estimate due to a sampling process6355that is not representative of the population. \index{sampling bias}63566357\item measurement error: Error in an estimate due to inaccuracy collecting6358or recording data. \index{measurement error}63596360\item sampling distribution: The distribution of a statistic if an6361experiment is repeated many times. \index{sampling distribution}63626363\item standard error: The RMSE of an estimate,6364which quantifies variability due to sampling error (but not6365other sources of error).6366\index{standard error}63676368\item confidence interval: An interval that represents the expected6369range of an estimator if an experiment is repeated many times.6370\index{confidence interval} \index{interval!confidence}63716372\end{itemize}637363746375\chapter{Hypothesis testing}6376\label{testing}63776378The code for this chapter is in {\tt hypothesis.py}. For information6379about downloading and working with this code, see Section~\ref{code}.63806381\section{Classical hypothesis testing}6382\index{hypothesis testing}6383\index{apparent effect}63846385Exploring the data from the NSFG, we saw several ``apparent effects,''6386including differences between first babies and others.6387So far we have taken these effects at face value; in this chapter,6388we put them to the test.6389\index{National Survey of Family Growth}6390\index{NSFG}63916392The fundamental question we want to address is whether the effects6393we see in a sample are likely to appear in the larger population.6394For example, in the NSFG sample we see a difference in mean pregnancy6395length for first babies and others. We would like to know if6396that effect reflects a real difference for women6397in the U.S., or if it might appear in the sample by chance.6398\index{pregnancy length} \index{length!pregnancy}63996400There are several ways we could formulate this question, including6401Fisher null hypothesis testing, Neyman-Pearson decision theory, and6402Bayesian inference\footnote{For more about Bayesian inference, see the6403sequel to this book, {\it Think Bayes}.}. What I present here is a6404subset of all three that makes up most of what people use in practice,6405which I will call {\bf classical hypothesis testing}.6406\index{Bayesian inference}6407\index{null hypothesis}64086409The goal of classical hypothesis testing is to answer the question,6410``Given a sample and an apparent effect, what is the probability of6411seeing such an effect by chance?'' Here's how we answer that question:64126413\begin{itemize}64146415\item The first step is to quantify the size of the apparent effect by6416choosing a {\bf test statistic}. In the NSFG example, the apparent6417effect is a difference in pregnancy length between first babies and6418others, so a natural choice for the test statistic is the difference6419in means between the two groups.6420\index{test statistic}64216422\item The second step is to define a {\bf null hypothesis}, which is a6423model of the system based on the assumption that the apparent effect6424is not real. In the NSFG example the null hypothesis is that there6425is no difference between first babies and others; that is, that6426pregnancy lengths for both groups have the same distribution.6427\index{null hypothesis}6428\index{pregnancy length}6429\index{model}64306431\item The third step is to compute a {\bf p-value}, which is the6432probability of seeing the apparent effect if the null hypothesis is6433true. In the NSFG example, we would compute the actual difference6434in means, then compute the probability of seeing a6435difference as big, or bigger, under the null hypothesis.6436\index{p-value}64376438\item The last step is to interpret the result. If the p-value is6439low, the effect is said to be {\bf statistically significant}, which6440means that it is unlikely to have occurred by chance. In that case6441we infer that the effect is more likely to appear in the larger6442population. \index{statistically significant} \index{significant}64436444\end{itemize}64456446The logic of this process is similar to a proof by6447contradiction. To prove a mathematical statement, A, you assume6448temporarily that A is false. If that assumption leads to a6449contradiction, you conclude that A must actually be true.6450\index{contradiction, proof by}6451\index{proof by contradiction}64526453Similarly, to test a hypothesis like, ``This effect is real,'' we6454assume, temporarily, that it is not. That's the null hypothesis.6455Based on that assumption, we compute the probability of the apparent6456effect. That's the p-value. If the p-value is low, we6457conclude that the null hypothesis is unlikely to be true.6458\index{p-value}6459\index{null hypothesis}646064616462\section{HypothesisTest}6463\label{hypotest}6464\index{mean!difference in}64656466{\tt thinkstats2} provides {\tt HypothesisTest}, a6467class that represents the structure of a classical hypothesis6468test. Here is the definition:6469\index{HypothesisTest}64706471\begin{verbatim}6472class HypothesisTest(object):64736474def __init__(self, data):6475self.data = data6476self.MakeModel()6477self.actual = self.TestStatistic(data)64786479def PValue(self, iters=1000):6480self.test_stats = [self.TestStatistic(self.RunModel())6481for _ in range(iters)]64826483count = sum(1 for x in self.test_stats if x >= self.actual)6484return count / iters64856486def TestStatistic(self, data):6487raise UnimplementedMethodException()64886489def MakeModel(self):6490pass64916492def RunModel(self):6493raise UnimplementedMethodException()6494\end{verbatim}64956496{\tt HypothesisTest} is an abstract parent class that provides6497complete definitions for some methods and place-keepers for others.6498Child classes based on {\tt HypothesisTest} inherit \verb"__init__"6499and {\tt PValue} and provide {\tt TestStatistic},6500{\tt RunModel}, and optionally {\tt MakeModel}.6501\index{HypothesisTest}65026503\verb"__init__" takes the data in whatever form is appropriate. It6504calls {\tt MakeModel}, which builds a representation of the null6505hypothesis, then passes the data to {\tt TestStatistic}, which6506computes the size of the effect in the sample.6507\index{test statistic}6508\index{null hypothesis}65096510{\tt PValue} computes the probability of the apparent effect under6511the null hypothesis. It takes as a parameter {\tt iters}, which is6512the number of simulations to run. The first line generates simulated6513data, computes test statistics, and stores them in6514\verb"test_stats".6515The result is6516the fraction of elements in \verb"test_stats" that6517exceed or equal the observed test statistic, {\tt self.actual}.6518\index{simulation}65196520As a simple example\footnote{Adapted from MacKay, {\it Information6521Theory, Inference, and Learning Algorithms}, 2003.}, suppose we6522toss a coin 250 times and see 140 heads and 110 tails. Based on this6523result, we might suspect that the coin is biased; that is, more likely6524to land heads. To test this hypothesis, we compute the6525probability of seeing such a difference if the coin is actually6526fair:6527\index{biased coin}6528\index{MacKay, David}65296530\begin{verbatim}6531class CoinTest(thinkstats2.HypothesisTest):65326533def TestStatistic(self, data):6534heads, tails = data6535test_stat = abs(heads - tails)6536return test_stat65376538def RunModel(self):6539heads, tails = self.data6540n = heads + tails6541sample = [random.choice('HT') for _ in range(n)]6542hist = thinkstats2.Hist(sample)6543data = hist['H'], hist['T']6544return data6545\end{verbatim}65466547The parameter, {\tt data}, is a pair of6548integers: the number of heads and tails. The test statistic is6549the absolute difference between them, so {\tt self.actual}6550is 30.6551\index{HypothesisTest}65526553{\tt RunModel} simulates coin tosses assuming that the coin is6554actually fair. It generates a sample of 250 tosses, uses Hist6555to count the number of heads and tails, and returns a pair of6556integers.6557\index{Hist}6558\index{model}65596560Now all we have to do is instantiate {\tt CoinTest} and call6561{\tt PValue}:65626563\begin{verbatim}6564ct = CoinTest((140, 110))6565pvalue = ct.PValue()6566\end{verbatim}65676568The result is about 0.07, which means that if the coin is6569fair, we expect to see a difference as big as 30 about 7\% of the6570time.65716572How should we interpret this result? By convention,65735\% is the threshold of statistical significance. If the6574p-value is less than 5\%, the effect is considered significant; otherwise6575it is not.6576\index{p-value}6577\index{statistically significant} \index{significant}65786579But the choice of 5\% is arbitrary, and (as we will see later) the6580p-value depends on the choice of the test statistics and6581the model of the null hypothesis. So p-values should not be considered6582precise measurements.6583\index{null hypothesis}65846585I recommend interpreting p-values according to their order of6586magnitude: if the p-value is less than 1\%, the effect is unlikely to6587be due to chance; if it is greater than 10\%, the effect can plausibly6588be explained by chance. P-values between 1\% and 10\% should be6589considered borderline. So in this example I conclude that the6590data do not provide strong evidence that the coin is biased or not.659165926593\section{Testing a difference in means}6594\label{testdiff}6595\index{mean!difference in}65966597One of the most common effects to test is a difference in mean6598between two groups. In the NSFG data, we saw that the mean pregnancy6599length for first babies is slightly longer, and the mean birth weight6600is slightly smaller. Now we will see if those effects are6601statistically significant.6602\index{National Survey of Family Growth}6603\index{NSFG}6604\index{pregnancy length}6605\index{length!pregnancy}66066607For these examples, the null hypothesis is that the distributions6608for the two groups are the same. One way to model the null6609hypothesis is by {\bf permutation}; that is, we can take values6610for first babies and others and shuffle them, treating6611the two groups as one big group:6612\index{null hypothesis}6613\index{permutation}6614\index{model}66156616\begin{verbatim}6617class DiffMeansPermute(thinkstats2.HypothesisTest):66186619def TestStatistic(self, data):6620group1, group2 = data6621test_stat = abs(group1.mean() - group2.mean())6622return test_stat66236624def MakeModel(self):6625group1, group2 = self.data6626self.n, self.m = len(group1), len(group2)6627self.pool = np.hstack((group1, group2))66286629def RunModel(self):6630np.random.shuffle(self.pool)6631data = self.pool[:self.n], self.pool[self.n:]6632return data6633\end{verbatim}66346635{\tt data} is a pair of sequences, one for each6636group. The test statistic is the absolute difference in the means.6637\index{HypothesisTest}66386639{\tt MakeModel} records the sizes of the groups, {\tt n} and6640{\tt m}, and combines the groups into one NumPy6641array, {\tt self.pool}.6642\index{NumPy}66436644{\tt RunModel} simulates the null hypothesis by shuffling the6645pooled values and splitting them into two groups with sizes {\tt n}6646and {\tt m}. As always, the return value from {\tt RunModel} has6647the same format as the observed data.6648\index{null hypothesis}6649\index{model}66506651To test the difference in pregnancy length, we run:66526653\begin{verbatim}6654live, firsts, others = first.MakeFrames()6655data = firsts.prglngth.values, others.prglngth.values6656ht = DiffMeansPermute(data)6657pvalue = ht.PValue()6658\end{verbatim}66596660{\tt MakeFrames} reads the NSFG data and returns DataFrames6661representing all live births, first babies, and others.6662We extract pregnancy lengths as NumPy arrays, pass them as6663data to {\tt DiffMeansPermute}, and compute the p-value. The6664result is about 0.17, which means that we expect to see a difference6665as big as the observed effect about 17\% of the time. So6666this effect is not statistically significant.6667\index{DataFrame}6668\index{p-value}6669\index{significant} \index{statistically significant}6670\index{pregnancy length}66716672\begin{figure}6673% hypothesis.py6674\centerline{\includegraphics[height=2.5in]{figs/hypothesis1.pdf}}6675\caption{CDF of difference in mean pregnancy length under the null6676hypothesis.}6677\label{hypothesis1}6678\end{figure}66796680{\tt HypothesisTest} provides {\tt PlotCdf}, which plots the6681distribution of the test statistic and a gray line indicating6682the observed effect size:6683\index{thinkplot}6684\index{HypothesisTest}6685\index{Cdf}6686\index{effect size}66876688\begin{verbatim}6689ht.PlotCdf()6690thinkplot.Show(xlabel='test statistic',6691ylabel='CDF')6692\end{verbatim}66936694Figure~\ref{hypothesis1} shows the result. The CDF intersects the6695observed difference at 0.83, which is the complement of the p-value,66960.17.6697\index{p-value}66986699If we run the same analysis with birth weight, the computed p-value6700is 0; after 1000 attempts,6701the simulation never yields an effect6702as big as the observed difference, 0.12 lbs. So we would6703report $p < 0.001$, and6704conclude that the difference in birth weight is statistically6705significant.6706\index{birth weight}6707\index{weight!birth}6708\index{significant} \index{statistically significant}670967106711\section{Other test statistics}67126713Choosing the best test statistic depends on what question you are6714trying to address. For example, if the relevant question is whether6715pregnancy lengths are different for first6716babies, then it makes sense to test the absolute difference in means,6717as we did in the previous section.6718\index{test statistic}6719\index{pregnancy length}67206721If we had some reason to think that first babies are likely6722to be late, then we would not take the absolute value of the difference;6723instead we would use this test statistic:67246725\begin{verbatim}6726class DiffMeansOneSided(DiffMeansPermute):67276728def TestStatistic(self, data):6729group1, group2 = data6730test_stat = group1.mean() - group2.mean()6731return test_stat6732\end{verbatim}67336734{\tt DiffMeansOneSided} inherits {\tt MakeModel} and {\tt RunModel}6735from {\tt DiffMeansPermute}; the only difference is that6736{\tt TestStatistic} does not take the absolute value of the6737difference. This kind of test is called {\bf one-sided} because6738it only counts one side of the distribution of differences. The6739previous test, using both sides, is {\bf two-sided}.6740\index{one-sided test}6741\index{two-sided test}67426743For this version of the test, the p-value is 0.09. In general6744the p-value for a one-sided test is about half the p-value for6745a two-sided test, depending on the shape of the distribution.6746\index{p-value}67476748The one-sided hypothesis, that first babies are born late, is more6749specific than the two-sided hypothesis, so the p-value is smaller.6750But even for the stronger hypothesis, the difference is6751not statistically significant.6752\index{significant} \index{statistically significant}67536754We can use the same framework to test for a difference in standard6755deviation. In Section~\ref{visualization}, we saw some evidence that6756first babies are more likely to be early or late, and less likely to6757be on time. So we might hypothesize that the standard deviation is6758higher. Here's how we can test that:6759\index{standard deviation}67606761\begin{verbatim}6762class DiffStdPermute(DiffMeansPermute):67636764def TestStatistic(self, data):6765group1, group2 = data6766test_stat = group1.std() - group2.std()6767return test_stat6768\end{verbatim}67696770This is a one-sided test because the hypothesis is that the standard6771deviation for first babies is higher, not just different. The p-value6772is 0.09, which is not statistically significant.6773\index{p-value}6774\index{permutation}6775\index{significant} \index{statistically significant}677667776778\section{Testing a correlation}6779\label{corrtest}67806781This framework can also test correlations. For example, in the NSFG6782data set, the correlation between birth weight and mother's age is6783about 0.07. It seems like older mothers have heavier babies. But6784could this effect be due to chance?6785\index{correlation}6786\index{test statistic}67876788For the test statistic, I use6789Pearson's correlation, but Spearman's would work as well.6790If we had reason to expect positive correlation, we would do a6791one-sided test. But since we have no such reason, I'll6792do a two-sided test using the absolute value of correlation.6793\index{Pearson coefficient of correlation}6794\index{Spearman coefficient of correlation}67956796The null hypothesis is that there is no correlation between mother's6797age and birth weight. By shuffling the observed values, we can6798simulate a world where the distributions of age and6799birth weight are the same, but where the variables are unrelated:6800\index{birth weight}6801\index{weight!birth}6802\index{null hypothesis}68036804\begin{verbatim}6805class CorrelationPermute(thinkstats2.HypothesisTest):68066807def TestStatistic(self, data):6808xs, ys = data6809test_stat = abs(thinkstats2.Corr(xs, ys))6810return test_stat68116812def RunModel(self):6813xs, ys = self.data6814xs = np.random.permutation(xs)6815return xs, ys6816\end{verbatim}68176818{\tt data} is a pair of sequences. {\tt TestStatistic} computes the6819absolute value of Pearson's correlation. {\tt RunModel} shuffles the6820{\tt xs} and returns simulated data.6821\index{HypothesisTest}6822\index{permutation}6823\index{Pearson coefficient of correlation}68246825Here's the code that reads the data and runs the test:68266827\begin{verbatim}6828live, firsts, others = first.MakeFrames()6829live = live.dropna(subset=['agepreg', 'totalwgt_lb'])6830data = live.agepreg.values, live.totalwgt_lb.values6831ht = CorrelationPermute(data)6832pvalue = ht.PValue()6833\end{verbatim}68346835I use {\tt dropna} with the {\tt subset} argument to drop rows6836that are missing either of the variables we need.6837\index{dropna}6838\index{NaN}6839\index{missing values}68406841The actual correlation is 0.07. The computed p-value is 0; after 10006842iterations the largest simulated correlation is 0.04. So although the6843observed correlation is small, it is statistically significant.6844\index{p-value}6845\index{significant} \index{statistically significant}68466847This example is a reminder that ``statistically significant'' does not6848always mean that an effect is important, or significant in practice.6849It only means that it is unlikely to have occurred by chance.685068516852\section{Testing proportions}6853\label{casino}6854\index{chi-squared test}68556856Suppose you run a casino and you suspect that a customer is6857using a crooked die; that6858is, one that has been modified to make one of the faces more6859likely than the others. You apprehend the alleged6860cheater and confiscate the die, but now you have to prove that it6861is crooked. You roll the die 60 times and get the following results:6862\index{casino}6863\index{dice}6864\index{crooked die}68656866\begin{center}6867\begin{tabular}{|l|c|c|c|c|c|c|}6868\hline6869Value & 1 & 2 & 3 & 4 & 5 & 6 \\6870\hline6871Frequency & 8 & 9 & 19 & 5 & 8 & 11 \\6872\hline6873\end{tabular}6874\end{center}68756876On average you expect each value to appear 10 times. In this6877dataset, the value 3 appears more often than expected, and the value 46878appears less often. But are these differences statistically6879significant?6880\index{frequency}6881\index{significant} \index{statistically significant}68826883To test this hypothesis, we can compute the expected frequency for6884each value, the difference between the expected and observed6885frequencies, and the total absolute difference. In this6886example, we expect each side to come up 10 times out of 60; the6887deviations from this expectation are -2, -1, 9, -5, -2, and 1; so the6888total absolute difference is 20. How often would we see such a6889difference by chance?6890\index{deviation}68916892Here's a version of {\tt HypothesisTest} that answers that question:6893\index{HypothesisTest}68946895\begin{verbatim}6896class DiceTest(thinkstats2.HypothesisTest):68976898def TestStatistic(self, data):6899observed = data6900n = sum(observed)6901expected = np.ones(6) * n / 66902test_stat = sum(abs(observed - expected))6903return test_stat69046905def RunModel(self):6906n = sum(self.data)6907values = [1, 2, 3, 4, 5, 6]6908rolls = np.random.choice(values, n, replace=True)6909hist = thinkstats2.Hist(rolls)6910freqs = hist.Freqs(values)6911return freqs6912\end{verbatim}69136914The data are represented as a list of frequencies: the observed6915values are {\tt [8, 9, 19, 5, 8, 11]}; the expected frequencies6916are all 10. The test statistic is the sum of the absolute differences.6917\index{frequency}69186919The null hypothesis is that the die is fair, so we simulate that by6920drawing random samples from {\tt values}. {\tt RunModel} uses {\tt6921Hist} to compute and return the list of frequencies.6922\index{Hist}6923\index{null hypothesis}6924\index{model}69256926The p-value for this data is 0.13, which means that if the die is6927fair we expect to see the observed total deviation, or more, about692813\% of the time. So the apparent effect is not statistically6929significant.6930\index{p-value}6931\index{deviation}6932\index{significant} \index{statistically significant}693369346935\section{Chi-squared tests}6936\label{casino2}69376938In the previous section we used total deviation as the test statistic.6939But for testing proportions it is more common to use the chi-squared6940statistic:6941%6942\[ \goodchi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} \]6943%6944%% TODO: Consider using upper case chi, which is more strictly correct,6945%% but harder to distinguish from X.6946%6947Where $O_i$ are the observed frequencies and $E_i$ are the expected6948frequencies. Here's the Python code:6949\index{chi-squared test}6950\index{chi-squared statistic}6951\index{test statistic}69526953\begin{verbatim}6954class DiceChiTest(DiceTest):69556956def TestStatistic(self, data):6957observed = data6958n = sum(observed)6959expected = np.ones(6) * n / 66960test_stat = sum((observed - expected)**2 / expected)6961return test_stat6962\end{verbatim}69636964Squaring the deviations (rather than taking absolute values) gives6965more weight to large deviations. Dividing through by {\tt expected}6966standardizes the deviations, although in this case it has no effect6967because the expected frequencies are all equal.6968\index{deviation}69696970The p-value using the chi-squared statistic is 0.04,6971substantially smaller than what we got using total deviation, 0.13.6972If we take the 5\% threshold seriously, we would consider this effect6973statistically significant. But considering the two tests togther, I6974would say that the results are borderline. I would not rule out the6975possibility that the die is crooked, but I would not convict the6976accused cheater.6977\index{p-value}6978\index{significant} \index{statistically significant}69796980This example demonstrates an important point: the p-value depends6981on the choice of test statistic and the model of the null hypothesis,6982and sometimes these choices determine whether an effect is6983statistically significant or not.6984\index{null hypothesis}6985\index{model}698669876988\section{First babies again}69896990Earlier in this chapter we looked at pregnancy lengths for first6991babies and others, and concluded that the apparent differences in6992mean and standard deviation are not statistically significant. But in6993Section~\ref{visualization}, we saw several apparent differences6994in the distribution of pregnancy length, especially in the range from699535 to 43 weeks. To see whether those differences are statistically6996significant, we can use a test based on a chi-squared statistic.6997\index{standard deviation}6998\index{statistically significant} \index{significant}6999\index{pregnancy length}70007001The code combines elements from previous examples:7002\index{HypothesisTest}70037004\begin{verbatim}7005class PregLengthTest(thinkstats2.HypothesisTest):70067007def MakeModel(self):7008firsts, others = self.data7009self.n = len(firsts)7010self.pool = np.hstack((firsts, others))70117012pmf = thinkstats2.Pmf(self.pool)7013self.values = range(35, 44)7014self.expected_probs = np.array(pmf.Probs(self.values))70157016def RunModel(self):7017np.random.shuffle(self.pool)7018data = self.pool[:self.n], self.pool[self.n:]7019return data7020\end{verbatim}70217022The data are represented as two lists of pregnancy lengths. The null7023hypothesis is that both samples are drawn from the same distribution.7024{\tt MakeModel} models that distribution by pooling the two7025samples using {\tt hstack}. Then {\tt RunModel} generates7026simulated data by shuffling the pooled sample and splitting it7027into two parts.7028\index{null hypothesis}7029\index{model}7030\index{hstack}7031\index{pregnancy length}70327033{\tt MakeModel} also defines {\tt values}, which is the7034range of weeks we'll use, and \verb"expected_probs",7035which is the probability of each value in the pooled distribution.70367037Here's the code that computes the test statistic:70387039\begin{verbatim}7040# class PregLengthTest:70417042def TestStatistic(self, data):7043firsts, others = data7044stat = self.ChiSquared(firsts) + self.ChiSquared(others)7045return stat70467047def ChiSquared(self, lengths):7048hist = thinkstats2.Hist(lengths)7049observed = np.array(hist.Freqs(self.values))7050expected = self.expected_probs * len(lengths)7051stat = sum((observed - expected)**2 / expected)7052return stat7053\end{verbatim}70547055{\tt TestStatistic} computes the chi-squared statistic for7056first babies and others, and adds them.7057\index{chi-squared statistic}70587059{\tt ChiSquared} takes a sequence of pregnancy lengths, computes7060its histogram, and computes {\tt observed}, which is a list of7061frequencies corresponding to {\tt self.values}.7062To compute the list of expected frequencies, it multiplies the7063pre-computed probabilities, \verb"expected_probs", by the sample7064size. It returns the chi-squared statistic, {\tt stat}.70657066For the NSFG data the total chi-squared statistic is 102, which7067doesn't mean much by itself. But after 1000 iterations, the largest7068test statistic generated under the null hypothesis is 32. We conclude7069that the observed chi-squared statistic is unlikely under the null7070hypothesis, so the apparent effect is statistically significant.7071\index{null hypothesis}7072\index{statistically significant} \index{significant}70737074This example demonstrates a limitation of chi-squared tests: they7075indicate that there is a difference between the two groups,7076but they don't say anything specific about what the difference is.707770787079\section{Errors}7080\index{error}70817082In classical hypothesis testing, an effect is considered statistically7083significant if the p-value is below some threshold, commonly 5\%.7084This procedure raises two questions:7085\index{p-value}7086\index{threshold}7087\index{statistically significant} \index{significant}70887089\begin{itemize}70907091\item If the effect is actually due to chance, what is the probability7092that we will wrongly consider it significant? This7093probability is the {\bf false positive rate}.7094\index{false positive}70957096\item If the effect is real, what is the chance that the hypothesis7097test will fail? This probability is the {\bf false negative rate}.7098\index{false negative}70997100\end{itemize}71017102The false positive rate is relatively easy to compute: if the7103threshold is 5\%, the false positive rate is 5\%. Here's why:71047105\begin{itemize}71067107\item If there is no real effect, the null hypothesis is true, so we7108can compute the distribution of the test statistic by simulating the7109null hypothesis. Call this distribution $\CDF_T$.7110\index{null hypothesis}7111\index{CDF}71127113\item Each time we run an experiment, we get a test statistic, $t$,7114which is drawn from $CDF_T$. Then we compute a p-value, which is7115the probability that a random value from $CDF_T$ exceeds {\tt t},7116so that's $1 - CDF_T(t)$.71177118\item The p-value is less than 5\% if $CDF_T(t)$ is greater7119than 95\%; that is, if $t$ exceeds the 95th percentile.7120And how often does a value chosen from $CDF_T$ exceed7121the 95th percentile? 5\% of the time.71227123\end{itemize}71247125So if you perform one hypothesis test with a 5\% threshold, you expect7126a false positive 1 time in 20.712771287129\section{Power}7130\label{power}71317132The false negative rate is harder to compute because it depends on7133the actual effect size, and normally we don't know that.7134One option is to compute a rate7135conditioned on a hypothetical effect size.7136\index{effect size}71377138For example, if we assume that the observed difference between groups7139is accurate, we can use the observed samples as a model of the7140population and run hypothesis tests with simulated data:7141\index{model}71427143\begin{verbatim}7144def FalseNegRate(data, num_runs=100):7145group1, group2 = data7146count = 071477148for i in range(num_runs):7149sample1 = thinkstats2.Resample(group1)7150sample2 = thinkstats2.Resample(group2)71517152ht = DiffMeansPermute((sample1, sample2))7153pvalue = ht.PValue(iters=101)7154if pvalue > 0.05:7155count += 171567157return count / num_runs7158\end{verbatim}71597160{\tt FalseNegRate} takes data in the form of two sequences, one for7161each group. Each time through the loop, it simulates an experiment by7162drawing a random sample from each group and running a hypothesis test.7163Then it checks the result and counts the number of false negatives.7164\index{Resample}7165\index{permutation}71667167{\tt Resample} takes a sequence and draws a sample with the same7168length, with replacement:7169\index{replacement}71707171\begin{verbatim}7172def Resample(xs):7173return np.random.choice(xs, len(xs), replace=True)7174\end{verbatim}71757176Here's the code that tests pregnancy lengths:71777178\begin{verbatim}7179live, firsts, others = first.MakeFrames()7180data = firsts.prglngth.values, others.prglngth.values7181neg_rate = FalseNegRate(data)7182\end{verbatim}71837184The result is about 70\%, which means that if the actual difference in7185mean pregnancy length is 0.078 weeks, we expect an experiment with this7186sample size to yield a negative test 70\% of the time.7187\index{pregnancy length}71887189This result is often presented the other way around: if the actual7190difference is 0.078 weeks, we should expect a positive test only 30\%7191of the time. This ``correct positive rate'' is called the {\bf power}7192of the test, or sometimes ``sensitivity''. It reflects the ability of7193the test to detect an effect of a given size.7194\index{power}7195\index{sensitivity}7196\index{correct positive}71977198In this example, the test had only a 30\% chance of yielding a7199positive result (again, assuming that the difference is 0.078 weeks).7200As a rule of thumb, a power of 80\% is considered acceptable, so7201we would say that this test was ``underpowered.''7202\index{underpowered}72037204In general a negative hypothesis test does not imply that there is no7205difference between the groups; instead it suggests that if there is a7206difference, it is too small to detect with this sample size.720772087209\section{Replication}7210\label{replication}72117212The hypothesis testing process I demonstrated in this chapter is not,7213strictly speaking, good practice.72147215First, I performed multiple tests. If you run one hypothesis test,7216the chance of a false positive is about 1 in 20, which might be7217acceptable. But if you run 20 tests, you should expect at least one7218false positive, most of the time.7219\index{multiple tests}72207221Second, I used the same dataset for exploration and testing. If7222you explore a large dataset, find a surprising effect, and then test7223whether it is significant, you have a good chance of generating a7224false positive.7225\index{statistically significant} \index{significant}72267227To compensate for multiple tests, you can adjust the p-value7228threshold (see7229\url{https://en.wikipedia.org/wiki/Holm-Bonferroni_method}). Or you7230can address both problems by partitioning the data, using one set for7231exploration and the other for testing.7232\index{p-value}7233\index{Holm-Bonferroni method}72347235In some fields these practices are required or at least encouraged.7236But it is also common to address these problems implicitly by7237replicating published results. Typically the first paper to report a7238new result is considered exploratory. Subsequent papers that7239replicate the result with new data are considered confirmatory.7240\index{confirmatory result}72417242As it happens, we have an opportunity to replicate the results in this7243chapter. The first edition of this book is based on Cycle 6 of the7244NSFG, which was released in 2002. In October 2011, the CDC released7245additional data based on interviews conducted from 2006--2010. {\tt7246nsfg2.py} contains code to read and clean this data. In the new7247dataset:7248\index{NSFG}72497250\begin{itemize}72517252\item The difference in mean pregnancy length is72530.16 weeks and statistically significant with $p < 0.001$ (compared7254to 0.078 weeks in the original dataset).7255\index{statistically significant} \index{significant}7256\index{pregnancy length}72577258\item The difference in birth weight is 0.17 pounds with $p < 0.001$7259(compared to 0.12 lbs in the original dataset).7260\index{birth weight}7261\index{weight!birth}72627263\item The correlation between birth weight and mother's age is72640.08 with $p < 0.001$ (compared to 0.07).72657266\item The chi-squared test is statistically significant with7267$p < 0.001$ (as it was in the original).72687269\end{itemize}72707271In summary, all of the effects that were statistically significant7272in the original dataset were replicated in the new dataset, and the7273difference in pregnancy length, which was not significant in the7274original, is bigger in the new dataset and significant.727572767277\section{Exercises}72787279A solution to these exercises is in \verb"chap09soln.py".72807281\begin{exercise}7282As sample size increases, the power of a hypothesis test increases,7283which means it is more likely to be positive if the effect is real.7284Conversely, as sample size decreases, the test is less likely to7285be positive even if the effect is real.7286\index{sample size}72877288To investigate this behavior, run the tests in this chapter with7289different subsets of the NSFG data. You can use {\tt thinkstats2.SampleRows}7290to select a random subset of the rows in a DataFrame.7291\index{National Survey of Family Growth}7292\index{NSFG}7293\index{DataFrame}72947295What happens to the p-values of these tests as sample size decreases?7296What is the smallest sample size that yields a positive test?7297\index{p-value}7298\end{exercise}7299730073017302\begin{exercise}73037304In Section~\ref{testdiff}, we simulated the null hypothesis by7305permutation; that is, we treated the observed values as if they7306represented the entire population, and randomly assigned the7307members of the population to the two groups.7308\index{null hypothesis}7309\index{permutation}73107311An alternative is to use the sample to estimate the distribution for7312the population, then draw a random sample from that distribution.7313This process is called {\bf resampling}. There are several ways to7314implement resampling, but one of the simplest is to draw a sample7315with replacement from the observed values, as in Section~\ref{power}.7316\index{resampling}7317\index{replacement}73187319Write a class named {\tt DiffMeansResample} that inherits from7320{\tt DiffMeansPermute} and overrides {\tt RunModel} to implement7321resampling, rather than permutation.7322\index{permutation}73237324Use this model to test the differences in pregnancy length and7325birth weight. How much does the model affect the results?7326\index{model}7327\index{birth weight}7328\index{weight!birth}7329\index{pregnancy length}73307331\end{exercise}733273337334\section{Glossary}73357336\begin{itemize}73377338\item hypothesis testing: The process of determining whether an apparent7339effect is statistically significant.7340\index{hypothesis testing}73417342\item test statistic: A statistic used to quantify an effect size.7343\index{test statistic}7344\index{effect size}73457346\item null hypothesis: A model of a system based on the assumption that7347an apparent effect is due to chance.7348\index{null hypothesis}73497350\item p-value: The probability that an effect could occur by chance.7351\index{p-value}73527353\item statistically significant: An effect is statistically7354significant if it is unlikely to occur by chance.7355\index{significant} \index{statistically significant}73567357\item permutation test: A way to compute p-values by generating7358permutations of an observed dataset.7359\index{permutation test}73607361\item resampling test: A way to compute p-values by generating7362samples, with replacement, from an observed dataset.7363\index{resampling test}73647365\item two-sided test: A test that asks, ``What is the chance of an effect7366as big as the observed effect, positive or negative?''73677368\item one-sided test: A test that asks, ``What is the chance of an effect7369as big as the observed effect, and with the same sign?''7370\index{one-sided test}7371\index{two-sided test}7372\index{test!one-sided}7373\index{test!two-sided}73747375\item chi-squared test: A test that uses the chi-squared statistic as7376the test statistic.7377\index{chi-squared test}73787379\item false positive: The conclusion that an effect is real when it is not.7380\index{false positive}73817382\item false negative: The conclusion that an effect is due to chance when it7383is not.7384\index{false negative}73857386\item power: The probability of a positive test if the null hypothesis7387is false.7388\index{power}7389\index{null hypothesis}73907391\end{itemize}739273937394\chapter{Linear least squares}7395\label{linear}73967397The code for this chapter is in {\tt linear.py}. For information7398about downloading and working with this code, see Section~\ref{code}.739974007401\section{Least squares fit}74027403Correlation coefficients measure the strength and sign of a7404relationship, but not the slope. There are several ways to estimate7405the slope; the most common is a {\bf linear least squares fit}. A7406``linear fit'' is a line intended to model the relationship between7407variables. A ``least squares'' fit is one that minimizes the mean7408squared error (MSE) between the line and the data.7409\index{least squares fit}7410\index{linear least squares}7411\index{model}74127413Suppose we have a sequence of points, {\tt ys}, that we want to7414express as a function of another sequence {\tt xs}. If there is a7415linear relationship between {\tt xs} and {\tt ys} with intercept {\tt7416inter} and slope {\tt slope}, we expect each {\tt y[i]} to be7417{\tt inter + slope * x[i]}. \index{residuals}74187419But unless the correlation is perfect, this prediction is only7420approximate. The vertical deviation from the line, or {\bf residual},7421is7422\index{deviation}74237424\begin{verbatim}7425res = ys - (inter + slope * xs)7426\end{verbatim}74277428The residuals might be due to random factors like measurement error,7429or non-random factors that are unknown. For example, if we are7430trying to predict weight as a function of height, unknown factors7431might include diet, exercise, and body type.7432\index{slope}7433\index{intercept}7434\index{measurement error}74357436If we get the parameters {\tt inter} and {\tt slope} wrong, the residuals7437get bigger, so it makes intuitive sense that the parameters we want7438are the ones that minimize the residuals.7439\index{parameter}74407441We might try to minimize the absolute value of the7442residuals, or their squares, or their cubes; but the most common7443choice is to minimize the sum of squared residuals,7444{\tt sum(res**2)}.74457446Why? There are three good reasons and one less important one:74477448\begin{itemize}74497450\item Squaring has the feature of treating positive and7451negative residuals the same, which is usually what we want.74527453\item Squaring gives more weight to large residuals, but not7454so much weight that the largest residual always dominates.74557456\item If the residuals are uncorrelated and normally distributed with7457mean 0 and constant (but unknown) variance, then the least squares7458fit is also the maximum likelihood estimator of {\tt inter} and {\tt7459slope}. See7460\url{https://en.wikipedia.org/wiki/Linear_regression}. \index{MLE}7461\index{maximum likelihood estimator}7462\index{correlation}74637464\item The values of {\tt inter} and {\tt slope} that minimize7465the squared residuals can be computed efficiently.74667467\end{itemize}74687469The last reason made sense when computational efficiency was more7470important than choosing the method most appropriate to the problem7471at hand. That's no longer the case, so it is worth considering7472whether squared residuals are the right thing to minimize.7473\index{computational methods}7474\index{squared residuals}74757476For example, if you are using {\tt xs} to predict values of {\tt ys},7477guessing too high might be better (or worse) than guessing too low.7478In that case you might want to compute some cost function for each7479residual, and minimize total cost, {\tt sum(cost(res))}.7480However, computing a least squares fit is quick, easy and often good7481enough.7482\index{cost function}748374847485\section{Implementation}74867487{\tt thinkstats2} provides simple functions that demonstrate7488linear least squares:7489\index{LeastSquares}74907491\begin{verbatim}7492def LeastSquares(xs, ys):7493meanx, varx = MeanVar(xs)7494meany = Mean(ys)74957496slope = Cov(xs, ys, meanx, meany) / varx7497inter = meany - slope * meanx74987499return inter, slope7500\end{verbatim}75017502{\tt LeastSquares} takes sequences7503{\tt xs} and {\tt ys} and returns the estimated parameters {\tt inter}7504and {\tt slope}.7505For details on how it works, see7506\url{http://wikipedia.org/wiki/Numerical_methods_for_linear_least_squares}.7507\index{parameter}75087509{\tt thinkstats2} also provides {\tt FitLine}, which takes {\tt inter}7510and {\tt slope} and returns the fitted line for a sequence7511of {\tt xs}.7512\index{FitLine}75137514\begin{verbatim}7515def FitLine(xs, inter, slope):7516fit_xs = np.sort(xs)7517fit_ys = inter + slope * fit_xs7518return fit_xs, fit_ys7519\end{verbatim}75207521We can use these functions to compute the least squares fit for7522birth weight as a function of mother's age.7523\index{birth weight}7524\index{weight!birth}7525\index{age}75267527\begin{verbatim}7528live, firsts, others = first.MakeFrames()7529live = live.dropna(subset=['agepreg', 'totalwgt_lb'])7530ages = live.agepreg7531weights = live.totalwgt_lb75327533inter, slope = thinkstats2.LeastSquares(ages, weights)7534fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)7535\end{verbatim}75367537The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year.7538These values are hard to interpret in this form: the intercept is7539the expected weight of a baby whose mother is 0 years old, which7540doesn't make sense in context, and the slope is too small to7541grasp easily.7542\index{slope}7543\index{intercept}7544\index{dropna}7545\index{NaN}75467547Instead of presenting the intercept at $x=0$, it7548is often helpful to present the intercept at the mean of $x$. In7549this case the mean age is about 25 years and the mean baby weight7550for a 25 year old mother is 7.3 pounds. The slope is 0.27 ounces7551per year, or 0.17 pounds per decade.75527553\begin{figure}7554% linear.py7555\centerline{\includegraphics[height=2.5in]{figs/linear1.pdf}}7556\caption{Scatter plot of birth weight and mother's age with7557a linear fit.}7558\label{linear1}7559\end{figure}75607561Figure~\ref{linear1} shows a scatter plot of birth weight and age7562along with the fitted line. It's a good idea to look at a figure like7563this to assess whether the relationship is linear and whether the7564fitted line seems like a good model of the relationship.7565\index{birth weight}7566\index{weight!birth}7567\index{scatter plot}7568\index{plot!scatter}7569\index{model}757075717572\section{Residuals}7573\label{residuals}75747575Another useful test is to plot the residuals.7576{\tt thinkstats2} provides a function that computes residuals:7577\index{residuals}75787579\begin{verbatim}7580def Residuals(xs, ys, inter, slope):7581xs = np.asarray(xs)7582ys = np.asarray(ys)7583res = ys - (inter + slope * xs)7584return res7585\end{verbatim}75867587{\tt Residuals} takes sequences {\tt xs} and {\tt ys} and7588estimated parameters {\tt inter} and {\tt slope}. It returns7589the differences between the actual values and the fitted line.75907591\begin{figure}7592% linear.py7593\centerline{\includegraphics[height=2.5in]{figs/linear2.pdf}}7594\caption{Residuals of the linear fit.}7595\label{linear2}7596\end{figure}75977598To visualize the residuals, I group respondents by age and compute7599percentiles in each group, as we saw in Section~\ref{characterizing}.7600Figure~\ref{linear2} shows the 25th, 50th and 75th percentiles of7601the residuals for each age group. The median is near zero, as7602expected, and the interquartile range is about 2 pounds. So if we7603know the mother's age, we can guess the baby's weight within a pound,7604about 50\% of the time.7605\index{visualization}76067607Ideally these lines should be flat, indicating that the residuals are7608random, and parallel, indicating that the variance of the residuals is7609the same for all age groups. In fact, the lines are close to7610parallel, so that's good; but they have some curvature, indicating7611that the relationship is nonlinear. Nevertheless, the linear fit7612is a simple model that is probably good enough for some purposes.7613\index{model}7614\index{nonlinear}761576167617\section{Estimation}7618\label{regest}76197620The parameters {\tt slope} and {\tt inter} are estimates based on a7621sample; like other estimates, they are vulnerable to sampling bias,7622measurement error, and sampling error. As discussed in7623Chapter~\ref{estimation}, sampling bias is caused by non-representative7624sampling, measurement error is caused by errors in collecting7625and recording data, and sampling error is the result of measuring a7626sample rather than the entire population.7627\index{sampling bias}7628\index{bias!sampling}7629\index{measurement error}7630\index{sampling error}7631\index{estimation}76327633To assess sampling error, we ask, ``If we run this experiment again,7634how much variability do we expect in the estimates?'' We can7635answer this question by running simulated experiments and computing7636sampling distributions of the estimates.7637\index{sampling error}7638\index{sampling distribution}76397640I simulate the experiments by resampling the data; that is, I treat7641the observed pregnancies as if they were the entire population7642and draw samples, with replacement, from the observed sample.7643\index{simulation}7644\index{replacement}76457646\begin{verbatim}7647def SamplingDistributions(live, iters=101):7648t = []7649for _ in range(iters):7650sample = thinkstats2.ResampleRows(live)7651ages = sample.agepreg7652weights = sample.totalwgt_lb7653estimates = thinkstats2.LeastSquares(ages, weights)7654t.append(estimates)76557656inters, slopes = zip(*t)7657return inters, slopes7658\end{verbatim}76597660{\tt SamplingDistributions} takes a DataFrame with one row per live7661birth, and {\tt iters}, the number of experiments to simulate. It7662uses {\tt ResampleRows} to resample the observed pregnancies. We've7663already seen {\tt SampleRows}, which chooses random rows from a7664DataFrame. {\tt thinkstats2} also provides {\tt ResampleRows}, which7665returns a sample the same size as the original:7666\index{DataFrame}7667\index{resampling}76687669\begin{verbatim}7670def ResampleRows(df):7671return SampleRows(df, len(df), replace=True)7672\end{verbatim}76737674After resampling, we use the simulated sample to estimate parameters.7675The result is two sequences: the estimated intercepts and estimated7676slopes.7677\index{parameter}76787679I summarize the sampling distributions by printing the standard7680error and confidence interval:7681\index{sampling distribution}76827683\begin{verbatim}7684def Summarize(estimates, actual=None):7685mean = thinkstats2.Mean(estimates)7686stderr = thinkstats2.Std(estimates, mu=actual)7687cdf = thinkstats2.Cdf(estimates)7688ci = cdf.ConfidenceInterval(90)7689print('mean, SE, CI', mean, stderr, ci)7690\end{verbatim}76917692{\tt Summarize} takes a sequence of estimates and the actual value.7693It prints the mean of the estimates, the standard error and7694a 90\% confidence interval.7695\index{standard error}7696\index{confidence interval}76977698For the intercept, the mean estimate is 6.83, with standard error76990.07 and 90\% confidence interval (6.71, 6.94). The estimated slope, in7700more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220).7701There is almost a factor of two between the low and high ends of7702this CI, so it should be considered a rough estimate.77037704%inter 6.83039697331 6.831740353667705%SE, CI 0.0699814482068 (6.7146843084406846, 6.9447797068631871)7706%slope 0.0174538514718 0.01738409269367707%SE, CI 0.00276116142884 (0.012635074392201724, 0.021975282350381781)77087709To visualize the sampling error of the estimate, we could plot7710all of the fitted lines, or for a less cluttered representation,7711plot a 90\% confidence interval for each age. Here's the code:77127713\begin{verbatim}7714def PlotConfidenceIntervals(xs, inters, slopes,7715percent=90, **options):7716fys_seq = []7717for inter, slope in zip(inters, slopes):7718fxs, fys = thinkstats2.FitLine(xs, inter, slope)7719fys_seq.append(fys)77207721p = (100 - percent) / 27722percents = p, 100 - p7723low, high = thinkstats2.PercentileRows(fys_seq, percents)7724thinkplot.FillBetween(fxs, low, high, **options)7725\end{verbatim}77267727{\tt xs} is the sequence of mother's age. {\tt inters} and {\tt slopes}7728are the estimated parameters generated by {\tt SamplingDistributions}.7729{\tt percent} indicates which confidence interval to plot.77307731{\tt PlotConfidenceIntervals} generates a fitted line for each pair7732of {\tt inter} and {\tt slope} and stores the results in a sequence,7733\verb"fys_seq". Then it uses {\tt PercentileRows} to select the7734upper and lower percentiles of {\tt y} for each value of {\tt x}.7735For a 90\% confidence interval, it selects the 5th and 95th percentiles.7736{\tt FillBetween} draws a polygon that fills the space between two7737lines.7738\index{thinkplot}7739\index{FillBetween}77407741\begin{figure}7742% linear.py7743\centerline{\includegraphics[height=2.5in]{figs/linear3.pdf}}7744\caption{50\% and 90\% confidence intervals showing variability in the7745fitted line due to sampling error of {\tt inter} and {\tt slope}.}7746\label{linear3}7747\end{figure}77487749Figure~\ref{linear3} shows the 50\% and 90\% confidence7750intervals for curves fitted to birth weight as a function of7751mother's age.7752The vertical width of the region represents the effect of7753sampling error; the effect is smaller for values near the mean and7754larger for the extremes.775577567757\section{Goodness of fit}7758\label{goodness}7759\index{goodness of fit}77607761There are several ways to measure the quality of a linear model, or7762{\bf goodness of fit}. One of the simplest is the standard deviation7763of the residuals.7764\index{standard deviation}7765\index{model}77667767If you use a linear model to make predictions, {\tt Std(res)}7768is the root mean squared error (RMSE) of your predictions. For7769example, if you use mother's age to guess birth weight, the RMSE of7770your guess would be 1.40 lbs.7771\index{birth weight}7772\index{weight!birth}77737774If you guess birth weight without knowing the mother's age, the RMSE7775of your guess is {\tt Std(ys)}, which is 1.41 lbs. So in this7776example, knowing a mother's age does not improve the predictions7777substantially.7778\index{prediction}77797780Another way to measure goodness of fit is the {\bf7781coefficient of determination}, usually denoted $R^2$ and7782called ``R-squared'':7783\index{coefficient of determination}7784\index{r-squared}77857786\begin{verbatim}7787def CoefDetermination(ys, res):7788return 1 - Var(res) / Var(ys)7789\end{verbatim}77907791{\tt Var(res)} is the MSE of your guesses using the model,7792{\tt Var(ys)} is the MSE without it. So their ratio is the fraction7793of MSE that remains if you use the model, and $R^2$ is the fraction7794of MSE the model eliminates.7795\index{MSE}77967797For birth weight and mother's age, $R^2$ is 0.0047, which means7798that mother's age predicts about half of 1\% of variance in7799birth weight.78007801There is a simple relationship between the coefficient of7802determination and Pearson's coefficient of correlation: $R^2 = \rho^2$.7803For example, if $\rho$ is 0.8 or -0.8, $R^2 = 0.64$.7804\index{Pearson coefficient of correlation}78057806Although $\rho$ and $R^2$ are often used to quantify the strength of a7807relationship, they are not easy to interpret in terms of predictive7808power. In my opinion, {\tt Std(res)} is the best representation7809of the quality of prediction, especially if it is presented7810in relation to {\tt Std(ys)}.7811\index{coefficient of determination}7812\index{r-squared}78137814For example, when people talk about the validity of the SAT7815(a standardized test used for college admission in the U.S.) they7816often talk about correlations between SAT scores and other measures of7817intelligence.7818\index{SAT}7819\index{IQ}78207821According to one study, there is a Pearson correlation of7822$\rho=0.72$ between total SAT scores and IQ scores, which sounds like7823a strong correlation. But $R^2 = \rho^2 = 0.52$, so SAT scores7824account for only 52\% of variance in IQ.78257826IQ scores are normalized with {\tt Std(ys) = 15}, so78277828\begin{verbatim}7829>>> var_ys = 15**27830>>> rho = 0.727831>>> r2 = rho**27832>>> var_res = (1 - r2) * var_ys7833>>> std_res = math.sqrt(var_res)783410.40967835\end{verbatim}78367837So using SAT score to predict IQ reduces RMSE from 15 points to 10.47838points. A correlation of 0.72 yields a reduction in RMSE of only783931\%.78407841If you see a correlation that looks impressive, remember that $R^2$ is7842a better indicator of reduction in MSE, and reduction in RMSE is a7843better indicator of predictive power.7844\index{coefficient of determination}7845\index{r-squared}7846\index{prediction}784778487849\section{Testing a linear model}78507851The effect of mother's age on birth weight is small, and has little7852predictive power. So is it possible that the apparent relationship7853is due to chance? There are several ways we might test the7854results of a linear fit.7855\index{birth weight}7856\index{weight!birth}7857\index{model}7858\index{linear model}78597860One option is to test whether the apparent reduction in MSE is due to7861chance. In that case, the test statistic is $R^2$ and the null7862hypothesis is that there is no relationship between the variables. We7863can simulate the null hypothesis by permutation, as in7864Section~\ref{corrtest}, when we tested the correlation between7865mother's age and birth weight. In fact, because $R^2 = \rho^2$, a7866one-sided test of $R^2$ is equivalent to a two-sided test of $\rho$.7867We've already done that test, and found $p < 0.001$, so we conclude7868that the apparent relationship between mother's age and birth weight7869is statistically significant.7870\index{null hypothesis}7871\index{permutation}7872\index{coefficient of determination}7873\index{r-squared}7874\index{significant} \index{statistically significant}78757876Another approach is to test whether the apparent slope is due to chance.7877The null hypothesis is that the slope is actually zero; in that case7878we can model the birth weights as random variations around their mean.7879Here's a HypothesisTest for this model:7880\index{HypothesisTest}7881\index{model}78827883\begin{verbatim}7884class SlopeTest(thinkstats2.HypothesisTest):78857886def TestStatistic(self, data):7887ages, weights = data7888_, slope = thinkstats2.LeastSquares(ages, weights)7889return slope78907891def MakeModel(self):7892_, weights = self.data7893self.ybar = weights.mean()7894self.res = weights - self.ybar78957896def RunModel(self):7897ages, _ = self.data7898weights = self.ybar + np.random.permutation(self.res)7899return ages, weights7900\end{verbatim}79017902The data are represented as sequences of ages and weights. The7903test statistic is the slope estimated by {\tt LeastSquares}.7904The model of the null hypothesis is represented by the mean weight7905of all babies and the deviations from the mean. To7906generate simulated data, we permute the deviations and add them to7907the mean.7908\index{deviation}7909\index{null hypothesis}7910\index{permutation}79117912Here's the code that runs the hypothesis test:79137914\begin{verbatim}7915live, firsts, others = first.MakeFrames()7916live = live.dropna(subset=['agepreg', 'totalwgt_lb'])7917ht = SlopeTest((live.agepreg, live.totalwgt_lb))7918pvalue = ht.PValue()7919\end{verbatim}79207921The p-value is less than $0.001$, so although the estimated7922slope is small, it is unlikely to be due to chance.7923\index{p-value}7924\index{dropna}7925\index{NaN}79267927Estimating the p-value by simulating the null hypothesis is strictly7928correct, but there is a simpler alternative. Remember that we already7929computed the sampling distribution of the slope, in7930Section~\ref{regest}. To do that, we assumed that the observed slope7931was correct and simulated experiments by resampling.7932\index{null hypothesis}79337934Figure~\ref{linear4} shows the sampling distribution of the7935slope, from Section~\ref{regest}, and the distribution of slopes7936generated under the null hypothesis. The sampling distribution7937is centered about the estimated slope, 0.017 lbs/year, and the slopes7938under the null hypothesis are centered around 0; but other than7939that, the distributions are identical. The distributions are7940also symmetric, for reasons we will see in Section~\ref{CLT}.7941\index{symmetric}7942\index{sampling distribution}79437944\begin{figure}7945% linear.py7946\centerline{\includegraphics[height=2.5in]{figs/linear4.pdf}}7947\caption{The sampling distribution of the estimated7948slope and the distribution of slopes7949generated under the null hypothesis. The vertical lines are at 07950and the observed slope, 0.017 lbs/year.}7951\label{linear4}7952\end{figure}79537954So we could estimate the p-value two ways:7955\index{p-value}79567957\begin{itemize}79587959\item Compute the probability that the slope under the null7960hypothesis exceeds the observed slope.7961\index{null hypothesis}79627963\item Compute the probability that the slope in the sampling7964distribution falls below 0. (If the estimated slope were negative,7965we would compute the probability that the slope in the sampling7966distribution exceeds 0.)79677968\end{itemize}79697970The second option is easier because we normally want to compute the7971sampling distribution of the parameters anyway. And it is a good7972approximation unless the sample size is small {\em and\/} the7973distribution of residuals is skewed. Even then, it is usually good7974enough, because p-values don't have to be precise.7975\index{skewness}7976\index{parameter}79777978Here's the code that estimates the p-value of the slope using the7979sampling distribution:7980\index{sampling distribution}79817982\begin{verbatim}7983inters, slopes = SamplingDistributions(live, iters=1001)7984slope_cdf = thinkstats2.Cdf(slopes)7985pvalue = slope_cdf[0]7986\end{verbatim}79877988Again, we find $p < 0.001$.798979907991\section{Weighted resampling}7992\label{weighted}79937994So far we have treated the NSFG data as if it were a representative7995sample, but as I mentioned in Section~\ref{nsfg}, it is not. The7996survey deliberately oversamples several groups in order to7997improve the chance of getting statistically significant results; that7998is, in order to improve the power of tests involving these groups.7999\index{significant} \index{statistically significant}80008001This survey design is useful for many purposes, but it means that we8002cannot use the sample to estimate values for the general8003population without accounting for the sampling process.80048005For each respondent, the NSFG data includes a variable called {\tt8006finalwgt}, which is the number of people in the general population8007the respondent represents. This value is called a {\bf sampling8008weight}, or just ``weight.''8009\index{sampling weight}8010\index{weight}8011\index{weighted resampling}8012\index{resampling!weighted}80138014As an example, if you survey 100,000 people in a country of 3008015million, each respondent represents 3,000 people. If you oversample8016one group by a factor of 2, each person in the oversampled8017group would have a lower weight, about 1500.80188019To correct for oversampling, we can use resampling; that is, we8020can draw samples from the survey using probabilities proportional8021to sampling weights. Then, for any quantity we want to estimate, we can8022generate sampling distributions, standard errors, and confidence8023intervals. As an example, I will estimate mean birth weight with8024and without sampling weights.8025\index{standard error}8026\index{confidence interval}8027\index{birth weight}8028\index{weight!birth}8029\index{sampling distribution}8030\index{oversampling}80318032In Section~\ref{regest}, we saw {\tt ResampleRows}, which chooses8033rows from a DataFrame, giving each row the same probability.8034Now we need to do the same thing using probabilities8035proportional to sampling weights.8036{\tt ResampleRowsWeighted} takes a DataFrame, resamples rows according8037to the weights in {\tt finalwgt}, and returns a DataFrame containing8038the resampled rows:8039\index{DataFrame}8040\index{resampling}80418042\begin{verbatim}8043def ResampleRowsWeighted(df, column='finalwgt'):8044weights = df[column]8045cdf = Cdf(dict(weights))8046indices = cdf.Sample(len(weights))8047sample = df.loc[indices]8048return sample8049\end{verbatim}80508051{\tt weights} is a Series; converting it to a dictionary makes8052a map from the indices to the weights. In {\tt cdf} the values8053are indices and the probabilities are proportional to the8054weights.80558056{\tt indices} is a sequence of row indices; {\tt sample} is a8057DataFrame that contains the selected rows. Since we sample with8058replacement, the same row might appear more than once. \index{Cdf}8059\index{replacement}80608061Now we can compare the effect of resampling with and without8062weights. Without weights, we generate the sampling distribution8063like this:8064\index{sampling distribution}80658066\begin{verbatim}8067estimates = [ResampleRows(live).totalwgt_lb.mean()8068for _ in range(iters)]8069\end{verbatim}80708071With weights, it looks like this:80728073\begin{verbatim}8074estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()8075for _ in range(iters)]8076\end{verbatim}80778078The following table summarizes the results:80798080\begin{center}8081\begin{tabular}{|l|c|c|c|}8082\hline8083& mean birth & standard & 90\% CI \\8084& weight (lbs) & error & \\8085\hline8086Unweighted & 7.27 & 0.014 & (7.24, 7.29) \\8087Weighted & 7.35 & 0.014 & (7.32, 7.37) \\8088\hline8089\end{tabular}8090\end{center}80918092%mean 7.265807895188093%stderr 0.01416835277928094%ci (7.2428565501217079, 7.2890814917127074)8095%mean 7.347780347188096%stderr 0.01427389723198097%ci (7.3232804012858885, 7.3704916897506925)80988099In this example, the effect of weighting is small but non-negligible.8100The difference in estimated means, with and without weighting, is8101about 0.08 pounds, or 1.3 ounces. This difference is substantially8102larger than the standard error of the estimate, 0.014 pounds, which8103implies that the difference is not due to chance.8104\index{standard error}8105\index{confidence interval}810681078108\section{Exercises}81098110A solution to this exercise is in \verb"chap10soln.ipynb"81118112\begin{exercise}81138114Using the data from the BRFSS, compute the linear least squares8115fit for log(weight) versus height.8116How would you best present the estimated parameters for a model8117like this where one of the variables is log-transformed?8118If you were trying to guess8119someone's weight, how much would it help to know their height?8120\index{Behavioral Risk Factor Surveillance System}8121\index{BRFSS}8122\index{model}81238124Like the NSFG, the BRFSS oversamples some groups and provides8125a sampling weight for each respondent. In the BRFSS data, the variable8126name for these weights is {\tt finalwt}.8127Use resampling, with and without weights, to estimate the mean height8128of respondents in the BRFSS, the standard error of the mean, and a812990\% confidence interval. How much does correct weighting affect the8130estimates?8131\index{confidence interval}8132\index{standard error}8133\index{oversampling}8134\index{sampling weight}8135\end{exercise}813681378138\section{Glossary}81398140\begin{itemize}81418142\item linear fit: a line intended to model the relationship between8143variables. \index{linear fit}81448145\item least squares fit: A model of a dataset that minimizes the8146sum of squares of the residuals.8147\index{least squares fit}81488149\item residual: The deviation of an actual value from a model.8150\index{residuals}81518152\item goodness of fit: A measure of how well a model fits data.8153\index{goodness of fit}81548155\item coefficient of determination: A statistic intended to8156quantify goodness of fit.8157\index{coefficient of determination}81588159\item sampling weight: A value associated with an observation in a8160sample that indicates what part of the population it represents.8161\index{sampling weight}81628163\end{itemize}8164816581668167\chapter{Regression}8168\label{regression}81698170The linear least squares fit in the previous chapter is an example of8171{\bf regression}, which is the more general problem of fitting any8172kind of model to any kind of data. This use of the term ``regression''8173is a historical accident; it is only indirectly related to the8174original meaning of the word.8175\index{model}8176\index{regression}81778178The goal of regression analysis is to describe the relationship8179between one set of variables, called the {\bf dependent variables},8180and another set of variables, called independent or {\bf8181explanatory variables}.8182\index{explanatory variable}8183\index{dependent variable}81848185In the previous chapter we used mother's age as an explanatory8186variable to predict birth weight as a dependent variable. When there8187is only one dependent and one explanatory variable, that's {\bf8188simple regression}. In this chapter, we move on to {\bf multiple8189regression}, with more than one explanatory variable. If there is8190more than one dependent variable, that's multivariate8191regression.8192\index{birth weight}8193\index{weight!birth}8194\index{simple regression}8195\index{multiple regression}81968197If the relationship between the dependent and explanatory variable8198is linear, that's {\bf linear regression}. For example,8199if the dependent variable is $y$ and the explanatory variables8200are $x_1$ and $x_2$, we would write the following linear8201regression model:8202%8203\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]8204%8205where $\beta_0$ is the intercept, $\beta_1$ is the parameter8206associated with $x_1$, $\beta_2$ is the parameter associated with8207$x_2$, and $\eps$ is the residual due to random variation or other8208unknown factors.8209\index{regression model}8210\index{linear regression}82118212Given a sequence of values for $y$ and sequences for $x_1$ and $x_2$,8213we can find the parameters, $\beta_0$, $\beta_1$, and $\beta_2$, that8214minimize the sum of $\eps^2$. This process is called8215{\bf ordinary least squares}. The computation is similar to {\tt8216thinkstats2.LeastSquare}, but generalized to deal with more than one8217explanatory variable. You can find the details at8218\url{https://en.wikipedia.org/wiki/Ordinary_least_squares}8219\index{explanatory variable}8220\index{ordinary least squares}8221\index{parameter}82228223The code for this chapter is in {\tt regression.py}. For information8224about downloading and working with this code, see Section~\ref{code}.82258226\section{StatsModels}8227\label{statsmodels}82288229In the previous chapter I presented {\tt thinkstats2.LeastSquares}, an8230implementation of simple linear regression intended to be easy to8231read. For multiple regression we'll switch to StatsModels, a Python8232package that provides several forms of regression and other8233analyses. If you are using Anaconda, you already have StatsModels;8234otherwise you might have to install it.8235\index{Anaconda}82368237As an example, I'll run the model from the previous chapter with8238StatsModels:8239\index{StatsModels}8240\index{model}82418242\begin{verbatim}8243import statsmodels.formula.api as smf82448245live, firsts, others = first.MakeFrames()8246formula = 'totalwgt_lb ~ agepreg'8247model = smf.ols(formula, data=live)8248results = model.fit()8249\end{verbatim}82508251{\tt statsmodels} provides two interfaces (APIs); the ``formula''8252API uses strings to identify the dependent and explanatory variables.8253It uses a syntax called {\tt patsy}; in this example, the \verb"~"8254operator separates the dependent variable on the left from the8255explanatory variables on the right.8256\index{explanatory variable}8257\index{dependent variable}8258\index{Patsy}82598260{\tt smf.ols} takes the formula string and the DataFrame, {\tt live},8261and returns an OLS object that represents the model. The name {\tt ols}8262stands for ``ordinary least squares.''8263\index{DataFrame}8264\index{model}8265\index{ordinary least squares}82668267The {\tt fit} method fits the model to the data and returns a8268RegressionResults object that contains the results.8269\index{RegressionResults}82708271The results are also available as attributes. {\tt params}8272is a Series that maps from variable names to their parameters, so we can8273get the intercept and slope like this:8274\index{Series}82758276\begin{verbatim}8277inter = results.params['Intercept']8278slope = results.params['agepreg']8279\end{verbatim}82808281The estimated parameters are 6.83 and 0.0175, the same as8282from {\tt LeastSquares}.8283\index{parameter}82848285{\tt pvalues} is a Series that maps from variable names to the associated8286p-values, so we can check whether the estimated slope is statistically8287significant:8288\index{p-value}8289\index{significant} \index{statistically significant}82908291\begin{verbatim}8292slope_pvalue = results.pvalues['agepreg']8293\end{verbatim}82948295The p-value associated with {\tt agepreg} is {\tt 5.7e-11}, which8296is less than $0.001$, as expected.8297\index{age}82988299{\tt results.rsquared} contains $R^2$, which is $0.0047$. {\tt8300results} also provides \verb"f_pvalue", which is the p-value8301associated with the model as a whole, similar to testing whether $R^2$8302is statistically significant.8303\index{model}8304\index{coefficient of determination}8305\index{r-squared}83068307And {\tt results} provides {\tt resid}, a sequence of residuals, and8308{\tt fittedvalues}, a sequence of fitted values corresponding to8309{\tt agepreg}.8310\index{residuals}83118312The results object provides {\tt summary()}, which8313represents the results in a readable format.83148315\begin{verbatim}8316print(results.summary())8317\end{verbatim}83188319But it prints a lot of information that is not relevant (yet), so8320I use a simpler function called {\tt SummarizeResults}. Here are8321the results of this model:83228323\begin{verbatim}8324Intercept 6.83 (0)8325agepreg 0.0175 (5.72e-11)8326R^2 0.0047388327Std(ys) 1.4088328Std(res) 1.4058329\end{verbatim}83308331{\tt Std(ys)} is the standard deviation of the dependent variable,8332which is the RMSE if you have to guess birth weights without the benefit of8333any explanatory variables. {\tt Std(res)} is the standard deviation8334of the residuals, which is the RMSE if your guesses are informed8335by the mother's age. As we have already seen, knowing the mother's8336age provides no substantial improvement to the predictions.8337\index{standard deviation}8338\index{birth weight}8339\index{weight!birth}8340\index{explanatory variable}8341\index{dependent variable}8342\index{RMSE}8343\index{predictive power}834483458346\section{Multiple regression}8347\label{multiple}83488349In Section~\ref{birth_weights} we saw that first babies tend to be8350lighter than others, and this effect is statistically significant.8351But it is a strange result because there is no obvious mechanism that8352would cause first babies to be lighter. So we might wonder whether8353this relationship is {\bf spurious}.8354\index{multiple regression}8355\index{spurious relationship}83568357In fact, there is a possible explanation for this effect. We have8358seen that birth weight depends on mother's age, and we might expect8359that mothers of first babies are younger than others.8360\index{weight}8361\index{age}83628363With a few calculations we can check whether this explanation8364is plausible. Then we'll use multiple regression to investigate8365more carefully. First, let's see how big the difference in weight8366is:83678368\begin{verbatim}8369diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()8370\end{verbatim}83718372First babies are 0.125 lbs lighter, or 2 ounces. And the difference8373in ages:83748375\begin{verbatim}8376diff_age = firsts.agepreg.mean() - others.agepreg.mean()8377\end{verbatim}83788379The mothers of first babies are 3.59 years younger. Running the8380linear model again, we get the change in birth weight as a function8381of age:8382\index{birth weight}8383\index{weight!birth}83848385\begin{verbatim}8386results = smf.ols('totalwgt_lb ~ agepreg', data=live).fit()8387slope = results.params['agepreg']8388\end{verbatim}83898390The slope is 0.0175 pounds per year. If we multiply the slope by8391the difference in ages, we get the expected difference in birth8392weight for first babies and others, due to mother's age:83938394\begin{verbatim}8395slope * diff_age8396\end{verbatim}83978398The result is 0.063, just about half of the observed difference.8399So we conclude, tentatively, that the observed difference in birth8400weight can be partly explained by the difference in mother's age.84018402Using multiple regression, we can explore these relationships8403more systematically.8404\index{multiple regression}84058406\begin{verbatim}8407live['isfirst'] = live.birthord == 18408formula = 'totalwgt_lb ~ isfirst'8409results = smf.ols(formula, data=live).fit()8410\end{verbatim}84118412The first line creates a new column named {\tt isfirst} that is8413True for first babies and false otherwise. Then we fit a model8414using {\tt isfirst} as an explanatory variable.8415\index{model}8416\index{explanatory variable}84178418Here are the results:84198420\begin{verbatim}8421Intercept 7.33 (0)8422isfirst[T.True] -0.125 (2.55e-05)8423R^2 0.001968424\end{verbatim}84258426Because {\tt isfirst} is a boolean, {\tt ols} treats it as a8427{\bf categorical variable}, which means that the values fall8428into categories, like True and False, and should not be treated8429as numbers. The estimated parameter is the effect on birth8430weight when {\tt isfirst} is true, so the result,8431-0.125 lbs, is the difference in8432birth weight between first babies and others.8433\index{birth weight}8434\index{weight!birth}8435\index{categorical variable}8436\index{boolean}84378438The slope and the intercept are statistically significant,8439which means that they were unlikely to occur by chance, but the8440the $R^2$ value for this model is small, which means that8441{\tt isfirst} doesn't account for a substantial part of the8442variation in birth weight.8443\index{coefficient of determination}8444\index{r-squared}84458446The results are similar with {\tt agepreg}:84478448\begin{verbatim}8449Intercept 6.83 (0)8450agepreg 0.0175 (5.72e-11)8451R^2 0.0047388452\end{verbatim}84538454Again, the parameters are statistically significant, but8455$R^2$ is low.8456\index{coefficient of determination}8457\index{r-squared}84588459These models confirm results we have already seen. But now we8460can fit a single model that includes both variables. With the8461formula \verb"totalwgt_lb ~ isfirst + agepreg", we get:84628463\begin{verbatim}8464Intercept 6.91 (0)8465isfirst[T.True] -0.0698 (0.0253)8466agepreg 0.0154 (3.93e-08)8467R^2 0.0052898468\end{verbatim}84698470In the combined model, the parameter for {\tt isfirst} is smaller8471by about half, which means that part of the apparent effect of8472{\tt isfirst} is actually accounted for by {\tt agepreg}. And8473the p-value for {\tt isfirst} is about 2.5\%, which is on the8474border of statistical significance.8475\index{p-value}8476\index{model}84778478$R^2$ for this model is a little higher, which indicates that the8479two variables together account for more variation in birth weight8480than either alone (but not by much).8481\index{birth weight}8482\index{weight!birth}8483\index{coefficient of determination}8484\index{r-squared}848584868487\section{Nonlinear relationships}8488\label{nonlinear}84898490Remembering that the contribution of {\tt agepreg} might be nonlinear,8491we might consider adding a variable to capture more of this8492relationship. One option is to create a column, {\tt agepreg2},8493that contains the squares of the ages:8494\index{nonlinear}84958496\begin{verbatim}8497live['agepreg2'] = live.agepreg**28498formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'8499\end{verbatim}85008501Now by estimating parameters for {\tt agepreg} and {\tt agepreg2},8502we are effectively fitting a parabola:85038504\begin{verbatim}8505Intercept 5.69 (1.38e-86)8506isfirst[T.True] -0.0504 (0.109)8507agepreg 0.112 (3.23e-07)8508agepreg2 -0.00185 (8.8e-06)8509R^2 0.0074628510\end{verbatim}85118512The parameter of {\tt agepreg2} is negative, so the parabola8513curves downward, which is consistent with the shape of the lines8514in Figure~\ref{linear2}.8515\index{parabola}85168517The quadratic model of {\tt agepreg} accounts for more of the8518variability in birth weight; the parameter for {\tt isfirst}8519is smaller in this model, and no longer statistically significant.8520\index{birth weight}8521\index{weight!birth}8522\index{quadratic model}8523\index{model}8524\index{significant} \index{statistically significant}85258526Using computed variables like {\tt agepreg2} is a common way to8527fit polynomials and other functions to data.8528This process is still considered linear8529regression, because the dependent variable is a linear function of8530the explanatory variables, regardless of whether some variables8531are nonlinear functions of others.8532\index{explanatory variable}8533\index{dependent variable}8534\index{nonlinear}85358536The following table summarizes the results of these regressions:85378538\begin{center}8539\begin{tabular}{|l|c|c|c|c|}8540\hline & isfirst & agepreg & agepreg2 & $R^2$ \\ \hline8541Model 1 & -0.125 * & -- & -- & 0.002 \\8542Model 2 & -- & 0.0175 * & -- & 0.0047 \\8543Model 3 & -0.0698 (0.025) & 0.0154 * & -- & 0.0053 \\8544Model 4 & -0.0504 (0.11) & 0.112 * & -0.00185 * & 0.0075 \\8545\hline8546\end{tabular}8547\end{center}85488549The columns in this table are the explanatory variables and8550the coefficient of determination, $R^2$. Each entry is an estimated8551parameter and either a p-value in parentheses or an asterisk to8552indicate a p-value less that 0.001.8553\index{p-value}8554\index{coefficient of determination}8555\index{r-squared}8556\index{explanatory variable}85578558We conclude that the apparent difference in birth weight8559is explained, at least in part, by the difference in mother's age.8560When we include mother's age in the model, the effect of8561{\tt isfirst} gets smaller, and the remaining effect might be8562due to chance.8563\index{age}85648565In this example, mother's age acts as a {\bf control variable};8566including {\tt agepreg} in the model ``controls for'' the8567difference in age between first-time mothers and others, making8568it possible to isolate the effect (if any) of {\tt isfirst}.8569\index{control variable}857085718572\section{Data mining}8573\label{mining}85748575So far we have used regression models for explanation; for example,8576in the previous section we discovered that an apparent difference8577in birth weight is actually due to a difference in mother's age.8578But the $R^2$ values of those models is very low, which means that8579they have little predictive power. In this section we'll try to8580do better.8581\index{birth weight}8582\index{weight!birth}8583\index{regression model}8584\index{coefficient of determination}8585\index{r-squared}85868587Suppose one of your co-workers is expecting a baby and8588there is an office pool to guess the baby's birth weight (if you are8589not familiar with betting pools, see8590\url{https://en.wikipedia.org/wiki/Betting_pool}).8591\index{betting pool}85928593Now suppose that you {\em really\/} want to win the pool. What could8594you do to improve your chances? Well,8595the NSFG dataset includes 244 variables about each pregnancy and another85963087 variables about each respondent. Maybe some of those variables8597have predictive power. To find out which ones are most useful,8598why not try them all?8599\index{NSFG}86008601Testing the variables in the pregnancy table is easy, but in order to8602use the variables in the respondent table, we have to match up each8603pregnancy with a respondent. In theory we could iterate through the8604rows of the pregnancy table, use the {\tt caseid} to find the8605corresponding respondent, and copy the values from the8606correspondent table into the pregnancy table. But that would be slow.8607\index{join}8608\index{SQL}86098610A better option is to recognize this process as a {\bf join} operation8611as defined in SQL and other relational database languages (see8612\url{https://en.wikipedia.org/wiki/Join_(SQL)}). Join is implemented8613as a DataFrame method, so we can perform the operation like this:8614\index{DataFrame}86158616\begin{verbatim}8617live = live[live.prglngth>30]8618resp = chap01soln.ReadFemResp()8619resp.index = resp.caseid8620join = live.join(resp, on='caseid', rsuffix='_r')8621\end{verbatim}86228623The first line selects records for pregnancies longer than 30 weeks,8624assuming that the office pool is formed several weeks before the8625due date.8626\index{betting pool}86278628The next line reads the respondent file. The result is a DataFrame8629with integer indices; in order to look up respondents efficiently,8630I replace {\tt resp.index} with {\tt resp.caseid}.86318632The {\tt join} method is invoked on {\tt live}, which is considered8633the ``left'' table, and passed {\tt resp}, which is the ``right'' table.8634The keyword argument {\tt on} indicates the variable used to match up8635rows from the two tables.86368637In this example some column names appear in both tables,8638so we have to provide {\tt rsuffix}, which is a string that will be8639appended to the names of overlapping columns from the right table.8640For example, both tables have a column named {\tt race} that encodes8641the race of the respondent. The result of the join contains two8642columns named {\tt race} and \verb"race_r".8643\index{race}86448645The pandas implementation is fast. Joining the NSFG tables takes8646less than a second on an ordinary desktop computer.8647Now we can start testing variables.8648\index{pandas}8649\index{join}86508651\begin{verbatim}8652t = []8653for name in join.columns:8654try:8655if join[name].var() < 1e-7:8656continue86578658formula = 'totalwgt_lb ~ agepreg + ' + name8659model = smf.ols(formula, data=join)8660if model.nobs < len(join)/2:8661continue86628663results = model.fit()8664except (ValueError, TypeError):8665continue86668667t.append((results.rsquared, name))8668\end{verbatim}86698670For each variable we construct a model, compute $R^2$, and append8671the results to a list. The models all include {\tt agepreg}, since8672we already know that it has some predictive power.8673\index{model}8674\index{coefficient of determination}8675\index{r-squared}86768677I check that each explanatory variable has some variability; otherwise8678the results of the regression are unreliable. I also check the number8679of observations for each model. Variables that contain a large number8680of {\tt nan}s are not good candidates for prediction.8681\index{explanatory variable}8682\index{NaN}86838684For most of these variables, we haven't done any cleaning. Some of them8685are encoded in ways that don't work very well for linear regression.8686As a result, we might overlook some variables that would be useful if8687they were cleaned properly. But maybe we will find some good candidates.8688\index{cleaning}868986908691\section{Prediction}86928693The next step is to sort the results and select the variables that8694yield the highest values of $R^2$.8695\index{prediction}86968697\begin{verbatim}8698t.sort(reverse=True)8699for mse, name in t[:30]:8700print(name, mse)8701\end{verbatim}87028703The first variable on the list is \verb"totalwgt_lb",8704followed by \verb"birthwgt_lb". Obviously, we can't use birth8705weight to predict birth weight.8706\index{birth weight}8707\index{weight!birth}87088709Similarly {\tt prglngth} has useful predictive power, but for the8710office pool we assume pregnancy length (and the related variables)8711are not known yet.8712\index{predictive power}8713\index{pregnancy length}87148715The first useful predictive variable is {\tt babysex} which indicates8716whether the baby is male or female. In the NSFG dataset, boys are8717about 0.3 lbs heavier. So, assuming that the sex of the baby is8718known, we can use it for prediction.8719\index{sex}87208721Next is {\tt race}, which indicates whether the respondent is white,8722black, or other. As an explanatory variable, race can be problematic.8723In datasets like the NSFG, race is correlated with many other8724variables, including income and other socioeconomic factors. In a8725regression model, race acts as a {\bf proxy variable},8726so apparent correlations with race are often caused, at least in8727part, by other factors.8728\index{explanatory variable}8729\index{race}87308731The next variable on the list is {\tt nbrnaliv}, which indicates8732whether the pregnancy yielded multiple births. Twins and triplets8733tend to be smaller than other babies, so if we know whether our8734hypothetical co-worker is expecting twins, that would help.8735\index{multiple birth}87368737Next on the list is {\tt paydu}, which indicates whether the8738respondent owns her home. It is one of several income-related8739variables that turn out to be predictive. In datasets like the NSFG,8740income and wealth are correlated with just about everything. In this8741example, income is related to diet, health, health care, and other8742factors likely to affect birth weight.8743\index{birth weight}8744\index{weight!birth}8745\index{income}8746\index{wealth}87478748Some of the other variables on the list are things that would not8749be known until later, like {\tt bfeedwks}, the number of weeks8750the baby was breast fed. We can't use these variables for prediction,8751but you might want to speculate on reasons8752{\tt bfeedwks} might be correlated with birth weight.87538754Sometimes you start with a theory and use data to test it. Other8755times you start with data and go looking for possible theories.8756The second approach, which this section demonstrates, is8757called {\bf data mining}. An advantage of data mining is that it8758can discover unexpected patterns. A hazard is that many of the8759patterns it discovers are either random or spurious.8760\index{theory}8761\index{data mining}87628763Having identified potential explanatory variables, I tested a few8764models and settled on this one:8765\index{model}8766\index{explanatory variable}87678768\begin{verbatim}8769formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + '8770'nbrnaliv>1 + paydu==1 + totincr')8771results = smf.ols(formula, data=join).fit()8772\end{verbatim}87738774This formula uses some syntax we have not seen yet:8775{\tt C(race)} tells the formula parser (Patsy) to treat race as a8776categorical variable, even though it is encoded numerically.8777\index{Patsy}8778\index{categorical variable}87798780The encoding for {\tt babysex} is 1 for male, 2 for female; writing8781{\tt babysex==1} converts it to boolean, True for male and false for8782female.8783\index{boolean}87848785Similarly {\tt nbrnaliv>1} is True for multiple births and8786{\tt paydu==1} is True for respondents who own their houses.87878788{\tt totincr} is encoded numerically from 1-14, with each increment8789representing about \$5000 in annual income. So we can treat these8790values as numerical, expressed in units of \$5000.8791\index{income}87928793Here are the results of the model:87948795\begin{verbatim}8796Intercept 6.63 (0)8797C(race)[T.2] 0.357 (5.43e-29)8798C(race)[T.3] 0.266 (2.33e-07)8799babysex == 1[T.True] 0.295 (5.39e-29)8800nbrnaliv > 1[T.True] -1.38 (5.1e-37)8801paydu == 1[T.True] 0.12 (0.000114)8802agepreg 0.00741 (0.0035)8803totincr 0.0122 (0.00188)8804\end{verbatim}88058806The estimated parameters for race are larger than I expected,8807especially since we control for income. The encoding8808is 1 for black, 2 for white, and 3 for other. Babies of black8809mothers are lighter than babies of other races by 0.27--0.36 lbs.8810\index{control variable}8811\index{race}88128813As we've already seen, boys are heavier by about 0.3 lbs;8814twins and other multiplets are lighter by 1.4 lbs.8815\index{weight}88168817People who own their homes have heavier babies by about 0.12 lbs,8818even when we control for income. The parameter for mother's8819age is smaller than what we saw in Section~\ref{multiple}, which8820suggests that some of the other variables are correlated with8821age, probably including {\tt paydu} and {\tt totincr}.8822\index{income}88238824All of these variables are statistically significant, some with8825very low p-values, but8826$R^2$ is only 0.06, still quite small.8827RMSE without using the model is 1.27 lbs; with the model it drops8828to 1.23. So your chance of winning the pool is not substantially8829improved. Sorry!8830\index{p-value}8831\index{model}8832\index{coefficient of determination}8833\index{r-squared}8834\index{significant} \index{statistically significant}8835883688378838\section{Logistic regression}88398840In the previous examples, some of the explanatory variables were8841numerical and some categorical (including boolean). But the dependent8842variable was always numerical.8843\index{explanatory variable}8844\index{dependent variable}8845\index{categorical variable}88468847Linear regression can be generalized to handle other kinds of8848dependent variables. If the dependent variable is boolean, the8849generalized model is called {\bf logistic regression}. If the dependent8850variable is an integer count, it's called {\bf Poisson8851regression}.8852\index{model}8853\index{logistic regression}8854\index{Poisson regression}8855\index{boolean}88568857As an example of logistic regression, let's consider a variation8858on the office pool scenario.8859Suppose8860a friend of yours is pregnant and you want to predict whether the8861baby is a boy or a girl. You could use data from the NSFG to find8862factors that affect the ``sex ratio'', which is conventionally8863defined to be the probability8864of having a boy.8865\index{betting pool}8866\index{sex}88678868If you encode the dependent variable numerically, for example 0 for a8869girl and 1 for a boy, you could apply ordinary least squares, but8870there would be problems. The linear model might be something like8871this:8872%8873\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]8874%8875Where $y$ is the dependent variable, and $x_1$ and $x_2$ are8876explanatory variables. Then we could find the parameters that8877minimize the residuals.8878\index{regression model}8879\index{explanatory variable}8880\index{dependent variable}8881\index{ordinary least squares}88828883The problem with this approach is that it produces predictions that8884are hard to interpret. Given estimated parameters and values for8885$x_1$ and $x_2$, the model might predict $y=0.5$, but the only8886meaningful values of $y$ are 0 and 1.8887\index{parameter}88888889It is tempting to interpret a result like that as a probability; for8890example, we might say that a respondent with particular values of8891$x_1$ and $x_2$ has a 50\% chance of having a boy. But it is also8892possible for this model to predict $y=1.1$ or $y=-0.1$, and those8893are not valid probabilities.8894\index{probability}88958896Logistic regression avoids this problem by expressing predictions in8897terms of {\bf odds} rather than probabilities. If you are not8898familiar with odds, ``odds in favor'' of an event is the ratio of the8899probability it will occur to the probability that it will not.8900\index{odds}89018902So if I think my team has a 75\% chance of winning, I would8903say that the odds in their favor are three to one, because8904the chance of winning is three times the chance of losing.89058906Odds and probabilities are different representations of the same8907information. Given a probability, you can compute the odds like this:89088909\begin{verbatim}8910o = p / (1-p)8911\end{verbatim}89128913Given odds in favor, you can convert to8914probability like this:89158916\begin{verbatim}8917p = o / (o+1)8918\end{verbatim}89198920Logistic regression is based on the following model:8921%8922\[ \log o = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]8923%8924Where $o$ is the odds in favor of a particular outcome; in the8925example, $o$ would be the odds of having a boy.8926\index{regression model}89278928Suppose we have estimated the parameters $\beta_0$, $\beta_1$, and8929$\beta_2$ (I'll explain how in a minute). And suppose we are given8930values for $x_1$ and $x_2$. We can compute the predicted value of8931$\log o$, and then convert to a probability:89328933\begin{verbatim}8934o = np.exp(log_o)8935p = o / (o+1)8936\end{verbatim}89378938So in the office pool scenario we could compute the predictive8939probability of having a boy. But how do we estimate the parameters?8940\index{parameter}894189428943\section{Estimating parameters}89448945Unlike linear regression, logistic regression does not have a8946closed form solution, so it is solved by guessing an initial8947solution and improving it iteratively.8948\index{logistic regression}8949\index{closed form}89508951The usual goal is to find the maximum-likelihood estimate (MLE),8952which is the set of parameters that maximizes the likelihood of the8953data. For example, suppose we have the following data:8954\index{MLE}8955\index{maximum likelihood estimator}89568957\begin{verbatim}8958>>> y = np.array([0, 1, 0, 1])8959>>> x1 = np.array([0, 0, 0, 1])8960>>> x2 = np.array([0, 1, 1, 1])8961\end{verbatim}89628963And we start with the initial guesses $\beta_0=-1.5$, $\beta_1=2.8$,8964and $\beta_2=1.1$:89658966\begin{verbatim}8967>>> beta = [-1.5, 2.8, 1.1]8968\end{verbatim}89698970Then for each row we can compute \verb"log_o":89718972\begin{verbatim}8973>>> log_o = beta[0] + beta[1] * x1 + beta[2] * x28974[-1.5 -0.4 -0.4 2.4]8975\end{verbatim}89768977And convert from log odds to probabilities:8978\index{log odds}89798980\begin{verbatim}8981>>> o = np.exp(log_o)8982[ 0.223 0.670 0.670 11.02 ]89838984>>> p = o / (o+1)8985[ 0.182 0.401 0.401 0.916 ]8986\end{verbatim}89878988Notice that when \verb"log_o" is greater than 0, {\tt o}8989is greater than 1 and {\tt p} is greater than 0.5.89908991The likelihood of an outcome is {\tt p} when {\tt y==1} and {\tt 1-p}8992when {\tt y==0}. For example, if we think the probability of a boy is89930.8 and the outcome is a boy, the likelihood is 0.8; if8994the outcome is a girl, the likelihood is 0.2. We can compute that8995like this:8996\index{likelihood}89978998\begin{verbatim}8999>>> likes = y * p + (1-y) * (1-p)9000[ 0.817 0.401 0.598 0.916 ]9001\end{verbatim}90029003The overall likelihood of the data is the product of {\tt likes}:90049005\begin{verbatim}9006>>> like = np.prod(likes)90070.189008\end{verbatim}90099010For these values of {\tt beta}, the likelihood of the data is 0.18.9011The goal of logistic regression is to find parameters that maximize9012this likelihood. To do that, most statistics packages use an9013iterative solver like Newton's method (see9014\url{https://en.wikipedia.org/wiki/Logistic_regression#Model_fitting}).9015\index{Newton's method}9016\index{iterative solver}901790189019\section{Implementation}9020\label{implementation}90219022StatsModels provides an implementation of logistic regression9023called {\tt logit}, named for the function that converts from9024probability to log odds. To demonstrate its use, I'll look for9025variables that affect the sex ratio.9026\index{StatsModels}9027\index{sex ratio}9028\index{logit function}90299030Again, I load the NSFG data and select pregnancies longer than903130 weeks:90329033\begin{verbatim}9034live, firsts, others = first.MakeFrames()9035df = live[live.prglngth>30]9036\end{verbatim}90379038{\tt logit} requires the dependent variable to be binary (rather than9039boolean), so I create a new column named {\tt boy}, using {\tt9040astype(int)} to convert to binary integers:9041\index{dependent variable}9042\index{boolean}9043\index{binary}90449045\begin{verbatim}9046df['boy'] = (df.babysex==1).astype(int)9047\end{verbatim}90489049Factors that have been found to affect sex ratio include parents'9050age, birth order, race, and social status. We can use logistic9051regression to see if these effects appear in the NSFG data. I'll9052start with the mother's age:9053\index{age}9054\index{race}90559056\begin{verbatim}9057import statsmodels.formula.api as smf90589059model = smf.logit('boy ~ agepreg', data=df)9060results = model.fit()9061SummarizeResults(results)9062\end{verbatim}90639064{\tt logit} takes the same arguments as {\tt ols}, a formula9065in Patsy syntax and a DataFrame. The result is a Logit object9066that represents the model. It contains attributes called9067{\tt endog} and {\tt exog} that contain the {\bf endogenous9068variable}, another name for the dependent variable,9069and the {\bf exogenous variables}, another name for the9070explanatory variables. Since they are NumPy arrays, it is9071sometimes convenient to convert them to DataFrames:9072\index{NumPy}9073\index{pandas}9074\index{DataFrame}9075\index{explanatory variable}9076\index{dependent variable}9077\index{exogenous variable}9078\index{endogenous variable}9079\index{Patsy}90809081\begin{verbatim}9082endog = pandas.DataFrame(model.endog, columns=[model.endog_names])9083exog = pandas.DataFrame(model.exog, columns=model.exog_names)9084\end{verbatim}90859086The result of {\tt model.fit} is a BinaryResults object, which is9087similar to the RegressionResults object we got from {\tt ols}.9088Here is a summary of the results:90899090\begin{verbatim}9091Intercept 0.00579 (0.953)9092agepreg 0.00105 (0.783)9093R^2 6.144e-069094\end{verbatim}90959096The parameter of {\tt agepreg} is positive, which suggests that9097older mothers are more likely to have boys, but the p-value is90980.783, which means that the apparent effect could easily be due9099to chance.9100\index{p-value}9101\index{age}91029103The coefficient of determination, $R^2$, does not apply to logistic9104regression, but there are several alternatives that are used9105as ``pseudo $R^2$ values.'' These values can be useful for comparing9106models. For example, here's a model that includes several factors9107believed to be associated with sex ratio:9108\index{model}9109\index{coefficient of determination}9110\index{r-squared}9111\index{pseudo r-squared}91129113\begin{verbatim}9114formula = 'boy ~ agepreg + hpagelb + birthord + C(race)'9115model = smf.logit(formula, data=df)9116results = model.fit()9117\end{verbatim}91189119Along with mother's age, this model includes father's age at9120birth ({\tt hpagelb}), birth order ({\tt birthord}), and9121race as a categorical variable. Here are the results:9122\index{categorical variable}91239124\begin{verbatim}9125Intercept -0.0301 (0.772)9126C(race)[T.2] -0.0224 (0.66)9127C(race)[T.3] -0.000457 (0.996)9128agepreg -0.00267 (0.629)9129hpagelb 0.0047 (0.266)9130birthord 0.00501 (0.821)9131R^2 0.0001449132\end{verbatim}91339134None of the estimated parameters are statistically significant. The9135pseudo-$R^2$ value is a little higher, but that could be due to9136chance.9137\index{pseudo r-squared}9138\index{significant} \index{statistically significant}913991409141\section{Accuracy}9142\label{accuracy}91439144In the office pool scenario,9145we are most interested in the accuracy of the model:9146the number of successful predictions, compared with what we would9147expect by chance.9148\index{model}9149\index{accuracy}91509151In the NSFG data, there are more boys than girls, so the baseline9152strategy is to guess ``boy'' every time. The accuracy of this9153strategy is just the fraction of boys:91549155\begin{verbatim}9156actual = endog['boy']9157baseline = actual.mean()9158\end{verbatim}91599160Since {\tt actual} is encoded in binary integers, the mean is the9161fraction of boys, which is 0.507.91629163Here's how we compute the accuracy of the model:91649165\begin{verbatim}9166predict = (results.predict() >= 0.5)9167true_pos = predict * actual9168true_neg = (1 - predict) * (1 - actual)9169\end{verbatim}91709171{\tt results.predict} returns a NumPy array of probabilities, which we9172round off to 0 or 1. Multiplying by {\tt actual}9173yields 1 if we predict a boy and get it right, 0 otherwise. So,9174\verb"true_pos" indicates ``true positives''.9175\index{NumPy}9176\index{true positive}9177\index{true negative}91789179Similarly, \verb"true_neg" indicates the cases where we guess ``girl''9180and get it right. Accuracy is the fraction of correct guesses:91819182\begin{verbatim}9183acc = (sum(true_pos) + sum(true_neg)) / len(actual)9184\end{verbatim}91859186The result is 0.512, slightly better than the9187baseline, 0.507. But, you should not take this result too seriously.9188We used the same data to build and test the model, so the model9189may not have predictive power on new data.9190\index{model}91919192Nevertheless, let's use the model to make a prediction for the office9193pool. Suppose your friend is 35 years old and white,9194her husband is 39, and they are expecting their third child:91959196\begin{verbatim}9197columns = ['agepreg', 'hpagelb', 'birthord', 'race']9198new = pandas.DataFrame([[35, 39, 3, 2]], columns=columns)9199y = results.predict(new)9200\end{verbatim}92019202To invoke {\tt results.predict} for a new case, you have to construct9203a DataFrame with a column for each variable in the model. The result9204in this case is 0.52, so you should guess ``boy.'' But if the model9205improves your chances of winning, the difference is very small.9206\index{DataFrame}9207920892099210\section{Exercises}92119212My solution to these exercises is in \verb"chap11soln.ipynb".92139214\begin{exercise}9215Suppose one of your co-workers is expecting a baby and you are9216participating in an office pool to predict the date of birth.9217Assuming that bets are placed during the 30th week of pregnancy, what9218variables could you use to make the best prediction? You should limit9219yourself to variables that are known before the birth, and likely to9220be available to the people in the pool.9221\index{betting pool}9222\index{date of birth}92239224\end{exercise}922592269227\begin{exercise}9228The Trivers-Willard hypothesis suggests that for many mammals the9229sex ratio depends on ``maternal condition''; that is,9230factors like the mother's age, size, health, and social status.9231See \url{https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis}9232\index{Trivers-Willard hypothesis}9233\index{sex ratio}92349235Some studies have shown this effect among humans, but results are9236mixed. In this chapter we tested some variables related to these9237factors, but didn't find any with a statistically significant effect9238on sex ratio.9239\index{significant} \index{statistically significant}92409241As an exercise, use a data mining approach to test the other variables9242in the pregnancy and respondent files. Can you find any factors with9243a substantial effect?9244\index{data mining}92459246\end{exercise}924792489249\begin{exercise}9250If the quantity you want to predict is a count, you can use Poisson9251regression, which is implemented in StatsModels with a function called9252{\tt poisson}. It works the same way as {\tt ols} and {\tt logit}.9253As an exercise, let's use it to predict how many children a woman9254has born; in the NSFG dataset, this variable is called {\tt numbabes}.9255\index{StatsModels}9256\index{Poisson regression}92579258Suppose you meet a woman who is 35 years old, black, and a college9259graduate whose annual household income exceeds \$75,000. How many9260children would you predict she has born?9261\end{exercise}926292639264\begin{exercise}9265If the quantity you want to predict is categorical, you can use9266multinomial logistic regression, which is implemented in StatsModels9267with a function called {\tt mnlogit}. As an exercise, let's use it to9268guess whether a woman is married, cohabitating, widowed, divorced,9269separated, or never married; in the NSFG dataset, marital status is9270encoded in a variable called {\tt rmarital}.9271\index{categorical variable}9272\index{marital status}92739274Suppose you meet a woman who is 25 years old, white, and a high9275school graduate whose annual household income is about \$45,000.9276What is the probability that she is married, cohabitating, etc?9277\end{exercise}92789279928092819282\section{Glossary}92839284\begin{itemize}92859286\item regression: One of several related processes for estimating parameters9287that fit a model to data.9288\index{regression}92899290\item dependent variables: The variables in a regression model we would9291like to predict. Also known as endogenous variables.9292\index{dependent variable}9293\index{endogenous variable}92949295\item explanatory variables: The variables used to predict or explain9296the dependent variables. Also known as independent, or exogenous,9297variables.9298\index{explanatory variable}9299\index{exogenous variable}93009301\item simple regression: A regression with only one dependent and9302one explanatory variable.9303\index{simple regression}93049305\item multiple regression: A regression with multiple explanatory9306variables, but only one dependent variable.9307\index{multiple regression}93089309\item linear regression: A regression based on a linear model.9310\index{linear regression}93119312\item ordinary least squares: A linear regression that estimates9313parameters by minimizing the squared error of the residuals.9314\index{ordinary least squares}93159316\item spurious relationship: A relationship between two variables that is9317caused by a statistical artifact or a factor, not included in the9318model, that is related to both variables.9319\index{spurious relationship}93209321\item control variable: A variable included in a regression to9322eliminate or ``control for'' a spurious relationship.9323\index{control variable}93249325\item proxy variable: A variable that contributes information to9326a regression model indirectly because of a relationship with another9327factor, so it acts as a proxy for that factor.9328\index{proxy variable}93299330\item categorical variable: A variable that can have one of a9331discrete set of unordered values.9332\index{categorical variable}93339334\item join: An operation that combines data from two DataFrames9335using a key to match up rows in the two frames.9336\index{join}9337\index{DataFrame}93389339\item data mining: An approach to finding relationships between9340variables by testing a large number of models.9341\index{data mining}93429343\item logistic regression: A form of regression used when the9344dependent variable is boolean.9345\index{logistic regression}93469347\item Poisson regression: A form of regression used when the9348dependent variable is a non-negative integer, usually a count.9349\index{Poisson regression}93509351\item odds: An alternative way of representing a probability, $p$, as9352the ratio of the probability and its complement, $p / (1-p)$.9353\index{odds}93549355\end{itemize}9356935793589359\chapter{Time series analysis}93609361A {\bf time series} is a sequence of measurements from a system that9362varies in time. One famous example is the ``hockey stick graph'' that9363shows global average temperature over time (see9364\url{https://en.wikipedia.org/wiki/Hockey_stick_graph}).9365\index{time series}9366\index{hockey stick graph}93679368The example I work with in this chapter comes from Zachary M. Jones, a9369researcher in political science who studies the black market for9370cannabis in the U.S. (\url{http://zmjones.com/marijuana}). He9371collected data from a web site called ``Price of Weed'' that9372crowdsources market information by asking participants to report the9373price, quantity, quality, and location of cannabis transactions9374(\url{http://www.priceofweed.com/}). The goal of his project is to9375investigate the effect of policy decisions, like legalization, on9376markets. I find this project appealing because it is an example that9377uses data to address important political questions, like drug policy.9378\index{Price of Weed}9379\index{cannabis}93809381I hope you will9382find this chapter interesting, but I'll take this opportunity to9383reiterate the importance of maintaining a professional attitude to9384data analysis. Whether and which drugs should be illegal are9385important and difficult public policy questions; our decisions should9386be informed by accurate data reported honestly.9387\index{ethics}93889389The code for this chapter is in {\tt timeseries.py}. For information9390about downloading and working with this code, see Section~\ref{code}.939193929393\section{Importing and cleaning}93949395The data I downloaded from9396Mr. Jones's site is in the repository for this book.9397The following code reads it into a9398pandas DataFrame:9399\index{pandas}9400\index{DataFrame}94019402\begin{verbatim}9403transactions = pandas.read_csv('mj-clean.csv', parse_dates=[5])9404\end{verbatim}94059406\verb"parse_dates" tells \verb"read_csv" to interpret values in column 59407as dates and convert them to NumPy {\tt datetime64} objects.9408\index{NumPy}94099410The DataFrame has a row for each reported transaction and9411the following columns:94129413\begin{itemize}94149415\item city: string city name.94169417\item state: two-letter state abbreviation.94189419\item price: price paid in dollars.9420\index{price}94219422\item amount: quantity purchased in grams.94239424\item quality: high, medium, or low quality, as reported by the purchaser.94259426\item date: date of report, presumed to be shortly after date of purchase.94279428\item ppg: price per gram, in dollars.94299430\item state.name: string state name.94319432\item lat: approximate latitude of the transaction, based on city name.94339434\item lon: approximate longitude of the transaction.94359436\end{itemize}94379438Each transaction is an event in time, so we could treat this dataset9439as a time series. But the events are not equally spaced in time; the9440number of transactions reported each day varies from 0 to several9441hundred. Many methods used to analyze time series require the9442measurements to be equally spaced, or at least things are simpler if9443they are.9444\index{transaction}9445\index{equally spaced data}94469447In order to demonstrate these methods, I divide the dataset9448into groups by reported quality, and then transform each group into9449an equally spaced series by computing the mean daily price per gram.94509451\begin{verbatim}9452def GroupByQualityAndDay(transactions):9453groups = transactions.groupby('quality')9454dailies = {}9455for name, group in groups:9456dailies[name] = GroupByDay(group)94579458return dailies9459\end{verbatim}94609461{\tt groupby} is a DataFrame method that returns a GroupBy object,9462{\tt groups}; used in a for loop, it iterates the names of the groups9463and the DataFrames that represent them. Since the values of {\tt9464quality} are {\tt low}, {\tt medium}, and {\tt high}, we get three9465groups with those names. \index{DataFrame} \index{groupby}94669467The loop iterates through the groups and calls {\tt GroupByDay},9468which computes the daily average price and returns a new DataFrame:94699470\begin{verbatim}9471def GroupByDay(transactions, func=np.mean):9472grouped = transactions[['date', 'ppg']].groupby('date')9473daily = grouped.aggregate(func)94749475daily['date'] = daily.index9476start = daily.date[0]9477one_year = np.timedelta64(1, 'Y')9478daily['years'] = (daily.date - start) / one_year94799480return daily9481\end{verbatim}94829483The parameter, {\tt transactions}, is a DataFrame that contains9484columns {\tt date} and {\tt ppg}. We select these two9485columns, then group by {\tt date}.9486\index{groupby}94879488The result, {\tt grouped}, is a map from each date to a DataFrame that9489contains prices reported on that date. {\tt aggregate} is a9490GroupBy method that iterates through the groups and applies a9491function to each column of the group; in this case there is only one9492column, {\tt ppg}. So the result of {\tt aggregate} is a DataFrame9493with one row for each date and one column, {\tt ppg}.9494\index{aggregate}94959496Dates in these DataFrames are stored as NumPy {\tt datetime64}9497objects, which are represented as 64-bit integers in nanoseconds.9498For some of the analyses coming up, it will be convenient to9499work with time in more human-friendly units, like years. So9500{\tt GroupByDay} adds a column named {\tt date} by copying9501the {\tt index}, then adds {\tt years}, which contains the number9502of years since the first transaction as a floating-point number.9503\index{NumPy}9504\index{datetime64}95059506The resulting DataFrame has columns {\tt ppg}, {\tt date}, and9507{\tt years}.9508\index{DataFrame}950995109511\section{Plotting}95129513The result from {\tt GroupByQualityAndDay} is a map from each quality9514to a DataFrame of daily prices. Here's the code I use to plot9515the three time series:9516\index{DataFrame}9517\index{visualization}95189519\begin{verbatim}9520thinkplot.PrePlot(rows=3)9521for i, (name, daily) in enumerate(dailies.items()):9522thinkplot.SubPlot(i+1)9523title = 'price per gram ($)' if i==0 else ''9524thinkplot.Config(ylim=[0, 20], title=title)9525thinkplot.Scatter(daily.index, daily.ppg, s=10, label=name)9526if i == 2:9527pyplot.xticks(rotation=30)9528else:9529thinkplot.Config(xticks=[])9530\end{verbatim}95319532{\tt PrePlot} with {\tt rows=3} means that we are planning to9533make three subplots laid out in three rows. The loop iterates9534through the DataFrames and creates a scatter plot for each. It is9535common to plot time series with line segments between the points,9536but in this case there are many data points and prices are highly9537variable, so adding lines would not help.9538\index{thinkplot}95399540Since the labels on the x-axis are dates, I use {\tt pyplot.xticks}9541to rotate the ``ticks'' 30 degrees, making them more readable.9542\index{pyplot}9543\index{ticks}9544\index{xticks}95459546\begin{figure}9547% timeseries.py9548\centerline{\includegraphics[width=3.5in]{figs/timeseries1.pdf}}9549\caption{Time series of daily price per gram for high, medium, and low9550quality cannabis.}9551\label{timeseries1}9552\end{figure}95539554Figure~\ref{timeseries1} shows the result. One apparent feature in9555these plots is a gap around November 2013. It's possible that data9556collection was not active during this time, or the data might not9557be available. We will consider ways to deal with this missing data9558later.9559\index{missing values}95609561Visually, it looks like the price of high quality cannabis is9562declining during this period, and the price of medium quality is9563increasing. The price of low quality might also be increasing, but it9564is harder to tell, since it seems to be more volatile. Keep in mind9565that quality data is reported by volunteers, so trends over time9566might reflect changes in how participants apply these labels.9567\index{price}956895699570\section{Linear regression}9571\label{timeregress}95729573Although there are methods specific to time series analysis, for many9574problems a simple way to get started is by applying general-purpose9575tools like linear regression. The following function takes a9576DataFrame of daily prices and computes a least squares fit, returning9577the model and results objects from StatsModels:9578\index{DataFrame}9579\index{StatsModels}9580\index{linear regression}95819582\begin{verbatim}9583def RunLinearModel(daily):9584model = smf.ols('ppg ~ years', data=daily)9585results = model.fit()9586return model, results9587\end{verbatim}95889589Then we can iterate through the qualities and fit a model to9590each:95919592\begin{verbatim}9593for name, daily in dailies.items():9594model, results = RunLinearModel(daily)9595print(name)9596regression.SummarizeResults(results)9597\end{verbatim}95989599Here are the results:96009601\begin{center}9602\begin{tabular}{|l|l|l|c|} \hline9603quality & intercept & slope & $R^2$ \\ \hline9604high & 13.450 & -0.708 & 0.444 \\9605medium & 8.879 & 0.283 & 0.050 \\9606low & 5.362 & 0.568 & 0.030 \\9607\hline9608\end{tabular}9609\end{center}96109611The estimated slopes indicate that the price of high quality cannabis9612dropped by about 71 cents per year during the observed interval; for9613medium quality it increased by 28 cents per year, and for low quality9614it increased by 57 cents per year. These estimates are all9615statistically significant with very small p-values.9616\index{p-value}9617\index{significant} \index{statistically significant}96189619The $R^2$ value for high quality cannabis is 0.44, which means9620that time as an explanatory variable accounts for 44\% of the observed9621variability in price. For the other qualities, the change in price9622is smaller, and variability in prices is higher, so the values9623of $R^2$ are smaller (but still statistically significant).9624\index{explanatory variable}9625\index{significant} \index{statistically significant}96269627The following code plots the observed prices and the fitted values:96289629\begin{verbatim}9630def PlotFittedValues(model, results, label=''):9631years = model.exog[:,1]9632values = model.endog9633thinkplot.Scatter(years, values, s=15, label=label)9634thinkplot.Plot(years, results.fittedvalues, label='model')9635\end{verbatim}96369637As we saw in Section~\ref{implementation}, {\tt model} contains9638{\tt exog} and {\tt endog}, NumPy arrays with the exogenous9639(explanatory) and endogenous (dependent) variables.9640\index{NumPy}9641\index{explanatory variable}9642\index{dependent variable}9643\index{exogenous variable}9644\index{endogenous variable}96459646\begin{figure}9647% timeseries.py9648\centerline{\includegraphics[height=2.5in]{figs/timeseries2.pdf}}9649\caption{Time series of daily price per gram for high quality cannabis,9650and a linear least squares fit.}9651\label{timeseries2}9652\end{figure}96539654{\tt PlotFittedValues} makes a scatter plot of the data points and a line9655plot of the fitted values. Figure~\ref{timeseries2} shows the results9656for high quality cannabis. The model seems like a good linear fit9657for the data; nevertheless, linear regression is not the most9658appropriate choice for this data:9659\index{model}9660\index{fitted values}96619662\begin{itemize}96639664\item First, there is no reason to expect the long-term trend to be a9665line or any other simple function. In general, prices are9666determined by supply and demand, both of which vary over time in9667unpredictable ways.9668\index{trend}96699670\item Second, the linear regression model gives equal weight to all9671data, recent and past. For purposes of prediction, we should9672probably give more weight to recent data.9673\index{weight}96749675\item Finally, one of the assumptions of linear regression is that the9676residuals are uncorrelated noise. With time series data, this9677assumption is often false because successive values are correlated.9678\index{residuals}96799680\end{itemize}96819682The next section presents an alternative that is more appropriate9683for time series data.968496859686\section{Moving averages}96879688Most time series analysis is based on the modeling assumption that the9689observed series is the sum of three components:9690\index{model}9691\index{moving average}96929693\begin{itemize}96949695\item Trend: A smooth function that captures persistent changes.9696\index{trend}96979698\item Seasonality: Periodic variation, possibly including daily,9699weekly, monthly, or yearly cycles.9700\index{seasonality}97019702\item Noise: Random variation around the long-term trend.9703\index{noise}97049705\end{itemize}97069707Regression is one way to extract the trend from a series, as we9708saw in the previous section. But if the trend is not a simple9709function, a good alternative is a {\bf moving average}. A moving9710average divides the series into overlapping regions, called {\bf windows},9711and computes the average of the values in each window.9712\index{window}97139714One of the simplest moving averages is the {\bf rolling mean}, which9715computes the mean of the values in each window. For example, if9716the window size is 3, the rolling mean computes the mean of9717values 0 through 2, 1 through 3, 2 through 4, etc.9718\index{rolling mean}9719\index{mean!rolling}97209721pandas provides \verb"rolling_mean", which takes a Series and a9722window size and returns a new Series.9723\index{pandas}9724\index{Series}97259726\begin{verbatim}9727>>> series = np.arange(10)9728array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])97299730>>> pandas.rolling_mean(series, 3)9731array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])9732\end{verbatim}97339734The first two values are {\tt nan}; the next value is the mean of9735the first three elements, 0, 1, and 2. The next value is the mean9736of 1, 2, and 3. And so on.97379738Before we can apply \verb"rolling_mean" to the cannabis data, we9739have to deal with missing values. There are a few days in the9740observed interval with no reported transactions for one or more9741quality categories, and a period in 2013 when data collection was9742not active.9743\index{missing values}97449745In the DataFrames we have used so far, these dates are absent;9746the index skips days with no data. For the analysis that follows,9747we need to represent this missing data explicitly. We can do9748that by ``reindexing'' the DataFrame:9749\index{DataFrame}9750\index{reindex}97519752\begin{verbatim}9753dates = pandas.date_range(daily.index.min(), daily.index.max())9754reindexed = daily.reindex(dates)9755\end{verbatim}97569757The first line computes a date range that includes every day from the9758beginning to the end of the observed interval. The second line9759creates a new DataFrame with all of the data from {\tt daily}, but9760including rows for all dates, filled with {\tt nan}.9761\index{interval}9762\index{date range}97639764Now we can plot the rolling mean like this:97659766\begin{verbatim}9767roll_mean = pandas.rolling_mean(reindexed.ppg, 30)9768thinkplot.Plot(roll_mean.index, roll_mean)9769\end{verbatim}97709771The window size is 30, so each value in \verb"roll_mean" is9772the mean of 30 values from {\tt reindexed.ppg}.9773\index{pandas}9774\index{window}97759776\begin{figure}9777% timeseries.py9778\centerline{\includegraphics[height=2.5in]{figs/timeseries10.pdf}}9779\caption{Daily price and a rolling mean (left) and exponentially-weighted9780moving average (right).}9781\label{timeseries10}9782\end{figure}97839784Figure~\ref{timeseries10} (left)9785shows the result.9786The rolling mean seems to do a good job of smoothing out the noise and9787extracting the trend. The first 29 values are {\tt nan}, and wherever9788there's a missing value, it's followed by another 29 {\tt nan}s.9789There are ways to fill in these gaps, but they are a minor nuisance.9790\index{missing values}9791\index{noise}9792\index{smoothing}97939794An alternative is the {\bf exponentially-weighted moving average} (EWMA),9795which has two advantages. First, as the name suggests, it computes9796a weighted average where the most recent value has the highest weight9797and the weights for previous values drop off exponentially.9798Second, the pandas implementation of EWMA handles missing values9799better.9800\index{reindex}9801\index{exponentially-weighted moving average}9802\index{EWMA}98039804\begin{verbatim}9805ewma = pandas.ewma(reindexed.ppg, span=30)9806thinkplot.Plot(ewma.index, ewma)9807\end{verbatim}98089809The {\bf span} parameter corresponds roughly to the window size of9810a moving average; it controls how fast the weights drop off, so it9811determines the number of points that make a non-negligible contribution9812to each average.9813\index{span}9814\index{window}98159816Figure~\ref{timeseries10} (right) shows the EWMA for the same data.9817It is similar to the rolling mean, where they are both defined,9818but it has no missing values, which makes it easier to work with. The9819values are noisy at the beginning of the time series, because they are9820based on fewer data points.9821\index{missing values}982298239824\section{Missing values}98259826Now that we have characterized the trend of the time series, the9827next step is to investigate seasonality, which is periodic behavior.9828Time series data based on human behavior often exhibits daily,9829weekly, monthly, or yearly cycles. In the next section I present9830methods to test for seasonality, but they don't work well with9831missing data, so we have to solve that problem first.9832\index{missing values}9833\index{seasonality}98349835A simple and common way to fill missing data is to use a moving9836average. The Series method {\tt fillna} does just what we want:9837\index{Series}9838\index{fillna}98399840\begin{verbatim}9841reindexed.ppg.fillna(ewma, inplace=True)9842\end{verbatim}98439844Wherever {\tt reindexed.ppg} is {\tt nan}, {\tt fillna} replaces9845it with the corresponding value from {\tt ewma}. The {\tt inplace}9846flag tells {\tt fillna} to modify the existing Series rather than9847create a new one.98489849A drawback of this method is that it understates the noise in the9850series. We can solve that problem by adding in resampled9851residuals:9852\index{resampling}9853\index{noise}98549855\begin{verbatim}9856resid = (reindexed.ppg - ewma).dropna()9857fake_data = ewma + thinkstats2.Resample(resid, len(reindexed))9858reindexed.ppg.fillna(fake_data, inplace=True)9859\end{verbatim}98609861% (One note on vocabulary: in this book I am using9862%``resampling'' in the statistical sense, which is drawing a random9863%sample from a population that is, itself, a sample. In the context9864%of time series analysis, it has another meaning: changing the9865%time between measurements in a series. I don't use the second9866%meaning in this book, but you might encounter it.)98679868{\tt resid} contains the residual values, not including days9869when {\tt ppg} is {\tt nan}. \verb"fake_data" contains the9870sum of the moving average and a random sample of residuals.9871Finally, {\tt fillna} replaces {\tt nan} with values from9872\verb"fake_data".9873\index{dropna}9874\index{fillna}9875\index{NaN}98769877\begin{figure}9878% timeseries.py9879\centerline{\includegraphics[height=2.5in]{figs/timeseries8.pdf}}9880\caption{Daily price with filled data.}9881\label{timeseries8}9882\end{figure}98839884Figure~\ref{timeseries8} shows the result. The filled data is visually9885similar to the actual values. Since the resampled residuals are9886random, the results are different every time; later we'll see how9887to characterize the error created by missing values.9888\index{resampling}9889\index{missing values}989098919892\section{Serial correlation}98939894As prices vary from day to day, you might expect to see patterns.9895If the price is high on Monday,9896you might expect it to be high for a few more days; and9897if it's low, you might expect it to stay low. A pattern9898like this is called {\bf serial9899correlation}, because each value is correlated with the next one9900in the series.9901\index{correlation!serial}9902\index{serial correlation}99039904To compute serial correlation, we can shift the time series9905by an interval called a {\bf lag}, and then compute the correlation9906of the shifted series with the original:9907\index{lag}99089909\begin{verbatim}9910def SerialCorr(series, lag=1):9911xs = series[lag:]9912ys = series.shift(lag)[lag:]9913corr = thinkstats2.Corr(xs, ys)9914return corr9915\end{verbatim}99169917After the shift, the first {\tt lag} values are {\tt nan}, so9918I use a slice to remove them before computing {\tt Corr}.9919\index{NaN}99209921%high 0.4801218161549922%medium 0.1646000783629923%low 0.10337362013199249925If we apply {\tt SerialCorr} to the raw price data with lag 1, we find9926serial correlation 0.48 for the high quality category, 0.16 for9927medium and 0.10 for low. In any time series with a long-term trend,9928we expect to see strong serial correlations; for example, if prices9929are falling, we expect to see values above the mean in the first9930half of the series and values below the mean in the second half.99319932It is more interesting to see if the correlation persists if you9933subtract away the trend. For example, we can compute the residual9934of the EWMA and then compute its serial correlation:9935\index{EWMA}99369937\begin{verbatim}9938ewma = pandas.ewma(reindexed.ppg, span=30)9939resid = reindexed.ppg - ewma9940corr = SerialCorr(resid, 1)9941\end{verbatim}99429943With lag=1, the serial correlations for the de-trended data are9944-0.022 for high quality, -0.015 for medium, and 0.036 for low.9945These values are small, indicating that there is little or9946no one-day serial correlation in this series.9947\index{pandas}99489949To check for weekly, monthly, and yearly seasonality, I ran9950the analysis again with different lags. Here are the results:9951\index{seasonality}99529953\begin{center}9954\begin{tabular}{|c|c|c|c|}9955\hline9956lag & high & medium & low \\ \hline99571 & -0.029 & -0.014 & 0.034 \\99587 & 0.02 & -0.042 & -0.0097 \\995930 & 0.014 & -0.0064 & -0.013 \\9960365 & 0.045 & 0.015 & 0.033 \\9961\hline9962\end{tabular}9963\end{center}99649965In the next section we'll test whether these correlations are9966statistically significant (they are not), but at this point we can9967tentatively conclude that there are no substantial seasonal patterns9968in these series, at least not with these lags.9969\index{significant} \index{statistically significant}997099719972\section{Autocorrelation}99739974If you think a series might have some serial correlation, but you9975don't know which lags to test, you can test them all! The {\bf9976autocorrelation function} is a function that maps from lag to the9977serial correlation with the given lag. ``Autocorrelation'' is another9978name for serial correlation, used more often when the lag is not 1.9979\index{autocorrelation function}99809981StatsModels, which we used for linear regression in9982Section~\ref{statsmodels}, also provides functions for time series9983analysis, including {\tt acf}, which computes the autocorrelation9984function:9985\index{StatsModels}99869987\begin{verbatim}9988import statsmodels.tsa.stattools as smtsa9989acf = smtsa.acf(filled.resid, nlags=365, unbiased=True)9990\end{verbatim}99919992{\tt acf} computes serial correlations with9993lags from 0 through {\tt nlags}. The {\tt unbiased} flag tells9994{\tt acf} to correct the estimates for the sample size. The result9995is an array of correlations. If we select daily prices for high9996quality, and extract correlations for lags 1, 7, 30, and 365, we can9997confirm that {\tt acf} and {\tt SerialCorr} yield approximately9998the same results:9999\index{acf}1000010001\begin{verbatim}10002>>> acf[0], acf[1], acf[7], acf[30], acf[365]100031.000, -0.029, 0.020, 0.014, 0.04410004\end{verbatim}1000510006With {\tt lag=0}, {\tt acf} computes the correlation of the series10007with itself, which is always 1.10008\index{lag}1000910010\begin{figure}10011% timeseries.py10012\centerline{\includegraphics[height=2.5in]{figs/timeseries9.pdf}}10013\caption{Autocorrelation function for daily prices (left), and10014daily prices with a simulated weekly seasonality (right).}10015\label{timeseries9}10016\end{figure}1001710018Figure~\ref{timeseries9} (left) shows autocorrelation functions for10019the three quality categories, with {\tt nlags=40}. The gray region10020shows the normal variability we would expect if there is no actual10021autocorrelation; anything that falls outside this range is10022statistically significant, with a p-value less than 5\%. Since10023the false positive rate is 5\%, and10024we are computing 120 correlations (40 lags for each of 3 times series),10025we expect to see about 6 points outside this region. In fact, there10026are 7. We conclude that there are no autocorrelations10027in these series that could not be explained by chance.10028\index{p-value}10029\index{significant} \index{statistically significant}10030\index{false positive}1003110032I computed the gray regions by resampling the residuals. You10033can see my code in {\tt timeseries.py}; the function is called10034{\tt SimulateAutocorrelation}.10035\index{resampling}1003610037To see what the autocorrelation function looks like when there is a10038seasonal component, I generated simulated data by adding a weekly10039cycle. Assuming that demand for cannabis is higher on weekends, we10040might expect the price to be higher. To simulate this effect, I10041select dates that fall on Friday or Saturday and add a random amount10042to the price, chosen from a uniform distribution from \$0 to \$2.10043\index{simulation}10044\index{uniform distribution}10045\index{distribution!uniform}1004610047\begin{verbatim}10048def AddWeeklySeasonality(daily):10049frisat = (daily.index.dayofweek==4) | (daily.index.dayofweek==5)10050fake = daily.copy()10051fake.ppg[frisat] += np.random.uniform(0, 2, frisat.sum())10052return fake10053\end{verbatim}1005410055{\tt frisat} is a boolean Series, {\tt True} if the day of the10056week is Friday or Saturday. {\tt fake} is a new DataFrame, initially10057a copy of {\tt daily}, which we modify by adding random values10058to {\tt ppg}. {\tt frisat.sum()} is the total number of Fridays10059and Saturdays, which is the number of random values we have to10060generate.10061\index{DataFrame}10062\index{Series}10063\index{boolean}1006410065Figure~\ref{timeseries9} (right) shows autocorrelation functions for10066prices with this simulated seasonality. As expected, the10067correlations are highest when the lag is a multiple of 7. For10068high and medium quality, the new correlations are statistically10069significant. For low quality they are not, because residuals in this10070category are large; the effect would have to be bigger10071to be visible through the noise.10072\index{significant} \index{statistically significant}10073\index{residuals}10074\index{lag}100751007610077\section{Prediction}1007810079Time series analysis can be used to investigate, and sometimes10080explain, the behavior of systems that vary in time. It can also10081make predictions.10082\index{prediction}1008310084The linear regressions we used in Section~\ref{timeregress} can be10085used for prediction. The RegressionResults class provides {\tt10086predict}, which takes a DataFrame containing the explanatory10087variables and returns a sequence of predictions. Here's the code:10088\index{explanatory variable}10089\index{linear regression}1009010091\begin{verbatim}10092def GenerateSimplePrediction(results, years):10093n = len(years)10094inter = np.ones(n)10095d = dict(Intercept=inter, years=years)10096predict_df = pandas.DataFrame(d)10097predict = results.predict(predict_df)10098return predict10099\end{verbatim}1010010101{\tt results} is a RegressionResults object; {\tt years} is the10102sequence of time values we want predictions for. The function10103constructs a DataFrame, passes it to {\tt predict}, and10104returns the result.10105\index{pandas}10106\index{DataFrame}1010710108If all we want is a single, best-guess prediction, we're done. But10109for most purposes it is important to quantify error. In other words,10110we want to know how accurate the prediction is likely to be.1011110112There are three sources of error we should take into account:1011310114\begin{itemize}1011510116\item Sampling error: The prediction is based on estimated10117parameters, which depend on random variation10118in the sample. If we run the experiment again, we expect10119the estimates to vary.10120\index{sampling error}10121\index{parameter}1012210123\item Random variation: Even if the estimated parameters are10124perfect, the observed data varies randomly around the long-term10125trend, and we expect this variation to continue in the future.10126\index{noise}1012710128\item Modeling error: We have already seen evidence that the long-term10129trend is not linear, so predictions based on a linear model will10130eventually fail.10131\index{modeling error}1013210133\end{itemize}1013410135Another source of error to consider is unexpected future events.10136Agricultural prices are affected by weather, and all prices are10137affected by politics and law. As I write this, cannabis is legal in10138two states and legal for medical purposes in 20 more. If more states10139legalize it, the price is likely to go down. But if10140the federal government cracks down, the price might go up.1014110142Modeling errors and unexpected future events are hard to quantify.10143Sampling error and random variation are easier to deal with, so we'll10144do that first.1014510146To quantify sampling error, I use resampling, as we did in10147Section~\ref{regest}. As always, the goal is to use the actual10148observations to simulate what would happen if we ran the experiment10149again. The simulations are based on the assumption that the estimated10150parameters are correct, but the random residuals could have been10151different. Here is a function that runs the simulations:10152\index{resampling}1015310154\begin{verbatim}10155def SimulateResults(daily, iters=101):10156model, results = RunLinearModel(daily)10157fake = daily.copy()1015810159result_seq = []10160for i in range(iters):10161fake.ppg = results.fittedvalues + Resample(results.resid)10162_, fake_results = RunLinearModel(fake)10163result_seq.append(fake_results)1016410165return result_seq10166\end{verbatim}1016710168{\tt daily} is a DataFrame containing the observed prices;10169{\tt iters} is the number of simulations to run.10170\index{DataFrame}10171\index{price}1017210173{\tt SimulateResults} uses {\tt RunLinearModel}, from10174Section~\ref{timeregress}, to estimate the slope and intercept10175of the observed values.1017610177Each time through the loop, it generates a ``fake'' dataset by10178resampling the residuals and adding them to the fitted values. Then10179it runs a linear model on the fake data and stores the RegressionResults10180object.10181\index{model}10182\index{residuals}1018310184The next step is to use the simulated results to generate predictions:1018510186\begin{verbatim}10187def GeneratePredictions(result_seq, years, add_resid=False):10188n = len(years)10189d = dict(Intercept=np.ones(n), years=years, years2=years**2)10190predict_df = pandas.DataFrame(d)1019110192predict_seq = []10193for fake_results in result_seq:10194predict = fake_results.predict(predict_df)10195if add_resid:10196predict += thinkstats2.Resample(fake_results.resid, n)10197predict_seq.append(predict)1019810199return predict_seq10200\end{verbatim}1020110202{\tt GeneratePredictions} takes the sequence of results from the10203previous step, as well as {\tt years}, which is a sequence of10204floats that specifies the interval to generate predictions for,10205and \verb"add_resid", which indicates whether it should add resampled10206residuals to the straight-line prediction.10207{\tt GeneratePredictions} iterates through the sequence of10208RegressionResults and generates a sequence of predictions.10209\index{resampling}1021010211\begin{figure}10212% timeseries.py10213\centerline{\includegraphics[height=2.5in]{figs/timeseries4.pdf}}10214\caption{Predictions based on linear fits, showing variation due10215to sampling error and prediction error.}10216\label{timeseries4}10217\end{figure}1021810219Finally, here's the code that plots a 90\% confidence interval for10220the predictions:10221\index{confidence interval}1022210223\begin{verbatim}10224def PlotPredictions(daily, years, iters=101, percent=90):10225result_seq = SimulateResults(daily, iters=iters)10226p = (100 - percent) / 210227percents = p, 100-p1022810229predict_seq = GeneratePredictions(result_seq, years, True)10230low, high = thinkstats2.PercentileRows(predict_seq, percents)10231thinkplot.FillBetween(years, low, high, alpha=0.3, color='gray')1023210233predict_seq = GeneratePredictions(result_seq, years, False)10234low, high = thinkstats2.PercentileRows(predict_seq, percents)10235thinkplot.FillBetween(years, low, high, alpha=0.5, color='gray')10236\end{verbatim}1023710238{\tt PlotPredictions} calls {\tt GeneratePredictions} twice: once10239with \verb"add_resid=True" and again with \verb"add_resid=False".10240It uses {\tt PercentileRows} to select the 5th and 95th percentiles10241for each year, then plots a gray region between these bounds.10242\index{FillBetween}1024310244Figure~\ref{timeseries4} shows the result.10245The dark gray region represents a 90\% confidence interval for10246the sampling error; that is, uncertainty about the estimated10247slope and intercept due to sampling.10248\index{sampling error}1024910250The lighter region shows10251a 90\% confidence interval for prediction error, which is the10252sum of sampling error and random variation.10253\index{noise}1025410255These regions quantify sampling error and random variation, but10256not modeling error. In general modeling error is hard to quantify,10257but in this case we can address at least one source of error,10258unpredictable external events.10259\index{modeling error}1026010261The regression model is based on the assumption that the system10262is {\bf stationary}; that is, that the parameters of the model10263don't change over time.10264Specifically, it assumes that the slope and10265intercept are constant, as well as the distribution of residuals.10266\index{stationary model}10267\index{parameter}1026810269But looking at the moving averages in Figure~\ref{timeseries10}, it10270seems like the slope changes at least once during the observed10271interval, and the variance of the residuals seems bigger in the first10272half than the second.10273\index{slope}1027410275As a result, the parameters we get depend on the interval we10276observe. To see how much effect this has on the predictions,10277we can extend {\tt SimulateResults} to use intervals of observation10278with different start and end dates. My implementation is in10279{\tt timeseries.py}.10280\index{simulation}1028110282\begin{figure}10283% timeseries.py10284\centerline{\includegraphics[height=2.5in]{figs/timeseries5.pdf}}10285\caption{Predictions based on linear fits, showing10286variation due to the interval of observation.}10287\label{timeseries5}10288\end{figure}1028910290Figure~\ref{timeseries5} shows the result for the medium quality10291category. The lightest gray area shows a confidence interval that10292includes uncertainty due to sampling error, random variation, and10293variation in the interval of observation.10294\index{confidence interval}10295\index{interval}1029610297The model based on the entire interval has positive slope, indicating10298that prices were increasing. But the most recent interval shows signs10299of decreasing prices, so models based on the most recent data have10300negative slope. As a result, the widest predictive interval includes10301the possibility of decreasing prices over the next year.10302\index{model}103031030410305\section{Further reading}1030610307Time series analysis is a big topic; this chapter has only scratched10308the surface. An important tool for working with time series data10309is autoregression, which I did not cover here, mostly because it turns10310out not to be useful for the example data I worked with.10311\index{time series}1031210313But once you10314have learned the material in this chapter, you are well prepared10315to learn about autoregression. One resource I recommend is10316Philipp Janert's book, {\it Data Analysis with Open Source Tools},10317O'Reilly Media, 2011. His chapter on time series analysis picks up10318where this one leaves off.10319\index{Janert, Philipp}103201032110322\section{Exercises}1032310324My solution to these exercises is in \verb"chap12soln.py".1032510326\begin{exercise}10327The linear model I used in this chapter has the obvious drawback10328that it is linear, and there is no reason to expect prices to10329change linearly over time.10330We can add flexibility to the model by adding a quadratic term,10331as we did in Section~\ref{nonlinear}.10332\index{nonlinear}10333\index{linear model}10334\index{quadratic model}1033510336Use a quadratic model to fit the time series of daily prices,10337and use the model to generate predictions. You will have to10338write a version of {\tt RunLinearModel} that runs that quadratic10339model, but after that you should be able to reuse code in10340{\tt timeseries.py} to generate predictions.10341\index{prediction}1034210343\end{exercise}1034410345\begin{exercise}10346Write a definition for a class named {\tt SerialCorrelationTest}10347that extends {\tt HypothesisTest} from Section~\ref{hypotest}.10348It should take a series and a lag as data, compute the serial10349correlation of the series with the given lag, and then compute10350the p-value of the observed correlation.10351\index{HypothesisTest}10352\index{p-value}10353\index{lag}1035410355Use this class to test whether the serial correlation in raw10356price data is statistically significant. Also test the residuals10357of the linear model and (if you did the previous exercise),10358the quadratic model.10359\index{quadratic model}10360\index{significant} \index{statistically significant}1036110362\end{exercise}1036310364\begin{exercise}10365There are several ways to extend the EWMA model to generate predictions.10366One of the simplest is something like this:10367\index{EWMA}1036810369\begin{enumerate}1037010371\item Compute the EWMA of the time series and use the last point10372as an intercept, {\tt inter}.1037310374\item Compute the EWMA of differences between successive elements in10375the time series and use the last point as a slope, {\tt slope}.10376\index{slope}1037710378\item To predict values at future times, compute {\tt inter + slope * dt},10379where {\tt dt} is the difference between the time of the prediction and10380the time of the last observation.10381\index{prediction}1038210383\end{enumerate}1038410385Use this method to generate predictions for a year after the last10386observation. A few hints:1038710388\begin{itemize}1038910390\item Use {\tt timeseries.FillMissing} to fill in missing values10391before running this analysis. That way the time between consecutive10392elements is consistent.10393\index{missing values}1039410395\item Use {\tt Series.diff} to compute differences between successive10396elements.10397\index{Series}1039810399\item Use {\tt reindex} to extend the DataFrame index into the future.10400\index{reindex}1040110402\item Use {\tt fillna} to put your predicted values into the DataFrame.10403\index{fillna}1040410405\end{itemize}1040610407\end{exercise}104081040910410\section{Glossary}1041110412\begin{itemize}1041310414\item time series: A dataset where each value is associated with10415a timestamp, often a series of measurements and the times they10416were collected.10417\index{time series}1041810419\item window: A sequence of consecutive values in a time series,10420often used to compute a moving average.10421\index{window}1042210423\item moving average: One of several statistics intended to estimate10424the underlying trend in a time series by computing averages (of10425some kind) for a series of overlapping windows.10426\index{moving average}1042710428\item rolling mean: A moving average based on the mean value in10429each window.10430\index{rolling mean}1043110432\item exponentially-weighted moving average (EWMA): A moving10433average based on a weighted mean that gives the highest weight10434to the most recent values, and exponentially decreasing weights10435to earlier values. \index{exponentially-weighted moving average} \index{EWMA}1043610437\item span: A parameter of EWMA that determines how quickly the10438weights decrease.10439\index{span}1044010441\item serial correlation: Correlation between a time series and10442a shifted or lagged version of itself.10443\index{serial correlation}1044410445\item lag: The size of the shift in a serial correlation or10446autocorrelation.10447\index{lag}1044810449\item autocorrelation: A more general term for a serial correlation10450with any amount of lag.10451\index{autocorrelation function}1045210453\item autocorrelation function: A function that maps from lag to10454serial correlation.1045510456\item stationary: A model is stationary if the parameters and the10457distribution of residuals does not change over time.10458\index{model}10459\index{stationary model}1046010461\end{itemize}10462104631046410465\chapter{Survival analysis}1046610467{\bf Survival analysis} is a way to describe how long things last.10468It is often used to study human lifetimes, but it10469also applies to ``survival'' of mechanical and electronic components, or10470more generally to intervals in time before an event.10471\index{survival analysis}10472\index{mechanical component}10473\index{electrical component}1047410475If someone you know has been diagnosed with a life-threatening10476disease, you might have seen a ``5-year survival rate,'' which10477is the probability of surviving five years after diagnosis. That10478estimate and related statistics are the result of survival analysis.10479\index{survival rate}1048010481The code in this chapter is in {\tt survival.py}. For information10482about downloading and working with this code, see Section~\ref{code}.104831048410485\section{Survival curves}10486\label{survival}1048710488The fundamental concept in survival analysis is the {\bf survival10489curve}, $S(t)$, which is a function that maps from a duration, $t$, to the10490probability of surviving longer than $t$. If you know the distribution10491of durations, or ``lifetimes'', finding the survival curve is easy;10492it's just the complement of the CDF: \index{survival curve}10493%10494\[ S(t) = 1 - \CDF(t) \]10495%10496where $CDF(t)$ is the probability of a lifetime less than or equal10497to $t$.10498\index{complementary CDF} \index{CDF!complementary} \index{CCDF}1049910500For example, in the NSFG dataset, we know the duration of 1118910501complete pregnancies. We can read this data and compute the CDF:10502\index{pregnancy length}1050310504\begin{verbatim}10505preg = nsfg.ReadFemPreg()10506complete = preg.query('outcome in [1, 3, 4]').prglngth10507cdf = thinkstats2.Cdf(complete, label='cdf')10508\end{verbatim}1050910510The outcome codes {\tt 1, 3, 4} indicate live birth, stillbirth,10511and miscarriage. For this analysis I am excluding induced abortions,10512ectopic pregnancies, and pregnancies that were in progress when10513the respondent was interviewed.1051410515The DataFrame method {\tt query} takes a boolean expression and10516evaluates it for each row, selecting the rows that yield True.10517\index{DataFrame}10518\index{boolean}10519\index{query}1052010521\begin{figure}10522% survival.py10523\centerline{\includegraphics[height=3.0in]{figs/survival1.pdf}}10524\caption{Cdf and survival curve for pregnancy length (top),10525hazard curve (bottom).}10526\label{survival1}10527\end{figure}1052810529Figure~\ref{survival1} (top) shows the CDF of pregnancy length10530and its complement, the survival curve. To represent the10531survival curve, I define an object that wraps a Cdf and10532adapts the interface:10533\index{Cdf}10534\index{pregnancy length}10535\index{SurvivalFunction}1053610537\begin{verbatim}10538class SurvivalFunction(object):10539def __init__(self, cdf, label=''):10540self.cdf = cdf10541self.label = label or cdf.label1054210543@property10544def ts(self):10545return self.cdf.xs1054610547@property10548def ss(self):10549return 1 - self.cdf.ps10550\end{verbatim}1055110552{\tt SurvivalFunction} provides two properties: {\tt ts}, which10553is the sequence of lifetimes, and {\tt ss}, which is the survival10554curve. In Python, a ``property'' is a method that can be10555invoked as if it were a variable.1055610557We can instantiate a {\tt SurvivalFunction} by passing10558the CDF of lifetimes:10559\index{property}1056010561\begin{verbatim}10562sf = SurvivalFunction(cdf)10563\end{verbatim}1056410565{\tt SurvivalFunction} also provides \verb"__getitem__" and10566{\tt Prob}, which evaluates the survival curve:1056710568\begin{verbatim}10569# class SurvivalFunction1057010571def __getitem__(self, t):10572return self.Prob(t)1057310574def Prob(self, t):10575return 1 - self.cdf.Prob(t)10576\end{verbatim}1057710578For example, {\tt sf[13]} is the fraction of pregnancies that10579proceed past the first trimester:10580\index{trimester}1058110582\begin{verbatim}10583>>> sf[13]105840.8602210585>>> cdf[13]105860.1397810587\end{verbatim}1058810589About 86\% of pregnancies proceed past the first trimester;10590about 14\% do not.1059110592{\tt SurvivalFunction} provides {\tt Render}, so we can10593plot {\tt sf} using the functions in {\tt thinkplot}:10594\index{thinkplot}1059510596\begin{verbatim}10597thinkplot.Plot(sf)10598\end{verbatim}1059910600Figure~\ref{survival1} (top) shows the result. The curve is nearly10601flat between 13 and 26 weeks, which shows that few pregnancies10602end in the second trimester. And the curve is steepest around 3910603weeks, which is the most common pregnancy length.10604\index{pregnancy length}106051060610607\section{Hazard function}10608\label{hazard}1060910610From the survival curve we can derive the {\bf hazard function};10611for pregnancy lengths, the hazard function maps from a time, $t$, to10612the fraction of pregnancies that continue until $t$ and then end at10613$t$. To be more precise:10614%10615\[ \lambda(t) = \frac{S(t) - S(t+1)}{S(t)} \]10616%10617The numerator is the fraction of lifetimes that end at $t$, which10618is also $\PMF(t)$.10619\index{hazard function}1062010621{\tt SurvivalFunction} provides {\tt MakeHazard}, which calculates10622the hazard function:1062310624\begin{verbatim}10625# class SurvivalFunction1062610627def MakeHazard(self, label=''):10628ss = self.ss10629lams = {}10630for i, t in enumerate(self.ts[:-1]):10631hazard = (ss[i] - ss[i+1]) / ss[i]10632lams[t] = hazard1063310634return HazardFunction(lams, label=label)10635\end{verbatim}1063610637The {\tt HazardFunction} object is a wrapper for a pandas10638Series:10639\index{pandas}10640\index{Series}10641\index{wrapper}1064210643\begin{verbatim}10644class HazardFunction(object):1064510646def __init__(self, d, label=''):10647self.series = pandas.Series(d)10648self.label = label10649\end{verbatim}1065010651{\tt d} can be a dictionary or any other type that can initialize10652a Series, including another Series. {\tt label} is a string used10653to identify the HazardFunction when plotted.10654\index{HazardFunction}1065510656{\tt HazardFunction} provides \verb"__getitem__", so we can evaluate10657it like this:1065810659\begin{verbatim}10660>>> hf = sf.MakeHazard()10661>>> hf[39]106620.4968910663\end{verbatim}1066410665So of all pregnancies that proceed until week 39, about1066650\% end in week 39.1066710668Figure~\ref{survival1} (bottom) shows the hazard function for10669pregnancy lengths. For times after week 42, the hazard function10670is erratic because it is based on a small number of cases.10671Other than that the shape of the curve is as expected: it is10672highest around 39 weeks, and a little higher in the first10673trimester than in the second.10674\index{pregnancy length}1067510676The hazard function is useful in its own right, but it is also an10677important tool for estimating survival curves, as we'll see in the10678next section.106791068010681\section{Inferring survival curves}1068210683If someone gives you the CDF of lifetimes, it is easy to compute the10684survival and hazard functions. But in many real-world10685scenarios, we can't measure the distribution of lifetimes directly.10686We have to infer it.10687\index{survival curve}10688\index{CDF}1068910690For example, suppose you are following a group of patients to see how10691long they survive after diagnosis. Not all patients are diagnosed on10692the same day, so at any point in time, some patients have survived10693longer than others. If some patients have died, we know their10694survival times. For patients who are still alive, we don't know10695survival times, but we have a lower bound.10696\index{diagnosis}1069710698If we wait until all patients are dead, we can compute the survival10699curve, but if we are evaluating the effectiveness of a new treatment,10700we can't wait that long! We need a way to estimate survival curves10701using incomplete information.10702\index{incomplete information}1070310704As a more cheerful example, I will use NSFG data to quantify how10705long respondents ``survive'' until they get married for the10706first time. The range of respondents' ages is 14 to 44 years, so10707the dataset provides a snapshot of women at different stages in their10708lives.10709\index{marital status}1071010711For women who have been married, the dataset includes the date10712of their first marriage and their age at the time.10713For women who have not been married, we know their age when interviewed,10714but have no way of knowing when or if they will get married.10715\index{age}1071610717Since we know the age at first marriage for {\em some\/} women, it10718might be tempting to exclude the rest and compute the CDF of10719the known data. That is a bad idea. The result would10720be doubly misleading: (1) older women would be overrepresented,10721because they are more likely to be married when interviewed,10722and (2) married women would be overrepresented! In fact, this10723analysis would lead to the conclusion that all women get married,10724which is obviously incorrect.107251072610727\section{Kaplan-Meier estimation}1072810729In this example it is not only desirable but necessary to include10730observations of unmarried women, which brings us to one of the central10731algorithms in survival analysis, {\bf Kaplan-Meier estimation}.10732\index{Kaplan-Meier estimation}1073310734The general idea is that we can use the data to estimate the hazard10735function, then convert the hazard function to a survival curve.10736To estimate the hazard function, we consider, for each age,10737(1) the number of women who got married at that age and (2) the number10738of women ``at risk'' of getting married, which includes all women10739who were not married at an earlier age.10740\index{hazard function}10741\index{at risk}1074210743Here's the code:1074410745\begin{verbatim}10746def EstimateHazardFunction(complete, ongoing, label=''):1074710748hist_complete = Counter(complete)10749hist_ongoing = Counter(ongoing)1075010751ts = list(hist_complete | hist_ongoing)10752ts.sort()1075310754at_risk = len(complete) + len(ongoing)1075510756lams = pandas.Series(index=ts)10757for t in ts:10758ended = hist_complete[t]10759censored = hist_ongoing[t]1076010761lams[t] = ended / at_risk10762at_risk -= ended + censored1076310764return HazardFunction(lams, label=label)10765\end{verbatim}1076610767{\tt complete} is the set of complete observations; in this case,10768the ages when respondents got married. {\tt ongoing} is the set10769of incomplete observations; that is, the ages of unmarried women10770when they were interviewed.1077110772First, we precompute \verb"hist_complete", which is a Counter10773that maps from each age to the number of women married at that10774age, and \verb"hist_ongoing" which maps from each age to the10775number of unmarried women interviewed at that age.1077610777\index{Counter}10778\index{survival curve}1077910780{\tt ts} is the union of ages when respondents got married10781and ages when unmarried women were interviewed, sorted in10782increasing order.1078310784\verb"at_risk" keeps track of the number of respondents considered10785``at risk'' at each age; initially, it is the total number of10786respondents.1078710788The result is stored in a Pandas {\tt Series} that maps from10789each age to the estimated hazard function at that age.1079010791Each time through the loop, we consider one age, {\tt t},10792and compute the number of events that end at {\tt t} (that is,10793the number of respondents married at that age) and the number10794of events censored at {\tt t} (that is, the number of women10795interviewed at {\tt t} whose future marriage dates are10796censored). In this context, ``censored'' means that the10797data are unavailable because of the data collection process.1079810799The estimated hazard function is the fraction of the cases10800at risk that end at {\tt t}.1080110802At the end of the loop, we subtract from \verb"at_risk" the10803number of cases that ended or were censored at {\tt t}.1080410805Finally, we pass {\tt lams} to the {\tt HazardFunction}10806constructor and return the result.1080710808\index{HazardFunction}108091081010811\section{The marriage curve}1081210813To test this function, we have to do some data cleaning and10814transformation. The NSFG variables we need are:10815\index{marital status}1081610817\begin{itemize}1081810819\item {\tt cmbirth}: The respondent's date of birth, known for10820all respondents.10821\index{date of birth}1082210823\item {\tt cmintvw}: The date the respondent was interviewed,10824known for all respondents.1082510826\item {\tt cmmarrhx}: The date the respondent was first married,10827if applicable and known.1082810829\item {\tt evrmarry}: 1 if the respondent had been10830married prior to the date of interview, 0 otherwise.1083110832\end{itemize}1083310834The first three variables are encoded in ``century-months''; that is, the10835integer number of months since December 1899. So century-month108361 is January 1900.10837\index{century month}1083810839First, we read the respondent file and replace invalid values of10840{\tt cmmarrhx}:1084110842\begin{verbatim}10843resp = chap01soln.ReadFemResp()10844resp.cmmarrhx.replace([9997, 9998, 9999], np.nan, inplace=True)10845\end{verbatim}1084610847Then we compute each respondent's age when married and age when10848interviewed:10849\index{NaN}1085010851\begin{verbatim}10852resp['agemarry'] = (resp.cmmarrhx - resp.cmbirth) / 12.010853resp['age'] = (resp.cmintvw - resp.cmbirth) / 12.010854\end{verbatim}1085510856Next we extract {\tt complete}, which is the age at marriage for10857women who have been married, and {\tt ongoing}, which is the10858age at interview for women who have not:10859\index{age}1086010861\begin{verbatim}10862complete = resp[resp.evrmarry==1].agemarry10863ongoing = resp[resp.evrmarry==0].age10864\end{verbatim}1086510866Finally we compute the10867hazard function.10868\index{hazard function}1086910870\begin{verbatim}10871hf = EstimateHazardFunction(complete, ongoing)10872\end{verbatim}1087310874Figure~\ref{survival2} (top) shows the estimated hazard function;10875it is low in the teens,10876higher in the 20s, and declining in the 30s. It increases again in10877the 40s, but that is an artifact of the estimation process; as the10878number of respondents ``at risk'' decreases, a small number of10879women getting married yields a large estimated hazard. The survival10880curve will smooth out this noise.10881\index{noise}108821088310884\section{Estimating the survival curve}1088510886Once we have the hazard function, we can estimate the survival curve.10887The chance of surviving past time {\tt t} is the chance of surviving10888all times up through {\tt t}, which is the cumulative product of10889the complementary hazard function:10890%10891\[ [1-\lambda(0)] [1-\lambda(1)] \ldots [1-\lambda(t)] \]10892%10893The {\tt HazardFunction} class provides {\tt MakeSurvival}, which10894computes this product:10895\index{cumulative product}10896\index{SurvivalFunction}1089710898\begin{verbatim}10899# class HazardFunction:1090010901def MakeSurvival(self):10902ts = self.series.index10903ss = (1 - self.series).cumprod()10904cdf = thinkstats2.Cdf(ts, 1-ss)10905sf = SurvivalFunction(cdf)10906return sf10907\end{verbatim}1090810909{\tt ts} is the sequence of times where the hazard function is10910estimated. {\tt ss} is the cumulative product of the complementary10911hazard function, so it is the survival curve.1091210913Because of the way {\tt SurvivalFunction} is implemented, we have10914to compute the complement of {\tt ss}, make a Cdf, and then instantiate10915a SurvivalFunction object.10916\index{Cdf}10917\index{complementary CDF}109181091910920\begin{figure}10921% survival.py10922\centerline{\includegraphics[height=2.5in]{figs/survival2.pdf}}10923\caption{Hazard function for age at first marriage (top) and10924survival curve (bottom).}10925\label{survival2}10926\end{figure}1092710928Figure~\ref{survival2} (bottom) shows the result. The survival10929curve is steepest between 25 and 35, when most women get married.10930Between 35 and 45,10931the curve is nearly flat, indicating that women who do not marry10932before age 35 are unlikely to get married.1093310934A curve like this was the basis of a famous magazine article in 1986;10935{\it Newsweek\/} reported that a 40-year old unmarried woman was ``more10936likely to be killed by a terrorist'' than get married. These10937statistics were widely reported and became part of popular culture,10938but they were wrong then (because they were based on faulty analysis)10939and turned out to be even more wrong (because of cultural changes that10940were already in progress and continued). In 2006, {\it Newsweek\/} ran10941an another article admitting that they were wrong.10942\index{Newsweek}1094310944I encourage you to read more about this article, the statistics it was10945based on, and the reaction. It should remind you of the ethical10946obligation to perform statistical analysis with care, interpret the10947results with appropriate skepticism, and present them to the public10948accurately and honestly.10949\index{ethics}109501095110952\section{Confidence intervals}1095310954Kaplan-Meier analysis yields a single estimate of the survival curve,10955but it is also important to quantify the uncertainty of the estimate.10956As usual, there are three possible sources of error: measurement10957error, sampling error, and modeling error.10958\index{confidence interval}10959\index{modeling error}10960\index{sampling error}1096110962In this example, measurement error is probably small. People10963generally know when they were born, whether they've been married, and10964when. And they can be expected to report this information accurately.10965\index{measurement error}1096610967We can quantify sampling error by resampling. Here's the code:10968\index{resampling}1096910970\begin{verbatim}10971def ResampleSurvival(resp, iters=101):10972low, high = resp.agemarry.min(), resp.agemarry.max()10973ts = np.arange(low, high, 1/12.0)1097410975ss_seq = []10976for i in range(iters):10977sample = thinkstats2.ResampleRowsWeighted(resp)10978hf, sf = EstimateSurvival(sample)10979ss_seq.append(sf.Probs(ts))1098010981low, high = thinkstats2.PercentileRows(ss_seq, [5, 95])10982thinkplot.FillBetween(ts, low, high)10983\end{verbatim}1098410985{\tt ResampleSurvival} takes {\tt resp}, a DataFrame of respondents,10986and {\tt iters}, the number of times to resample. It computes {\tt10987ts}, which is the sequence of ages where we will evaluate the survival10988curves.10989\index{DataFrame}1099010991Inside the loop, {\tt ResampleSurvival}:1099210993\begin{itemize}1099410995\item Resamples the respondents using {\tt ResampleRowsWeighted},10996which we saw in Section~\ref{weighted}.10997\index{weighted resampling}1099810999\item Calls {\tt EstimateSurvival}, which uses the process in the11000previous sections to estimate the hazard and survival curves, and1100111002\item Evaluates the survival curve at each age in {\tt ts}.1100311004\end{itemize}1100511006\verb"ss_seq" is a sequence of evaluated survival curves.11007{\tt PercentileRows} takes this sequence and computes the 5th and 95th11008percentiles, returning a 90\% confidence interval for the survival11009curve.11010\index{FillBetween}1101111012\begin{figure}11013% survival.py11014\centerline{\includegraphics[height=2.5in]{figs/survival3.pdf}}11015\caption{Survival curve for age at first marriage (dark line) and a 90\%11016confidence interval based on weighted resampling (gray line).}11017\label{survival3}11018\end{figure}1101911020Figure~\ref{survival3} shows the result along with the survival11021curve we estimated in the previous section. The confidence11022interval takes into account the sampling weights, unlike the estimated11023curve. The discrepancy between them indicates that the sampling11024weights have a substantial effect on the estimate---we will have11025to keep that in mind.11026\index{confidence interval}11027\index{sampling weight}110281102911030\section{Cohort effects}1103111032One of the challenges of survival analysis is that different parts11033of the estimated curve are based on different groups of respondents.11034The part of the curve at time {\tt t} is based on respondents11035whose age was at least {\tt t} when they were interviewed.11036So the leftmost part of the curve includes data from all respondents,11037but the rightmost part includes only the oldest respondents.1103811039If the relevant characteristics of the respondents are not changing11040over time, that's fine, but in this case it seems likely that marriage11041patterns are different for women born in different generations.11042We can investigate this effect by grouping respondents according11043to their decade of birth. Groups like this, defined by date of11044birth or similar events, are called {\bf cohorts}, and differences11045between the groups are called {\bf cohort effects}.11046\index{cohort}11047\index{cohort effect}1104811049To investigate cohort effects in the NSFG marriage data, I gathered11050the Cycle 6 data from 2002 used throughout this book;11051the Cycle 7 data from 2006--2010 used in Section~\ref{replication};11052and the Cycle 5 data from 1995. In total these datasets include1105330,769 respondents.1105411055\begin{verbatim}11056resp5 = ReadFemResp1995()11057resp6 = ReadFemResp2002()11058resp7 = ReadFemResp2010()11059resps = [resp5, resp6, resp7]11060\end{verbatim}1106111062For each DataFrame, {\tt resp}, I use {\tt cmbirth} to compute the11063decade of birth for each respondent:11064\index{pandas}11065\index{DataFrame}1106611067\begin{verbatim}11068month0 = pandas.to_datetime('1899-12-15')11069dates = [month0 + pandas.DateOffset(months=cm)11070for cm in resp.cmbirth]11071resp['decade'] = (pandas.DatetimeIndex(dates).year - 1900) // 1011072\end{verbatim}1107311074{\tt cmbirth} is encoded as the integer number of months since11075December 1899; {\tt month0} represents that date as a Timestamp11076object. For each birth date, we instantiate a {\tt DateOffset} that11077contains the century-month and add it to {\tt month0}; the result11078is a sequence of Timestamps, which is converted to a {\tt11079DateTimeIndex}. Finally, we extract {\tt year} and compute11080decades.11081\index{DateTimeIndex}11082\index{Index}11083\index{century month}1108411085To take into account the sampling weights, and also to show11086variability due to sampling error, I resample the data,11087group respondents by decade, and plot survival curves:11088\index{resampling}11089\index{sampling error}1109011091\begin{verbatim}11092for i in range(iters):11093samples = [thinkstats2.ResampleRowsWeighted(resp)11094for resp in resps]11095sample = pandas.concat(samples, ignore_index=True)11096groups = sample.groupby('decade')1109711098EstimateSurvivalByDecade(groups, alpha=0.2)11099\end{verbatim}1110011101Data from the three NSFG cycles use different sampling weights,11102so I resample them separately and then use {\tt concat}11103to merge them into a single DataFrame. The parameter \verb"ignore_index"11104tells {\tt concat} not to match up respondents by index; instead11105it creates a new index from 0 to 30768.11106\index{pandas}11107\index{DataFrame}11108\index{groupby}1110911110{\tt EstimateSurvivalByDecade} plots survival curves for each cohort:1111111112\begin{verbatim}11113def EstimateSurvivalByDecade(resp):11114for name, group in groups:11115hf, sf = EstimateSurvival(group)11116thinkplot.Plot(sf)11117\end{verbatim}1111811119\begin{figure}11120% survival.py11121\centerline{\includegraphics[height=2.5in]{figs/survival4.pdf}}11122\caption{Survival curves for respondents born during different decades.}11123\label{survival4}11124\end{figure}1112511126Figure~\ref{survival4} shows the results. Several patterns are11127visible:1112811129\begin{itemize}1113011131\item Women born in the 50s married earliest, with successive11132cohorts marrying later and later, at least until age 30 or so.1113311134\item Women born in the 60s follow a surprising pattern. Prior11135to age 25, they were marrying at slower rates than their predecessors.11136After age 25, they were marrying faster. By age 32 they had overtaken11137the 50s cohort, and at age 44 they are substantially more likely to11138have married.11139\index{marital status}1114011141Women born in the 60s turned 25 between 1985 and 1995. Remembering11142that the {\it Newsweek\/} article I mentioned was published in 1986, it11143is tempting to imagine that the article triggered a marriage boom.11144That explanation would be too pat, but it is possible that the article11145and the reaction to it were indicative of a mood that affected the11146behavior of this cohort.11147\index{Newsweek}1114811149\item The pattern of the 70s cohort is similar. They are less11150likely than their predecessors to be married before age 25, but11151at age 35 they have caught up with both of the previous cohorts.1115211153\item Women born in the 80s are even less likely to marry before11154age 25. What happens after that is not clear; for more data, we11155have to wait for the next cycle of the NSFG.1115611157\end{itemize}1115811159In the meantime we can make some predictions.11160\index{prediction}111611116211163\section{Extrapolation}1116411165The survival curve for the 70s cohort ends at about age 38;11166for the 80s cohort it ends at age 28, and for the 90s cohort11167we hardly have any data at all.11168\index{extrapolation}1116911170We can extrapolate these curves by ``borrowing'' data from the11171previous cohort. HazardFunction provides a method, {\tt Extend}, that11172copies the tail from another longer HazardFunction:11173\index{HazardFunction}1117411175\begin{verbatim}11176# class HazardFunction1117711178def Extend(self, other):11179last = self.series.index[-1]11180more = other.series[other.series.index > last]11181self.series = pandas.concat([self.series, more])11182\end{verbatim}1118311184As we saw in Section~\ref{hazard}, the HazardFunction contains a Series11185that maps from $t$ to $\lambda(t)$. {\tt Extend} finds {\tt last},11186which is the last index in {\tt self.series}, selects values from11187{\tt other} that come later than {\tt last}, and appends them11188onto {\tt self.series}.11189\index{pandas}11190\index{Series}1119111192Now we can extend the HazardFunction for each cohort, using values11193from the predecessor:1119411195\begin{verbatim}11196def PlotPredictionsByDecade(groups):11197hfs = []11198for name, group in groups:11199hf, sf = EstimateSurvival(group)11200hfs.append(hf)1120111202thinkplot.PrePlot(len(hfs))11203for i, hf in enumerate(hfs):11204if i > 0:11205hf.Extend(hfs[i-1])11206sf = hf.MakeSurvival()11207thinkplot.Plot(sf)11208\end{verbatim}1120911210{\tt groups} is a GroupBy object with respondents grouped by decade of11211birth. The first loop computes the HazardFunction for each group.11212\index{groupby}1121311214The second loop extends each HazardFunction with values from11215its predecessor, which might contain values from the previous11216group, and so on. Then it converts each HazardFunction to11217a SurvivalFunction and plots it.1121811219\begin{figure}11220% survival.py11221\centerline{\includegraphics[height=2.5in]{figs/survival5.pdf}}11222\caption{Survival curves for respondents born during different decades,11223with predictions for the later cohorts.}11224\label{survival5}11225\end{figure}1122611227Figure~\ref{survival5} shows the results; I've removed the 50s cohort11228to make the predictions more visible. These results suggest that by11229age 40, the most recent cohorts will converge with the 60s cohort,11230with fewer than 20\% never married.11231\index{visualization}112321123311234\section{Expected remaining lifetime}1123511236Given a survival curve, we can compute the expected remaining11237lifetime as a function of current age. For example, given the11238survival curve of pregnancy length from Section~\ref{survival},11239we can compute the expected time until delivery.11240\index{pregnancy length}1124111242The first step is to extract the PMF of lifetimes. {\tt SurvivalFunction}11243provides a method that does that:1124411245\begin{verbatim}11246# class SurvivalFunction1124711248def MakePmf(self, filler=None):11249pmf = thinkstats2.Pmf()11250for val, prob in self.cdf.Items():11251pmf.Set(val, prob)1125211253cutoff = self.cdf.ps[-1]11254if filler is not None:11255pmf[filler] = 1-cutoff1125611257return pmf11258\end{verbatim}1125911260Remember that the SurvivalFunction contains the Cdf of lifetimes.11261The loop copies the values and probabilities from the Cdf into11262a Pmf.11263\index{Pmf}11264\index{Cdf}1126511266{\tt cutoff} is the highest probability in the Cdf, which is 111267if the Cdf is complete, and otherwise less than 1.11268If the Cdf is incomplete, we plug in the provided value, {\tt filler},11269to cap it off.1127011271The Cdf of pregnancy lengths is complete, so we don't have to worry11272about this detail yet.11273\index{pregnancy length}1127411275The next step is to compute the expected remaining lifetime, where11276``expected'' means average. {\tt SurvivalFunction}11277provides a method that does that, too:11278\index{expected remaining lifetime}1127911280\begin{verbatim}11281# class SurvivalFunction1128211283def RemainingLifetime(self, filler=None, func=thinkstats2.Pmf.Mean):11284pmf = self.MakePmf(filler=filler)11285d = {}11286for t in sorted(pmf.Values())[:-1]:11287pmf[t] = 011288pmf.Normalize()11289d[t] = func(pmf) - t1129011291return pandas.Series(d)11292\end{verbatim}1129311294{\tt RemainingLifetime} takes {\tt filler}, which is passed along11295to {\tt MakePmf}, and {\tt func} which is the function used to11296summarize the distribution of remaining lifetimes.1129711298{\tt pmf} is the Pmf of lifetimes extracted from the SurvivalFunction.11299{\tt d} is a dictionary that contains the results, a map from11300current age, {\tt t}, to expected remaining lifetime.11301\index{Pmf}1130211303The loop iterates through the values in the Pmf. For each value11304of {\tt t} it computes the conditional distribution of lifetimes,11305given that the lifetime exceeds {\tt t}. It does that by removing11306values from the Pmf one at a time and renormalizing the remaining11307values.1130811309Then it uses {\tt func} to summarize the conditional distribution.11310In this example the result is the mean pregnancy length, given that11311the length exceeds {\tt t}. By subtracting {\tt t} we get the11312mean remaining pregnancy length.11313\index{pregnancy length}1131411315\begin{figure}11316% survival.py11317\centerline{\includegraphics[height=2.5in]{figs/survival6.pdf}}11318\caption{Expected remaining pregnancy length (left) and11319years until first marriage (right).}11320\label{survival6}11321\end{figure}1132211323Figure~\ref{survival6} (left) shows the expected remaining pregnancy11324length as a function of the current duration. For example, during11325Week 0, the expected remaining duration is about 34 weeks. That's11326less than full term (39 weeks) because terminations of pregnancy11327in the first trimester bring the average down.11328\index{pregnancy length}1132911330The curve drops slowly during the first trimester. After 13 weeks,11331the expected remaining lifetime has dropped by only 9 weeks, to1133225. After that the curve drops faster, by about a week per week.1133311334Between Week 37 and 42, the curve levels off between 1 and 2 weeks.11335At any time during this period, the expected remaining lifetime is the11336same; with each week that passes, the destination gets no closer.11337Processes with this property are called {\bf memoryless} because11338the past has no effect on the predictions.11339This behavior is the mathematical basis of the infuriating mantra11340of obstetrics nurses: ``any day now.''11341\index{memoryless}1134211343Figure~\ref{survival6} (right) shows the median remaining time until11344first marriage, as a function of age. For an 11 year-old girl, the11345median time until first marriage is about 14 years. The curve decreases11346until age 22 when the median remaining time is about 7 years.11347After that it increases again: by age 30 it is back where it started,11348at 14 years.1134911350Based on this data, young women have decreasing remaining11351``lifetimes''. Mechanical components with this property are called {\bf NBUE}11352for ``new better than used in expectation,'' meaning that a new part is11353expected to last longer.11354\index{NBUE}1135511356Women older than 22 have increasing remaining time until first11357marriage. Components with this property are called {\bf UBNE} for11358``used better than new in expectation.'' That is, the older the part,11359the longer it is expected to last. Newborns and cancer patients are11360also UBNE; their life expectancy increases the longer they live.11361\index{UBNE}1136211363For this example I computed median, rather than mean, because the11364Cdf is incomplete; the survival curve projects that about 20\%11365of respondents will not marry before age 44. The age of11366first marriage for these women is unknown, and might be non-existent,11367so we can't compute a mean.11368\index{Cdf}11369\index{median}1137011371I deal with these unknown values by replacing them with {\tt np.inf},11372a special value that represents infinity. That makes the mean11373infinity for all ages, but the median is well-defined as long as11374more than 50\% of the remaining lifetimes are finite, which is true11375until age 30. After that it is hard to define a meaningful11376expected remaining lifetime.11377\index{inf}1137811379Here's the code that computes and plots these functions:1138011381\begin{verbatim}11382rem_life1 = sf1.RemainingLifetime()11383thinkplot.Plot(rem_life1)1138411385func = lambda pmf: pmf.Percentile(50)11386rem_life2 = sf2.RemainingLifetime(filler=np.inf, func=func)11387thinkplot.Plot(rem_life2)11388\end{verbatim}1138911390{\tt sf1} is the survival curve for pregnancy length;11391in this case we can use the default values for {\tt RemainingLifetime}.11392\index{pregnancy length}1139311394{\tt sf2} is the survival curve for age at first marriage;11395{\tt func} is a function that takes a Pmf and computes its11396median (50th percentile).11397\index{Pmf}113981139911400\section{Exercises}1140111402My solution to this exercise is in \verb"chap13soln.py".1140311404\begin{exercise}11405In NSFG Cycles 6 and 7, the variable {\tt cmdivorcx} contains the11406date of divorce for the respondent's first marriage, if applicable,11407encoded in century-months.11408\index{divorce}11409\index{marital status}1141011411Compute the duration of marriages that have ended in divorce, and11412the duration, so far, of marriages that are ongoing. Estimate the11413hazard and survival curve for the duration of marriage.1141411415Use resampling to take into account sampling weights, and plot11416data from several resamples to visualize sampling error.11417\index{resampling}1141811419Consider dividing the respondents into groups by decade of birth,11420and possibly by age at first marriage.11421\index{groupby}1142211423\end{exercise}114241142511426\section{Glossary}1142711428\begin{itemize}1142911430\item survival analysis: A set of methods for describing and11431predicting lifetimes, or more generally time until an event occurs.11432\index{survival analysis}1143311434\item survival curve: A function that maps from a time, $t$, to the11435probability of surviving past $t$.11436\index{survival curve}1143711438\item hazard function: A function that maps from $t$ to the fraction11439of people alive until $t$ who die at $t$.11440\index{hazard function}1144111442\item Kaplan-Meier estimation: An algorithm for estimating hazard and11443survival functions.11444\index{Kaplan-Meier estimation}1144511446\item cohort: a group of subjects defined by an event, like date of11447birth, in a particular interval of time.11448\index{cohort}1144911450\item cohort effect: a difference between cohorts.11451\index{cohort effect}1145211453\item NBUE: A property of expected remaining lifetime, ``New11454better than used in expectation.''11455\index{NBUE}1145611457\item UBNE: A property of expected remaining lifetime, ``Used11458better than new in expectation.''11459\index{UBNE}1146011461\end{itemize}114621146311464\chapter{Analytic methods}11465\label{analysis}1146611467This book has focused on computational methods like simulation and11468resampling, but some of the problems we solved have11469analytic solutions that can be much faster.11470\index{resampling}11471\index{analytic methods}11472\index{computational methods}1147311474I present some of these methods in this chapter, and explain11475how they work. At the end of the chapter, I make suggestions11476for integrating computational and analytic methods for exploratory11477data analysis.1147811479The code in this chapter is in {\tt normal.py}. For information11480about downloading and working with this code, see Section~\ref{code}.114811148211483\section{Normal distributions}11484\label{why_normal}11485\index{normal distribution}11486\index{distribution!normal}11487\index{Gaussian distribution}11488\index{distribution!Gaussian}1148911490As a motivating example, let's review the problem from11491Section~\ref{gorilla}:11492\index{gorilla}1149311494\begin{quotation}11495\noindent Suppose you are a scientist studying gorillas in a wildlife11496preserve. Having weighed 9 gorillas, you find sample mean $\xbar=90$ kg and11497sample standard deviation, $S=7.5$ kg. If you use $\xbar$ to estimate11498the population mean, what is the standard error of the estimate?11499\end{quotation}1150011501To answer that question, we need the sampling11502distribution of $\xbar$. In Section~\ref{gorilla} we approximated11503this distribution by simulating the experiment (weighing115049 gorillas), computing $\xbar$ for each simulated experiment, and11505accumulating the distribution of estimates.11506\index{standard error}11507\index{standard deviation}1150811509The result is an approximation of the sampling distribution. Then we11510use the sampling distribution to compute standard errors and11511confidence intervals:11512\index{confidence interval}11513\index{sampling distribution}1151411515\begin{enumerate}1151611517\item The standard deviation of the sampling distribution is the11518standard error of the estimate; in the example, it is about115192.5 kg.1152011521\item The interval between the 5th and 95th percentile of the sampling11522distribution is a 90\% confidence interval. If we run the11523experiment many times, we expect the estimate to fall in this11524interval 90\% of the time. In the example, the 90\% CI is11525$(86, 94)$ kg.1152611527\end{enumerate}1152811529Now we'll do the same calculation analytically. We11530take advantage of the fact that the weights of adult female gorillas11531are roughly normally distributed. Normal distributions have two11532properties that make them amenable for analysis: they are ``closed'' under11533linear transformation and addition. To explain what that means, I11534need some notation. \index{analysis}11535\index{linear transformation}11536\index{addition, closed under}1153711538If the distribution of a quantity, $X$, is11539normal with parameters $\mu$ and $\sigma$, you can write11540%11541\[ X \sim \normal~(\mu, \sigma^{2})\]11542%11543where the symbol $\sim$ means ``is distributed'' and the script letter11544$\normal$ stands for ``normal.''1154511546%The other analytic distributions in this chapter are sometimes11547%written $\mathrm{Exponential}(\lambda)$, $\mathrm{Pareto}(x_m,11548%\alpha)$ and, for lognormal, $\mathrm{Log}-\normal~(\mu,11549%\sigma^2)$.1155011551A linear transformation of $X$ is something like $X' = a X + b$, where11552$a$ and $b$ are real numbers.\index{linear transformation}11553A family of distributions is closed under11554linear transformation if $X'$ is in the same family as $X$. The normal11555distribution has this property; if $X \sim \normal~(\mu,11556\sigma^2)$,11557%11558\[ X' \sim \normal~(a \mu + b, a^{2} \sigma^2) \tag*{(1)} \]11559%11560Normal distributions are also closed under addition.11561If $Z = X + Y$ and11562$X \sim \normal~(\mu_{X}, \sigma_{X}^{2})$ and11563$Y \sim \normal~(\mu_{Y}, \sigma_{Y}^{2})$ then11564%11565\[ Z \sim \normal~(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2) \tag*{(2)}\]11566%11567In the special case $Z = X + X$, we have11568%11569\[ Z \sim \normal~(2 \mu_X, 2 \sigma_X^2) \]11570%11571and in general if we draw $n$ values of $X$ and add them up, we have11572%11573\[ Z \sim \normal~(n \mu_X, n \sigma_X^2) \tag*{(3)}\]115741157511576\section{Sampling distributions}1157711578Now we have everything we need to compute the sampling distribution of11579$\xbar$. Remember that we compute $\xbar$ by weighing $n$ gorillas,11580adding up the total weight, and dividing by $n$.11581\index{sampling distribution}11582\index{gorilla}11583\index{weight}1158411585Assume that the distribution of gorilla weights, $X$, is11586approximately normal:11587%11588\[ X \sim \normal~(\mu, \sigma^2)\]11589%11590If we weigh $n$ gorillas, the total weight, $Y$, is distributed11591%11592\[ Y \sim \normal~(n \mu, n \sigma^2) \]11593%11594using Equation 3. And if we divide by $n$, the sample mean,11595$Z$, is distributed11596%11597\[ Z \sim \normal~(\mu, \sigma^2/n) \]11598%11599using Equation 1 with $a = 1/n$.1160011601The distribution of $Z$ is the sampling distribution of $\xbar$.11602The mean of $Z$ is $\mu$, which shows that $\xbar$ is an unbiased11603estimate of $\mu$. The variance of the sampling distribution11604is $\sigma^2 / n$.11605\index{biased estimator}11606\index{estimator!biased}1160711608So the standard deviation of the sampling distribution, which is the11609standard error of the estimate, is $\sigma / \sqrt{n}$. In the11610example, $\sigma$ is 7.5 kg and $n$ is 9, so the standard error is 2.511611kg. That result is consistent with what we estimated by simulation,11612but much faster to compute!11613\index{standard error}11614\index{standard deviation}1161511616We can also use the sampling distribution to compute confidence11617intervals. A 90\% confidence interval for $\xbar$ is the interval11618between the 5th and 95th percentiles of $Z$. Since $Z$ is normally11619distributed, we can compute percentiles by evaluating the inverse11620CDF.11621\index{inverse CDF}11622\index{CDF, inverse}11623\index{confidence interval}1162411625There is no closed form for the CDF of the normal distribution11626or its inverse, but there are fast numerical methods and they11627are implemented in SciPy, as we saw in Section~\ref{normal}.11628{\tt thinkstats2} provides a wrapper function that makes the11629SciPy function a little easier to use:11630\index{SciPy}11631\index{normal distribution}11632\index{wrapper}11633\index{closed form}1163411635\begin{verbatim}11636def EvalNormalCdfInverse(p, mu=0, sigma=1):11637return scipy.stats.norm.ppf(p, loc=mu, scale=sigma)11638\end{verbatim}1163911640Given a probability, {\tt p}, it returns the corresponding11641percentile from a normal distribution with parameters {\tt mu}11642and {\tt sigma}. For the 90\% confidence interval of $\xbar$,11643we compute the 5th and 95th percentiles like this:11644\index{percentile}1164511646\begin{verbatim}11647>>> thinkstats2.EvalNormalCdfInverse(0.05, mu=90, sigma=2.5)1164885.8881164911650>>> thinkstats2.EvalNormalCdfInverse(0.95, mu=90, sigma=2.5)1165194.11211652\end{verbatim}1165311654So if we run the experiment many times, we expect the11655estimate, $\xbar$, to fall in the range $(85.9, 94.1)$ about1165690\% of the time. Again, this is consistent with the result11657we got by simulation.11658\index{simulation}116591166011661\section{Representing normal distributions}1166211663To make these calculations easier, I have defined a class called11664{\tt Normal} that represents a normal distribution and encodes11665the equations in the previous sections. Here's what it looks11666like:11667\index{Normal}1166811669\begin{verbatim}11670class Normal(object):1167111672def __init__(self, mu, sigma2):11673self.mu = mu11674self.sigma2 = sigma21167511676def __str__(self):11677return 'N(%g, %g)' % (self.mu, self.sigma2)11678\end{verbatim}1167911680So we can instantiate a Normal that represents the distribution11681of gorilla weights:11682\index{gorilla}1168311684\begin{verbatim}11685>>> dist = Normal(90, 7.5**2)11686>>> dist11687N(90, 56.25)11688\end{verbatim}1168911690{\tt Normal} provides {\tt Sum}, which takes a sample size, {\tt n},11691and returns the distribution of the sum of {\tt n} values, using11692Equation 3:1169311694\begin{verbatim}11695def Sum(self, n):11696return Normal(n * self.mu, n * self.sigma2)11697\end{verbatim}1169811699Normal also knows how to multiply and divide using11700Equation 1:1170111702\begin{verbatim}11703def __mul__(self, factor):11704return Normal(factor * self.mu, factor**2 * self.sigma2)1170511706def __div__(self, divisor):11707return 1 / divisor * self11708\end{verbatim}1170911710So we can compute the sampling distribution of the mean with sample11711size 9:11712\index{sampling distribution}11713\index{sample size}1171411715\begin{verbatim}11716>>> dist_xbar = dist.Sum(9) / 911717>>> dist_xbar.sigma117182.511719\end{verbatim}1172011721The standard deviation of the sampling distribution is 2.5 kg, as we11722saw in the previous section. Finally, Normal provides {\tt11723Percentile}, which we can use to compute a confidence interval:11724\index{standard deviation}11725\index{confidence interval}1172611727\begin{verbatim}11728>>> dist_xbar.Percentile(5), dist_xbar.Percentile(95)1172985.888 94.11311730\end{verbatim}1173111732And that's the same answer we got before. We'll use the Normal11733class again later, but before we go on, we need one more bit of11734analysis.117351173611737\section{Central limit theorem}11738\label{CLT}1173911740As we saw in the previous sections, if we add values drawn from normal11741distributions, the distribution of the sum is normal.11742Most other distributions don't have this property;11743if we add values drawn from other distributions, the sum does not11744generally have an analytic distribution.11745\index{sum}11746\index{normal distribution} \index{distribution!normal}11747\index{Gaussian distribution} \index{distribution!Gaussian}1174811749But if we add up {\tt n} values from11750almost any distribution, the distribution of the sum converges to11751normal as {\tt n} increases.1175211753More specifically, if the distribution of the values has mean and11754standard deviation $\mu$ and $\sigma$, the distribution of the sum is11755approximately $\normal(n \mu, n \sigma^2)$.11756\index{standard deviation}1175711758This result is the Central Limit Theorem (CLT). It is one of the11759most useful tools for statistical analysis, but it comes with11760caveats:11761\index{Central Limit Theorem}11762\index{CLT}1176311764\begin{itemize}1176511766\item The values have to be drawn independently. If they are11767correlated, the CLT doesn't apply (although this is seldom a problem11768in practice).11769\index{independent}1177011771\item The values have to come from the same distribution (although11772this requirement can be relaxed).11773\index{identical}1177411775\item The values have to be drawn11776from a distribution with finite mean and variance. So most Pareto11777distributions are out.11778\index{mean}11779\index{variance}11780\index{Pareto distribution}11781\index{distribution!Pareto}11782\index{exponential distribution}11783\index{distribution!exponential}1178411785\item The rate of convergence depends11786on the skewness of the distribution. Sums from an exponential11787distribution converge for small {\tt n}. Sums from a11788lognormal distribution require larger sizes.11789\index{lognormal distribution}11790\index{distribution!lognormal}11791\index{skewness}1179211793\end{itemize}1179411795The Central Limit Theorem explains the prevalence11796of normal distributions in the natural world. Many characteristics of11797living things are affected by genetic11798and environmental factors whose effect is additive. The characteristics11799we measure are the sum of a large number of small effects, so their11800distribution tends to be normal.11801\index{normal distribution}11802\index{distribution!normal}11803\index{Gaussian distribution}11804\index{distribution!Gaussian}11805\index{Central Limit Theorem}11806\index{CLT}118071180811809\section{Testing the CLT}1181011811To see how the Central Limit Theorem works, and when it doesn't,11812let's try some experiments. First, we'll try11813an exponential distribution:1181411815\begin{verbatim}11816def MakeExpoSamples(beta=2.0, iters=1000):11817samples = []11818for n in [1, 10, 100]:11819sample = [np.sum(np.random.exponential(beta, n))11820for _ in range(iters)]11821samples.append((n, sample))11822return samples11823\end{verbatim}1182411825{\tt MakeExpoSamples} generates samples of sums of exponential values11826(I use ``exponential values'' as shorthand for ``values from an11827exponential distribution'').11828{\tt beta} is the parameter of the distribution; {\tt iters}11829is the number of sums to generate.1183011831To explain this function, I'll start from the inside and work my way11832out. Each time we call {\tt np.random.exponential}, we get a sequence11833of {\tt n} exponential values and compute its sum. {\tt sample}11834is a list of these sums, with length {\tt iters}.11835\index{NumPy}1183611837It is easy to get {\tt n} and {\tt iters} confused: {\tt n} is the11838number of terms in each sum; {\tt iters} is the number of sums we11839compute in order to characterize the distribution of sums.1184011841The return value is a list of {\tt (n, sample)} pairs. For11842each pair, we make a normal probability plot:11843\index{thinkplot}11844\index{normal probability plot}1184511846\begin{verbatim}11847def NormalPlotSamples(samples, plot=1, ylabel=''):11848for n, sample in samples:11849thinkplot.SubPlot(plot)11850thinkstats2.NormalProbabilityPlot(sample)1185111852thinkplot.Config(title='n=%d' % n, ylabel=ylabel)11853plot += 111854\end{verbatim}1185511856{\tt NormalPlotSamples} takes the list of pairs from {\tt11857MakeExpoSamples} and generates a row of normal probability plots.11858\index{normal probability plot}1185911860\begin{figure}11861% normal.py11862\centerline{\includegraphics[height=3.5in]{figs/normal1.pdf}}11863\caption{Distributions of sums of exponential values (top row) and11864lognormal values (bottom row).}11865\label{normal1}11866\end{figure}1186711868Figure~\ref{normal1} (top row) shows11869the results. With {\tt n=1}, the distribution of the sum is still11870exponential, so the normal probability plot is not a straight line.11871But with {\tt n=10} the distribution of the sum is approximately11872normal, and with {\tt n=100} it is all but indistinguishable from11873normal.1187411875Figure~\ref{normal1} (bottom row) shows similar results for a11876lognormal distribution. Lognormal distributions are generally more11877skewed than exponential distributions, so the distribution of sums11878takes longer to converge. With {\tt n=10} the normal11879probability plot is nowhere near straight, but with {\tt n=100}11880it is approximately normal.11881\index{lognormal distribution}11882\index{distribution!lognormal}11883\index{skewness}1188411885\begin{figure}11886% normal.py11887\centerline{\includegraphics[height=3.5in]{figs/normal2.pdf}}11888\caption{Distributions of sums of Pareto values (top row) and11889correlated exponential values (bottom row).}11890\label{normal2}11891\end{figure}1189211893Pareto distributions are even more skewed than lognormal. Depending11894on the parameters, many Pareto distributions do not have finite mean11895and variance. As a result, the Central Limit Theorem does not apply.11896Figure~\ref{normal2} (top row) shows distributions of sums of11897Pareto values. Even with {\tt n=100} the normal probability plot11898is far from straight.11899\index{Pareto distribution}11900\index{distribution!Pareto}11901\index{Central Limit Theorem}11902\index{CLT}11903\index{normal probability plot}1190411905I also mentioned that CLT does not apply if the values are correlated.11906To test that, I generate correlated values from an exponential11907distribution. The algorithm for generating correlated values is11908(1) generate correlated normal values, (2) use the normal CDF11909to transform the values to uniform, and (3) use the inverse11910exponential CDF to transform the uniform values to exponential.11911\index{inverse CDF}11912\index{CDF, inverse}11913\index{correlation}11914\index{random number}1191511916{\tt GenerateCorrelated} returns an iterator of {\tt n} normal values11917with serial correlation {\tt rho}:11918\index{iterator}1191911920\begin{verbatim}11921def GenerateCorrelated(rho, n):11922x = random.gauss(0, 1)11923yield x1192411925sigma = math.sqrt(1 - rho**2)11926for _ in range(n-1):11927x = random.gauss(x*rho, sigma)11928yield x11929\end{verbatim}1193011931The first value is a standard normal value. Each subsequent value11932depends on its predecessor: if the previous value is {\tt x}, the mean of11933the next value is {\tt x*rho}, with variance {\tt 1-rho**2}. Note that {\tt11934random.gauss} takes the standard deviation as the second argument,11935not variance.11936\index{standard deviation}11937\index{standard normal distribution}1193811939{\tt GenerateExpoCorrelated}11940takes the resulting sequence and transforms it to exponential:1194111942\begin{verbatim}11943def GenerateExpoCorrelated(rho, n):11944normal = list(GenerateCorrelated(rho, n))11945uniform = scipy.stats.norm.cdf(normal)11946expo = scipy.stats.expon.ppf(uniform)11947return expo11948\end{verbatim}1194911950{\tt normal} is a list of correlated normal values. {\tt uniform}11951is a sequence of uniform values between 0 and 1. {\tt expo} is11952a correlated sequence of exponential values.11953{\tt ppf} stands for ``percent point function,'' which is another11954name for the inverse CDF.11955\index{inverse CDF}11956\index{CDF, inverse}11957\index{percent point function}1195811959Figure~\ref{normal2} (bottom row) shows distributions of sums of11960correlated exponential values with {\tt rho=0.9}. The correlation11961slows the rate of convergence; nevertheless, with {\tt n=100} the11962normal probability plot is nearly straight. So even though CLT11963does not strictly apply when the values are correlated, moderate11964correlations are seldom a problem in practice.11965\index{normal probability plot}11966\index{correlation}1196711968These experiments are meant to show how the Central Limit Theorem11969works, and what happens when it doesn't. Now let's see how we can11970use it.119711197211973\section{Applying the CLT}11974\label{usingCLT}1197511976To see why the Central Limit Theorem is useful, let's get back11977to the example in Section~\ref{testdiff}: testing the apparent11978difference in mean pregnancy length for first babies and others.11979As we've seen, the apparent difference is about119800.078 weeks:11981\index{pregnancy length}11982\index{Central Limit Theorem}11983\index{CLT}1198411985\begin{verbatim}11986>>> live, firsts, others = first.MakeFrames()11987>>> delta = firsts.prglngth.mean() - others.prglngth.mean()119880.07811989\end{verbatim}1199011991Remember the logic of hypothesis testing: we compute a p-value, which11992is the probability of the observed difference under the null11993hypothesis; if it is small, we conclude that the observed difference11994is unlikely to be due to chance.11995\index{p-value}11996\index{null hypothesis}11997\index{hypothesis testing}1199811999In this example, the null hypothesis is that the distribution of12000pregnancy lengths is the same for first babies and others.12001So we can compute the sampling distribution of the mean12002like this:12003\index{sampling distribution}1200412005\begin{verbatim}12006dist1 = SamplingDistMean(live.prglngth, len(firsts))12007dist2 = SamplingDistMean(live.prglngth, len(others))12008\end{verbatim}1200912010Both sampling distributions are based on the same population, which is12011the pool of all live births. {\tt SamplingDistMean} takes this12012sequence of values and the sample size, and returns a Normal object12013representing the sampling distribution:1201412015\begin{verbatim}12016def SamplingDistMean(data, n):12017mean, var = data.mean(), data.var()12018dist = Normal(mean, var)12019return dist.Sum(n) / n12020\end{verbatim}1202112022{\tt mean} and {\tt var} are the mean and variance of12023{\tt data}. We approximate the distribution of the data with12024a normal distribution, {\tt dist}.1202512026In this example, the data are not normally distributed, so this12027approximation is not very good. But then we compute {\tt dist.Sum(n)12028/ n}, which is the sampling distribution of the mean of {\tt n}12029values. Even if the data are not normally distributed, the sampling12030distribution of the mean is, by the Central Limit Theorem.12031\index{Central Limit Theorem}12032\index{CLT}1203312034Next, we compute the sampling distribution of the difference12035in the means. The {\tt Normal} class knows how to perform12036subtraction using Equation 2:12037\index{Normal}1203812039\begin{verbatim}12040def __sub__(self, other):12041return Normal(self.mu - other.mu,12042self.sigma2 + other.sigma2)12043\end{verbatim}1204412045So we can compute the sampling distribution of the difference like this:1204612047\begin{verbatim}12048>>> dist = dist1 - dist212049N(0, 0.0032)12050\end{verbatim}1205112052The mean is 0, which makes sense because we expect two samples from12053the same distribution to have the same mean, on average. The variance12054of the sampling distribution is 0.0032.12055\index{sampling distribution}1205612057{\tt Normal} provides {\tt Prob}, which evaluates the normal CDF.12058We can use {\tt Prob} to compute the probability of a12059difference as large as {\tt delta} under the null hypothesis:12060\index{null hypothesis}1206112062\begin{verbatim}12063>>> 1 - dist.Prob(delta)120640.08412065\end{verbatim}1206612067Which means that the p-value for a one-sided test is 0.84. For12068a two-sided test we would also compute12069\index{p-value}12070\index{one-sided test}12071\index{two-sided test}1207212073\begin{verbatim}12074>>> dist.Prob(-delta)120750.08412076\end{verbatim}1207712078Which is the same because the normal distribution is symmetric.12079The sum of the tails is 0.168, which is consistent with the estimate12080in Section~\ref{testdiff}, which was 0.17.12081\index{symmetric}12082120831208412085\section{Correlation test}1208612087In Section~\ref{corrtest} we used a permutation test for the correlation12088between birth weight and mother's age, and found that it is12089statistically significant, with p-value less than 0.001.12090\index{p-value}12091\index{birth weight}12092\index{weight!birth}12093\index{permutation}12094\index{significant} \index{statistically significant}1209512096Now we can do the same thing analytically. The method is based12097on this mathematical result: given two variables that are normally distributed12098and uncorrelated, if we generate a sample with size $n$,12099compute Pearson's correlation, $r$, and then compute the transformed12100correlation12101%12102\[ t = r \sqrt{\frac{n-2}{1-r^2}} \]12103%12104the distribution of $t$ is Student's t-distribution with parameter12105$n-2$. The t-distribution is an analytic distribution; the CDF can12106be computed efficiently using gamma functions.12107\index{Pearson coefficient of correlation}12108\index{correlation}1210912110We can use this result to compute the sampling distribution of12111correlation under the null hypothesis; that is, if we generate12112uncorrelated sequences of normal values, what is the distribution of12113their correlation? {\tt StudentCdf} takes the sample size, {\tt n}, and12114returns the sampling distribution of correlation:12115\index{null hypothesis}12116\index{sampling distribution}1211712118\begin{verbatim}12119def StudentCdf(n):12120ts = np.linspace(-3, 3, 101)12121ps = scipy.stats.t.cdf(ts, df=n-2)12122rs = ts / np.sqrt(n - 2 + ts**2)12123return thinkstats2.Cdf(rs, ps)12124\end{verbatim}1212512126{\tt ts} is a NumPy array of values for $t$, the transformed12127correlation. {\tt ps} contains the corresponding probabilities,12128computed using the CDF of the Student's t-distribution implemented in12129SciPy. The parameter of the t-distribution, {\tt df}, stands for12130``degrees of freedom.'' I won't explain that term, but you can read12131about it at12132\url{http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)}.12133\index{NumPy}12134\index{SciPy}12135\index{Student's t-distribution}12136\index{distribution!Student's t}12137\index{degrees of freedom}1213812139\begin{figure}12140% normal.py12141\centerline{\includegraphics[height=2.5in]{figs/normal4.pdf}}12142\caption{Sampling distribution of correlations for uncorrelated12143normal variables.}12144\label{normal4}12145\end{figure}1214612147To get from {\tt ts} to the correlation coefficients, {\tt rs},12148we apply the inverse transform,12149%12150\[ r = t / \sqrt{n - 2 + t^2} \]12151%12152The result is the sampling distribution of $r$ under the null hypothesis.12153Figure~\ref{normal4} shows this distribution along with the distribution12154we generated in Section~\ref{corrtest} by resampling. They are nearly12155identical. Although the actual distributions are not normal,12156Pearson's coefficient of correlation is based on sample means12157and variances. By the Central Limit Theorem, these moment-based12158statistics are normally distributed even if the data are not.12159\index{Central Limit Theorem}12160\index{CLT}12161\index{null hypothesis}12162\index{resampling}1216312164From Figure~\ref{normal4}, we can see that the12165observed correlation, 0.07, is unlikely to occur if the variables12166are actually uncorrelated.12167Using the analytic distribution, we can compute just how unlikely:12168\index{analytic distribution}1216912170\begin{verbatim}12171t = r * math.sqrt((n-2) / (1-r**2))12172p_value = 1 - scipy.stats.t.cdf(t, df=n-2)12173\end{verbatim}1217412175We compute the value of {\tt t} that corresponds to {\tt r=0.07}, and12176then evaluate the t-distribution at {\tt t}. The result is {\tt121772.9e-11}. This example demonstrates an advantage of the analytic12178method: we can compute very small p-values. But in practice it12179usually doesn't matter.12180\index{SciPy}12181\index{p-value}121821218312184\section{Chi-squared test}1218512186In Section~\ref{casino2} we used the chi-squared statistic to12187test whether a die is crooked. The chi-squared statistic measures12188the total normalized deviation from the expected values in a table:12189%12190\[ \goodchi^2 = \sum_i \frac{{(O_i - E_i)}^2}{E_i} \]12191%12192One reason the chi-squared statistic is widely used is that12193its sampling distribution under the null hypothesis is analytic;12194by a remarkable coincidence\footnote{Not really.}, it is called12195the chi-squared distribution. Like the t-distribution, the12196chi-squared CDF can be computed efficiently using gamma functions.12197\index{deviation}12198\index{null hypothesis}12199\index{sampling distribution}12200\index{chi-squared test}12201\index{chi-squared distribution}12202\index{distribution!chi-squared}1220312204\begin{figure}12205% normal.py12206\centerline{\includegraphics[height=2.5in]{figs/normal5.pdf}}12207\caption{Sampling distribution of chi-squared statistics for12208a fair six-sided die.}12209\label{normal5}12210\end{figure}1221112212SciPy provides an implementation of the chi-squared distribution,12213which we use to compute the sampling distribution of the12214chi-squared statistic:12215\index{SciPy}1221612217\begin{verbatim}12218def ChiSquaredCdf(n):12219xs = np.linspace(0, 25, 101)12220ps = scipy.stats.chi2.cdf(xs, df=n-1)12221return thinkstats2.Cdf(xs, ps)12222\end{verbatim}1222312224Figure~\ref{normal5} shows the analytic result along with the12225distribution we got by resampling. They are very similar,12226especially in the tail, which is the part we usually care most12227about.12228\index{resampling}12229\index{tail}1223012231We can use this distribution to compute the p-value of the12232observed test statistic, {\tt chi2}:12233\index{test statistic}12234\index{p-value}1223512236\begin{verbatim}12237p_value = 1 - scipy.stats.chi2.cdf(chi2, df=n-1)12238\end{verbatim}1223912240The result is 0.041, which is consistent with the result12241from Section~\ref{casino2}.1224212243The parameter of the chi-squared distribution is ``degrees of12244freedom'' again. In this case the correct parameter is {\tt n-1},12245where {\tt n} is the size of the table, 6. Choosing this parameter12246can be tricky; to be honest, I am never confident that I have it12247right until I generate something like Figure~\ref{normal5} to compare12248the analytic results to the resampling results.12249\index{degrees of freedom}122501225112252\section{Discussion}1225312254This book focuses on computational methods like resampling and12255permutation. These methods have several advantages over analysis:12256\index{resampling}12257\index{permutation}12258\index{computational methods}1225912260\begin{itemize}1226112262\item They are easier to explain and understand. For example, one of12263the most difficult topics in an introductory statistics class is12264hypothesis testing. Many students don't really understand what12265p-values are. I think the approach I presented in12266Chapter~\ref{testing}---simulating the null hypothesis and12267computing test statistics---makes the fundamental idea clearer.12268\index{p-value}12269\index{null hypothesis}1227012271\item They are robust and versatile. Analytic methods are often based12272on assumptions that might not hold in practice. Computational12273methods require fewer assumptions, and can be adapted and extended12274more easily.12275\index{robust}1227612277\item They are debuggable. Analytic methods are often like a black12278box: you plug in numbers and they spit out results. But it's easy12279to make subtle errors, hard to be confident that the results are12280right, and hard to find the problem if they are not. Computational12281methods lend themselves to incremental development and testing,12282which fosters confidence in the results.12283\index{debugging}1228412285\end{itemize}1228612287But there is one drawback: computational methods can be slow. Taking12288into account these pros and cons, I recommend the following process:1228912290\begin{enumerate}1229112292\item Use computational methods during exploration. If you find a12293satisfactory answer and the run time is acceptable, you can stop.12294\index{exploration}1229512296\item If run time is not acceptable, look for opportunities to12297optimize. Using analytic methods is one of several methods of12298optimization.1229912300\item If replacing a computational method with an analytic method is12301appropriate, use the computational method as a basis of comparison,12302providing mutual validation between the computational and12303analytic results.12304\index{model}1230512306\end{enumerate}1230712308For the vast majority of problems I have worked on, I didn't have12309to go past Step 1.123101231112312\section{Exercises}1231312314A solution to these exercises is in \verb"chap14soln.py"1231512316\begin{exercise}12317\label{log_clt}12318In Section~\ref{lognormal}, we saw that the distribution12319of adult weights is approximately lognormal. One possible12320explanation is that the weight a person12321gains each year is proportional to their current weight.12322In that case, adult weight is the product of a large number12323of multiplicative factors:12324%12325\[ w = w_0 f_1 f_2 \ldots f_n \]12326%12327where $w$ is adult weight, $w_0$ is birth weight, and $f_i$12328is the weight gain factor for year $i$.12329\index{birth weight}12330\index{weight!birth}12331\index{lognormal distribution}12332\index{distribution!lognormal}12333\index{adult weight}1233412335The log of a product is the sum of the logs of the12336factors:12337%12338\[ \log w = \log w_0 + \log f_1 + \log f_2 + \cdots + \log f_n \]12339%12340So by the Central Limit Theorem, the distribution of $\log w$ is12341approximately normal for large $n$, which implies that the12342distribution of $w$ is lognormal.12343\index{Central Limit Theorem}12344\index{CLT}1234512346To model this phenomenon, choose a distribution for $f$ that seems12347reasonable, then generate a sample of adult weights by choosing a12348random value from the distribution of birth weights, choosing a12349sequence of factors from the distribution of $f$, and computing the12350product. What value of $n$ is needed to converge to a lognormal12351distribution?12352\index{model}1235312354\index{logarithm}12355\index{product}1235612357\end{exercise}12358123591236012361\begin{exercise}12362In Section~\ref{usingCLT} we used the Central Limit Theorem to find12363the sampling distribution of the difference in means, $\delta$, under12364the null hypothesis that both samples are drawn from the same12365population.12366\index{null hypothesis}12367\index{sampling distribution}1236812369We can also use this distribution to find the standard error of the12370estimate and confidence intervals, but that would only be12371approximately correct. To be more precise, we should compute the12372sampling distribution of $\delta$ under the alternate hypothesis that12373the samples are drawn from different populations.12374\index{standard error}12375\index{standard deviation}12376\index{confidence interval}1237712378Compute this distribution and use it to calculate the standard error12379and a 90\% confidence interval for the difference in means.12380\end{exercise}123811238212383\begin{exercise}12384In a recent paper\footnote{``Evidence for the persistent effects of an12385intervention to mitigate gender-sterotypical task allocation within12386student engineering teams,'' Proceedings of the IEEE Frontiers in Education12387Conference, 2014.}, Stein et al.~investigate the12388effects of an intervention intended to mitigate gender-stereotypical12389task allocation within student engineering teams.1239012391Before and after the intervention, students responded to a survey that12392asked them to rate their contribution to each aspect of class projects on12393a 7-point scale.1239412395Before the intervention, male students reported higher scores for the12396programming aspect of the project than female students; on average men12397reported a score of 3.57 with standard error 0.28. Women reported123981.91, on average, with standard error 0.32.12399\index{standard error}1240012401Compute the sampling distribution of the gender gap (the difference in12402means), and test whether it is statistically significant. Because you12403are given standard errors for the estimated means, you don't need to12404know the sample size to figure out the sampling distributions.12405\index{significant} \index{statistically significant}12406\index{sampling distribution}1240712408After the intervention, the gender gap was smaller: the average score12409for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE124100.16). Again, compute the sampling distribution of the gender gap and12411test it.12412\index{gender gap}1241312414Finally, estimate the change in gender gap; what is the sampling12415distribution of this change, and is it statistically significant?12416\index{significant} \index{statistically significant}12417\end{exercise}1241812419\cleardoublepage12420\phantomsection12421\addcontentsline{toc}{chapter}{\indexname}%12422\printindex1242312424\clearemptydoublepage12425%\blankpage12426%\blankpage12427%\blankpage124281242912430\end{document}1243112432124331243412435