📚 The CoCalc Library - books, templates and other resources
License: OTHER
% LaTeX source for ``Think Bayes: Bayesian Statistics Made Simple''1% Copyright 2012 Allen B. Downey.23% License: Creative Commons Attribution-NonCommercial 3.0 Unported License.4% http://creativecommons.org/licenses/by-nc/3.0/5%67\documentclass[12pt]{book}8\usepackage[width=5.5in,height=8.5in,9hmarginratio=3:2,vmarginratio=1:1]{geometry}1011% for some of these packages, you might have to install12% texlive-latex-extra (in Ubuntu)1314\usepackage[T1]{fontenc}15\usepackage{textcomp}16\usepackage{mathpazo}17\usepackage{url}18\usepackage{graphicx}19\usepackage{subfig}20\usepackage{amsmath}21\usepackage{amsthm}22\usepackage{makeidx}23\usepackage{setspace}24\usepackage{hevea}25\usepackage{upquote}26\usepackage{fancyhdr}27\usepackage[bookmarks]{hyperref}2829\title{Think Bayes}30\author{Allen B. Downey}3132\newcommand{\thetitle}{Think Bayes: Bayesian Statistics Made Simple}33\newcommand{\theversion}{1.0.8}3435% these styles get translated in CSS for the HTML version36\newstyle{a:link}{color:black;}37\newstyle{p+p}{margin-top:1em;margin-bottom:1em}38\newstyle{img}{border:0px}3940% change the arrows in the HTML version41\setlinkstext42{\imgsrc[ALT="Previous"]{back.png}}43{\imgsrc[ALT="Up"]{up.png}}44{\imgsrc[ALT="Next"]{next.png}}4546\makeindex4748\newif\ifplastex49\plastexfalse5051\begin{document}5253\frontmatter5455\ifplastex5657\else58\fi5960\newcommand{\PMF}{\mathrm{PMF}}61\newcommand{\PDF}{\mathrm{PDF}}62\newcommand{\CDF}{\mathrm{CDF}}63\newcommand{\ICDF}{\mathrm{ICDF}}6465\ifplastex66\usepackage{localdef}67\maketitle6869\else7071\newtheorem{exercise}{Exercise}[chapter]7273\input{latexonly}7475\begin{latexonly}7677\newtheoremstyle{exercise}% name of the style to be used78{\topsep}% measure of space to leave above the theorem. E.g.: 3pt79{\topsep}% measure of space to leave below the theorem. E.g.: 3pt80{}% name of font to use in the body of the theorem81{0pt}% measure of space to indent82{\bfseries}% name of head font83{}% punctuation between head and body84{ }% space after theorem head; " " = normal interword space85{}% Manually specify head8687\theoremstyle{exercise}8889\renewcommand{\blankpage}{\thispagestyle{empty} \quad \newpage}9091% TITLE PAGES FOR LATEX VERSION9293%-half title--------------------------------------------------94\thispagestyle{empty}9596\begin{flushright}97\vspace*{2.0in}9899\begin{spacing}{3}100{\huge Think Bayes}\\101{\Large Bayesian Statistics Made Simple}102\end{spacing}103104\vspace{0.25in}105106Version \theversion107108\vfill109110\end{flushright}111112%--verso------------------------------------------------------113114\blankpage115\blankpage116117%--title page--------------------------------------------------118\pagebreak119\thispagestyle{empty}120121\begin{flushright}122\vspace*{2.0in}123124\begin{spacing}{3}125{\huge Think Bayes}\\126{\Large Bayesian Statistics Made Simple}127\end{spacing}128129\vspace{0.25in}130131Version \theversion132133\vspace{1in}134135136{\Large137Allen B. Downey\\138}139140141\vspace{0.5in}142143{\Large Green Tea Press}144145{\small Needham, Massachusetts}146147\vfill148149\end{flushright}150151152%--copyright--------------------------------------------------153\pagebreak154\thispagestyle{empty}155156Copyright \copyright ~2012 Allen B. Downey.157158159\vspace{0.2in}160161\begin{flushleft}162Green Tea Press \\1639 Washburn Ave \\164Needham MA 02492165\end{flushleft}166167Permission is granted to copy, distribute, and/or modify this document168under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported169License, which is available at \url{http://creativecommons.org/licenses/by-nc/3.0/}.170171\vspace{0.2in}172173\end{latexonly}174175176% HTMLONLY177178\begin{htmlonly}179180% TITLE PAGE FOR HTML VERSION181182{\Large \thetitle}183184{\large Allen B. Downey}185186Version \theversion187188\vspace{0.25in}189190Copyright 2012 Allen B. Downey191192\vspace{0.25in}193194Permission is granted to copy, distribute, and/or modify this document195under the terms of the Creative Commons Attribution-NonCommercial 3.0196Unported License, which is available at197\url{http://creativecommons.org/licenses/by-nc/3.0/}.198199\setcounter{chapter}{-1}200201\end{htmlonly}202203\fi204% END OF THE PART WE SKIP FOR PLASTEX205206\chapter{Preface}207\label{preface}208209\section{My theory, which is mine}210211The premise of this book, and the other books in the {\it Think X}212series, is that if you know how to program, you213can use that skill to learn other topics.214215Most books on Bayesian statistics use mathematical notation and216present ideas in terms of mathematical concepts like calculus.217This book uses Python code instead of math, and discrete approximations218instead of continuous mathematics. As a result, what would219be an integral in a math book becomes a summation, and220most operations on probability distributions are simple loops.221222I think this presentation is easier to understand, at least for people with223programming skills. It is also more general, because when we make224modeling decisions, we can choose the most appropriate model without225worrying too much about whether the model lends itself to conventional226analysis.227228Also, it provides a smooth development path from simple examples to229real-world problems. Chapter~\ref{estimation} is a good example. It230starts with a simple example involving dice, one of the staples of231basic probability. From there it proceeds in small steps to the232locomotive problem, which I borrowed from Mosteller's {\it233Fifty Challenging Problems in Probability with Solutions}, and from234there to the German tank problem, a famously successful application of235Bayesian methods during World War II.236237238\section{Modeling and approximation}239240Most chapters in this book are motivated by a real-world problem, so241they involve some degree of modeling. Before we can apply Bayesian242methods (or any other analysis), we have to make decisions about which243parts of the real-world system to include in the model and which244details we can abstract away. \index{modeling}245246For example, in Chapter~\ref{prediction}, the motivating problem is to247predict the winner of a hockey game. I model goal-scoring as a248Poisson process, which implies that a goal is equally likely at any249point in the game. That is not exactly true, but it is probably a250good enough model for most purposes.251\index{Poisson process}252253In Chapter~\ref{evidence} the motivating problem is interpreting SAT254scores (the SAT is a standardized test used for college admissions in255the United States). I start with a simple model that assumes that all256SAT questions are equally difficult, but in fact the designers of the257SAT deliberately include some questions that are relatively easy and258some that are relatively hard. I present a second model that accounts259for this aspect of the design, and show that it doesn't have a big260effect on the results after all.261262I think it is important to include modeling as an explicit part263of problem solving because it reminds us to think about modeling264errors (that is, errors due to simplifications and assumptions265of the model).266267Many of the methods in this book are based on discrete distributions,268which makes some people worry about numerical errors. But for269real-world problems, numerical errors are almost always270smaller than modeling errors.271272Furthermore, the discrete approach often allows better modeling273decisions, and I would rather have an approximate solution274to a good model than an exact solution to a bad model.275276On the other hand, continuous methods sometimes yield performance277advantages---for example by replacing a linear- or quadratic-time278computation with a constant-time solution.279280So I recommend a general process with these steps:281282\begin{enumerate}283284\item While you are exploring a problem, start with simple models and285implement them in code that is clear, readable, and demonstrably286correct. Focus your attention on good modeling decisions, not287optimization.288289\item Once you have a simple model working, identify the290biggest sources of error. You might need to increase the number of291values in a discrete approximation, or increase the number of292iterations in a Monte Carlo simulation, or add details to the model.293294\item If the performance of your solution is good enough for your295application, you might not have to do any optimization. But if you296do, there are two approaches to consider. You can review your code297and look for optimizations; for example, if you cache previously298computed results you might be able to avoid redundant computation.299Or you can look for analytic methods that yield computational300shortcuts.301302\end{enumerate}303304One benefit of this process is that Steps 1 and 2 tend to be fast, so you305can explore several alternative models before investing heavily in any306of them.307308Another benefit is that if you get to Step 3, you will be starting309with a reference implementation that is likely to be correct,310which you can use for regression testing (that is, checking that the311optimized code yields the same results, at least approximately).312\index{regression testing}313314315\section{Working with the code}316\label{download}317318The code and sound samples used in this book are available from319\url{https://github.com/AllenDowney/ThinkBayes}. Git is a version320control system that allows you to keep track of the files that321make up a project. A collection of files under Git's control is322called a ``repository''. GitHub is a hosting service that provides323storage for Git repositories and a convenient web interface.324\index{repository}325\index{Git}326\index{GitHub}327328The GitHub homepage for my repository provides several ways to329work with the code:330331\begin{itemize}332333\item You can create a copy of my repository334on GitHub by pressing the {\sf Fork} button. If you don't already335have a GitHub account, you'll need to create one. After forking, you'll336have your own repository on GitHub that you can use to keep track337of code you write while working on this book. Then you can338clone the repo, which means that you copy the files339to your computer.340\index{fork}341342\item Or you could clone343my repository. You don't need a GitHub account to do this, but you344won't be able to write your changes back to GitHub.345\index{clone}346347\item If you don't want to use Git at all, you can download the files348in a Zip file using the button in the lower-right corner of the349GitHub page.350351\end{itemize}352353The code for the first edition of the book works with Python 2.354If you are using Python 3, you might want to use the updated code355in \url{https://github.com/AllenDowney/ThinkBayes2} instead.356357I developed this book using Anaconda from358Continuum Analytics, which is a free Python distribution that includes359all the packages you'll need to run the code (and lots more).360I found Anaconda easy to install. By default it does a user-level361installation, not system-level, so you don't need administrative362privileges. You can363download Anaconda from \url{http://continuum.io/downloads}.364\index{Anaconda}365366If you don't want to use Anaconda, you will need the following367packages:368369\begin{itemize}370371\item NumPy for basic numerical computation, \url{http://www.numpy.org/};372\index{NumPy}373374\item SciPy for scientific computation,375\url{http://www.scipy.org/};376\index{SciPy}377378\item matplotlib for visualization, \url{http://matplotlib.org/}.379\index{matplotlib}380381\end{itemize}382383Although these are commonly used packages, they are not included with384all Python installations, and they can be hard to install in some385environments. If you have trouble installing them, I386recommend using Anaconda or one of the other Python distributions387that include these packages.388\index{installation}389390Many of the examples in this book use classes and functions defined in391{\tt thinkbayes.py}. Some of them also use {\tt thinkplot.py}, which392provides wrappers for some of the functions in {\tt pyplot}, which is393part of {\tt matplotlib}.394395396\section{Code style}397398Experienced Python programmers will notice that the code in this399book does not comply with PEP 8, which is the most common400style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).401\index{PEP 8}402403Specifically, PEP 8 calls for lowercase function names with404underscores between words, \verb"like_this". In this book and405the accompanying code, function and method names begin with406a capital letter and use camel case, \verb"LikeThis".407408I broke this rule because I developed some of the code409while I was a Visiting Scientist at Google, so I followed410the Google style guide, which deviates from PEP 8 in a few411places. Once I got used to Google style, I found that I liked412it. And at this point, it would be too much trouble to change.413414Also on the topic of style, I write ``Bayes's theorem''415with an {\it s} after the apostrophe, which is preferred in some416style guides and deprecated in others. I don't have a strong417preference. I had to choose one, and this is the one I chose.418419And finally one typographical note: throughout the book, I use420PMF and CDF for the mathematical concept of a probability421mass function or cumulative distribution function, and Pmf and Cdf422to refer to the Python objects I use to represent them.423424425\section{Prerequisites}426427There are several excellent modules for doing Bayesian statistics in428Python, including {\tt pymc} and OpenBUGS. I chose not to use them429for this book because you need a fair amount of background knowledge430to get started with these modules, and I want to keep the431prerequisites minimal. If you know Python and a little bit about432probability, you are ready to start this book.433434Chapter~\ref{intro} is about probability and Bayes's theorem; it has435no code. Chapter~\ref{compstat} introduces {\tt Pmf}, a thinly disguised436Python dictionary I use to represent a probability mass function437(PMF). Then Chapter~\ref{estimation} introduces {\tt Suite}, a kind438of Pmf that provides a framework for doing Bayesian updates.439440In some of the later chapters, I use441analytic distributions including the Gaussian (normal) distribution,442the exponential and Poisson distributions, and the beta distribution.443In Chapter~\ref{species} I break out the less-common Dirichlet444distribution, but I explain it as I go along. If you are not familiar445with these distributions, you can read about them on Wikipedia. You446could also read the companion to this book, {\it Think Stats}, or an447introductory statistics book (although I'm afraid most of them take448a mathematical approach that is not particularly helpful for practical449purposes).450451452453\section*{Contributor List}454455If you have a suggestion or correction, please send email to456{\it downey@allendowney.com}. If I make a change based on your457feedback, I will add you to the contributor list458(unless you ask to be omitted).459\index{contributors}460461If you include at least part of the sentence the462error appears in, that makes it easy for me to search. Page and463section numbers are fine, too, but not as easy to work with.464Thanks!465466\small467468\begin{itemize}469470\item First, I have to acknowledge David MacKay's excellent book,471{\it Information Theory, Inference, and Learning Algorithms}, which is472where I first came to understand Bayesian methods. With his473permission, I use several problems from474his book as examples.475476\item This book also benefited from my interactions with Sanjoy477Mahajan, especially in fall 2012, when I audited his class on478Bayesian Inference at Olin College.479480\item I wrote parts of this book during project nights with the Boston481Python User Group, so I would like to thank them for their482company and pizza.483484\item Jonathan Edwards sent in the first typo.485486\item George Purkins found a markup error.487488\item Olivier Yiptong sent several helpful suggestions.489490\item Yuriy Pasichnyk found several errors.491492\item Kristopher Overholt sent a long list of corrections and suggestions.493494\item Robert Marcus found a misplaced {\it i}.495496\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.497498\item Markus Dobler pointed out that drawing cookies from a bowl499with replacement is an unrealistic scenario.500501\item Tom Pollard and Paul A. Giannaros spotted a version problem with502some of the numbers in the train example.503504\item Ram Limbu found a typo and suggested a clarification.505506\item In spring 2013, students in my class, Computational Bayesian507Statistics, made many helpful corrections and suggestions: Kai508Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun509Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,510Brendan Ritter, and Evan Simpson.511512\item Greg Marra and Matt Aasted helped me clarify the discussion of513{\it The Price is Right} problem.514515\item Marcus Ogren pointed out that the original statement of the516locomotive problem was ambiguous.517518\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the519book and found many opportunities for improvement.520521\item James Lawry spotted a math error.522523\item Ben Kahle found a reference to the wrong figure.524525\item Jeffrey Law found an inconsistency between the text and the code.526527\item Linda Pescatore found a typo and made some helpful suggestions.528529\item Tomasz Mi\k{a}sko sent many excellent corrections and suggestions.530531% ENDCONTRIB532533\end{itemize}534535\normalsize536537\clearemptydoublepage538539% TABLE OF CONTENTS540\begin{latexonly}541542\tableofcontents543544\clearemptydoublepage545546\end{latexonly}547548% START THE BOOK549\mainmatter550551\newcommand{\p}[1]{\ensuremath{\mathrm{p}(#1)}}552\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}553\newcommand{\T}[1]{\mbox{#1}}554\newcommand{\AND}{~\mathrm{and}~}555\newcommand{\NOT}{\mathrm{not}~}556557558\chapter{Bayes's Theorem}559\label{intro}560561\section{Conditional probability}562563The fundamental idea behind all Bayesian statistics is Bayes's theorem,564which is surprisingly easy to derive, provided that you understand565conditional probability. So we'll start with probability, then566conditional probability, then Bayes's theorem, and on to Bayesian567statistics.568\index{conditional probability}569\index{probability!conditional}570571A probability is a number between 0 and 1 (including both) that572represents a degree of belief in a fact or prediction. The value5731 represents certainty that a fact is true, or that a prediction574will come true. The value 0 represents certainty575that the fact is false.576\index{degree of belief}577578Intermediate values represent degrees of certainty. The value 0.5,579often written as 50\%, means that a predicted outcome is580as likely to happen as not. For example, the probability that a tossed581coin lands face up is very close to 50\%.582\index{coin toss}583584A conditional probability is a probability based on some background585information. For example, I want to know the probability586that I will have a heart attack in the next year. According to the587CDC, ``Every year about 785,000 Americans have a first coronary attack.588(\url{http://www.cdc.gov/heartdisease/facts.htm})''589\index{heart attack}590591The U.S. population is about 311 million, so the probability that a592randomly chosen American will have a heart attack in the next year is593roughly 0.3\%.594595But I am not a randomly chosen American. Epidemiologists have596identified many factors that affect the risk of heart attacks;597depending on those factors, my risk might be higher or lower than598average.599600I am male, 45 years old, and I have borderline high cholesterol.601Those factors increase my chances. However, I have low blood pressure602and I don't smoke, and those factors decrease my chances.603604Plugging everything into the online calculator at605\url{http://cvdrisk.nhlbi.nih.gov/calculator.asp}, I find that my606risk of a heart attack in the next year is about 0.2\%, less than the607national average. That value is a conditional probability, because it608is based on a number of factors that make up my ``condition.''609610The usual notation for conditional probability is \p{A|B}, which611is the probability of $A$ given that $B$ is true. In this612example, $A$ represents the prediction that I will have a heart613attack in the next year, and $B$ is the set of conditions I listed.614615616\section{Conjoint probability}617618{\bf Conjoint probability} is a fancy way to say the probability that619two things are true. I write \p{A \AND B} to mean the620probability that $A$ and $B$ are both true.621\index{conjoint probability}622\index{probability!conjoint}623624If you learned about probability in the context of coin tosses and625dice, you might have learned the following formula:626%627\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]628%629For example, if I toss two coins, and $A$ means the first coin lands630face up, and $B$ means the second coin lands face up, then $\p{A} =631\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.632633But this formula only works because in this case $A$ and $B$ are634independent; that is, knowing the outcome of the first event does635not change the probability of the second. Or, more formally,636\p{B|A} = \p{B}.637\index{independence}638\index{dependence}639640Here is a different example where the events are not independent.641Suppose that $A$ means that it rains today and $B$ means that it642rains tomorrow. If I know that it rained today, it is more likely643that it will rain tomorrow, so $\p{B|A} > \p{B}$.644645In general, the probability of a conjunction is646%647\[ \p{A \AND B} = \p{A}~\p{B|A} \]648%649for any $A$ and $B$. So if the chance of rain on any given day650is 0.5, the chance of rain on two consecutive days is not6510.25, but probably a bit higher.652653654\section{The cookie problem}655656We'll get to Bayes's theorem soon, but I want to motivate it with an657example called the cookie problem.\footnote{Based on an example from658\url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer659there.} Suppose there are two bowls of cookies. Bowl 1 contains66030 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of661each.662\index{Bayes's theorem}663\index{cookie problem}664665Now suppose you choose one of the bowls at random and, without666looking, select a cookie at random. The cookie is vanilla. What is667the probability that it came from Bowl 1?668669This is a conditional probability; we want \p{\T{Bowl 1} |670\T{vanilla}}, but it is not obvious how to compute it. If I asked a671different question---the probability of a vanilla cookie given Bowl6721---it would be easy:673%674\[ \p{\T{vanilla} | \T{Bowl 1}} = 3/4 \]675%676Sadly, \p{A|B} is {\em not} the same as \p{B|A}, but there677is a way to get from one to the other: Bayes's theorem.678679680\section{Bayes's theorem}681682At this point we have everything we need to derive Bayes's theorem.683We'll start with the observation that conjunction is commutative; that is684%685\[ \p{A \AND B} = \p{B \AND A} \]686%687for any events $A$ and $B$.688\index{Bayes's theorem!derivation}689\index{conjunction}690691Next, we write the probability of a conjunction:692%693\[ \p{A \AND B} = \p{A}~\p{B|A} \]694%695Since we have not said anything about what $A$ and $B$ mean, they696are interchangeable. Interchanging them yields697%698\[ \p{B \AND A} = \p{B}~\p{A|B} \]699%700That's all we need. Pulling those pieces together, we get701%702\[ \p{B}~\p{A|B} = \p{A}~\p{B|A} \]703%704Which means there are two ways to compute the conjunction.705If you have \p{A}, you multiply by the conditional706probability \p{B|A}. Or you can do it the other way around; if you707know \p{B}, you multiply by \p{A|B}. Either way you should get708the same thing.709710Finally we can divide through by \p{B}:711%712\[ \p{A|B} = \frac{\p{A}~\p{B|A}}{\p{B}} \]713%714And that's Bayes's theorem! It might not look like much, but715it turns out to be surprisingly powerful.716717For example, we can use it to solve the cookie problem. I'll write718$B_1$ for the hypothesis that the cookie came from Bowl 1719and $V$ for the vanilla cookie. Plugging in Bayes's theorem720we get721%722\[ \p{B_1|V} = \frac{\p{B_1}~\p{V|B_1}}{\p{V}} \]723%724The term on the left is what we want: the probability of Bowl 1, given725that we chose a vanilla cookie. The terms on the right are:726727\begin{itemize}728729\item $\p{B_1}$: This is the probability that we chose Bowl 1, unconditioned730by what kind of cookie we got. Since the problem says we chose a731bowl at random, we can assume $\p{B_1} = 1/2$.732733\item $\p{V|B_1}$: This is the probability of getting a vanilla cookie734from Bowl 1, which is 3/4.735736\item \p{V}: This is the probability of drawing a vanilla cookie from737either bowl. Since we had an equal chance of choosing either bowl738and the bowls contain the same number of cookies, we had the same739chance of choosing any cookie. Between the two bowls there are74050 vanilla and 30741chocolate cookies, so \p{V} = 5/8.742743\end{itemize}744745Putting it together, we have746%747\[ \p{B_1|V} = \frac{(1/2)~(3/4)}{5/8} \]748%749which reduces to 3/5. So the vanilla cookie is evidence in favor of750the hypothesis that we chose Bowl 1, because vanilla cookies are more751likely to come from Bowl 1.752\index{evidence}753754This example demonstrates one use of Bayes's theorem: it provides755a strategy to get from \p{B|A} to \p{A|B}. This strategy is useful756in cases, like the cookie problem, where it is easier to compute757the terms on the right side of Bayes's theorem than the term on the758left.759760761\section{The diachronic interpretation}762763There is another way to think of Bayes's theorem: it gives us a764way to update the probability of a hypothesis, $H$, in light of765some body of data, $D$.766\index{diachronic interpretation}767768This way of thinking about Bayes's theorem is called the769{\bf diachronic interpretation}. ``Diachronic'' means that something770is happening over time; in this case771the probability of the hypotheses changes, over time, as772we see new data.773774Rewriting Bayes's theorem with $H$ and $D$ yields:775%776\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]777%778In this interpretation, each term has a name:779\index{prior}780\index{posterior}781\index{likelihood}782\index{normalizing constant}783784\begin{itemize}785786\item \p{H} is the probability of the hypothesis before we see787the data, called the prior probability, or just {\bf prior}.788789\item \p{H|D} is what we want to compute, the probability of790the hypothesis after we see the data, called the {\bf posterior}.791792\item \p{D|H} is the probability of the data under the hypothesis,793called the {\bf likelihood}.794795\item \p{D} is the probability of the data under any hypothesis,796called the {\bf normalizing constant}.797798\end{itemize}799800Sometimes we can compute the prior based on background801information. For example, the cookie problem specifies that we choose802a bowl at random with equal probability.803804In other cases the prior is subjective; that is, reasonable people805might disagree, either because they use different background806information or because they interpret the same information807differently.808\index{subjective prior}809810The likelihood is usually the easiest part to compute. In the811cookie problem, if we know which bowl the cookie came from,812we find the probability of a vanilla cookie by counting.813814The normalizing constant can be tricky. It is supposed to be the815probability of seeing the data under any hypothesis at all, but in the816most general case it is hard to nail down what that means.817818Most often we simplify things by specifying a set of hypotheses819that are820\index{mutually exclusive}821\index{collectively exhaustive}822823\begin{description}824825\item[Mutually exclusive:] At most one hypothesis in826the set can be true, and827828\item[Collectively exhaustive:] There are no other829possibilities; at least one of the hypotheses has to be true.830831\end{description}832833I use the word {\bf suite} for a set of hypotheses that has these834properties.835\index{suite}836837In the cookie problem, there are only two hypotheses---the cookie838came from Bowl 1 or Bowl 2---and they are mutually exclusive and839collectively exhaustive.840841In that case we can compute \p{D} using the law of total probability,842which says that if there are two exclusive ways that something843might happen, you can add up the probabilities like this:844%845\[ \p{D} = \p{B_1}~\p{D|B_1} + \p{B_2}~\p{D|B_2} \]846%847Plugging in the values from the cookie problem, we have848%849\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]850%851which is what we computed earlier by mentally combining the two852bowls.853\index{total probability}854855856\newcommand{\MM}{M\&M}857858\section{The \MM~problem}859860\MM's are small candy-coated chocolates that come in a variety of861colors. Mars, Inc., which makes \MM's, changes the mixture of862colors from time to time.863\index{M and M problem}864865In 1995, they introduced blue \MM's. Before then, the color mix in866a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%867Green, 10\% Orange, 10\% Tan. Afterward it was 24\% Blue , 20\%868Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.869870Suppose a friend of mine has two bags of \MM's, and he tells me871that one is from 1994 and one from 1996. He won't tell me which is872which, but he gives me one \MM~from each bag. One is yellow and873one is green. What is the probability that the yellow one came874from the 1994 bag?875876This problem is similar to the cookie problem, with the twist that I877draw one sample from each bowl/bag. This problem also gives me a878chance to demonstrate the table method, which is useful for solving879problems like this on paper. In the next chapter we will880solve them computationally.881\index{table method}882883The first step is to enumerate the hypotheses. The bag the yellow884\MM~came from I'll call Bag 1; I'll call the other Bag 2. So885the hypotheses are:886887\begin{itemize}888889\item A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.890891\item B: Bag 1 is from 1996 and Bag 2 from 1994.892893\end{itemize}894895Now we construct a table with a row for each hypothesis and a896column for each term in Bayes's theorem:897898\begin{tabular}{|c|c|c|c|c|}899\hline900& Prior & Likelihood & & Posterior \\901& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\902\hline903A & 1/2 & (20)(20) & 200 & 20/27 \\904B & 1/2 & (14)(10) & 70 & 7/27 \\905\hline906\end{tabular}907908The first column has the priors.909Based on the statement of the problem,910it is reasonable to choose $\p{A} = \p{B} = 1/2$.911912The second column has the likelihoods, which follow from the913information in the problem. For example, if $A$ is true, the yellow914\MM~came from the 1994 bag with probability 20\%, and the green came915from the 1996 bag with probability 20\%. If $B$ is true, the yellow916\MM~came from the 1996 bag with probability 14\%, and the green came917from the 1994 bag with probability 10\%.918Because the selections are919independent, we get the conjoint probability by multiplying.920\index{independence}921922The third column is just the product of the previous two.923The sum of this column, 270, is the normalizing constant.924To get the last column, which contains the posteriors, we divide925the third column by the normalizing constant.926927That's it. Simple, right?928929Well, you might be bothered by one detail. I write \p{D|H}930in terms of percentages, not probabilities, which means it931is off by a factor of 10,000. But that932cancels out when we divide through by the normalizing constant, so933it doesn't affect the result.934\index{normalizing constant}935936When the set of hypotheses is mutually exclusive and collectively937exhaustive, you can multiply the likelihoods by any factor, if it is938convenient, as long as you apply the same factor to the entire column.939940941\section{The Monty Hall problem}942943The Monty Hall problem might be the most contentious question in944the history of probability. The scenario is simple, but the correct945answer is so counterintuitive that many people just can't accept946it, and many smart people have embarrassed themselves not just by947getting it wrong but by arguing the wrong side, aggressively,948in public.949\index{Monty Hall problem}950951Monty Hall was the original host of the game show {\em Let's Make a952Deal}. The Monty Hall problem is based on one of the regular953games on the show. If you are on the show, here's what happens:954955\begin{itemize}956957\item Monty shows you three closed doors and tells you that there is a958prize behind each door: one prize is a car, the other two are less959valuable prizes like peanut butter and fake finger nails. The960prizes are arranged at random.961962\item The object of the game is to guess which door has the car. If963you guess right, you get to keep the car.964965\item You pick a door, which we will call Door A. We'll call the966other doors B and C.967968\item Before opening the door you chose, Monty increases the969suspense by opening either Door B or C, whichever does not970have the car. (If the car is actually behind Door A, Monty can971safely open B or C, so he chooses one at random.)972973\item Then Monty offers you the option to stick with your original974choice or switch to the one remaining unopened door.975976\end{itemize}977978The question is, should you ``stick'' or ``switch'' or does it979make no difference?980\index{stick}981\index{switch}982\index{intuition}983984Most people have the strong intuition that it makes no difference.985There are two doors left, they reason, so the chance that the car986is behind Door A is 50\%.987988But that is wrong. In fact, the chance of winning if you stick989with Door A is only 1/3; if you switch, your chances are 2/3.990991By applying Bayes's theorem, we can break this problem into simple992pieces, and maybe convince ourselves that the correct answer is,993in fact, correct.994995To start, we should make a careful statement of the data. In996this case $D$ consists of two parts: Monty chooses Door B997{\em and} there is no car there.998999Next we define three hypotheses: $A$, $B$, and $C$ represent the1000hypothesis that the car is behind Door A, Door B, or Door C.1001Again, let's apply the table method:10021003\begin{tabular}{|c|c|c|c|c|}1004\hline1005& Prior & Likelihood & & Posterior \\1006& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\1007\hline1008A & 1/3 & 1/2 & 1/6 & 1/3 \\1009B & 1/3 & 0 & 0 & 0 \\1010C & 1/3 & 1 & 1/3 & 2/3 \\1011\hline1012\end{tabular}10131014Filling in the priors is easy because we are told that the prizes1015are arranged at random, which suggests that the car is equally1016likely to be behind any door.10171018Figuring out the likelihoods takes some thought, but with reasonable1019care we can be confident that we have it right:10201021\begin{itemize}10221023\item If the car is actually behind A, Monty could safely open Doors B1024or C. So the probability that he chooses B is 1/2. And since the1025car is actually behind A, the probability that the car is not behind1026B is 1.10271028\item If the car is actually behind B, Monty has to open door C, so1029the probability that he opens door B is 0.10301031\item Finally, if the car is behind Door C, Monty opens B with1032probability 1 and finds no car there with probability 1.10331034\end{itemize}10351036Now the hard part is over; the rest is just arithmetic. The1037sum of the third column is 1/2. Dividing through yields1038$\p{A|D} = 1/3$ and $\p{C|D} = 2/3$. So you are better off switching.10391040There are many variations of the Monty Hall problem. One of the1041strengths of the Bayesian approach is that it generalizes to handle1042these variations.10431044For example, suppose that Monty always chooses B if he can, and1045only chooses C if he has to (because the car is behind B). In1046that case the revised table is:10471048\begin{tabular}{|c|c|c|c|c|}1049\hline1050& Prior & Likelihood & & Posterior \\1051& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\1052\hline1053A & 1/3 & 1 & 1/3 & 1/2 \\1054B & 1/3 & 0 & 0 & 0 \\1055C & 1/3 & 1 & 1/3 & 1/2 \\1056\hline1057\end{tabular}10581059The only change is \p{D|A}. If the car is behind $A$, Monty can1060choose to open B or C. But in this variation he always chooses1061B, so $\p{D|A} = 1$.10621063As a result, the likelihoods are the same for $A$ and $C$, and the1064posteriors are the same: $\p{A|D} = \p{C|D} = 1/2$. In this case, the1065fact that Monty chose B reveals no information about the location of1066the car, so it doesn't matter whether the contestant sticks or1067switches.10681069On the other hand, if he had opened $C$, we would know $\p{B|D} = 1$.10701071I included the Monty Hall problem in this chapter because I think it1072is fun, and because Bayes's theorem makes the complexity of the1073problem a little more manageable. But it is not a typical use of1074Bayes's theorem, so if you found it confusing, don't worry!10751076\section{Discussion}10771078For many problems involving conditional probability, Bayes's theorem1079provides a divide-and-conquer strategy. If \p{A|B} is hard to1080compute, or hard to measure experimentally, check whether it might be1081easier to compute the other terms in Bayes's theorem, \p{B|A}, \p{A}1082and \p{B}.1083\index{divide-and-conquer}10841085If the Monty Hall problem is your idea of fun, I have collected a1086number of similar problems in an article called ``All your Bayes are1087belong to us,'' which you can read at1088\url{http://allendowney.blogspot.com/2011/10/all-your-bayes-are-belong-to-us.html}.108910901091\chapter{Computational Statistics}1092\label{compstat}10931094\section{Distributions}10951096In statistics a {\bf distribution} is a set of values and their1097corresponding probabilities.1098\index{distribution}10991100For example, if you roll a six-sided die, the set of possible1101values is the numbers 1 to 6, and the probability associated1102with each value is 1/6.1103\index{dice}11041105As another example, you might be interested in how many times each1106word appears in common English usage. You could build a distribution1107that includes each word and how many times it appears.1108\index{word frequency}11091110To represent a distribution in Python, you could use a dictionary that1111maps from each value to its probability. I have written a class1112called {\tt Pmf} that uses a Python dictionary in exactly that way,1113and provides a number of useful methods.1114I called the class Pmf in reference to1115a {\bf probability mass function}, which is a way to1116represent a distribution mathematically.1117\index{probability mass function}1118\index{Pmf class}11191120{\tt Pmf} is defined in a Python module I wrote to accompany this1121book, {\tt thinkbayes.py}. You can download it from1122\url{http://thinkbayes.com/thinkbayes.py}. For more information1123see Section~\ref{download}.11241125To use {\tt Pmf} you can import it like this:11261127\begin{verbatim}1128from thinkbayes import Pmf1129\end{verbatim}11301131The following code builds a Pmf to represent the distribution1132of outcomes for a six-sided die:11331134\begin{verbatim}1135pmf = Pmf()1136for x in [1,2,3,4,5,6]:1137pmf.Set(x, 1/6.0)1138\end{verbatim}11391140\verb"Pmf" creates an empty Pmf with no values. The1141\verb"Set" method sets the probability associated with each1142value to $1/6$.11431144Here's another example that counts the number of times each word1145appears in a sequence:11461147\begin{verbatim}1148pmf = Pmf()1149for word in word_list:1150pmf.Incr(word, 1)1151\end{verbatim}11521153\verb"Incr" increases the ``probability'' associated with each1154word by 1. If a word is not already in the Pmf, it is added.11551156I put ``probability'' in quotes because in this example, the1157probabilities are not normalized; that is, they do not add up to 1.1158So they are not true probabilities.11591160But in this example the word counts are proportional to the1161probabilities. So after we count all the words, we can compute1162probabilities by dividing through by the total number of words. {\tt1163Pmf} provides a method, \verb"Normalize", that does exactly that:1164\index{Pmf methods}11651166\begin{verbatim}1167pmf.Normalize()1168\end{verbatim}11691170Once you have a Pmf object, you can ask for the probability1171associated with any value:1172\index{Prob}11731174\begin{verbatim}1175print pmf.Prob('the')1176\end{verbatim}11771178And that would print the frequency of the word ``the'' as a fraction1179of the words in the list.11801181Pmf uses a Python dictionary to store the values and their1182probabilities, so the values in the Pmf can be any hashable type.1183The probabilities can be any numerical type, but they are usually1184floating-point numbers (type \verb"float").118511861187\section{The cookie problem}11881189In the context of Bayes's theorem, it is natural to use a Pmf1190to map from each hypothesis to its probability. In the cookie1191problem, the hypotheses are $B_1$ and $B_2$. In Python, I1192represent them with strings:1193\index{cookie problem}11941195\begin{verbatim}1196pmf = Pmf()1197pmf.Set('Bowl 1', 0.5)1198pmf.Set('Bowl 2', 0.5)1199\end{verbatim}12001201This distribution, which contains the priors for each hypothesis,1202is called (wait for it) the {\bf prior distribution}.1203\index{prior distribution}12041205To update the distribution based on new data (the vanilla cookie),1206we multiply each prior by the corresponding likelihood. The likelihood1207of drawing a vanilla cookie from Bowl 1 is 3/4. The likelihood1208for Bowl 2 is 1/2.1209\index{Mult}12101211\begin{verbatim}1212pmf.Mult('Bowl 1', 0.75)1213pmf.Mult('Bowl 2', 0.5)1214\end{verbatim}12151216\verb"Mult" does what you would expect. It gets the probability1217for the given hypothesis and multiplies by the given likelihood.12181219After this update, the distribution is no longer normalized, but1220because these hypotheses are mutually exclusive and collectively1221exhaustive, we can {\bf renormalize}:1222\index{renormalize}12231224\begin{verbatim}1225pmf.Normalize()1226\end{verbatim}12271228The result is a distribution that contains the posterior probability1229for each hypothesis, which is called (wait now) the1230{\bf posterior distribution}.1231\index{posterior distribution}12321233Finally, we can get the posterior probability for Bowl 1:12341235\begin{verbatim}1236print pmf.Prob('Bowl 1')1237\end{verbatim}12381239And the answer is 0.6. You can download this example1240from \url{http://thinkbayes.com/cookie.py}. For more information1241see Section~\ref{download}.1242\index{cookie.py}124312441245\section{The Bayesian framework}1246\label{framework}12471248\index{Bayesian framework}1249Before we go on to other problems, I want to rewrite the code1250from the previous section to make it more general. First I'll1251define a class to encapsulate the code related to this problem:12521253\begin{verbatim}1254class Cookie(Pmf):12551256def __init__(self, hypos):1257Pmf.__init__(self)1258for hypo in hypos:1259self.Set(hypo, 1)1260self.Normalize()1261\end{verbatim}12621263A Cookie object is a Pmf that maps from hypotheses to their1264probabilities. The \verb"__init__" method gives each hypothesis1265the same prior probability. As in the previous section, there are1266two hypotheses:12671268\begin{verbatim}1269hypos = ['Bowl 1', 'Bowl 2']1270pmf = Cookie(hypos)1271\end{verbatim}12721273\verb"Cookie" provides an \verb"Update" method that takes1274data as a parameter and updates the probabilities:1275\index{Update}12761277\begin{verbatim}1278def Update(self, data):1279for hypo in self.Values():1280like = self.Likelihood(data, hypo)1281self.Mult(hypo, like)1282self.Normalize()1283\end{verbatim}12841285\verb"Update" loops through each hypothesis in the suite1286and multiplies its probability by the likelihood of the1287data under the hypothesis, which is computed by \verb"Likelihood":1288\index{Likelihood}12891290\begin{verbatim}1291mixes = {1292'Bowl 1':dict(vanilla=0.75, chocolate=0.25),1293'Bowl 2':dict(vanilla=0.5, chocolate=0.5),1294}12951296def Likelihood(self, data, hypo):1297mix = self.mixes[hypo]1298like = mix[data]1299return like1300\end{verbatim}13011302\verb"Likelihood" uses \verb"mixes", which is a dictionary1303that maps from the name of a bowl to the mix of cookies in1304the bowl.13051306Here's what the update looks like:13071308\begin{verbatim}1309pmf.Update('vanilla')1310\end{verbatim}13111312And then we can print the posterior probability of each hypothesis:13131314\begin{verbatim}1315for hypo, prob in pmf.Items():1316print hypo, prob1317\end{verbatim}13181319The result is13201321\begin{verbatim}1322Bowl 1 0.61323Bowl 2 0.41324\end{verbatim}13251326which is the same as what we got before. This code is more complicated1327than what we saw in the previous section. One advantage is that it1328generalizes to the case where we draw more than one cookie from the1329same bowl (with replacement):13301331\begin{verbatim}1332dataset = ['vanilla', 'chocolate', 'vanilla']1333for data in dataset:1334pmf.Update(data)1335\end{verbatim}13361337The other advantage is that it provides a framework for solving many1338similar problems. In the next section we'll solve the Monty Hall1339problem computationally and then see what parts of the framework are1340the same.13411342The code in this section is available from1343\url{http://thinkbayes.com/cookie2.py}.1344For more information1345see Section~\ref{download}.13461347\section{The Monty Hall problem}13481349To solve the Monty Hall problem, I'll define a new class:1350\index{Monty Hall problem}13511352\begin{verbatim}1353class Monty(Pmf):13541355def __init__(self, hypos):1356Pmf.__init__(self)1357for hypo in hypos:1358self.Set(hypo, 1)1359self.Normalize()1360\end{verbatim}13611362So far \verb"Monty" and \verb"Cookie" are exactly the same.1363And the code that creates the Pmf is the same, too, except for1364the names of the hypotheses:13651366\begin{verbatim}1367hypos = 'ABC'1368pmf = Monty(hypos)1369\end{verbatim}13701371Calling \verb"Update" is pretty much the same:13721373\begin{verbatim}1374data = 'B'1375pmf.Update(data)1376\end{verbatim}13771378And the implementation of \verb"Update" is exactly the same:13791380\begin{verbatim}1381def Update(self, data):1382for hypo in self.Values():1383like = self.Likelihood(data, hypo)1384self.Mult(hypo, like)1385self.Normalize()1386\end{verbatim}13871388The only part that requires some work is \verb"Likelihood":13891390\begin{verbatim}1391def Likelihood(self, data, hypo):1392if hypo == data:1393return 01394elif hypo == 'A':1395return 0.51396else:1397return 11398\end{verbatim}13991400Finally, printing the results is the same:14011402\begin{verbatim}1403for hypo, prob in pmf.Items():1404print hypo, prob1405\end{verbatim}14061407And the answer is14081409\begin{verbatim}1410A 0.3333333333331411B 0.01412C 0.6666666666671413\end{verbatim}14141415In this example, writing \verb"Likelihood" is a little complicated,1416but the framework of the Bayesian update is simple. The code in1417this section is available from \url{http://thinkbayes.com/monty.py}.1418For more information1419see Section~\ref{download}.14201421\section{Encapsulating the framework}14221423\index{Suite class}1424Now that we see what elements of the framework are the same, we1425can encapsulate them in an object---a \verb"Suite" is a \verb"Pmf"1426that provides \verb"__init__", \verb"Update", and \verb"Print":14271428\begin{verbatim}1429class Suite(Pmf):1430"""Represents a suite of hypotheses and their probabilities."""14311432def __init__(self, hypo=tuple()):1433"""Initializes the distribution."""14341435def Update(self, data):1436"""Updates each hypothesis based on the data."""14371438def Print(self):1439"""Prints the hypotheses and their probabilities."""1440\end{verbatim}14411442The implementation of \verb"Suite" is in \verb"thinkbayes.py". To use1443\verb"Suite", you should write a class that inherits from it and1444provides \verb"Likelihood". For example, here is the solution to the1445Monty Hall problem rewritten to use \verb"Suite":14461447\begin{verbatim}1448from thinkbayes import Suite14491450class Monty(Suite):14511452def Likelihood(self, data, hypo):1453if hypo == data:1454return 01455elif hypo == 'A':1456return 0.51457else:1458return 11459\end{verbatim}14601461And here's the code that uses this class:14621463\begin{verbatim}1464suite = Monty('ABC')1465suite.Update('B')1466suite.Print()1467\end{verbatim}14681469You can download this example from1470\url{http://thinkbayes.com/monty2.py}.1471For more information1472see Section~\ref{download}.147314741475\section{The \MM~problem}14761477\index{M and M problem}1478We can use the \verb"Suite" framework to solve the \MM~problem.1479Writing the \verb"Likelihood" function is tricky, but everything1480else is straightforward.14811482First I need to encode the color mixes from before and1483after 1995:14841485\begin{verbatim}1486mix94 = dict(brown=30,1487yellow=20,1488red=20,1489green=10,1490orange=10,1491tan=10)14921493mix96 = dict(blue=24,1494green=20,1495orange=16,1496yellow=14,1497red=13,1498brown=13)1499\end{verbatim}15001501Then I have to encode the hypotheses:15021503\begin{verbatim}1504hypoA = dict(bag1=mix94, bag2=mix96)1505hypoB = dict(bag1=mix96, bag2=mix94)1506\end{verbatim}15071508\verb"hypoA" represents the hypothesis that Bag 1 is from15091994 and Bag 2 from 1996. \verb"hypoB" is the other way1510around.15111512Next I map from the name of the hypothesis to the representation:15131514\begin{verbatim}1515hypotheses = dict(A=hypoA, B=hypoB)1516\end{verbatim}15171518And finally I can write \verb"Likelihood". In this case1519the hypothesis, \verb"hypo", is a string, either \verb"A" or \verb"B".1520The data is a tuple that specifies a bag1521and a color.15221523\begin{verbatim}1524def Likelihood(self, data, hypo):1525bag, color = data1526mix = self.hypotheses[hypo][bag]1527like = mix[color]1528return like1529\end{verbatim}15301531Here's the code that creates the suite and updates it:15321533\begin{verbatim}1534suite = M_and_M('AB')15351536suite.Update(('bag1', 'yellow'))1537suite.Update(('bag2', 'green'))15381539suite.Print()1540\end{verbatim}15411542And here's the result:15431544\begin{verbatim}1545A 0.7407407407411546B 0.2592592592591547\end{verbatim}15481549The posterior probability of A is approximately $20/27$, which is what1550we got before.15511552The code in this section is available from1553\url{http://thinkbayes.com/m_and_m.py}. For more information see1554Section~\ref{download}.15551556\section{Discussion}15571558This chapter presents the Suite class, which encapsulates the1559Bayesian update framework.15601561{\tt Suite} is an {\bf abstract type}, which means that it defines the1562interface a Suite is supposed to have, but does not provide a complete1563implementation. The {\tt Suite} interface includes {\tt Update} and1564{\tt Likelihood}, but the {\tt Suite} class only provides an1565implementation of {\tt Update}, not {\tt Likelihood}.1566\index{abstract type} \index{concrete type} \index{interface}1567\index{implementation}15681569A {\bf concrete type} is a class that extends an abstract parent1570class and provides an implementation of the missing methods.1571For example, {\tt Monty} extends {\tt Suite}, so it inherits1572{\tt Update} and provides {\tt Likelihood}.15731574If you are familiar with1575design patterns, you might recognize this as an example of the1576template method pattern.1577You can read about this pattern at1578\url{http://en.wikipedia.org/wiki/Template_method_pattern}.1579\index{template method pattern}15801581Most of the examples in the following chapters follow the same1582pattern; for each problem we define a new class that extends {\tt1583Suite}, inherits {\tt Update}, and provides {\tt Likelihood}. In a1584few cases we override {\tt Update}, usually to improve performance.15851586\section{Exercises}15871588\begin{exercise}15891590In Section~\ref{framework} I said that the solution to the cookie1591problem generalizes to the case where we draw multiple cookies1592with replacement.15931594But in the more likely scenario where we eat the cookies we draw,1595the likelihood of each draw depends on the previous draws.15961597Modify the solution in this chapter to handle selection without1598replacement. Hint: add instance variables to {\tt Cookie} to1599represent the hypothetical state of the bowls, and modify1600{\tt Likelihood} accordingly. You might want to define a1601{\tt Bowl} object.16021603\end{exercise}16041605160616071608\chapter{Estimation}1609\label{estimation}16101611\section{The dice problem}16121613\index{Dice problem}1614Suppose I have a box of dice that contains a 4-sided die, a 6-sided1615die, an 8-sided die, a 12-sided die, and a 20-sided die. If you1616have ever played {\it Dungeons~\&~Dragons}, you know what I am talking about.1617\index{Dungeons and Dragons}16181619Suppose I select a die from the box at random, roll it, and get a 6.1620What is the probability that I rolled each die?1621\index{dice}16221623Let me suggest a three-step strategy for approaching a problem like this.16241625\begin{enumerate}16261627\item Choose a representation for the hypotheses.16281629\item Choose a representation for the data.16301631\item Write the likelihood function.16321633\end{enumerate}16341635In previous examples I used strings to represent hypotheses and1636data, but for the die problem I'll use numbers. Specifically,1637I'll use the integers 4, 6, 8, 12, and 20 to represent hypotheses:16381639\begin{verbatim}1640suite = Dice([4, 6, 8, 12, 20])1641\end{verbatim}16421643And integers from 1 to 20 for the data.1644These representations make it easy to1645write the likelihood function:16461647\begin{verbatim}1648class Dice(Suite):1649def Likelihood(self, data, hypo):1650if hypo < data:1651return 01652else:1653return 1.0/hypo1654\end{verbatim}16551656Here's how \verb"Likelihood" works. If \verb"hypo<data", that1657means the roll is greater than the number of sides on the die.1658That can't happen, so the likelihood is 0.16591660Otherwise the question is, ``Given that there are {\tt hypo}1661sides, what is the chance of rolling {\tt data}?'' The1662answer is \verb"1/hypo", regardless of {\tt data}.16631664Here is the statement that does the update (if I roll a 6):16651666\begin{verbatim}1667suite.Update(6)1668\end{verbatim}16691670And here is the posterior distribution:16711672\begin{verbatim}16734 0.016746 0.39215686274516758 0.294117647059167612 0.196078431373167720 0.1176470588241678\end{verbatim}16791680After we roll a 6, the probability for the 4-sided die is 0. The1681most likely alternative is the 6-sided die, but there is still1682almost a 12\% chance for the 20-sided die.16831684What if we roll a few more times and get 6, 8, 7, 7, 5, and 4?16851686\begin{verbatim}1687for roll in [6, 8, 7, 7, 5, 4]:1688suite.Update(roll)1689\end{verbatim}16901691With this data the 6-sided die is eliminated, and the 8-sided1692die seems quite likely. Here are the results:16931694\begin{verbatim}16954 0.016966 0.016978 0.943248453672169812 0.0552061280613169920 0.00154541826651700\end{verbatim}17011702Now the probability is 94\% that we are rolling the 8-sided die,1703and less than 1\% for the 20-sided die.17041705The dice problem is based on an example I saw in Sanjoy Mahajan's class on1706Bayesian inference. You can download the code in this section from1707\url{http://thinkbayes.com/dice.py}.1708For more information1709see Section~\ref{download}.17101711\section{The locomotive problem}17121713\index{locomotive problem}1714\index{Mosteller, Frederick}1715\index{German tank problem}1716I found the locomotive problem1717in Frederick Mosteller's, {\it Fifty Challenging Problems in1718Probability with Solutions} (Dover, 1987):17191720\begin{quote}1721``A railroad numbers its locomotives in order 1..N. One day you see a1722locomotive with the number 60. Estimate how many locomotives the1723railroad has.''1724\end{quote}17251726Based on this observation, we know the railroad has 60 or more1727locomotives. But how many more? To apply Bayesian reasoning, we1728can break this problem into two steps:17291730\begin{enumerate}17311732\item What did we know about $N$ before we saw the data?17331734\item For any given value of $N$, what is the likelihood of1735seeing the data (a locomotive with number 60)?17361737\end{enumerate}17381739The answer to the first question is the prior. The answer to the1740second is the likelihood.17411742\begin{figure}1743% train.py1744\centerline{\includegraphics[height=2.5in]{figs/train1.pdf}}1745\caption{Posterior distribution for the locomotive problem, based1746on a uniform prior.}1747\label{fig.train1}1748\end{figure}17491750We don't have much basis to choose a prior, but we can start with1751something simple and then consider alternatives. Let's assume that1752$N$ is equally likely to be any value from 1 to 1000.17531754\begin{verbatim}1755hypos = xrange(1, 1001)1756\end{verbatim}17571758Now all we need is a likelihood function. In a hypothetical fleet of1759$N$ locomotives, what is the probability that we would see number 60?1760If we assume that there is only one train-operating company (or only1761one we care about) and that we are equally likely to see any of its1762locomotives, then the chance of seeing any particular locomotive is1763$1/N$.17641765Here's the likelihood function:1766\index{likelihood function}17671768\begin{verbatim}1769class Train(Suite):1770def Likelihood(self, data, hypo):1771if hypo < data:1772return 01773else:1774return 1.0/hypo1775\end{verbatim}17761777This might look familiar; the likelihood functions for the locomotive1778problem and the dice problem are identical.1779\index{dice problem}17801781Here's the update:17821783\begin{verbatim}1784suite = Train(hypos)1785suite.Update(60)1786\end{verbatim}17871788There are too many hypotheses to print, so I plotted the1789results in Figure~\ref{fig.train1}. Not surprisingly, all1790values of $N$ below 60 have been eliminated.17911792The most likely1793value, if you had to guess, is 60. That might not seem like1794a very good guess; after all, what are the chances that you just1795happened to see the train with the highest number?1796Nevertheless, if you want to maximize the chance of getting1797the answer exactly right, you should guess 60.17981799But maybe that's1800not the right goal. An alternative is to compute the1801mean of the posterior distribution:18021803\begin{verbatim}1804def Mean(suite):1805total = 01806for hypo, prob in suite.Items():1807total += hypo * prob1808return total18091810print Mean(suite)1811\end{verbatim}18121813Or you could use the very similar method provided by {\tt Pmf}:18141815\begin{verbatim}1816print suite.Mean()1817\end{verbatim}18181819The mean of the posterior is 333, so that might be a1820good guess if you wanted to minimize error. If you played this1821guessing game over and over, using the mean of the posterior as your1822estimate would minimize the mean squared error over the long run (see1823\url{http://en.wikipedia.org/wiki/Minimum_mean_square_error}).1824\index{mean squared error}18251826You can download this example from \url{http://thinkbayes.com/train.py}.1827For more information1828see Section~\ref{download}.18291830\section{What about that prior?}18311832To make any progress on the locomotive problem we had to make1833assumptions, and some of them were pretty arbitrary. In1834particular, we chose a uniform prior from 1 to 1000, without1835much justification for choosing 1000, or for choosing a uniform1836distribution.1837\index{prior distribution}18381839It is not crazy to believe that a railroad company might operate 10001840locomotives, but a reasonable person might guess more or fewer. So we1841might wonder whether the posterior distribution is sensitive to these1842assumptions. With so little data---only one observation---it probably1843is.18441845Recall that with a uniform prior from 1 to 1000, the mean of1846the posterior is 333. With an upper bound of 500, we get a1847posterior mean of 207, and with an upper bound of 2000,1848the posterior mean is 552.18491850So that's bad. There are two ways to proceed:18511852\begin{itemize}18531854\item Get more data.18551856\item Get more background information.18571858\end{itemize}18591860With more data, posterior distributions based on different1861priors tend to converge. For example, suppose that in addition1862to train 60 we also see trains 30 and 90. We can update the1863distribution like this:18641865\begin{verbatim}1866for data in [60, 30, 90]:1867suite.Update(data)1868\end{verbatim}18691870With these data, the means of the posteriors are18711872\begin{tabular}{|l|l|}1873\hline1874Upper & Posterior \\1875Bound & Mean \\1876\hline1877500 & 152 \\18781000 & 164\\18792000 & 171\\1880\hline1881\end{tabular}18821883So the differences are smaller.188418851886\section{An alternative prior}18871888\begin{figure}1889% train.py1890\centerline{\includegraphics[height=2.5in]{figs/train4.pdf}}1891\caption{Posterior distribution based on a power law prior,1892compared to a uniform prior.}1893\label{fig.train4}1894\end{figure}18951896If more data are not available, another option is to improve the1897priors by gathering more background information. It is probably1898not reasonable to assume that a train-operating company with 1000 locomotives1899is just as likely as a company with only 1.19001901With some effort, we could probably find a list of companies that1902operate locomotives in the area of observation. Or we could1903interview an expert in rail shipping to gather information about1904the typical size of companies.19051906But even without getting into the specifics of railroad economics, we1907can make some educated guesses. In most fields, there are many small1908companies, fewer medium-sized companies, and only one or two very1909large companies. In fact, the distribution of company sizes tends to1910follow a power law, as Robert Axtell reports in {\it Science} (see1911\url{http://www.sciencemag.org/content/293/5536/1818.full.pdf}).1912\index{power law}1913\index{Axtell, Robert}19141915This law suggests that if there are 1000 companies with fewer than191610 locomotives, there might be 100 companies with 100 locomotives,191710 companies with 1000, and possibly one company with 10,000 locomotives.19181919Mathematically, a power law means that the number of companies1920with a given size is inversely proportional to size, or1921%1922\[ \PMF(x) \propto \left( \frac{1}{x} \right)^{\alpha} \]1923%1924where $\PMF(x)$ is the probability mass function of $x$ and $\alpha$ is1925a parameter that is often near 1.19261927We can construct a power law prior like this:19281929\begin{verbatim}1930class Train(Dice):19311932def __init__(self, hypos, alpha=1.0):1933Pmf.__init__(self)1934for hypo in hypos:1935self.Set(hypo, hypo**(-alpha))1936self.Normalize()1937\end{verbatim}19381939And here's the code that constructs the prior:19401941\begin{verbatim}1942hypos = range(1, 1001)1943suite = Train(hypos)1944\end{verbatim}19451946Again, the upper bound is arbitrary, but with a power law1947prior, the posterior is less sensitive to this choice.19481949Figure~\ref{fig.train4} shows the new posterior based on1950the power law, compared to the posterior based on the1951uniform prior. Using the background information1952represented in the power law prior, we can all but eliminate1953values of $N$ greater than 700.19541955If we start with this prior and observe trains 30, 60, and 90,1956the means of the posteriors are19571958\begin{tabular}{|l|l|}1959\hline1960Upper & Posterior \\1961Bound & Mean \\1962\hline1963500 & 131 \\19641000 & 133 \\19652000 & 134 \\1966\hline1967\end{tabular}19681969Now the differences are much smaller. In fact,1970with an arbitrarily large upper bound, the mean converges on 134.19711972So the power law prior is more realistic, because it is based on1973general information about the size of companies, and it1974behaves better in practice.19751976You can download the examples in this section from1977\url{http://thinkbayes.com/train3.py}.1978For more information1979see Section~\ref{download}.19801981\section{Credible intervals}1982\label{credible}19831984Once you have computed a posterior distribution, it is often useful1985to summarize the results with a single point estimate or an interval.1986For point estimates it is common to use the mean, median, or the1987value with maximum likelihood.1988\index{credible interval}1989\index{maximum likelihood}19901991For intervals we usually report two values computed1992so that there is a 90\% chance that the unknown value falls1993between them (or any other probability).1994These values define a {\bf credible interval}.19951996A simple way to compute a credible interval is to add up the1997probabilities in the posterior distribution and record the values1998that correspond to probabilities 5\% and 95\%. In other words,1999the 5th and 95th percentiles.2000\index{percentile}20012002\verb"thinkbayes" provides a function that computes percentiles:20032004\begin{verbatim}2005def Percentile(pmf, percentage):2006p = percentage / 100.02007total = 02008for val, prob in pmf.Items():2009total += prob2010if total >= p:2011return val2012\end{verbatim}20132014And here's the code that uses it:20152016\begin{verbatim}2017interval = Percentile(suite, 5), Percentile(suite, 95)2018print interval2019\end{verbatim}20202021For the previous example---the locomotive problem with a power law prior2022and three trains---the 90\% credible interval is $(91, 243)$. The2023width of this range suggests, correctly, that we are still quite2024uncertain about how many locomotives there are.202520262027\section{Cumulative distribution functions}20282029In the previous section we computed percentiles by iterating through2030the values and probabilities in a Pmf. If we need to compute more2031than a few percentiles, it is more efficient to use a cumulative2032distribution function, or Cdf.2033\index{cumulative distribution function}2034\index{Cdf}20352036Cdfs and Pmfs are equivalent in the sense that they contain the2037same information about the distribution, and you can always convert2038from one to the other. The advantage of the Cdf is that you can2039compute percentiles more efficiently.20402041{\tt thinkbayes} provides a {\tt Cdf} class that represents a2042cumulative distribution function. {\tt Pmf} provides a method2043that makes the corresponding Cdf:20442045\begin{verbatim}2046cdf = suite.MakeCdf()2047\end{verbatim}20482049And {\tt Cdf} provides a function named \verb"Percentile"20502051\begin{verbatim}2052interval = cdf.Percentile(5), cdf.Percentile(95)2053\end{verbatim}20542055Converting from a Pmf to a Cdf takes time proportional to the number2056of values, {\tt len(pmf)}. The Cdf stores the values and2057probabilities in sorted lists, so looking up a probability to get the2058corresponding value takes ``log time'': that is, time proportional to2059the logarithm of the number of values. Looking up a value to get the2060corresponding probability is also logarithmic, so Cdfs are efficient2061for many calculations.20622063The examples in this section are in \url{http://thinkbayes.com/train3.py}.2064For more information2065see Section~\ref{download}.206620672068\section{The German tank problem}20692070During World War II, the Economic Warfare Division of the American2071Embassy in London used statistical analysis to estimate German2072production of tanks and other equipment.\footnote{Ruggles and Brodie,2073``An Empirical Approach to Economic Intelligence in World War II,''2074{\em Journal of the American Statistical Association}, Vol. 42,2075No. 237 (March 1947).}20762077The Western Allies had captured log books, inventories, and repair2078records that included chassis and engine serial numbers for individual2079tanks.20802081Analysis of these records indicated that serial numbers were allocated2082by manufacturer and tank type in blocks of 100 numbers, that numbers2083in each block were used sequentially, and that not all numbers in each2084block were used. So the problem of estimating German tank production2085could be reduced, within each block of 100 numbers, to a form of the2086locomotive problem.20872088Based on this insight, American and British analysts produced2089estimates substantially lower than estimates from other forms2090of intelligence. And after the war, records indicated that they were2091substantially more accurate.20922093They performed similar analyses for tires, trucks, rockets, and other2094equipment, yielding accurate and actionable economic intelligence.20952096The German tank problem is historically interesting; it is also a nice2097example of real-world application of statistical estimation. So far2098many of the examples in this book have been toy problems, but it will2099not be long before we start solving real problems. I think it is an2100advantage of Bayesian analysis, especially with the computational2101approach we are taking, that it provides such a short path from a2102basic introduction to the research frontier.210321042105\section{Discussion}21062107Among Bayesians, there are two approaches to choosing prior2108distributions. Some recommend choosing the prior that best represents2109background information about the problem; in that case the prior2110is said to be {\bf informative}. The problem with using an informative2111prior is that people might use different background information (or2112interpret it differently). So informative priors often seem subjective.2113\index{informative prior}21142115The alternative is a so-called {\bf uninformative prior}, which is2116intended to be as unrestricted as possible, in order to let the data2117speak for themselves. In some cases you can identify a unique prior2118that has some desirable property, like representing minimal prior2119information about the estimated quantity.2120\index{uninformative prior}21212122Uninformative priors are appealing because they seem more2123objective. But I am generally in favor of using informative priors.2124Why? First, Bayesian analysis is always based on2125modeling decisions. Choosing the prior is one of those decisions, but2126it is not the only one, and it might not even be the most subjective.2127So even if an uninformative prior is more objective, the entire analysis2128is still subjective.2129\index{modeling}2130\index{subjectivity}2131\index{objectivity}21322133Also, for most practical problems, you are likely to be in one of two2134regimes: either you have a lot of data or not very much. If you have2135a lot of data, the choice of the prior doesn't matter very much;2136informative and uninformative priors yield almost the same results.2137We'll see an example like this in the next chapter.21382139But if, as in the locomotive problem, you don't have much data,2140using relevant background information (like the power law distribution)2141makes a big difference.2142\index{locomotive problem}21432144And if, as in the German tank problem, you have to make life-and-death2145decisions based on your results, you should probably use all of the2146information at your disposal, rather than maintaining the illusion of2147objectivity by pretending to know less than you do.2148\index{German tank problem}214921502151\section{Exercises}21522153\begin{exercise}2154To write a likelihood function for the locomotive problem, we had2155to answer this question: ``If the railroad has $N$ locomotives, what2156is the probability that we see number 60?''21572158The answer depends on what sampling process we use when we observe the2159locomotive. In this chapter, I resolved the ambiguity by specifying2160that there is only one train-operating company (or only one that we2161care about).21622163But suppose instead that there are many companies with different2164numbers of trains. And suppose that you are equally likely to see any2165train operated by any company.2166In that case, the likelihood function is different because you2167are more likely to see a train operated by a large company.21682169As an exercise, implement the likelihood function for this variation2170of the locomotive problem, and compare the results.21712172\end{exercise}21732174217521762177\chapter{More Estimation}2178\label{more}21792180\section{The Euro problem}2181\label{euro}21822183\index{Euro problem}2184\index{MacKay, David}2185In {\it Information Theory, Inference, and Learning Algorithms}, David MacKay2186poses this problem:21872188\begin{quote}2189A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:21902191\begin{quote}2192When spun on edge 250 times, a Belgian one-euro coin came2193up heads 140 times and tails 110. `It looks very suspicious2194to me,' said Barry Blight, a statistics lecturer at the London2195School of Economics. `If the coin were unbiased, the chance of2196getting a result as extreme as that would be less than 7\%.'2197\end{quote}21982199But do these data give evidence that the coin is biased rather than fair?2200\end{quote}22012202To answer that question, we'll proceed in two steps. The first2203is to estimate the probability that the coin lands face up. The second2204is to evaluate whether the data support the hypothesis that the2205coin is biased.22062207You can download the code in this section from2208\url{http://thinkbayes.com/euro.py}.2209For more information2210see Section~\ref{download}.22112212Any given coin has some probability, $x$, of landing heads up when spun2213on edge. It seems reasonable to believe that the value of $x$ depends2214on some physical characteristics of the coin, primarily the distribution2215of weight.22162217If a coin is perfectly balanced, we expect $x$ to be close to 50\%, but2218for a lopsided coin, $x$ might be substantially different. We can use2219Bayes's theorem and the observed data to estimate $x$.22202221Let's define 101 hypotheses, where $H_x$ is the hypothesis that the2222probability of heads is $x$\%, for values from 0 to 100. I'll2223start with a uniform prior where the probability of $H_x$ is the same2224for all $x$. We'll come back later to consider other priors.2225\index{uniform distribution}22262227\begin{figure}2228% euro.py2229\centerline{\includegraphics[height=2.5in]{figs/euro1.pdf}}2230\caption{Posterior distribution for the Euro problem2231on a uniform prior.}2232\label{fig.euro1}2233\end{figure}22342235The likelihood function is relatively easy: If $H_x$ is true, the2236probability of heads is $x/100$ and the probability of tails is $1-2237x/100$.22382239\begin{verbatim}2240class Euro(Suite):22412242def Likelihood(self, data, hypo):2243x = hypo2244if data == 'H':2245return x/100.02246else:2247return 1 - x/100.02248\end{verbatim}22492250Here's the code that makes the suite and updates it:22512252\begin{verbatim}2253suite = Euro(xrange(0, 101))2254dataset = 'H' * 140 + 'T' * 11022552256for data in dataset:2257suite.Update(data)2258\end{verbatim}22592260The result is in Figure~\ref{fig.euro1}.226122622263\section{Summarizing the posterior}22642265Again, there are several ways to summarize the posterior distribution.2266One option is to find the most likely value in the posterior2267distribution. \verb"thinkbayes" provides a function that does2268that:2269\index{posterior distribution}2270\index{maximum likelihood}22712272\begin{verbatim}2273def MaximumLikelihood(pmf):2274"""Returns the value with the highest probability."""2275prob, val = max((prob, val) for val, prob in pmf.Items())2276return val2277\end{verbatim}22782279In this case the result is 56, which is also the observed percentage of2280heads, $140/250 = 56\%$. So that suggests (correctly) that the2281observed percentage is the maximum likelihood estimator2282for the population.22832284We might also summarize the posterior by computing the mean2285and median:2286\index{median}22872288\begin{verbatim}2289print 'Mean', suite.Mean()2290print 'Median', thinkbayes.Percentile(suite, 50)2291\end{verbatim}22922293The mean is 55.95; the median is 56. Finally, we can compute a2294credible interval:22952296\begin{verbatim}2297print 'CI', thinkbayes.CredibleInterval(suite, 90)2298\end{verbatim}22992300The result is $(51, 61)$.23012302Now, getting back to the original question, we would like to know2303whether the coin is fair. We observe that the posterior credible2304interval does not include 50\%, which suggests that the coin is not2305fair.23062307But that is not exactly the question we started with. MacKay asked,2308`` Do these data give evidence that the coin is biased rather than2309fair?'' To answer that question, we will have to be more precise2310about what it means to say that data constitute evidence for2311a hypothesis. And that is the subject of the next chapter.2312\index{evidence}23132314But before we go on, I want to address one possible source of confusion.2315Since we want to know whether the coin is fair, it might be tempting2316to ask for the probability that {\tt x} is 50\%:23172318\begin{verbatim}2319print suite.Prob(50)2320\end{verbatim}23212322The result is 0.021, but that value is almost meaningless. The2323decision to evaluate 101 hypotheses was arbitrary; we could have2324divided the range into more or fewer pieces, and if we had, the2325probability for any given hypothesis would be greater or less.232623272328\section{Swamping the priors}2329\label{triangle}23302331\begin{figure}2332% euro.py2333\centerline{\includegraphics[height=2.5in]{figs/euro2.pdf}}2334\caption{Uniform and triangular priors for the2335Euro problem.}2336\label{fig.euro2}2337\end{figure}23382339\begin{figure}2340% euro.py2341\centerline{\includegraphics[height=2.5in]{figs/euro3.pdf}}2342\caption{Posterior distributions for the Euro problem.}2343\label{fig.euro3}2344\end{figure}23452346We started with a uniform prior, but that might not be a good2347choice. I can believe2348that if a coin is lopsided, $x$ might deviate substantially from234950\%, but it seems unlikely that the Belgian Euro coin is so2350imbalanced that $x$ is 10\% or 90\%.23512352It might be more reasonable to choose a prior that gives2353higher probability to values of $x$ near 50\% and lower probability2354to extreme values.23552356As an example, I constructed a triangular prior, shown in2357Figure~\ref{fig.euro2}. Here's the code that constructs the prior:23582359\begin{verbatim}2360def TrianglePrior():2361suite = Euro()2362for x in range(0, 51):2363suite.Set(x, x)2364for x in range(51, 101):2365suite.Set(x, 100-x)2366suite.Normalize()2367\end{verbatim}23682369Figure~\ref{fig.euro2} shows the result (and the uniform prior for2370comparison).2371Updating this prior with the same dataset yields the posterior2372distribution shown in Figure~\ref{fig.euro3}. Even with substantially2373different priors, the posterior distributions are very similar. The2374medians and the credible intervals are identical; the means differ by2375less than 0.5\%. \index{triangle distribution}23762377This is an example of {\bf swamping the priors}: with enough2378data, people who start with different priors will tend to2379converge on the same posterior.2380\index{swamping the priors}2381\index{convergence}238223832384\section{Optimization}23852386The code I have shown so far is meant to be easy to read, but it2387is not very efficient. In general, I like to develop code that2388is demonstrably correct, then check whether it is fast enough for2389my purposes. If so, there is no need to optimize.2390For this example, if we care about run time,2391there are several ways we can speed it up.2392\index{optimization}23932394The first opportunity is to reduce the number of times we2395normalize the suite.2396In the original code, we call \verb"Update" once for each spin.23972398\begin{verbatim}2399dataset = 'H' * heads + 'T' * tails24002401for data in dataset:2402suite.Update(data)2403\end{verbatim}24042405And here's what \verb"Update" looks like:24062407\begin{verbatim}2408def Update(self, data):2409for hypo in self.Values():2410like = self.Likelihood(data, hypo)2411self.Mult(hypo, like)2412return self.Normalize()2413\end{verbatim}24142415Each update iterates through the hypotheses, then calls \verb"Normalize",2416which iterates through the hypotheses again. We can save some2417time by doing all of the updates before normalizing.24182419\verb"Suite" provides a method called \verb"UpdateSet" that does2420exactly that. Here it is:24212422\begin{verbatim}2423def UpdateSet(self, dataset):2424for data in dataset:2425for hypo in self.Values():2426like = self.Likelihood(data, hypo)2427self.Mult(hypo, like)2428return self.Normalize()2429\end{verbatim}24302431And here's how we can invoke it:24322433\begin{verbatim}2434dataset = 'H' * heads + 'T' * tails2435suite.UpdateSet(dataset)2436\end{verbatim}24372438This optimization speeds things up, but the run time is still2439proportional to the amount of data. We can speed things up2440even more by rewriting \verb"Likelihood" to process the entire2441dataset, rather than one spin at a time.24422443In the original version,2444\verb"data" is a string that encodes either heads or tails:24452446\begin{verbatim}2447def Likelihood(self, data, hypo):2448x = hypo / 100.02449if data == 'H':2450return x2451else:2452return 1-x2453\end{verbatim}24542455As an alternative, we could encode the dataset as a tuple of2456two integers: the number of heads and tails.2457In that case \verb"Likelihood" looks like this:2458\index{tuple}24592460\begin{verbatim}2461def Likelihood(self, data, hypo):2462x = hypo / 100.02463heads, tails = data2464like = x**heads * (1-x)**tails2465return like2466\end{verbatim}24672468And then we can call \verb"Update" like this:24692470\begin{verbatim}2471heads, tails = 140, 1102472suite.Update((heads, tails))2473\end{verbatim}24742475Since we have replaced repeated multiplication with exponentiation,2476this version takes the same time for any number of spins.247724782479\section{The beta distribution}2480\label{beta}24812482\index{beta distribution}2483There is one more optimization that solves this problem2484even faster.24852486So far we have used a Pmf object to represent a discrete set of2487values for {\tt x}. Now we will use a continuous2488distribution, specifically the beta distribution (see2489\url{http://en.wikipedia.org/wiki/Beta_distribution}).2490\index{continuous distribution}24912492The beta distribution is defined on the interval from 0 to 12493(including both), so it is a natural choice for describing2494proportions and probabilities. But wait, it gets better.24952496%TODO: explain the binomial distribution in the previous section24972498It turns out that if you do a Bayesian update with a binomial2499likelihood function, which is what we did in the previous section, the beta2500distribution is a {\bf conjugate prior}. That means that if the prior2501distribution for {\tt x} is a beta distribution, the posterior is also2502a beta distribution. But wait, it gets even better.2503\index{binomial likelihood function}2504\index{conjugate prior}25052506The shape of the beta distribution depends on two parameters, written2507$\alpha$ and $\beta$, or {\tt alpha} and {\tt beta}. If the prior2508is a beta distribution with parameters {\tt alpha} and {\tt beta}, and2509we see data with {\tt h} heads and {\tt t} tails, the posterior is a2510beta distribution with parameters {\tt alpha+h} and {\tt beta+t}. In2511other words, we can do an update with two additions.2512\index{parameter}25132514So that's great, but it only works if we can find a beta distribution2515that is a good choice for a prior. Fortunately, for many realistic2516priors there is a beta distribution that is at least a good2517approximation, and for a uniform prior there is a perfect match. The2518beta distribution with {\tt alpha=1} and {\tt beta=1} is uniform from25190 to 1.25202521Let's see how we can take advantage of all this.2522{\tt thinkbayes.py} provides2523a class that represents a beta distribution:2524\index{Beta object}25252526\begin{verbatim}2527class Beta(object):25282529def __init__(self, alpha=1, beta=1):2530self.alpha = alpha2531self.beta = beta2532\end{verbatim}25332534By default \verb"__init__" makes a uniform distribution.2535{\tt Update} performs a Bayesian update:25362537\begin{verbatim}2538def Update(self, data):2539heads, tails = data2540self.alpha += heads2541self.beta += tails2542\end{verbatim}25432544{\tt data} is a pair of integers representing the number of2545heads and tails.25462547So we have yet another way to solve the Euro problem:25482549\begin{verbatim}2550beta = thinkbayes.Beta()2551beta.Update((140, 110))2552print beta.Mean()2553\end{verbatim}25542555{\tt Beta} provides {\tt Mean}, which2556computes a simple function of {\tt alpha}2557and {\tt beta}:25582559\begin{verbatim}2560def Mean(self):2561return float(self.alpha) / (self.alpha + self.beta)2562\end{verbatim}25632564For the Euro problem the posterior mean is 56\%, which is the2565same result we got using Pmfs.25662567{\tt Beta} also provides {\tt EvalPdf}, which evaluates2568the probability density2569function (PDF) of the beta distribution:2570\index{probability density function}2571\index{PDF}25722573\begin{verbatim}2574def EvalPdf(self, x):2575return x**(self.alpha-1) * (1-x)**(self.beta-1)2576\end{verbatim}25772578Finally, {\tt Beta} provides {\tt MakePmf}, which2579uses {\tt EvalPdf} to generate a discrete approximation2580of the beta distribution.25812582%This expression might look familiar. Here's {\tt2583% thinkbayes.EvalBinomialPmf}25842585%\begin{verbatim}2586%def EvalBinomialPmf(x, yes, no):2587% return x**yes * (1-x)**no2588%\end{verbatim}25892590%It's the same function, but in {\tt EvalPdf}, we think of {\tt x} as a2591%random variable and {\tt alpha} and {\tt beta} as parameters; in {\tt2592% EvalBinomialPmf}, {\tt x} is the parameter, and {\tt yes} and {\tt2593% no} are random variables. Distributions like these that share the2594%same PDF are called {\bf conjugate distributions}.2595%\index{conjugate distribution}259625972598\section{Discussion}25992600In this chapter we solved the same problem with two different2601priors and found that with a large dataset, the priors get2602swamped. If two people start with different2603prior beliefs, they generally find, as they see more data, that2604their posterior distributions converge. At some point the2605difference between their distributions is small enough that it has2606no practical effect.2607\index{swamping the priors}2608\index{convergence}26092610When this happens, it relieves some of the worry about objectivity2611that I discussed in the previous chapter. And for many real-world2612problems even stark prior beliefs can eventually be reconciled2613by data.26142615But that is not always the case. First, remember that all Bayesian2616analysis is based on modeling decisions. If you and I do not2617choose the same model, we might interpret data differently. So2618even with the same data, we would compute different likelihoods,2619and our posterior beliefs might not converge.2620\index{modeling}26212622Also, notice that in a Bayesian update, we multiply2623each prior probability by a likelihood, so if \p{H} is 0,2624\p{H|D} is also 0, regardless of $D$. In the Euro problem,2625if you are convinced that $x$ is less than 50\%, and you assign2626probability 0 to all other hypotheses, no amount of data will2627convince you otherwise.2628\index{Euro problem}26292630This observation is the basis of {\bf Cromwell's rule}, which is the2631recommendation that you should avoid giving a prior probability of26320 to any hypothesis that is even remotely possible2633(see \url{http://en.wikipedia.org/wiki/Cromwell's_rule}).2634\index{Cromwell's rule}26352636Cromwell's rule is named after Oliver Cromwell, who wrote, ``I beseech2637you, in the bowels of Christ, think it possible that you may be2638mistaken.'' For Bayesians, this turns out to be good advice (even if2639it's a little overwrought).2640\index{Cromwell, Oliver}264126422643\section{Exercises}26442645\begin{exercise}26462647Suppose that instead of observing coin tosses directly, you measure2648the outcome using an instrument that is not always correct. Specifically,2649suppose there is a probability {\tt y} that an actual heads is reported2650as tails, or actual tails reported as heads.26512652Write a class that estimates the bias of a coin given a series of2653outcomes and the value of {\tt y}.26542655How does the spread of the posterior distribution depend on2656{\tt y}?26572658\end{exercise}265926602661\begin{exercise}26622663\index{Reddit}2664This exercise is inspired by a question posted by a2665``redditor'' named dominosci on Reddit's statistics ``subreddit'' at2666\url{http://reddit.com/r/statistics}.26672668Reddit is an online forum with many interest groups called2669subreddits. Users, called redditors, post links to online2670content and other web pages. Other redditors vote on the links,2671giving an ``upvote'' to high-quality links and a ``downvote'' to2672links that are bad or irrelevant.26732674A problem, identified by dominosci, is that some redditors2675are more reliable than others, and Reddit does not take2676this into account.26772678The challenge is to devise a system so that when a redditor2679casts a vote, the estimated quality of the link is updated2680in accordance with the reliability of the redditor, and the2681estimated reliability of the redditor is updated in accordance2682with the quality of the link.26832684One approach is to model the quality of the link as the2685probability of garnering an upvote, and to model the reliability2686of the redditor as the probability of correctly giving an upvote2687to a high-quality item.26882689Write class definitions for redditors and links and an update function2690that updates both objects whenever a redditor casts a vote.26912692\end{exercise}2693269426952696\chapter{Odds and Addends}26972698\section{Odds}26992700One way to represent a probability is with a number between27010 and 1, but that's not the only way. If you have ever bet2702on a football game or a horse race, you have probably encountered2703another representation of probability, called {\bf odds}.2704\index{odds}27052706You might have heard expressions like ``the odds are2707three to one,'' but you might not know what that means.2708The {\bf odds in favor} of an event are the ratio of the probability2709it will occur to the probability that it will not.27102711So if I think my team has a 75\% chance of winning, I would2712say that the odds in their favor are three to one, because2713the chance of winning is three times the chance of losing.27142715You can write odds in decimal form, but it is most common to2716write them as a ratio of integers. So ``three to one'' is2717written $3:1$.27182719When probabilities are low, it is more common to report the2720{\bf odds against} rather than the odds in favor. For2721example, if I think my horse has a 10\% chance of winning,2722I would say that the odds against are $9:1$.27232724Probabilities and odds are different representations of the2725same information. Given a probability, you can compute the2726odds like this:27272728\begin{verbatim}2729def Odds(p):2730return p / (1-p)2731\end{verbatim}27322733Given the odds in favor, in decimal form, you can convert to2734probability like this:27352736\begin{verbatim}2737def Probability(o):2738return o / (o+1)2739\end{verbatim}27402741If you represent odds with a numerator and denominator, you2742can convert to probability like this:27432744\begin{verbatim}2745def Probability2(yes, no):2746return yes / (yes + no)2747\end{verbatim}27482749When I work with odds in my head, I find it helpful to picture2750people at the track. If 20\% of them think my horse will win,2751then 80\% of them don't, so the odds in favor are $20:80$ or2752$1:4$.27532754If the odds are $5:1$ against my horse, then five out of six2755people think she will lose, so the probability of winning2756is $1/6$.2757\index{horse racing}275827592760\section{The odds form of Bayes's theorem}27612762\index{Bayes's theorem!odds form}2763In Chapter~\ref{intro} I wrote Bayes's theorem in the {\bf probability2764form}:2765%2766\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]2767%2768If we have two hypotheses, $A$ and $B$,2769we can write the ratio of posterior probabilities like this:2770%2771\[ \frac{\p{A|D}}{\p{B|D}} = \frac{\p{A}~\p{D|A}}2772{\p{B}~\p{D|B}} \]2773%2774Notice that the normalizing constant, \p{D}, drops out of2775this equation.2776\index{normalizing constant}27772778If $A$ and $B$ are mutually exclusive and collectively exhaustive,2779that means $\p{B} = 1 - \p{A}$, so we can rewrite the ratio of2780the priors, and the ratio of the posteriors, as odds.27812782Writing \odds{A} for odds in favor of $A$, we get:2783%2784\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]2785%2786In words, this says that the posterior odds are the prior odds times2787the likelihood ratio. This is the {\bf odds form} of Bayes's theorem.27882789This form is most convenient for computing a Bayesian update on2790paper or in your head. For example, let's go back to the2791cookie problem:2792\index{cookie problem}27932794\begin{quote}2795Suppose there are two bowls of cookies. Bowl 1 contains279630 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of2797each.27982799Now suppose you choose one of the bowls at random and, without looking,2800select a cookie at random. The cookie is vanilla. What is the probability2801that it came from Bowl 1?2802\end{quote}28032804The prior probability is 50\%, so the prior odds are $1:1$, or just28051. The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$.2806So the posterior odds are $3:2$, which corresponds to probability2807$3/5$.280828092810\section{Oliver's blood}2811\label{oliver}28122813\index{Oliver's blood problem}2814\index{MacKay, David}2815Here is another problem from MacKay's {\it Information Theory,2816Inference, and Learning Algorithms}:28172818\begin{quote}2819Two people have left traces of their own blood at the scene of2820a crime. A suspect, Oliver, is tested and found to have type2821`O' blood. The blood groups of the two traces are found to2822be of type `O' (a common type in the local population, having frequency282360\%) and of type `AB' (a rare type, with frequency 1\%).2824Do these data [the traces found at the scene] give evidence2825in favor of the proposition that Oliver was one of the people2826[who left blood at the scene]?2827\end{quote}28282829To answer this question, we need to think about what it means2830for data to give evidence in favor of (or against) a hypothesis.2831Intuitively, we might say that data favor a hypothesis if the2832hypothesis is more likely in light of the data than it was before.2833\index{evidence}28342835In the cookie problem, the prior odds are $1:1$, or probability 50\%.2836The posterior odds are $3:2$, or probability 60\%. So we could say2837that the vanilla cookie is evidence in favor of Bowl 1.28382839The odds form of Bayes's theorem provides a way to make this2840intuition more precise. Again2841%2842\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]2843%2844Or dividing through by \odds{A}:2845%2846\[ \frac{\odds{A|D}}{\odds{A}} = \frac{\p{D|A}}{\p{D|B}} \]2847%2848The term on the left is the ratio of the posterior and prior odds.2849The term on the right is the likelihood ratio, also called the {\bf Bayes2850factor}.2851\index{likelihood ratio}2852\index{Bayes factor}28532854If the Bayes factor value is greater than 1, that means that the2855data were more likely under $A$ than under $B$. And since the2856odds ratio is also greater than 1, that means that the odds are2857greater, in light of the data, than they were before.28582859If the Bayes factor is less than 1, that means the data were2860less likely under $A$ than under $B$, so the odds in2861favor of $A$ go down.28622863Finally, if the Bayes factor is exactly 1, the data are equally2864likely under either hypothesis, so the odds do not change.28652866Now we can get back to the Oliver's blood problem. If Oliver is2867one of the people who left blood at the crime scene, then he2868accounts for the `O' sample, so the probability of the data2869is just the probability that a random member of the population2870has type `AB' blood, which is 1\%.28712872If Oliver did not leave blood at the scene, then we have two2873samples to account for. If we choose two random people from2874the population, what is the chance of finding one with type `O'2875and one with type `AB'? Well, there are two ways it might happen:2876the first person we choose might have type `O' and the second2877`AB', or the other way around. So the total probability is2878$2 (0.6) (0.01) = 1.2\%$.28792880The likelihood of the data is slightly higher if Oliver is2881{\it not} one of the people who left blood at the scene, so2882the blood data is actually evidence against Oliver's guilt.2883\index{evidence}28842885This example is a little contrived, but it is an example of2886the counterintuitive result that data {\it consistent} with2887a hypothesis are not necessarily {\it in favor of}2888the hypothesis.28892890If this result is so counterintuitive that it bothers you,2891this way of thinking might help: the data consist of a common2892event, type `O' blood, and a rare event, type `AB' blood.2893If Oliver accounts for the common event, that leaves the rare2894event still unexplained. If Oliver doesn't account for the2895`O' blood, then we have two chances to find someone in the2896population with `AB' blood. And that factor of two makes2897the difference.289828992900\section{Addends}2901\label{addends}29022903The fundamental operation of Bayesian statistics is2904{\tt Update}, which takes a prior distribution and a set2905of data, and produces a posterior distribution. But solving2906real problems usually involves a number of other operations,2907including scaling, addition and other arithmetic operations,2908max and min, and mixtures.2909\index{distribution!operations}29102911This chapter presents addition and max; I will present2912other operations as we need them.29132914The first example is based on2915{\it Dungeons~\&~Dragons}, a role-playing game where the results2916of players' decisions are usually determined by rolling dice.2917In fact, before game play starts, players generate each2918attribute of their characters---strength, intelligence, wisdom,2919dexterity, constitution, and charisma---by rolling three29206-sided dice and adding them up.2921\index{Dungeons and Dragons}29222923So you might be curious to know the distribution of this sum.2924There are two ways you might compute it:2925\index{simulation}2926\index{enumeration}29272928\begin{description}29292930\item[Simulation:] Given a Pmf that represents the distribution2931for a single die, you can draw random samples, add them up,2932and accumulate the distribution of simulated sums.29332934\item[Enumeration:] Given two Pmfs, you can enumerate all possible2935pairs of values and compute the distribution of the sums.29362937\end{description}29382939\verb"thinkbayes" provides functions for both. Here's an example2940of the first approach. First, I'll define a class to represent2941a single die as a Pmf:29422943\begin{verbatim}2944class Die(thinkbayes.Pmf):29452946def __init__(self, sides):2947thinkbayes.Pmf.__init__(self)2948for x in xrange(1, sides+1):2949self.Set(x, 1)2950self.Normalize()2951\end{verbatim}29522953Now I can create a 6-sided die:29542955\begin{verbatim}2956d6 = Die(6)2957\end{verbatim}29582959And use \verb"thinkbayes.SampleSum" to generate a sample of 1000 rolls.29602961\begin{verbatim}2962dice = [d6] * 32963three = thinkbayes.SampleSum(dice, 1000)2964\end{verbatim}29652966\verb"SampleSum" takes list of distributions (either Pmf or Cdf2967objects) and the sample size, {\tt n}. It generates {\tt n} random2968sums and returns their distribution as a Pmf object.29692970\begin{verbatim}2971def SampleSum(dists, n):2972pmf = MakePmfFromList(RandomSum(dists) for i in xrange(n))2973return pmf2974\end{verbatim}29752976\verb"SampleSum" uses \verb"RandomSum", also in \verb"thinkbayes.py":29772978\begin{verbatim}2979def RandomSum(dists):2980total = sum(dist.Random() for dist in dists)2981return total2982\end{verbatim}29832984{\tt RandomSum} invokes {\tt Random} on each distribution and2985adds up the results.29862987The drawback of simulation is that the result2988is only approximately correct. As \verb"n" gets larger, it gets2989more accurate, but of course the run time increases as well.29902991The other approach is to enumerate all pairs of values and2992compute the sum and probability of each pair. This is implemented2993in \verb"Pmf.__add__":29942995\begin{verbatim}2996# class Pmf29972998def __add__(self, other):2999pmf = Pmf()3000for v1, p1 in self.Items():3001for v2, p2 in other.Items():3002pmf.Incr(v1+v2, p1*p2)3003return pmf3004\end{verbatim}30053006{\tt self} is a Pmf, of course; {\tt other} can be a Pmf or anything3007else that provides {\tt Items}. The result is a new Pmf. The time to3008run \verb"__add__" depends on the number of items in {\tt self} and3009{\tt other}; it is proportional to {\tt len(self) * len(other)}.30103011And here's how it's used:30123013\begin{verbatim}3014three_exact = d6 + d6 + d63015\end{verbatim}30163017When you apply the {\tt +} operator to a Pmf, Python invokes3018\verb"__add__". In this example, \verb"__add__" is invoked twice.30193020Figure~\ref{fig.dungeons1} shows an approximate result generated3021by simulation and the exact result computed by enumeration.30223023\begin{figure}3024% dungeons.py3025\centerline{\includegraphics[height=2.5in]{figs/dungeons1.pdf}}3026\caption{Approximate and exact distributions for the sum of3027three 6-sided dice.}3028\label{fig.dungeons1}3029\end{figure}30303031\verb"Pmf.__add__" is based on the assumption that the random3032selections from each Pmf are independent. In the example of rolling3033several dice, this assumption is pretty good. In other cases, we3034would have to extend this method to use conditional probabilities.3035\index{independence}30363037The code from this section is available from3038\url{http://thinkbayes.com/dungeons.py}.3039For more information3040see Section~\ref{download}.30413042\section{Maxima}30433044\begin{figure}3045% dungeons.py3046\centerline{\includegraphics[height=2.5in]{figs/dungeons2.pdf}}3047\caption{Distribution of the maximum of six rolls of three dice.}3048\label{fig.dungeons2}3049\end{figure}30503051When you generate a {\it Dungeons~\&~Dragons} character, you are3052particularly interested in the character's best attributes, so3053you might like to know the3054distribution of the maximum attribute.30553056There are three ways to compute the distribution of a maximum:3057\index{maximum}3058\index{simulation}3059\index{enumeration}3060\index{exponentiation}30613062\begin{description}30633064\item[Simulation:] Given a Pmf that represents the distribution3065for a single selection, you can generate random samples, find the maximum,3066and accumulate the distribution of simulated maxima.30673068\item[Enumeration:] Given two Pmfs, you can enumerate all possible3069pairs of values and compute the distribution of the maximum.30703071\item[Exponentiation:] If we convert a Pmf to a Cdf, there is a simple3072and efficient algorithm for finding the Cdf of the maximum.30733074\end{description}30753076The code to simulate maxima is almost identical to the code for3077simulating sums:30783079\begin{verbatim}3080def RandomMax(dists):3081total = max(dist.Random() for dist in dists)3082return total30833084def SampleMax(dists, n):3085pmf = MakePmfFromList(RandomMax(dists) for i in xrange(n))3086return pmf3087\end{verbatim}30883089All I did was replace ``sum'' with ``max''. And the code3090for enumeration is almost identical, too:30913092\begin{verbatim}3093def PmfMax(pmf1, pmf2):3094res = thinkbayes.Pmf()3095for v1, p1 in pmf1.Items():3096for v2, p2 in pmf2.Items():3097res.Incr(max(v1, v2), p1*p2)3098return res3099\end{verbatim}31003101In fact, you could generalize this function by taking the3102appropriate operator as a parameter.31033104The only problem with this algorithm is that if each Pmf3105has $m$ values, the run time is proportional to $m^2$.3106And if we want the maximum of {\tt k} selections, it takes3107time proportional to $k m^2$.31083109If we convert the Pmfs to Cdfs, we can do the same calculation3110much faster! The key is to remember the definition of the3111cumulative distribution function:3112%3113\[ CDF(x) = \p{X \le~x} \]3114%3115where $X$ is a random variable that means ``a value chosen3116randomly from this distribution.'' So, for example, $CDF(5)$3117is the probability that a value from this distribution is less3118than or equal to 5.31193120If I draw $X$ from $CDF_1$ and $Y$ from $CDF_2$, and compute3121the maximum $Z = max(X, Y)$, what is the chance that $Z$ is3122less than or equal to 5? Well, in that case both $X$ and $Y$3123must be less than or equal to 5.31243125\index{independence}3126If the selections of $X$ and $Y$ are independent,3127%3128\[ CDF_3(5) = CDF_1(5) CDF_2(5) \]3129%3130where $CDF_3$ is the distribution of $Z$. I chose the value31315 because I think it makes the formulas easy to read, but we3132can generalize for any value of $z$:3133%3134\[ CDF_3(z) = CDF_1(z) CDF_2(z) \]3135%3136In the special case where we draw $k$ values from the same3137distribution,3138%3139\[ CDF_k(z) = CDF_1(z)^k \]3140%3141So to find the distribution of the maximum of $k$ values,3142we can enumerate the probabilities in the given Cdf3143and raise them to the $k$th power.3144\verb"Cdf" provides a method that does just that:31453146\begin{verbatim}3147# class Cdf31483149def Max(self, k):3150cdf = self.Copy()3151cdf.ps = [p**k for p in cdf.ps]3152return cdf3153\end{verbatim}31543155\verb"Max" takes the number of selections, {\tt k}, and returns a new3156Cdf that represents the distribution of the maximum of {\tt k}3157selections. The run time for this method is proportional to3158$m$, the number of items in the Cdf.31593160\verb"Pmf.Max" does the same thing for Pmfs. It has to do a little3161more work to convert the Pmf to a Cdf, so the run time is proportional3162to $m \log m$, but that's still better than quadratic.31633164Finally, here's an example that computes the distribution of3165a character's best attribute:31663167\begin{verbatim}3168best_attr_cdf = three_exact.Max(6)3169best_attr_pmf = best_attr_cdf.MakePmf()3170\end{verbatim}31713172Where \verb"three_exact" is defined in the previous section.3173If we print the results, we see that the chance of generating3174a character with at least one attribute of 18 is about 3\%.3175Figure~\ref{fig.dungeons2} shows the distribution.317631773178\section{Mixtures}3179\label{mixture}31803181\begin{figure}3182% dungeons.py3183\centerline{\includegraphics[height=2.5in]{figs/dungeons3.pdf}}3184\caption{Distribution outcome for random die from a box.}3185\label{fig.dungeons3}3186\end{figure}31873188Let's do one more example from {\it Dungeons~\&~Dragons}. Suppose3189I have a box of dice with the following inventory:31903191\begin{verbatim}31925 4-sided dice31934 6-sided dice31943 8-sided dice31952 12-sided dice31961 20-sided die3197\end{verbatim}31983199I choose a die from the box and roll it. What is the distribution3200of the outcome?32013202If you know which die it is, the answer is easy. A die with {\tt n}3203sides yields a uniform distribution from 1 to {\tt n}, including both.3204\index{uniform distribution}32053206But if we don't know which die it is, the resulting distribution is3207a {\bf mixture} of uniform distributions with different bounds.3208In general, this kind of mixture does not fit any simple mathematical3209model, but it is straightforward to compute the distribution in3210the form of a PMF.3211\index{mixture}32123213As always, one option is to simulate the scenario, generate a random3214sample, and compute the PMF of the sample. This approach is simple3215and it generates an approximate solution quickly. But if we want an3216exact solution, we need a different approach.3217\index{simulation}32183219Let's start with a simple version of the problem where there are3220only two dice, one with 6 sides and one with 8. We can make a Pmf to3221represent each die:32223223\begin{verbatim}3224d6 = Die(6)3225d8 = Die(8)3226\end{verbatim}32273228Then we create a Pmf to represent the mixture:32293230\begin{verbatim}3231mix = thinkbayes.Pmf()3232for die in [d6, d8]:3233for outcome, prob in die.Items():3234mix.Incr(outcome, prob)3235mix.Normalize()3236\end{verbatim}32373238The first loop enumerates the dice; the second enumerates the3239outcomes and their probabilities. Inside the loop,3240{\tt Pmf.Incr} adds up the contributions from the two distributions.32413242This code assumes that the two dice are equally likely. More3243generally, we need to know the probability of each die so we can3244weight the outcomes accordingly.32453246First we create a Pmf that maps from each die to the probability it is3247selected:32483249\begin{verbatim}3250pmf_dice = thinkbayes.Pmf()3251pmf_dice.Set(Die(4), 5)3252pmf_dice.Set(Die(6), 4)3253pmf_dice.Set(Die(8), 3)3254pmf_dice.Set(Die(12), 2)3255pmf_dice.Set(Die(20), 1)3256pmf_dice.Normalize()3257\end{verbatim}32583259Next we need a more general version of the mixture algorithm:32603261\begin{verbatim}3262mix = thinkbayes.Pmf()3263for die, weight in pmf_dice.Items():3264for outcome, prob in die.Items():3265mix.Incr(outcome, weight*prob)3266\end{verbatim}32673268Now each die has a weight associated with it (which makes it a3269weighted die, I suppose). When we add each outcome to the mixture,3270its probability is multiplied by {\tt weight}.32713272Figure~\ref{fig.dungeons3} shows the result. As expected, values 13273through 4 are the most likely because any die can produce them.3274Values above 12 are unlikely because there is only one die in the box3275that can produce them (and it does so less than half the time).32763277{\tt thinkbayes} provides a function named {\tt MakeMixture}3278that encapsulates this algorithm, so we could have written:32793280\begin{verbatim}3281mix = thinkbayes.MakeMixture(pmf_dice)3282\end{verbatim}32833284We'll use {\tt MakeMixture} again in Chapters~\ref{prediction} and3285~\ref{observer}.328632873288\section{Discussion}32893290Other than the odds form of Bayes's theorem, this chapter is not3291specifically Bayesian. But Bayesian analysis is all about3292distributions, so it is important to understand the concept of a3293distribution well. From a computational point of view, a distribution3294is any data structure that represents a set of values (possible3295outcomes of a random process) and their probabilities.3296\index{distribution}32973298We have seen two representations of distributions: Pmfs and Cdfs.3299These representations are equivalent in the sense that they contain3300the same information, so you can convert from one to the other. The3301primary difference between them is performance: some operations are3302faster and easier with a Pmf; others are faster with a Cdf.3303\index{Pmf} \index{Cdf}33043305The other goal of this chapter is to introduce operations that act on3306distributions, like \verb"Pmf.__add__", {\tt Cdf.Max}, and {\tt3307thinkbayes.MakeMixture}. We will use these operations later, but I3308introduce them now to encourage you to think of a distribution as a3309fundamental unit of computation, not just a container for values and3310probabilities.3311331233133314\chapter{Decision Analysis}3315\label{decisionanalysis}33163317\section{The {\it Price is Right} problem}33183319On November 1, 2007, contestants named Letia and Nathaniel appeared3320on {\it The Price is Right}, an American game show. They competed in3321a game called {\it The Showcase}, where the objective is to guess the price3322of a showcase of prizes. The contestant who comes closest to the3323actual price of the showcase, without going over, wins the prizes.3324\index{Price is Right}3325\index{Showcase}33263327Nathaniel went first. His showcase included a dishwasher, a wine3328cabinet, a laptop computer, and a car. He bid \$26,000.33293330Letia's showcase included a pinball machine, a video arcade game, a3331pool table, and a cruise of the Bahamas. She bid \$21,500.33323333The actual price of Nathaniel's showcase was \$25,347. His bid3334was too high, so he lost.33353336The actual price of Letia's showcase was \$21,578. She was only3337off by \$78, so she won her showcase and, because3338her bid was off by less than \$250, she also won Nathaniel's3339showcase.33403341For a Bayesian thinker, this scenario suggests several questions:33423343\begin{enumerate}33443345\item Before seeing the prizes, what prior beliefs should the3346contestant have about the price of the showcase?33473348\item After seeing the prizes, how should the contestant update3349those beliefs?33503351\item Based on the posterior distribution, what should the3352contestant bid?33533354\end{enumerate}33553356The third question demonstrates a common use of Bayesian analysis:3357decision analysis. Given a posterior distribution, we can choose3358the bid that maximizes the contestant's expected return.3359\index{decision analysis}33603361This problem is inspired by an example in Cameron Davidson-Pilon's3362book, {\it Bayesian Methods for Hackers}. The code I wrote for this3363chapter is available from \url{http://thinkbayes.com/price.py}; it3364reads data files you can download from3365\url{http://thinkbayes.com/showcases.2011.csv} and3366\url{http://thinkbayes.com/showcases.2012.csv}. For more information3367see Section~\ref{download}.3368\index{Davidson-Pilon, Cameron}336933703371\section{The prior}33723373\begin{figure}3374% price.py3375\centerline{\includegraphics[height=2.5in]{figs/price1.pdf}}3376\caption{Distribution of prices for showcases on3377{\it The Price is Right}, 2011-12.}3378\label{fig.price1}3379\end{figure}33803381To choose a prior distribution of prices, we can take advantage3382of data from previous episodes. Fortunately, fans of the show3383keep detailed records. When I corresponded with Mr.~Davidson-Pilon3384about his book, he sent me data collected by Steve Gee at3385\url{http://tpirsummaries.8m.com}. It includes the price of3386each showcase from the 2011 and 2012 seasons and the bids3387offered by the contestants.3388\index{Gee, Steve}33893390Figure~\ref{fig.price1} shows the distribution of prices for these3391showcases. The most common value for both showcases is around3392\$28,000, but the first showcase has a second mode near \$50,000,3393and the second showcase is occasionally worth more than \$70,000.33943395These distributions are based on actual data, but they3396have been smoothed by Gaussian kernel density estimation (KDE).3397Before we go on, I want to take a detour to talk about3398probability density functions and KDE.3399\index{kernel density estimation}3400\index{KDE}340134023403\section{Probability density functions}34043405So far we have been working with probability mass functions, or PMFs.3406A PMF is a map from each possible value to its probability. In my3407implementation, a Pmf object provides a method named {\tt Prob} that3408takes a value and returns a probability, also known as a {\bf probability3409mass}.3410\index{probability density function}3411\index{Pdf}3412\index{Pmf}34133414A {\bf probability density function}, or PDF, is the continuous version of a3415PMF, where the possible values make up a continuous range rather than3416a discrete set.34173418\index{Gaussian distribution}3419In mathematical notation, PDFs are usually written as functions; for3420example, here is the PDF of a Gaussian distribution with3421mean 0 and standard deviation 1:3422%3423\[ f(x) = \frac{1}{\sqrt{2 \pi}} \exp(-x^2/2) \]3424%3425For a given value of $x$, this function computes a probability3426density.3427A density is similar3428to a probability mass in the sense that a higher density indicates3429that a value is more likely.3430\index{density}3431\index{probability density}3432\index{probability}34333434But a density is not a probability. A density can be 0 or any positive3435value; it is not bounded, like a probability, between 0 and 1.34363437If you integrate a density3438over a continuous range, the result is a probability. But3439for the applications in this book we seldom have to do that.34403441Instead we primarily use probability densities as part3442of a likelihood function. We will see an example soon.344334443445\section{Representing PDFs}34463447\index{Pdf}3448To represent PDFs in Python,3449{\tt thinkbayes.py} provides a class named {\tt Pdf}.3450{\tt Pdf} is an {\bf abstract type}, which means that it defines3451the interface a Pdf is supposed to have, but does not provide3452a complete implementation. The {\tt Pdf} interface includes3453two methods, {\tt Density} and {\tt MakePmf}:34543455\begin{verbatim}3456class Pdf(object):34573458def Density(self, x):3459raise UnimplementedMethodException()34603461def MakePmf(self, xs):3462pmf = Pmf()3463for x in xs:3464pmf.Set(x, self.Density(x))3465pmf.Normalize()3466return pmf3467\end{verbatim}34683469{\tt Density} takes a value, {\tt x}, and returns the corresponding3470density. {\tt MakePmf} makes a discrete approximation to the PDF.34713472{\tt Pdf} provides an implementation of {\tt MakePmf}, but not {\tt3473Density}, which has to be provided by a child class.3474\index{abstract type} \index{concrete type} \index{interface}3475\index{implementation}34763477\index{Gaussian distribution}3478A {\bf concrete type} is a child class that extends an abstract type3479and provides an implementation of the missing methods.3480For example, {\tt GaussianPdf} extends {\tt Pdf} and provides3481{\tt Density}:34823483\begin{verbatim}3484class GaussianPdf(Pdf):34853486def __init__(self, mu, sigma):3487self.mu = mu3488self.sigma = sigma34893490def Density(self, x):3491return scipy.stats.norm.pdf(x, self.mu, self.sigma)3492\end{verbatim}34933494\verb"__init__" takes {\tt mu} and {\tt sigma}, which are3495the mean and standard deviation of the distribution, and stores3496them as attributes.34973498{\tt Density} uses a function from {\tt scipy.stats} to evaluate the3499Gaussian PDF. The function is called {\tt norm.pdf} because the3500Gaussian distribution is also called the ``normal'' distribution.3501\index{scipy}3502\index{normal distribution}35033504The Gaussian PDF is defined by a simple mathematical function,3505so it is easy to evaluate. And it is useful because many3506quantities in the real world have distributions that are3507approximately Gaussian.3508\index{Gaussian distribution}3509\index{Gaussian PDF}35103511But with real data, there is no guarantee that the distribution3512is Gaussian or any other simple mathematical function. In3513that case we can use a sample to estimate the PDF of3514the whole population.35153516For example, in {\it The Price Is Right} data, we have3517313 prices for the first showcase. We can think of these3518values as a sample from the population of all possible showcase3519prices.35203521This sample includes the following values (in order):3522%3523\[ 28800, 28868, 28941, 28957, 28958 \]3524%3525In the sample, no values appear between 28801 and 28867, but3526there is no reason to think that these values are impossible.3527Based on our background information, we expect all3528values in this range to be equally likely. In other words,3529we expect the PDF to be fairly smooth.35303531Kernel density estimation (KDE) is an algorithm that takes3532a sample and finds an appropriately smooth PDF that fits3533the data. You can read details at3534\url{http://en.wikipedia.org/wiki/Kernel_density_estimation}.3535\index{KDE}3536\index{kernel density estimation}35373538{\tt scipy} provides an implementation of KDE and {\tt thinkbayes}3539provides a class called {\tt EstimatedPdf} that3540uses it:3541\index{scipy}3542\index{numpy}35433544\begin{verbatim}3545class EstimatedPdf(Pdf):35463547def __init__(self, sample):3548self.kde = scipy.stats.gaussian_kde(sample)35493550def Density(self, x):3551return self.kde.evaluate(x)3552\end{verbatim}35533554\verb"__init__" takes a sample3555and computes a kernel density estimate. The result is a3556\verb"gaussian_kde" object that provides an {\tt evaluate}3557method.35583559{\tt Density} takes a value, calls \verb"gaussian_kde.evaluate",3560and returns the resulting density.3561\index{density}35623563Finally, here's an outline of the code I used to generate3564Figure~\ref{fig.price1}:3565\index{numpy}35663567\begin{verbatim}3568prices = ReadData()3569pdf = thinkbayes.EstimatedPdf(prices)35703571low, high = 0, 750003572n = 1013573xs = numpy.linspace(low, high, n)3574pmf = pdf.MakePmf(xs)3575\end{verbatim}35763577{\tt pdf} is a {\tt Pdf} object, estimated by KDE. {\tt pmf}3578is a Pmf object that approximates the Pdf by evaluating the density3579at a sequence of equally spaced values.35803581{\tt linspace} stands for3582``linear space.'' It takes a range, {\tt low} and {\tt high}, and3583the number of points, {\tt n}, and returns a new {\tt numpy}3584array with {\tt n} elements equally spaced between {\tt low} and3585{\tt high}, including both.35863587And now back to {\it The Price is Right}.358835893590\section{Modeling the contestants}35913592\begin{figure}3593% price.py3594\centerline{\includegraphics[height=2.5in]{figs/price2.pdf}}3595\caption{Cumulative distribution (CDF) of the difference between the3596contestant's bid and the actual price.}3597\label{fig.price2}3598\end{figure}35993600The PDFs in Figure~\ref{fig.price1} estimate the distribution of3601possible prices. If you were a contestant on the3602show, you could use this distribution to quantify your prior belief3603about the price of each showcase (before you see the prizes).36043605To update these priors, we have to answer these questions:36063607\begin{enumerate}36083609\item What data should we consider and how should we quantify it?36103611\item Can we compute a likelihood function; that is,3612for each hypothetical value of {\tt price}, can we compute3613the conditional likelihood of the data?36143615\end{enumerate}36163617To answer these questions, I am going to model the contestant3618as a price-guessing instrument with known error characteristics.3619In other words, when the contestant sees the prizes, he or she3620guesses the price of each prize---ideally without taking into3621consideration the fact that the prize is part of a showcase---and3622adds up the prices. Let's call this total {\tt guess}.3623\index{error}36243625Under this model, the question we have to answer is, ``If the3626actual price is {\tt price}, what is the likelihood that the3627contestant's estimate would be {\tt guess}?''3628\index{likelihood}36293630Or if we define3631%3632\begin{verbatim}3633error = price - guess3634\end{verbatim}3635%3636then we could ask, ``What is the likelihood3637that the contestant's estimate is off by {\tt error}?''36383639To answer this question, we can use the historical data again.3640Figure~\ref{fig.price2} shows the cumulative distribution of {\tt diff},3641the difference between the contestant's bid and the actual price3642of the showcase.3643\index{Cdf}36443645The definition of diff is3646%3647\begin{verbatim}3648diff = price - bid3649\end{verbatim}3650%3651When {\tt diff} is negative, the bid is too high. As an3652aside, we can use this distribution to compute the probability that the3653contestants overbid: the first contestant overbids 25\% of the3654time; the second contestant overbids 29\% of the time.36553656We can also see that the bids are biased;3657that is, they are more likely to be too low than too high. And3658that makes sense, given the rules of the game.36593660Finally, we can use this distribution to estimate the reliability of3661the contestants' guesses. This step is a little tricky because3662we don't actually know the contestant's guesses; we only know3663what they bid.36643665So we'll have to make some assumptions. Specifically, I3666assume that the distribution of {\tt error} is Gaussian with mean 03667and the same variance as {\tt diff}.3668\index{Gaussian distribution}36693670The {\tt Player} class implements this model:3671\index{numpy}36723673\begin{verbatim}3674class Player(object):36753676def __init__(self, prices, bids, diffs):3677self.pdf_price = thinkbayes.EstimatedPdf(prices)3678self.cdf_diff = thinkbayes.MakeCdfFromList(diffs)36793680mu = 03681sigma = numpy.std(diffs)3682self.pdf_error = thinkbayes.GaussianPdf(mu, sigma)3683\end{verbatim}36843685{\tt prices} is a sequence of showcase prices, {\tt bids} is a3686sequence of bids, and {\tt diffs} is a sequence of diffs, where3687again {\tt diff = price - bid}.36883689\verb"pdf_price" is the smoothed PDF of prices, estimated by KDE.3690\verb"cdf_diff" is the cumulative distribution of {\tt diff},3691which we saw in Figure~\ref{fig.price2}. And \verb"pdf_error"3692is the PDF that characterizes the distribution of errors; where3693{\tt error = price - guess}.36943695Again, we use the variance of {\tt diff} to estimate the variance of3696{\tt error}. This estimate is not perfect because contestants' bids3697are sometimes strategic; for example, if Player 2 thinks that Player 13698has overbid, Player 2 might make a very low bid. In that case {\tt3699diff} does not reflect {\tt error}. If this happens a lot, the3700observed variance in {\tt diff} might overestimate the variance in3701{\tt error}. Nevertheless, I think it is a reasonable modeling3702decision.37033704As an alternative, someone preparing to appear on the show could3705estimate their own distribution of {\tt error} by watching previous shows3706and recording their guesses and the actual prices.370737083709\section{Likelihood}37103711Now we are ready to write the likelihood function. As usual,3712I define a new class that extends {\tt thinkbayes.Suite}:3713\index{likelihood}37143715\begin{verbatim}3716class Price(thinkbayes.Suite):37173718def __init__(self, pmf, player):3719thinkbayes.Suite.__init__(self, pmf)3720self.player = player3721\end{verbatim}37223723{\tt pmf} represents the prior distribution and3724{\tt player} is a Player object as described in the previous3725section. Here's {\tt Likelihood}:37263727\begin{verbatim}3728def Likelihood(self, data, hypo):3729price = hypo3730guess = data37313732error = price - guess3733like = self.player.ErrorDensity(error)37343735return like3736\end{verbatim}37373738{\tt hypo} is the hypothetical price of the showcase. {\tt data}3739is the contestant's best guess at the price. {\tt error} is3740the difference, and {\tt like} is the likelihood of the data,3741given the hypothesis.37423743{\tt ErrorDensity} is defined in {\tt Player}:37443745\begin{verbatim}3746# class Player:37473748def ErrorDensity(self, error):3749return self.pdf_error.Density(error)3750\end{verbatim}37513752{\tt ErrorDensity} works by evaluating \verb"pdf_error" at3753the given value of {\tt error}.3754The result is a probability density, so it is not really a probability.3755But remember that {\tt Likelihood} doesn't3756need to compute a probability; it only has to compute something {\em3757proportional} to a probability. As long as the constant of3758proportionality is the same for all likelihoods, it gets canceled out3759when we normalize the posterior distribution.3760\index{density}3761\index{likelihood}37623763And therefore, a probability density is a perfectly good likelihood.376437653766\section{Update}37673768\begin{figure}3769% price.py3770\centerline{\includegraphics[height=2.5in]{figs/price3.pdf}}3771\caption{Prior and posterior distributions for Player 1, based on3772a best guess of \$20,000.}3773\label{fig.price3}3774\end{figure}377537763777{\tt Player} provides a method that takes the contestant's3778guess and computes the posterior distribution:37793780\begin{verbatim}3781# class Player37823783def MakeBeliefs(self, guess):3784pmf = self.PmfPrice()3785self.prior = Price(pmf, self)3786self.posterior = self.prior.Copy()3787self.posterior.Update(guess)3788\end{verbatim}37893790{\tt PmfPrice} generates a discrete approximation3791to the PDF of price, which we use to construct the prior.37923793{\tt PmfPrice} uses {\tt MakePmf}, which3794evaluates \verb"pdf_price" at a sequence of values:37953796\begin{verbatim}3797# class Player37983799n = 1013800price_xs = numpy.linspace(0, 75000, n)38013802def PmfPrice(self):3803return self.pdf_price.MakePmf(self.price_xs)3804\end{verbatim}38053806To construct the posterior, we make a copy of the3807prior and then invoke {\tt Update}, which invokes {\tt Likelihood}3808for each hypothesis, multiplies the priors by the likelihoods,3809and renormalizes.3810\index{normalize}38113812So let's get back to the original scenario. Suppose you are3813Player 1 and when you see your showcase, your best guess is3814that the total price of the prizes is \$20,000.38153816Figure~\ref{fig.price3} shows prior and3817posterior beliefs about the actual price.3818The posterior is shifted3819to the left because your guess3820is on the low end of the prior range.38213822On one level, this result makes sense. The most likely value3823in the prior is \$27,750, your best guess is \$20,000, and3824the mean of the posterior is somewhere in between: \$25,096.38253826On another level, you might find this result bizarre, because it3827suggests that if you {\em think} the price is \$20,000, then you3828should {\em believe} the price is \$24,000.38293830To resolve this apparent paradox, remember that you are combining two3831sources of information, historical data about past showcases and3832guesses about the prizes you see.38333834We are treating the historical data as the prior and updating it3835based on your guesses, but we could equivalently use your guess3836as a prior and update it based on historical data.38373838If you think of it that way, maybe it is less surprising that the3839most likely value in the posterior is not your original guess.384038413842\section{Optimal bidding}38433844Now that we have a posterior distribution, we can use it to3845compute the optimal bid, which I define as the bid that maximizes3846expected return (see \url{http://en.wikipedia.org/wiki/Expected_return}).3847\index{decision analysis}38483849I'm going to present the methods in this section top-down, which3850means I will show you how they are used before I show you how they3851work. If you see an unfamiliar method, don't worry; the definition3852will be along shortly.38533854To compute optimal bids, I wrote a class called {\tt GainCalculator}:38553856\begin{verbatim}3857class GainCalculator(object):38583859def __init__(self, player, opponent):3860self.player = player3861self.opponent = opponent3862\end{verbatim}38633864{\tt player} and {\tt opponent} are {\tt Player} objects.38653866{\tt GainCalculator} provides {\tt ExpectedGains}, which3867computes a sequence of bids and the expected gain for each3868bid:3869\index{numpy}38703871\begin{verbatim}3872def ExpectedGains(self, low=0, high=75000, n=101):3873bids = numpy.linspace(low, high, n)38743875gains = [self.ExpectedGain(bid) for bid in bids]38763877return bids, gains3878\end{verbatim}38793880{\tt low} and {\tt high} specify the range of possible bids;3881{\tt n} is the number of bids to try.38823883{\tt ExpectedGains} calls {\tt ExpectedGain}, which3884computes expected gain for a given bid:38853886\begin{verbatim}3887def ExpectedGain(self, bid):3888suite = self.player.posterior3889total = 03890for price, prob in sorted(suite.Items()):3891gain = self.Gain(bid, price)3892total += prob * gain3893return total3894\end{verbatim}38953896{\tt ExpectedGain} loops through the values in the posterior3897and computes the gain for each bid, given the actual prices of3898the showcase. It weights each gain with the corresponding3899probability and returns the total.39003901\begin{figure}3902% price.py3903\centerline{\includegraphics[height=2.5in]{figs/price5.pdf}}3904\caption{Expected gain versus bid in a scenario where Player 1's best3905guess is \$20,000 and Player 2's best guess is \$40,000.}3906\label{fig.price5}3907\end{figure}39083909{\tt ExpectedGain} invokes {\tt Gain}, which takes a bid and an actual3910price and returns the expected gain:39113912\begin{verbatim}3913def Gain(self, bid, price):3914if bid > price:3915return 039163917diff = price - bid3918prob = self.ProbWin(diff)39193920if diff <= 250:3921return 2 * price * prob3922else:3923return price * prob3924\end{verbatim}39253926If you overbid, you get nothing. Otherwise we compute3927the difference between your bid and the price, which determines3928your probability of winning.39293930If {\tt diff} is less than \$250, you win both showcases. For3931simplicity, I assume that both showcases have the same price. Since3932this outcome is rare, it doesn't make much difference.39333934Finally, we have to compute the probability of winning based3935on {\tt diff}:39363937\begin{verbatim}3938def ProbWin(self, diff):3939prob = (self.opponent.ProbOverbid() +3940self.opponent.ProbWorseThan(diff))3941return prob3942\end{verbatim}39433944If your opponent overbids, you win. Otherwise, you have to hope3945that your opponent is off by more than {\tt diff}. {\tt Player}3946provides methods to compute both probabilities:39473948\begin{verbatim}3949# class Player:39503951def ProbOverbid(self):3952return self.cdf_diff.Prob(-1)39533954def ProbWorseThan(self, diff):3955return 1 - self.cdf_diff.Prob(diff)3956\end{verbatim}39573958This code might be confusing because the computation is now from3959the point of view of the opponent, who is computing, ``What is3960the probability that I overbid?'' and ``What is the probability3961that my bid is off by more than {\tt diff}?''39623963Both answers are based on the CDF of {\tt diff}. If the opponent's3964{\tt diff} is less than or equal to -1, you win. If the opponent's3965{\tt diff} is worse than yours, you win. Otherwise you lose.39663967Finally, here's the code that computes optimal bids:39683969\begin{verbatim}3970# class Player:39713972def OptimalBid(self, guess, opponent):3973self.MakeBeliefs(guess)3974calc = GainCalculator(self, opponent)3975bids, gains = calc.ExpectedGains()3976gain, bid = max(zip(gains, bids))3977return bid, gain3978\end{verbatim}39793980Given a guess and an opponent, {\tt OptimalBid} computes3981the posterior distribution, instantiates a {\tt GainCalculator},3982computes expected gains for a range of bids and returns3983the optimal bid and expected gain. Whew!39843985Figure~\ref{fig.price5} shows the results for both players,3986based on a scenario where Player 1's best guess is \$20,0003987and Player 2's best guess is \$40,000.39883989For Player 1 the optimal bid is \$21,000, yielding an expected3990return of almost \$16,700. This is a case (which turns out3991to be unusual) where the optimal bid is actually higher than3992the contestant's best guess.39933994For Player 2 the optimal bid is \$31,500, yielding an expected3995return of almost \$19,400. This is the more typical case where3996the optimal bid is less than the best guess.399739983999\section{Discussion}40004001One of the features of Bayesian estimation is that the4002result comes in the form of a posterior distribution. Classical4003estimation usually generates a single point estimate or a confidence4004interval, which is sufficient if estimation is the last step in the4005process, but if you want to use an estimate as an input to a4006subsequent analysis, point estimates and intervals are often not much4007help.4008\index{distribution}40094010In this example, we use the posterior distribution4011to compute an optimal bid. The return on a given bid is asymmetric4012and discontinuous (if you overbid, you lose), so it would be hard to4013solve this problem analytically. But it is relatively simple to do4014computationally.4015\index{decision analysis}40164017Newcomers to Bayesian thinking are often tempted to summarize the4018posterior distribution by computing the mean or the maximum4019likelihood estimate. These summaries can be useful, but if that's4020all you need, then you probably don't need Bayesian methods in the4021first place.4022\index{maximum likelihood}4023\index{summary statistic}40244025Bayesian methods are most useful when you can carry the posterior4026distribution into the next step of the analysis to perform some4027kind of decision analysis, as we did in this chapter, or some kind of4028prediction, as we see in the next chapter.4029403040314032\chapter{Prediction}4033\label{prediction}40344035\section{The Boston Bruins problem}40364037In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston4038Bruins played a best-of-seven championship series against the despised4039Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then4040won the next two games 8-1 and 4-0. At this point in the series, what4041is the probability that Boston will win the next game, and what is4042their probability of winning the championship?4043\index{hockey}4044\index{Boston Bruins}4045\index{Vancouver Canucks}40464047As always, to answer a question like this, we need to make some4048assumptions. First, it is reasonable to believe that goal scoring in4049hockey is at least approximately a Poisson process, which means that4050it is equally likely for a goal to be scored at any time during a4051game. Second, we can assume that against a particular opponent, each team4052has some long-term average goals per game, denoted $\lambda$.4053\index{Poisson process}40544055Given these assumptions, my strategy for answering this question is40564057\begin{enumerate}40584059\item Use statistics from previous games to choose a prior4060distribution for $\lambda$.40614062\item Use the score from the first four games to estimate $\lambda$4063for each team.40644065\item Use the posterior distributions of $\lambda$ to compute4066distribution of goals for each team, the distribution of the4067goal differential, and the probability that each team wins4068the next game.40694070\item Compute the probability that each team wins the series.40714072\end{enumerate}40734074To choose a prior distribution, I got some statistics from4075\url{http://www.nhl.com}, specifically the average goals per game4076for each team in the 2010-11 season. The distribution is roughly4077Gaussian with mean 2.8 and standard deviation 0.3.4078\index{National Hockey League}4079\index{NHL}40804081The Gaussian distribution is continuous, but we'll approximate it with4082a discrete Pmf. \verb"thinkbayes" provides \verb"MakeGaussianPmf" to4083do exactly that:4084\index{numpy}4085\index{Gaussian distribution}40864087\begin{verbatim}4088def MakeGaussianPmf(mu, sigma, num_sigmas, n=101):4089pmf = Pmf()4090low = mu - num_sigmas*sigma4091high = mu + num_sigmas*sigma40924093for x in numpy.linspace(low, high, n):4094p = scipy.stats.norm.pdf(mu, sigma, x)4095pmf.Set(x, p)4096pmf.Normalize()4097return pmf4098\end{verbatim}40994100{\tt mu} and {\tt sigma} are the mean and standard deviation of the4101Gaussian distribution. \verb"num_sigmas" is the number of standard4102deviations above and below the mean that the Pmf will span, and {\tt4103n} is the number of values in the Pmf.41044105Again we use {\tt numpy.linspace} to make an array of {\tt n}4106equally spaced values between {\tt low} and {\tt high}, including4107both.41084109\verb"norm.pdf" evaluates the Gaussian probability density function (PDF).4110\index{PDF}4111\index{probability density function}41124113Getting back to the hockey problem, here's the definition for a suite4114of hypotheses about the value of $\lambda$.41154116\begin{verbatim}4117class Hockey(thinkbayes.Suite):41184119def __init__(self):4120pmf = thinkbayes.MakeGaussianPmf(2.7, 0.3, 4)4121thinkbayes.Suite.__init__(self, pmf)4122\end{verbatim}41234124So the prior distribution is Gaussian with mean 2.7, standard deviation41250.3, and it spans 4 sigmas above and below the mean.41264127As always, we have to decide how to represent each hypothesis; in4128this case I represent the hypothesis that $\lambda=x$ with the4129floating-point value {\tt x}.413041314132\section{Poisson processes}41334134In mathematical statistics, a {\bf process} is a stochastic model of a4135physical system (``stochastic'' means that the model has some kind of4136randomness in it). For example, a Bernoulli process is a model of a4137sequence of events, called trials, in which each trial has two4138possible outcomes, like success and failure. So a Bernoulli process4139is a natural model for a series of coin flips, or a series of shots on4140goal. \index{process} \index{Poisson process}41414142A Poisson process is the continuous version of a Bernoulli process,4143where an event can occur at any point in time with equal probability.4144Poisson processes can be used to model customers arriving in a store,4145buses arriving at a bus stop, or goals scored in a hockey game.4146\index{Bernoulli process}41474148In many real systems the probability of an event changes over time.4149Customers are more likely to go to a store at certain times of day,4150buses are supposed to arrive at fixed intervals, and goals are more4151or less likely at different times during a game.41524153But all models are based on simplifications, and in this case modeling4154a hockey game with a Poisson process is a reasonable choice. Heuer,4155M\"{u}ller and Rubner (2010) analyze scoring in a German soccer league4156and come to the same conclusion; see4157\url{http://www.cimat.mx/Eventos/vpec10/img/poisson.pdf}.4158\index{Heuer, Andreas}41594160The benefit of using this model is that we can compute the distribution4161of goals per game efficiently, as well as the distribution of time4162between goals. Specifically, if the average number of goals4163in a game is {\tt lam}, the distribution of goals per game is4164given by the Poisson PMF:4165\index{Poisson distribution}41664167\begin{verbatim}4168def EvalPoissonPmf(k, lam):4169return (lam)**k * math.exp(-lam) / math.factorial(k)4170\end{verbatim}41714172And the distribution of time between goals is given by the4173exponential PDF:4174\index{exponential distribution}41754176\begin{verbatim}4177def EvalExponentialPdf(x, lam):4178return lam * math.exp(-lam * x)4179\end{verbatim}41804181I use the variable4182{\tt lam} because {\tt lambda} is a reserved keyword in Python.4183Both of these functions are in \verb"thinkbayes.py".418441854186\section{The posteriors}41874188\begin{figure}4189% hockey.py4190\centerline{\includegraphics[height=2.5in]{figs/hockey1.pdf}}4191\caption{Posterior distribution of the number of4192goals per game.}4193\label{fig.hockey1}4194\end{figure}41954196Now we can compute the likelihood that a team with a hypothetical4197value of {\tt lam} scores {\tt k} goals in a game:41984199\begin{verbatim}4200# class Hockey42014202def Likelihood(self, data, hypo):4203lam = hypo4204k = data4205like = thinkbayes.EvalPoissonPmf(k, lam)4206return like4207\end{verbatim}42084209Each hypothesis is a possible value of $\lambda$; {\tt4210data} is the observed number of goals, {\tt k}.42114212With the likelihood function in place, we can make a suite for each4213team and update them with the scores from the first four games.42144215\begin{verbatim}4216suite1 = Hockey('bruins')4217suite1.UpdateSet([0, 2, 8, 4])42184219suite2 = Hockey('canucks')4220suite2.UpdateSet([1, 3, 1, 0])4221\end{verbatim}42224223Figure~\ref{fig.hockey1} shows the resulting posterior distributions4224for {\tt lam}. Based on the first four games, the most likely4225values for {\tt lam} are 2.6 for the Canucks and 2.9 for the Bruins.422642274228\section{The distribution of goals}42294230\begin{figure}4231% hockey.py4232\centerline{\includegraphics[height=2.5in]{figs/hockey2.pdf}}4233\caption{Distribution of goals in a single game.}4234\label{fig.hockey2}4235\end{figure}42364237To compute the probability that each team wins the next game,4238we need to compute the distribution of goals for each team.42394240If we knew the value of {\tt lam} exactly, we could use the4241Poisson distribution again. \verb"thinkbayes" provides a4242method that computes a truncated approximation of a Poisson4243distribution:4244\index{Poisson distribution}42454246\begin{verbatim}4247def MakePoissonPmf(lam, high):4248pmf = Pmf()4249for k in xrange(0, high+1):4250p = EvalPoissonPmf(k, lam)4251pmf.Set(k, p)4252pmf.Normalize()4253return pmf4254\end{verbatim}42554256The range of values in the computed Pmf is from 0 to {\tt high}.4257So if the value of {\tt lam} were exactly 3.4, we would compute:42584259\begin{verbatim}4260lam = 3.44261goal_dist = thinkbayes.MakePoissonPmf(lam, 10)4262\end{verbatim}42634264I chose the upper bound, 10, because the probability of scoring4265more than 10 goals in a game is quite low.42664267That's simple enough so far; the problem is that we don't know4268the value of {\tt lam} exactly. Instead, we have a distribution4269of possible values for {\tt lam}.42704271For each value of {\tt lam}, the distribution of goals is Poisson.4272So the overall distribution of goals is a mixture of these4273Poisson distributions, weighted according to the probabilities4274in the distribution of {\tt lam}.4275\index{mixture}4276\index{Poisson distribution}42774278Given the posterior distribution of {\tt lam}, here's the code4279that makes the distribution of goals:42804281\begin{verbatim}4282def MakeGoalPmf(suite):4283metapmf = thinkbayes.Pmf()42844285for lam, prob in suite.Items():4286pmf = thinkbayes.MakePoissonPmf(lam, 10)4287metapmf.Set(pmf, prob)42884289mix = thinkbayes.MakeMixture(metapmf)4290return mix4291\end{verbatim}42924293For each value of {\tt lam} we make a Poisson Pmf and add it to the4294meta-Pmf. I call it a meta-Pmf because it is a Pmf that contains4295Pmfs as its values.4296\index{meta-Pmf}42974298Then we use \verb"MakeMixture" to compute the mixture4299(we saw {\tt MakeMixture} in Section~\ref{mixture}).4300\index{mixture}4301\index{MakeMixture}43024303Figure~\ref{fig.hockey2} shows the resulting distribution of goals for4304the Bruins and Canucks. The Bruins are less likely to4305score 3 goals or fewer in the next game, and more likely to score 4 or4306more.430743084309\section{The probability of winning}43104311\begin{figure}4312% hockey.py4313\centerline{\includegraphics[height=2.5in]{figs/hockey3.pdf}}4314\caption{Distribution of time between goals.}4315\label{fig.hockey3}4316\end{figure}43174318To get the probability of winning, first we compute the4319distribution of the goal differential:43204321\begin{verbatim}4322goal_dist1 = MakeGoalPmf(suite1)4323goal_dist2 = MakeGoalPmf(suite2)4324diff = goal_dist1 - goal_dist24325\end{verbatim}43264327The subtraction operator invokes \verb"Pmf.__sub__", which enumerates4328pairs of values and computes the difference. Subtracting two4329distributions is almost the same as adding, which we saw in4330Section~\ref{addends}.43314332If the goal differential is positive, the Bruins win; if negative, the4333Canucks win; if 0, it's a tie:43344335\begin{verbatim}4336p_win = diff.ProbGreater(0)4337p_loss = diff.ProbLess(0)4338p_tie = diff.Prob(0)4339\end{verbatim}43404341With the distributions from the previous section, \verb"p_win"4342is 46\%, \verb"p_loss" is 37\%, and \verb"p_tie" is 17\%.43434344In the event of a tie at the end of ``regulation play,'' the teams play4345overtime periods until one team scores. Since the game ends4346immediately when the first goal is scored, this overtime format4347is known as ``sudden death.''4348\index{overtime}4349\index{sudden death}435043514352\section{Sudden death}43534354To compute the probability of winning in a sudden death overtime,4355the important statistic is not goals per game, but time until the4356first goal. The assumption that goal-scoring is a Poisson process4357implies that the time between goals4358is exponentially distributed.4359\index{Poisson process}4360\index{exponential distribution}43614362Given {\tt lam}, we can compute the time between goals like this:43634364\begin{verbatim}4365lam = 3.44366time_dist = thinkbayes.MakeExponentialPmf(lam, high=2, n=101)4367\end{verbatim}43684369{\tt high} is the upper bound of the distribution. In this case4370I chose 2, because the probability of going more than two games4371without scoring is small. {\tt n} is the number of values in4372the Pmf.43734374If we know {\tt lam} exactly, that's all there is to it.4375But we don't; instead we have a posterior4376distribution of possible values. So as we did with the distribution4377of goals, we make a meta-Pmf and compute a mixture of4378Pmfs.4379\index{MakeMixture}4380\index{meta-Pmf}4381\index{mixture}43824383\begin{verbatim}4384def MakeGoalTimePmf(suite):4385metapmf = thinkbayes.Pmf()43864387for lam, prob in suite.Items():4388pmf = thinkbayes.MakeExponentialPmf(lam, high=2, n=2001)4389metapmf.Set(pmf, prob)43904391mix = thinkbayes.MakeMixture(metapmf)4392return mix4393\end{verbatim}43944395Figure~\ref{fig.hockey3} shows the resulting distributions. For4396time values less than one period (one third of a game), the Bruins4397are more likely to score. The time until the Canucks score is4398more likely to be longer.43994400I set the number of values, {\tt n}, fairly high in order to minimize4401the number of ties, since it is not possible for both teams4402to score simultaneously.44034404Now we compute the probability that the Bruins score first:44054406\begin{verbatim}4407time_dist1 = MakeGoalTimePmf(suite1)4408time_dist2 = MakeGoalTimePmf(suite2)4409p_overtime = thinkbayes.PmfProbLess(time_dist1, time_dist2)4410\end{verbatim}44114412For the Bruins, the probability of winning in overtime is 52\%.44134414Finally, the total probability of winning is the chance of4415winning at the end of regulation play plus the probability4416of winning in overtime.44174418\begin{verbatim}4419p_tie = diff.Prob(0)4420p_overtime = thinkbayes.PmfProbLess(time_dist1, time_dist2)44214422p_win = diff.ProbGreater(0) + p_tie * p_overtime4423\end{verbatim}44244425For the Bruins, the overall chance of winning the next game is 55\%.44264427To win the series, the Bruins can either win the next two games4428or split the next two and win the third. Again, we can compute4429the total probability:44304431\begin{verbatim}4432# win the next two4433p_series = p_win**244344435# split the next two, win the third4436p_series += 2 * p_win * (1-p_win) * p_win4437\end{verbatim}44384439The Bruins chance of winning the series is 57\%. And in 2011,4440they did.444144424443\section{Discussion}44444445As always, the analysis in this chapter is based on modeling decisions,4446and modeling is almost always an iterative process. In general,4447you want to start with something simple that yields an approximate4448answer, identify likely sources of error, and look for opportunities4449for improvement.4450\index{modeling}4451\index{iterative modeling}44524453In this example, I would consider these options:44544455\begin{itemize}44564457\item I chose a prior based on the average goals per game for each4458team. But this statistic is averaged across all opponents. Against4459a particular opponent, we might expect more variability. For4460example, if the team with the best offense plays the team with the4461worst defense, the expected goals per game might be several standard4462deviations above the mean.44634464\item For data I used only the first four games of the championship4465series. If the same teams played each other during the4466regular season, I could use the results from those games as well.4467One complication is that the composition of teams changes during4468the season due to trades and injuries. So it might be best to4469give more weight to recent games.44704471\item To take advantage of all available information, we could4472use results from all regular season games to estimate each team's4473goal scoring rate, possibly adjusted by estimating4474an additional factor for each pairwise match-up. This approach4475would be more complicated, but it is still feasible.44764477\end{itemize}44784479For the first option, we could use the results from the regular season4480to estimate the variability across all pairwise match-ups. Thanks to4481Dirk Hoag at \url{http://forechecker.blogspot.com/}, I was able to get4482the number of goals scored during regulation play (not overtime) for4483each game in the regular season.4484\index{Hoag, Dirk}44854486Teams in different conferences only play each other one or two4487times in the regular season, so I focused on pairs that played4488each other 4--6 times. For each pair, I computed the average4489goals per game, which is an estimate of $\lambda$, then plotted4490the distribution of these estimates.44914492The mean of these estimates is 2.8, again, but the standard4493deviation is 0.85, substantially higher than what we got computing4494one estimate for each team.44954496If we run the analysis again with the higher-variance prior, the4497probability that the Bruins win the series is 80\%, substantially4498higher than the result with the low-variance prior, 57\%.44994500So it turns out that the results are sensitive to the prior, which4501makes sense considering how little data we have to work with. Based4502on the difference between the low-variance model and the high-variable4503model, it seems worthwhile to put some effort into getting the prior4504right.45054506The code and data for this chapter are available from4507\url{http://thinkbayes.com/hockey.py} and4508\url{http://thinkbayes.com/hockey_data.csv}.4509For more information4510see Section~\ref{download}.45114512\section{Exercises}45134514\begin{exercise}45154516If buses arrive at a bus stop every 20 minutes, and you4517arrive at the bus stop at a random time, your wait time until4518the bus arrives is uniformly distributed from 0 to 20 minutes.4519\index{bus stop problem}45204521But in reality, there is variability in the time between4522buses. Suppose you are waiting for a bus, and you know the historical4523distribution of time between buses. Compute your distribution4524of wait times.45254526Hint: Suppose that the time between buses is either45275 or 10 minutes with equal probability. What is the probability4528that you arrive during one of the 10 minute intervals?45294530I solve a version of this problem in the next chapter.45314532\end{exercise}453345344535\begin{exercise}45364537Suppose that passengers arriving at the bus stop are well-modeled4538by a Poisson process with parameter $\lambda$. If you arrive at the4539stop and find 3 people waiting, what is your posterior distribution4540for the time since the last bus arrived.4541\index{Poisson process}4542\index{bus stop problem}45434544I solve a version of this problem in the next chapter.45454546\end{exercise}454745484549\begin{exercise}45504551Suppose that you are an ecologist sampling the insect population in4552a new environment. You deploy 100 traps in a test area and come back4553the next day to check on them. You find that 37 traps have been4554triggered, trapping an insect inside. Once a trap triggers, it4555cannot trap another insect until it has been reset.4556\index{insect sampling problem}45574558If you reset the traps and come back in two days, how many traps4559do you expect to find triggered? Compute a posterior predictive4560distribution for the number of traps.4561\index{predictive distribution}45624563\end{exercise}456445654566\begin{exercise}45674568Suppose you are the manager of an apartment building with4569100 light bulbs in common areas. It is your responsibility4570to replace light bulbs when they break.4571\index{light bulb problem}45724573On January 1, all 100 bulbs are working. When you inspect4574them on February 1, you find 3 light bulbs out. If you4575come back on April 1, how many light bulbs do you expect to4576find broken?45774578In the previous exercise, you could reasonably assume that an event is4579equally likely at any time. For light bulbs, the likelihood of4580failure depends on the age of the bulb. Specifically, old bulbs4581have an increasing failure rate due to evaporation of the filament.45824583This problem is more open-ended than some; you will have to make4584modeling decisions. You might want to read about the Weibull4585distribution4586(\url{http://en.wikipedia.org/wiki/Weibull_distribution}).4587Or you might want to look around for information about4588light bulb survival curves.4589\index{Weibull distribution}45904591\end{exercise}459245934594\chapter{Observer Bias}4595\label{observer}45964597\section{The Red Line problem}45984599In Massachusetts, the Red Line is a subway that connects4600Cambridge and Boston. When I was working in Cambridge I took the Red4601Line from Kendall Square to South Station and caught the commuter rail4602to Needham. During rush hour Red Line trains run every 7--84603minutes, on average.4604\index{Red Line problem}4605\index{Boston}46064607When I arrived at the station, I could estimate the time until4608the next train based on the number of passengers on the platform.4609If there were only a few people, I inferred that I just missed4610a train and expected to wait about 7 minutes. If there were4611more passengers, I expected the train to arrive sooner. But if4612there were a large number of passengers, I suspected that4613trains were not running on schedule, so I would go back to the4614street level and get a taxi.46154616While I was waiting for trains, I thought about how Bayesian4617estimation could help predict my wait time and decide when I4618should give up and take a taxi. This chapter presents the4619analysis I came up with.46204621This chapter is based on a project by Brendan Ritter and4622Kai Austin, who took a class with me at Olin College.4623The code in this chapter is available from4624\url{http://thinkbayes.com/redline.py}. The code I used4625to collect data is in \url{http://thinkbayes.com/redline_data.py}.4626For more information4627see Section~\ref{download}.4628\index{Olin College}462946304631\section{The model}46324633\begin{figure}4634% redline.py4635\centerline{\includegraphics[height=2.5in]{figs/redline0.pdf}}4636\caption{PMF of gaps between trains, based on collected data,4637smoothed by KDE. {\tt z} is the actual distribution; {\tt zb}4638is the biased distribution seen by passengers. }4639\label{fig.redline0}4640\end{figure}46414642Before we get to the analysis, we have to make some4643modeling decisions. First, I will treat passenger arrivals as4644a Poisson process, which means I assume that passengers are equally4645likely to arrive at any time, and that they arrive at an unknown4646rate, $\lambda$, measured in passengers per minute. Since I4647observe passengers during a short period of time, and at the same4648time every day, I assume that $\lambda$ is constant.4649\index{Poisson process}46504651On the other hand, the arrival process for trains is not Poisson.4652Trains to Boston are supposed to leave from the end of the line4653(Alewife station) every 7--8 minutes during peak times, but by the time4654they get to Kendall Square, the time between trains varies between 34655and 12 minutes.46564657To gather data on the time between trains, I wrote a script that4658downloads real-time data from4659\url{http://www.mbta.com/rider_tools/developers/}, selects south-bound4660trains arriving at Kendall square, and records their arrival times4661in a database. I ran the script from 4pm to 6pm every weekday4662for 5 days, and recorded about 15 arrivals per day. Then4663I computed the time between consecutive arrivals; the distribution4664of these gaps is shown in Figure~\ref{fig.redline0}, labeled {\tt z}.46654666If you stood on the platform from 4pm to 6pm and recorded the time4667between trains, this is the distribution you would see. But if you4668arrive at some random time (without regard to the train schedule) you4669would see a different distribution. The average time4670between trains, as seen by a random passenger, is substantially4671higher than the true average.46724673Why? Because a passenger is more like to arrive during a4674large interval than a small one. Consider a simple example:4675suppose that the time between trains is either 5 minutes4676or 10 minutes with equal probability. In that case4677the average time between4678trains is 7.5 minutes.46794680But a passenger is more likely to arrive during a 10 minute gap4681than a 5 minute gap; in fact, twice as likely. If we surveyed4682arriving passengers, we would find that 2/3 of them arrived during4683a 10 minute gap, and only 1/3 during a 5 minute gap. So the4684average time between trains, as seen by an arriving passenger,4685is 8.33 minutes.46864687This kind of {\bf observer bias} appears in many contexts. Students4688think that classes are bigger than they are because more of them are4689in the big classes. Airline passengers think that planes are fuller4690than they are because more of them are on full flights.4691\index{observer bias}46924693In each case, values from the actual distribution are4694oversampled in proportion to their value. In the Red Line example,4695a gap that is twice as big is twice as likely to be observed.46964697So given the actual distribution of gaps, we can compute the4698distribution of gaps as seen by passengers. {\tt BiasPmf}4699does this computation:47004701\begin{verbatim}4702def BiasPmf(pmf):4703new_pmf = pmf.Copy()47044705for x, p in pmf.Items():4706new_pmf.Mult(x, x)47074708new_pmf.Normalize()4709return new_pmf4710\end{verbatim}47114712{\tt pmf} is the actual distribution; \verb"new_pmf" is the4713biased distribution. Inside the loop, we multiply the4714probability of each value, {\tt x}, by the likelihood it will4715be observed, which is proportional to {\tt x}. Then we4716normalize the result.47174718Figure~\ref{fig.redline0} shows the actual distribution of gaps,4719labeled {\tt z}, and the distribution of gaps seen by passengers,4720labeled {\tt zb} for ``z biased''.472147224723\section{Wait times}47244725\begin{figure}4726% redline.py4727\centerline{\includegraphics[height=2.5in]{figs/redline2.pdf}}4728\caption{CDF of {\tt z}, {\tt zb}, and the wait time seen4729by passengers, {\tt y}. }4730\label{fig.redline2}4731\end{figure}47324733Wait time, which I call {\tt y}, is the time between the arrival4734of a passenger and the next arrival of a train. Elapsed time, which I4735call {\tt x}, is the time between the arrival of the previous4736train and the arrival of a passenger. I chose these definitions4737so that {\tt zb = x + y}.47384739Given the distribution of {\tt zb}, we can compute the distribution of4740{\tt y}. I'll start with a simple case and then generalize.4741Suppose, as in the previous example, that {\tt zb} is either 5 minutes4742with probability 1/3, or 10 minutes with probability 2/3.47434744If we arrive at a random time during a 5 minute gap,4745{\tt y} is uniform from 0 to 5 minutes. If we arrive during a 104746minute gap, {\tt y} is uniform from 0 to 10. So the overall4747distribution is a mixture of uniform distributions weighted4748according to the probability of each gap.4749\index{uniform distribution}47504751The following function takes the distribution of {\tt zb} and4752computes the distribution of {\tt y}:47534754\begin{verbatim}4755def PmfOfWaitTime(pmf_zb):4756metapmf = thinkbayes.Pmf()4757for gap, prob in pmf_zb.Items():4758uniform = MakeUniformPmf(0, gap)4759metapmf.Set(uniform, prob)47604761pmf_y = thinkbayes.MakeMixture(metapmf)4762return pmf_y4763\end{verbatim}47644765{\tt PmfOfWaitTime} makes a meta-Pmf that maps from each uniform4766distribution to its probability. Then it uses {\tt MakeMixture},4767which we saw in Section~\ref{mixture}, to compute the mixture.4768\index{mixture}4769\index{MakeMixture}4770\index{meta-Pmf}47714772{\tt PmfOfWaitTime} also uses {\tt MakeUniformPmf}, defined here:47734774\begin{verbatim}4775def MakeUniformPmf(low, high):4776pmf = thinkbayes.Pmf()4777for x in MakeRange(low=low, high=high):4778pmf.Set(x, 1)4779pmf.Normalize()4780return pmf4781\end{verbatim}47824783{\tt low} and {\tt high} are the range of the uniform distribution,4784(both ends included). Finally, {\tt MakeUniformPmf} uses {\tt4785MakeRange}, defined here:47864787\begin{verbatim}4788def MakeRange(low, high, skip=10):4789return range(low, high+skip, skip)4790\end{verbatim}47914792{\tt MakeRange} defines a set of possible values for wait time4793(expressed in seconds). By default it divides the range into479410 second intervals.47954796To encapsulate the process of computing these distributions, I4797created a class called {\tt WaitTimeCalculator}:47984799\begin{verbatim}4800class WaitTimeCalculator(object):48014802def __init__(self, pmf_z):4803self.pmf_z = pmf_z4804self.pmf_zb = BiasPmf(pmf)48054806self.pmf_y = self.PmfOfWaitTime(self.pmf_zb)4807self.pmf_x = self.pmf_y4808\end{verbatim}48094810The parameter, \verb"pmf_z", is the unbiased distribution of {\tt z}.4811\verb"pmf_zb" is the biased distribution of gap time, as seen by4812passengers.48134814\verb"pmf_y" is the distribution of wait time. \verb"pmf_x" is the4815distribution of elapsed time, which is the same as the distribution of4816wait time. To see why, remember that for a particular value of4817{\tt zb}, the distribution of {\tt y} is uniform from 0 to {\tt zb}.4818Also4819%4820\begin{verbatim}4821x = zb - y4822\end{verbatim}4823%4824So the distribution of {\tt x} is also uniform from 0 to {\tt zb}.48254826Figure~\ref{fig.redline2} shows the distribution of {\tt z}, {\tt zb},4827and {\tt y} based on the data I collected from the Red Line web site.48284829To present these distributions, I am switching from Pmfs to Cdfs.4830Most people are more familiar with Pmfs, but I think Cdfs are easier4831to interpret, once you get used to them. And if you want to plot4832several distributions on the same axes, Cdfs are the way to go.4833\index{Cdf}4834\index{cumulative distribution function}48354836The mean of {\tt z} is 7.8 minutes. The mean of {\tt zb} is 8.84837minutes, about 13\% higher. The mean of {\tt y} is 4.4, half4838the mean of {\tt zb}.48394840As an aside, the Red Line schedule reports that trains run every48419 minutes during peak times. This is close to the average of4842{\tt zb}, but higher than the average of {\tt z}. I exchanged email4843with a representative of the MBTA, who confirmed that the reported4844time between trains is deliberately conservative in order to4845account for variability.484648474848\section{Predicting wait times}4849\label{elapsed}48504851\begin{figure}4852% redline.py4853\centerline{\includegraphics[height=2.5in]{figs/redline3.pdf}}4854\caption{Prior and posterior of {\tt x} and predicted {\tt y}. }4855\label{fig.redline3}4856\end{figure}48574858Let's get back to the motivating question: suppose that when4859I arrive at the platform I see 10 people waiting.4860How long should I expect to wait until the next train arrives?48614862As always, let's start with the easiest version of the problem4863and work our way up. Suppose we are given the actual distribution of4864{\tt z}, and we know that the passenger arrival rate,4865$\lambda$, is 2 passengers per minute.48664867In that case we can:48684869\begin{enumerate}48704871\item Use the distribution of {\tt z} to compute4872the prior distribution of {\tt zp}, the time between trains4873as seen by a passenger.48744875\item Then we can use the number of passengers to estimate the distribution4876of {\tt x}, the elapsed time since the last train.48774878\item Finally, we use the relation {\tt y = zp - x} to get the4879distribution of {\tt y}.48804881\end{enumerate}48824883The first step is to create a {\tt WaitTimeCalculator} that4884encapsulates the distributions of {\tt zp}, {\tt x},4885and {\tt y}, prior to taking into account the number of4886passengers.48874888\begin{verbatim}4889wtc = WaitTimeCalculator(pmf_z)4890\end{verbatim}48914892\verb"pmf_z" is the given distribution of gap times.48934894The next step is to make an {\tt ElapsedTimeEstimator} (defined4895below), which encapsulates the posterior distribution of {\tt x} and4896the predictive distribution of {\tt y}.4897\index{predictive distribution}48984899\begin{verbatim}4900ete = ElapsedTimeEstimator(wtc,4901lam=2.0/60,4902num_passengers=15)4903\end{verbatim}49044905The parameters are the {\tt WaitTimeCalculator}, the passenger4906arrival rate, {\tt lam} (expressed in passengers per second),4907and the observed number of passengers, let's say 15.49084909Here is the definition of {\tt ElapsedTimeEstimator}:49104911\begin{verbatim}4912class ElapsedTimeEstimator(object):49134914def __init__(self, wtc, lam, num_passengers):4915self.prior_x = Elapsed(wtc.pmf_x)49164917self.post_x = self.prior_x.Copy()4918self.post_x.Update((lam, num_passengers))49194920self.pmf_y = PredictWaitTime(wtc.pmf_zb, self.post_x)4921\end{verbatim}49224923\verb"prior_x" and \verb"posterior_x" are the prior and4924posterior distributions of elapsed time. \verb"pmf_y" is4925the predictive distribution of wait time.49264927{\tt ElapsedTimeEstimator} uses {\tt Elapsed} and {\tt PredictWaitTime},4928defined below.49294930{\tt Elapsed} is a Suite that represents the hypothetical4931distribution of {\tt x}. The prior distribution of {\tt x}4932comes straight from the {\tt WaitTimeCalculator}. Then we4933use the data, which consists of the arrival rate, {\tt lam},4934and the number of passengers on the platform, to compute4935the posterior distribution.49364937Here's the definition of {\tt Elapsed}:49384939\begin{verbatim}4940class Elapsed(thinkbayes.Suite):49414942def Likelihood(self, data, hypo):4943x = hypo4944lam, k = data4945like = thinkbayes.EvalPoissonPmf(k, lam * x)4946return like4947\end{verbatim}49484949As always, {\tt Likelihood} takes a hypothesis and data, and4950computes the likelihood of the data under the hypothesis.4951In this case {\tt hypo} is the elapsed time since the last train4952and {\tt data} is a tuple of {\tt lam} and the number of4953passengers.4954\index{likelihood}49554956The likelihood of the data is the probability of getting4957{\tt k} arrivals in {\tt x} time, given arrival rate4958{\tt lam}. We compute that using the PMF of the Poisson4959distribution.4960\index{Poisson distribution}49614962Finally, here's the definition of {\tt PredictWaitTime}:49634964\begin{verbatim}4965def PredictWaitTime(pmf_zb, pmf_x):4966pmf_y = pmf_zb - pmf_x4967RemoveNegatives(pmf_y)4968return pmf_y4969\end{verbatim}49704971\verb"pmf_zb" is the distribution of gaps between trains;4972\verb"pmf_x" is the distribution of elapsed time, based on4973the observed number of passengers. Since {\tt y = zb - x},4974we can compute49754976\begin{verbatim}4977pmf_y = pmf_zb - pmf_x4978\end{verbatim}49794980The subtraction operator invokes \verb"Pmf.__sub__", which enumerates4981all pairs of {\tt zb} and {\tt x}, computes the differences, and adds4982the results to \verb"pmf_y".49834984The resulting Pmf includes some negative values, which we know are4985impossible. For example, if you arrive during a gap of 5 minutes, you4986can't wait more than 5 minutes. {\tt RemoveNegatives} removes the4987impossible values from the distribution and renormalizes.49884989\begin{verbatim}4990def RemoveNegatives(pmf):4991for val in pmf.Values():4992if val < 0:4993pmf.Remove(val)4994pmf.Normalize()4995\end{verbatim}49964997Figure~\ref{fig.redline3} shows the results. The prior distribution4998of {\tt x} is the same as the distribution of {\tt y} in4999Figure~\ref{fig.redline2}. The posterior distribution of {\tt x}5000shows that, after seeing 15 passengers on the platform, we believe5001that the time since the last train is probably 5-10 minutes. The5002predictive distribution of {\tt y} indicates that we expect the next5003train in less than 5 minutes, with about 80\% confidence.5004\index{predictive distribution}500550065007\section{Estimating the arrival rate}50085009\begin{figure}5010% redline.py5011\centerline{\includegraphics[height=2.5in]{figs/redline1.pdf}}5012\caption{Prior and posterior distributions of {\tt lam} based5013on five days of passenger data. }5014\label{fig.redline1}5015\end{figure}50165017The analysis so far has been based on the assumption that we know (1)5018the distribution of gaps and (2) the passenger arrival rate. Now we5019are ready to relax the second assumption.50205021Suppose that you just moved to Boston, so you don't know much about5022the passenger arrival rate on the Red Line. After a few days of5023commuting, you could make a guess, at least qualitatively. With5024a little more effort, you could estimate $\lambda$ quantitatively.5025\index{arrival rate}50265027Each day when you arrive at the platform, you should note the5028time and the number of passengers waiting (if the platform is too5029big, you could choose a sample area). Then you should record your5030wait time and the5031number of new arrivals while you are waiting.50325033After five days, you might have data like this:5034%5035\begin{verbatim}5036k1 y k25037-- --- --503817 4.6 9503922 1.0 0504023 1.4 4504118 5.4 1250424 5.8 115043\end{verbatim}5044%5045where {\tt k1} is the number of passengers waiting when you arrive,5046{\tt y} is your wait time in minutes, and {\tt k2} is the number of5047passengers who arrive while you are waiting.50485049Over the course of one week, you waited 18 minutes and saw 365050passengers arrive, so you would estimate that the arrival rate is50512 passengers per minute. For practical purposes that estimate is5052good enough, but for the sake of completeness I5053will compute a posterior distribution for $\lambda$ and show how5054to use that distribution in the rest of the analysis.50555056{\tt ArrivalRate} is a {\tt Suite} that represents hypotheses about5057$\lambda$. As always, {\tt Likelihood} takes a hypothesis and data,5058and computes the likelihood of the data under the hypothesis.50595060In this case the hypothesis is a value of $\lambda$. The data is a5061pair, {\tt y, k}, where {\tt y} is a wait time and {\tt k} is the5062number of passengers that arrived.50635064\begin{verbatim}5065class ArrivalRate(thinkbayes.Suite):50665067def Likelihood(self, data, hypo):5068lam = hypo5069y, k = data5070like = thinkbayes.EvalPoissonPmf(k, lam * y)5071return like5072\end{verbatim}50735074This {\tt Likelihood} might look familiar; it5075is almost identical to {\tt Elapsed.Likelihood} in5076Section~\ref{elapsed}. The difference is that in {\tt5077Elapsed.Likelihood} the hypothesis is {\tt x}, the elapsed time; in5078{\tt ArrivalRate.Likelihood} the hypothesis is {\tt lam}, the arrival5079rate. But in both cases the likelihood is the probability of seeing5080{\tt k} arrivals in some period of time, given {\tt lam}.50815082{\tt ArrivalRateEstimator} encapsulates the process of estimating5083$\lambda$. The parameter, \verb"passenger_data", is a list5084of {\tt k1, y, k2} tuples, as in the table above.5085\index{numpy}50865087\begin{verbatim}5088class ArrivalRateEstimator(object):50895090def __init__(self, passenger_data):5091low, high = 0, 55092n = 515093hypos = numpy.linspace(low, high, n) / 6050945095self.prior_lam = ArrivalRate(hypos)50965097self.post_lam = self.prior_lam.Copy()5098for k1, y, k2 in passenger_data:5099self.post_lam.Update((y, k2))5100\end{verbatim}51015102\verb"__init__" builds5103{\tt hypos}, which is a sequence of hypothetical values for {\tt lam},5104then builds the prior distribution, \verb"prior_lam".5105The {\tt for} loop updates the prior with data, yielding the posterior5106distribution, \verb"post_lam".51075108Figure~\ref{fig.redline1} shows5109the prior and posterior distributions. As expected, the mean and5110median of the posterior are near the observed rate, 2 passengers per5111minute. But the spread of the posterior distribution captures our5112uncertainty about $\lambda$ based on a small sample.511351145115\section{Incorporating uncertainty}51165117\begin{figure}5118% redline.py5119\centerline{\includegraphics[height=2.5in]{figs/redline4.pdf}}5120\caption{Predictive distributions of {\tt y} for possible values5121of {\tt lam}. }5122\label{fig.redline4}5123\end{figure}51245125Whenever there is uncertainty about one of the inputs to an analysis,5126we can take it into account by a process like this:5127\index{uncertainty}51285129\begin{enumerate}51305131\item Implement the analysis based on a deterministic value of the5132uncertain parameter (in this case $\lambda$).51335134\item Compute the distribution of the uncertain parameter.51355136\item Run the analysis for each value of the parameter, and generate a5137set of predictive distributions.5138\index{predictive distribution}51395140\item Compute a mixture of the predictive distributions, using the5141weights from the distribution of the parameter.5142\index{mixture}51435144\end{enumerate}51455146We have already done steps (1) and (2). I wrote a class5147called {\tt WaitMixtureEstimator} to handle steps (3) and (4).51485149\begin{verbatim}5150class WaitMixtureEstimator(object):51515152def __init__(self, wtc, are, num_passengers=15):5153self.metapmf = thinkbayes.Pmf()51545155for lam, prob in sorted(are.post_lam.Items()):5156ete = ElapsedTimeEstimator(wtc, lam, num_passengers)5157self.metapmf.Set(ete.pmf_y, prob)51585159self.mixture = thinkbayes.MakeMixture(self.metapmf)5160\end{verbatim}51615162{\tt wtc} is the {\tt WaitTimeCalculator} that contains the5163distribution of {\tt zb}. {\tt are} is the {\tt ArrivalTimeEstimator}5164that contains the distribution of {\tt lam}.51655166The first line makes a meta-Pmf that maps from each possible5167distribution of {\tt y} to its probability. For each value5168of {\tt lam}, we use {\tt ElapsedTimeEstimator} to5169compute the corresponding distribution of5170{\tt y} and store it in the Meta-Pmf. Then5171we use {\tt MakeMixture} to compute the mixture.5172\index{MakeMixture}5173\index{meta-Pmf}5174\index{mixture}51755176%For purposes of comparison, I also compute the distribution of5177%{\tt y} based on a single point estimate of {\tt lam}, which is5178%the mean of the posterior distribution.51795180Figure~\ref{fig.redline4} shows the results. The shaded lines5181in the background are the distributions of {\tt y} for each value5182of {\tt lam}, with line thickness that represents likelihood.5183The dark line is the mixture of these distributions.51845185In this case we could get a very similar result using a single point5186estimate of {\tt lam}. So it was not necessary, for practical purposes,5187to include the uncertainty of the estimate.51885189In general, it is important to include variability if the system5190response is non-linear; that is, if small changes in the input can5191cause big changes in the output. In this case, posterior variability5192in {\tt lam} is small and the system response is approximately5193linear for small perturbations.5194\index{non-linear}519551965197\section{Decision analysis}51985199\begin{figure}5200% redline.py5201\centerline{\includegraphics[height=2.5in]{figs/redline5.pdf}}5202\caption{Probability that wait time exceeds 15 minutes as5203a function of the number of passengers on the platform. }5204\label{fig.redline5}5205\end{figure}52065207At this point we can use the number of passengers on the platform5208to predict the distribution of wait times. Now5209let's get to the second part of the question: when should I stop5210waiting for the train and go catch a taxi?5211\index{decision analysis}52125213Remember that in the original scenario, I am trying to get to5214South Station to catch the commuter rail. Suppose I leave5215the office with enough time that I can wait 15 minutes5216and still make my connection at South Station.52175218In that case I would like to know the probability that {\tt y} exceeds521915 minutes as a function of \verb"num_passengers". It is easy enough5220to use the5221analysis from Section~\ref{elapsed} and run it for a range of5222\verb"num_passengers".52235224But there's a problem.5225The analysis is sensitive to the frequency of long delays, and5226because long delays are rare, it is hard to estimate5227their frequency.52285229I only have data from one week,5230and the longest delay I observed was 15 minutes. So I can't5231estimate the frequency of longer delays accurately.52325233However, I can use previous observations to make at least a coarse5234estimate. When I commuted by Red Line for a year, I saw three long5235delays caused by a signaling problem, a power outage, and ``police5236activity'' at another stop. So I estimate that there are about52373 major delays per year.52385239But remember that my observations are biased. I am more likely5240to observe long delays because they affect a large number5241of passengers. So we should treat my observations as a sample5242of {\tt zb} rather than {\tt z}. Here's how we can do that.5243\index{observer bias}52445245During my year of commuting, I took the Red Line home about 2205246times. So I take the observed gap times, \verb"gap_times",5247generate a sample of 220 gaps, and compute their Pmf:52485249\begin{verbatim}5250n = 2205251cdf_z = thinkbayes.MakeCdfFromList(gap_times)5252sample_z = cdf_z.Sample(n)5253pmf_z = thinkbayes.MakePmfFromList(sample_z)5254\end{verbatim}52555256Next I bias \verb"pmf_z" to get the distribution of5257{\tt zb}, draw a sample, and then add in delays of525830, 40, and 50 minutes (expressed in seconds):52595260\begin{verbatim}5261cdf_zp = BiasPmf(pmf_z).MakeCdf()5262sample_zb = cdf_zp.Sample(n) + [1800, 2400, 3000]5263\end{verbatim}52645265{\tt Cdf.Sample} is more efficient than {\tt Pmf.Sample}, so it5266is usually faster to convert a Pmf to a Cdf before sampling.52675268Next I use the sample of {\tt zb} to estimate a Pdf using5269KDE, and then convert the Pdf to a Pmf:52705271\begin{verbatim}5272pdf_zb = thinkbayes.EstimatedPdf(sample_zb)5273xs = MakeRange(low=60)5274pmf_zb = pdf_zb.MakePmf(xs)5275\end{verbatim}52765277Finally I unbias the distribution of {\tt zb} to get the5278distribution of {\tt z}, which I use to create the5279{\tt WaitTimeCalculator}:52805281\begin{verbatim}5282pmf_z = UnbiasPmf(pmf_zb)5283wtc = WaitTimeCalculator(pmf_z)5284\end{verbatim}52855286This process is complicated, but5287all of the steps are operations we have seen before.5288Now we are ready to compute the probability of a long wait.52895290\begin{verbatim}5291def ProbLongWait(num_passengers, minutes):5292ete = ElapsedTimeEstimator(wtc, lam, num_passengers)5293cdf_y = ete.pmf_y.MakeCdf()5294prob = 1 - cdf_y.Prob(minutes * 60)5295\end{verbatim}52965297Given the number of passengers on the platform,5298{\tt ProbLongWait}5299makes an {\tt ElapsedTimeEstimator},5300extracts the distribution of wait time, and5301computes5302the probability that wait time5303exceeds {\tt minutes}.53045305Figure~\ref{fig.redline5} shows the result. When the number of5306passengers is less than 20, we infer that the system is5307operating normally, so the probability of a long delay is small.5308If there are 30 passengers, we estimate that it has been 155309minutes since the last train; that's longer than a normal delay,5310so we infer that something is wrong and expect longer delays.53115312If we are willing to accept a 10\% chance of missing the connection5313at South Station, we should stay and wait as long as there5314are fewer than 30 passengers, and take a taxi if there are more.53155316Or, to take this analysis one step further, we could quantify the cost5317of missing the connection and the cost of taking a taxi, then choose5318the threshold that minimizes expected cost.53195320\section{Discussion}53215322The analysis so far has been based on the assumption that the5323arrival rate of passengers is the same every day. For a commuter5324train during rush hour, that might not be a bad assumption, but5325there are some obvious exceptions. For example, if there is a special5326event nearby, a large number of people might arrive at the same time.5327In that case, the estimate of {\tt lam} would be too low, so the5328estimates of {\tt x} and {\tt y} would be too high.53295330If special events are as common as major delays, it would5331be important to include them in the model. We could do that by5332extending the distribution of {\tt lam} to include occasional5333large values.53345335We started with the assumption that we know5336distribution of {\tt z}.5337As an alternative, a passenger could estimate {\tt z}, but it would5338not be easy.5339As a passenger, you only5340observe only your own wait time, {\tt y}. Unless you skip5341the first train and wait for the second, you don't5342observe the gap between trains, {\tt z}.53435344However, we could make some inferences about {\tt zb}. If we note5345the number of passengers waiting when we arrive, we can estimate5346the elapsed time since the last train, {\tt x}. Then we observe5347{\tt y}. If we add the posterior distribution of {\tt x} to5348the observed {\tt y}, we get a distribution that represents5349our posterior belief about the observed value of {\tt zb}.53505351We can use this distribution to update our beliefs about the5352distribution of {\tt zb}. Finally, we can compute the5353inverse of {\tt BiasPmf} to get from the distribution of {\tt zb}5354to the distribution of {\tt z}.53555356I leave this analysis as an exercise for the5357reader. One suggestion: you should read Chapter~\ref{species} first.5358You can find the outline of5359a solution in \url{http://thinkbayes.com/redline.py}.5360For more information5361see Section~\ref{download}.53625363\section{Exercises}53645365\begin{exercise}5366This exercise is from5367MacKay, {\em Information Theory, Inference, and Learning Algorithms}:5368\index{MacKay, David}53695370\begin{quote}5371Unstable particles are emitted from a source and decay at a5372distance $x$, a real number that has an exponential probability5373distribution with [parameter] $\lambda$. Decay events can only be5374observed if they occur in a window extending from $x=1$ cm to $x=20$5375cm. $N$ decays are observed at locations $\{ 1.5, 2, 3, 4, 5, 12 \}$5376cm. What is the posterior distribution of $\lambda$?53775378\end{quote}53795380You can download a solution to this exercise from5381\url{http://thinkbayes.com/decay.py}.53825383\end{exercise}538453855386\chapter{Two Dimensions}5387\label{paintball}53885389\section{Paintball}53905391Paintball is a sport in which competing teams try to shoot each other5392with guns that fire paint-filled pellets that break on impact, leaving5393a colorful mark on the target. It is usually played in an5394arena decorated with barriers and other objects that can be5395used as cover.5396\index{Paintball problem}53975398Suppose you are playing paintball in an indoor arena 30 feet5399wide and 50 feet long. You are standing near one of the 30 foot5400walls, and you suspect that one of your opponents has taken cover5401nearby. Along the wall, you see several paint spatters, all the same5402color, that you think your opponent fired recently.54035404The spatters are at 15, 16, 18, and 21 feet, measured from the5405lower-left corner of the room. Based on these data, where do you5406think your opponent is hiding?54075408Figure~\ref{fig.paintball} shows a diagram of the arena. Using the5409lower-left corner of the room as the origin, I denote the unknown5410location of the shooter with coordinates $\alpha$ and $\beta$, or {\tt5411alpha} and {\tt beta}. The location of a spatter is labeled5412{\tt x}. The angle the opponent shoots at is $\theta$ or {\tt theta}.54135414The Paintball problem is a modified version5415of the Lighthouse problem, a common example of Bayesian analysis. My5416notation follows the presentation of the problem in D.S.~Sivia's, {\it Data5417Analysis: a Bayesian Tutorial, Second Edition} (Oxford, 2006).5418\index{Sivia, D.S.}54195420You can download the code in this chapter from5421\url{http://thinkbayes.com/paintball.py}.5422For more information5423see Section~\ref{download}.54245425\section{The suite}54265427\begin{figure}5428% paintball.py5429\centerline{\includegraphics[height=2.5in]{figs/paintball.pdf}}5430\caption{Diagram of the layout for the paintball problem.}5431\label{fig.paintball}5432\end{figure}54335434To get started, we need a Suite that represents a set of hypotheses5435about the location of the opponent. Each hypothesis is a5436pair of coordinates: {\tt (alpha, beta)}.54375438Here is the definition of the Paintball suite:54395440\begin{verbatim}5441class Paintball(thinkbayes.Suite, thinkbayes.Joint):54425443def __init__(self, alphas, betas, locations):5444self.locations = locations5445pairs = [(alpha, beta)5446for alpha in alphas5447for beta in betas]5448thinkbayes.Suite.__init__(self, pairs)5449\end{verbatim}54505451{\tt Paintball} inherits from {\tt Suite}, which we have seen before,5452and {\tt Joint}, which I will explain soon.5453\index{Joint pmf}54545455{\tt alphas} is the list of possible values for {\tt alpha}; {\tt5456betas} is the list of values for {\tt beta}. {\tt pairs} is a list5457of all {\tt (alpha, beta)} pairs.54585459{\tt locations} is a list of possible locations along5460the wall; it is stored for use in {\tt Likelihood}.54615462\begin{figure}5463% paintball.py5464\centerline{\includegraphics[height=2.5in]{figs/paintball2.pdf}}5465\caption{Posterior CDFs for {\tt alpha} and {\tt beta}, given the data.}5466\label{fig.paintball2}5467\end{figure}54685469The room is 30 feet wide and 50 feet long, so here's the code that5470creates the suite:54715472\begin{verbatim}5473alphas = range(0, 31)5474betas = range(1, 51)5475locations = range(0, 31)54765477suite = Paintball(alphas, betas, locations)5478\end{verbatim}54795480This prior distribution assumes that all locations in the room are5481equally likely. Given a map of the room, we might choose a more5482detailed prior, but we'll start simple.548354845485\section{Trigonometry}54865487Now we need a likelihood function, which means we have to figure5488out the likelihood of hitting any spot along the wall, given5489the location of the opponent.5490\index{likelihood}54915492As a simple model, imagine that the opponent is like a rotating5493turret, equally likely to shoot in any direction.5494In that case, he is most likely to hit5495the wall at location {\tt alpha}, and less likely to hit the wall far5496away from {\tt alpha}.5497\index{trigonometry}54985499With a little trigonometry, we can compute the probability of hitting5500any spot along the wall. Imagine that the shooter fires a shot at5501angle $\theta$; the pellet would hit the wall at location $x$, where5502%5503\[ x - \alpha = \beta \tan \theta \]5504%5505Solving this equation for $\theta$ yields5506%5507\[ \theta = tan^{-1} \left( \frac{x - \alpha}{\beta} \right) \]5508%5509So given a location on the wall, we can find $\theta$.55105511Taking the derivative of the first equation with respect to5512$\theta$ yields5513%5514\[ \frac{dx}{d\theta} = \frac{\beta}{\cos^2 \theta} \]5515%5516This derivative is what I'll call the ``strafing speed'',5517which is the speed of the target location along the wall as $\theta$5518increases. The probability of hitting a given point on the wall is5519inversely related to strafing speed.5520\index{strafing speed}55215522If we know the coordinates of the shooter and a location5523along the wall, we can compute strafing speed:55245525\begin{verbatim}5526def StrafingSpeed(alpha, beta, x):5527theta = math.atan2(x - alpha, beta)5528speed = beta / math.cos(theta)**25529return speed5530\end{verbatim}55315532{\tt alpha} and {\tt beta} are the coordinates of the shooter;5533{\tt x} is the location of a spatter. The result is5534the derivative of {\tt x} with respect to {\tt theta}.55355536\begin{figure}5537% paintball.py5538\centerline{\includegraphics[height=2.5in]{figs/paintball1.pdf}}5539\caption{PMF of location given {\tt alpha=10}, for several values of5540{\tt beta}.}5541\label{fig.paintball1}5542\end{figure}55435544Now we can compute a Pmf that represents the probability of hitting5545any location on the wall. {\tt MakeLocationPmf} takes {\tt alpha} and5546{\tt beta}, the coordinates of the shooter, and {\tt locations}, a5547list of possible values of {\tt x}.55485549\begin{verbatim}5550def MakeLocationPmf(alpha, beta, locations):5551pmf = thinkbayes.Pmf()5552for x in locations:5553prob = 1.0 / StrafingSpeed(alpha, beta, x)5554pmf.Set(x, prob)5555pmf.Normalize()5556return pmf5557\end{verbatim}55585559{\tt MakeLocationPmf} computes the probability of hitting5560each location, which is inversely related to5561strafing speed. The result is a Pmf of locations and their5562probabilities.55635564Figure~\ref{fig.paintball1} shows the Pmf of location with {\tt alpha5565= 10} and a range of values for {\tt beta}. For all values of beta5566the most likely spatter location is {\tt x = 10}; as {\tt beta}5567increases, so does the spread of the Pmf.5568556955705571\section{Likelihood}55725573Now all we need is a likelihood function.5574We can use {\tt MakeLocationPmf} to compute the likelihood5575of any value of {\tt x}, given the coordinates of the opponent.5576\index{likelihood}55775578\begin{verbatim}5579def Likelihood(self, data, hypo):5580alpha, beta = hypo5581x = data5582pmf = MakeLocationPmf(alpha, beta, self.locations)5583like = pmf.Prob(x)5584return like5585\end{verbatim}55865587Again, {\tt alpha} and {\tt beta} are the hypothetical coordinates of5588the shooter, and {\tt x} is the location of an observed spatter.55895590{\tt pmf} contains the probability of each location, given the5591coordinates of the shooter. From this Pmf, we select the probability5592of the observed location.55935594And we're done. To update the suite, we can use {\tt UpdateSet},5595which is inherited from {\tt Suite}.55965597\begin{verbatim}5598suite.UpdateSet([15, 16, 18, 21])5599\end{verbatim}56005601The result is a distribution that maps each {\tt (alpha, beta)} pair5602to a posterior probability.560356045605\section{Joint distributions}56065607When each value in a distribution is a tuple of variables, it is5608called a {\bf joint distribution} because it represents the5609distributions of the variables together, that is ``jointly''.5610A joint distribution contains the distributions of the variables,5611as well as information about the relationships among them.5612\index{joint distribution}56135614Given a joint distribution, we can compute the distributions5615of each variable independently, which are called the {\bf marginal5616distributions}.5617\index{marginal distribution}5618\index{Joint}56195620{\tt thinkbayes.Joint} provides a method that computes marginal5621distributions:56225623\begin{verbatim}5624# class Joint:56255626def Marginal(self, i):5627pmf = Pmf()5628for vs, prob in self.Items():5629pmf.Incr(vs[i], prob)5630return pmf5631\end{verbatim}56325633{\tt i} is the index of the variable we want; in this example5634{\tt i=0} indicates the distribution of {\tt alpha}, and5635{\tt i=1} indicates the distribution of {\tt beta}.56365637Here's the code that extracts the marginal distributions:56385639\begin{verbatim}5640marginal_alpha = suite.Marginal(0)5641marginal_beta = suite.Marginal(1)5642\end{verbatim}56435644Figure~\ref{fig.paintball2} shows the results (converted to CDFs).5645The median value for {\tt alpha} is 18, near the center of mass of5646the observed spatters. For {\tt beta}, the most likely values are5647close to the wall, but beyond 10 feet the distribution is almost5648uniform, which indicates that the data do not distinguish strongly5649between these possible locations.56505651Given the posterior marginals, we can compute credible intervals5652for each coordinate independently:5653\index{credible interval}56545655\begin{verbatim}5656print 'alpha CI', marginal_alpha.CredibleInterval(50)5657print 'beta CI', marginal_beta.CredibleInterval(50)5658\end{verbatim}56595660The 50\% credible intervals are {\tt (14, 21)} for {\tt alpha} and5661{\tt (5, 31)} for {\tt beta}. So the data provide evidence that the5662shooter is in the near side of the room. But it is not strong5663evidence. The 90\% credible intervals cover most of the room!5664\index{evidence}566556665667\section{Conditional distributions}5668\label{conditional}56695670\begin{figure}5671% paintball.py5672\centerline{\includegraphics[height=2.5in]{figs/paintball3.pdf}}5673\caption{Posterior distributions for {\tt alpha} conditioned on several values5674of {\tt beta}.}5675\label{fig.paintball3}5676\end{figure}56775678The marginal distributions contain information about the variables5679independently, but they do not capture the dependence between5680variables, if any.5681\index{independence}5682\index{dependence}56835684One way to visualize dependence is by computing {\bf conditional5685distributions}. {\tt thinkbayes.Joint} provides a method that5686does that:5687\index{conditional distribution}5688\index{Joint}56895690\begin{verbatim}5691def Conditional(self, i, j, val):5692pmf = Pmf()5693for vs, prob in self.Items():5694if vs[j] != val: continue5695pmf.Incr(vs[i], prob)56965697pmf.Normalize()5698return pmf5699\end{verbatim}57005701Again, {\tt i} is the index of the variable we want; {\tt j}5702is the index of the conditioning variable, and {\tt val} is the5703conditional value.57045705The result is the distribution of the $i$th variable under the5706condition that the $j$th variable is {\tt val}.57075708For example, the following code computes the conditional distributions5709of {\tt alpha} for a range of values of {\tt beta}:57105711\begin{verbatim}5712betas = [10, 20, 40]57135714for beta in betas:5715cond = suite.Conditional(0, 1, beta)5716\end{verbatim}57175718Figure~\ref{fig.paintball3} shows the results, which we could5719fully describe as ``posterior conditional marginal distributions.''5720Whew!57215722If the variables were independent, the conditional distributions would5723all be the same. Since they are all different, we can tell the5724variables are dependent. For example, if we know (somehow) that {\tt5725beta = 10}, the conditional distribution of {\tt alpha} is fairly5726narrow. For larger values of {\tt beta}, the distribution of5727{\tt alpha} is wider.5728\index{dependence}5729\index{independence}573057315732\section{Credible intervals}57335734\begin{figure}5735% paintball.py5736\centerline{\includegraphics[height=2.5in]{figs/paintball5.pdf}}5737\caption{Credible intervals for the coordinates of the opponent.}5738\label{fig.paintball5}5739\end{figure}57405741Another way to visualize the posterior joint distribution is to5742compute credible intervals. When we looked at credible intervals5743in Section~\ref{credible},5744I skipped over a subtle point: for a given distribution, there5745are many intervals with the same level of credibility. For example,5746if you want a 50\% credible interval, you could choose any set of5747values whose probability adds up to 50\%.57485749When the values are one-dimensional, it is most common to choose5750the {\bf central credible interval}; for example, the central 50\%5751credible interval contains all values between the 25th and 75th5752percentiles.5753\index{central credible interval}57545755In multiple dimensions it is less obvious what the right credible5756interval should be. The best choice might depend on context, but5757one common choice is the maximum likelihood credible interval, which5758contains the most likely values that add up to 50\% (or some other5759percentage).5760\index{maximum likelihood}57615762{\tt thinkbayes.Joint} provides a method that computes maximum5763likelihood credible intervals.5764\index{Joint}57655766\begin{verbatim}5767# class Joint:57685769def MaxLikeInterval(self, percentage=90):5770interval = []5771total = 057725773t = [(prob, val) for val, prob in self.Items()]5774t.sort(reverse=True)57755776for prob, val in t:5777interval.append(val)5778total += prob5779if total >= percentage/100.0:5780break57815782return interval5783\end{verbatim}57845785The first step is to make a list of the values in the suite,5786sorted in descending order by probability. Next we traverse the5787list, adding each value to the interval, until the total5788probability exceeds {\tt percentage}. The result is a list5789of values from the suite. Notice that this set of values5790is not necessarily contiguous.57915792To visualize the intervals, I wrote a function that ``colors''5793each value according to how many intervals it appears in:57945795\begin{verbatim}5796def MakeCrediblePlot(suite):5797d = dict((pair, 0) for pair in suite.Values())57985799percentages = [75, 50, 25]5800for p in percentages:5801interval = suite.MaxLikeInterval(p)5802for pair in interval:5803d[pair] += 158045805return d5806\end{verbatim}58075808{\tt d} is a dictionary that maps from each value in the suite5809to the number of intervals it appears in. The loop computes intervals5810for several percentages and modifies {\tt d}.58115812Figure~\ref{fig.paintball5} shows the result. The 25\% credible5813interval is the darkest region near the bottom wall. For higher5814percentages, the credible interval is bigger, of course, and skewed5815toward the right side of the room.581658175818\section{Discussion}58195820This chapter shows that the Bayesian framework from the previous5821chapters can be extended to handle a two-dimensional parameter space.5822The only difference is that each hypothesis is represented by5823a tuple of parameters.58245825I also presented {\tt Joint}, which is a parent class that provides5826methods that apply to joint distributions:5827{\tt Marginal}, {\tt Conditional}, and {\tt MakeLikeInterval}.5828In object-oriented terms,5829{\tt Joint} is a mixin (see \url{http://en.wikipedia.org/wiki/Mixin}).5830\index{Joint}58315832There is a lot of new vocabulary in this chapter, so let's review:58335834\begin{description}58355836\item[Joint distribution:] A distribution that represents all possible5837values in a multidimensional space and their probabilities. The5838example in this chapter is a two-dimensional space made up of the5839coordinates {\tt alpha} and {\tt beta}. The joint distribution5840represents the probability of each ({\tt alpha}, {\tt beta}) pair.58415842\item[Marginal distribution:] The distribution of one parameter in a5843joint distribution, treating the other parameters as unknown. For5844example, Figure~\ref{fig.paintball2} shows the distributions of {\tt5845alpha} and {\tt beta} independently.58465847\item[Conditional distribution:] The distribution of one parameter in5848a joint distribution, conditioned on one or more of the other5849parameters. Figure~\ref{fig.paintball3} shows several distributions for5850{\tt alpha}, conditioned on different values of {\tt beta}.58515852\end{description}58535854Given the joint distribution, you can compute marginal and conditional5855distributions. With enough conditional distributions, you could5856re-create the joint distribution, at least approximately. But given5857the marginal distributions you cannot re-create the joint distribution5858because you have lost information about the dependence between5859variables.5860\index{joint distribution}5861\index{conditional distribution}5862\index{marginal distribution}58635864If there are $n$ possible values for each of two parameters, most5865operations on the joint distribution take time proportional to $n^2$.5866If there are $d$ parameters, run time is proportional to $n^d$,5867which quickly becomes impractical as the number of dimensions increases.58685869If you can process a million hypotheses in a reasonable amount of time,5870you could handle two dimensions with 1000 values for each parameter,5871or three dimensions with 100 values each, or six dimensions with 105872values each.58735874If you need more dimensions, or more values per dimension, there are5875optimizations you can try. I present an example5876in Chapter~\ref{species}.58775878You can download the code in this chapter from5879\url{http://thinkbayes.com/paintball.py}.5880For more information5881see Section~\ref{download}.58825883\section{Exercises}58845885\begin{exercise}5886In our simple model, the opponent is equally likely to shoot in any5887direction. As an exercise, let's consider improvements to this model.58885889The analysis in this chapter suggests that a shooter is most likely to5890hit the closest wall. But in reality, if the opponent is close to a5891wall, he is unlikely to shoot at the wall because he is unlikely to5892see a target between himself and the wall.58935894Design an improved model that takes this behavior5895into account. Try to find a model that is more realistic, but not5896too complicated.5897\end{exercise}589858995900590159025903\chapter{Approximate Bayesian Computation}59045905\section{The Variability Hypothesis}59065907I have a soft spot for crank science. Recently I visited Norumbega5908Tower, which is an enduring monument to the crackpot theories of Eben5909Norton Horsford, inventor of double-acting baking powder and fake5910history. But that's not what this chapter is about.5911\index{crank science}5912\index{Horsford, Eben Norton}59135914This chapter is about the Variability Hypothesis, which5915\index{Variability Hypothesis}5916\index{Meckel, Johann}59175918\begin{quote}5919"originated in the early nineteenth century with Johann Meckel, who5920argued that males have a greater range of ability than females,5921especially in intelligence. In other words, he believed that most5922geniuses and most mentally retarded people are men. Because he5923considered males to be the 'superior animal,' Meckel concluded that5924females' lack of variation was a sign of inferiority."59255926From \url{http://en.wikipedia.org/wiki/Variability_hypothesis}.5927\end{quote}59285929I particularly like that last part, because I suspect that if it turns5930out that women are actually more variable, Meckel would take that as a5931sign of inferiority, too. Anyway, you will not be surprised to hear5932that the evidence for the Variability Hypothesis is weak.5933\index{evidence}59345935Nevertheless, it came up in my class recently when we looked at data5936from the CDC's Behavioral Risk Factor Surveillance System (BRFSS),5937specifically the self-reported heights of adult American men and women.5938The dataset includes responses from 154407 men and 254722 women.5939Here's what we found:5940\index{Centers for Disease Control}5941\index{CDC}5942\index{BRFSS}5943\index{Behavioral Risk Factor Surveillance System}59445945\begin{itemize}59465947\item The average height for men is 178 cm; the average height for5948women is 163 cm. So men are taller, on average. No surprise there.59495950\item For men the standard deviation is 7.7 cm; for women it is 7.35951cm. So in absolute terms, men's heights are more variable.59525953\item But to compare variability between groups, it is more meaningful5954to use the coefficient of variation (CV), which is the standard5955deviation divided by the mean. It is a dimensionless measure of5956variability relative to scale. For men CV is 0.0433; for women it5957is 0.0444.5958\index{coefficient of variation}59595960\end{itemize}59615962That's very close, so we could conclude that this dataset provides5963weak evidence against the Variability Hypothesis. But we can use5964Bayesian methods to make that conclusion more precise. And answering5965this question gives me a chance to demonstrate some techniques5966for working with large datasets.5967\index{height}59685969I will proceed in a few steps:59705971\begin{enumerate}59725973\item We'll start with the simplest implementation, but it only works5974for datasets smaller than 1000 values.59755976\item By computing probabilities under a log transform, we can scale5977up to the full size of the dataset, but the computation gets slow.59785979\item Finally, we speed things up substantially with Approximate5980Bayesian Computation, also known as ABC.59815982\end{enumerate}59835984You can download the code in this chapter from5985\url{http://thinkbayes.com/variability.py}.5986For more information5987see Section~\ref{download}.59885989\section{Mean and standard deviation}59905991In Chapter~\ref{paintball} we estimated two parameters simultaneously5992using a joint distribution. In this chapter we use the same5993method to estimate the parameters of a Gaussian distribution:5994the mean, {\tt mu}, and the standard deviation, {\tt sigma}.5995\index{Gaussian distribution}59965997For this problem, I define a Suite called {\tt Height} that5998represents a map from each {\tt mu, sigma} pair to its probability:59996000\begin{verbatim}6001class Height(thinkbayes.Suite, thinkbayes.Joint):60026003def __init__(self, mus, sigmas):6004pairs = [(mu, sigma)6005for mu in mus6006for sigma in sigmas]60076008thinkbayes.Suite.__init__(self, pairs)6009\end{verbatim}60106011{\tt mus} is a sequence of possible values for {\tt mu}; {\tt sigmas}6012is a sequence of values for {\tt sigma}. The prior distribution6013is uniform over all {\tt mu, sigma} pairs.6014\index{Joint}6015\index{joint distribution}60166017The likelihood function is easy. Given hypothetical values6018of {\tt mu} and {\tt sigma}, we compute the likelihood6019of a particular value, {\tt x}. That's what {\tt EvalGaussianPdf}6020does, so all we have to do is use it:6021\index{likelihood}60226023\begin{verbatim}6024# class Height60256026def Likelihood(self, data, hypo):6027x = data6028mu, sigma = hypo6029like = thinkbayes.EvalGaussianPdf(x, mu, sigma)6030return like6031\end{verbatim}60326033If you have studied statistics from a mathematical perspective,6034you know that when you evaluate a PDF, you get a probability6035density. In order to get a probability, you have to integrate6036probability densities over some range.6037\index{density}60386039But for our purposes, we don't need a probability; we just6040need something proportional to the probability we want.6041A probability density does that job nicely.60426043The hardest part of this problem turns6044out to be choosing appropriate ranges for {\tt mus} and6045{\tt sigmas}. If the range is too small, we omit some6046possibilities with non-negligible probability and get the6047wrong answer. If the range is too big, we get the right answer,6048but waste computational power.60496050So this is an opportunity to use classical estimation to6051make Bayesian techniques more efficient. Specifically, we can use6052classical estimators to find a likely location for {\tt mu} and {\tt6053sigma}, and use the standard errors of those estimates to choose a6054likely spread.6055\index{classical estimation}60566057If the true parameters of the distribution are $\mu$ and $\sigma$, and6058we take a sample of $n$ values, an estimator of $\mu$ is the sample6059mean, {\tt m}.60606061And an estimator of $\sigma$ is the sample standard6062variance, {\tt s}.60636064The standard error of the estimated $\mu$ is $s / \sqrt{n}$6065and the standard error of the estimated $\sigma$ is6066$s / \sqrt{2 (n-1)}$.60676068Here's the code to compute all that:60696070\begin{verbatim}6071def FindPriorRanges(xs, num_points, num_stderrs=3.0):60726073# compute m and s6074n = len(xs)6075m = numpy.mean(xs)6076s = numpy.std(xs)60776078# compute ranges for m and s6079stderr_m = s / math.sqrt(n)6080mus = MakeRange(m, stderr_m, num_stderrs)60816082stderr_s = s / math.sqrt(2 * (n-1))6083sigmas = MakeRange(s, stderr_s, num_stderrs)60846085return mus, sigmas6086\end{verbatim}60876088{\tt xs} is the dataset. \verb"num_points" is the desired number of6089values in the range. \verb"num_stderrs" is the width of the range on6090each side of the estimate, in number of standard errors.60916092The return6093value is a pair of sequences, {\tt mus} and {\tt sigmas}.60946095Here's {\tt MakeRange}:6096\index{numpy}60976098\begin{verbatim}6099def MakeRange(estimate, stderr, num_stderrs):6100spread = stderr * num_stderrs6101array = numpy.linspace(estimate-spread,6102estimate+spread,6103num_points)6104return array6105\end{verbatim}61066107{\tt numpy.linspace} makes an array of equally spaced elements between6108{\tt estimate-spread} and {\tt estimate+spread}, including both.6109\index{linspace}611061116112\section{Update}61136114Finally here's the code to make and update the suite:61156116\begin{verbatim}6117mus, sigmas = FindPriorRanges(xs, num_points)6118suite = Height(mus, sigmas)6119suite.UpdateSet(xs)6120print suite.MaximumLikelihood()6121\end{verbatim}61226123This process might seem bogus, because we use the data to choose the6124range of the prior distribution, and then use the data again to do the6125update. In general, using the same data twice is, in fact, bogus.6126\index{bogus}6127\index{maximum likelihood}61286129But in this case it is ok. Really. We use the data to choose the6130range for the prior, but only to avoid computing a lot of6131probabilities that would have been very small anyway. With6132\verb"num_stderrs=4", the range is big enough to cover all values with6133non-negligible likelihood. After that, making it bigger has no effect6134on the results.61356136In effect, the prior is uniform over all values6137of {\tt mu} and {\tt sigma}, but for computational efficiency6138we ignore all the values that don't matter.61396140\section{The posterior distribution of CV}61416142Once we have the posterior joint distribution of {\tt mu} and {\tt6143sigma}, we can compute the distribution of CV for men and women, and6144then the probability that one exceeds the other.61456146To compute the distribution of CV, we enumerate pairs of6147{\tt mu} and {\tt sigma}:61486149\begin{verbatim}6150def CoefVariation(suite):6151pmf = thinkbayes.Pmf()6152for (mu, sigma), p in suite.Items():6153pmf.Incr(sigma/mu, p)6154return pmf6155\end{verbatim}61566157Then we use \verb"thinkbayes.PmfProbGreater" to compute the6158probability that men are more variable.61596160The analysis itself is simple, but there are two more issues we6161have to deal with:61626163\begin{enumerate}61646165\item As the size of the dataset increases, we run into a series of6166computational problems due to the limitations of floating-point6167arithmetic.61686169\item The dataset contains a number of extreme values that are almost6170certainly errors. We will need to make the estimation process6171robust in the presence of these outliers.61726173\end{enumerate}61746175The following sections explain these problems and their solutions.617661776178\section{Underflow}6179\label{underflow}61806181If we select the first 100 values from the BRFSS dataset and run the6182analysis I just described, it runs without errors and we get posterior6183distributions that look reasonable.61846185If we select the first 1000 values and run the program again, we get6186an error in \verb"Pmf.Normalize":61876188\begin{verbatim}6189ValueError: total probability is zero.6190\end{verbatim}61916192The problem is that we are using probability densities to compute6193likelihoods, and densities from continuous distributions tend to be6194small. And if you take 1000 small values and multiply6195them together, the result is very small. In this case it is so small6196it can't be represented by a floating-point number, so it gets rounded6197down to zero, which is called {\bf underflow}. And if all6198probabilities in the distribution are 0, it's not a distribution any6199more.6200\index{underflow}62016202A possible solution is to renormalize the Pmf after each update,6203or after each batch of 100. That would work, but it would be slow.62046205A better alternative is to compute likelihoods under a log6206transform. That way, instead of multiplying small values, we can add6207up log likelihoods. {\tt Pmf} provides methods {\tt Log}, {\tt6208LogUpdateSet} and {\tt Exp} to make this process easy.6209\index{logarithm}6210\index{log transform}62116212{\tt Log} computes the log of the probabilities in a Pmf:62136214\begin{verbatim}6215# class Pmf62166217def Log(self):6218m = self.MaxLike()6219for x, p in self.d.iteritems():6220if p:6221self.Set(x, math.log(p/m))6222else:6223self.Remove(x)6224\end{verbatim}62256226Before applying the log transform {\tt Log} uses {\tt MaxLike} to find6227{\tt m}, the highest probability in the Pmf. It divide all6228probabilities by {\tt m}, so the highest probability gets normalized6229to 1, which yields a log of 0. The other log probabilities are all6230negative. If there are any values in the Pmf with probability 0, they6231are removed.62326233While the Pmf is under a log transform, we can't use {\tt Update},6234{\tt UpdateSet}, or {\tt Normalize}. The result would be nonsensical;6235if you try, Pmf raises an exception.6236Instead, we have to use {\tt LogUpdate}6237and {\tt LogUpdateSet}.6238\index{exception}62396240Here's the implementation of {\tt LogUpdateSet}:62416242\begin{verbatim}6243# class Suite62446245def LogUpdateSet(self, dataset):6246for data in dataset:6247self.LogUpdate(data)6248\end{verbatim}62496250{\tt LogUpdateSet} loops through the data and calls {\tt LogUpdate}:62516252\begin{verbatim}6253# class Suite62546255def LogUpdate(self, data):6256for hypo in self.Values():6257like = self.LogLikelihood(data, hypo)6258self.Incr(hypo, like)6259\end{verbatim}62606261{\tt LogUpdate} is just like {\tt Update} except that it calls6262{\tt LogLikelihood} instead of {\tt Likelihood}, and {\tt Incr}6263instead of {\tt Mult}.62646265Using log-likelihoods avoids the problem with underflow, but while6266the Pmf is under the log transform, there's not much we can do with6267it. We have to use {\tt Exp} to invert the transform:62686269\begin{verbatim}6270# class Pmf62716272def Exp(self):6273m = self.MaxLike()6274for x, p in self.d.iteritems():6275self.Set(x, math.exp(p-m))6276\end{verbatim}62776278If the log-likelihoods are large negative numbers, the resulting6279likelihoods might underflow. So {\tt Exp} finds the maximum6280log-likelihood, {\tt m}, and shifts all the likelihoods up by {\tt m}.6281The resulting distribution has a maximum likelihood of 1. This6282process inverts the log transform with minimal loss of precision.6283\index{maximum likelihood}628462856286\section{Log-likelihood}62876288Now all we need is {\tt LogLikelihood}.62896290\begin{verbatim}6291# class Height62926293def LogLikelihood(self, data, hypo):6294x = data6295mu, sigma = hypo6296loglike = scipy.stats.norm.logpdf(x, mu, sigma)6297return loglike6298\end{verbatim}62996300{\tt norm.logpdf} computes the log-likelihood of the6301Gaussian PDF.6302\index{scipy}6303\index{log-likelihood}630463056306Here's what the whole update process looks like:63076308\begin{verbatim}6309suite.Log()6310suite.LogUpdateSet(xs)6311suite.Exp()6312suite.Normalize()6313\end{verbatim}63146315To review, {\tt Log} puts the suite under a log transform.6316{\tt LogUpdateSet} calls {\tt LogUpdate}, which calls6317{\tt LogLikelihood}. {\tt LogUpdate} uses {\tt Pmf.Incr},6318because adding a log-likelihood is the same as multiplying6319by a likelihood.63206321After the update, the log-likelihoods are large negative6322numbers, so {\tt Exp} shifts them up before inverting the6323transform, which is how we avoid underflow.63246325Once the suite is transformed back, the probabilities6326are ``linear'' again, which means ``not logarithmic'',6327so we can use {\tt Normalize} again.63286329Using this algorithm, we can process the entire dataset without6330underflow, but it is still slow. On my computer it might6331take an hour. We can do better.633263336334\section{A little optimization}63356336This section uses math and computational optimization6337to speed things up by a factor of 100. But the following section6338presents an algorithm that is even faster. So if you want to6339get right to the good stuff, feel free to skip this section.6340\index{optimization}63416342{\tt Suite.LogUpdateSet} calls {\tt LogUpdate} once for each data6343point. We can speed it up by computing the log-likelihood of the entire6344dataset at once.63456346We'll start with the Gaussian PDF:6347%6348\[ \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 \right] \]6349%6350and compute the log (dropping the constant term):6351%6352\[ -\log \sigma -\frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 \]6353%6354Given a sequence of values, $x_i$, the total log-likelihood is6355%6356\[ \sum_i -\log \sigma - \frac{1}{2} \left( \frac{x_i-\mu}{\sigma} \right)^2 \]6357%6358Pulling out the terms that don't depend on $i$, we get6359%6360\[ -n \log \sigma - \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2 \]6361%6362which we can translate into Python:63636364\begin{verbatim}6365# class Height63666367def LogUpdateSetFast(self, data):6368xs = tuple(data)6369n = len(xs)63706371for hypo in self.Values():6372mu, sigma = hypo6373total = Summation(xs, mu)6374loglike = -n * math.log(sigma) - total / 2 / sigma**26375self.Incr(hypo, loglike)6376\end{verbatim}63776378By itself, this would be a small improvement, but it6379creates an opportunity for a bigger one. Notice that the6380summation only depends on {\tt mu}, not {\tt sigma}, so we only6381have to compute it once for each value of {\tt mu}.6382\index{optimization}63836384To avoid recomputing, I factor out a function that computes the6385summation, and {\bf memoize} it so it stores previously computed6386results in a dictionary (see6387\url{http://en.wikipedia.org/wiki/Memoization}): \index{memoization}63886389\begin{verbatim}6390def Summation(xs, mu, cache={}):6391try:6392return cache[xs, mu]6393except KeyError:6394ds = [(x-mu)**2 for x in xs]6395total = sum(ds)6396cache[xs, mu] = total6397return total6398\end{verbatim}63996400{\tt cache} stores previously computed sums. The {\tt try} statement6401returns a result from the cache if possible; otherwise it computes6402the summation, then caches and returns the result.6403\index{cache}64046405The only catch is that we can't use a list as a key in the cache, because6406it is not a hashable type. That's why {\tt LogUpdateSetFast} converts6407the dataset to a tuple.64086409This optimization speeds up the computation by about a6410factor of 100, processing the entire dataset (154~407 men and 254~7226411women) in less than a minute on my not-very-fast computer.641264136414\section{ABC}64156416But maybe you don't have that kind of time. In that case, Approximate6417Bayesian Computation (ABC) might be the way to go. The motivation6418behind ABC is that the likelihood of any particular dataset is:6419\index{ABC}6420\index{Approximate Bayesian Computation}64216422\begin{enumerate}64236424\item Very small, especially for large datasets, which is why we had6425to use the log transform,64266427\item Expensive to compute, which is why we had to do so much6428optimization, and64296430\item Not really what we want anyway.64316432\end{enumerate}64336434We don't really care about the likelihood of seeing the exact dataset6435we saw. Especially for continuous variables, we care about the6436likelihood of seeing any dataset like the one we saw.64376438For example, in the Euro problem, we don't care about the order of6439the coin flips, only the total number of heads and tails. And in6440the locomotive problem, we don't care about which particular trains were6441seen, only the number of trains and the maximum of the serial numbers.6442\index{locomotive problem}6443\index{Euro problem}64446445Similarly, in the BRFSS sample, we don't really want to know the6446probability of seeing one particular set of values (especially since6447there are hundreds of thousands of them). It is more6448relevant to ask, ``If we sample 100,000 people from a population6449with hypothetical values of $\mu$ and $\sigma$, what would be6450the chance of collecting a sample with the observed mean and6451variance?''6452\index{BRFSS}64536454For samples from a Gaussian distribution, we can answer this question6455efficiently because we can find the distribution of the sample6456statistics analytically. In fact, we already did it when we computed6457the range of the prior.6458\index{Gaussian distribution}64596460If you draw $n$ values from a Gaussian distribution with parameters6461$\mu$ and $\sigma$, and compute the sample mean, $m$, the6462distribution of $m$ is Gaussian6463with parameters $\mu$ and $\sigma / \sqrt{n}$.64646465Similarly, the distribution of the sample standard deviation, $s$, is6466Gaussian with parameters $\sigma$ and $\sigma / \sqrt{2 (n-1)}$.64676468\index{sample statistics}6469We can use these sample distributions to compute the likelihood of the6470sample statistics, $m$ and $s$, given hypothetical values6471for $\mu$ and $\sigma$. Here's a new version of \verb"LogUpdateSet"6472that does it:64736474\begin{verbatim}6475def LogUpdateSetABC(self, data):6476xs = data6477n = len(xs)64786479# compute sample statistics6480m = numpy.mean(xs)6481s = numpy.std(xs)64826483for hypo in sorted(self.Values()):6484mu, sigma = hypo64856486# compute log likelihood of m, given hypo6487stderr_m = sigma / math.sqrt(n)6488loglike = EvalGaussianLogPdf(m, mu, stderr_m)64896490#compute log likelihood of s, given hypo6491stderr_s = sigma / math.sqrt(2 * (n-1))6492loglike += EvalGaussianLogPdf(s, sigma, stderr_s)64936494self.Incr(hypo, loglike)6495\end{verbatim}64966497On my computer this function processes the entire dataset in about a6498second, and the result agrees with the exact result with about 56499digits of precision.650065016502\section{Robust estimation}65036504\begin{figure}6505% variability.py6506\centerline{\includegraphics[height=2.5in]{figs/variability_posterior_male.pdf}}6507\caption{Contour plot of the posterior joint distribution of6508mean and standard deviation of height for men in the U.S.}6509\label{fig.variability1}6510\end{figure}65116512\begin{figure}6513% variability.py6514\centerline{\includegraphics[height=2.5in]{figs/variability_posterior_female.pdf}}6515\caption{Contour plot of the posterior joint distribution of6516mean and standard deviation of height for women in the U.S.}6517\label{fig.variability2}6518\end{figure}65196520We are almost ready to look at results, but we have one more6521problem to deal with. There are a number of outliers in this6522dataset that are almost certainly errors. For example, there6523are three adults with reported height of 61 cm, which would6524place them among the shortest living adults in the world.6525At the other end, there are four women with reported height6526229 cm, just short of the tallest women in the world.65276528It is not impossible that these values are correct, but it is6529unlikely, which makes it hard to know how to deal with them.6530And we have to get6531it right, because these extreme values have a disproportionate6532effect on the estimated variability.65336534Because ABC is based on summary statistics, rather than the entire6535dataset, we can make it more robust by choosing summary statistics6536that are robust in the presence of outliers. For example, rather6537than use the sample mean and standard deviation, we could use the median6538and inter-quartile range6539(IQR), which is the difference between the 25th and 75th percentiles.6540\index{summary statistic}6541\index{robust estimation}6542\index{inter-quartile range}6543\index{IQR}65446545More generally, we could compute an inter-percentile range (IPR) that6546spans any given fraction of the distribution, {\tt p}:65476548\begin{verbatim}6549def MedianIPR(xs, p):6550cdf = thinkbayes.MakeCdfFromList(xs)6551median = cdf.Percentile(50)65526553alpha = (1-p) / 26554ipr = cdf.Value(1-alpha) - cdf.Value(alpha)6555return median, ipr6556\end{verbatim}65576558{\tt xs} is a sequence of values. {\tt p} is the desired range;6559for example, {\tt p=0.5} yields the inter-quartile range.65606561{\tt MedianIPR} works by computing the CDF of {\tt xs},6562then extracting the median and the difference between two6563percentiles.65646565We can convert from {\tt ipr} to an estimate of {\tt sigma} using the6566Gaussian CDF to compute the fraction of the distribution covered by a6567given number of standard deviations. For example, it is a well-known6568rule of thumb that 68\% of a Gaussian distribution falls within one6569standard deviation of the mean, which leaves 16\% in each tail. If we6570compute the range between the 16th and 84th percentiles, we expect the6571result to be {\tt 2 * sigma}. So we can estimate {\tt sigma} by6572computing the 68\% IPR and dividing by 2.6573\index{Gaussian distribution}65746575More generally we could use any number of {\tt sigmas}.6576{\tt MedianS} performs the more general version of this6577computation:65786579\begin{verbatim}6580def MedianS(xs, num_sigmas):6581half_p = thinkbayes.StandardGaussianCdf(num_sigmas) - 0.565826583median, ipr = MedianIPR(xs, half_p * 2)6584s = ipr / 2 / num_sigmas65856586return median, s6587\end{verbatim}65886589Again, {\tt xs} is the sequence of values; \verb"num_sigmas" is the6590number of standard deviations the results should be based on. The6591result is {\tt median}, which estimates $\mu$, and {\tt s}, which6592estimates $\sigma$.65936594Finally, in {\tt LogUpdateSetABC} we can replace the sample mean and6595standard deviation with {\tt median} and {\tt s}. And that pretty6596much does it.65976598It might seem odd that we are using observed percentiles to6599estimate $\mu$ and $\sigma$, but it is an example of the6600flexibility of the Bayesian approach. In effect we are asking,6601``Given hypothetical values for $\mu$ and $\sigma$, and6602a sampling process that has some chance of introducing errors,6603what is the likelihood of generating a given set of sample6604statistics?''66056606We are free to choose any sample statistics we like, up to a point:6607$\mu$ and $\sigma$ determine the location and spread of6608a distribution, so we need to choose statistics that capture those6609characteristics. For example, if we chose the 49th and 51st percentiles,6610we would get very little information about spread, so it6611would leave the estimate of $\sigma$ relatively unconstrained6612by the data. All values of {\tt sigma} would have nearly the6613same likelihood of producing the observed values, so the posterior6614distribution of {\tt sigma} would look a lot like the6615prior.661666176618\section{Who is more variable?}66196620\begin{figure}6621% variability.py6622\centerline{\includegraphics[height=2.5in]{figs/variability_cv.pdf}}6623\caption{Posterior distributions of CV for men and women, based on6624robust estimators.}6625\label{fig.variability3}6626\end{figure}66276628Finally we are ready to answer the question we started with: is the6629coefficient of variation greater for men than for women?66306631Using ABC based on the median and IPR with \verb"num_sigmas=1", I6632computed posterior joint distributions for {\tt mu} and {\tt6633sigma}. Figures~\ref{fig.variability1} and ~\ref{fig.variability2}6634show the results as a contour plot with {\tt mu} on the x-axis, {\tt6635sigma} on the y-axis, and probability on the z-axis.66366637For each joint distribution, I computed the posterior distribution of6638CV. Figure~\ref{fig.variability3} shows these distributions for men6639and women. The mean for men is 0.0410; for women it is 0.0429.6640Since there is no overlap between the distributions, we conclude with6641near certainty that6642women are more variable in height than men.66436644So is that the end of the Variability Hypothesis? Sadly, no. It turns6645out that this6646result depends on the choice of the6647inter-percentile range. With \verb"num_sigmas=1", we conclude that6648women are more variable, but with \verb"num_sigmas=2" we conclude6649with equal confidence that men are more variable.66506651The reason for the difference is that there6652are more men of short stature, and their distance from the mean is6653greater.66546655So our evaluation of the Variability Hypothesis depends on the6656interpretation of ``variability.'' With \verb"num_sigmas=1" we6657focus on people near the mean. As we increase6658\verb"num_sigmas", we give more weight to the extremes.66596660To decide which6661emphasis is appropriate, we would need a more precise statement6662of the hypothesis. As it is, the Variability Hypothesis may be6663too vague to evaluate.66646665Nevertheless, it helped6666me demonstrate several new ideas and, I hope you agree,6667it makes an interesting example.666866696670\section{Discussion}66716672There are two ways you might think of ABC. One interpretation6673is that it is, as the name suggests, an approximation that is6674faster to compute than the exact value.66756676But remember that Bayesian analysis is always6677based on modeling decisions, which implies that there is no6678``exact'' solution. For any interesting6679physical system there are many possible models, and each model6680yields different results. To interpret the results, we have to6681evaluate the models.6682\index{modeling}66836684So another interpretation of ABC is that it represents an alternative6685model of the likelihood. When we compute \p{D|H}, we are asking6686``What is the likelihood of the data under a given hypothesis?''6687\index{likelihood}66886689For large datasets, the likelihood of the data is very small, which6690is a hint that we might not be asking the right question. What6691we really want to know is the likelihood of any outcome6692like the data, where the definition of ``like'' is yet another6693modeling decision.66946695The underlying idea of ABC is that two datasets are alike if they yield6696the same summary statistics. But in some cases, like the example in6697this chapter, it is not obvious which summary statistics to choose.6698\index{summary statistic}66996700You can download the code in this chapter from6701\url{http://thinkbayes.com/variability.py}.6702For more information6703see Section~\ref{download}.67046705\section{Exercises}67066707\begin{exercise}67086709An ``effect size'' is a statistic intended to measure the difference6710between two groups (see6711\url{http://en.wikipedia.org/wiki/Effect_size}).67126713For example, we could use data from the BRFSS to estimate the6714difference in height between men and women. By sampling values6715from the posterior distributions of $\mu$ and6716$\sigma$, we could generate the posterior distribution of this6717difference.67186719But it might be better to use a dimensionless measure of effect6720size, rather than a difference measured in cm. One option is6721to use divide through by the standard deviation (similar to what6722we did with the coefficient of variation).67236724If the parameters for Group 1 are $(\mu_1, \sigma_1)$, and the6725parameters for Group 2 are $(\mu_2, \sigma_2)$, the dimensionless6726effect size is6727%6728\[ \frac{\mu_1 - \mu_2}{(\sigma_1 + \sigma_2)/2} \]6729%6730Write a function that takes joint distributions of6731{\tt mu} and {\tt sigma} for two groups and returns6732the posterior distribution of effect size.67336734Hint: if enumerating all pairs from the two distributions takes too6735long, consider random sampling.67366737\end{exercise}6738673967406741\chapter{Hypothesis Testing}67426743\section{Back to the Euro problem}67446745In Section~\ref{euro} I presented a problem from MacKay's {\it Information6746Theory, Inference, and Learning Algorithms}:6747\index{MacKay, David}67486749\begin{quote}6750A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:67516752\begin{quote}6753When spun on edge 250 times, a Belgian one-euro coin came6754up heads 140 times and tails 110. `It looks very suspicious6755to me,' said Barry Blight, a statistics lecturer at the London6756School of Economics. `If the coin were unbiased, the chance of6757getting a result as extreme as that would be less than 7\%.'6758\end{quote}67596760But do these data give evidence that the coin is biased rather than fair?6761\end{quote}67626763We estimated the probability that the coin would6764land face up, but we didn't really answer MacKay's question:6765Do the data give evidence that the coin is biased?6766\index{Euro problem}6767\index{evidence}67686769In Chapter~\ref{more} I proposed that data are in favor of6770a hypothesis if the data are more likely under the hypothesis than6771under the alternative or, equivalently, if the Bayes factor is greater6772than 1.6773\index{hypothesis testing}6774\index{Bayes factor}67756776In the Euro example, we have two hypotheses to consider: I'll use6777$F$ for the hypothesis that the coin is fair and $B$ for the hypothesis6778that it is biased.6779\index{fair coin}6780\index{biased coin}67816782If the coin is fair, it is easy to compute the likelihood of the6783data, \p{D|F}. In fact, we already wrote the function6784that does it.67856786\begin{verbatim}6787def Likelihood(self, data, hypo):6788x = hypo / 100.06789head, tails = data6790like = x**heads * (1-x)**tails6791return like6792\end{verbatim}67936794To use it we can6795create a {\tt Euro} suite and invoke6796{\tt Likelihood}:67976798\begin{verbatim}6799suite = Euro()6800likelihood = suite.Likelihood(data, 50)6801\end{verbatim}68026803\p{D|F} is $5.5 \cdot 10^{-76}$, which doesn't tell us much except6804that the probability of seeing any particular dataset is very small.6805It takes two likelihoods to make a ratio, so we also have to6806compute \p{D|B}.68076808It is not obvious how to compute the likelihood of $B$, because6809it's not obvious what ``biased'' means.68106811One possibility is to cheat and look at the data before we define6812the hypothesis. In that case we would say that ``biased'' means that6813the probability of heads is 140/250.68146815\begin{verbatim}6816actual_percent = 100.0 * 140 / 2506817likelihood = suite.Likelihood(data, actual_percent)6818\end{verbatim}68196820This version of $B$ I call \verb"B_cheat"; the likelihood of6821\verb"b_cheat" is $34 \cdot 10^{-76}$ and the likelihood ratio is68226.1. So we would say that the data are evidence in favor of this6823version of $B$.6824\index{evidence}68256826But using the data to formulate the hypothesis6827is obviously bogus. By that definition, any dataset would6828be evidence in favor of $B$, unless the observed percentage of heads6829is exactly 50\%.6830\index{bogus}68316832\section{Making a fair comparison}6833\label{suitelike}68346835To make a legitimate comparison, we have to define $B$ without looking6836at the data. So let's try a different definition. If you inspect6837a Belgian Euro coin, you might notice that the ``heads'' side is more6838prominent than the ``tails'' side. You might expect the shape to6839have some effect on6840$x$, but be unsure whether it makes heads more or less6841likely. So you might say ``I think the coin is biased so that6842$x$ is either 0.6 or 0.4, but I am not sure which.''68436844We can think of this version, which I'll call \verb"B_two"6845as a hypothesis made up of two6846sub-hypotheses. We can compute the likelihood for each6847sub-hypothesis and then compute the average likelihood.68486849\begin{verbatim}6850like40 = suite.Likelihood(data, 40)6851like60 = suite.Likelihood(data, 60)6852likelihood = 0.5 * like40 + 0.5 * like606853\end{verbatim}68546855The likelihood ratio (or Bayes factor) for \verb"b_two" is 1.3, which6856means the data provide weak evidence in favor of \verb"b_two".6857\index{evidence}6858\index{likelihood ratio}6859\index{Bayes factor}68606861More generally, suppose you suspect that the coin is biased, but you6862have no clue about the value of $x$. In that case you might build a6863Suite, which I call \verb"b_uniform", to represent sub-hypotheses from68640 to 100.68656866\begin{verbatim}6867b_uniform = Euro(xrange(0, 101))6868b_uniform.Remove(50)6869b_uniform.Normalize()6870\end{verbatim}68716872I initialize \verb"b_uniform" with values from 0 to 100.6873I removed the sub-hypothesis that $x$ is 50\%, because if6874$x$ is 50\% the coin is fair, but it has almost no6875effect on the result whether you remove it or not.68766877To compute the likelihood of6878\verb"b_uniform" we compute the likelihood of each sub-hypothesis6879and accumulate a weighted average.68806881\begin{verbatim}6882def SuiteLikelihood(suite, data):6883total = 06884for hypo, prob in suite.Items():6885like = suite.Likelihood(data, hypo)6886total += prob * like6887return total6888\end{verbatim}68896890The likelihood ratio for \verb"b_uniform" is 0.47, which means6891that the data are weak evidence against \verb"b_uniform",6892compared to $F$.6893\index{likelihood}68946895If you think about the computation performed by6896\verb"SuiteLikelihood", you might notice that it is similar to an6897update. To refresh your memory, here's the {\tt Update} function:68986899\begin{verbatim}6900def Update(self, data):6901for hypo in self.Values():6902like = self.Likelihood(data, hypo)6903self.Mult(hypo, like)6904return self.Normalize()6905\end{verbatim}69066907And here's {\tt Normalize}:69086909\begin{verbatim}6910def Normalize(self):6911total = self.Total()69126913factor = 1.0 / total6914for x in self.d:6915self.d[x] *= factor69166917return total6918\end{verbatim}69196920The return value from {\tt Normalize} is the total of the6921probabilities in the Suite, which is the average of the likelihoods6922for the sub-hypotheses, weighted by the prior probabilities. And {\tt6923Update} passes this value along, so instead of using {\tt6924SuiteLikelihood}, we could compute the likelihood of6925\verb"b_uniform" like this:69266927\begin{verbatim}6928likelihood = b_uniform.Update(data)6929\end{verbatim}6930693169326933\section{The triangle prior}69346935In Chapter~\ref{more} we also considered a triangle-shaped prior that6936gives higher probability to values of $x$ near 50\%. If we think of6937this prior as a suite of sub-hypotheses, we can compute its likelihood6938like this:6939\index{triangle distribution}69406941\begin{verbatim}6942b_triangle = TrianglePrior()6943likelihood = b_triangle.Update(data)6944\end{verbatim}69456946The likelihood ratio for \verb"b_triangle" is 0.84, compared to $F$, so6947again we would say that the data are weak evidence against $B$.6948\index{evidence}69496950The following table shows the priors we have considered, the6951likelihood of each, and the likelihood ratio (or Bayes factor)6952relative to $F$.6953\index{likelihood ratio}6954\index{Bayes factor}69556956\begin{tabular}{|l|r|r|}6957\hline6958Hypothesis & Likelihood & Bayes \\6959& $\times 10^{-76}$ & Factor \\6960\hline6961$F$ & 5.5 & -- \\6962\verb"B_cheat" & 34 & 6.1 \\6963\verb"B_two" & 7.4 & 1.3 \\6964\verb"B_uniform" & 2.6 & 0.47 \\6965\verb"B_triangle" & 4.6 & 0.84 \\6966\hline6967\end{tabular}69686969Depending on which definition we choose, the data might provide6970evidence for or against the hypothesis that the coin is biased, but6971in either case it is relatively weak evidence.69726973In summary, we can use Bayesian hypothesis testing to compare the6974likelihood of $F$ and $B$, but we have to do some work to specify6975precisely what $B$ means. This specification depends on background6976information about coins and their behavior when spun, so people6977could reasonably disagree about the right definition.69786979My presentation of this example follows6980David MacKay's discussion, and comes to the same conclusion.6981You can download the code I used in this chapter from6982\url{http://thinkbayes.com/euro3.py}.6983For more information6984see Section~\ref{download}.69856986\section{Discussion}69876988The Bayes factor for \verb"B_uniform" is 0.47, which means6989that the data provide evidence against this hypothesis, compared6990to $F$. In the previous section I characterized this evidence6991as ``weak,'' but didn't say why.6992\index{evidence}69936994Part of the answer is historical. Harold Jeffreys, an early6995proponent of Bayesian statistics, suggested a scale for6996interpreting Bayes factors:69976998\begin{tabular}{|l|l|}6999\hline7000Bayes & Strength \\7001Factor & \\7002\hline70031 -- 3 & Barely worth mentioning \\70043 -- 10 & Substantial \\700510 -- 30 & Strong \\700630 -- 100 & Very strong \\7007$>$ 100 & Decisive \\7008\hline7009\end{tabular}70107011In the example, the Bayes factor is 0.47 in favor of \verb"B_uniform",7012so it is 2.1 in favor of $F$, which Jeffreys would consider ``barely7013worth mentioning.'' Other authors have suggested variations on the7014wording. To avoid arguing about adjectives, we could think about odds7015instead.70167017If your prior odds are 1:1, and you see evidence with Bayes7018factor 2, your posterior odds are 2:1. In terms of probability,7019the data changed your degree of belief from 50\% to 66\%. For7020most real world problems, that change would be small relative7021to modeling errors and other sources of uncertainty.70227023On the other hand, if you had seen evidence with Bayes7024factor 100, your posterior odds would be 100:1 or more than 99\%.7025Whether or not you agree that such evidence is ``decisive,''7026it is certainly strong.70277028\section{Exercises}70297030\begin{exercise}7031Some people believe in the existence of extra-sensory7032perception (ESP); for example, the ability of some people to guess7033the value of an unseen playing card with probability better7034than chance.7035\index{ESP}7036\index{extra-sensory perception}70377038What is your prior degree of belief in this kind of ESP?7039Do you think it is as likely to exist as not? Or are you7040more skeptical about it? Write down your prior odds.70417042Now compute the strength of the evidence it would take to7043convince you that ESP is at least 50\% likely to exist.7044What Bayes factor would be needed to make you 90\% sure7045that ESP exists?7046\end{exercise}704770487049\begin{exercise}7050Suppose that your answer to the previous question is 1000;7051that is, evidence with Bayes factor 1000 in favor of ESP would7052be sufficient to change your mind.70537054Now suppose that you read a paper in a respectable peer-reviewed7055scientific journal that presents evidence with Bayes factor 1000 in7056favor of ESP. Would that change your mind?70577058If not, how do you resolve the apparent contradiction?7059You might find it helpful to read about David Hume's article, ``Of7060Miracles,'' at \url{http://en.wikipedia.org/wiki/Of_Miracles}.7061\index{Hume, David}70627063\end{exercise}7064706570667067\chapter{Evidence}7068\label{evidence}70697070\section{Interpreting SAT scores}70717072Suppose you are the Dean of Admission at a small engineering7073college in Massachusetts, and you are considering two candidates,7074Alice and Bob, whose qualifications are similar in many ways,7075with the exception that Alice got a higher score on the Math7076portion of the SAT, a standardized test intended to measure7077preparation for college-level work in mathematics.7078\index{SAT}7079\index{standardized test}70807081If Alice got 780 and Bob got a 740 (out of a possible 800), you might7082want to know whether that difference is evidence that Alice is better7083prepared than Bob, and what the strength of that evidence is.7084\index{evidence}70857086Now in reality, both scores are very good, and both7087candidates are probably well prepared for college math. So7088the real Dean of Admission would probably suggest that we choose7089the candidate who best demonstrates the other skills and7090attitudes we look for in students. But as an example of7091Bayesian hypothesis testing, let's stick with a narrower question:7092``How strong is the evidence that Alice is better prepared7093than Bob?''70947095To answer that question, we need to make some modeling decisions.7096I'll start with a simplification I know is wrong; then we'll come back7097and improve the model. I pretend, temporarily, that7098all SAT questions are equally difficult. Actually, the designers of7099the SAT choose questions with a range of difficulty, because that7100improves the ability to measure statistical differences between7101test-takers.7102\index{modeling}71037104But if we choose a model where all questions are equally difficult, we7105can define a characteristic, \verb"p_correct", for each test-taker,7106which is the probability of answering any question correctly. This7107simplification makes it easy to compute the likelihood of a given7108score.710971107111\section{The scale}71127113In order to understand SAT scores, we have to understand the scoring7114and scaling process. Each test-taker gets a raw score based on the7115number of correct and incorrect questions. The raw score is converted7116to a scaled score in the range 200--800.7117\index{scaled score}71187119In 2009, there were 54 questions on the math SAT. The raw score7120for each test-taker is the number of questions answered correctly7121minus a penalty of $1/4$ point for each question answered incorrectly.71227123The College Board, which administers the SAT, publishes the7124map from raw scores to scaled scores. I have downloaded that7125data and wrapped it in an Interpolator object that provides a forward7126lookup (from raw score to scaled) and a reverse lookup (from scaled7127score to raw).7128\index{College Board}71297130You can download the code for this example from7131\url{http://thinkbayes.com/sat.py}.7132For more information7133see Section~\ref{download}.71347135\section{The prior}71367137The College Board also publishes the distribution of scaled scores7138for all test-takers. If we convert each scaled score to a raw score,7139and divide by the number of questions, the result is an estimate7140of \verb"p_correct".7141So we can use the distribution of raw scores to model the7142prior distribution of \verb"p_correct".71437144Here is the code that reads and processes the data:71457146\begin{verbatim}7147class Exam(object):71487149def __init__(self):7150self.scale = ReadScale()7151scores = ReadRanks()7152score_pmf = thinkbayes.MakePmfFromDict(dict(scores))7153self.raw = self.ReverseScale(score_pmf)7154self.max_score = max(self.raw.Values())7155self.prior = DivideValues(self.raw, self.max_score)7156\end{verbatim}71577158{\tt Exam} encapsulates the information we have about the exam.7159{\tt ReadScale} and {\tt ReadRanks} read files and return7160objects that contain the data:7161{\tt self.scale} is the {\tt Interpolator} that converts7162from raw to scaled scores and back; {\tt scores} is a list7163of (score, frequency) pairs.71647165\verb"score_pmf" is the Pmf of7166scaled scores. {\tt self.raw} is the Pmf of raw scores, and7167{\tt self.prior} is the Pmf of \verb"p_correct".71687169\begin{figure}7170% sat.py7171\centerline{\includegraphics[height=2.5in]{figs/sat_prior.pdf}}7172\caption{Prior distribution of {\tt p\_correct} for SAT test-takers.}7173\label{fig.satprior}7174\end{figure}71757176Figure~\ref{fig.satprior} shows the prior distribution of7177\verb"p_correct". This distribution is approximately Gaussian, but it7178is compressed at the extremes. By design, the SAT has the most power7179to discriminate between test-takers within two standard deviations of7180the mean, and less power outside that range.7181\index{Gaussian distribution}71827183For each test-taker, I define a Suite called {\tt Sat} that7184represents the distribution of \verb"p_correct". Here's the definition:71857186\begin{verbatim}7187class Sat(thinkbayes.Suite):71887189def __init__(self, exam, score):7190thinkbayes.Suite.__init__(self)71917192self.exam = exam7193self.score = score71947195# start with the prior distribution7196for p_correct, prob in exam.prior.Items():7197self.Set(p_correct, prob)71987199# update based on an exam score7200self.Update(score)7201\end{verbatim}72027203\verb"__init__" takes an Exam object and a scaled score. It makes a7204copy of the prior distribution and then updates itself based on the7205exam score.72067207As usual, we inherit {\tt Update} from {\tt Suite} and provide7208{\tt Likelihood}:72097210\begin{verbatim}7211def Likelihood(self, data, hypo):7212p_correct = hypo7213score = data72147215k = self.exam.Reverse(score)7216n = self.exam.max_score7217like = thinkbayes.EvalBinomialPmf(k, n, p_correct)7218return like7219\end{verbatim}72207221{\tt hypo} is a hypothetical7222value of \verb"p_correct", and {\tt data} is a scaled score.72237224To keep things simple, I interpret the raw score as the number of7225correct answers, ignoring the penalty for wrong answers. With7226this simplification, the likelihood is given by the binomial7227distribution, which computes the probability of $k$ correct7228responses out of $n$ questions.7229\index{binomial distribution}7230\index{raw score}723172327233\section{Posterior}72347235\begin{figure}7236% sat.py7237\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_p_corr.pdf}}7238\caption{Posterior distributions of {\tt p\_correct} for Alice and Bob.}7239\label{fig.satposterior1}7240\end{figure}72417242Figure~\ref{fig.satposterior1} shows the posterior distributions7243of \verb"p_correct" for Alice and Bob based on their exam scores.7244We can see that they overlap, so it is possible that \verb"p_correct"7245is actually higher for Bob, but it seems unlikely.72467247Which brings us back to the original question, ``How strong is the7248evidence that Alice is better prepared than Bob?'' We can use the7249posterior distributions of \verb"p_correct" to answer this question.72507251To formulate the question in terms of Bayesian hypothesis testing,7252I define two hypotheses:72537254\begin{itemize}72557256\item $A$: \verb"p_correct" is higher for Alice than for Bob.72577258\item $B$: \verb"p_correct" is higher for Bob than for Alice.72597260\end{itemize}72617262To compute the likelihood of $A$, we can enumerate all pairs of values7263from the posterior distributions and add up the total probability of7264the cases where \verb"p_correct" is higher for Alice than for Bob.7265And we already have a function, \verb"thinkbayes.PmfProbGreater",7266that does that.72677268So we can define a Suite that computes the posterior probabilities7269of $A$ and $B$:72707271\begin{verbatim}7272class TopLevel(thinkbayes.Suite):72737274def Update(self, data):7275a_sat, b_sat = data72767277a_like = thinkbayes.PmfProbGreater(a_sat, b_sat)7278b_like = thinkbayes.PmfProbLess(a_sat, b_sat)7279c_like = thinkbayes.PmfProbEqual(a_sat, b_sat)72807281a_like += c_like / 27282b_like += c_like / 272837284self.Mult('A', a_like)7285self.Mult('B', b_like)72867287self.Normalize()7288\end{verbatim}72897290Usually when we define a new Suite, we inherit {\tt Update}7291and provide {\tt Likelihood}. In this case I override {\tt Update},7292because it is easier to evaluate the likelihood of both7293hypotheses at the same time.72947295The data passed to {\tt Update} are Sat objects that represent7296the posterior distributions of \verb"p_correct".72977298\verb"a_like" is the total probability that7299\verb"p_correct" is higher for Alice; \verb"b_like" is that7300probability that it is higher for Bob.73017302\verb"c_like" is the probability that they are ``equal,'' but this7303equality is an artifact of the decision to model \verb"p_correct" with7304a set of discrete values. If we use more values, \verb"c_like"7305is smaller, and in the extreme, if \verb"p_correct" is7306continuous, \verb"c_like" is zero. So I treat \verb"c_like" as7307a kind of round-off error and split it evenly between \verb"a_like"7308and \verb"b_like".73097310Here is the code that creates {\tt TopLevel} and updates it:73117312\begin{verbatim}7313exam = Exam()7314a_sat = Sat(exam, 780)7315b_sat = Sat(exam, 740)73167317top = TopLevel('AB')7318top.Update((a_sat, b_sat))7319top.Print()7320\end{verbatim}73217322The likelihood of $A$ is 0.79 and the likelihood of $B$ is 0.21. The7323likelihood ratio (or Bayes factor) is 3.8, which means that these test7324scores are evidence that Alice is better than Bob at answering SAT7325questions. If we believed, before seeing the test scores, that $A$7326and $B$ were equally likely, then after seeing the scores we should7327believe that the probability of $A$ is 79\%, which means there is7328still a 21\% chance that Bob is actually better prepared.7329\index{likelihood ratio}7330\index{Bayes factor}733173327333\section{A better model}73347335Remember that the analysis we have done so far is based on7336the simplification that all SAT questions are equally difficult.7337In reality, some are easier than others, which means that the7338difference between Alice and Bob might be even smaller.73397340But how big is the modeling error? If it is small, we conclude7341that the first model---based on the simplification that all questions7342are equally difficult---is good enough. If it's large,7343we need a better model.7344\index{modeling error}73457346In the next few sections, I develop a better model and7347discover (spoiler alert!) that the modeling error is small. So if7348you are satisfied with the simple model, you can skip to the next7349chapter. If you want to see how the more realistic model works,7350read on...73517352\begin{itemize}73537354\item Assume that each test-taker has some7355degree of {\tt efficacy}, which measures their7356ability to answer SAT questions.7357\index{efficacy}73587359\item Assume that each question has some level of7360{\tt difficulty}.73617362\item Finally, assume that the chance that a test-taker answers a7363question correctly is related to {\tt efficacy} and {\tt difficulty}7364according to this function:73657366\begin{verbatim}7367def ProbCorrect(efficacy, difficulty, a=1):7368return 1 / (1 + math.exp(-a * (efficacy - difficulty)))7369\end{verbatim}73707371\end{itemize}73727373This function is a simplified version of the curve used in {\bf item7374response theory}, which you can read about at7375\url{http://en.wikipedia.org/wiki/Item_response_theory}. {\tt7376efficacy} and {\tt difficulty} are considered to be on the same7377scale, and the probability of getting a question right depends only on7378the difference between them.7379\index{item response theory}73807381When {\tt efficacy} and {\tt difficulty} are equal, the7382probability of getting the question right is 50\%. As7383{\tt efficacy} increases, this probability approaches 100\%.7384As it decreases (or as {\tt difficulty} increases), the7385probability approaches 0\%.73867387Given the distribution of {\tt efficacy} across test-takers7388and the distribution of {\tt difficulty} across questions, we7389can compute the expected distribution of raw scores. We'll do that7390in two steps. First, for a person with given {\tt efficacy},7391we'll compute the distribution of raw scores.73927393\begin{verbatim}7394def PmfCorrect(efficacy, difficulties):7395pmf0 = thinkbayes.Pmf([0])73967397ps = [ProbCorrect(efficacy, diff) for diff in difficulties]7398pmfs = [BinaryPmf(p) for p in ps]7399dist = sum(pmfs, pmf0)7400return dist7401\end{verbatim}74027403{\tt difficulties} is a list of difficulties, one for each question.7404{\tt ps} is a list of probabilities, and {\tt pmfs} is a list of7405two-valued Pmf objects; here's the function that makes them:74067407\begin{verbatim}7408def BinaryPmf(p):7409pmf = thinkbayes.Pmf()7410pmf.Set(1, p)7411pmf.Set(0, 1-p)7412return pmf7413\end{verbatim}74147415{\tt dist} is the sum of these Pmfs. Remember from Section~\ref{addends}7416that when we add up Pmf objects, the result is the distribution7417of the sums. In order to use Python's {\tt sum} to add up Pmfs,7418we have to provide {\tt pmf0} which is the identity for Pmfs,7419so {\tt pmf + pmf0} is always {\tt pmf}.74207421If we know a person's efficacy, we can compute their distribution7422of raw scores. For a group of people with a different efficacies, the7423resulting distribution of raw scores is a mixture. Here's the code7424that computes the mixture:74257426\begin{verbatim}7427# class Exam:74287429def MakeRawScoreDist(self, efficacies):7430pmfs = thinkbayes.Pmf()7431for efficacy, prob in efficacies.Items():7432scores = PmfCorrect(efficacy, self.difficulties)7433pmfs.Set(scores, prob)74347435mix = thinkbayes.MakeMixture(pmfs)7436return mix7437\end{verbatim}74387439{\tt MakeRawScoreDist} takes {\tt efficacies}, which is a Pmf that7440represents the distribution of efficacy across test-takers. I assume7441it is Gaussian with mean 0 and standard deviation 1.5. This7442choice is mostly arbitrary. The probability of getting a question7443correct depends on the difference between efficacy and difficulty, so7444we can choose the units of efficacy and then calibrate the units of7445difficulty accordingly. \index{Gaussian distribution}74467447{\tt pmfs} is a meta-Pmf that contains one Pmf for each level of7448efficacy, and maps to the fraction of test-takers at that level. {\tt7449MakeMixture} takes the meta-pmf and computes the distribution of the7450mixture (see Section~\ref{mixture}). \index{meta-Pmf}7451\index{MakeMixture}745274537454\section{Calibration}74557456If we were given the distribution of difficulty, we could use7457\verb"MakeRawScoreDist" to compute the distribution of raw scores.7458But for us the problem is the other way around: we are given the7459distribution of raw scores and we want to infer the distribution of7460difficulty.74617462\begin{figure}7463% sat.py7464\centerline{\includegraphics[height=2.5in]{figs/sat_calibrate.pdf}}7465\caption{Actual distribution of raw scores and a model to fit it.}7466\label{fig.satcalibrate}7467\end{figure}74687469I assume that the distribution of difficulty is uniform with7470parameters {\tt center} and {\tt width}. {\tt MakeDifficulties}7471makes a list of difficulties with these parameters.7472\index{numpy}74737474\begin{verbatim}7475def MakeDifficulties(center, width, n):7476low, high = center-width, center+width7477return numpy.linspace(low, high, n)7478\end{verbatim}74797480By trying out a few combinations, I found that7481{\tt center=-0.05} and {\tt width=1.8} yield a distribution7482of raw scores similar to the actual data, as shown in7483Figure~\ref{fig.satcalibrate}.7484\index{calibration}74857486So, assuming that the distribution of difficulty is uniform,7487its range is approximately7488{\tt -1.85} to {\tt 1.75}, given that7489efficacy is Gaussian with mean 0 and standard deviation 1.5.7490\index{Gaussian distribution}74917492The following table shows the range of {\tt ProbCorrect} for7493test-takers at different levels of efficacy:74947495\begin{tabular}{|r|r|r|r|}7496\hline7497& \multicolumn{3}{|c|}{Difficulty} \\7498\hline7499Efficacy & -1.85 & -0.05 & 1.75 \\7500\hline75013.00 & 0.99 & 0.95 & 0.78 \\75021.50 & 0.97 & 0.82 & 0.44 \\75030.00 & 0.86 & 0.51 & 0.15 \\7504-1.50 & 0.59 & 0.19 & 0.04 \\7505-3.00 & 0.24 & 0.05 & 0.01 \\7506\hline7507\end{tabular}75087509Someone with efficacy 3 (two standard deviations above7510the mean) has a 99\% chance of answering the easiest questions on7511the exam, and a 78\% chance of answering the hardest. On the other7512end of the range, someone two standard deviations below the mean7513has only a 24\% chance of answering the easiest questions.751475157516\section{Posterior distribution of efficacy}75177518\begin{figure}7519% sat.py7520\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_eff.pdf}}7521\caption{Posterior distributions of efficacy for Alice and Bob.}7522\label{fig.satposterior2}7523\end{figure}75247525Now that the model is calibrated, we can compute the posterior7526distribution of efficacy for Alice and Bob. Here is a version of the7527Sat class that uses the new model:75287529\begin{verbatim}7530class Sat2(thinkbayes.Suite):75317532def __init__(self, exam, score):7533self.exam = exam7534self.score = score75357536# start with the Gaussian prior7537efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3)7538thinkbayes.Suite.__init__(self, efficacies)75397540# update based on an exam score7541self.Update(score)7542\end{verbatim}75437544\verb"Update" invokes7545\verb"Likelihood", which computes the likelihood of a given test score7546for a hypothetical level of efficacy.75477548\begin{verbatim}7549def Likelihood(self, data, hypo):7550efficacy = hypo7551score = data7552raw = self.exam.Reverse(score)75537554pmf = self.exam.PmfCorrect(efficacy)7555like = pmf.Prob(raw)7556return like7557\end{verbatim}75587559{\tt pmf} is the distribution of raw scores for a test-taker7560with the given efficacy; {\tt like} is the probability of7561the observed score.75627563Figure~\ref{fig.satposterior2} shows the posterior distributions7564of efficacy for Alice and Bob. As expected, the location7565of Alice's distribution is farther to the right, but again there7566is some overlap.75677568Using {\tt TopLevel} again, we compare $A$, the7569hypothesis that Alice's efficacy is higher, and $B$, the7570hypothesis that Bob's is higher. The likelihood ratio is75713.4, a bit smaller than what we got from the simple model (3.8).7572So this model indicates that the data are evidence in favor7573of $A$, but a little weaker than the previous estimate.75747575If our prior belief is that $A$ and $B$ are equally likely,7576then in light of this evidence we would give $A$ a posterior7577probability of 77\%, leaving a 23\% chance that Bob's efficacy7578is higher.757975807581\section{Predictive distribution}75827583The analysis we have done so far generates estimates for7584Alice and Bob's efficacy, but since efficacy is not directly7585observable, it is hard to validate the results.7586\index{predictive distribution}75877588To give the model predictive power, we can use it to answer7589a related question: ``If Alice and Bob take the math SAT7590again, what is the chance that Alice will do better again?''75917592We'll answer this question in two steps:75937594\begin{itemize}75957596\item We'll use the posterior distribution of efficacy to7597generate a predictive distribution of raw score for each test-taker.75987599\item We'll compare the two predictive distributions to compute7600the probability that Alice gets a higher score again.76017602\end{itemize}76037604We already have most of the code we need. To compute7605the predictive distributions, we can use \verb"MakeRawScoreDist" again:76067607\begin{verbatim}7608exam = Exam()7609a_sat = Sat(exam, 780)7610b_sat = Sat(exam, 740)76117612a_pred = exam.MakeRawScoreDist(a_sat)7613b_pred = exam.MakeRawScoreDist(b_sat)7614\end{verbatim}76157616Then we can find the likelihood that Alice does better on the second7617test, Bob does better, or they tie:76187619\begin{verbatim}7620a_like = thinkbayes.PmfProbGreater(a_pred, b_pred)7621b_like = thinkbayes.PmfProbLess(a_pred, b_pred)7622c_like = thinkbayes.PmfProbEqual(a_pred, b_pred)7623\end{verbatim}76247625The probability that Alice does better on the second exam is 63\%,7626which means that Bob has a 37\% chance of doing as well or better.76277628Notice that we have more confidence about Alice's efficacy than we do7629about the outcome of the next test. The posterior odds are 3:1 that7630Alice's efficacy is higher, but only 2:1 that Alice will do better on7631the next exam.763276337634\section{Discussion}76357636\begin{figure}7637% sat.py7638\centerline{\includegraphics[height=2.5in]{figs/sat_joint.pdf}}7639\caption{Joint posterior distribution of {\tt p\_correct} for Alice and Bob.}7640\label{fig.satjoint}7641\end{figure}76427643We started this chapter with the question,7644``How strong is the evidence that Alice is better prepared7645than Bob?'' On the face of it, that sounds like we want to7646test two hypotheses: either Alice is more prepared or Bob is.76477648But in order to compute likelihoods for these hypotheses, we7649have to solve an estimation problem. For each test-taker7650we have to find the posterior distribution of either7651\verb"p_correct" or \verb"efficacy".76527653Values like this are called {\bf nuisance parameters} because7654we don't care what they are, but we have7655to estimate them to answer the question we care about.7656\index{nuisance parameter}76577658One way to visualize the analysis we did in this chapter is7659to plot the space of these parameters. \verb"thinkbayes.MakeJoint"7660takes two Pmfs, computes their joint distribution, and returns7661a joint pmf of each possible pair of values and its probability.76627663\begin{verbatim}7664def MakeJoint(pmf1, pmf2):7665joint = Joint()7666for v1, p1 in pmf1.Items():7667for v2, p2 in pmf2.Items():7668joint.Set((v1, v2), p1 * p2)7669return joint7670\end{verbatim}76717672This function assumes that the two distributions are independent.7673\index{joint distribution}7674\index{independence}76757676Figure~\ref{fig.satjoint} shows the joint posterior distribution of7677\verb"p_correct" for Alice and Bob. The diagonal line indicates the7678part of the space where \verb"p_correct" is the same for Alice and7679Bob. To the right of this line, Alice is more prepared; to the left,7680Bob is more prepared.76817682In {\tt TopLevel.Update}, when we compute the likelihoods of $A$ and7683$B$, we add up the probability mass on each side of this line. For the7684cells that fall on the line, we add up the total mass and split it7685between $A$ and $B$.76867687The process we used in this chapter---estimating nuisance7688parameters in order to evaluate the likelihood of competing7689hypotheses---is a common Bayesian approach to problems like this.76907691769276937694\chapter{Simulation}76957696In this chapter I describe my solution to a problem posed7697by a patient with a kidney tumor. I think the problem is7698important and relevant to patients with these tumors7699and doctors treating them.77007701And I think the solution is interesting because, although it7702is a Bayesian approach to the problem, the use of Bayes's theorem7703is implicit. I present the solution and my code; at the end7704of the chapter I will explain the Bayesian part.77057706If you want more technical detail than I present here, you can7707read my paper on this work at \url{http://arxiv.org/abs/1203.6890}.770877097710\section{The Kidney Tumor problem}77117712\index{Kidney tumor problem}7713\index{Reddit}7714I am a frequent reader and occasional contributor to the online statistics7715forum at \url{http://reddit.com/r/statistics}. In November 2011, I read7716the following message:77177718\begin{quote}7719"I have Stage IV Kidney Cancer and am trying to determine if the7720cancer formed before I retired from the military. ... Given the7721dates of retirement and detection is it possible to determine when7722there was a 50/50 chance that I developed the disease? Is it7723possible to determine the probability on the retirement date? My7724tumor was 15.5 cm x 15 cm at detection. Grade II."7725\end{quote}77267727I contacted the author of the message and got more information; I learned7728that veterans get different benefits if it is "more likely than not"7729that a tumor formed while they were in military service (among other7730considerations).77317732Because renal tumors grow slowly, and often do not cause symptoms,7733they are sometimes left untreated. As a result, doctors can observe7734the rate of growth for untreated tumors by comparing scans from the7735same patient at different times. Several papers have reported these7736growth rates.77377738I collected data from a paper by Zhang et al\footnote{Zhang et al,7739Distribution of Renal Tumor Growth Rates Determined by Using Serial7740Volumetric CT Measurements, January 2009 {\it Radiology}, 250,7741137-144.}. I contacted the authors to see if I could get raw data,7742but they refused on grounds of medical privacy. Nevertheless, I was7743able to extract the data I needed by printing one of their graphs and7744measuring it with a ruler.77457746\begin{figure}7747% kidney.py7748\centerline{\includegraphics[height=2.5in]{figs/kidney2.pdf}}7749\caption{CDF of RDT in doublings per year.}7750\label{fig.kidney2}7751\end{figure}77527753They report growth rates in reciprocal doubling time (RDT),7754which is in units of doublings per year. So a tumor with $RDT=1$7755doubles in volume each year; with $RDT=2$ it quadruples in the same7756time, and with $RDT=-1$, it halves. Figure~\ref{fig.kidney2} shows the7757distribution of RDT for 53 patients.7758\index{doubling time}77597760The squares are the data points from the paper; the line is a model I7761fit to the data. The positive tail fits an exponential distribution7762well, so I used a mixture of two exponentials.7763\index{exponential distribution}7764\index{mixture}7765776677677768\section{A simple model}77697770It is usually a good idea to start with a simple model before7771trying something more challenging. Sometimes the simple model is7772sufficient for the problem at hand, and if not, you can use it7773to validate the more complex model.7774\index{modeling}77757776For my simple model, I assume that tumors grow with a constant7777doubling time, and that they are three-dimensional in the sense that7778if the maximum linear measurement doubles, the volume is multiplied by7779eight.77807781I learned from my correspondent that the time between his discharge7782from the military and his diagnosis was 3291 days (about 9 years).7783So my first calculation was, ``If this tumor grew at the median7784rate, how big would it have been at the date of discharge?''77857786The median volume doubling time reported by Zhang et al is 811 days.7787Assuming 3-dimensional geometry, the doubling time for a linear7788measure is three times longer.77897790\begin{verbatim}7791# time between discharge and diagnosis, in days7792interval = 3291.077937794# doubling time in linear measure is doubling time in volume * 37795dt = 811.0 * 377967797# number of doublings since discharge7798doublings = interval / dt77997800# how big was the tumor at time of discharge (diameter in cm)7801d1 = 15.57802d0 = d1 / 2.0 ** doublings7803\end{verbatim}78047805You can download the code in this chapter from7806\url{http://thinkbayes.com/kidney.py}. For more information7807see Section~\ref{download}.78087809The result, {\tt d0}, is about 6 cm. So if this tumor formed after7810the date of discharge, it must have grown substantially faster than7811the median rate. Therefore I concluded that it is ``more likely than7812not'' that this tumor formed before the date of discharge.78137814In addition, I computed the growth rate that would be implied7815if this tumor had formed after the date of discharge. If we7816assume an initial size of 0.1 cm, we can compute the number of7817doublings to get to a final size of 15.5 cm:78187819\begin{verbatim}7820# assume an initial linear measure of 0.1 cm7821d0 = 0.17822d1 = 15.578237824# how many doublings would it take to get from d0 to d17825doublings = log2(d1 / d0)78267827# what linear doubling time does that imply?7828dt = interval / doublings78297830# compute the volumetric doubling time and RDT7831vdt = dt / 37832rdt = 365 / vdt7833\end{verbatim}78347835{\tt dt} is linear doubling time, so {\tt vdt} is volumetric7836doubling time, and {\tt rdt} is reciprocal doubling7837time.78387839The number of doublings, in linear measure, is 7.3, which implies7840an RDT of 2.4. In the data from Zhang et al, only 20\% of tumors7841grew this fast during a period of observation. So again,7842I concluded that is ``more likely than not'' that the tumor7843formed prior to the date of discharge.78447845These calculations are sufficient to answer the question as7846posed, and on behalf of my correspondent, I wrote a letter explaining7847my conclusions to the Veterans' Benefit Administration.7848\index{Veterans' Benefit Administration}78497850Later I told a friend, who is an oncologist, about my results. He was7851surprised by the growth rates observed by Zhang et al, and by what7852they imply about the ages of these tumors. He suggested that the7853results might be interesting to researchers and doctors.78547855But in order to make them useful, I wanted a more general model7856of the relationship between age and size.785778587859\section{A more general model}78607861Given the size of a tumor at time of diagnosis, it would be most7862useful to know the probability that the tumor formed before7863any given date; in other words, the distribution of ages.7864\index{modeling}7865\index{simulation}78667867To find it, I run simulations of tumor growth to get the7868distribution of size conditioned on age. Then we can use7869a Bayesian approach to get the7870distribution of age conditioned on size.7871\index{conditional distribution}78727873The simulation starts with a small tumor and runs these steps:78747875\begin{enumerate}78767877\item Choose a growth rate from the distribution of RDT.78787879\item Compute the size of the tumor at the end of an interval.78807881\item Record the size of the tumor at each interval.78827883\item Repeat until the tumor exceeds the maximum relevant size.78847885\end{enumerate}78867887For the initial size I chose 0.3 cm, because carcinomas smaller than7888that are less likely to be invasive and less likely to have the blood7889supply needed for rapid growth (see7890\url{http://en.wikipedia.org/wiki/Carcinoma_in_situ}).7891\index{carcinoma}78927893I chose an interval of 245 days (about 8 months) because that is the7894median time between measurements in the data source.78957896For the maximum size I chose 20 cm. In the data source, the range of7897observed sizes is 1.0 to 12.0 cm, so we are extrapolating beyond7898the observed range at each end, but not by far, and not in a way7899likely to have a strong effect on the results.79007901\begin{figure}7902% kidney.py7903\centerline{\includegraphics[height=2.5in]{figs/kidney4.pdf}}7904\caption{Simulations of tumor growth, size vs. time.}7905\label{fig.kidney4}7906\end{figure}79077908The simulation is based on one big simplification:7909the growth rate is chosen independently during each interval,7910so it does not depend on age, size, or growth rate during7911previous intervals.7912\index{independence}79137914In Section~\ref{serial} I review these assumptions and7915consider more detailed models. But first let's look at some7916examples.79177918Figure~\ref{fig.kidney4} shows7919the size of simulated tumors as a function of7920age. The dashed line at 10 cm shows the range of ages for tumors at7921that size: the fastest-growing tumor gets there in 8 years; the7922slowest takes more than 35.79237924I am presenting results in terms of linear measurements, but the7925calculations are in terms of volume. To convert from one to the7926other, again, I use the volume of a sphere with the given7927diameter.7928\index{volume}7929\index{sphere}793079317932\section{Implementation}79337934Here is the kernel of the simulation:7935\index{simulation}79367937\begin{verbatim}7938def MakeSequence(rdt_seq, v0=0.01, interval=0.67, vmax=Volume(20.0)):7939seq = v0,7940age = 079417942for rdt in rdt_seq:7943age += interval7944final, seq = ExtendSequence(age, seq, rdt, interval)7945if final > vmax:7946break79477948return seq7949\end{verbatim}79507951\verb"rdt_seq" is an iterator that yields7952random values from the CDF of growth rate.7953{\tt v0} is the initial volume in mL. {\tt interval} is the time step7954in years. {\tt vmax} is the final volume corresponding to a linear7955measurement of 20 cm.7956\index{iterator}79577958{\tt Volume} converts from linear measurement in cm to volume7959in mL, based on the simplification that the tumor is a sphere:79607961\begin{verbatim}7962def Volume(diameter, factor=4*math.pi/3):7963return factor * (diameter/2.0)**37964\end{verbatim}79657966{\tt ExtendSequence} computes the volume of the tumor at the7967end of the interval.79687969\begin{verbatim}7970def ExtendSequence(age, seq, rdt, interval):7971initial = seq[-1]7972doublings = rdt * interval7973final = initial * 2**doublings7974new_seq = seq + (final,)7975cache.Add(age, new_seq, rdt)79767977return final, new_seq7978\end{verbatim}79797980{\tt age} is the age of the tumor at the end of the interval.7981{\tt seq} is a tuple that contains the volumes so far. {\tt rdt} is7982the growth rate during the interval, in doublings per year.7983{\tt interval} is the size of the time step in years.79847985The return values are {\tt final}, the volume of the7986tumor at the end of the interval, and \verb"new_seq", a new7987tuple containing the volumes in {\tt seq} plus the new volume7988{\tt final}.79897990{\tt Cache.Add} records the age and size of each tumor at the end7991of each interval, as explained in the next section.7992\index{cache}799379947995\section{Caching the joint distribution}79967997\begin{figure}7998% kidney.py7999\centerline{\includegraphics[height=2.5in]{figs/kidney8.pdf}}8000\caption{Joint distribution of age and tumor size.}8001\label{fig.kidney8}8002\end{figure}80038004Here's how the cache works.80058006\begin{verbatim}8007class Cache(object):80088009def __init__(self):8010self.joint = thinkbayes.Joint()8011\end{verbatim}80128013{\tt joint} is a joint Pmf that records the8014frequency of each age-size pair, so it approximates the8015joint distribution of age and size.8016\index{joint distribution}80178018At the end of each simulated interval, {\tt ExtendSequence} calls8019{\tt Add}:80208021\begin{verbatim}8022# class Cache80238024def Add(self, age, seq):8025final = seq[-1]8026cm = Diameter(final)8027bucket = round(CmToBucket(cm))8028self.joint.Incr((age, bucket))8029\end{verbatim}80308031Again, {\tt age} is the age of the tumor, and {\tt seq} is the8032sequence of volumes so far.80338034\begin{figure}8035% kidney.py8036\centerline{\includegraphics[height=2.5in]{figs/kidney6.pdf}}8037\caption{Distributions of age, conditioned on size.}8038\label{fig.kidney6}8039\end{figure}80408041Before adding the new data to the joint distribution, we use {\tt8042Diameter} to convert from volume to diameter in centimeters:80438044\begin{verbatim}8045def Diameter(volume, factor=3/math.pi/4, exp=1/3.0):8046return 2 * (factor * volume) ** exp8047\end{verbatim}80488049And8050{\tt CmToBucket} to convert from centimeters to a discrete bucket8051number:80528053\begin{verbatim}8054def CmToBucket(x, factor=10):8055return factor * math.log(x)8056\end{verbatim}80578058The buckets are equally spaced on a log scale. Using {\tt factor=10}8059yields a reasonable number of buckets; for example,80601 cm maps to bucket 0 and 10 cm maps to bucket 23.8061\index{log scale}8062\index{bucket}80638064After running the simulations, we can plot the joint distribution8065as a pseudocolor plot, where each cell represents the number of8066tumors observed at a given size-age pair.8067Figure~\ref{fig.kidney8} shows the joint distribution after 10008068simulations.8069\index{pseudocolor plot}8070807180728073\section{Conditional distributions}80748075\begin{figure}8076% kidney.py8077\centerline{\includegraphics[height=2.5in]{figs/kidney7.pdf}}8078\caption{Percentiles of tumor age as a function of size.}8079\label{fig.kidney7}8080\end{figure}80818082By taking a vertical slice from the joint distribution, we can get the8083distribution of sizes for any given age. By taking a horizontal8084slice, we can get the distribution of ages conditioned on size.8085\index{conditional distribution}80868087Here's the code that reads the joint distribution and builds8088the conditional distribution for a given size.8089\index{joint distribution}80908091\begin{verbatim}8092# class Cache80938094def ConditionalCdf(self, bucket):8095pmf = self.joint.Conditional(0, 1, bucket)8096cdf = pmf.MakeCdf()8097return cdf8098\end{verbatim}80998100\verb"bucket" is the integer bucket number corresponding to8101tumor size. {\tt Joint.Conditional} computes the8102PMF of age conditioned on {\tt bucket}.8103The result is the CDF of age conditioned on {\tt bucket}.81048105Figure~\ref{fig.kidney6} shows several of these CDFs, for8106a range of sizes. To summarize these distributions, we can8107compute percentiles as a function of size.8108\index{percentile}81098110\begin{verbatim}8111percentiles = [95, 75, 50, 25, 5]81128113for bucket in cache.GetBuckets():8114cdf = ConditionalCdf(bucket)8115ps = [cdf.Percentile(p) for p in percentiles]8116\end{verbatim}81178118Figure~\ref{fig.kidney7} shows these percentiles for each8119size bucket. The data points are computed from the estimated8120joint distribution. In the model, size and time are discrete,8121which contributes numerical errors, so I also show a least8122squares fit for each sequence of percentiles.8123\index{least squares fit}812481258126\section{Serial Correlation}8127\label{serial}81288129The results so far are based on a number of modeling decisions;8130let's review them and consider which ones are the most8131likely sources of error:8132\index{modeling error}81338134\begin{itemize}81358136\item To convert from linear measure to volume, we assume that8137tumors are approximately spherical. This assumption is probably8138fine for tumors up to a few centimeters, but not for very8139large tumors.8140\index{sphere}81418142\item The distribution of growth rates in the simulations are based on8143a continuous model we chose to fit the data reported by Zhang et al,8144which is based on 53 patients. The fit is only approximate and, more8145importantly, a larger sample would yield a8146different distribution.8147\index{growth rate}81488149\item The growth model does not take into account tumor subtype or8150grade; this assumption is consistent with the conclusion of Zhang et al:8151``Growth rates in renal tumors of different sizes, subtypes and8152grades represent a wide range and overlap substantially.''8153But with a larger sample, a difference might become apparent.8154\index{tumor type}81558156\item The distribution of growth rate does not depend on the size of8157the tumor. This assumption would not be realistic for very8158small and very large tumors, whose growth is limited by blood supply.81598160But tumors observed by Zhang et al ranged from 1 to 12 cm, and they8161found no statistically significant relationship between8162size and growth rate. So if there is a relationship, it is8163likely to be weak, at least in this size range.81648165\item In the simulations, growth rate during each interval is8166independent of previous growth rates. In reality it is plausible8167that tumors that have grown quickly in the past are more likely8168to grow quickly. In other words, there is probably8169a serial correlation in growth rate.8170\index{serial correlation}81718172\end{itemize}81738174Of these, the first and last seem the most problematic. I'll8175investigate serial correlation first, then come back to8176spherical geometry.81778178To simulate correlated growth, I wrote a generator\footnote{If you are8179not familiar with Python generators, see8180\url{http://wiki.python.org/moin/Generators}.} that yields a8181correlated series from a given Cdf. Here's how the algorithm works:8182\index{generator}81838184\begin{enumerate}81858186\item Generate correlated values from a Gaussian distribution.8187This is easy to do because we can compute the distribution8188of the next value conditioned on the previous value.8189\index{Gaussian distribution}81908191\item Transform each value to its cumulative probability using8192the Gaussian CDF.8193\index{cumulative probability}81948195\item Transform each cumulative probability to the corresponding value8196using the given Cdf.81978198\end{enumerate}81998200Here's what that looks like in code:82018202\begin{verbatim}8203def CorrelatedGenerator(cdf, rho):8204x = random.gauss(0, 1)8205yield Transform(x)82068207sigma = math.sqrt(1 - rho**2);8208while True:8209x = random.gauss(x * rho, sigma)8210yield Transform(x)8211\end{verbatim}82128213{\tt cdf} is the desired Cdf; {\tt rho} is the desired correlation.8214The values of {\tt x} are Gaussian; {\tt Transform} converts them8215to the desired distribution.82168217The first value of {\tt x} is Gaussian with mean 0 and standard8218deviation 1. For subsequent values, the mean and standard deviation8219depend on the previous value. Given the previous {\tt x}, the mean of the8220next value is {\tt x * rho}, and the variance is {\tt 1 - rho**2}.8221\index{correlated random value}82228223{\tt Transform} maps from each8224Gaussian value, {\tt x}, to a value from the given Cdf, {\tt y}.82258226\begin{verbatim}8227def Transform(x):8228p = thinkbayes.GaussianCdf(x)8229y = cdf.Value(p)8230return y8231\end{verbatim}82328233{\tt GaussianCdf} computes the CDF of the standard Gaussian8234distribution at {\tt x}, returning a cumulative probability.8235{\tt Cdf.Value} maps from a cumulative probability to the8236corresponding value in {\tt cdf}.82378238Depending on the shape of {\tt cdf}, information can8239be lost in transformation, so the actual correlation might be8240lower than {\tt rho}. For example, when I generate824110000 values from the distribution of growth rates with8242{\tt rho=0.4}, the actual correlation is 0.37.8243But since we are guessing at the right correlation anyway,8244that's close enough.82458246Remember that {\tt MakeSequence} takes an iterator as an argument.8247That interface allows it to work with different generators:8248\index{generator}82498250\begin{verbatim}8251iterator = UncorrelatedGenerator(cdf)8252seq1 = MakeSequence(iterator)82538254iterator = CorrelatedGenerator(cdf, rho)8255seq2 = MakeSequence(iterator)8256\end{verbatim}82578258In this example, {\tt seq1} and {\tt seq2} are8259drawn from the same distribution, but the values in {\tt seq1}8260are uncorrelated and the values in {\tt seq2} are correlated8261with a coefficient of approximately {\tt rho}.8262\index{serial correlation}82638264Now we can see what effect serial correlation has on the results;8265the following table shows percentiles of age for a 6 cm tumor,8266using the uncorrelated generator and a correlated generator8267with target $\rho = 0.4$.8268\index{percentile}82698270\begin{table}8271\input{kidney_table2}8272\caption{Percentiles of tumor age conditioned on size.}8273\end{table}82748275Correlation makes the fastest growing tumors faster and the slowest8276slower, so the range of ages is wider. The difference is modest for8277low percentiles, but for the 95th percentile it is more than 6 years.8278To compute these percentiles precisely, we would need a better8279estimate of the actual serial correlation.82808281However, this model is sufficient to answer the question8282we started with: given a tumor with a linear dimension of828315.5 cm, what is the probability that it formed more than82848 years ago?82858286Here's the code:82878288\begin{verbatim}8289# class Cache82908291def ProbOlder(self, cm, age):8292bucket = CmToBucket(cm)8293cdf = self.ConditionalCdf(bucket)8294p = cdf.Prob(age)8295return 1-p8296\end{verbatim}82978298{\tt cm} is the size of the tumor; {\tt age} is the age threshold8299in years. {\tt ProbOlder} converts size to a bucket number,8300gets the Cdf of age conditioned on bucket, and computes the8301probability that age exceeds the given value.83028303With no serial correlation, the probability that a830415.5 cm tumor is older than 8 years is 0.999, or almost certain.8305With correlation 0.4, faster-growing tumors are more likely, but8306the probability is still 0.995. Even with correlation 0.8, the8307probability is 0.978.83088309Another likely source of error is the assumption that tumors are8310approximately spherical. For a tumor with linear dimensions 15.5 x 158311cm, this assumption is probably not valid. If, as seems likely, a8312tumor this size8313is relatively flat, it might have the same volume as a 6 cm sphere.8314With this smaller volume and correlation 0.8, the probability of age8315greater than 8 is still 95\%.83168317So even taking into account modeling errors, it is unlikely that such8318a large tumor could have formed less than 8 years prior to the date of8319diagnosis.8320\index{modeling error}832183228323\section{Discussion}83248325Well, we got through a whole chapter without using Bayes's theorem or8326the {\tt Suite} class that encapsulates Bayesian updates. What8327happened?83288329One way to think about Bayes's theorem is as an algorithm for8330inverting conditional probabilities. Given \p{B|A}, we can compute8331\p{A|B}, provided we know \p{A} and \p{B}. Of course this algorithm8332is only useful if, for some reason, it is easier to compute \p{B|A}8333than \p{A|B}.83348335In this example, it is. By running simulations, we can estimate the8336distribution of size conditioned on age, or \p{size|age}. But it is8337harder to get the distribution of age conditioned on size, or8338\p{age|size}. So this seems like a perfect opportunity to use Bayes's8339theorem.83408341The reason I didn't is computational efficiency. To estimate8342\p{size|age} for any given size, you have to run a lot of simulations.8343Along the way, you end up computing \p{size|age} for a lot of sizes.8344In fact, you end up computing the entire joint distribution of size8345and age, \p{size, age}.8346\index{joint distribution}83478348And once you have the joint distribution, you don't really need8349Bayes's theorem, you can extract \p{age|size} by taking slices from8350the joint distribution, as demonstrated in {\tt ConditionalCdf}.8351\index{conditional distribution}83528353So we side-stepped Bayes, but he was with us in spirit.835483558356\chapter{A Hierarchical Model}8357\label{hierarchical}835883598360\section{The Geiger counter problem}83618362I got the idea for the following problem from Tom Campbell-Ricketts,8363author of the Maximum Entropy blog at8364\url{http://maximum-entropy-blog.blogspot.com}. And he got the idea8365from E.~T.~Jaynes, author of the classic {\em Probability Theory: The8366Logic of Science}:8367\index{Jaynes, E.~T.}8368\index{Campbell-Ricketts, Tom}8369\index{Geiger counter problem}83708371\begin{quote}8372Suppose that a radioactive source emits particles toward8373a Geiger counter at an average rate of $r$ particles per second,8374but the counter only registers a fraction, $f$, of the particles8375that hit it. If $f$ is 10\% and8376the counter registers 15 particles in a one second8377interval, what is the posterior distribution of $n$, the actual8378number of particles that hit the counter, and $r$, the average8379rate particles are emitted?8380\end{quote}83818382To get started on a problem like this, think about the chain of8383causation that starts with the parameters of the system and ends8384with the observed data:8385\index{causation}83868387\begin{enumerate}83888389\item The source emits particles at an average rate, $r$.83908391\item During any given second, the source emits $n$ particles8392toward the counter.83938394\item Out of those $n$ particles, some number, $k$, get counted.83958396\end{enumerate}83978398The probability that an atom decays is the same at any point in time,8399so radioactive decay is well modeled by a Poisson process. Given $r$,8400the distribution of $n$ is Poisson distribution with parameter $r$.8401\index{radioactive decay}8402\index{Poisson process}84038404And if we assume that the probability of detection for each particle8405is independent of the others, the distribution of $k$ is the binomial8406distribution with parameters $n$ and $f$.8407\index{binomial distribution}84088409Given the parameters of the system, we can find the distribution of8410the data. So we can solve what is called the {\bf forward problem}.8411\index{forward problem}84128413Now we want to go the other way: given the data, we8414want the distribution of the parameters. This is called8415the {\bf inverse problem}. And if you can solve the forward8416problem, you can use Bayesian methods to solve the inverse problem.8417\index{inverse problem}841884198420\section{Start simple}84218422\begin{figure}8423% jaynes.py8424\centerline{\includegraphics[height=2.5in]{figs/jaynes1.pdf}}8425\caption{Posterior distribution of $n$ for three values of $r$.}8426\label{fig.jaynes1}8427\end{figure}84288429Let's start with a simple version of the problem where we know8430the value of $r$. We are given the value of $f$, so all we8431have to do is estimate $n$.84328433I define a Suite called {\tt Detector} that models the behavior8434of the detector and estimates $n$.84358436\begin{verbatim}8437class Detector(thinkbayes.Suite):84388439def __init__(self, r, f, high=500, step=1):8440pmf = thinkbayes.MakePoissonPmf(r, high, step=step)8441thinkbayes.Suite.__init__(self, pmf, name=r)8442self.r = r8443self.f = f8444\end{verbatim}84458446If the average emission rate is $r$ particles per second, the8447distribution of $n$ is Poisson with parameter $r$.8448{\tt high} and {\tt step} determine the upper bound for $n$8449and the step size between hypothetical values.8450\index{Poisson distribution}84518452Now we need a likelihood function:8453\index{likelihood}84548455\begin{verbatim}8456# class Detector84578458def Likelihood(self, data, hypo):8459k = data8460n = hypo8461p = self.f84628463return thinkbayes.EvalBinomialPmf(k, n, p)8464\end{verbatim}84658466{\tt data} is the number of particles detected, and {\tt hypo} is8467the hypothetical number of particles emitted, $n$.84688469If there are actually $n$ particles, and the probability of detecting8470any one of them is $f$, the probability of detecting $k$ particles is8471given by the binomial distribution.8472\index{binomial distribution}84738474That's it for the Detector. We can try it out for a range8475of values of $r$:84768477\begin{verbatim}8478f = 0.18479k = 1584808481for r in [100, 250, 400]:8482suite = Detector(r, f, step=1)8483suite.Update(k)8484print suite.MaximumLikelihood()8485\end{verbatim}84868487Figure~\ref{fig.jaynes1} shows the posterior distribution of $n$ for8488several given values of $r$.848984908491\section{Make it hierarchical}84928493In the previous section, we assume $r$ is known. Now let's8494relax that assumption. I define another Suite, called {\tt Emitter},8495that models the behavior of the emitter and estimates $r$:84968497\begin{verbatim}8498class Emitter(thinkbayes.Suite):84998500def __init__(self, rs, f=0.1):8501detectors = [Detector(r, f) for r in rs]8502thinkbayes.Suite.__init__(self, detectors)8503\end{verbatim}85048505{\tt rs} is a sequence of hypothetical value for $r$. {\tt detectors}8506is a sequence of Detector objects, one for each value of $r$. The8507values in the Suite are Detectors, so Emitter is a {\bf meta-Suite};8508that is, a Suite that contains other Suites as values.8509\index{meta-Suite}85108511To update the Emitter, we have to compute the likelihood of the data8512under each hypothetical value of $r$. But each value of $r$ is8513represented by a Detector that contains a range of values for $n$.85148515To compute the likelihood of the data for a given Detector, we loop8516through the values of $n$ and add up the total probability of $k$.8517That's what {\tt SuiteLikelihood} does:85188519\begin{verbatim}8520# class Detector85218522def SuiteLikelihood(self, data):8523total = 08524for hypo, prob in self.Items():8525like = self.Likelihood(data, hypo)8526total += prob * like8527return total8528\end{verbatim}85298530Now we can write the Likelihood function for the Emitter:85318532\begin{verbatim}8533# class Emitter85348535def Likelihood(self, data, hypo):8536detector = hypo8537like = detector.SuiteLikelihood(data)8538return like8539\end{verbatim}85408541Each {\tt hypo} is a Detector, so we can invoke8542{\tt SuiteLikelihood} to get the likelihood of the data under8543the hypothesis.85448545After we update the Emitter, we have to update each of the8546Detectors, too.85478548\begin{verbatim}8549# class Emitter85508551def Update(self, data):8552thinkbayes.Suite.Update(self, data)85538554for detector in self.Values():8555detector.Update()8556\end{verbatim}85578558A model like this, with multiple levels of Suites, is called {\bf8559hierarchical}. \index{hierarchical model}856085618562\section{A little optimization}85638564You might recognize {\tt SuiteLikelihood}; we saw it8565in Section~\ref{suitelike}. At the time, I pointed out that8566we didn't really need it, because the total probability8567computed by {\tt SuiteLikelihood} is exactly the normalizing8568constant computed and returned by {\tt Update}.8569\index{normalizing constant}85708571So instead of updating the Emitter and then updating the8572Detectors, we can do both steps at the same time, using8573the result from {\tt Detector.Update} as the likelihood8574of Emitter.85758576Here's the streamlined version of {\tt Emitter.Likelihood}:85778578\begin{verbatim}8579# class Emitter85808581def Likelihood(self, data, hypo):8582return hypo.Update(data)8583\end{verbatim}85848585And with this version of {\tt Likelihood} we can use the8586default version of {\tt Update}. So this version has fewer8587lines of code, and it runs faster because it does not compute8588the normalizing constant twice.8589\index{optimization}859085918592\section{Extracting the posteriors}85938594\begin{figure}8595% jaynes.py8596\centerline{\includegraphics[height=2.5in]{figs/jaynes2.pdf}}8597\caption{Posterior distributions of $n$ and $r$.}8598\label{fig.jaynes2}8599\end{figure}86008601After we update the Emitter, we can get the posterior distribution8602of $r$ by looping through the Detectors and their probabilities:86038604\begin{verbatim}8605# class Emitter86068607def DistOfR(self):8608items = [(detector.r, prob) for detector, prob in self.Items()]8609return thinkbayes.MakePmfFromItems(items)8610\end{verbatim}86118612{\tt items} is a list of values of $r$ and their probabilities.8613The result is the Pmf of $r$.86148615To get the posterior distribution of $n$, we have to compute8616the mixture of the Detectors. We can use8617{\tt thinkbayes.MakeMixture}, which takes a meta-Pmf that maps8618from each distribution to its probability. And that's exactly8619what the Emitter is:86208621\begin{verbatim}8622# class Emitter86238624def DistOfN(self):8625return thinkbayes.MakeMixture(self)8626\end{verbatim}86278628Figure~\ref{fig.jaynes2} shows the results. Not surprisingly, the8629most likely value for $n$ is 150. Given $f$ and $n$, the expected8630count is $k = f n$, so given $f$ and $k$, the expected value of $n$ is8631$k / f$, which is 150.86328633And if 150 particles are emitted in one second, the most likely value8634of $r$ is 150 particles per second. So the posterior distribution of8635$r$ is also centered on 150.86368637The posterior distributions of $r$ and $n$ are similar;8638the only difference is that we are slightly less certain about $n$.8639In general, we can be more certain about the long-range emission rate,8640$r$, than about the number of particles emitted in any particular second,8641$n$.86428643You can download the code in this chapter from8644\url{http://thinkbayes.com/jaynes.py}. For more information see8645Section~\ref{download}.864686478648\section{Discussion}86498650The Geiger counter problem demonstrates the connection between8651causation and hierarchical modeling. In the example, the8652emission rate $r$ has a causal effect on the number of particles,8653$n$, which has a causal effect on the particle count, $k$.8654\index{Geiger counter problem}8655\index{causation}86568657The hierarchical model reflects the structure of the8658system, with causes at the top and effects at the bottom.8659\index{hierarchical model}86608661\begin{enumerate}86628663\item At the top level, we start with a range of hypothetical8664values for $r$.86658666\item For each value of $r$, we have a range of values for $n$,8667and the prior distribution of $n$ depends on $r$.86688669\item When we update the model, we go bottom-up. We compute8670a posterior distribution of $n$ for each value of $r$, then8671compute the posterior distribution of $r$.86728673\end{enumerate}86748675So causal information flows down the hierarchy, and inference flows8676up.867786788679\section{Exercises}86808681\begin{exercise}8682This exercise is also inspired by an example in Jaynes, {\em8683Probability Theory}.86848685Suppose you buy a mosquito trap that is supposed to reduce the8686population of mosquitoes near your house. Each8687week, you empty the trap and count the number of mosquitoes8688captured. After the first week, you count 30 mosquitoes.8689After the second week, you count 20 mosquitoes. Estimate the8690percentage change in the number of mosquitoes in your yard.86918692To answer this question, you have to make some modeling8693decisions. Here are some suggestions:86948695\begin{itemize}86968697\item Suppose that each week a large number of mosquitoes, $N$, is bred8698in a wetland near your home.86998700\item During the week, some fraction of8701them, $f_1$, wander into your yard, and of those some fraction, $f_2$,8702are caught in the trap.87038704\item Your solution should take into account your prior belief8705about how much $N$ is likely to change from one week to the next.8706You can do that by adding a level to the hierarchy to8707model the percent change in $N$.87088709\end{itemize}87108711\end{exercise}871287138714\chapter{Dealing with Dimensions}8715\label{species}87168717\section{Belly button bacteria}87188719Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen8720science project with the goal of identifying bacterial species that8721can be found in human navels (\url{http://bbdata.yourwildlife.org}).8722The project might seem whimsical, but it is part of an increasing8723interest in the human microbiome, the set of microorganisms that live8724on human skin and parts of the body.8725\index{biodiversity}8726\index{belly button}8727\index{bacteria}8728\index{microbiome}87298730In their pilot study, BBB2 researchers collected swabs from the navels8731of 60 volunteers, used multiplex pyrosequencing to extract and sequence8732fragments of 16S rDNA, then identified the species or genus the8733fragments came from. Each identified fragment is called a ``read.''8734\index{navel}8735\index{rDNA}8736\index{pyrosequencing}87378738We can use these data to answer several related questions:87398740\begin{itemize}87418742\item Based on the number of species observed, can we estimate8743the total number of species in the environment?8744\index{species}87458746\item Can we estimate the prevalence of each species; that is, the8747fraction of the total population belonging to each species?8748\index{prevalence}87498750\item If we are planning to collect additional samples, can we predict8751how many new species we are likely to discover?87528753\item How many additional reads are needed to increase the8754fraction of observed species to a given threshold?87558756\end{itemize}87578758These questions make up what is called the {\bf Unseen Species problem}.8759\index{Unseen Species problem}876087618762\section{Lions and tigers and bears}87638764I'll start with a simplified version of the problem where we know that8765there are exactly three species. Let's call them lions, tigers and8766bears. Suppose we visit a wild animal preserve and see 3 lions, 28767tigers and one bear.8768\index{lions and tigers and bears}87698770If we have an equal chance of observing any animal in the preserve,8771the number of each species we see is governed by the multinomial8772distribution. If the prevalence of lions and tigers and bears is8773\verb"p_lion" and \verb"p_tiger" and \verb"p_bear", the likelihood of8774seeing 3 lions, 2 tigers and one bear is proportional to8775\index{multinomial distribution}87768777\begin{verbatim}8778p_lion**3 * p_tiger**2 * p_bear**18779\end{verbatim}87808781An approach that is tempting, but not correct, is to use beta8782distributions, as in Section~\ref{beta}, to describe the prevalence of8783each species separately. For example, we saw 3 lions and 3 non-lions;8784if we think of that as 3 ``heads'' and 3 ``tails,'' then the posterior8785distribution of \verb"p_lion" is:8786\index{beta distribution}87878788\begin{verbatim}8789beta = thinkbayes.Beta()8790beta.Update((3, 3))8791print beta.MaximumLikelihood()8792\end{verbatim}87938794The maximum likelihood estimate for \verb"p_lion" is the observed8795rate, 50\%. Similarly the MLEs for \verb"p_tiger" and \verb"p_bear"8796are 33\% and 17\%.8797\index{maximum likelihood}87988799But there are two problems:88008801\begin{enumerate}88028803\item We have implicitly used a prior for each species that is uniform8804from 0 to 1, but since we know that there are three species, that8805prior is not correct. The right prior should have a mean of 1/3,8806and there should be zero likelihood that any species has a8807prevalence of 100\%.88088809\item The distributions for each species are not independent, because8810the prevalences have to add up to 1. To capture this dependence, we8811need a joint distribution for the three prevalences.8812\index{independence}8813\index{joint distribution}88148815\end{enumerate}88168817We can use a Dirichlet distribution to solve both of these problems8818(see \url{http://en.wikipedia.org/wiki/Dirichlet_distribution}). In8819the same way we used the beta distribution to describe the8820distribution of bias for a coin, we can use a Dirichlet8821distribution to describe the joint distribution of \verb"p_lion",8822\verb"p_tiger" and \verb"p_bear".8823\index{beta distribution}8824\index{Dirichlet distribution}88258826The Dirichlet distribution is the multi-dimensional generalization8827of the beta distribution. Instead of two possible outcomes, like8828heads and tails, the Dirichlet distribution handles any number of8829outcomes: in this example, three species.88308831If there are {\tt n} outcomes, the Dirichlet distribution is8832described by {\tt n} parameters, written $\alpha_1$ through $\alpha_n$.88338834Here's the definition, from {\tt thinkbayes.py}, of a class that8835represents a Dirichlet distribution:8836\index{numpy}88378838\begin{verbatim}8839class Dirichlet(object):88408841def __init__(self, n):8842self.n = n8843self.params = numpy.ones(n, dtype=numpy.int)8844\end{verbatim}88458846{\tt n} is the number of dimensions; initially the parameters8847are all 1. I use a {\tt numpy} array to store the parameters8848so I can take advantage of array operations.88498850Given a Dirichlet distribution, the marginal distribution8851for each prevalence is a beta distribution, which we can8852compute like this:88538854\begin{verbatim}8855def MarginalBeta(self, i):8856alpha0 = self.params.sum()8857alpha = self.params[i]8858return Beta(alpha, alpha0-alpha)8859\end{verbatim}88608861{\tt i} is the index of the marginal distribution we want.8862{\tt alpha0} is the sum of the parameters; {\tt alpha} is the8863parameter for the given species.8864\index{marginal distribution}88658866In the example, the prior marginal distribution for each species8867is {\tt Beta(1, 2)}. We can compute the prior means like8868this:88698870\begin{verbatim}8871dirichlet = thinkbayes.Dirichlet(3)8872for i in range(3):8873beta = dirichlet.MarginalBeta(i)8874print beta.Mean()8875\end{verbatim}88768877As expected, the prior mean prevalence for each species is 1/3.88788879To update the Dirichlet distribution, we add the8880observations to the parameters like this:88818882\begin{verbatim}8883def Update(self, data):8884m = len(data)8885self.params[:m] += data8886\end{verbatim}88878888Here {\tt data} is a sequence of counts in the same order as {\tt8889params}, so in this example, it should be the number of lions,8890tigers and bears.88918892{\tt data} can be shorter than {\tt params}; in that8893case there are some species that have not been8894observed.88958896Here's code that updates {\tt dirichlet} with the observed data and8897computes the posterior marginal distributions.88988899\begin{verbatim}8900data = [3, 2, 1]8901dirichlet.Update(data)89028903for i in range(3):8904beta = dirichlet.MarginalBeta(i)8905pmf = beta.MakePmf()8906print i, pmf.Mean()8907\end{verbatim}89088909\begin{figure}8910% species.py8911\centerline{\includegraphics[height=2.5in]{figs/species1.pdf}}8912\caption{Distribution of prevalences for three species.}8913\label{fig.species1}8914\end{figure}89158916Figure~\ref{fig.species1} shows the results. The posterior8917mean prevalences are 44\%, 33\%, and 22\%.891889198920\section{The hierarchical version}89218922We have solved a simplified version of the problem: if we8923know how many species there are, we can estimate the prevalence8924of each.8925\index{prevalence}89268927Now let's get back to the original problem, estimating the total8928number of species. To solve this problem I'll define a meta-Suite,8929which is a Suite that contains other Suites as hypotheses. In this8930case, the top-level Suite contains hypotheses about the number of8931species; the bottom level contains hypotheses about prevalences.8932\index{hierarchical model}8933\index{meta-Suite}89348935Here's the class definition:89368937\begin{verbatim}8938class Species(thinkbayes.Suite):89398940def __init__(self, ns):8941hypos = [thinkbayes.Dirichlet(n) for n in ns]8942thinkbayes.Suite.__init__(self, hypos)8943\end{verbatim}89448945\verb"__init__" takes a list of possible values for {\tt n} and8946makes a list of Dirichlet objects.89478948Here's the code that creates the top-level suite:89498950\begin{verbatim}8951ns = range(3, 30)8952suite = Species(ns)8953\end{verbatim}89548955{\tt ns} is the list of possible values for {\tt n}. We have seen 38956species, so there have to be at least that many. I chose an upper8957bound that seems reasonable, but we will check later that the8958probability of exceeding this bound is low. And at least initially8959we assume that any value in this range is equally likely.89608961To update a hierarchical model, you have to update all levels.8962Usually you have to update the bottom8963level first and work up, but in this case we can8964update the top level first:89658966\begin{verbatim}8967#class Species89688969def Update(self, data):8970thinkbayes.Suite.Update(self, data)8971for hypo in self.Values():8972hypo.Update(data)8973\end{verbatim}89748975{\tt Species.Update} invokes {\tt Update} in the parent class,8976then loops through the sub-hypotheses and updates them.89778978Now all we need is a likelihood function:89798980\begin{verbatim}8981# class Species89828983def Likelihood(self, data, hypo):8984dirichlet = hypo8985like = 08986for i in range(1000):8987like += dirichlet.Likelihood(data)89888989return like8990\end{verbatim}89918992{\tt data} is a sequence of8993observed counts; {\tt hypo} is a Dirichlet object.8994{\tt Species.Likelihood} calls8995{\tt Dirichlet.Likelihood} 1000 times and returns the total.89968997Why call it 1000 times? Because {\tt8998Dirichlet.Likelihood} doesn't actually compute the likelihood of the8999data under the whole Dirichlet distribution. Instead, it draws one9000sample from the hypothetical distribution and computes the likelihood9001of the data under the sampled set of prevalences.90029003Here's what it looks like:90049005\begin{verbatim}9006# class Dirichlet90079008def Likelihood(self, data):9009m = len(data)9010if self.n < m:9011return 090129013x = data9014p = self.Random()9015q = p[:m]**x9016return q.prod()9017\end{verbatim}90189019The length of {\tt data} is the number of species observed. If9020we see more species than we thought existed, the likelihood is 0.90219022\index{multinomial distribution}9023Otherwise we select a random set of prevalences, {\tt p}, and9024compute the multinomial PMF, which is9025%9026\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]9027%9028$p_i$ is the prevalence of the $i$th species, and $x_i$ is the9029observed number. The first term, $c_x$, is the multinomial9030coefficient; I leave it out of the computation because it is9031a multiplicative factor that depends only9032on the data, not the hypothesis, so it gets normalized away9033(see \url{http://en.wikipedia.org/wiki/Multinomial_distribution}).9034\index{multinomial coefficient}90359036{\tt m} is the number of observed species.9037We only need the first {\tt m} elements of {\tt p};9038for the others, $x_i$ is 0, so9039$p_i^{x_i}$ is 1, and we can leave them out of the product.904090419042\section{Random sampling}9043\label{randomdir}90449045There are two ways to generate a random sample from a Dirichlet9046distribution. One is to use the marginal beta distributions, but in9047that case you have to select one at a time and scale the rest so they9048add up to 1 (see9049\url{http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation}).9050\index{random sample}90519052A less obvious, but faster, way is to select values from {\tt n} gamma9053distributions, then normalize by dividing through by the total.9054Here's the code:9055\index{numpy}9056\index{gamma distribution}90579058\begin{verbatim}9059# class Dirichlet90609061def Random(self):9062p = numpy.random.gamma(self.params)9063return p / p.sum()9064\end{verbatim}90659066Now we're ready to look at some results. Here is the code that extracts9067the posterior distribution of {\tt n}:90689069\begin{verbatim}9070def DistOfN(self):9071pmf = thinkbayes.Pmf()9072for hypo, prob in self.Items():9073pmf.Set(hypo.n, prob)9074return pmf9075\end{verbatim}90769077{\tt DistOfN} iterates9078through the top-level hypotheses and accumulates the probability9079of each {\tt n}.90809081\begin{figure}9082% species.py9083\centerline{\includegraphics[height=2.5in]{figs/species2.pdf}}9084\caption{Posterior distribution of {\tt n}.}9085\label{fig.species2}9086\end{figure}90879088Figure~\ref{fig.species2} shows the result. The most likely value is 4.9089Values from 3 to 7 are reasonably likely; after that the probabilities9090drop off quickly. The probability that there are 29 species is9091low enough to be negligible; if we chose a higher bound,9092we would get nearly the same result.90939094Remember that this result is based on a uniform prior for {\tt n}. If9095we have background information about the number of species in the9096environment, we might choose a different prior. \index{uniform9097distribution}909890999100\section{Optimization}91019102I have to admit that I am proud of this example. The Unseen Species9103problem is not easy, and I think this solution is simple and clear,9104and takes surprisingly few lines of code (about 50 so far).91059106The only problem is that it is slow. It's good enough for the example9107with only 3 observed species, but not good enough for the belly button9108data, with more than 100 species in some samples.91099110The next few sections present a series of optimizations we need to9111make this solution scale. Before we get into the details, here's9112a road map.9113\index{optimization}91149115\begin{itemize}91169117\item The first step is to recognize that if we update the Dirichlet9118distributions with the same data, the first {\tt m} parameters are9119the same for all of them. The only difference is the number of9120hypothetical unseen species. So we don't really need {\tt n}9121Dirichlet objects; we can store the parameters in the top level of9122the hierarchy. {\tt Species2} implements this optimization.91239124\item {\tt Species2} also uses the same set of random values for all9125of the hypotheses. This saves time generating random values, but it9126has a second benefit that turns out to be more important: by giving9127all hypotheses the same selection from the sample space, we make9128the comparison between the hypotheses more fair, so it takes9129fewer iterations to converge.91309131\item Even with these changes there is a major performance problem.9132As the number of observed species increases, the array of random9133prevalences gets bigger, and the chance of choosing one that is9134approximately right becomes small. So the vast majority of9135iterations yield small likelihoods that don't contribute much to the9136total, and don't discriminate between hypotheses.91379138The solution is to do the updates one species at a time. {\tt9139Species4} is a simple implementation of this strategy using9140Dirichlet objects to represent the sub-hypotheses.91419142\item Finally, {\tt Species5} combines the sub-hypotheses into the top9143level and uses {\tt numpy} array operations to speed things up.9144\index{numpy}91459146\end{itemize}91479148If you are not interested in the details, feel free to skip to9149Section~\ref{belly} where we look at results from the belly9150button data.915191529153\section{Collapsing the hierarchy}9154\label{collapsing}91559156All of the bottom-level Dirichlet distributions are updated9157with the same data, so the first {\tt m} parameters are the same for9158all of them.9159We can eliminate them and merge the parameters into9160the top-level suite. {\tt Species2} implements this optimization:9161\index{numpy}91629163\begin{verbatim}9164class Species2(object):91659166def __init__(self, ns):9167self.ns = ns9168self.probs = numpy.ones(len(ns), dtype=numpy.double)9169self.params = numpy.ones(self.high, dtype=numpy.int)9170\end{verbatim}91719172{\tt ns} is the list of hypothetical values for {\tt n};9173{\tt probs} is the list of corresponding probabilities. And9174{\tt params} is the sequence of Dirichlet parameters, initially9175all 1.91769177{\tt Species2.Update} updates both levels of9178the hierarchy: first the probability for each value of {\tt n},9179then the Dirichlet parameters:9180\index{numpy}91819182\begin{verbatim}9183# class Species291849185def Update(self, data):9186like = numpy.zeros(len(self.ns), dtype=numpy.double)9187for i in range(1000):9188like += self.SampleLikelihood(data)91899190self.probs *= like9191self.probs /= self.probs.sum()91929193m = len(data)9194self.params[:m] += data9195\end{verbatim}91969197{\tt SampleLikelihood} returns an array of likelihoods, one for each9198value of {\tt n}. {\tt like} accumulates the total likelihood for91991000 samples. {\tt self.probs} is multiplied by the total likelihood,9200then normalized. The last two lines, which update the parameters,9201are the same as in {\tt Dirichlet.Update}.92029203Now let's look at {\tt SampleLikelihood}. There are two9204opportunities for optimization here:92059206\begin{itemize}92079208\item When the hypothetical number of species, {\tt n},9209exceeds the observed number, {\tt m}, we only need the first {\tt m}9210terms of the multinomial PMF; the rest are 1.92119212\item If the number of species is large, the likelihood of the data9213might be too small for floating-point (see ~\ref{underflow}). So it9214is safer to compute log-likelihoods.9215\index{log-likelihood} \index{underflow}92169217\end{itemize}92189219\index{multinomial distribution}9220Again, the multinomial PMF is9221%9222\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]9223%9224So the log-likelihood is9225%9226\[ \log c_x + x_1 \log p_1 + \cdots + x_n \log p_n \]9227%9228which is fast and easy to compute. Again, $c_x$9229it is the same for all hypotheses, so we can drop it.9230Here's the code:9231\index{numpy}92329233\begin{verbatim}9234# class Species292359236def SampleLikelihood(self, data):9237gammas = numpy.random.gamma(self.params)92389239m = len(data)9240row = gammas[:m]9241col = numpy.cumsum(gammas)92429243log_likes = []9244for n in self.ns:9245ps = row / col[n-1]9246terms = data * numpy.log(ps)9247log_like = terms.sum()9248log_likes.append(log_like)92499250log_likes -= numpy.max(log_likes)9251likes = numpy.exp(log_likes)92529253coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]9254likes *= coefs92559256return likes9257\end{verbatim}92589259{\tt gammas} is an array of values from a gamma distribution; its9260length is the largest hypothetical value of {\tt n}. {\tt row} is9261just the first {\tt m} elements of {\tt gammas}; since these are the9262only elements that depend on the data, they are the only ones we need.9263\index{gamma distribution}92649265For each value of {\tt n} we need to divide {\tt row} by the9266total of the first {\tt n} values from {\tt gamma}. {\tt cumsum}9267computes these cumulative sums and stores them in {\tt col}.9268\index{cumulative sum}92699270The loop iterates through the values of {\tt n} and accumulates9271a list of log-likelihoods.9272\index{log-likelihood}92739274Inside the loop, {\tt ps} contains the row of probabilities, normalized9275with the appropriate cumulative sum. {\tt terms} contains the9276terms of the summation, $x_i \log p_i$, and \verb"log_like" contains9277their sum.92789279After the loop, we want to convert the log-likelihoods to linear9280likelihoods, but first it's a good idea to shift them so the largest9281log-likelihood is 0; that way the linear likelihoods are not too9282small (see ~\ref{underflow}).92839284Finally, before we return the likelihood, we have to apply a correction9285factor, which is the number of ways we could have observed these {\tt m}9286species, if the total number of species is {\tt n}.9287{\tt BinomialCoefficient} computes ``n choose m'', which is written9288$\binom{n}{m}$.9289\index{binomial coefficient}92909291As often happens, the optimized version is less readable and more9292error-prone than the original. But that's one reason I think it is9293a good idea to start with the simple version; we can use it for9294regression testing. I plotted results from both versions and confirmed9295that they are approximately equal, and that they converge as the9296number of iterations increases.9297\index{regression testing}929892999300\section{One more problem}93019302There's more we could do to optimize this code, but there's another9303problem we need to fix first. As the number of observed9304species increases, this version gets noisier and takes more9305iterations to converge on a good answer.93069307The problem is that if the prevalences we choose from the Dirichlet9308distribution, the {\tt ps}, are not at least approximately right,9309the likelihood of the observed data is close to zero and almost9310equally bad for all values of {\tt n}. So most iterations don't9311provide any useful contribution to the total likelihood. And as the9312number of observed species, {\tt m}, gets large, the probability of9313choosing {\tt ps} with non-negligible likelihood gets small. Really9314small.93159316Fortunately, there is a solution. Remember that if you observe9317a set of data, you can update the prior distribution with the9318entire dataset, or you can break it up into a series of updates9319with subsets of the data, and the result is the same either way.93209321For this example, the key is to perform the updates one species at9322a time. That way when we generate a random set of {\tt ps}, only9323one of them affects the computed likelihood, so the chance of choosing9324a good one is much better.93259326Here's a new version that updates one species at a time:9327\index{numpy}93289329\begin{verbatim}9330class Species4(Species):93319332def Update(self, data):9333m = len(data)93349335for i in range(m):9336one = numpy.zeros(i+1)9337one[i] = data[i]9338Species.Update(self, one)9339\end{verbatim}93409341This version inherits \verb"__init__" from {\tt Species}, so it9342represents the hypotheses as a list of Dirichlet objects (unlike9343{\tt Species2}).93449345{\tt Update} loops through the observed species and makes an9346array, {\tt one}, with all zeros and one species count. Then9347it calls {\tt Update} in the parent class, which computes9348the likelihoods and updates the sub-hypotheses.93499350So in the running example, we do three updates. The first9351is something like ``I have seen three lions.'' The second is9352``I have seen two tigers and no additional lions.'' And the third9353is ``I have seen one bear and no more lions and tigers.''93549355Here's the new version of {\tt Likelihood}:93569357\begin{verbatim}9358# class Species493599360def Likelihood(self, data, hypo):9361dirichlet = hypo9362like = 09363for i in range(self.iterations):9364like += dirichlet.Likelihood(data)93659366# correct for the number of unseen species the new one9367# could have been9368m = len(data)9369num_unseen = dirichlet.n - m + 19370like *= num_unseen93719372return like9373\end{verbatim}93749375This is almost the same as {\tt Species.Likelihood}. The difference9376is the factor, \verb"num_unseen". This correction is necessary9377because each time we see a species for the first time, we have to9378consider that there were some number of other unseen species that9379we might have seen. For larger values of {\tt n} there are more9380unseen species that we could have seen, which increases the likelihood9381of the data.93829383This is a subtle point and I have to admit that I did not get it right9384the first time. But again I was able to validate this version9385by comparing it to the previous versions.9386\index{regression testing}938793889389\section{We're not done yet}93909391\newcommand{\BigO}[1]{\mathcal{O}(#1)}93929393Performing the updates one species at a time solves one problem, but9394it creates another. Each update takes time proportional to $k m$,9395where $k$ is the number of hypotheses and $m$ is the number of observed9396species. So if we do $m$ updates, the total run time is9397proportional to $k m^2$.93989399But we can speed things up using the same trick we used in9400Section~\ref{collapsing}: we'll get rid of the Dirichlet objects and9401collapse the two levels of the hierarchy into a single object. So9402here's yet another version of {\tt Species}:94039404\begin{verbatim}9405class Species5(Species2):94069407def Update(self, data):9408m = len(data)9409for i in range(m):9410self.UpdateOne(i+1, data[i])9411self.params[i] += data[i]9412\end{verbatim}94139414This version inherits \verb"__init__" from {\tt Species2}, so9415it uses {\tt ns} and {\tt probs} to represent the distribution9416of {\tt n}, and {\tt params} to represent the parameters of9417the Dirichlet distribution.94189419{\tt Update} is similar to what we saw in the previous section.9420It loops through the observed species and calls {\tt UpdateOne}:9421\index{numpy}94229423\begin{verbatim}9424# class Species594259426def UpdateOne(self, i, count):9427likes = numpy.zeros(len(self.ns), dtype=numpy.double)9428for i in range(self.iterations):9429likes += self.SampleLikelihood(i, count)94309431unseen_species = [n-i+1 for n in self.ns]9432likes *= unseen_species94339434self.probs *= likes9435self.probs /= self.probs.sum()9436\end{verbatim}94379438This function is similar to {\tt Species2.Update}, with two changes:94399440\begin{itemize}94419442\item The interface is different. Instead of the whole dataset, we9443get {\tt i}, the index of the observed species, and {\tt count},9444how many of that species we've seen.94459446\item We have to apply a correction factor for the number of unseen9447species, as in {\tt Species4.Likelihood}. The difference here is9448that we update all of the likelihoods at once with array9449multiplication.94509451\end{itemize}94529453Finally, here's {\tt SampleLikelihood}:9454\index{numpy}94559456\begin{verbatim}9457# class Species594589459def SampleLikelihood(self, i, count):9460gammas = numpy.random.gamma(self.params)94619462sums = numpy.cumsum(gammas)[self.ns[0]-1:]94639464ps = gammas[i-1] / sums9465log_likes = numpy.log(ps) * count94669467log_likes -= numpy.max(log_likes)9468likes = numpy.exp(log_likes)94699470return likes9471\end{verbatim}94729473This is similar to {\tt Species2.SampleLikelihood}; the9474difference is that each update only includes a single species,9475so we don't need a loop.94769477The runtime of this function is proportional to the number9478of hypotheses, $k$. It runs $m$ times, so the run time of9479the update is proportional to $k m$.9480And the number of iterations we9481need to get an accurate result is usually small.948294839484\section{The belly button data}9485\label{belly}94869487That's enough about lions and tigers and bears.9488Let's get back to belly buttons. To get a sense of what the9489data look like, consider subject B1242,9490whose sample of 400 reads yielded 61 species with the following9491counts:94929493\begin{verbatim}949492, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5,94954, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,94961, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,94971, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 19498\end{verbatim}94999500There are a few dominant species that make up a large9501fraction of the whole, but many species that yielded only9502a single read. The number of these ``singletons'' suggests9503that there are likely to be at least a few unseen species.9504\index{species}95059506In the example with lions and tigers, we assume that each9507animal in the preserve is equally likely to be observed.9508Similarly, for the belly button data, we assume that each9509bacterium is equally likely to yield a read.95109511In reality, each step in the data-collection9512process might introduce biases. Some species might9513be more likely to be picked up by a swab, or to yield identifiable9514amplicons. So when we talk about the prevalence of each species,9515we should remember this source of error.9516\index{sample bias}95179518I should also acknowledge that I am using the term ``species''9519loosely. First, bacterial species are not well defined. Second,9520some reads identify a particular species, others only identify9521a genus. To be more precise, I should say ``operational9522taxonomic unit'', or OTU.9523\index{operational taxonomic unit}9524\index{OTU}95259526Now let's process some of the belly button data. I define9527a class called {\tt Subject} to represent information about9528each subject in the study:95299530\begin{verbatim}9531class Subject(object):95329533def __init__(self, code):9534self.code = code9535self.species = []9536\end{verbatim}95379538Each subject has a string code, like ``B1242'', and a list of9539(count, species name) pairs, sorted in increasing order by count.9540{\tt Subject} provides several methods to make it9541easy to access these counts and species names. You can see the details9542in \url{http://thinkbayes.com/species.py}.9543For more information9544see Section~\ref{download}.95459546\begin{figure}9547% species.py9548\centerline{\includegraphics[height=2.5in]{figs/species-ndist-B1242.pdf}}9549\caption{Distribution of {\tt n} for subject B1242.}9550\label{species-ndist}9551\end{figure}95529553{\tt Subject} provides a method named {\tt Process} that creates and9554updates a {\tt Species5} suite,9555which represents the distributions of {\tt n} and the prevalences.9556\index{prevalence}95579558And {\tt Suite2} provides {\tt DistOfN}, which returns the posterior9559distribution of {\tt n}.95609561\begin{verbatim}9562# class Suite295639564def DistN(self):9565items = zip(self.ns, self.probs)9566pmf = thinkbayes.MakePmfFromItems(items)9567return pmf9568\end{verbatim}95699570Figure~\ref{species-ndist} shows the distribution of {\tt n} for9571subject B1242. The probability that there are exactly 61 species, and9572no unseen species, is nearly zero. The most likely value is 72, with957390\% credible interval 66 to 79. At the high end, it is unlikely that9574there are as many as 87 species.95759576Next we compute the posterior distribution of prevalence for9577each species. {\tt Species2} provides {\tt DistOfPrevalence}:95789579\begin{verbatim}9580# class Species295819582def DistOfPrevalence(self, index):9583metapmf = thinkbayes.Pmf()95849585for n, prob in zip(self.ns, self.probs):9586beta = self.MarginalBeta(n, index)9587pmf = beta.MakePmf()9588metapmf.Set(pmf, prob)95899590mix = thinkbayes.MakeMixture(metapmf)9591return metapmf, mix9592\end{verbatim}95939594{\tt index} indicates which species we want. For each9595{\tt n}, we have a different posterior distribution9596of prevalence.95979598\begin{figure}9599% species.py9600\centerline{\includegraphics[height=2.5in]{figs/species-prev-B1242.pdf}}9601\caption{Distribution of prevalences for subject B1242.}9602\label{species-prev}9603\end{figure}96049605The loop iterates through the possible values of {\tt n}9606and their probabilities. For each value of {\tt n} it gets9607a Beta object representing the marginal distribution for the9608indicated species. Remember that Beta objects contain the9609parameters {\tt alpha} and {\tt beta}; they don't have9610values and probabilities like a Pmf, but they provide {\tt MakePmf},9611which generates a discrete approximation to the continuous9612beta distribution.9613\index{Beta object}96149615{\tt metapmf} is a meta-Pmf that contains the distributions9616of prevalence, conditioned on {\tt n}. {\tt MakeMixture}9617combines the meta-Pmf into {\tt mix}, which combines the9618conditional distributions into a single distribution9619of prevalence.9620\index{meta-Pmf}9621\index{mixture}9622\index{MakeMixture}96239624Figure~\ref{species-prev} shows results for the five9625species with the most reads. The most prevalent species accounts for962623\% of the 400 reads, but since there are almost certainly unseen9627species, the most likely estimate for its prevalence is 20\%,9628with 90\% credible interval between 17\% and 23\%.962996309631\section{Predictive distributions}96329633\begin{figure}9634% species.py9635\centerline{\includegraphics[height=2.5in]{figs/species-rare-B1242.pdf}}9636\caption{Simulated rarefaction curves for subject B1242.}9637\label{species-rare}9638\end{figure}96399640I introduced the hidden species problem in the form of four related9641questions. We have answered the first two by computing the posterior9642distribution for {\tt n} and the prevalence of each species.9643\index{predictive distribution}96449645The other two questions are:96469647\begin{itemize}96489649\item If we are planning to collect additional reads, can we predict9650how many new species we are likely to discover?96519652\item How many additional reads are needed to increase the9653fraction of observed species to a given threshold?96549655\end{itemize}96569657To answer predictive questions like this we can use the posterior9658distributions to simulate possible future events and compute9659predictive distributions for the number of species, and fraction of9660the total, we are likely to see.96619662The kernel of these simulations looks like this:9663\index{simulation}96649665\begin{enumerate}96669667\item Choose {\tt n} from its posterior distribution.96689669\item Choose a prevalence for each species, including possible unseen9670species, using the Dirichlet distribution.9671\index{Dirichlet distribution}96729673\item Generate a random sequence of future observations.96749675\item Compute the number of new species, \verb"num_new", as a function9676of the number of additional reads, {\tt k}.96779678\item Repeat the previous steps and accumulate the joint distribution9679of \verb"num_new" and {\tt k}.9680\index{joint distribution}96819682\end{enumerate}96839684And here's the code. {\tt RunSimulation} runs a single simulation:96859686\begin{verbatim}9687# class Subject96889689def RunSimulation(self, num_reads):9690m, seen = self.GetSeenSpecies()9691n, observations = self.GenerateObservations(num_reads)96929693curve = []9694for k, obs in enumerate(observations):9695seen.add(obs)96969697num_new = len(seen) - m9698curve.append((k+1, num_new))96999700return curve9701\end{verbatim}97029703\verb"num_reads" is the number of additional reads to simulate.9704{\tt m} is the number of seen species, and {\tt seen} is a set of9705strings with a unique name for each species.9706{\tt n} is a random value from the posterior distribution, and9707{\tt observations} is a random sequence of species names.97089709Each time through the loop, we add the new observation to9710{\tt seen} and record the number of reads and the number of9711new species so far.97129713The result of {\tt RunSimulation} is a {\bf rarefaction curve},9714represented as a list of pairs with the number of reads and9715the number of new species.9716\index{rarefaction curve}97179718Before we see the results, let's look at {\tt GetSeenSpecies} and9719{\tt GenerateObservations}.97209721\begin{verbatim}9722#class Subject97239724def GetSeenSpecies(self):9725names = self.GetNames()9726m = len(names)9727seen = set(SpeciesGenerator(names, m))9728return m, seen9729\end{verbatim}97309731{\tt GetNames} returns the list of species names that appear in9732the data files, but for many subjects these names are not unique.9733So I use {\tt SpeciesGenerator} to extend each name with a serial9734number:9735\index{generator}97369737\begin{verbatim}9738def SpeciesGenerator(names, num):9739i = 09740for name in names:9741yield '%s-%d' % (name, i)9742i += 197439744while i < num:9745yield 'unseen-%d' % i9746i += 19747\end{verbatim}97489749Given a name like {\tt Corynebacterium}, {\tt SpeciesGenerator} yields9750{\tt Corynebacterium-1}. When the list of names is exhausted, it9751yields names like {\tt unseen-62}.97529753Here is {\tt GenerateObservations}:97549755\begin{verbatim}9756# class Subject97579758def GenerateObservations(self, num_reads):9759n, prevalences = self.suite.SamplePosterior()97609761names = self.GetNames()9762name_iter = SpeciesGenerator(names, n)97639764d = dict(zip(name_iter, prevalences))9765cdf = thinkbayes.MakeCdfFromDict(d)9766observations = cdf.Sample(num_reads)97679768return n, observations9769\end{verbatim}97709771Again, \verb"num_reads" is the number of additional reads9772to generate. {\tt n} and {\tt prevalences} are samples from9773the posterior distribution.97749775{\tt cdf} is a Cdf object that maps species names, including the9776unseen, to cumulative probabilities. Using a Cdf makes it efficient9777to generate a random sequence of species names.9778\index{Cdf}9779\index{cumulative probability}97809781Finally, here is {\tt Species2.SamplePosterior}:97829783\begin{verbatim}9784def SamplePosterior(self):9785pmf = self.DistOfN()9786n = pmf.Random()9787prevalences = self.SamplePrevalences(n)9788return n, prevalences9789\end{verbatim}97909791And {\tt SamplePrevalences}, which generates a sample of9792prevalences conditioned on {\tt n}:9793\index{numpy}9794\index{random sample}97959796\begin{verbatim}9797# class Species297989799def SamplePrevalences(self, n):9800params = self.params[:n]9801gammas = numpy.random.gamma(params)9802gammas /= gammas.sum()9803return gammas9804\end{verbatim}98059806We saw this algorithm for generating random values from a Dirichlet9807distribution in Section~\ref{randomdir}.98089809Figure~\ref{species-rare} shows 100 simulated rarefaction curves9810for subject B1242. The curves are ``jittered;''9811that is, I shifted each curve by a random offset so they9812would not all overlap. By inspection we can estimate that after9813400 more reads we are likely to find 2--6 new species.981498159816\section{Joint posterior}98179818\begin{figure}9819% species.py9820\centerline{\includegraphics[height=2.5in]{figs/species-cond-B1242.pdf}}9821\caption{Distributions of the number of new species conditioned on9822the number of additional reads.}9823\label{species-cond}9824\end{figure}98259826We can use these simulations to estimate the9827joint distribution of \verb"num_new" and {\tt k}, and from that9828we can get the distribution of \verb"num_new" conditioned on any9829value of {\tt k}.9830\index{joint distribution}98319832\begin{verbatim}9833def MakeJointPredictive(curves):9834joint = thinkbayes.Joint()9835for curve in curves:9836for k, num_new in curve:9837joint.Incr((k, num_new))9838joint.Normalize()9839return joint9840\end{verbatim}98419842{\tt MakeJointPredictive} makes a Joint object, which is a9843Pmf whose values are tuples.9844\index{Joint object}98459846{\tt curves} is a list of rarefaction curves created by9847{\tt RunSimulation}. Each curve contains a list of pairs of9848{\tt k} and \verb"num_new".9849\index{rarefaction curve}98509851The resulting joint distribution is a map from each pair to9852its probability of occurring. Given the joint distribution, we9853can use {\tt Joint.Conditional}9854get the distribution of \verb"num_new" conditioned on {\tt k}9855(see Section~\ref{conditional}).9856\index{conditional distribution}98579858{\tt Subject.MakeConditionals} takes a list of {\tt ks}9859and computes the conditional distribution of \verb"num_new"9860for each {\tt k}. The result is a list of Cdf objects.98619862\begin{verbatim}9863def MakeConditionals(curves, ks):9864joint = MakeJointPredictive(curves)98659866cdfs = []9867for k in ks:9868pmf = joint.Conditional(1, 0, k)9869pmf.name = 'k=%d' % k9870cdf = pmf.MakeCdf()9871cdfs.append(cdf)98729873return cdfs9874\end{verbatim}98759876Figure~\ref{species-cond} shows the results. After 100 reads, the9877median predicted number of new species is 2; the 90\% credible9878interval is 0 to 5. After 800 reads, we expect to see 3 to 12 new9879species.988098819882\section{Coverage}98839884\begin{figure}9885% species.py9886\centerline{\includegraphics[height=2.5in]{figs/species-frac-B1242.pdf}}9887\caption{Complementary CDF of coverage for a range of additional reads.}9888\label{species-frac}9889\end{figure}98909891The last question we want to answer is, ``How many additional reads9892are needed to increase the fraction of observed species to a given9893threshold?''9894\index{coverage}98959896To answer this question, we need a version of {\tt RunSimulation}9897that computes the fraction of observed species rather than the9898number of new species.98999900\begin{verbatim}9901# class Subject99029903def RunSimulation(self, num_reads):9904m, seen = self.GetSeenSpecies()9905n, observations = self.GenerateObservations(num_reads)99069907curve = []9908for k, obs in enumerate(observations):9909seen.add(obs)99109911frac_seen = len(seen) / float(n)9912curve.append((k+1, frac_seen))99139914return curve9915\end{verbatim}99169917Next we loop through each curve and make a dictionary, {\tt d},9918that maps from the number of additional reads, {\tt k}, to9919a list of {\tt fracs}; that is, a list of values for the9920coverage achieved after {\tt k} reads.99219922\begin{verbatim}9923def MakeFracCdfs(self, curves):9924d = {}9925for curve in curves:9926for k, frac in curve:9927d.setdefault(k, []).append(frac)99289929cdfs = {}9930for k, fracs in d.iteritems():9931cdf = thinkbayes.MakeCdfFromList(fracs)9932cdfs[k] = cdf99339934return cdfs9935\end{verbatim}99369937Then for each value of {\tt k} we make a Cdf of {\tt fracs}; this Cdf9938represents the distribution of coverage after {\tt k} reads.99399940Remember that the CDF tells you the probability of falling below a9941given threshold, so the {\em complementary} CDF tells you the9942probability of exceeding it. Figure~\ref{species-frac} shows9943complementary CDFs for a range of values of {\tt k}.9944\index{complementary CDF}99459946To read this figure, select the level of coverage you want to achieve9947along the $x$-axis. As an example, choose 90\%.9948\index{coverage}99499950Now you can read up the chart to find the probability of achieving995190\% coverage after {\tt k} reads. For example, with 200 reads,9952you have about a 40\% chance of getting 90\% coverage. With 1000 reads, you9953have a 90\% chance of getting 90\% coverage.99549955With that, we have answered the four questions that make up the unseen9956species problem. To validate the algorithms in this chapter with9957real data, I had to deal with a few more details. But9958this chapter is already too long, so I won't discuss them here.99599960You can read about the problems, and how I addressed them, at9961\url{http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html}.99629963You can download the code in this chapter from9964\url{http://thinkbayes.com/species.py}.9965For more information9966see Section~\ref{download}.996799689969\section{Discussion}99709971The Unseen Species problem is an area of active research, and I9972believe the algorithm in this chapter is a novel contribution. So in9973fewer than 200 pages we have made it from the basics of probability to9974the research frontier. I'm very happy about that.99759976My goal for this book is to present three related ideas:99779978\begin{itemize}99799980\item {\bf Bayesian thinking}: The foundation of Bayesian analysis is9981the idea of using probability distributions to represent uncertain9982beliefs, using data to update those distributions, and using the9983results to make predictions and inform decisions.99849985\item {\bf A computational approach}: The premise of this book is that9986it is easier to understand Bayesian analysis using computation9987rather than math, and easier to implement Bayesian methods with9988reusable building blocks that can be rearranged to solve real-world9989problems quickly.99909991\item {\bf Iterative modeling}: Most real-world problems involve9992modeling decisions and trade-offs between realism and complexity.9993It is often impossible to know ahead of time what factors should be9994included in the model and which can be abstracted away. The best9995approach is to iterate, starting with simple models and adding9996complexity gradually, using each model to validate the others.99979998\end{itemize}999910000These ideas are versatile and powerful; they are applicable to10001problems in every area of science and engineering, from simple10002examples to topics of current research.1000310004If you made it this far, you should be prepared to apply these10005tools to new problems relevant to your work. I hope you find10006them useful; let me know how it goes!10007100081000910010%\chapter{Future chapters}1001110012%Bayesian regression (hybrid version with resampling?)10013%\url{http://www.reddit.com/r/statistics/comments/1647yj/which_regression_technique/}1001410015%Change point detection:1001610017%Deconvolution: Estimating round trip times1001810019%Bayesian search1002010021%Extension of the Euro problem: evaluating reddit items and redditors10022%\url{http://www.reddit.com/r/statistics/comments/15rurz/question_about_continuous_bayesian_inference/}1002310024%Charles Darwin problem (capture-tag-recapture)10025%\url{http://maximum-entropy-blog.blogspot.com/2012/04/capture-recapture-and-charles-darwin.html}1002610027% http://camdp.com/blogs/how-solve-price-rights-showdown1002810029% https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers1003010031% http://blog.yhathq.com/posts/estimating-user-lifetimes-with-pymc.html1003210033\printindex1003410035\end{document}100361003710038