CoCalc -- book.tex

📚 The CoCalc Library - books, templates and other resources
cocalc-examples / think-bayes / book / book.tex
¹³²⁹²⁴ views
License: OTHER
1
% LaTeX source for ``Think Bayes: Bayesian Statistics Made Simple''
2
% Copyright 2012  Allen B. Downey.
3

4
% License: Creative Commons Attribution-NonCommercial 3.0 Unported License.
5
% http://creativecommons.org/licenses/by-nc/3.0/
6
%
7

8
\documentclass[12pt]{book}
9
\usepackage[width=5.5in,height=8.5in,
10
  hmarginratio=3:2,vmarginratio=1:1]{geometry}
11

12
% for some of these packages, you might have to install
13
% texlive-latex-extra (in Ubuntu)
14

15
\usepackage[T1]{fontenc}
16
\usepackage{textcomp}
17
\usepackage{mathpazo}
18
\usepackage{url}
19
\usepackage{graphicx}
20
\usepackage{subfig}
21
\usepackage{amsmath}
22
\usepackage{amsthm}
23
\usepackage{makeidx}
24
\usepackage{setspace}
25
\usepackage{hevea}                           
26
\usepackage{upquote}
27
\usepackage{fancyhdr}
28
\usepackage[bookmarks]{hyperref}
29

30
\title{Think Bayes}
31
\author{Allen B. Downey}
32

33
\newcommand{\thetitle}{Think Bayes: Bayesian Statistics Made Simple}
34
\newcommand{\theversion}{1.0.8}
35

36
% these styles get translated in CSS for the HTML version
37
\newstyle{a:link}{color:black;}
38
\newstyle{p+p}{margin-top:1em;margin-bottom:1em}
39
\newstyle{img}{border:0px}
40

41
% change the arrows in the HTML version
42
\setlinkstext
43
  {\imgsrc[ALT="Previous"]{back.png}}
44
  {\imgsrc[ALT="Up"]{up.png}}
45
  {\imgsrc[ALT="Next"]{next.png}} 
46

47
\makeindex
48

49
\newif\ifplastex
50
\plastexfalse
51

52
\begin{document}
53

54
\frontmatter
55

56
\ifplastex
57

58
\else
59
\fi
60

61
\newcommand{\PMF}{\mathrm{PMF}}
62
\newcommand{\PDF}{\mathrm{PDF}}
63
\newcommand{\CDF}{\mathrm{CDF}}
64
\newcommand{\ICDF}{\mathrm{ICDF}}
65

66
\ifplastex
67
    \usepackage{localdef}
68
    \maketitle
69

70
\else
71

72
\newtheorem{exercise}{Exercise}[chapter]
73

74
\input{latexonly}
75

76
\begin{latexonly}
77

78
\newtheoremstyle{exercise}% name of the style to be used
79
  {\topsep}% measure of space to leave above the theorem. E.g.: 3pt
80
  {\topsep}% measure of space to leave below the theorem. E.g.: 3pt
81
  {}% name of font to use in the body of the theorem
82
  {0pt}% measure of space to indent
83
  {\bfseries}% name of head font
84
  {}% punctuation between head and body
85
  { }% space after theorem head; " " = normal interword space
86
  {}% Manually specify head
87

88
\theoremstyle{exercise}
89

90
\renewcommand{\blankpage}{\thispagestyle{empty} \quad \newpage}
91

92
% TITLE PAGES FOR LATEX VERSION
93

94
%-half title--------------------------------------------------
95
\thispagestyle{empty}
96

97
\begin{flushright}
98
\vspace*{2.0in}
99

100
\begin{spacing}{3}
101
{\huge Think Bayes}\\
102
{\Large Bayesian Statistics Made Simple}
103
\end{spacing}
104

105
\vspace{0.25in}
106

107
Version \theversion
108

109
\vfill
110

111
\end{flushright}
112

113
%--verso------------------------------------------------------
114

115
\blankpage
116
\blankpage
117

118
%--title page--------------------------------------------------
119
\pagebreak
120
\thispagestyle{empty}
121

122
\begin{flushright}
123
\vspace*{2.0in}
124

125
\begin{spacing}{3}
126
{\huge Think Bayes}\\
127
{\Large Bayesian Statistics Made Simple}
128
\end{spacing}
129

130
\vspace{0.25in}
131

132
Version \theversion
133

134
\vspace{1in}
135

136

137
{\Large
138
Allen B. Downey\\
139
}
140

141

142
\vspace{0.5in}
143

144
{\Large Green Tea Press}
145

146
{\small Needham, Massachusetts}
147

148
\vfill
149

150
\end{flushright}
151

152

153
%--copyright--------------------------------------------------
154
\pagebreak
155
\thispagestyle{empty}
156

157
Copyright \copyright ~2012 Allen B. Downey.
158

159

160
\vspace{0.2in}
161

162
\begin{flushleft}
163
Green Tea Press       \\
164
9 Washburn Ave \\
165
Needham MA 02492
166
\end{flushleft}
167

168
Permission is granted to copy, distribute, and/or modify this document
169
under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported
170
License, which is available at \url{http://creativecommons.org/licenses/by-nc/3.0/}.
171

172
\vspace{0.2in}
173

174
\end{latexonly}
175

176

177
% HTMLONLY
178

179
\begin{htmlonly}
180

181
% TITLE PAGE FOR HTML VERSION
182

183
{\Large \thetitle}
184

185
{\large Allen B. Downey}
186

187
Version \theversion
188

189
\vspace{0.25in}
190

191
Copyright 2012 Allen B. Downey
192

193
\vspace{0.25in}
194

195
Permission is granted to copy, distribute, and/or modify this document
196
under the terms of the Creative Commons Attribution-NonCommercial 3.0
197
Unported License, which is available at
198
\url{http://creativecommons.org/licenses/by-nc/3.0/}.
199

200
\setcounter{chapter}{-1}
201

202
\end{htmlonly}
203

204
\fi
205
% END OF THE PART WE SKIP FOR PLASTEX
206

207
\chapter{Preface}
208
\label{preface}
209

210
\section{My theory, which is mine}
211

212
The premise of this book, and the other books in the {\it Think X}
213
series, is that if you know how to program, you
214
can use that skill to learn other topics.
215

216
Most books on Bayesian statistics use mathematical notation and
217
present ideas in terms of mathematical concepts like calculus.
218
This book uses Python code instead of math, and discrete approximations
219
instead of continuous mathematics.  As a result, what would
220
be an integral in a math book becomes a summation, and
221
most operations on probability distributions are simple loops.
222

223
I think this presentation is easier to understand, at least for people with
224
programming skills.  It is also more general, because when we make
225
modeling decisions, we can choose the most appropriate model without
226
worrying too much about whether the model lends itself to conventional
227
analysis.
228

229
Also, it provides a smooth development path from simple examples to
230
real-world problems.  Chapter~\ref{estimation} is a good example.  It
231
starts with a simple example involving dice, one of the staples of
232
basic probability.  From there it proceeds in small steps to the
233
locomotive problem, which I borrowed from Mosteller's {\it
234
  Fifty Challenging Problems in Probability with Solutions}, and from
235
there to the German tank problem, a famously successful application of
236
Bayesian methods during World War II.
237

238

239
\section{Modeling and approximation}
240

241
Most chapters in this book are motivated by a real-world problem, so
242
they involve some degree of modeling.  Before we can apply Bayesian
243
methods (or any other analysis), we have to make decisions about which
244
parts of the real-world system to include in the model and which
245
details we can abstract away.  \index{modeling}
246

247
For example, in Chapter~\ref{prediction}, the motivating problem is to
248
predict the winner of a hockey game.  I model goal-scoring as a
249
Poisson process, which implies that a goal is equally likely at any
250
point in the game.  That is not exactly true, but it is probably a
251
good enough model for most purposes.
252
\index{Poisson process}
253

254
In Chapter~\ref{evidence} the motivating problem is interpreting SAT
255
scores (the SAT is a standardized test used for college admissions in
256
the United States).  I start with a simple model that assumes that all
257
SAT questions are equally difficult, but in fact the designers of the
258
SAT deliberately include some questions that are relatively easy and
259
some that are relatively hard.  I present a second model that accounts
260
for this aspect of the design, and show that it doesn't have a big
261
effect on the results after all.
262

263
I think it is important to include modeling as an explicit part
264
of problem solving because it reminds us to think about modeling
265
errors (that is, errors due to simplifications and assumptions
266
of the model).
267

268
Many of the methods in this book are based on discrete distributions,
269
which makes some people worry about numerical errors.  But for
270
real-world problems, numerical errors are almost always
271
smaller than modeling errors.
272

273
Furthermore, the discrete approach often allows better modeling
274
decisions, and I would rather have an approximate solution
275
to a good model than an exact solution to a bad model.
276

277
On the other hand, continuous methods sometimes yield performance
278
advantages---for example by replacing a linear- or quadratic-time
279
computation with a constant-time solution.
280

281
So I recommend a general process with these steps:
282

283
\begin{enumerate}
284

285
\item While you are exploring a problem, start with simple models and
286
  implement them in code that is clear, readable, and demonstrably
287
  correct.  Focus your attention on good modeling decisions, not
288
  optimization.
289

290
\item Once you have a simple model working, identify the
291
  biggest sources of error.  You might need to increase the number of
292
  values in a discrete approximation, or increase the number of
293
  iterations in a Monte Carlo simulation, or add details to the model.
294

295
\item If the performance of your solution is good enough for your
296
  application, you might not have to do any optimization.  But if you
297
  do, there are two approaches to consider.  You can review your code
298
  and look for optimizations; for example, if you cache previously
299
  computed results you might be able to avoid redundant computation.
300
  Or you can look for analytic methods that yield computational
301
  shortcuts.
302

303
\end{enumerate}
304

305
One benefit of this process is that Steps 1 and 2 tend to be fast, so you
306
can explore several alternative models before investing heavily in any
307
of them.
308

309
Another benefit is that if you get to Step 3, you will be starting
310
with a reference implementation that is likely to be correct,
311
which you can use for regression testing (that is, checking that the
312
optimized code yields the same results, at least approximately).
313
\index{regression testing}
314

315

316
\section{Working with the code}
317
\label{download}
318

319
The code and sound samples used in this book are available from
320
\url{https://github.com/AllenDowney/ThinkBayes}.  Git is a version
321
control system that allows you to keep track of the files that
322
make up a project.  A collection of files under Git's control is
323
called a ``repository''.  GitHub is a hosting service that provides
324
storage for Git repositories and a convenient web interface.
325
\index{repository}
326
\index{Git}
327
\index{GitHub}
328

329
The GitHub homepage for my repository provides several ways to
330
work with the code:
331

332
\begin{itemize}
333

334
\item You can create a copy of my repository
335
on GitHub by pressing the {\sf Fork} button.  If you don't already
336
have a GitHub account, you'll need to create one.  After forking, you'll
337
have your own repository on GitHub that you can use to keep track
338
of code you write while working on this book.  Then you can
339
clone the repo, which means that you copy the files
340
to your computer.
341
\index{fork}
342

343
\item Or you could clone
344
my repository.  You don't need a GitHub account to do this, but you
345
won't be able to write your changes back to GitHub.
346
\index{clone}
347

348
\item If you don't want to use Git at all, you can download the files
349
in a Zip file using the button in the lower-right corner of the
350
GitHub page.
351

352
\end{itemize}
353

354
The code for the first edition of the book works with Python 2.
355
If you are using Python 3, you might want to use the updated code
356
in \url{https://github.com/AllenDowney/ThinkBayes2} instead.
357

358
I developed this book using Anaconda from
359
Continuum Analytics, which is a free Python distribution that includes
360
all the packages you'll need to run the code (and lots more).
361
I found Anaconda easy to install.  By default it does a user-level
362
installation, not system-level, so you don't need administrative
363
privileges.  You can
364
download Anaconda from \url{http://continuum.io/downloads}.
365
\index{Anaconda}
366

367
If you don't want to use Anaconda, you will need the following
368
packages:
369

370
\begin{itemize}
371

372
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
373
\index{NumPy}
374

375
\item SciPy for scientific computation,
376
  \url{http://www.scipy.org/};
377
\index{SciPy}
378

379
\item matplotlib for visualization, \url{http://matplotlib.org/}.
380
\index{matplotlib}
381

382
\end{itemize}
383

384
Although these are commonly used packages, they are not included with
385
all Python installations, and they can be hard to install in some
386
environments.  If you have trouble installing them, I
387
recommend using Anaconda or one of the other Python distributions
388
that include these packages.
389
\index{installation}
390

391
Many of the examples in this book use classes and functions defined in
392
{\tt thinkbayes.py}.  Some of them also use {\tt thinkplot.py}, which
393
provides wrappers for some of the functions in {\tt pyplot}, which is
394
part of {\tt matplotlib}.  
395

396

397
\section{Code style}
398

399
Experienced Python programmers will notice that the code in this
400
book does not comply with PEP 8, which is the most common
401
style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).
402
\index{PEP 8}
403

404
Specifically, PEP 8 calls for lowercase function names with
405
underscores between words, \verb"like_this".  In this book and
406
the accompanying code, function and method names begin with
407
a capital letter and use camel case, \verb"LikeThis".
408

409
I broke this rule because I developed some of the code
410
while I was a Visiting Scientist at Google, so I followed
411
the Google style guide, which deviates from PEP 8 in a few
412
places.  Once I got used to Google style, I found that I liked
413
it.  And at this point, it would be too much trouble to change.
414

415
Also on the topic of style, I write ``Bayes's theorem''
416
with an {\it s} after the apostrophe, which is preferred in some
417
style guides and deprecated in others.  I don't have a strong
418
preference.  I had to choose one, and this is the one I chose.
419

420
And finally one typographical note: throughout the book, I use
421
PMF and CDF for the mathematical concept of a probability
422
mass function or cumulative distribution function, and Pmf and Cdf
423
to refer to the Python objects I use to represent them.
424

425

426
\section{Prerequisites}
427

428
There are several excellent modules for doing Bayesian statistics in
429
Python, including {\tt pymc} and OpenBUGS.  I chose not to use them
430
for this book because you need a fair amount of background knowledge
431
to get started with these modules, and I want to keep the
432
prerequisites minimal.  If you know Python and a little bit about
433
probability, you are ready to start this book.
434

435
Chapter~\ref{intro} is about probability and Bayes's theorem; it has
436
no code.  Chapter~\ref{compstat} introduces {\tt Pmf}, a thinly disguised
437
Python dictionary I use to represent a probability mass function
438
(PMF).  Then Chapter~\ref{estimation} introduces {\tt Suite}, a kind
439
of Pmf that provides a framework for doing Bayesian updates.
440

441
In some of the later chapters, I use
442
analytic distributions including the Gaussian (normal) distribution,
443
the exponential and Poisson distributions, and the beta distribution.
444
In Chapter~\ref{species} I break out the less-common Dirichlet
445
distribution, but I explain it as I go along.  If you are not familiar
446
with these distributions, you can read about them on Wikipedia.  You
447
could also read the companion to this book, {\it Think Stats}, or an
448
introductory statistics book (although I'm afraid most of them take
449
a mathematical approach that is not particularly helpful for practical
450
purposes).
451

452

453

454
\section*{Contributor List}
455

456
If you have a suggestion or correction, please send email to 
457
{\it downey@allendowney.com}.  If I make a change based on your
458
feedback, I will add you to the contributor list
459
(unless you ask to be omitted).
460
\index{contributors}
461

462
If you include at least part of the sentence the
463
error appears in, that makes it easy for me to search.  Page and
464
section numbers are fine, too, but not as easy to work with.
465
Thanks!
466

467
\small
468

469
\begin{itemize}
470

471
\item First, I have to acknowledge David MacKay's excellent book,
472
  {\it Information Theory, Inference, and Learning Algorithms}, which is
473
  where I first came to understand Bayesian methods.  With his
474
  permission, I use several problems from
475
  his book as examples.
476

477
\item This book also benefited from my interactions with Sanjoy
478
  Mahajan, especially in fall 2012, when I audited his class on
479
  Bayesian Inference at Olin College.
480

481
\item I wrote parts of this book during project nights with the Boston
482
  Python User Group, so I would like to thank them for their
483
  company and pizza.
484

485
\item Jonathan Edwards sent in the first typo.
486

487
\item George Purkins found a markup error.
488

489
\item Olivier Yiptong sent several helpful suggestions.
490

491
\item Yuriy Pasichnyk found several errors.
492

493
\item Kristopher Overholt sent a long list of corrections and suggestions.
494

495
\item Robert Marcus found a misplaced {\it i}.
496

497
\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.
498

499
\item Markus Dobler pointed out that drawing cookies from a bowl
500
with replacement is an unrealistic scenario.
501

502
\item Tom Pollard and Paul A. Giannaros spotted a version problem with
503
  some of the numbers in the train example.
504

505
\item Ram Limbu found a typo and suggested a clarification.
506

507
\item In spring 2013, students in my class, Computational Bayesian
508
  Statistics, made many helpful corrections and suggestions: Kai
509
  Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun
510
  Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,
511
  Brendan Ritter, and Evan Simpson.
512

513
\item Greg Marra and Matt Aasted helped me clarify the discussion of
514
  {\it The Price is Right} problem.
515

516
\item Marcus Ogren pointed out that the original statement of the
517
  locomotive problem was ambiguous.
518

519
\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the
520
  book and found many opportunities for improvement.
521

522
\item James Lawry spotted a math error.
523

524
\item Ben Kahle found a reference to the wrong figure.
525

526
\item Jeffrey Law found an inconsistency between the text and the code.
527

528
\item Linda Pescatore found a typo and made some helpful suggestions.
529

530
\item Tomasz Mi\k{a}sko sent many excellent corrections and suggestions.
531

532
% ENDCONTRIB
533

534
\end{itemize}
535

536
\normalsize
537

538
\clearemptydoublepage
539

540
% TABLE OF CONTENTS
541
\begin{latexonly}
542

543
\tableofcontents
544

545
\clearemptydoublepage
546

547
\end{latexonly}
548

549
% START THE BOOK
550
\mainmatter
551

552
\newcommand{\p}[1]{\ensuremath{\mathrm{p}(#1)}}
553
\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}
554
\newcommand{\T}[1]{\mbox{#1}}
555
\newcommand{\AND}{~\mathrm{and}~}
556
\newcommand{\NOT}{\mathrm{not}~}
557

558

559
\chapter{Bayes's Theorem}
560
\label{intro}
561

562
\section{Conditional probability}
563

564
The fundamental idea behind all Bayesian statistics is Bayes's theorem,
565
which is surprisingly easy to derive, provided that you understand
566
conditional probability.  So we'll start with probability, then
567
conditional probability, then Bayes's theorem, and on to Bayesian
568
statistics.
569
\index{conditional probability}
570
\index{probability!conditional}
571

572
A probability is a number between 0 and 1 (including both) that
573
represents a degree of belief in a fact or prediction.  The value
574
1 represents certainty that a fact is true, or that a prediction
575
will come true.  The value 0 represents certainty
576
that the fact is false.
577
\index{degree of belief}
578

579
Intermediate values represent degrees of certainty.  The value 0.5,
580
often written as 50\%, means that a predicted outcome is
581
as likely to happen as not.  For example, the probability that a tossed
582
coin lands face up is very close to 50\%.
583
\index{coin toss}
584

585
A conditional probability is a probability based on some background
586
information.  For example, I want to know the probability
587
that I will have a heart attack in the next year.  According to the
588
CDC, ``Every year about 785,000 Americans have a first coronary attack.
589
(\url{http://www.cdc.gov/heartdisease/facts.htm})''
590
\index{heart attack}
591

592
The U.S. population is about 311 million, so the probability that a
593
randomly chosen American will have a heart attack in the next year is
594
roughly 0.3\%.
595

596
But I am not a randomly chosen American.  Epidemiologists have
597
identified many factors that affect the risk of heart attacks;
598
depending on those factors, my risk might be higher or lower than
599
average.
600

601
I am male, 45 years old, and I have borderline high cholesterol.
602
Those factors increase my chances.  However, I have low blood pressure
603
and I don't smoke, and those factors decrease my chances.
604

605
Plugging everything into the online calculator at
606
\url{http://cvdrisk.nhlbi.nih.gov/calculator.asp}, I find that my
607
risk of a heart attack in the next year is about 0.2\%, less than the
608
national average.  That value is a conditional probability, because it
609
is based on a number of factors that make up my ``condition.''
610

611
The usual notation for conditional probability is \p{A|B}, which
612
is the probability of $A$ given that $B$ is true.  In this
613
example, $A$ represents the prediction that I will have a heart
614
attack in the next year, and $B$ is the set of conditions I listed.
615

616

617
\section{Conjoint probability}
618

619
{\bf Conjoint probability} is a fancy way to say the probability that
620
two things are true.  I write \p{A \AND B} to mean the
621
probability that $A$ and $B$ are both true.
622
\index{conjoint probability}
623
\index{probability!conjoint}
624

625
If you learned about probability in the context of coin tosses and
626
dice, you might have learned the following formula:
627
%
628
\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]
629
%
630
For example, if I toss two coins, and $A$ means the first coin lands
631
face up, and $B$ means the second coin lands face up, then $\p{A} =
632
\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.
633

634
But this formula only works because in this case $A$ and $B$ are
635
independent; that is, knowing the outcome of the first event does
636
not change the probability of the second.  Or, more formally,
637
\p{B|A} = \p{B}.
638
\index{independence}
639
\index{dependence}
640

641
Here is a different example where the events are not independent.
642
Suppose that $A$ means that it rains today and $B$ means that it
643
rains tomorrow.  If I know that it rained today, it is more likely
644
that it will rain tomorrow, so $\p{B|A} > \p{B}$.
645

646
In general, the probability of a conjunction is
647
%
648
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
649
%
650
for any $A$ and $B$.  So if the chance of rain on any given day
651
is 0.5, the chance of rain on two consecutive days is not
652
0.25, but probably a bit higher.
653

654

655
\section{The cookie problem}
656

657
We'll get to Bayes's theorem soon, but I want to motivate it with an
658
example called the cookie problem.\footnote{Based on an example from
659
  \url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer
660
  there.}  Suppose there are two bowls of cookies.  Bowl 1 contains
661
  30 vanilla cookies and 10 chocolate cookies.  Bowl 2 contains 20 of
662
  each.
663
\index{Bayes's theorem}
664
\index{cookie problem}
665

666
Now suppose you choose one of the bowls at random and, without
667
looking, select a cookie at random.  The cookie is vanilla.  What is
668
the probability that it came from Bowl 1?
669

670
This is a conditional probability; we want \p{\T{Bowl 1} |
671
  \T{vanilla}}, but it is not obvious how to compute it.  If I asked a
672
different question---the probability of a vanilla cookie given Bowl
673
1---it would be easy:
674
%
675
\[ \p{\T{vanilla} | \T{Bowl 1}} = 3/4 \]
676
%
677
Sadly, \p{A|B} is {\em not} the same as \p{B|A}, but there
678
is a way to get from one to the other: Bayes's theorem.
679

680

681
\section{Bayes's theorem}
682

683
At this point we have everything we need to derive Bayes's theorem.
684
We'll start with the observation that conjunction is commutative; that is
685
%
686
\[ \p{A \AND B} = \p{B \AND A} \]
687
%
688
for any events $A$ and $B$.
689
\index{Bayes's theorem!derivation}
690
\index{conjunction}
691

692
Next, we write the probability of a conjunction:
693
%
694
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
695
%
696
Since we have not said anything about what $A$ and $B$ mean, they
697
are interchangeable.  Interchanging them yields
698
%
699
\[ \p{B \AND A} = \p{B}~\p{A|B} \]
700
%
701
That's all we need.  Pulling those pieces together, we get
702
%
703
\[ \p{B}~\p{A|B} = \p{A}~\p{B|A} \]
704
%
705
Which means there are two ways to compute the conjunction.
706
If you have \p{A}, you multiply by the conditional
707
probability \p{B|A}.  Or you can do it the other way around; if you
708
know \p{B}, you multiply by \p{A|B}.  Either way you should get
709
the same thing.
710

711
Finally we can divide through by \p{B}:
712
%
713
\[ \p{A|B} = \frac{\p{A}~\p{B|A}}{\p{B}} \]
714
%
715
And that's Bayes's theorem!  It might not look like much, but
716
it turns out to be surprisingly powerful.
717

718
For example, we can use it to solve the cookie problem.  I'll write
719
$B_1$ for the hypothesis that the cookie came from Bowl 1
720
and $V$ for the vanilla cookie.  Plugging in Bayes's theorem
721
we get
722
%
723
\[ \p{B_1|V} = \frac{\p{B_1}~\p{V|B_1}}{\p{V}} \]
724
%
725
The term on the left is what we want: the probability of Bowl 1, given
726
that we chose a vanilla cookie.  The terms on the right are:
727

728
\begin{itemize}
729

730
\item $\p{B_1}$: This is the probability that we chose Bowl 1, unconditioned
731
by what kind of cookie we got.  Since the problem says we chose a
732
bowl at random, we can assume $\p{B_1} = 1/2$.
733

734
\item $\p{V|B_1}$: This is the probability of getting a vanilla cookie
735
from Bowl 1, which is 3/4.
736

737
\item \p{V}: This is the probability of drawing a vanilla cookie from
738
either bowl.  Since we had an equal chance of choosing either bowl
739
and the bowls contain the same number of cookies, we had the same
740
chance of choosing any cookie.  Between the two bowls there are
741
50 vanilla and 30
742
chocolate cookies, so \p{V} = 5/8.
743

744
\end{itemize}
745

746
Putting it together, we have 
747
%
748
\[ \p{B_1|V} = \frac{(1/2)~(3/4)}{5/8} \]
749
%
750
which reduces to 3/5.  So the vanilla cookie is evidence in favor of
751
the hypothesis that we chose Bowl 1, because vanilla cookies are more
752
likely to come from Bowl 1.
753
\index{evidence}
754

755
This example demonstrates one use of Bayes's theorem: it provides
756
a strategy to get from \p{B|A} to \p{A|B}.  This strategy is useful
757
in cases, like the cookie problem, where it is easier to compute
758
the terms on the right side of Bayes's theorem than the term on the
759
left.
760

761

762
\section{The diachronic interpretation}
763

764
There is another way to think of Bayes's theorem: it gives us a
765
way to update the probability of a hypothesis, $H$, in light of
766
some body of data, $D$.
767
\index{diachronic interpretation}
768

769
This way of thinking about Bayes's theorem is called the
770
{\bf diachronic interpretation}.  ``Diachronic'' means that something
771
is happening over time; in this case
772
the probability of the hypotheses changes, over time, as
773
we see new data.
774

775
Rewriting Bayes's theorem with $H$ and $D$ yields:
776
%
777
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
778
%
779
In this interpretation, each term has a name:
780
\index{prior}
781
\index{posterior}
782
\index{likelihood}
783
\index{normalizing constant}
784

785
\begin{itemize}
786

787
\item \p{H} is the probability of the hypothesis before we see
788
the data, called the prior probability, or just {\bf prior}.
789

790
\item \p{H|D} is what we want to compute, the probability of
791
the hypothesis after we see the data, called the {\bf posterior}.
792
 
793
\item \p{D|H} is the probability of the data under the hypothesis,
794
called the {\bf likelihood}.
795

796
\item \p{D} is the probability of the data under any hypothesis,
797
called the {\bf normalizing constant}.
798

799
\end{itemize}
800

801
Sometimes we can compute the prior based on background
802
information.  For example, the cookie problem specifies that we choose
803
a bowl at random with equal probability.
804

805
In other cases the prior is subjective; that is, reasonable people
806
might disagree, either because they use different background
807
information or because they interpret the same information
808
differently.
809
\index{subjective prior}
810

811
The likelihood is usually the easiest part to compute.  In the
812
cookie problem, if we know which bowl the cookie came from,
813
we find the probability of a vanilla cookie by counting.
814

815
The normalizing constant can be tricky.  It is supposed to be the
816
probability of seeing the data under any hypothesis at all, but in the
817
most general case it is hard to nail down what that means.
818

819
Most often we simplify things by specifying a set of hypotheses
820
that are
821
\index{mutually exclusive}
822
\index{collectively exhaustive}
823

824
\begin{description}
825

826
\item[Mutually exclusive:] At most one hypothesis in
827
the set can be true, and
828

829
\item[Collectively exhaustive:] There are no other
830
possibilities; at least one of the hypotheses has to be true.
831

832
\end{description}
833

834
I use the word {\bf suite} for a set of hypotheses that has these
835
properties.
836
\index{suite}
837

838
In the cookie problem, there are only two hypotheses---the cookie
839
came from Bowl 1 or Bowl 2---and they are mutually exclusive and
840
collectively exhaustive.
841

842
In that case we can compute \p{D} using the law of total probability,
843
which says that if there are two exclusive ways that something
844
might happen, you can add up the probabilities like this:
845
%
846
\[ \p{D} = \p{B_1}~\p{D|B_1} + \p{B_2}~\p{D|B_2} \]
847
%
848
Plugging in the values from the cookie problem, we have
849
%
850
\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]
851
%
852
which is what we computed earlier by mentally combining the two
853
bowls.
854
\index{total probability}
855

856

857
\newcommand{\MM}{M\&M}
858

859
\section{The \MM~problem}
860

861
\MM's are small candy-coated chocolates that come in a variety of
862
colors.  Mars, Inc., which makes \MM's, changes the mixture of
863
colors from time to time.
864
\index{M and M problem}
865

866
In 1995, they introduced blue \MM's.  Before then, the color mix in
867
a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%
868
Green, 10\% Orange, 10\% Tan.  Afterward it was 24\% Blue , 20\%
869
Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.
870

871
Suppose a friend of mine has two bags of \MM's, and he tells me
872
that one is from 1994 and one from 1996.  He won't tell me which is
873
which, but he gives me one \MM~from each bag.  One is yellow and
874
one is green.  What is the probability that the yellow one came
875
from the 1994 bag?
876

877
This problem is similar to the cookie problem, with the twist that I
878
draw one sample from each bowl/bag.  This problem also gives me a
879
chance to demonstrate the table method, which is useful for solving
880
problems like this on paper.  In the next chapter we will
881
solve them computationally.
882
\index{table method}
883

884
The first step is to enumerate the hypotheses.  The bag the yellow
885
\MM~came from I'll call Bag 1; I'll call the other Bag 2.  So
886
the hypotheses are:
887

888
\begin{itemize}
889

890
\item A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.
891

892
\item B: Bag 1 is from 1996 and Bag 2 from 1994.
893

894
\end{itemize}
895

896
Now we construct a table with a row for each hypothesis and a
897
column for each term in Bayes's theorem:
898

899
\begin{tabular}{|c|c|c|c|c|}
900
\hline
901
   & Prior & Likelihood &   & Posterior  \\
902
   & \p{H} & \p{D|H} & \p{H}~\p{D|H}  & \p{H|D}  \\
903
\hline
904
A  &  1/2  &  (20)(20)  &  200  &  20/27 \\
905
B  &  1/2  &  (14)(10)  &   70  &  7/27 \\
906
\hline
907
\end{tabular}
908

909
The first column has the priors.
910
Based on the statement of the problem,
911
it is reasonable to choose $\p{A} = \p{B} = 1/2$.
912

913
The second column has the likelihoods, which follow from the
914
information in the problem.  For example, if $A$ is true, the yellow
915
\MM~came from the 1994 bag with probability 20\%, and the green came
916
from the 1996 bag with probability 20\%.  If $B$ is true, the yellow
917
\MM~came from the 1996 bag with probability 14\%, and the green came
918
from the 1994 bag with probability 10\%.
919
Because the selections are
920
independent, we get the conjoint probability by multiplying.
921
\index{independence}
922

923
The third column is just the product of the previous two.
924
The sum of this column, 270, is the normalizing constant.
925
To get the last column, which contains the posteriors, we divide
926
the third column by the normalizing constant.
927

928
That's it.  Simple, right?
929

930
Well, you might be bothered by one detail.  I write \p{D|H}
931
in terms of percentages, not probabilities, which means it
932
is off by a factor of 10,000.  But that
933
cancels out when we divide through by the normalizing constant, so
934
it doesn't affect the result.
935
\index{normalizing constant}
936

937
When the set of hypotheses is mutually exclusive and collectively
938
exhaustive, you can multiply the likelihoods by any factor, if it is
939
convenient, as long as you apply the same factor to the entire column.
940

941

942
\section{The Monty Hall problem}
943

944
The Monty Hall problem might be the most contentious question in
945
the history of probability.  The scenario is simple, but the correct
946
answer is so counterintuitive that many people just can't accept
947
it, and many smart people have embarrassed themselves not just by
948
getting it wrong but by arguing the wrong side, aggressively,
949
in public.
950
\index{Monty Hall problem}
951

952
Monty Hall was the original host of the game show {\em Let's Make a
953
Deal}.  The Monty Hall problem is based on one of the regular
954
games on the show.  If you are on the show, here's what happens:
955

956
\begin{itemize}
957

958
\item Monty shows you three closed doors and tells you that there is a
959
  prize behind each door: one prize is a car, the other two are less
960
  valuable prizes like peanut butter and fake finger nails.  The
961
  prizes are arranged at random.
962

963
\item The object of the game is to guess which door has the car.  If
964
  you guess right, you get to keep the car.
965

966
\item You pick a door, which we will call Door A.  We'll call the
967
  other doors B and C.
968

969
\item Before opening the door you chose, Monty increases the
970
  suspense by opening either Door B or C, whichever does not
971
  have the car.  (If the car is actually behind Door A, Monty can
972
  safely open B or C, so he chooses one at random.)
973

974
\item Then Monty offers you the option to stick with your original
975
  choice or switch to the one remaining unopened door.
976

977
\end{itemize}
978

979
The question is, should you ``stick'' or ``switch'' or does it
980
make no difference?
981
\index{stick}
982
\index{switch}
983
\index{intuition}
984

985
Most people have the strong intuition that it makes no difference.
986
There are two doors left, they reason, so the chance that the car
987
is behind Door A is 50\%.
988

989
But that is wrong.  In fact, the chance of winning if you stick
990
with Door A is only 1/3; if you switch, your chances are 2/3.
991

992
By applying Bayes's theorem, we can break this problem into simple
993
pieces, and maybe convince ourselves that the correct answer is,
994
in fact, correct.
995

996
To start, we should make a careful statement of the data.  In
997
this case $D$ consists of two parts: Monty chooses Door B
998
{\em and} there is no car there.
999

1000
Next we define three hypotheses: $A$, $B$, and $C$ represent the
1001
hypothesis that the car is behind Door A, Door B, or Door C.
1002
Again, let's apply the table method:
1003

1004
\begin{tabular}{|c|c|c|c|c|}
1005
\hline
1006
   & Prior & Likelihood &   & Posterior  \\
1007
   & \p{H} & \p{D|H} & \p{H}~\p{D|H}  & \p{H|D}  \\
1008
\hline
1009
A  &  1/3  &  1/2  &  1/6  &  1/3 \\
1010
B  &  1/3  &  0  &   0  &  0 \\
1011
C  &  1/3  &  1  &   1/3  &  2/3 \\
1012
\hline
1013
\end{tabular}
1014

1015
Filling in the priors is easy because we are told that the prizes
1016
are arranged at random, which suggests that the car is equally
1017
likely to be behind any door.
1018

1019
Figuring out the likelihoods takes some thought, but with reasonable
1020
care we can be confident that we have it right:
1021

1022
\begin{itemize}
1023

1024
\item If the car is actually behind A, Monty could safely open Doors B
1025
  or C.  So the probability that he chooses B is 1/2.  And since the
1026
  car is actually behind A, the probability that the car is not behind
1027
  B is 1.
1028

1029
\item If the car is actually behind B, Monty has to open door C, so
1030
  the probability that he opens door B is 0.
1031

1032
\item Finally, if the car is behind Door C, Monty opens B with
1033
  probability 1 and finds no car there with probability 1.
1034

1035
\end{itemize}
1036

1037
Now the hard part is over; the rest is just arithmetic.  The
1038
sum of the third column is 1/2.  Dividing through yields
1039
$\p{A|D} = 1/3$ and $\p{C|D} = 2/3$.  So you are better off switching.
1040

1041
There are many variations of the Monty Hall problem.  One of the
1042
strengths of the Bayesian approach is that it generalizes to handle
1043
these variations.
1044

1045
For example, suppose that Monty always chooses B if he can, and
1046
only chooses C if he has to (because the car is behind B).  In
1047
that case the revised table is:
1048

1049
\begin{tabular}{|c|c|c|c|c|}
1050
\hline
1051
   & Prior & Likelihood &   & Posterior  \\
1052
   & \p{H} & \p{D|H} & \p{H}~\p{D|H}  & \p{H|D}  \\
1053
\hline
1054
A  &  1/3  &  1  &  1/3  &  1/2 \\
1055
B  &  1/3  &  0  &   0   &  0 \\
1056
C  &  1/3  &  1  &   1/3  &  1/2 \\
1057
\hline
1058
\end{tabular}
1059

1060
The only change is \p{D|A}.  If the car is behind $A$, Monty can
1061
choose to open B or C.  But in this variation he always chooses
1062
B, so $\p{D|A} = 1$.
1063

1064
As a result, the likelihoods are the same for $A$ and $C$, and the
1065
posteriors are the same: $\p{A|D} = \p{C|D} = 1/2$.  In this case, the
1066
fact that Monty chose B reveals no information about the location of
1067
the car, so it doesn't matter whether the contestant sticks or
1068
switches.
1069

1070
On the other hand, if he had opened $C$, we would know $\p{B|D} = 1$.
1071

1072
I included the Monty Hall problem in this chapter because I think it
1073
is fun, and because Bayes's theorem makes the complexity of the
1074
problem a little more manageable.  But it is not a typical use of
1075
Bayes's theorem, so if you found it confusing, don't worry!
1076

1077
\section{Discussion}
1078

1079
For many problems involving conditional probability, Bayes's theorem
1080
provides a divide-and-conquer strategy.  If \p{A|B} is hard to
1081
compute, or hard to measure experimentally, check whether it might be
1082
easier to compute the other terms in Bayes's theorem, \p{B|A}, \p{A}
1083
and \p{B}.
1084
\index{divide-and-conquer}
1085

1086
If the Monty Hall problem is your idea of fun, I have collected a
1087
number of similar problems in an article called ``All your Bayes are
1088
belong to us,'' which you can read at
1089
\url{http://allendowney.blogspot.com/2011/10/all-your-bayes-are-belong-to-us.html}.
1090

1091

1092
\chapter{Computational Statistics}
1093
\label{compstat}
1094

1095
\section{Distributions}
1096

1097
In statistics a {\bf distribution} is a set of values and their
1098
corresponding probabilities.
1099
\index{distribution}
1100

1101
For example, if you roll a six-sided die, the set of possible
1102
values is the numbers 1 to 6, and the probability associated
1103
with each value is 1/6.
1104
\index{dice}
1105

1106
As another example, you might be interested in how many times each
1107
word appears in common English usage.  You could build a distribution
1108
that includes each word and how many times it appears.
1109
\index{word frequency}
1110

1111
To represent a distribution in Python, you could use a dictionary that
1112
maps from each value to its probability.  I have written a class
1113
called {\tt Pmf} that uses a Python dictionary in exactly that way,
1114
and provides a number of useful methods.
1115
I called the class Pmf in reference to 
1116
a {\bf probability mass function}, which is a way to
1117
represent a distribution mathematically.
1118
\index{probability mass function}
1119
\index{Pmf class}
1120

1121
{\tt Pmf} is defined in a Python module I wrote to accompany this
1122
book, {\tt thinkbayes.py}.  You can download it from
1123
\url{http://thinkbayes.com/thinkbayes.py}.    For more information
1124
see Section~\ref{download}.
1125

1126
To use {\tt Pmf} you can import it like this:
1127

1128
\begin{verbatim}
1129
from thinkbayes import Pmf
1130
\end{verbatim}
1131

1132
The following code builds a Pmf to represent the distribution
1133
of outcomes for a six-sided die:
1134

1135
\begin{verbatim}
1136
pmf = Pmf()
1137
for x in [1,2,3,4,5,6]:
1138
    pmf.Set(x, 1/6.0)
1139
\end{verbatim}
1140

1141
\verb"Pmf" creates an empty Pmf with no values.  The
1142
\verb"Set" method sets the probability associated with each
1143
value to $1/6$.
1144

1145
Here's another example that counts the number of times each word
1146
appears in a sequence:
1147

1148
\begin{verbatim}
1149
pmf = Pmf()
1150
for word in word_list:
1151
    pmf.Incr(word, 1)
1152
\end{verbatim}
1153

1154
\verb"Incr" increases the ``probability'' associated with each
1155
word by 1.  If a word is not already in the Pmf, it is added.
1156

1157
I put ``probability'' in quotes because in this example, the
1158
probabilities are not normalized; that is, they do not add up to 1.
1159
So they are not true probabilities.
1160

1161
But in this example the word counts are proportional to the
1162
probabilities.  So after we count all the words, we can compute
1163
probabilities by dividing through by the total number of words.  {\tt
1164
  Pmf} provides a method, \verb"Normalize", that does exactly that:
1165
\index{Pmf methods}
1166

1167
\begin{verbatim}
1168
pmf.Normalize()
1169
\end{verbatim}
1170

1171
Once you have a Pmf object, you can ask for the probability
1172
associated with any value:
1173
\index{Prob}
1174

1175
\begin{verbatim}
1176
print pmf.Prob('the')
1177
\end{verbatim}
1178

1179
And that would print the frequency of the word ``the'' as a fraction
1180
of the words in the list.
1181

1182
Pmf uses a Python dictionary to store the values and their
1183
probabilities, so the values in the Pmf can be any hashable type.
1184
The probabilities can be any numerical type, but they are usually
1185
floating-point numbers (type \verb"float").
1186

1187

1188
\section{The cookie problem}
1189

1190
In the context of Bayes's theorem, it is natural to use a Pmf
1191
to map from each hypothesis to its probability.  In the cookie
1192
problem, the hypotheses are $B_1$ and $B_2$.  In Python, I
1193
represent them with strings:
1194
\index{cookie problem}
1195

1196
\begin{verbatim}
1197
pmf = Pmf()
1198
pmf.Set('Bowl 1', 0.5)
1199
pmf.Set('Bowl 2', 0.5)
1200
\end{verbatim}
1201

1202
This distribution, which contains the priors for each hypothesis,
1203
is called (wait for it) the {\bf prior distribution}.
1204
\index{prior distribution}
1205

1206
To update the distribution based on new data (the vanilla cookie),
1207
we multiply each prior by the corresponding likelihood.  The likelihood
1208
of drawing a vanilla cookie from Bowl 1 is 3/4.  The likelihood
1209
for Bowl 2 is 1/2.
1210
\index{Mult}
1211

1212
\begin{verbatim}
1213
pmf.Mult('Bowl 1', 0.75)
1214
pmf.Mult('Bowl 2', 0.5)
1215
\end{verbatim}
1216

1217
\verb"Mult" does what you would expect.  It gets the probability
1218
for the given hypothesis and multiplies by the given likelihood.
1219

1220
After this update, the distribution is no longer normalized, but
1221
because these hypotheses are mutually exclusive and collectively
1222
exhaustive, we can {\bf renormalize}:
1223
\index{renormalize}
1224

1225
\begin{verbatim}
1226
pmf.Normalize()
1227
\end{verbatim}
1228

1229
The result is a distribution that contains the posterior probability
1230
for each hypothesis, which is called (wait now) the
1231
{\bf posterior distribution}.
1232
\index{posterior distribution}
1233

1234
Finally, we can get the posterior probability for Bowl 1:
1235

1236
\begin{verbatim}
1237
print pmf.Prob('Bowl 1')
1238
\end{verbatim}
1239

1240
And the answer is 0.6.  You can download this example
1241
from \url{http://thinkbayes.com/cookie.py}.  For more information
1242
see Section~\ref{download}.
1243
\index{cookie.py}
1244

1245

1246
\section{The Bayesian framework}
1247
\label{framework}
1248

1249
\index{Bayesian framework}
1250
Before we go on to other problems, I want to rewrite the code
1251
from the previous section to make it more general.  First I'll
1252
define a class to encapsulate the code related to this problem:
1253

1254
\begin{verbatim}
1255
class Cookie(Pmf):
1256

1257
    def __init__(self, hypos):
1258
        Pmf.__init__(self)
1259
        for hypo in hypos:
1260
            self.Set(hypo, 1)
1261
        self.Normalize()
1262
\end{verbatim}
1263

1264
A Cookie object is a Pmf that maps from hypotheses to their
1265
probabilities.  The \verb"__init__" method gives each hypothesis
1266
the same prior probability.  As in the previous section, there are
1267
two hypotheses:
1268

1269
\begin{verbatim}
1270
    hypos = ['Bowl 1', 'Bowl 2']
1271
    pmf = Cookie(hypos)
1272
\end{verbatim}
1273

1274
\verb"Cookie" provides an \verb"Update" method that takes
1275
data as a parameter and updates the probabilities:
1276
\index{Update}
1277

1278
\begin{verbatim}
1279
    def Update(self, data):
1280
        for hypo in self.Values():
1281
            like = self.Likelihood(data, hypo)
1282
            self.Mult(hypo, like)
1283
        self.Normalize()
1284
\end{verbatim}
1285

1286
\verb"Update" loops through each hypothesis in the suite
1287
and multiplies its probability by the likelihood of the
1288
data under the hypothesis, which is computed by \verb"Likelihood":
1289
\index{Likelihood}
1290

1291
\begin{verbatim}
1292
    mixes = {
1293
        'Bowl 1':dict(vanilla=0.75, chocolate=0.25),
1294
        'Bowl 2':dict(vanilla=0.5, chocolate=0.5),
1295
        }
1296

1297
    def Likelihood(self, data, hypo):
1298
        mix = self.mixes[hypo]
1299
        like = mix[data]
1300
        return like
1301
\end{verbatim}
1302

1303
\verb"Likelihood" uses \verb"mixes", which is a dictionary
1304
that maps from the name of a bowl to the mix of cookies in
1305
the bowl.
1306

1307
Here's what the update looks like:
1308

1309
\begin{verbatim}
1310
    pmf.Update('vanilla')
1311
\end{verbatim}
1312

1313
And then we can print the posterior probability of each hypothesis:
1314

1315
\begin{verbatim}
1316
    for hypo, prob in pmf.Items():
1317
        print hypo, prob
1318
\end{verbatim}
1319

1320
The result is
1321

1322
\begin{verbatim}
1323
Bowl 1 0.6
1324
Bowl 2 0.4
1325
\end{verbatim}
1326

1327
which is the same as what we got before.  This code is more complicated
1328
than what we saw in the previous section.  One advantage is that it
1329
generalizes to the case where we draw more than one cookie from the
1330
same bowl (with replacement):
1331

1332
\begin{verbatim}
1333
    dataset = ['vanilla', 'chocolate', 'vanilla']
1334
    for data in dataset:
1335
        pmf.Update(data)
1336
\end{verbatim}
1337

1338
The other advantage is that it provides a framework for solving many
1339
similar problems.  In the next section we'll solve the Monty Hall
1340
problem computationally and then see what parts of the framework are
1341
the same.
1342

1343
The code in this section is available from
1344
\url{http://thinkbayes.com/cookie2.py}.
1345
  For more information
1346
see Section~\ref{download}.
1347

1348
\section{The Monty Hall problem}
1349

1350
To solve the Monty Hall problem, I'll define a new class:
1351
\index{Monty Hall problem}
1352

1353
\begin{verbatim}
1354
class Monty(Pmf):
1355

1356
    def __init__(self, hypos):
1357
        Pmf.__init__(self)
1358
        for hypo in hypos:
1359
            self.Set(hypo, 1)
1360
        self.Normalize()
1361
\end{verbatim}
1362

1363
So far \verb"Monty" and \verb"Cookie" are exactly the same.
1364
And the code that creates the Pmf is the same, too, except for
1365
the names of the hypotheses:
1366

1367
\begin{verbatim}
1368
    hypos = 'ABC'
1369
    pmf = Monty(hypos)
1370
\end{verbatim}
1371

1372
Calling \verb"Update" is pretty much the same:
1373

1374
\begin{verbatim}
1375
    data = 'B'
1376
    pmf.Update(data)
1377
\end{verbatim}
1378

1379
And the implementation of \verb"Update" is exactly the same:
1380

1381
\begin{verbatim}
1382
    def Update(self, data):
1383
        for hypo in self.Values():
1384
            like = self.Likelihood(data, hypo)
1385
            self.Mult(hypo, like)
1386
        self.Normalize()
1387
\end{verbatim}
1388

1389
The only part that requires some work is \verb"Likelihood":
1390

1391
\begin{verbatim}
1392
    def Likelihood(self, data, hypo):
1393
        if hypo == data:
1394
            return 0
1395
        elif hypo == 'A':
1396
            return 0.5
1397
        else:
1398
            return 1
1399
\end{verbatim}
1400

1401
Finally, printing the results is the same:
1402

1403
\begin{verbatim}
1404
    for hypo, prob in pmf.Items():
1405
        print hypo, prob
1406
\end{verbatim}
1407

1408
And the answer is
1409

1410
\begin{verbatim}
1411
A 0.333333333333
1412
B 0.0
1413
C 0.666666666667
1414
\end{verbatim}
1415

1416
In this example, writing \verb"Likelihood" is a little complicated,
1417
but the framework of the Bayesian update is simple.  The code in
1418
this section is available from \url{http://thinkbayes.com/monty.py}.
1419
  For more information
1420
see Section~\ref{download}.
1421

1422
\section{Encapsulating the framework}
1423

1424
\index{Suite class}
1425
Now that we see what elements of the framework are the same, we
1426
can encapsulate them in an object---a \verb"Suite" is a \verb"Pmf"
1427
that provides \verb"__init__", \verb"Update", and \verb"Print":
1428

1429
\begin{verbatim}
1430
class Suite(Pmf):
1431
    """Represents a suite of hypotheses and their probabilities."""
1432

1433
    def __init__(self, hypo=tuple()):
1434
        """Initializes the distribution."""
1435

1436
    def Update(self, data):
1437
        """Updates each hypothesis based on the data."""
1438

1439
    def Print(self):
1440
        """Prints the hypotheses and their probabilities."""
1441
\end{verbatim}
1442

1443
The implementation of \verb"Suite" is in \verb"thinkbayes.py".  To use
1444
\verb"Suite", you should write a class that inherits from it and
1445
provides \verb"Likelihood".  For example, here is the solution to the
1446
Monty Hall problem rewritten to use \verb"Suite":
1447

1448
\begin{verbatim}
1449
from thinkbayes import Suite
1450

1451
class Monty(Suite):
1452

1453
    def Likelihood(self, data, hypo):
1454
        if hypo == data:
1455
            return 0
1456
        elif hypo == 'A':
1457
            return 0.5
1458
        else:
1459
            return 1
1460
\end{verbatim}
1461

1462
And here's the code that uses this class:
1463

1464
\begin{verbatim}
1465
    suite = Monty('ABC')
1466
    suite.Update('B')
1467
    suite.Print()
1468
\end{verbatim}
1469

1470
You can download this example from
1471
\url{http://thinkbayes.com/monty2.py}.
1472
  For more information
1473
see Section~\ref{download}.
1474

1475

1476
\section{The \MM~problem}
1477

1478
\index{M and M problem}
1479
We can use the \verb"Suite" framework to solve the \MM~problem.
1480
Writing the \verb"Likelihood" function is tricky, but everything
1481
else is straightforward.
1482

1483
First I need to encode the color mixes from before and
1484
after 1995:
1485

1486
\begin{verbatim}
1487
    mix94 = dict(brown=30,
1488
                 yellow=20,
1489
                 red=20,
1490
                 green=10,
1491
                 orange=10,
1492
                 tan=10)
1493

1494
    mix96 = dict(blue=24,
1495
                 green=20,
1496
                 orange=16,
1497
                 yellow=14,
1498
                 red=13,
1499
                 brown=13)
1500
\end{verbatim}
1501

1502
Then I have to encode the hypotheses:
1503

1504
\begin{verbatim}
1505
    hypoA = dict(bag1=mix94, bag2=mix96)
1506
    hypoB = dict(bag1=mix96, bag2=mix94)
1507
\end{verbatim}
1508

1509
\verb"hypoA" represents the hypothesis that Bag 1 is from
1510
1994 and Bag 2 from 1996.  \verb"hypoB" is the other way
1511
around.
1512

1513
Next I map from the name of the hypothesis to the representation:
1514

1515
\begin{verbatim}
1516
    hypotheses = dict(A=hypoA, B=hypoB)
1517
\end{verbatim}
1518

1519
And finally I can write \verb"Likelihood".  In this case
1520
the hypothesis, \verb"hypo", is a string, either \verb"A" or \verb"B".
1521
The data is a tuple that specifies a bag
1522
and a color.
1523

1524
\begin{verbatim}
1525
    def Likelihood(self, data, hypo):
1526
        bag, color = data
1527
        mix = self.hypotheses[hypo][bag]
1528
        like = mix[color]
1529
        return like
1530
\end{verbatim}
1531

1532
Here's the code that creates the suite and updates it:
1533

1534
\begin{verbatim}
1535
    suite = M_and_M('AB')
1536

1537
    suite.Update(('bag1', 'yellow'))
1538
    suite.Update(('bag2', 'green'))
1539

1540
    suite.Print()
1541
\end{verbatim}
1542

1543
And here's the result:
1544

1545
\begin{verbatim}
1546
A 0.740740740741
1547
B 0.259259259259
1548
\end{verbatim}
1549

1550
The posterior probability of A is approximately $20/27$, which is what
1551
we got before.
1552

1553
The code in this section is available from
1554
\url{http://thinkbayes.com/m_and_m.py}.  For more information see
1555
Section~\ref{download}.
1556

1557
\section{Discussion}
1558

1559
This chapter presents the Suite class, which encapsulates the 
1560
Bayesian update framework.
1561

1562
{\tt Suite} is an {\bf abstract type}, which means that it defines the
1563
interface a Suite is supposed to have, but does not provide a complete
1564
implementation.  The {\tt Suite} interface includes {\tt Update} and
1565
{\tt Likelihood}, but the {\tt Suite} class only provides an
1566
implementation of {\tt Update}, not {\tt Likelihood}.
1567
\index{abstract type} \index{concrete type} \index{interface}
1568
\index{implementation}
1569

1570
A {\bf concrete type} is a class that extends an abstract parent
1571
class and provides an implementation of the missing methods.
1572
For example, {\tt Monty} extends {\tt Suite}, so it inherits
1573
{\tt Update} and provides {\tt Likelihood}.
1574

1575
If you are familiar with
1576
design patterns, you might recognize this as an example of the
1577
template method pattern.
1578
You can read about this pattern at
1579
\url{http://en.wikipedia.org/wiki/Template_method_pattern}.
1580
\index{template method pattern}
1581

1582
Most of the examples in the following chapters follow the same
1583
pattern; for each problem we define a new class that extends {\tt
1584
  Suite}, inherits {\tt Update}, and provides {\tt Likelihood}.  In a
1585
few cases we override {\tt Update}, usually to improve performance.
1586

1587
\section{Exercises}
1588

1589
\begin{exercise}
1590

1591
In Section~\ref{framework} I said that the solution to the cookie
1592
problem generalizes to the case where we draw multiple cookies
1593
with replacement.
1594

1595
But in the more likely scenario where we eat the cookies we draw,
1596
the likelihood of each draw depends on the previous draws.
1597

1598
Modify the solution in this chapter to handle selection without
1599
replacement.  Hint: add instance variables to {\tt Cookie} to
1600
represent the hypothetical state of the bowls, and modify
1601
{\tt Likelihood} accordingly.  You might want to define a 
1602
{\tt Bowl} object.
1603

1604
\end{exercise}
1605

1606

1607

1608

1609
\chapter{Estimation}
1610
\label{estimation}
1611

1612
\section{The dice problem}
1613

1614
\index{Dice problem}
1615
Suppose I have a box of dice that contains a 4-sided die, a 6-sided
1616
die, an 8-sided die, a 12-sided die, and a 20-sided die.  If you
1617
have ever played {\it Dungeons~\&~Dragons}, you know what I am talking about.
1618
\index{Dungeons and Dragons}
1619

1620
Suppose I select a die from the box at random, roll it, and get a 6.
1621
What is the probability that I rolled each die?
1622
\index{dice}
1623

1624
Let me suggest a three-step strategy for approaching a problem like this.
1625

1626
\begin{enumerate}
1627

1628
\item Choose a representation for the hypotheses.
1629

1630
\item Choose a representation for the data.
1631

1632
\item Write the likelihood function.
1633

1634
\end{enumerate}
1635

1636
In previous examples I used strings to represent hypotheses and
1637
data, but for the die problem I'll use numbers.  Specifically,
1638
I'll use the integers 4, 6, 8, 12, and 20 to represent hypotheses:
1639

1640
\begin{verbatim}
1641
    suite = Dice([4, 6, 8, 12, 20])
1642
\end{verbatim}
1643

1644
And integers from 1 to 20 for the data.
1645
These representations make it easy to
1646
write the likelihood function:
1647

1648
\begin{verbatim}
1649
class Dice(Suite):
1650
    def Likelihood(self, data, hypo):
1651
        if hypo < data:
1652
            return 0
1653
        else:
1654
            return 1.0/hypo
1655
\end{verbatim}
1656

1657
Here's how \verb"Likelihood" works.  If \verb"hypo<data", that
1658
means the roll is greater than the number of sides on the die.
1659
That can't happen, so the likelihood is 0.
1660

1661
Otherwise the question is, ``Given that there are {\tt hypo}
1662
sides, what is the chance of rolling {\tt data}?''  The
1663
answer is \verb"1/hypo", regardless of {\tt data}.
1664

1665
Here is the statement that does the update (if I roll a 6):
1666

1667
\begin{verbatim}
1668
    suite.Update(6)
1669
\end{verbatim}
1670

1671
And here is the posterior distribution:
1672

1673
\begin{verbatim}
1674
4 0.0
1675
6 0.392156862745
1676
8 0.294117647059
1677
12 0.196078431373
1678
20 0.117647058824
1679
\end{verbatim}
1680

1681
After we roll a 6, the probability for the 4-sided die is 0.  The
1682
most likely alternative is the 6-sided die, but there is still
1683
almost a 12\% chance for the 20-sided die.
1684

1685
What if we roll a few more times and get 6, 8, 7, 7, 5, and 4?
1686

1687
\begin{verbatim}
1688
    for roll in [6, 8, 7, 7, 5, 4]:
1689
        suite.Update(roll)
1690
\end{verbatim}
1691

1692
With this data the 6-sided die is eliminated, and the 8-sided
1693
die seems quite likely.  Here are the results:
1694

1695
\begin{verbatim}
1696
4 0.0
1697
6 0.0
1698
8 0.943248453672
1699
12 0.0552061280613
1700
20 0.0015454182665
1701
\end{verbatim}
1702

1703
Now the probability is 94\% that we are rolling the 8-sided die,
1704
and less than 1\% for the 20-sided die.
1705

1706
The dice problem is based on an example I saw in Sanjoy Mahajan's class on
1707
Bayesian inference.  You can download the code in this section from
1708
\url{http://thinkbayes.com/dice.py}.
1709
  For more information
1710
see Section~\ref{download}.
1711

1712
\section{The locomotive problem}
1713

1714
\index{locomotive problem}
1715
\index{Mosteller, Frederick}
1716
\index{German tank problem}
1717
I found the locomotive problem 
1718
in Frederick Mosteller's, {\it Fifty Challenging Problems in
1719
  Probability with Solutions} (Dover, 1987):
1720

1721
\begin{quote}
1722
``A railroad numbers its locomotives in order 1..N.  One day you see a
1723
locomotive with the number 60.  Estimate how many locomotives the
1724
railroad has.''
1725
\end{quote}
1726

1727
Based on this observation, we know the railroad has 60 or more
1728
locomotives.  But how many more?  To apply Bayesian reasoning, we
1729
can break this problem into two steps:
1730

1731
\begin{enumerate}
1732

1733
\item What did we know about $N$ before we saw the data?
1734

1735
\item For any given value of $N$, what is the likelihood of
1736
seeing the data (a locomotive with number 60)?
1737

1738
\end{enumerate}
1739

1740
The answer to the first question is the prior.  The answer to the
1741
second is the likelihood.
1742

1743
\begin{figure}
1744
% train.py
1745
\centerline{\includegraphics[height=2.5in]{figs/train1.pdf}}
1746
\caption{Posterior distribution for the locomotive problem, based
1747
on a uniform prior.}
1748
\label{fig.train1}
1749
\end{figure}
1750

1751
We don't have much basis to choose a prior, but we can start with
1752
something simple and then consider alternatives.  Let's assume that
1753
$N$ is equally likely to be any value from 1 to 1000.
1754

1755
\begin{verbatim}
1756
    hypos = xrange(1, 1001)
1757
\end{verbatim}
1758

1759
Now all we need is a likelihood function.  In a hypothetical fleet of
1760
$N$ locomotives, what is the probability that we would see number 60?
1761
If we assume that there is only one train-operating company (or only
1762
one we care about) and that we are equally likely to see any of its
1763
locomotives, then the chance of seeing any particular locomotive is
1764
$1/N$.
1765

1766
Here's the likelihood function:
1767
\index{likelihood function}
1768

1769
\begin{verbatim}
1770
class Train(Suite):
1771
    def Likelihood(self, data, hypo):
1772
        if hypo < data:
1773
            return 0
1774
        else:
1775
            return 1.0/hypo
1776
\end{verbatim}
1777

1778
This might look familiar; the likelihood functions for the locomotive
1779
problem and the dice problem are identical.
1780
\index{dice problem}
1781

1782
Here's the update:
1783

1784
\begin{verbatim}
1785
    suite = Train(hypos)
1786
    suite.Update(60)
1787
\end{verbatim}
1788

1789
There are too many hypotheses to print, so I plotted the
1790
results in Figure~\ref{fig.train1}.  Not surprisingly, all
1791
values of $N$ below 60 have been eliminated.
1792

1793
The most likely
1794
value, if you had to guess, is 60.  That might not seem like
1795
a very good guess; after all, what are the chances that you just
1796
happened to see the train with the highest number?
1797
Nevertheless, if you want to maximize the chance of getting
1798
the answer exactly right, you should guess 60.
1799

1800
But maybe that's
1801
not the right goal.  An alternative is to compute the
1802
mean of the posterior distribution:
1803

1804
\begin{verbatim}
1805
def Mean(suite):
1806
    total = 0
1807
    for hypo, prob in suite.Items():
1808
        total += hypo * prob
1809
    return total
1810

1811
print Mean(suite)
1812
\end{verbatim}
1813

1814
Or you could use the very similar method provided by {\tt Pmf}:
1815

1816
\begin{verbatim}
1817
    print suite.Mean()
1818
\end{verbatim}
1819

1820
The mean of the posterior is 333, so that might be a
1821
good guess if you wanted to minimize error.  If you played this
1822
guessing game over and over, using the mean of the posterior as your
1823
estimate would minimize the mean squared error over the long run (see
1824
\url{http://en.wikipedia.org/wiki/Minimum_mean_square_error}).
1825
\index{mean squared error}
1826

1827
You can download this example from \url{http://thinkbayes.com/train.py}.
1828
  For more information
1829
see Section~\ref{download}.
1830

1831
\section{What about that prior?}
1832

1833
To make any progress on the locomotive problem we had to make
1834
assumptions, and some of them were pretty arbitrary.  In
1835
particular, we chose a uniform prior from 1 to 1000, without
1836
much justification for choosing 1000, or for choosing a uniform
1837
distribution.
1838
\index{prior distribution}
1839

1840
It is not crazy to believe that a railroad company might operate 1000
1841
locomotives, but a reasonable person might guess more or fewer.  So we
1842
might wonder whether the posterior distribution is sensitive to these
1843
assumptions.  With so little data---only one observation---it probably
1844
is.
1845

1846
Recall that with a uniform prior from 1 to 1000, the mean of
1847
the posterior is 333.  With an upper bound of 500, we get a
1848
posterior mean of 207, and with an upper bound of 2000,
1849
the posterior mean is 552.
1850

1851
So that's bad.  There are two ways to proceed:
1852

1853
\begin{itemize}
1854

1855
\item Get more data.
1856

1857
\item Get more background information.
1858

1859
\end{itemize}
1860

1861
With more data, posterior distributions based on different
1862
priors tend to converge.  For example, suppose that in addition
1863
to train 60 we also see trains 30 and 90.  We can update the
1864
distribution like this:
1865

1866
\begin{verbatim}
1867
    for data in [60, 30, 90]:
1868
        suite.Update(data)
1869
\end{verbatim}
1870

1871
With these data, the means of the posteriors are
1872

1873
  \begin{tabular}{|l|l|}
1874
  \hline
1875
  Upper & Posterior \\
1876
  Bound & Mean \\
1877
  \hline
1878
  500 & 152 \\
1879
  1000 & 164\\
1880
  2000 & 171\\
1881
  \hline
1882
  \end{tabular}
1883

1884
So the differences are smaller.
1885

1886

1887
\section{An alternative prior}
1888

1889
\begin{figure}
1890
% train.py
1891
\centerline{\includegraphics[height=2.5in]{figs/train4.pdf}}
1892
\caption{Posterior distribution based on a power law prior,
1893
compared to a uniform prior.}
1894
\label{fig.train4}
1895
\end{figure}
1896

1897
If more data are not available, another option is to improve the
1898
priors by gathering more background information.  It is probably
1899
not reasonable to assume that a train-operating company with 1000 locomotives
1900
is just as likely as a company with only 1.
1901

1902
With some effort, we could probably find a list of companies that
1903
operate locomotives in the area of observation.  Or we could
1904
interview an expert in rail shipping to gather information about
1905
the typical size of companies.
1906

1907
But even without getting into the specifics of railroad economics, we
1908
can make some educated guesses.  In most fields, there are many small
1909
companies, fewer medium-sized companies, and only one or two very
1910
large companies.  In fact, the distribution of company sizes tends to
1911
follow a power law, as Robert Axtell reports in {\it Science} (see
1912
\url{http://www.sciencemag.org/content/293/5536/1818.full.pdf}).
1913
\index{power law}
1914
\index{Axtell, Robert}
1915

1916
This law suggests that if there are 1000 companies with fewer than
1917
10 locomotives, there might be 100 companies with 100 locomotives,
1918
10 companies with 1000, and possibly one company with 10,000 locomotives.
1919

1920
Mathematically, a power law means that the number of companies
1921
with a given size is inversely proportional to size, or
1922
%
1923
\[ \PMF(x) \propto \left( \frac{1}{x} \right)^{\alpha}   \]
1924
%
1925
where $\PMF(x)$ is the probability mass function of $x$ and $\alpha$ is
1926
a parameter that is often near 1.
1927

1928
We can construct a power law prior like this:
1929

1930
\begin{verbatim}
1931
class Train(Dice):
1932

1933
    def __init__(self, hypos, alpha=1.0):
1934
        Pmf.__init__(self)
1935
        for hypo in hypos:
1936
            self.Set(hypo, hypo**(-alpha))
1937
        self.Normalize()
1938
\end{verbatim}
1939

1940
And here's the code that constructs the prior:
1941

1942
\begin{verbatim}
1943
    hypos = range(1, 1001)
1944
    suite = Train(hypos)
1945
\end{verbatim}
1946

1947
Again, the upper bound is arbitrary, but with a power law
1948
prior, the posterior is less sensitive to this choice.
1949

1950
Figure~\ref{fig.train4} shows the new posterior based on
1951
the power law, compared to the posterior based on the
1952
uniform prior.  Using the background information
1953
represented in the power law prior, we can all but eliminate
1954
values of $N$ greater than 700.
1955

1956
If we start with this prior and observe trains 30, 60, and 90,
1957
the means of the posteriors are
1958

1959
  \begin{tabular}{|l|l|}
1960
  \hline
1961
  Upper & Posterior \\
1962
  Bound & Mean \\
1963
  \hline
1964
  500 & 131 \\
1965
  1000 & 133 \\
1966
  2000 & 134 \\
1967
  \hline
1968
  \end{tabular}
1969

1970
Now the differences are much smaller.  In fact,
1971
with an arbitrarily large upper bound, the mean converges on 134.
1972

1973
So the power law prior is more realistic, because it is based on
1974
general information about the size of companies, and it
1975
behaves better in practice.
1976

1977
You can download the examples in this section from
1978
\url{http://thinkbayes.com/train3.py}.
1979
  For more information
1980
see Section~\ref{download}.
1981

1982
\section{Credible intervals}
1983
\label{credible}
1984

1985
Once you have computed a posterior distribution, it is often useful
1986
to summarize the results with a single point estimate or an interval.
1987
For point estimates it is common to use the mean, median, or the
1988
value with maximum likelihood.
1989
\index{credible interval}
1990
\index{maximum likelihood}
1991

1992
For intervals we usually report two values computed
1993
so that there is a 90\% chance that the unknown value falls
1994
between them (or any other probability).
1995
These values define a {\bf credible interval}.
1996

1997
A simple way to compute a credible interval is to add up the
1998
probabilities in the posterior distribution and record the values
1999
that correspond to probabilities 5\% and 95\%.  In other words,
2000
the 5th and 95th percentiles.
2001
\index{percentile}
2002

2003
\verb"thinkbayes" provides a function that computes percentiles:
2004

2005
\begin{verbatim}
2006
def Percentile(pmf, percentage):
2007
    p = percentage / 100.0
2008
    total = 0
2009
    for val, prob in pmf.Items():
2010
        total += prob
2011
        if total >= p:
2012
            return val    
2013
\end{verbatim}
2014

2015
And here's the code that uses it:
2016

2017
\begin{verbatim}
2018
    interval = Percentile(suite, 5), Percentile(suite, 95)
2019
    print interval
2020
\end{verbatim}
2021

2022
For the previous example---the locomotive problem with a power law prior
2023
and three trains---the 90\% credible interval is $(91, 243)$.  The
2024
width of this range suggests, correctly, that we are still quite
2025
uncertain about how many locomotives there are.
2026

2027

2028
\section{Cumulative distribution functions}
2029

2030
In the previous section we computed percentiles by iterating through
2031
the values and probabilities in a Pmf.  If we need to compute more
2032
than a few percentiles, it is more efficient to use a cumulative
2033
distribution function, or Cdf.
2034
\index{cumulative distribution function}
2035
\index{Cdf}
2036

2037
Cdfs and Pmfs are equivalent in the sense that they contain the
2038
same information about the distribution, and you can always convert
2039
from one to the other.  The advantage of the Cdf is that you can
2040
compute percentiles more efficiently.
2041

2042
{\tt thinkbayes} provides a {\tt Cdf} class that represents a
2043
cumulative distribution function.  {\tt Pmf} provides a method
2044
that makes the corresponding Cdf:
2045

2046
\begin{verbatim}
2047
cdf = suite.MakeCdf()
2048
\end{verbatim}
2049

2050
And {\tt Cdf} provides a function named \verb"Percentile"
2051

2052
\begin{verbatim}
2053
    interval = cdf.Percentile(5), cdf.Percentile(95)
2054
\end{verbatim}
2055

2056
Converting from a Pmf to a Cdf takes time proportional to the number
2057
of values, {\tt len(pmf)}.  The Cdf stores the values and
2058
probabilities in sorted lists, so looking up a probability to get the
2059
corresponding value takes ``log time'': that is, time proportional to
2060
the logarithm of the number of values.  Looking up a value to get the
2061
corresponding probability is also logarithmic, so Cdfs are efficient
2062
for many calculations.
2063

2064
The examples in this section are in \url{http://thinkbayes.com/train3.py}.
2065
  For more information
2066
see Section~\ref{download}.
2067

2068

2069
\section{The German tank problem}
2070

2071
During World War II, the Economic Warfare Division of the American
2072
Embassy in London used statistical analysis to estimate German
2073
production of tanks and other equipment.\footnote{Ruggles and Brodie,
2074
  ``An Empirical Approach to Economic Intelligence in World War II,''
2075
  {\em Journal of the American Statistical Association}, Vol. 42,
2076
  No. 237 (March 1947).}
2077

2078
The Western Allies had captured log books, inventories, and repair
2079
records that included chassis and engine serial numbers for individual
2080
tanks.
2081

2082
Analysis of these records indicated that serial numbers were allocated
2083
by manufacturer and tank type in blocks of 100 numbers, that numbers
2084
in each block were used sequentially, and that not all numbers in each
2085
block were used.  So the problem of estimating German tank production
2086
could be reduced, within each block of 100 numbers, to a form of the
2087
locomotive problem.
2088

2089
Based on this insight, American and British analysts produced
2090
estimates substantially lower than estimates from other forms
2091
of intelligence.  And after the war, records indicated that they were
2092
substantially more accurate.
2093

2094
They performed similar analyses for tires, trucks, rockets, and other
2095
equipment, yielding accurate and actionable economic intelligence.
2096

2097
The German tank problem is historically interesting; it is also a nice
2098
example of real-world application of statistical estimation.  So far
2099
many of the examples in this book have been toy problems, but it will
2100
not be long before we start solving real problems.  I think it is an
2101
advantage of Bayesian analysis, especially with the computational
2102
approach we are taking, that it provides such a short path from a
2103
basic introduction to the research frontier.
2104

2105

2106
\section{Discussion}
2107

2108
Among Bayesians, there are two approaches to choosing prior
2109
distributions.  Some recommend choosing the prior that best represents
2110
background information about the problem; in that case the prior
2111
is said to be {\bf informative}.  The problem with using an informative
2112
prior is that people might use different background information (or
2113
interpret it differently).  So informative priors often seem subjective.
2114
\index{informative prior}
2115

2116
The alternative is a so-called {\bf uninformative prior}, which is
2117
intended to be as unrestricted as possible, in order to let the data
2118
speak for themselves.  In some cases you can identify a unique prior
2119
that has some desirable property, like representing minimal prior
2120
information about the estimated quantity.
2121
\index{uninformative prior}
2122

2123
Uninformative priors are appealing because they seem more
2124
objective.  But I am generally in favor of using informative priors.
2125
Why?  First, Bayesian analysis is always based on
2126
modeling decisions.  Choosing the prior is one of those decisions, but
2127
it is not the only one, and it might not even be the most subjective.
2128
So even if an uninformative prior is more objective, the entire analysis
2129
is still subjective.
2130
\index{modeling}
2131
\index{subjectivity}
2132
\index{objectivity}
2133

2134
Also, for most practical problems, you are likely to be in one of two
2135
regimes: either you have a lot of data or not very much.  If you have
2136
a lot of data, the choice of the prior doesn't matter very much;
2137
informative and uninformative priors yield almost the same results.
2138
We'll see an example like this in the next chapter.
2139

2140
But if, as in the locomotive problem, you don't have much data,
2141
using relevant background information (like the power law distribution)
2142
makes a big difference.
2143
\index{locomotive problem}
2144

2145
And if, as in the German tank problem, you have to make life-and-death
2146
decisions based on your results, you should probably use all of the
2147
information at your disposal, rather than maintaining the illusion of
2148
objectivity by pretending to know less than you do.
2149
\index{German tank problem}
2150

2151

2152
\section{Exercises}
2153

2154
\begin{exercise}
2155
To write a likelihood function for the locomotive problem, we had
2156
to answer this question:  ``If the railroad has $N$ locomotives, what
2157
is the probability that we see number 60?''
2158

2159
The answer depends on what sampling process we use when we observe the
2160
locomotive.  In this chapter, I resolved the ambiguity by specifying
2161
that there is only one train-operating company (or only one that we
2162
care about).
2163

2164
But suppose instead that there are many companies with different
2165
numbers of trains.  And suppose that you are equally likely to see any
2166
train operated by any company.
2167
In that case, the likelihood function is different because you
2168
are more likely to see a train operated by a large company.
2169

2170
As an exercise, implement the likelihood function for this variation
2171
of the locomotive problem, and compare the results.
2172

2173
\end{exercise}
2174

2175

2176

2177

2178
\chapter{More Estimation}
2179
\label{more}
2180

2181
\section{The Euro problem}
2182
\label{euro}
2183

2184
\index{Euro problem}
2185
\index{MacKay, David}
2186
In {\it Information Theory, Inference, and Learning Algorithms}, David MacKay
2187
poses this problem:
2188

2189
\begin{quote}
2190
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
2191

2192
  \begin{quote}
2193
        When spun on edge 250 times, a Belgian one-euro coin came
2194
        up heads 140 times and tails 110.  `It looks very suspicious
2195
        to me,' said Barry Blight, a statistics lecturer at the London
2196
        School of Economics.  `If the coin were unbiased, the chance of
2197
        getting a result as extreme as that would be less than 7\%.'
2198
        \end{quote}
2199

2200
But do these data give evidence that the coin is biased rather than fair?
2201
\end{quote}
2202

2203
To answer that question, we'll proceed in two steps.  The first
2204
is to estimate the probability that the coin lands face up.  The second
2205
is to evaluate whether the data support the hypothesis that the
2206
coin is biased.
2207

2208
You can download the code in this section from
2209
\url{http://thinkbayes.com/euro.py}.
2210
  For more information
2211
see Section~\ref{download}.
2212

2213
Any given coin has some probability, $x$, of landing heads up when spun
2214
on edge.  It seems reasonable to believe that the value of $x$ depends
2215
on some physical characteristics of the coin, primarily the distribution
2216
of weight.
2217

2218
If a coin is perfectly balanced, we expect $x$ to be close to 50\%, but
2219
for a lopsided coin, $x$ might be substantially different.  We can use
2220
Bayes's theorem and the observed data to estimate $x$.
2221

2222
Let's define 101 hypotheses, where $H_x$ is the hypothesis that the
2223
probability of heads is $x$\%, for values from 0 to 100.  I'll
2224
start with a uniform prior where the probability of $H_x$ is the same
2225
for all $x$.  We'll come back later to consider other priors.
2226
\index{uniform distribution}
2227

2228
\begin{figure}
2229
% euro.py
2230
\centerline{\includegraphics[height=2.5in]{figs/euro1.pdf}}
2231
\caption{Posterior distribution for the Euro problem
2232
on a uniform prior.}
2233
\label{fig.euro1}
2234
\end{figure}
2235

2236
The likelihood function is relatively easy: If $H_x$ is true, the
2237
probability of heads is $x/100$ and the probability of tails is $1-
2238
x/100$.
2239

2240
\begin{verbatim}
2241
class Euro(Suite):
2242

2243
    def Likelihood(self, data, hypo):
2244
        x = hypo
2245
        if data == 'H':
2246
            return x/100.0
2247
        else:
2248
            return 1 - x/100.0
2249
\end{verbatim}
2250

2251
Here's the code that makes the suite and updates it:
2252

2253
\begin{verbatim}
2254
    suite = Euro(xrange(0, 101))
2255
    dataset = 'H' * 140 + 'T' * 110
2256

2257
    for data in dataset:
2258
        suite.Update(data)
2259
\end{verbatim}
2260

2261
The result is in Figure~\ref{fig.euro1}.
2262

2263

2264
\section{Summarizing the posterior}
2265

2266
Again, there are several ways to summarize the posterior distribution.
2267
One option is to find the most likely value in the posterior
2268
distribution.  \verb"thinkbayes" provides a function that does 
2269
that:
2270
\index{posterior distribution}
2271
\index{maximum likelihood}
2272

2273
\begin{verbatim}
2274
def MaximumLikelihood(pmf):
2275
    """Returns the value with the highest probability."""
2276
    prob, val = max((prob, val) for val, prob in pmf.Items())
2277
    return val
2278
\end{verbatim}
2279

2280
In this case the result is 56, which is also the observed percentage of
2281
heads, $140/250 = 56\%$.  So that suggests (correctly) that the
2282
observed percentage is the maximum likelihood estimator
2283
for the population.
2284

2285
We might also summarize the posterior by computing the mean
2286
and median:
2287
\index{median}
2288

2289
\begin{verbatim}
2290
    print 'Mean', suite.Mean()
2291
    print 'Median', thinkbayes.Percentile(suite, 50)
2292
\end{verbatim}
2293

2294
The mean is 55.95; the median is 56.  Finally, we can compute a
2295
credible interval: 
2296

2297
\begin{verbatim}
2298
    print 'CI', thinkbayes.CredibleInterval(suite, 90)
2299
\end{verbatim}
2300

2301
The result is $(51, 61)$.
2302

2303
Now, getting back to the original question, we would like to know
2304
whether the coin is fair.  We observe that the posterior credible
2305
interval does not include 50\%, which suggests that the coin is not
2306
fair.
2307

2308
But that is not exactly the question we started with.  MacKay asked,
2309
`` Do these data give evidence that the coin is biased rather than
2310
fair?''  To answer that question, we will have to be more precise
2311
about what it means to say that data constitute evidence for
2312
a hypothesis.  And that is the subject of the next chapter.
2313
\index{evidence}
2314

2315
But before we go on, I want to address one possible source of confusion.
2316
Since we want to know whether the coin is fair, it might be tempting
2317
to ask for the probability that {\tt x} is 50\%:
2318

2319
\begin{verbatim}
2320
    print suite.Prob(50)
2321
\end{verbatim}
2322

2323
The result is 0.021, but that value is almost meaningless.  The
2324
decision to evaluate 101 hypotheses was arbitrary; we could have
2325
divided the range into more or fewer pieces, and if we had, the
2326
probability for any given hypothesis would be greater or less.
2327

2328

2329
\section{Swamping the priors}
2330
\label{triangle}
2331

2332
\begin{figure}
2333
% euro.py
2334
\centerline{\includegraphics[height=2.5in]{figs/euro2.pdf}}
2335
\caption{Uniform and triangular priors for the
2336
Euro problem.}
2337
\label{fig.euro2}
2338
\end{figure}
2339

2340
\begin{figure}
2341
% euro.py
2342
\centerline{\includegraphics[height=2.5in]{figs/euro3.pdf}}
2343
\caption{Posterior distributions for the Euro problem.}
2344
\label{fig.euro3}
2345
\end{figure}
2346

2347
We started with a uniform prior, but that might not be a good
2348
choice.  I can believe
2349
that if a coin is lopsided, $x$ might deviate substantially from
2350
50\%, but it seems unlikely that the Belgian Euro coin is so
2351
imbalanced that $x$ is 10\% or 90\%.
2352

2353
It might be more reasonable to choose a prior that gives
2354
higher probability to values of $x$ near 50\% and lower probability
2355
to extreme values.
2356

2357
As an example, I constructed a triangular prior, shown in
2358
Figure~\ref{fig.euro2}.  Here's the code that constructs the prior:
2359

2360
\begin{verbatim}
2361
def TrianglePrior():
2362
    suite = Euro()
2363
    for x in range(0, 51):
2364
        suite.Set(x, x)
2365
    for x in range(51, 101):
2366
        suite.Set(x, 100-x) 
2367
    suite.Normalize()
2368
\end{verbatim}
2369

2370
Figure~\ref{fig.euro2} shows the result (and the uniform prior for
2371
comparison).
2372
Updating this prior with the same dataset yields the posterior
2373
distribution shown in Figure~\ref{fig.euro3}.  Even with substantially
2374
different priors, the posterior distributions are very similar.  The
2375
medians and the credible intervals are identical; the means differ by
2376
less than 0.5\%.  \index{triangle distribution}
2377

2378
This is an example of {\bf swamping the priors}: with enough
2379
data, people who start with different priors will tend to
2380
converge on the same posterior.
2381
\index{swamping the priors}
2382
\index{convergence}
2383

2384

2385
\section{Optimization}
2386

2387
The code I have shown so far is meant to be easy to read, but it
2388
is not very efficient.  In general, I like to develop code that
2389
is demonstrably correct, then check whether it is fast enough for
2390
my purposes.  If so, there is no need to optimize.
2391
For this example, if we care about run time,
2392
there are several ways we can speed it up.
2393
\index{optimization}
2394

2395
The first opportunity is to reduce the number of times we
2396
normalize the suite.
2397
In the original code, we call \verb"Update" once for each spin.
2398

2399
\begin{verbatim}
2400
    dataset = 'H' * heads + 'T' * tails
2401

2402
    for data in dataset:
2403
        suite.Update(data)
2404
\end{verbatim}
2405

2406
And here's what \verb"Update" looks like:
2407

2408
\begin{verbatim}
2409
    def Update(self, data):
2410
        for hypo in self.Values():
2411
            like = self.Likelihood(data, hypo)
2412
            self.Mult(hypo, like)
2413
        return self.Normalize()
2414
\end{verbatim}
2415

2416
Each update iterates through the hypotheses, then calls \verb"Normalize",
2417
which iterates through the hypotheses again.  We can save some
2418
time by doing all of the updates before normalizing.
2419

2420
\verb"Suite" provides a method called \verb"UpdateSet" that does
2421
exactly that.  Here it is:
2422

2423
\begin{verbatim}
2424
    def UpdateSet(self, dataset):
2425
        for data in dataset:
2426
            for hypo in self.Values():
2427
                like = self.Likelihood(data, hypo)
2428
                self.Mult(hypo, like)
2429
        return self.Normalize()
2430
\end{verbatim}
2431

2432
And here's how we can invoke it:
2433

2434
\begin{verbatim}
2435
    dataset = 'H' * heads + 'T' * tails
2436
    suite.UpdateSet(dataset)
2437
\end{verbatim}
2438

2439
This optimization speeds things up, but the run time is still
2440
proportional to the amount of data.  We can speed things up
2441
even more by rewriting \verb"Likelihood" to process the entire
2442
dataset, rather than one spin at a time.
2443

2444
In the original version,
2445
\verb"data" is a string that encodes either heads or tails:
2446

2447
\begin{verbatim}
2448
    def Likelihood(self, data, hypo):
2449
        x = hypo / 100.0
2450
        if data == 'H':
2451
            return x
2452
        else:
2453
            return 1-x
2454
\end{verbatim}
2455

2456
As an alternative, we could encode the dataset as a tuple of 
2457
two integers: the number of heads and tails.
2458
In that case \verb"Likelihood" looks like this:
2459
\index{tuple}
2460

2461
\begin{verbatim}
2462
    def Likelihood(self, data, hypo):
2463
        x = hypo / 100.0
2464
        heads, tails = data
2465
        like = x**heads * (1-x)**tails
2466
        return like
2467
\end{verbatim}
2468

2469
And then we can call \verb"Update" like this:
2470

2471
\begin{verbatim}
2472
    heads, tails = 140, 110
2473
    suite.Update((heads, tails))
2474
\end{verbatim}
2475

2476
Since we have replaced repeated multiplication with exponentiation,
2477
this version takes the same time for any number of spins.
2478

2479

2480
\section{The beta distribution}
2481
\label{beta}
2482

2483
\index{beta distribution}
2484
There is one more optimization that solves this problem
2485
even faster.
2486

2487
So far we have used a Pmf object to represent a discrete set of
2488
values for {\tt x}.  Now we will use a continuous
2489
distribution, specifically the beta distribution (see
2490
\url{http://en.wikipedia.org/wiki/Beta_distribution}).
2491
\index{continuous distribution}
2492

2493
The beta distribution is defined on the interval from 0 to 1
2494
(including both), so it is a natural choice for describing
2495
proportions and probabilities.  But wait, it gets better.
2496

2497
%TODO: explain the binomial distribution in the previous section
2498

2499
It turns out that if you do a Bayesian update with a binomial
2500
likelihood function, which is what we did in the previous section, the beta
2501
distribution is a {\bf conjugate prior}.  That means that if the prior
2502
distribution for {\tt x} is a beta distribution, the posterior is also
2503
a beta distribution.  But wait, it gets even better.
2504
\index{binomial likelihood function}
2505
\index{conjugate prior}
2506

2507
The shape of the beta distribution depends on two parameters, written
2508
$\alpha$ and $\beta$, or {\tt alpha} and {\tt beta}.  If the prior
2509
is a beta distribution with parameters {\tt alpha} and {\tt beta}, and
2510
we see data with {\tt h} heads and {\tt t} tails, the posterior is a
2511
beta distribution with parameters {\tt alpha+h} and {\tt beta+t}.  In
2512
other words, we can do an update with two additions.
2513
\index{parameter}
2514

2515
So that's great, but it only works if we can find a beta distribution
2516
that is a good choice for a prior.  Fortunately, for many realistic
2517
priors there is a beta distribution that is at least a good
2518
approximation, and for a uniform prior there is a perfect match.  The
2519
beta distribution with {\tt alpha=1} and {\tt beta=1} is uniform from
2520
0 to 1.
2521

2522
Let's see how we can take advantage of all this.  
2523
{\tt thinkbayes.py} provides 
2524
a class that represents a beta distribution:
2525
\index{Beta object}
2526

2527
\begin{verbatim}
2528
class Beta(object):
2529

2530
    def __init__(self, alpha=1, beta=1):
2531
        self.alpha = alpha
2532
        self.beta = beta
2533
\end{verbatim}
2534

2535
By default \verb"__init__" makes a uniform distribution.
2536
{\tt Update} performs a Bayesian update:
2537

2538
\begin{verbatim}
2539
    def Update(self, data):
2540
        heads, tails = data
2541
        self.alpha += heads
2542
        self.beta += tails
2543
\end{verbatim}
2544

2545
{\tt data} is a pair of integers representing the number of
2546
heads and tails.
2547

2548
So we have yet another way to solve the Euro problem:
2549

2550
\begin{verbatim}
2551
    beta = thinkbayes.Beta()
2552
    beta.Update((140, 110))
2553
    print beta.Mean()
2554
\end{verbatim}
2555

2556
{\tt Beta} provides {\tt Mean}, which 
2557
computes a simple function of {\tt alpha}
2558
and {\tt beta}:
2559

2560
\begin{verbatim}
2561
    def Mean(self):
2562
        return float(self.alpha) / (self.alpha + self.beta)
2563
\end{verbatim}
2564

2565
For the Euro problem the posterior mean is 56\%, which is the
2566
same result we got using Pmfs.  
2567

2568
{\tt Beta} also provides {\tt EvalPdf}, which evaluates
2569
the probability density
2570
function (PDF)  of the beta distribution:
2571
\index{probability density function}
2572
\index{PDF}
2573

2574
\begin{verbatim}
2575
    def EvalPdf(self, x):
2576
        return x**(self.alpha-1) * (1-x)**(self.beta-1)
2577
\end{verbatim}
2578

2579
Finally, {\tt Beta} provides {\tt MakePmf}, which
2580
uses {\tt EvalPdf} to generate a discrete approximation
2581
of the beta distribution.
2582

2583
%This expression might look familiar.  Here's {\tt
2584
%  thinkbayes.EvalBinomialPmf}
2585

2586
%\begin{verbatim}
2587
%def EvalBinomialPmf(x, yes, no): 
2588
%    return x**yes * (1-x)**no
2589
%\end{verbatim}
2590

2591
%It's the same function, but in {\tt EvalPdf}, we think of {\tt x} as a
2592
%random variable and {\tt alpha} and {\tt beta} as parameters; in {\tt
2593
%  EvalBinomialPmf}, {\tt x} is the parameter, and {\tt yes} and {\tt
2594
%  no} are random variables.  Distributions like these that share the
2595
%same PDF are called {\bf conjugate distributions}.
2596
%\index{conjugate distribution}
2597

2598

2599
\section{Discussion}
2600

2601
In this chapter we solved the same problem with two different 
2602
priors and found that with a large dataset, the priors get
2603
swamped.  If two people start with different
2604
prior beliefs, they generally find, as they see more data, that
2605
their posterior distributions converge.  At some point the
2606
difference between their distributions is small enough that it has
2607
no practical effect.
2608
\index{swamping the priors}
2609
\index{convergence}
2610

2611
When this happens, it relieves some of the worry about objectivity
2612
that I discussed in the previous chapter.  And for many real-world
2613
problems even stark prior beliefs can eventually be reconciled
2614
by data.
2615

2616
But that is not always the case.  First, remember that all Bayesian
2617
analysis is based on modeling decisions.  If you and I do not
2618
choose the same model, we might interpret data differently.  So
2619
even with the same data, we would compute different likelihoods,
2620
and our posterior beliefs might not converge.
2621
\index{modeling}
2622

2623
Also, notice that in a Bayesian update, we multiply
2624
each prior probability by a likelihood, so if \p{H} is 0,
2625
\p{H|D} is also 0, regardless of $D$.  In the Euro problem,
2626
if you are convinced that $x$ is less than 50\%, and you assign
2627
probability 0 to all other hypotheses, no amount of data will
2628
convince you otherwise.
2629
\index{Euro problem}
2630

2631
This observation is the basis of {\bf Cromwell's rule}, which is the
2632
recommendation that you should avoid giving a prior probability of
2633
0 to any hypothesis that is even remotely possible
2634
(see \url{http://en.wikipedia.org/wiki/Cromwell's_rule}).
2635
\index{Cromwell's rule}
2636

2637
Cromwell's rule is named after Oliver Cromwell, who wrote, ``I beseech
2638
you, in the bowels of Christ, think it possible that you may be
2639
mistaken.''  For Bayesians, this turns out to be good advice (even if
2640
it's a little overwrought).
2641
\index{Cromwell, Oliver}
2642

2643

2644
\section{Exercises}
2645

2646
\begin{exercise}
2647

2648
Suppose that instead of observing coin tosses directly, you measure
2649
the outcome using an instrument that is not always correct.  Specifically,
2650
suppose there is a probability {\tt y} that an actual heads is reported
2651
as tails, or actual tails reported as heads.
2652

2653
Write a class that estimates the bias of a coin given a series of
2654
outcomes and the value of {\tt y}.
2655

2656
How does the spread of the posterior distribution depend on
2657
{\tt y}?
2658

2659
\end{exercise}
2660

2661

2662
\begin{exercise}
2663

2664
\index{Reddit}
2665
This exercise is inspired by a question posted by a
2666
``redditor'' named dominosci on Reddit's statistics ``subreddit'' at
2667
\url{http://reddit.com/r/statistics}.
2668

2669
Reddit is an online forum with many interest groups called
2670
subreddits.  Users, called redditors, post links to online
2671
content and other web pages.  Other redditors vote on the links,
2672
giving an ``upvote'' to high-quality links and a ``downvote'' to
2673
links that are bad or irrelevant.
2674

2675
A problem, identified by dominosci, is that some redditors
2676
are more reliable than others, and Reddit does not take
2677
this into account.
2678

2679
The challenge is to devise a system so that when a redditor
2680
casts a vote, the estimated quality of the link is updated
2681
in accordance with the reliability of the redditor, and the
2682
estimated reliability of the redditor is updated in accordance
2683
with the quality of the link.
2684

2685
One approach is to model the quality of the link as the
2686
probability of garnering an upvote, and to model the reliability
2687
of the redditor as the probability of correctly giving an upvote
2688
to a high-quality item.
2689

2690
Write class definitions for redditors and links and an update function
2691
that updates both objects whenever a redditor casts a vote.
2692

2693
\end{exercise}
2694

2695

2696

2697
\chapter{Odds and Addends}
2698

2699
\section{Odds}
2700

2701
One way to represent a probability is with a number between
2702
0 and 1, but that's not the only way.  If you have ever bet
2703
on a football game or a horse race, you have probably encountered
2704
another representation of probability, called {\bf odds}.
2705
\index{odds}
2706

2707
You might have heard expressions like ``the odds are
2708
three to one,'' but you might not know what that means.  
2709
The {\bf odds in favor} of an event are the ratio of the probability
2710
it will occur to the probability that it will not.
2711

2712
So if I think my team has a 75\% chance of winning, I would
2713
say that the odds in their favor are three to one, because
2714
the chance of winning is three times the chance of losing.
2715

2716
You can write odds in decimal form, but it is most common to
2717
write them as a ratio of integers.  So ``three to one'' is
2718
written $3:1$.
2719

2720
When probabilities are low, it is more common to report the
2721
{\bf odds against} rather than the odds in favor.  For
2722
example, if I think my horse has a 10\% chance of winning,
2723
I would say that the odds against are $9:1$.
2724

2725
Probabilities and odds are different representations of the
2726
same information.  Given a probability, you can compute the
2727
odds like this:
2728

2729
\begin{verbatim}
2730
def Odds(p):
2731
    return p / (1-p)
2732
\end{verbatim}
2733

2734
Given the odds in favor, in decimal form, you can convert to
2735
probability like this:
2736

2737
\begin{verbatim}
2738
def Probability(o):
2739
    return o / (o+1)
2740
\end{verbatim}
2741

2742
If you represent odds with a numerator and denominator, you
2743
can convert to probability like this:
2744

2745
\begin{verbatim}
2746
def Probability2(yes, no):
2747
    return yes / (yes + no)
2748
\end{verbatim}
2749

2750
When I work with odds in my head, I find it helpful to picture
2751
people at the track.  If 20\% of them think my horse will win,
2752
then 80\% of them don't, so the odds in favor are $20:80$ or
2753
$1:4$.
2754

2755
If the odds are $5:1$ against my horse, then five out of six
2756
people think she will lose, so the probability of winning
2757
is $1/6$.
2758
\index{horse racing}
2759

2760

2761
\section{The odds form of Bayes's theorem}
2762

2763
\index{Bayes's theorem!odds form}
2764
In Chapter~\ref{intro} I wrote Bayes's theorem in the {\bf probability
2765
form}:
2766
%
2767
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
2768
%
2769
If we have two hypotheses, $A$ and $B$, 
2770
we can write the ratio of posterior probabilities like this:
2771
%
2772
\[ \frac{\p{A|D}}{\p{B|D}} = \frac{\p{A}~\p{D|A}}
2773
                                        {\p{B}~\p{D|B}} \]
2774
%
2775
Notice that the normalizing constant, \p{D}, drops out of
2776
this equation.
2777
\index{normalizing constant}
2778

2779
If $A$ and $B$ are mutually exclusive and collectively exhaustive,
2780
that means $\p{B} = 1 - \p{A}$, so we can rewrite the ratio of
2781
the priors, and the ratio of the posteriors, as odds.
2782

2783
Writing \odds{A} for odds in favor of $A$, we get:
2784
%
2785
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2786
%
2787
In words, this says that the posterior odds are the prior odds times
2788
the likelihood ratio.  This is the {\bf odds form} of Bayes's theorem.
2789

2790
This form is most convenient for computing a Bayesian update on
2791
paper or in your head.  For example, let's go back to the
2792
cookie problem:
2793
\index{cookie problem}
2794

2795
\begin{quote}
2796
Suppose there are two bowls of cookies.  Bowl 1 contains
2797
  30 vanilla cookies and 10 chocolate cookies.  Bowl 2 contains 20 of
2798
  each.
2799

2800
Now suppose you choose one of the bowls at random and, without looking,
2801
select a cookie at random.  The cookie is vanilla.  What is the probability
2802
that it came from Bowl 1?
2803
\end{quote}
2804

2805
The prior probability is 50\%, so the prior odds are $1:1$, or just
2806
1.  The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$.
2807
So the posterior odds are $3:2$, which corresponds to probability
2808
$3/5$.
2809

2810

2811
\section{Oliver's blood}
2812
\label{oliver}
2813

2814
\index{Oliver's blood problem}
2815
\index{MacKay, David}
2816
Here is another problem from MacKay's {\it Information Theory,
2817
  Inference, and Learning Algorithms}:
2818

2819
\begin{quote}
2820
Two people have left traces of their own blood at the scene of
2821
a crime.  A suspect, Oliver, is tested and found to have type
2822
`O' blood.  The blood groups of the two traces are found to
2823
be of type `O' (a common type in the local population, having frequency
2824
60\%) and of type `AB' (a rare type, with frequency 1\%).
2825
Do these data [the traces found at the scene] give evidence
2826
in favor of the proposition that Oliver was one of the people
2827
[who left blood at the scene]?
2828
\end{quote}
2829

2830
To answer this question, we need to think about what it means
2831
for data to give evidence in favor of (or against) a hypothesis.
2832
Intuitively, we might say that data favor a hypothesis if the
2833
hypothesis is more likely in light of the data than it was before.
2834
\index{evidence}
2835

2836
In the cookie problem, the prior odds are $1:1$, or probability 50\%.
2837
The posterior odds are $3:2$, or probability 60\%.  So we could say
2838
that the vanilla cookie is evidence in favor of Bowl 1.
2839

2840
The odds form of Bayes's theorem provides a way to make this
2841
intuition more precise.  Again
2842
%
2843
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2844
%
2845
Or dividing through by \odds{A}:
2846
%
2847
\[ \frac{\odds{A|D}}{\odds{A}} = \frac{\p{D|A}}{\p{D|B}} \]
2848
%
2849
The term on the left is the ratio of the posterior and prior odds.
2850
The term on the right is the likelihood ratio, also called the {\bf Bayes
2851
factor}.
2852
\index{likelihood ratio}
2853
\index{Bayes factor}
2854

2855
If the Bayes factor value is greater than 1, that means that the
2856
data were more likely under $A$ than under $B$.  And since the
2857
odds ratio is also greater than 1, that means that the odds are
2858
greater, in light of the data, than they were before.
2859

2860
If the Bayes factor is less than 1, that means the data were
2861
less likely under $A$ than under $B$, so the odds in
2862
favor of $A$ go down.
2863

2864
Finally, if the Bayes factor is exactly 1, the data are equally
2865
likely under either hypothesis, so the odds do not change.
2866

2867
Now we can get back to the Oliver's blood problem.  If Oliver is
2868
one of the people who left blood at the crime scene, then he
2869
accounts for the `O' sample, so the probability of the data
2870
is just the probability that a random member of the population
2871
has type `AB' blood, which is 1\%.
2872

2873
If Oliver did not leave blood at the scene, then we have two
2874
samples to account for.  If we choose two random people from
2875
the population, what is the chance of finding one with type `O'
2876
and one with type `AB'?  Well, there are two ways it might happen:
2877
the first person we choose might have type `O' and the second
2878
`AB', or the other way around.  So the total probability is
2879
$2 (0.6) (0.01) = 1.2\%$.
2880

2881
The likelihood of the data is slightly higher if Oliver is
2882
{\it not} one of the people who left blood at the scene, so
2883
the blood data is actually evidence against Oliver's guilt.
2884
\index{evidence}
2885

2886
This example is a little contrived, but it is an example of
2887
the counterintuitive result that data {\it consistent} with
2888
a hypothesis are not necessarily {\it in favor of}
2889
the hypothesis.
2890

2891
If this result is so counterintuitive that it bothers you,
2892
this way of thinking might help: the data consist of a common
2893
event, type `O' blood, and a rare event, type `AB' blood.
2894
If Oliver accounts for the common event, that leaves the rare
2895
event still unexplained.  If Oliver doesn't account for the
2896
`O' blood, then we have two chances to find someone in the
2897
population with `AB' blood.  And that factor of two makes
2898
the difference.
2899

2900

2901
\section{Addends}
2902
\label{addends}
2903

2904
The fundamental operation of Bayesian statistics is
2905
{\tt Update}, which takes a prior distribution and a set
2906
of data, and produces a posterior distribution.  But solving
2907
real problems usually involves a number of other operations,
2908
including scaling, addition and other arithmetic operations,
2909
max and min, and mixtures.
2910
\index{distribution!operations}
2911

2912
This chapter presents addition and max; I will present
2913
other operations as we need them.
2914

2915
The first example is based on
2916
{\it Dungeons~\&~Dragons}, a role-playing game where the results
2917
of players' decisions are usually determined by rolling dice.
2918
In fact, before game play starts, players generate each
2919
attribute of their characters---strength, intelligence, wisdom,
2920
dexterity, constitution, and charisma---by rolling three
2921
6-sided dice and adding them up.
2922
\index{Dungeons and Dragons}
2923

2924
So you might be curious to know the distribution of this sum.
2925
There are two ways you might compute it:
2926
\index{simulation}
2927
\index{enumeration}
2928

2929
\begin{description}
2930

2931
\item[Simulation:] Given a Pmf that represents the distribution
2932
for a single die, you can draw random samples, add them up,
2933
and accumulate the distribution of simulated sums.
2934

2935
\item[Enumeration:] Given two Pmfs, you can enumerate all possible
2936
pairs of values and compute the distribution of the sums.
2937

2938
\end{description}
2939

2940
\verb"thinkbayes" provides functions for both.  Here's an example
2941
of the first approach.  First, I'll define a class to represent
2942
a single die as a Pmf:
2943

2944
\begin{verbatim}
2945
class Die(thinkbayes.Pmf):
2946

2947
    def __init__(self, sides):
2948
        thinkbayes.Pmf.__init__(self)
2949
        for x in xrange(1, sides+1):
2950
            self.Set(x, 1)
2951
        self.Normalize()
2952
\end{verbatim}
2953

2954
Now I can create a 6-sided die:
2955

2956
\begin{verbatim}
2957
d6 = Die(6)
2958
\end{verbatim}
2959

2960
And use \verb"thinkbayes.SampleSum" to generate a sample of 1000 rolls.
2961

2962
\begin{verbatim}
2963
dice = [d6] * 3
2964
three = thinkbayes.SampleSum(dice, 1000)
2965
\end{verbatim}
2966

2967
\verb"SampleSum" takes list of distributions (either Pmf or Cdf
2968
objects) and the sample size, {\tt n}.  It generates {\tt n} random
2969
sums and returns their distribution as a Pmf object.
2970

2971
\begin{verbatim}
2972
def SampleSum(dists, n):
2973
    pmf = MakePmfFromList(RandomSum(dists) for i in xrange(n))
2974
    return pmf
2975
\end{verbatim}
2976

2977
\verb"SampleSum" uses \verb"RandomSum", also in \verb"thinkbayes.py":
2978

2979
\begin{verbatim}
2980
def RandomSum(dists):
2981
    total = sum(dist.Random() for dist in dists)
2982
    return total
2983
\end{verbatim}
2984

2985
{\tt RandomSum} invokes {\tt Random} on each distribution and
2986
adds up the results.
2987

2988
The drawback of simulation is that the result
2989
is only approximately correct.  As \verb"n" gets larger, it gets
2990
more accurate, but of course the run time increases as well.
2991

2992
The other approach is to enumerate all pairs of values and
2993
compute the sum and probability of each pair.  This is implemented
2994
in \verb"Pmf.__add__":
2995

2996
\begin{verbatim}
2997
# class Pmf
2998

2999
    def __add__(self, other):
3000
        pmf = Pmf()
3001
        for v1, p1 in self.Items():
3002
            for v2, p2 in other.Items():
3003
                pmf.Incr(v1+v2, p1*p2)
3004
        return pmf
3005
\end{verbatim}
3006

3007
{\tt self} is a Pmf, of course; {\tt other} can be a Pmf or anything
3008
else that provides {\tt Items}.  The result is a new Pmf.  The time to
3009
run \verb"__add__" depends on the number of items in {\tt self} and
3010
{\tt other}; it is proportional to {\tt len(self) * len(other)}.
3011

3012
And here's how it's used:
3013

3014
\begin{verbatim}
3015
    three_exact = d6 + d6 + d6
3016
\end{verbatim}
3017

3018
When you apply the {\tt +} operator to a Pmf, Python invokes
3019
\verb"__add__".  In this example, \verb"__add__" is invoked twice.
3020

3021
Figure~\ref{fig.dungeons1} shows an approximate result generated
3022
by simulation and the exact result computed by enumeration.
3023

3024
\begin{figure}
3025
% dungeons.py
3026
\centerline{\includegraphics[height=2.5in]{figs/dungeons1.pdf}}
3027
\caption{Approximate and exact distributions for the sum of
3028
three 6-sided dice.}
3029
\label{fig.dungeons1}
3030
\end{figure}
3031

3032
\verb"Pmf.__add__" is based on the assumption that the random
3033
selections from each Pmf are independent.  In the example of rolling
3034
several dice, this assumption is pretty good.  In other cases, we
3035
would have to extend this method to use conditional probabilities.
3036
\index{independence}
3037

3038
The code from this section is available from
3039
\url{http://thinkbayes.com/dungeons.py}.
3040
  For more information
3041
see Section~\ref{download}.
3042

3043
\section{Maxima}
3044

3045
\begin{figure}
3046
% dungeons.py
3047
\centerline{\includegraphics[height=2.5in]{figs/dungeons2.pdf}}
3048
\caption{Distribution of the maximum of six rolls of three dice.}
3049
\label{fig.dungeons2}
3050
\end{figure}
3051

3052
When you generate a {\it Dungeons~\&~Dragons} character, you are
3053
particularly interested in the character's best attributes, so
3054
you might like to know the
3055
distribution of the maximum attribute.
3056

3057
There are three ways to compute the distribution of a maximum:
3058
\index{maximum}
3059
\index{simulation}
3060
\index{enumeration}
3061
\index{exponentiation}
3062

3063
\begin{description}
3064

3065
\item[Simulation:] Given a Pmf that represents the distribution
3066
for a single selection, you can generate random samples, find the maximum,
3067
and accumulate the distribution of simulated maxima.
3068

3069
\item[Enumeration:] Given two Pmfs, you can enumerate all possible
3070
pairs of values and compute the distribution of the maximum.
3071

3072
\item[Exponentiation:] If we convert a Pmf to a Cdf, there is a simple
3073
and efficient algorithm for finding the Cdf of the maximum.
3074

3075
\end{description}
3076

3077
The code to simulate maxima is almost identical to the code for
3078
simulating sums:
3079

3080
\begin{verbatim}
3081
def RandomMax(dists):
3082
    total = max(dist.Random() for dist in dists)
3083
    return total
3084

3085
def SampleMax(dists, n):
3086
    pmf = MakePmfFromList(RandomMax(dists) for i in xrange(n))
3087
    return pmf
3088
\end{verbatim}
3089

3090
All I did was replace ``sum'' with ``max''.  And the code
3091
for enumeration is almost identical, too:
3092

3093
\begin{verbatim}
3094
def PmfMax(pmf1, pmf2):
3095
    res = thinkbayes.Pmf()
3096
    for v1, p1 in pmf1.Items():
3097
        for v2, p2 in pmf2.Items():
3098
            res.Incr(max(v1, v2), p1*p2)
3099
    return res
3100
\end{verbatim}
3101

3102
In fact, you could generalize this function by taking the
3103
appropriate operator as a parameter.
3104

3105
The only problem with this algorithm is that if each Pmf
3106
has $m$ values, the run time is proportional to $m^2$.
3107
And if we want the maximum of {\tt k} selections, it takes
3108
time proportional to $k m^2$.
3109

3110
If we convert the Pmfs to Cdfs, we can do the same calculation
3111
much faster!  The key is to remember the definition of the
3112
cumulative distribution function:
3113
%
3114
\[ CDF(x) = \p{X \le~x} \]
3115
%
3116
where $X$ is a random variable that means ``a value chosen
3117
randomly from this distribution.''  So, for example, $CDF(5)$
3118
is the probability that a value from this distribution is less
3119
than or equal to 5.
3120

3121
If I draw $X$ from $CDF_1$ and $Y$ from $CDF_2$, and compute
3122
the maximum $Z = max(X, Y)$, what is the chance that $Z$ is
3123
less than or equal to 5?  Well, in that case both $X$ and $Y$
3124
must be less than or equal to 5.
3125

3126
\index{independence}
3127
If the selections of $X$ and $Y$ are independent,
3128
%
3129
\[ CDF_3(5) = CDF_1(5) CDF_2(5) \] 
3130
%
3131
where $CDF_3$ is the distribution of $Z$.  I chose the value
3132
5 because I think it makes the formulas easy to read, but we
3133
can generalize for any value of $z$:
3134
%
3135
\[ CDF_3(z) = CDF_1(z) CDF_2(z) \]
3136
%
3137
In the special case where we draw $k$ values from the same
3138
distribution, 
3139
%
3140
\[ CDF_k(z) = CDF_1(z)^k \]
3141
%
3142
So to find the distribution of the maximum of $k$ values,
3143
we can enumerate the probabilities in the given Cdf
3144
and raise them to the $k$th power.
3145
\verb"Cdf" provides a method that does just that:
3146

3147
\begin{verbatim}
3148
# class Cdf
3149

3150
    def Max(self, k):
3151
        cdf = self.Copy()
3152
        cdf.ps = [p**k for p in cdf.ps]
3153
        return cdf
3154
\end{verbatim}
3155

3156
\verb"Max" takes the number of selections, {\tt k}, and returns a new
3157
Cdf that represents the distribution of the maximum of {\tt k}
3158
selections.  The run time for this method is proportional to 
3159
$m$, the number of items in the Cdf.
3160

3161
\verb"Pmf.Max" does the same thing for Pmfs.  It has to do a little
3162
more work to convert the Pmf to a Cdf, so the run time is proportional
3163
to $m \log m$, but that's still better than quadratic.
3164

3165
Finally, here's an example that computes the distribution of
3166
a character's best attribute:
3167

3168
\begin{verbatim}
3169
    best_attr_cdf = three_exact.Max(6)
3170
    best_attr_pmf = best_attr_cdf.MakePmf()
3171
\end{verbatim}
3172

3173
Where \verb"three_exact" is defined in the previous section.
3174
If we print the results, we see that the chance of generating
3175
a character with at least one attribute of 18 is about 3\%.
3176
Figure~\ref{fig.dungeons2} shows the distribution.
3177

3178

3179
\section{Mixtures}
3180
\label{mixture}
3181

3182
\begin{figure}
3183
% dungeons.py
3184
\centerline{\includegraphics[height=2.5in]{figs/dungeons3.pdf}}
3185
\caption{Distribution outcome for random die from a box.}
3186
\label{fig.dungeons3}
3187
\end{figure}
3188

3189
Let's do one more example from {\it Dungeons~\&~Dragons}.  Suppose
3190
I have a box of dice with the following inventory:
3191

3192
\begin{verbatim}
3193
5   4-sided dice
3194
4   6-sided dice
3195
3   8-sided dice
3196
2  12-sided dice
3197
1  20-sided die
3198
\end{verbatim}
3199

3200
I choose a die from the box and roll it.  What is the distribution
3201
of the outcome?
3202

3203
If you know which die it is, the answer is easy.  A die with {\tt n}
3204
sides yields a uniform distribution from 1 to {\tt n}, including both.
3205
\index{uniform distribution}
3206

3207
But if we don't know which die it is, the resulting distribution is
3208
a {\bf mixture} of uniform distributions with different bounds.
3209
In general, this kind of mixture does not fit any simple mathematical
3210
model, but it is straightforward to compute the distribution in
3211
the form of a PMF.
3212
\index{mixture}
3213

3214
As always, one option is to simulate the scenario, generate a random
3215
sample, and compute the PMF of the sample.  This approach is simple
3216
and it generates an approximate solution quickly.  But if we want an
3217
exact solution, we need a different approach.
3218
\index{simulation}
3219

3220
Let's start with a simple version of the problem where there are
3221
only two dice, one with 6 sides and one with 8.  We can make a Pmf to
3222
represent each die:
3223

3224
\begin{verbatim}
3225
    d6 = Die(6)
3226
    d8 = Die(8)
3227
\end{verbatim}
3228

3229
Then we create a Pmf to represent the mixture:
3230

3231
\begin{verbatim}
3232
    mix = thinkbayes.Pmf()
3233
    for die in [d6, d8]:
3234
        for outcome, prob in die.Items():
3235
            mix.Incr(outcome, prob)
3236
    mix.Normalize()
3237
\end{verbatim}
3238

3239
The first loop enumerates the dice; the second enumerates the
3240
outcomes and their probabilities.  Inside the loop,
3241
{\tt Pmf.Incr} adds up the contributions from the two distributions.
3242

3243
This code assumes that the two dice are equally likely.  More
3244
generally, we need to know the probability of each die so we can
3245
weight the outcomes accordingly.
3246

3247
First we create a Pmf that maps from each die to the probability it is
3248
selected:
3249

3250
\begin{verbatim}
3251
    pmf_dice = thinkbayes.Pmf()
3252
    pmf_dice.Set(Die(4), 5)
3253
    pmf_dice.Set(Die(6), 4)
3254
    pmf_dice.Set(Die(8), 3)
3255
    pmf_dice.Set(Die(12), 2)
3256
    pmf_dice.Set(Die(20), 1)
3257
    pmf_dice.Normalize()
3258
\end{verbatim}
3259

3260
Next we need a more general version of the mixture algorithm:
3261

3262
\begin{verbatim}
3263
    mix = thinkbayes.Pmf()
3264
    for die, weight in pmf_dice.Items():
3265
        for outcome, prob in die.Items():
3266
            mix.Incr(outcome, weight*prob)
3267
\end{verbatim}
3268

3269
Now each die has a weight associated with it (which makes it a
3270
weighted die, I suppose).  When we add each outcome to the mixture,
3271
its probability is multiplied by {\tt weight}.
3272

3273
Figure~\ref{fig.dungeons3} shows the result.  As expected, values 1
3274
through 4 are the most likely because any die can produce them.
3275
Values above 12 are unlikely because there is only one die in the box
3276
that can produce them (and it does so less than half the time).
3277

3278
{\tt thinkbayes} provides a function named {\tt MakeMixture}
3279
that encapsulates this algorithm, so we could have written:
3280

3281
\begin{verbatim}
3282
    mix = thinkbayes.MakeMixture(pmf_dice)
3283
\end{verbatim}
3284

3285
We'll use {\tt MakeMixture} again in Chapters~\ref{prediction} and
3286
~\ref{observer}.
3287

3288

3289
\section{Discussion}
3290

3291
Other than the odds form of Bayes's theorem, this chapter is not
3292
specifically Bayesian.  But Bayesian analysis is all about
3293
distributions, so it is important to understand the concept of a
3294
distribution well.  From a computational point of view, a distribution
3295
is any data structure that represents a set of values (possible
3296
outcomes of a random process) and their probabilities.
3297
\index{distribution}
3298

3299
We have seen two representations of distributions: Pmfs and Cdfs.
3300
These representations are equivalent in the sense that they contain
3301
the same information, so you can convert from one to the other.  The
3302
primary difference between them is performance: some operations are
3303
faster and easier with a Pmf; others are faster with a Cdf.
3304
\index{Pmf} \index{Cdf}
3305

3306
The other goal of this chapter is to introduce operations that act on
3307
distributions, like \verb"Pmf.__add__", {\tt Cdf.Max}, and {\tt
3308
  thinkbayes.MakeMixture}.  We will use these operations later, but I
3309
introduce them now to encourage you to think of a distribution as a
3310
fundamental unit of computation, not just a container for values and
3311
probabilities.
3312

3313

3314

3315
\chapter{Decision Analysis}
3316
\label{decisionanalysis}
3317

3318
\section{The {\it Price is Right} problem}
3319

3320
On November 1, 2007, contestants named Letia and Nathaniel appeared
3321
on {\it The Price is Right}, an American game show.  They competed in
3322
a game called {\it The Showcase}, where the objective is to guess the price
3323
of a showcase of prizes.  The contestant who comes closest to the
3324
actual price of the showcase, without going over, wins the prizes.
3325
\index{Price is Right}
3326
\index{Showcase}
3327

3328
Nathaniel went first.  His showcase included a dishwasher, a wine
3329
cabinet, a laptop computer, and a car.  He bid \$26,000.
3330

3331
Letia's showcase included a pinball machine, a video arcade game, a
3332
pool table, and a cruise of the Bahamas.  She bid \$21,500.
3333

3334
The actual price of Nathaniel's showcase was \$25,347.  His bid
3335
was too high, so he lost.
3336

3337
The actual price of Letia's showcase was \$21,578.  She was only
3338
off by \$78, so she won her showcase and, because
3339
her bid was off by less than \$250, she also won Nathaniel's
3340
showcase.
3341

3342
For a Bayesian thinker, this scenario suggests several questions:
3343

3344
\begin{enumerate}
3345

3346
\item Before seeing the prizes, what prior beliefs should the
3347
  contestant have about the price of the showcase?
3348

3349
\item After seeing the prizes, how should the contestant update
3350
  those beliefs?
3351

3352
\item Based on the posterior distribution, what should the
3353
  contestant bid?
3354

3355
\end{enumerate}
3356

3357
The third question demonstrates a common use of Bayesian analysis:
3358
decision analysis.  Given a posterior distribution, we can choose
3359
the bid that maximizes the contestant's expected return.
3360
\index{decision analysis}
3361

3362
This problem is inspired by an example in Cameron Davidson-Pilon's
3363
book, {\it Bayesian Methods for Hackers}.  The code I wrote for this
3364
chapter is available from \url{http://thinkbayes.com/price.py}; it
3365
reads data files you can download from
3366
\url{http://thinkbayes.com/showcases.2011.csv} and
3367
\url{http://thinkbayes.com/showcases.2012.csv}.    For more information
3368
see Section~\ref{download}.
3369
\index{Davidson-Pilon, Cameron}
3370

3371

3372
\section{The prior}
3373

3374
\begin{figure}
3375
% price.py
3376
\centerline{\includegraphics[height=2.5in]{figs/price1.pdf}}
3377
\caption{Distribution of prices for showcases on
3378
{\it The Price is Right}, 2011-12.}
3379
\label{fig.price1}
3380
\end{figure}
3381

3382
To choose a prior distribution of prices, we can take advantage
3383
of data from previous episodes.  Fortunately, fans of the show
3384
keep detailed records.  When I corresponded with Mr.~Davidson-Pilon
3385
about his book, he sent me data collected by Steve Gee at
3386
\url{http://tpirsummaries.8m.com}.  It includes the price of
3387
each showcase from the 2011 and 2012 seasons and the bids
3388
offered by the contestants.
3389
\index{Gee, Steve}
3390

3391
Figure~\ref{fig.price1} shows the distribution of prices for these
3392
showcases.  The most common value for both showcases is around
3393
\$28,000, but the first showcase has a second mode near \$50,000,
3394
and the second showcase is occasionally worth more than \$70,000.
3395

3396
These distributions are based on actual data, but they
3397
have been smoothed by Gaussian kernel density estimation (KDE).
3398
Before we go on, I want to take a detour to talk about 
3399
probability density functions and KDE.
3400
\index{kernel density estimation}
3401
\index{KDE}
3402

3403

3404
\section{Probability density functions}
3405

3406
So far we have been working with probability mass functions, or PMFs.
3407
A PMF is a map from each possible value to its probability.  In my
3408
implementation, a Pmf object provides a method named {\tt Prob} that
3409
takes a value and returns a probability, also known as a {\bf probability
3410
mass}.
3411
\index{probability density function}
3412
\index{Pdf}
3413
\index{Pmf}
3414

3415
A {\bf probability density function}, or PDF, is the continuous version of a
3416
PMF, where the possible values make up a continuous range rather than
3417
a discrete set.  
3418

3419
\index{Gaussian distribution}
3420
In mathematical notation, PDFs are usually written as functions; for
3421
example, here is the PDF of a Gaussian distribution with
3422
mean 0 and standard deviation 1:
3423
%
3424
\[ f(x) = \frac{1}{\sqrt{2 \pi}} \exp(-x^2/2) \]
3425
%
3426
For a given value of $x$, this function computes a probability
3427
density.  
3428
A density is similar
3429
to a probability mass in the sense that a higher density indicates
3430
that a value is more likely.
3431
\index{density}
3432
\index{probability density}
3433
\index{probability}
3434

3435
But a density is not a probability.  A density can be 0 or any positive
3436
value; it is not bounded, like a probability, between 0 and 1.
3437

3438
If you integrate a density
3439
over a continuous range, the result is a probability.  But 
3440
for the applications in this book we seldom have to do that.
3441

3442
Instead we primarily use probability densities as part
3443
of a likelihood function.  We will see an example soon.
3444

3445

3446
\section{Representing PDFs}
3447

3448
\index{Pdf}
3449
To represent PDFs in Python,
3450
{\tt thinkbayes.py} provides a class named {\tt Pdf}.
3451
{\tt Pdf} is an {\bf abstract type}, which means that it defines
3452
the interface a Pdf is supposed to have, but does not provide
3453
a complete implementation.  The {\tt Pdf} interface includes
3454
two methods, {\tt Density} and {\tt MakePmf}:
3455

3456
\begin{verbatim}
3457
class Pdf(object):
3458

3459
    def Density(self, x):
3460
        raise UnimplementedMethodException()
3461

3462
    def MakePmf(self, xs):
3463
        pmf = Pmf()
3464
        for x in xs:
3465
            pmf.Set(x, self.Density(x))
3466
        pmf.Normalize()
3467
        return pmf
3468
\end{verbatim}
3469

3470
{\tt Density} takes a value, {\tt x}, and returns the corresponding
3471
density.  {\tt MakePmf} makes a discrete approximation to the PDF.
3472

3473
{\tt Pdf} provides an implementation of {\tt MakePmf}, but not {\tt
3474
  Density}, which has to be provided by a child class.
3475
\index{abstract type} \index{concrete type} \index{interface}
3476
\index{implementation}
3477

3478
\index{Gaussian distribution}
3479
A {\bf concrete type} is a child class that extends an abstract type
3480
and provides an implementation of the missing methods.
3481
For example, {\tt GaussianPdf} extends {\tt Pdf} and provides
3482
{\tt Density}:
3483

3484
\begin{verbatim}
3485
class GaussianPdf(Pdf):
3486

3487
    def __init__(self, mu, sigma):
3488
        self.mu = mu
3489
        self.sigma = sigma
3490
        
3491
    def Density(self, x):
3492
        return scipy.stats.norm.pdf(x, self.mu, self.sigma)
3493
\end{verbatim}
3494

3495
\verb"__init__" takes {\tt mu} and {\tt sigma}, which are
3496
the mean and standard deviation of the distribution, and stores
3497
them as attributes.
3498

3499
{\tt Density} uses a function from {\tt scipy.stats} to evaluate the
3500
Gaussian PDF.  The function is called {\tt norm.pdf} because the
3501
Gaussian distribution is also called the ``normal'' distribution. 
3502
\index{scipy}
3503
\index{normal distribution}
3504

3505
The Gaussian PDF is defined by a simple mathematical function,
3506
so it is easy to evaluate.  And it is useful because many
3507
quantities in the real world have distributions that are
3508
approximately Gaussian.
3509
\index{Gaussian distribution}
3510
\index{Gaussian PDF}
3511

3512
But with real data, there is no guarantee that the distribution
3513
is Gaussian or any other simple mathematical function.  In
3514
that case we can use a sample to estimate the PDF of
3515
the whole population.
3516

3517
For example, in {\it The Price Is Right} data, we have
3518
313 prices for the first showcase.  We can think of these
3519
values as a sample from the population of all possible showcase
3520
prices.
3521

3522
This sample includes the following values (in order):
3523
%
3524
\[ 28800, 28868, 28941, 28957, 28958 \]
3525
%
3526
In the sample, no values appear between 28801 and 28867, but
3527
there is no reason to think that these values are impossible.
3528
Based on our background information, we expect all
3529
values in this range to be equally likely.  In other words,
3530
we expect the PDF to be fairly smooth.
3531

3532
Kernel density estimation (KDE) is an algorithm that takes
3533
a sample and finds an appropriately smooth PDF that fits 
3534
the data.  You can read details at
3535
\url{http://en.wikipedia.org/wiki/Kernel_density_estimation}.
3536
\index{KDE}
3537
\index{kernel density estimation}
3538

3539
{\tt scipy} provides an implementation of KDE and  {\tt thinkbayes}
3540
provides a class called {\tt EstimatedPdf} that 
3541
uses it:
3542
\index{scipy}
3543
\index{numpy}
3544

3545
\begin{verbatim}
3546
class EstimatedPdf(Pdf):
3547

3548
    def __init__(self, sample):
3549
        self.kde = scipy.stats.gaussian_kde(sample)
3550

3551
    def Density(self, x):
3552
        return self.kde.evaluate(x)
3553
\end{verbatim}
3554

3555
\verb"__init__" takes a sample
3556
and computes a kernel density estimate.  The result is a
3557
\verb"gaussian_kde" object that provides an {\tt evaluate}
3558
method.
3559

3560
{\tt Density} takes a value, calls \verb"gaussian_kde.evaluate",
3561
and returns the resulting density.
3562
\index{density}
3563

3564
Finally, here's an outline of the code I used to generate
3565
Figure~\ref{fig.price1}:
3566
\index{numpy}
3567

3568
\begin{verbatim}
3569
    prices = ReadData()
3570
    pdf = thinkbayes.EstimatedPdf(prices)
3571

3572
    low, high = 0, 75000
3573
    n = 101
3574
    xs = numpy.linspace(low, high, n) 
3575
    pmf = pdf.MakePmf(xs)
3576
\end{verbatim}
3577

3578
{\tt pdf} is a {\tt Pdf} object, estimated by KDE.  {\tt pmf}
3579
is a Pmf object that approximates the Pdf by evaluating the density
3580
at a sequence of equally spaced values.
3581

3582
{\tt linspace} stands for
3583
``linear space.''  It takes a range, {\tt low} and {\tt high}, and
3584
the number of points, {\tt n}, and returns a new {\tt numpy}
3585
array with {\tt n} elements equally spaced between {\tt low} and
3586
{\tt high}, including both.
3587

3588
And now back to {\it The Price is Right}.
3589

3590

3591
\section{Modeling the contestants}
3592

3593
\begin{figure}
3594
% price.py
3595
\centerline{\includegraphics[height=2.5in]{figs/price2.pdf}}
3596
\caption{Cumulative distribution (CDF) of the difference between the
3597
  contestant's bid and the actual price.}
3598
\label{fig.price2}
3599
\end{figure}
3600

3601
The PDFs in Figure~\ref{fig.price1} estimate the distribution of
3602
possible prices.  If you were a contestant on the
3603
show, you could use this distribution to quantify your prior belief
3604
about the price of each showcase (before you see the prizes).
3605

3606
To update these priors, we have to answer these questions:
3607

3608
\begin{enumerate}
3609

3610
\item What data should we consider and how should we quantify it?
3611

3612
\item Can we compute a likelihood function; that is,
3613
  for each hypothetical value of {\tt price}, can we compute
3614
  the conditional likelihood of the data?
3615

3616
\end{enumerate}
3617

3618
To answer these questions, I am going to model the contestant
3619
as a price-guessing instrument with known error characteristics.
3620
In other words, when the contestant sees the prizes, he or she
3621
guesses the price of each prize---ideally without taking into
3622
consideration the fact that the prize is part of a showcase---and
3623
adds up the prices.  Let's call this total {\tt guess}.
3624
\index{error}
3625

3626
Under this model, the question we have to answer is, ``If the
3627
actual price is {\tt price}, what is the likelihood that the
3628
contestant's estimate would be {\tt guess}?''
3629
\index{likelihood}
3630

3631
Or if we define
3632
%
3633
\begin{verbatim}
3634
    error = price - guess
3635
\end{verbatim}
3636
%
3637
then we could ask, ``What is the likelihood
3638
that the contestant's estimate is off by {\tt error}?''
3639

3640
To answer this question, we can use the historical data again.
3641
Figure~\ref{fig.price2} shows the cumulative distribution of {\tt diff},
3642
the difference between the contestant's bid and the actual price
3643
of the showcase.
3644
\index{Cdf}
3645

3646
The definition of diff is
3647
%
3648
\begin{verbatim}
3649
    diff = price - bid
3650
\end{verbatim}
3651
%
3652
When {\tt diff} is negative, the bid is too high.  As an
3653
aside, we can use this distribution to compute the probability that the
3654
contestants overbid: the first contestant overbids 25\% of the
3655
time; the second contestant overbids 29\% of the time.
3656

3657
We can also see that the bids are biased;
3658
that is, they are more likely to be too low than too high.  And
3659
that makes sense, given the rules of the game.
3660

3661
Finally, we can use this distribution to estimate the reliability of
3662
the contestants' guesses.  This step is a little tricky because
3663
we don't actually know the contestant's guesses; we only know
3664
what they bid.
3665

3666
So we'll have to make some assumptions.  Specifically, I
3667
assume that the distribution of {\tt error} is Gaussian with mean 0
3668
and the same variance as {\tt diff}.
3669
\index{Gaussian distribution}
3670

3671
The {\tt Player} class implements this model:
3672
\index{numpy}
3673

3674
\begin{verbatim}
3675
class Player(object):
3676

3677
    def __init__(self, prices, bids, diffs):
3678
        self.pdf_price = thinkbayes.EstimatedPdf(prices)
3679
        self.cdf_diff = thinkbayes.MakeCdfFromList(diffs)
3680

3681
        mu = 0
3682
        sigma = numpy.std(diffs)
3683
        self.pdf_error = thinkbayes.GaussianPdf(mu, sigma)
3684
\end{verbatim}
3685

3686
{\tt prices} is a sequence of showcase prices, {\tt bids} is a
3687
sequence of bids, and {\tt diffs} is a sequence of diffs, where
3688
again {\tt diff = price - bid}.
3689

3690
\verb"pdf_price" is the smoothed PDF of prices, estimated by KDE.
3691
\verb"cdf_diff" is the cumulative distribution of {\tt diff},
3692
which we saw in Figure~\ref{fig.price2}.  And \verb"pdf_error"
3693
is the PDF that characterizes the distribution of errors; where
3694
{\tt error = price - guess}.
3695

3696
Again, we use the variance of {\tt diff} to estimate the variance of
3697
{\tt error}.  This estimate is not perfect because contestants' bids
3698
are sometimes strategic; for example, if Player 2 thinks that Player 1
3699
has overbid, Player 2 might make a very low bid.  In that case {\tt
3700
  diff} does not reflect {\tt error}.  If this happens a lot, the
3701
observed variance in {\tt diff} might overestimate the variance in
3702
{\tt error}.  Nevertheless, I think it is a reasonable modeling
3703
decision.
3704

3705
As an alternative, someone preparing to appear on the show could
3706
estimate their own distribution of {\tt error} by watching previous shows
3707
and recording their guesses and the actual prices.
3708

3709

3710
\section{Likelihood}
3711

3712
Now we are ready to write the likelihood function.  As usual,
3713
I define a new class that extends {\tt thinkbayes.Suite}:
3714
\index{likelihood}
3715

3716
\begin{verbatim}
3717
class Price(thinkbayes.Suite):
3718

3719
    def __init__(self, pmf, player):
3720
        thinkbayes.Suite.__init__(self, pmf)
3721
        self.player = player
3722
\end{verbatim}
3723

3724
{\tt pmf} represents the prior distribution and
3725
{\tt player} is a Player object as described in the previous
3726
section.  Here's {\tt Likelihood}:
3727

3728
\begin{verbatim}
3729
    def Likelihood(self, data, hypo):
3730
        price = hypo
3731
        guess = data
3732

3733
        error = price - guess
3734
        like = self.player.ErrorDensity(error)
3735

3736
        return like
3737
\end{verbatim}
3738

3739
{\tt hypo} is the hypothetical price of the showcase.  {\tt data}
3740
is the contestant's best guess at the price.  {\tt error} is
3741
the difference, and {\tt like} is the likelihood of the data,
3742
given the hypothesis.
3743

3744
{\tt ErrorDensity} is defined in {\tt Player}:
3745

3746
\begin{verbatim}
3747
# class Player:
3748

3749
    def ErrorDensity(self, error):
3750
        return self.pdf_error.Density(error)
3751
\end{verbatim}
3752

3753
{\tt ErrorDensity} works by evaluating \verb"pdf_error" at
3754
the given value of {\tt error}.
3755
The result is a probability density, so it is not really a probability.
3756
But remember that {\tt Likelihood} doesn't
3757
need to compute a probability; it only has to compute something {\em
3758
  proportional} to a probability.  As long as the constant of
3759
proportionality is the same for all likelihoods, it gets canceled out
3760
when we normalize the posterior distribution.
3761
\index{density}
3762
\index{likelihood}
3763

3764
And therefore, a probability density is a perfectly good likelihood.
3765

3766

3767
\section{Update}
3768

3769
\begin{figure}
3770
% price.py
3771
\centerline{\includegraphics[height=2.5in]{figs/price3.pdf}}
3772
\caption{Prior and posterior distributions for Player 1, based on
3773
  a best guess of \$20,000.}
3774
\label{fig.price3}
3775
\end{figure}
3776

3777

3778
{\tt Player} provides a method that takes the contestant's
3779
guess and computes the posterior distribution:
3780

3781
\begin{verbatim}
3782
# class Player
3783

3784
    def MakeBeliefs(self, guess):
3785
        pmf = self.PmfPrice()
3786
        self.prior = Price(pmf, self)
3787
        self.posterior = self.prior.Copy()
3788
        self.posterior.Update(guess)
3789
\end{verbatim}
3790

3791
{\tt PmfPrice} generates a discrete approximation
3792
to the PDF of price, which we use to construct the prior.
3793

3794
{\tt PmfPrice} uses {\tt MakePmf}, which
3795
evaluates \verb"pdf_price" at a sequence of values:
3796

3797
\begin{verbatim}
3798
# class Player
3799

3800
    n = 101
3801
    price_xs = numpy.linspace(0, 75000, n)
3802

3803
    def PmfPrice(self):
3804
        return self.pdf_price.MakePmf(self.price_xs)
3805
\end{verbatim}
3806

3807
To construct the posterior, we make a copy of the
3808
prior and then invoke {\tt Update}, which invokes {\tt Likelihood}
3809
for each hypothesis, multiplies the priors by the likelihoods,
3810
and  renormalizes.
3811
\index{normalize}
3812

3813
So let's get back to the original scenario.  Suppose you are
3814
Player 1 and when you see your showcase, your best guess is
3815
that the total price of the prizes is \$20,000.
3816

3817
Figure~\ref{fig.price3} shows prior and
3818
posterior beliefs about the actual price.
3819
The posterior is shifted
3820
to the left because your guess 
3821
is on the low end of the prior range.
3822

3823
On one level, this result makes sense.  The most likely value
3824
in the prior is \$27,750, your best guess is \$20,000, and
3825
the mean of the posterior is somewhere in between: \$25,096.
3826

3827
On another level, you might find this result bizarre, because it
3828
suggests that if you {\em think} the price is \$20,000, then you
3829
should {\em believe} the price is \$24,000.
3830

3831
To resolve this apparent paradox, remember that you are combining two
3832
sources of information, historical data about past showcases and
3833
guesses about the prizes you see.
3834

3835
We are treating the historical data as the prior and updating it
3836
based on your guesses, but we could equivalently use your guess
3837
as a prior and update it based on historical data.  
3838

3839
If you think of it that way, maybe it is less surprising that the
3840
most likely value in the posterior is not your original guess.
3841

3842

3843
\section{Optimal bidding}
3844

3845
Now that we have a posterior distribution, we can use it to
3846
compute the optimal bid, which I define as the bid that maximizes
3847
expected return (see \url{http://en.wikipedia.org/wiki/Expected_return}).
3848
\index{decision analysis}
3849

3850
I'm going to present the methods in this section top-down, which
3851
means I will show you how they are used before I show you how they
3852
work.  If you see an unfamiliar method, don't worry; the definition
3853
will be along shortly.
3854

3855
To compute optimal bids, I wrote a class called {\tt GainCalculator}:
3856

3857
\begin{verbatim}
3858
class GainCalculator(object):
3859

3860
    def __init__(self, player, opponent):
3861
        self.player = player
3862
        self.opponent = opponent
3863
\end{verbatim}
3864

3865
{\tt player} and {\tt opponent} are {\tt Player} objects.
3866

3867
{\tt GainCalculator} provides {\tt ExpectedGains}, which
3868
computes a sequence of bids and the expected gain for each
3869
bid:
3870
\index{numpy}
3871

3872
\begin{verbatim}
3873
    def ExpectedGains(self, low=0, high=75000, n=101):
3874
        bids = numpy.linspace(low, high, n)
3875

3876
        gains = [self.ExpectedGain(bid) for bid in bids]
3877

3878
        return bids, gains
3879
\end{verbatim}
3880

3881
{\tt low} and {\tt high} specify the range of possible bids;
3882
{\tt n} is the number of bids to try.  
3883

3884
{\tt ExpectedGains} calls {\tt ExpectedGain}, which
3885
computes expected gain for a given bid:
3886

3887
\begin{verbatim}
3888
    def ExpectedGain(self, bid):
3889
        suite = self.player.posterior
3890
        total = 0
3891
        for price, prob in sorted(suite.Items()):
3892
            gain = self.Gain(bid, price)
3893
            total += prob * gain
3894
        return total
3895
\end{verbatim}
3896

3897
{\tt ExpectedGain} loops through the values in the posterior
3898
and computes the gain for each bid, given the actual prices of
3899
the showcase.  It weights each gain with the corresponding
3900
probability and returns the total.
3901

3902
\begin{figure}
3903
% price.py
3904
\centerline{\includegraphics[height=2.5in]{figs/price5.pdf}}
3905
\caption{Expected gain versus bid in a scenario where Player 1's best 
3906
guess is \$20,000 and Player 2's best guess is \$40,000.}
3907
\label{fig.price5}
3908
\end{figure}
3909

3910
{\tt ExpectedGain} invokes {\tt Gain}, which takes a bid and an actual
3911
price and returns the expected gain:
3912

3913
\begin{verbatim}
3914
    def Gain(self, bid, price):
3915
        if bid > price:
3916
            return 0
3917

3918
        diff = price - bid
3919
        prob = self.ProbWin(diff)
3920

3921
        if diff <= 250:
3922
            return 2 * price * prob
3923
        else:
3924
            return price * prob
3925
\end{verbatim}
3926

3927
If you overbid, you get nothing.  Otherwise we compute 
3928
the difference between your bid and the price, which determines
3929
your probability of winning.
3930

3931
If {\tt diff} is less than \$250, you win both showcases.  For
3932
simplicity, I assume that both showcases have the same price.  Since
3933
this outcome is rare, it doesn't make much difference.
3934

3935
Finally, we have to compute the probability of winning based
3936
on {\tt diff}:
3937

3938
\begin{verbatim}
3939
    def ProbWin(self, diff):
3940
        prob = (self.opponent.ProbOverbid() + 
3941
                self.opponent.ProbWorseThan(diff))
3942
        return prob
3943
\end{verbatim}
3944

3945
If your opponent overbids, you win.  Otherwise, you have to hope
3946
that your opponent is off by more than {\tt diff}.  {\tt Player}
3947
provides methods to compute both probabilities:
3948

3949
\begin{verbatim}
3950
# class Player:
3951

3952
    def ProbOverbid(self):
3953
        return self.cdf_diff.Prob(-1)
3954

3955
    def ProbWorseThan(self, diff):
3956
        return 1 - self.cdf_diff.Prob(diff)
3957
\end{verbatim}
3958

3959
This code might be confusing because the computation is now from
3960
the point of view of the opponent, who is computing, ``What is
3961
the probability that I overbid?'' and ``What is the probability
3962
that my bid is off by more than {\tt diff}?''
3963

3964
Both answers are based on the CDF of {\tt diff}.  If the opponent's
3965
{\tt diff} is less than or equal to -1, you win.  If the opponent's
3966
{\tt diff} is worse than yours, you win.  Otherwise you lose.
3967

3968
Finally, here's the code that computes optimal bids:
3969

3970
\begin{verbatim}
3971
# class Player:
3972

3973
    def OptimalBid(self, guess, opponent):
3974
        self.MakeBeliefs(guess)
3975
        calc = GainCalculator(self, opponent)
3976
        bids, gains = calc.ExpectedGains()
3977
        gain, bid = max(zip(gains, bids))
3978
        return bid, gain
3979
\end{verbatim}
3980

3981
Given a guess and an opponent, {\tt OptimalBid} computes
3982
the posterior distribution, instantiates a {\tt GainCalculator},
3983
computes expected gains for a range of bids and returns
3984
the optimal bid and expected gain.  Whew!
3985

3986
Figure~\ref{fig.price5} shows the results for both players,
3987
based on a scenario where Player 1's best guess is \$20,000
3988
and Player 2's best guess is \$40,000.
3989

3990
For Player 1 the optimal bid is \$21,000, yielding an expected
3991
return of almost \$16,700.  This is a case (which turns out
3992
to be unusual) where the optimal bid is actually higher than
3993
the contestant's best guess.
3994

3995
For Player 2 the optimal bid is \$31,500, yielding an expected
3996
return of almost \$19,400.  This is the more typical case where
3997
the optimal bid is less than the best guess.
3998

3999

4000
\section{Discussion}
4001

4002
One of the features of Bayesian estimation is that the
4003
result comes in the form of a posterior distribution.  Classical
4004
estimation usually generates a single point estimate or a confidence
4005
interval, which is sufficient if estimation is the last step in the
4006
process, but if you want to use an estimate as an input to a
4007
subsequent analysis, point estimates and intervals are often not much
4008
help.
4009
\index{distribution}
4010

4011
In this example, we use the posterior distribution
4012
to compute an optimal bid.  The return on a given bid is asymmetric
4013
and discontinuous (if you overbid, you lose), so it would be hard to
4014
solve this problem analytically.  But it is relatively simple to do
4015
computationally.
4016
\index{decision analysis}
4017

4018
Newcomers to Bayesian thinking are often tempted to summarize the
4019
posterior distribution by computing the mean or the maximum
4020
likelihood estimate.  These summaries can be useful, but if that's
4021
all you need, then you probably don't need Bayesian methods in the
4022
first place.
4023
\index{maximum likelihood}
4024
\index{summary statistic}
4025

4026
Bayesian methods are most useful when you can carry the posterior
4027
distribution into the next step of the analysis to perform some
4028
kind of decision analysis, as we did in this chapter, or some kind of
4029
prediction, as we see in the next chapter.
4030

4031

4032

4033
\chapter{Prediction}
4034
\label{prediction}
4035

4036
\section{The Boston Bruins problem}
4037

4038
In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston
4039
Bruins played a best-of-seven championship series against the despised
4040
Vancouver Canucks.  Boston lost the first two games 0-1 and 2-3, then
4041
won the next two games 8-1 and 4-0.  At this point in the series, what
4042
is the probability that Boston will win the next game, and what is
4043
their probability of winning the championship?
4044
\index{hockey}
4045
\index{Boston Bruins}
4046
\index{Vancouver Canucks}
4047

4048
As always, to answer a question like this, we need to make some
4049
assumptions.  First, it is reasonable to believe that goal scoring in
4050
hockey is at least approximately a Poisson process, which means that
4051
it is equally likely for a goal to be scored at any time during a
4052
game.  Second, we can assume that against a particular opponent, each team
4053
has some long-term average goals per game, denoted $\lambda$.
4054
\index{Poisson process}
4055

4056
Given these assumptions, my strategy for answering this question is
4057

4058
\begin{enumerate}
4059

4060
\item Use statistics from previous games to choose a prior
4061
distribution for $\lambda$.
4062

4063
\item Use the score from the first four games to estimate $\lambda$
4064
for each team.
4065

4066
\item Use the posterior distributions of $\lambda$ to compute 
4067
distribution of goals for each team, the distribution of the
4068
goal differential, and the probability that each team wins
4069
the next game.
4070

4071
\item Compute the probability that each team wins the series.
4072

4073
\end{enumerate}
4074

4075
To choose a prior distribution, I got some statistics from
4076
\url{http://www.nhl.com}, specifically the average goals per game
4077
for each team in the 2010-11 season.  The distribution is roughly
4078
Gaussian with mean 2.8 and standard deviation 0.3.
4079
\index{National Hockey League}
4080
\index{NHL}
4081

4082
The Gaussian distribution is continuous, but we'll approximate it with
4083
a discrete Pmf.  \verb"thinkbayes" provides \verb"MakeGaussianPmf" to
4084
do exactly that:
4085
\index{numpy}
4086
\index{Gaussian distribution}
4087

4088
\begin{verbatim}
4089
def MakeGaussianPmf(mu, sigma, num_sigmas, n=101):
4090
    pmf = Pmf()
4091
    low = mu - num_sigmas*sigma
4092
    high = mu + num_sigmas*sigma
4093

4094
    for x in numpy.linspace(low, high, n):
4095
        p = scipy.stats.norm.pdf(mu, sigma, x)
4096
        pmf.Set(x, p)
4097
    pmf.Normalize()
4098
    return pmf
4099
\end{verbatim}  
4100

4101
{\tt mu} and {\tt sigma} are the mean and standard deviation of the
4102
Gaussian distribution.  \verb"num_sigmas" is the number of standard
4103
deviations above and below the mean that the Pmf will span, and {\tt
4104
  n} is the number of values in the Pmf.
4105

4106
Again we use {\tt numpy.linspace} to make an array of {\tt n}
4107
equally spaced values between {\tt low} and {\tt high}, including
4108
both.
4109

4110
\verb"norm.pdf" evaluates the Gaussian probability density function (PDF).
4111
\index{PDF}
4112
\index{probability density function}
4113

4114
Getting back to the hockey problem, here's the definition for a suite
4115
of hypotheses about the value of $\lambda$.
4116

4117
\begin{verbatim}
4118
class Hockey(thinkbayes.Suite):
4119

4120
    def __init__(self):
4121
        pmf = thinkbayes.MakeGaussianPmf(2.7, 0.3, 4)
4122
        thinkbayes.Suite.__init__(self, pmf)
4123
\end{verbatim}  
4124

4125
So the prior distribution is Gaussian with mean 2.7, standard deviation
4126
0.3, and it spans 4 sigmas above and below the mean.
4127

4128
As always, we have to decide how to represent each hypothesis; in
4129
this case I represent the hypothesis that $\lambda=x$ with the
4130
floating-point value {\tt x}. 
4131

4132

4133
\section{Poisson processes}
4134

4135
In mathematical statistics, a {\bf process} is a stochastic model of a
4136
physical system (``stochastic'' means that the model has some kind of
4137
randomness in it).  For example, a Bernoulli process is a model of a
4138
sequence of events, called trials, in which each trial has two
4139
possible outcomes, like success and failure.  So a Bernoulli process
4140
is a natural model for a series of coin flips, or a series of shots on
4141
goal.  \index{process} \index{Poisson process}
4142

4143
A Poisson process is the continuous version of a Bernoulli process,
4144
where an event can occur at any point in time with equal probability.
4145
Poisson processes can be used to model customers arriving in a store,
4146
buses arriving at a bus stop, or goals scored in a hockey game.
4147
\index{Bernoulli process}
4148

4149
In many real systems the probability of an event changes over time.
4150
Customers are more likely to go to a store at certain times of day,
4151
buses are supposed to arrive at fixed intervals, and goals are more
4152
or less likely at different times during a game.
4153

4154
But all models are based on simplifications, and in this case modeling
4155
a hockey game with a Poisson process is a reasonable choice.  Heuer,
4156
M\"{u}ller and Rubner (2010) analyze scoring in a German soccer league
4157
and come to the same conclusion; see
4158
\url{http://www.cimat.mx/Eventos/vpec10/img/poisson.pdf}.
4159
\index{Heuer, Andreas}
4160

4161
The benefit of using this model is that we can compute the distribution
4162
of goals per game efficiently, as well as the distribution of time
4163
between goals.  Specifically, if the average number of goals
4164
in a game is {\tt lam}, the distribution of goals per game is
4165
given by the Poisson PMF:
4166
\index{Poisson distribution}
4167

4168
\begin{verbatim}
4169
def EvalPoissonPmf(k, lam):
4170
    return (lam)**k * math.exp(-lam) / math.factorial(k)
4171
\end{verbatim}  
4172

4173
And the distribution of time between goals is given by the
4174
exponential PDF:
4175
\index{exponential distribution}
4176

4177
\begin{verbatim}
4178
def EvalExponentialPdf(x, lam):
4179
    return lam * math.exp(-lam * x)
4180
\end{verbatim}  
4181

4182
I use the variable
4183
{\tt lam} because {\tt lambda} is a reserved keyword in Python.
4184
Both of these functions are in \verb"thinkbayes.py".
4185

4186

4187
\section{The posteriors}
4188

4189
\begin{figure}
4190
% hockey.py
4191
\centerline{\includegraphics[height=2.5in]{figs/hockey1.pdf}}
4192
\caption{Posterior distribution of the number of
4193
goals per game.}
4194
\label{fig.hockey1}
4195
\end{figure}
4196

4197
Now we can compute the likelihood that a team with a hypothetical
4198
value of {\tt lam} scores {\tt k} goals in a game:
4199

4200
\begin{verbatim}
4201
# class Hockey
4202

4203
    def Likelihood(self, data, hypo):
4204
        lam = hypo
4205
        k = data
4206
        like = thinkbayes.EvalPoissonPmf(k, lam)
4207
        return like
4208
\end{verbatim}
4209

4210
Each hypothesis is a possible value of $\lambda$;  {\tt
4211
  data} is the observed number of goals, {\tt k}.
4212

4213
With the likelihood function in place, we can make a suite for each
4214
team and update them with the scores from the first four games.
4215

4216
\begin{verbatim}
4217
    suite1 = Hockey('bruins')
4218
    suite1.UpdateSet([0, 2, 8, 4])
4219
     
4220
    suite2 = Hockey('canucks')
4221
    suite2.UpdateSet([1, 3, 1, 0])
4222
\end{verbatim}  
4223

4224
Figure~\ref{fig.hockey1} shows the resulting posterior distributions
4225
for {\tt lam}.  Based on the first four games, the most likely
4226
values for {\tt lam} are 2.6 for the Canucks and 2.9 for the Bruins.
4227

4228

4229
\section{The distribution of goals}
4230

4231
\begin{figure}
4232
% hockey.py
4233
\centerline{\includegraphics[height=2.5in]{figs/hockey2.pdf}}
4234
\caption{Distribution of goals in a single game.}
4235
\label{fig.hockey2}
4236
\end{figure}
4237

4238
To compute the probability that each team wins the next game,
4239
we need to compute the distribution of goals for each team.
4240

4241
If we knew the value of {\tt lam} exactly, we could use the
4242
Poisson distribution again.  \verb"thinkbayes" provides a
4243
method that computes a truncated approximation of a Poisson
4244
distribution:
4245
\index{Poisson distribution}
4246

4247
\begin{verbatim}
4248
def MakePoissonPmf(lam, high):
4249
    pmf = Pmf()
4250
    for k in xrange(0, high+1):
4251
        p = EvalPoissonPmf(k, lam)
4252
        pmf.Set(k, p)
4253
    pmf.Normalize()
4254
    return pmf
4255
\end{verbatim}  
4256

4257
The range of values in the computed Pmf is from 0 to {\tt high}.
4258
So if the value of {\tt lam} were exactly 3.4, we would compute:
4259

4260
\begin{verbatim}
4261
lam = 3.4
4262
goal_dist = thinkbayes.MakePoissonPmf(lam, 10)
4263
\end{verbatim}
4264

4265
I chose the upper bound, 10, because the probability of scoring
4266
more than 10 goals in a game is quite low.
4267

4268
That's simple enough so far; the problem is that we don't know
4269
the value of {\tt lam} exactly.  Instead, we have a distribution
4270
of possible values for {\tt lam}.
4271

4272
For each value of {\tt lam}, the distribution of goals is Poisson.
4273
So the overall distribution of goals is a mixture of these
4274
Poisson distributions, weighted according to the probabilities
4275
in the distribution of {\tt lam}.
4276
\index{mixture}
4277
\index{Poisson distribution}
4278

4279
Given the posterior distribution of {\tt lam}, here's the code
4280
that makes the distribution of goals:
4281

4282
\begin{verbatim}
4283
def MakeGoalPmf(suite):
4284
    metapmf = thinkbayes.Pmf()
4285

4286
    for lam, prob in suite.Items():
4287
        pmf = thinkbayes.MakePoissonPmf(lam, 10)
4288
        metapmf.Set(pmf, prob)
4289

4290
    mix = thinkbayes.MakeMixture(metapmf)
4291
    return mix
4292
\end{verbatim}  
4293

4294
For each value of {\tt lam} we make a Poisson Pmf and add it to the
4295
meta-Pmf.  I call it a meta-Pmf because it is a Pmf that contains
4296
Pmfs as its values.
4297
\index{meta-Pmf}
4298

4299
Then we use \verb"MakeMixture" to compute the mixture
4300
(we saw {\tt MakeMixture} in Section~\ref{mixture}).
4301
\index{mixture}
4302
\index{MakeMixture}
4303

4304
Figure~\ref{fig.hockey2} shows the resulting distribution of goals for
4305
the Bruins and Canucks.  The Bruins are less likely to
4306
score 3 goals or fewer in the next game, and more likely to score 4 or
4307
more.
4308

4309

4310
\section{The probability of winning}
4311

4312
\begin{figure}
4313
% hockey.py
4314
\centerline{\includegraphics[height=2.5in]{figs/hockey3.pdf}}
4315
\caption{Distribution of time between goals.}
4316
\label{fig.hockey3}
4317
\end{figure}
4318

4319
To get the probability of winning, first we compute the
4320
distribution of the goal differential:
4321

4322
\begin{verbatim}
4323
    goal_dist1 = MakeGoalPmf(suite1)
4324
    goal_dist2 = MakeGoalPmf(suite2)
4325
    diff = goal_dist1 - goal_dist2
4326
\end{verbatim}  
4327

4328
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
4329
pairs of values and computes the difference.  Subtracting two
4330
distributions is almost the same as adding, which we saw in
4331
Section~\ref{addends}.
4332

4333
If the goal differential is positive, the Bruins win; if negative, the
4334
Canucks win; if 0, it's a tie:
4335

4336
\begin{verbatim}
4337
    p_win = diff.ProbGreater(0)
4338
    p_loss = diff.ProbLess(0)
4339
    p_tie = diff.Prob(0)
4340
\end{verbatim}  
4341

4342
With the distributions from the previous section, \verb"p_win"
4343
is 46\%, \verb"p_loss" is 37\%, and \verb"p_tie" is 17\%.
4344

4345
In the event of a tie at the end of ``regulation play,'' the teams play
4346
overtime periods until one team scores.  Since the game ends
4347
immediately when the first goal is scored, this overtime format
4348
is known as ``sudden death.''
4349
\index{overtime}
4350
\index{sudden death}
4351

4352

4353
\section{Sudden death}
4354

4355
To compute the probability of winning in a sudden death overtime,
4356
the important statistic is not goals per game, but time until the
4357
first goal.  The assumption that goal-scoring is a Poisson process
4358
implies that the time between goals
4359
is exponentially distributed.
4360
\index{Poisson process}
4361
\index{exponential distribution}
4362

4363
Given {\tt lam}, we can compute the time between goals like this: 
4364

4365
\begin{verbatim}
4366
lam = 3.4
4367
time_dist = thinkbayes.MakeExponentialPmf(lam, high=2, n=101)
4368
\end{verbatim}  
4369

4370
{\tt high} is the upper bound of the distribution.  In this case
4371
I chose 2, because the probability of going more than two games
4372
without scoring is small.  {\tt n} is the number of values in
4373
the Pmf.
4374

4375
If we know {\tt lam} exactly, that's all there is to it.
4376
But we don't; instead we have a posterior
4377
distribution of possible values.  So as we did with the distribution
4378
of goals, we make a meta-Pmf and compute a mixture of
4379
Pmfs.
4380
\index{MakeMixture}
4381
\index{meta-Pmf}
4382
\index{mixture}
4383

4384
\begin{verbatim}
4385
def MakeGoalTimePmf(suite):
4386
    metapmf = thinkbayes.Pmf()
4387

4388
    for lam, prob in suite.Items():
4389
        pmf = thinkbayes.MakeExponentialPmf(lam, high=2, n=2001)
4390
        metapmf.Set(pmf, prob)
4391

4392
    mix = thinkbayes.MakeMixture(metapmf)
4393
    return mix
4394
\end{verbatim}  
4395

4396
Figure~\ref{fig.hockey3} shows the resulting distributions.  For
4397
time values less than one period (one third of a game), the Bruins
4398
are more likely to score.  The time until the Canucks score is
4399
more likely to be longer.
4400

4401
I set the number of values, {\tt n}, fairly high in order to minimize
4402
the number of ties, since it is not possible for both teams
4403
to score simultaneously.
4404

4405
Now we compute the probability that the Bruins score first:
4406

4407
\begin{verbatim}
4408
    time_dist1 = MakeGoalTimePmf(suite1)
4409
    time_dist2 = MakeGoalTimePmf(suite2)
4410
    p_overtime = thinkbayes.PmfProbLess(time_dist1, time_dist2)
4411
\end{verbatim}  
4412

4413
For the Bruins, the probability of winning in overtime is 52\%.
4414

4415
Finally, the total probability of winning is the chance of
4416
winning at the end of regulation play plus the probability
4417
of winning in overtime.
4418

4419
\begin{verbatim}
4420
    p_tie = diff.Prob(0)
4421
    p_overtime = thinkbayes.PmfProbLess(time_dist1, time_dist2)
4422

4423
    p_win = diff.ProbGreater(0) + p_tie * p_overtime
4424
\end{verbatim}  
4425

4426
For the Bruins, the overall chance of winning the next game is 55\%.
4427

4428
To win the series, the Bruins can either win the next two games
4429
or split the next two and win the third.  Again, we can compute
4430
the total probability:
4431

4432
\begin{verbatim}
4433
    # win the next two
4434
    p_series = p_win**2
4435

4436
    # split the next two, win the third
4437
    p_series += 2 * p_win * (1-p_win) * p_win
4438
\end{verbatim}  
4439

4440
The Bruins chance of winning the series is 57\%.  And in 2011,
4441
they did.
4442

4443

4444
\section{Discussion}
4445

4446
As always, the analysis in this chapter is based on modeling decisions,
4447
and modeling is almost always an iterative process.  In general,
4448
you want to start with something simple that yields an approximate
4449
answer, identify likely sources of error, and look for opportunities
4450
for improvement.
4451
\index{modeling}
4452
\index{iterative modeling}
4453

4454
In this example, I would consider these options:
4455

4456
\begin{itemize}
4457

4458
\item I chose a prior based on the average goals per game for each
4459
  team.  But this statistic is averaged across all opponents.  Against
4460
  a particular opponent, we might expect more variability.  For
4461
  example, if the team with the best offense plays the team with the
4462
  worst defense, the expected goals per game might be several standard
4463
  deviations above the mean.
4464

4465
\item For data I used only the first four games of the championship
4466
  series.  If the same teams played each other during the
4467
  regular season, I could use the results from those games as well.
4468
  One complication is that the composition of teams changes during
4469
  the season due to trades and injuries.  So it might be best to
4470
  give more weight to recent games.
4471

4472
\item To take advantage of all available information, we could
4473
  use results from all regular season games to estimate each team's
4474
  goal scoring rate, possibly adjusted by estimating
4475
  an additional factor for each pairwise match-up.  This approach
4476
  would be more complicated, but it is still feasible.
4477

4478
\end{itemize}
4479

4480
For the first option, we could use the results from the regular season
4481
to estimate the variability across all pairwise match-ups.  Thanks to
4482
Dirk Hoag at \url{http://forechecker.blogspot.com/}, I was able to get
4483
the number of goals scored during regulation play (not overtime) for
4484
each game in the regular season.
4485
\index{Hoag, Dirk}
4486

4487
Teams in different conferences only play each other one or two
4488
times in the regular season, so I focused on pairs that played
4489
each other 4--6 times.  For each pair, I computed the average
4490
goals per game, which is an estimate of $\lambda$, then plotted
4491
the distribution of these estimates.
4492

4493
The mean of these estimates is 2.8, again, but the standard
4494
deviation is 0.85, substantially higher than what we got computing
4495
one estimate for each team.
4496

4497
If we run the analysis again with the higher-variance prior, the
4498
probability that the Bruins win the series is 80\%, substantially
4499
higher than the result with the low-variance prior, 57\%.
4500

4501
So it turns out that the results are sensitive to the prior, which
4502
makes sense considering how little data we have to work with.  Based
4503
on the difference between the low-variance model and the high-variable
4504
model, it seems worthwhile to put some effort into getting the prior
4505
right.
4506

4507
The code and data for this chapter are available from
4508
\url{http://thinkbayes.com/hockey.py} and
4509
\url{http://thinkbayes.com/hockey_data.csv}.
4510
  For more information
4511
see Section~\ref{download}.
4512

4513
\section{Exercises}
4514

4515
\begin{exercise}
4516

4517
If buses arrive at a bus stop every 20 minutes, and you
4518
arrive at the bus stop at a random time, your wait time until
4519
the bus arrives is uniformly distributed from 0 to 20 minutes.
4520
\index{bus stop problem}
4521

4522
But in reality, there is variability in the time between
4523
buses.  Suppose you are waiting for a bus, and you know the historical
4524
distribution of time between buses.  Compute your distribution
4525
of wait times.
4526

4527
Hint: Suppose that the time between buses is either
4528
5 or 10 minutes with equal probability.  What is the probability
4529
that you arrive during one of the 10 minute intervals?
4530

4531
I solve a version of this problem in the next chapter.
4532

4533
\end{exercise}
4534

4535

4536
\begin{exercise}
4537

4538
Suppose that passengers arriving at the bus stop are well-modeled
4539
by a Poisson process with parameter $\lambda$.  If you arrive at the
4540
stop and find 3 people waiting, what is your posterior distribution
4541
for the time since the last bus arrived.
4542
\index{Poisson process}
4543
\index{bus stop problem}
4544

4545
I solve a version of this problem in the next chapter.
4546

4547
\end{exercise}
4548

4549

4550
\begin{exercise}
4551

4552
Suppose that you are an ecologist sampling the insect population in
4553
a new environment.  You deploy 100 traps in a test area and come back
4554
the next day to check on them.  You find that 37 traps have been
4555
triggered, trapping an insect inside.  Once a trap triggers, it
4556
cannot trap another insect until it has been reset.
4557
\index{insect sampling problem}
4558

4559
If you reset the traps and come back in two days, how many traps
4560
do you expect to find triggered?  Compute a posterior predictive
4561
distribution for the number of traps.
4562
\index{predictive distribution}
4563

4564
\end{exercise}
4565

4566

4567
\begin{exercise}
4568

4569
Suppose you are the manager of an apartment building with
4570
100 light bulbs in common areas.  It is your responsibility
4571
to replace light bulbs when they break.
4572
\index{light bulb problem}
4573

4574
On January 1, all 100 bulbs are working.  When you inspect
4575
them on February 1, you find 3 light bulbs out.  If you
4576
come back on April 1, how many light bulbs do you expect to
4577
find broken?
4578

4579
In the previous exercise, you could reasonably assume that an event is
4580
equally likely at any time.  For light bulbs, the likelihood of
4581
failure depends on the age of the bulb.  Specifically, old bulbs
4582
have an increasing failure rate due to evaporation of the filament.
4583

4584
This problem is more open-ended than some; you will have to make
4585
modeling decisions.  You might want to read about the Weibull
4586
distribution
4587
(\url{http://en.wikipedia.org/wiki/Weibull_distribution}).
4588
Or you might want to look around for information about
4589
light bulb survival curves.
4590
\index{Weibull distribution}
4591

4592
\end{exercise}
4593

4594

4595
\chapter{Observer Bias}
4596
\label{observer}
4597

4598
\section{The Red Line problem}
4599

4600
In Massachusetts, the Red Line is a subway that connects
4601
Cambridge and Boston.  When I was working in Cambridge I took the Red
4602
Line from Kendall Square to South Station and caught the commuter rail
4603
to Needham.  During rush hour Red Line trains run every 7--8
4604
minutes, on average.
4605
\index{Red Line problem}
4606
\index{Boston}
4607

4608
When I arrived at the station, I could estimate the time until
4609
the next train based on the number of passengers on the platform.
4610
If there were only a few people, I inferred that I just missed
4611
a train and expected to wait about 7 minutes.  If there were
4612
more passengers, I expected the train to arrive sooner.  But if
4613
there were a large number of passengers, I suspected that
4614
trains were not running on schedule, so I would go back to the
4615
street level and get a taxi.
4616

4617
While I was waiting for trains, I thought about how Bayesian
4618
estimation could help predict my wait time and decide when I
4619
should give up and take a taxi.  This chapter presents the
4620
analysis I came up with.
4621

4622
This chapter is based on a project by Brendan Ritter and
4623
Kai Austin, who took a class with me at Olin College.
4624
The code in this chapter is available from
4625
\url{http://thinkbayes.com/redline.py}.  The code I used
4626
to collect data is in \url{http://thinkbayes.com/redline_data.py}.
4627
  For more information
4628
see Section~\ref{download}.
4629
\index{Olin College}
4630

4631

4632
\section{The model}
4633

4634
\begin{figure}
4635
% redline.py
4636
\centerline{\includegraphics[height=2.5in]{figs/redline0.pdf}}
4637
\caption{PMF of gaps between trains, based on collected data,
4638
smoothed by KDE.  {\tt z} is the actual distribution; {\tt zb}
4639
is the biased distribution seen by passengers. }
4640
\label{fig.redline0}
4641
\end{figure}
4642

4643
Before we get to the analysis, we have to make some
4644
modeling decisions.  First, I will treat passenger arrivals as
4645
a Poisson process, which means I assume that passengers are equally
4646
likely to arrive at any time, and that they arrive at an unknown
4647
rate, $\lambda$, measured in passengers per minute.  Since I
4648
observe passengers during a short period of time, and at the same
4649
time every day, I assume that $\lambda$ is constant.
4650
\index{Poisson process}
4651

4652
On the other hand, the arrival process for trains is not Poisson.
4653
Trains to Boston are supposed to leave from the end of the line
4654
(Alewife station) every 7--8 minutes during peak times, but by the time
4655
they get to Kendall Square, the time between trains varies between 3
4656
and 12 minutes.
4657

4658
To gather data on the time between trains, I wrote a script that
4659
downloads real-time data from
4660
\url{http://www.mbta.com/rider_tools/developers/}, selects south-bound
4661
trains arriving at Kendall square, and records their arrival times
4662
in a database.  I ran the script from 4pm to 6pm every weekday
4663
for 5 days, and recorded about 15 arrivals per day.  Then
4664
I computed the time between consecutive arrivals; the distribution
4665
of these gaps is shown in Figure~\ref{fig.redline0}, labeled {\tt z}.
4666

4667
If you stood on the platform from 4pm to 6pm and recorded the time
4668
between trains, this is the distribution you would see.  But if you
4669
arrive at some random time (without regard to the train schedule) you
4670
would see a different distribution.  The average time
4671
between trains, as seen by a random passenger, is substantially
4672
higher than the true average.
4673

4674
Why?  Because a passenger is more like to arrive during a
4675
large interval than a small one.  Consider a simple example:
4676
suppose that the time between trains is either 5 minutes
4677
or 10 minutes with equal probability.  In that case
4678
the average time between
4679
trains is 7.5 minutes.
4680

4681
But a passenger is more likely to arrive during a 10 minute gap 
4682
than a 5 minute gap; in fact, twice as likely.  If we surveyed
4683
arriving passengers, we would find that 2/3 of them arrived during
4684
a 10 minute gap, and only 1/3 during a 5 minute gap.  So the
4685
average time between trains, as seen by an arriving passenger,
4686
is 8.33 minutes.
4687

4688
This kind of {\bf observer bias} appears in many contexts.  Students
4689
think that classes are bigger than they are because more of them are
4690
in the big classes.  Airline passengers think that planes are fuller
4691
than they are because more of them are on full flights.
4692
\index{observer bias}
4693

4694
In each case, values from the actual distribution are
4695
oversampled in proportion to their value.  In the Red Line example,
4696
a gap that is twice as big is twice as likely to be observed.
4697

4698
So given the actual distribution of gaps, we can compute the
4699
distribution of gaps as seen by passengers.  {\tt BiasPmf}
4700
does this computation:
4701

4702
\begin{verbatim}
4703
def BiasPmf(pmf):
4704
    new_pmf = pmf.Copy()
4705

4706
    for x, p in pmf.Items():
4707
        new_pmf.Mult(x, x)
4708
        
4709
    new_pmf.Normalize()
4710
    return new_pmf
4711
\end{verbatim}
4712

4713
{\tt pmf} is the actual distribution; \verb"new_pmf" is the
4714
biased distribution.  Inside the loop, we multiply the
4715
probability of each value, {\tt x}, by the likelihood it will
4716
be observed, which is proportional to {\tt x}.  Then we
4717
normalize the result.
4718

4719
Figure~\ref{fig.redline0} shows the actual distribution of gaps,
4720
labeled {\tt z}, and the distribution of gaps seen by passengers,
4721
labeled {\tt zb} for ``z biased''.
4722

4723

4724
\section{Wait times}
4725

4726
\begin{figure}
4727
% redline.py
4728
\centerline{\includegraphics[height=2.5in]{figs/redline2.pdf}}
4729
\caption{CDF of {\tt z}, {\tt zb}, and the wait time seen
4730
by passengers, {\tt y}. }
4731
\label{fig.redline2}
4732
\end{figure}
4733

4734
Wait time, which I call {\tt y}, is the time between the arrival
4735
of a passenger and the next arrival of a train.  Elapsed time, which I
4736
call {\tt x}, is the time between the arrival of the previous
4737
train and the arrival of a passenger.  I chose these definitions
4738
so that {\tt zb = x + y}.
4739

4740
Given the distribution of {\tt zb}, we can compute the distribution of
4741
{\tt y}.  I'll start with a simple case and then generalize.
4742
Suppose, as in the previous example, that {\tt zb} is either 5 minutes
4743
with probability 1/3, or 10 minutes with probability 2/3.
4744

4745
If we arrive at a random time during a 5 minute gap, 
4746
{\tt y} is uniform from 0 to 5 minutes.  If we arrive during a 10
4747
minute gap, {\tt y} is uniform from 0 to 10.  So the overall
4748
distribution is a mixture of uniform distributions weighted
4749
according to the probability of each gap.
4750
\index{uniform distribution}
4751

4752
The following function takes the distribution of {\tt zb} and
4753
computes the distribution of {\tt y}:
4754

4755
\begin{verbatim}
4756
def PmfOfWaitTime(pmf_zb):
4757
    metapmf = thinkbayes.Pmf()
4758
    for gap, prob in pmf_zb.Items():
4759
        uniform = MakeUniformPmf(0, gap)
4760
        metapmf.Set(uniform, prob)
4761

4762
    pmf_y = thinkbayes.MakeMixture(metapmf)
4763
    return pmf_y
4764
\end{verbatim}
4765

4766
{\tt PmfOfWaitTime} makes a meta-Pmf that maps from each uniform
4767
distribution to its probability.  Then it uses {\tt MakeMixture},
4768
which we saw in Section~\ref{mixture}, to compute the mixture.
4769
\index{mixture}
4770
\index{MakeMixture}
4771
\index{meta-Pmf}
4772

4773
{\tt PmfOfWaitTime} also uses {\tt MakeUniformPmf}, defined here:
4774

4775
\begin{verbatim}
4776
def MakeUniformPmf(low, high):
4777
    pmf = thinkbayes.Pmf()
4778
    for x in MakeRange(low=low, high=high):
4779
        pmf.Set(x, 1)
4780
    pmf.Normalize()
4781
    return pmf
4782
\end{verbatim}
4783

4784
{\tt low} and {\tt high} are the range of the uniform distribution,
4785
(both ends included).  Finally, {\tt MakeUniformPmf} uses {\tt
4786
  MakeRange}, defined here:
4787

4788
\begin{verbatim}
4789
def MakeRange(low, high, skip=10):
4790
    return range(low, high+skip, skip)
4791
\end{verbatim}
4792

4793
{\tt MakeRange} defines a set of possible values for wait time
4794
(expressed in seconds).  By default it divides the range into 
4795
10 second intervals.
4796

4797
To encapsulate the process of computing these distributions, I
4798
created a class called {\tt WaitTimeCalculator}:
4799

4800
\begin{verbatim}
4801
class WaitTimeCalculator(object):
4802

4803
    def __init__(self, pmf_z):
4804
        self.pmf_z = pmf_z
4805
        self.pmf_zb = BiasPmf(pmf)
4806

4807
        self.pmf_y = self.PmfOfWaitTime(self.pmf_zb)
4808
        self.pmf_x = self.pmf_y
4809
\end{verbatim}
4810

4811
The parameter, \verb"pmf_z", is the unbiased distribution of {\tt z}.
4812
\verb"pmf_zb" is the biased distribution of gap time, as seen by
4813
passengers.
4814

4815
\verb"pmf_y" is the distribution of wait time.  \verb"pmf_x" is the
4816
distribution of elapsed time, which is the same as the distribution of
4817
wait time.  To see why, remember that for a particular value of
4818
{\tt zb}, the distribution of {\tt y} is uniform from 0 to {\tt zb}.
4819
Also
4820
%
4821
\begin{verbatim}
4822
x = zb - y
4823
\end{verbatim}
4824
%
4825
So the distribution of {\tt x} is also uniform from 0 to {\tt zb}.
4826

4827
Figure~\ref{fig.redline2} shows the distribution of {\tt z}, {\tt zb},
4828
and {\tt y} based on the data I collected from the Red Line web site.
4829

4830
To present these distributions, I am switching from Pmfs to Cdfs.
4831
Most people are more familiar with Pmfs, but I think Cdfs are easier
4832
to interpret, once you get used to them.  And if you want to plot
4833
several distributions on the same axes, Cdfs are the way to go.
4834
\index{Cdf}
4835
\index{cumulative distribution function}
4836

4837
The mean of {\tt z} is 7.8 minutes.  The mean of {\tt zb} is 8.8
4838
minutes, about 13\% higher.  The mean of {\tt y} is 4.4, half
4839
the mean of {\tt zb}.
4840

4841
As an aside, the Red Line schedule reports that trains run every
4842
9 minutes during peak times.  This is close to the average of
4843
{\tt zb}, but higher than the average of {\tt z}.  I exchanged email
4844
with a representative of the MBTA, who confirmed that the reported
4845
time between trains is deliberately conservative in order to
4846
account for variability.
4847

4848

4849
\section{Predicting wait times}
4850
\label{elapsed}
4851

4852
\begin{figure}
4853
% redline.py
4854
\centerline{\includegraphics[height=2.5in]{figs/redline3.pdf}}
4855
\caption{Prior and posterior of {\tt x} and predicted {\tt y}. }
4856
\label{fig.redline3}
4857
\end{figure}
4858

4859
Let's get back to the motivating question: suppose that when
4860
I arrive at the platform I see 10 people waiting.
4861
How long should I expect to wait until the next train arrives?
4862

4863
As always, let's start with the easiest version of the problem
4864
and work our way up.  Suppose we are given the actual distribution of
4865
{\tt z}, and we know that the passenger arrival rate,
4866
$\lambda$, is 2 passengers per minute.
4867

4868
In that case we can:
4869

4870
\begin{enumerate}
4871

4872
\item Use the distribution of {\tt z} to compute
4873
the prior distribution of {\tt zp}, the time between trains
4874
as seen by a passenger.
4875

4876
\item Then we can use the number of passengers to estimate the distribution
4877
of {\tt x}, the elapsed time since the last train.
4878

4879
\item Finally, we use the relation {\tt y = zp - x} to get the
4880
distribution of {\tt y}.
4881

4882
\end{enumerate}
4883

4884
The first step is to create a {\tt WaitTimeCalculator} that
4885
encapsulates the distributions of {\tt zp}, {\tt x},
4886
and {\tt y}, prior to taking into account the number of
4887
passengers.
4888

4889
\begin{verbatim}
4890
    wtc = WaitTimeCalculator(pmf_z)
4891
\end{verbatim}
4892

4893
\verb"pmf_z" is the given distribution of gap times.
4894

4895
The next step is to make an {\tt ElapsedTimeEstimator} (defined
4896
below), which encapsulates the posterior distribution of {\tt x} and
4897
the predictive distribution of {\tt y}.
4898
\index{predictive distribution}
4899

4900
\begin{verbatim}
4901
    ete = ElapsedTimeEstimator(wtc,
4902
                               lam=2.0/60,
4903
                               num_passengers=15)
4904
\end{verbatim}
4905

4906
The parameters are the {\tt WaitTimeCalculator}, the passenger
4907
arrival rate, {\tt lam} (expressed in passengers per second),
4908
and the observed number of passengers, let's say 15.
4909

4910
Here is the definition of {\tt ElapsedTimeEstimator}:
4911

4912
\begin{verbatim}
4913
class ElapsedTimeEstimator(object):
4914

4915
    def __init__(self, wtc, lam, num_passengers):
4916
        self.prior_x = Elapsed(wtc.pmf_x)
4917

4918
        self.post_x = self.prior_x.Copy()
4919
        self.post_x.Update((lam, num_passengers))
4920

4921
        self.pmf_y = PredictWaitTime(wtc.pmf_zb, self.post_x)
4922
\end{verbatim}
4923

4924
\verb"prior_x" and \verb"posterior_x" are the prior and
4925
posterior distributions of elapsed time.  \verb"pmf_y" is
4926
the predictive distribution of wait time.
4927

4928
{\tt ElapsedTimeEstimator} uses {\tt Elapsed} and {\tt PredictWaitTime},
4929
defined below.
4930

4931
{\tt Elapsed} is a Suite that represents the hypothetical
4932
distribution of {\tt x}.  The prior distribution of {\tt x}
4933
comes straight from the {\tt WaitTimeCalculator}.  Then we
4934
use the data, which consists of the arrival rate, {\tt lam},
4935
and the number of passengers on the platform, to compute
4936
the posterior distribution.
4937

4938
Here's the definition of {\tt Elapsed}:
4939

4940
\begin{verbatim}
4941
class Elapsed(thinkbayes.Suite):
4942

4943
    def Likelihood(self, data, hypo):
4944
        x = hypo
4945
        lam, k = data
4946
        like = thinkbayes.EvalPoissonPmf(k, lam * x)
4947
        return like
4948
\end{verbatim}
4949

4950
As always, {\tt Likelihood} takes a hypothesis and data, and
4951
computes the likelihood of the data under the hypothesis.
4952
In this case {\tt hypo} is the elapsed time since the last train
4953
and {\tt data} is a tuple of {\tt lam} and the number of
4954
passengers.
4955
\index{likelihood}
4956

4957
The likelihood of the data is the probability of getting
4958
{\tt k} arrivals in {\tt x} time, given arrival rate
4959
{\tt lam}.  We compute that using the PMF of the Poisson
4960
distribution.
4961
\index{Poisson distribution}
4962

4963
Finally, here's the definition of {\tt PredictWaitTime}:
4964

4965
\begin{verbatim}
4966
def PredictWaitTime(pmf_zb, pmf_x):
4967
    pmf_y = pmf_zb - pmf_x
4968
    RemoveNegatives(pmf_y)
4969
    return pmf_y
4970
\end{verbatim}
4971

4972
\verb"pmf_zb" is the distribution of gaps between trains;
4973
\verb"pmf_x" is the distribution of elapsed time, based on
4974
the observed number of passengers.  Since {\tt y = zb - x},
4975
we can compute
4976

4977
\begin{verbatim}
4978
    pmf_y = pmf_zb - pmf_x
4979
\end{verbatim}
4980

4981
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
4982
all pairs of {\tt zb} and {\tt x}, computes the differences, and adds
4983
the results to \verb"pmf_y".
4984

4985
The resulting Pmf includes some negative values, which we know are
4986
impossible.  For example, if you arrive during a gap of 5 minutes, you
4987
can't wait more than 5 minutes.  {\tt RemoveNegatives} removes the
4988
impossible values from the distribution and renormalizes.
4989

4990
\begin{verbatim}
4991
def RemoveNegatives(pmf):
4992
    for val in pmf.Values():
4993
        if val < 0:
4994
            pmf.Remove(val)
4995
    pmf.Normalize()
4996
\end{verbatim}
4997

4998
Figure~\ref{fig.redline3} shows the results.  The prior distribution
4999
of {\tt x} is the same as the distribution of {\tt y} in
5000
Figure~\ref{fig.redline2}.  The posterior distribution of {\tt x}
5001
shows that, after seeing 15 passengers on the platform, we believe
5002
that the time since the last train is probably 5-10 minutes.  The
5003
predictive distribution of {\tt y} indicates that we expect the next
5004
train in less than 5 minutes, with about 80\% confidence.
5005
\index{predictive distribution}
5006

5007

5008
\section{Estimating the arrival rate}
5009

5010
\begin{figure}
5011
% redline.py
5012
\centerline{\includegraphics[height=2.5in]{figs/redline1.pdf}}
5013
\caption{Prior and posterior distributions of {\tt lam} based
5014
on five days of passenger data. }
5015
\label{fig.redline1}
5016
\end{figure}
5017

5018
The analysis so far has been based on the assumption that we know (1)
5019
the distribution of gaps and (2) the passenger arrival rate.  Now we
5020
are ready to relax the second assumption.
5021

5022
Suppose that you just moved to Boston, so you don't know much about
5023
the passenger arrival rate on the Red Line.  After a few days of
5024
commuting, you could make a guess, at least qualitatively.  With
5025
a little more effort, you could estimate $\lambda$ quantitatively.
5026
\index{arrival rate}
5027

5028
Each day when you arrive at the platform, you should note the
5029
time and the number of passengers waiting (if the platform is too
5030
big, you could choose a sample area).  Then you should record your
5031
wait time and the
5032
number of new arrivals while you are waiting.
5033

5034
After five days, you might have data like this:
5035
%
5036
\begin{verbatim}
5037
k1      y     k2
5038
--     ---    --
5039
17     4.6     9
5040
22     1.0     0
5041
23     1.4     4
5042
18     5.4    12
5043
4      5.8    11
5044
\end{verbatim}
5045
%
5046
where {\tt k1} is the number of passengers waiting when you arrive,
5047
{\tt y} is your wait time in minutes, and {\tt k2} is the number of
5048
passengers who arrive while you are waiting.
5049

5050
Over the course of one week, you waited 18 minutes and saw 36
5051
passengers arrive, so you would estimate that the arrival rate is
5052
2 passengers per minute.  For practical purposes that estimate is
5053
good enough, but for the sake of completeness I
5054
will compute a posterior distribution for $\lambda$ and show how
5055
to use that distribution in the rest of the analysis.
5056

5057
{\tt ArrivalRate} is a {\tt Suite} that represents hypotheses about
5058
$\lambda$.  As always, {\tt Likelihood} takes a hypothesis and data,
5059
and computes the likelihood of the data under the hypothesis.
5060

5061
In this case the hypothesis is a value of $\lambda$.  The data is a
5062
pair, {\tt y, k}, where {\tt y} is a wait time and {\tt k} is the
5063
number of passengers that arrived.
5064

5065
\begin{verbatim}
5066
class ArrivalRate(thinkbayes.Suite):
5067

5068
    def Likelihood(self, data, hypo):
5069
        lam = hypo
5070
        y, k = data
5071
        like = thinkbayes.EvalPoissonPmf(k, lam * y)
5072
        return like
5073
\end{verbatim}
5074

5075
This {\tt Likelihood} might look familiar; it
5076
is almost identical to {\tt Elapsed.Likelihood} in
5077
Section~\ref{elapsed}.  The difference is that in {\tt
5078
  Elapsed.Likelihood} the hypothesis is {\tt x}, the elapsed time; in
5079
{\tt ArrivalRate.Likelihood} the hypothesis is {\tt lam}, the arrival
5080
rate.  But in both cases the likelihood is the probability of seeing
5081
{\tt k} arrivals in some period of time, given {\tt lam}.
5082

5083
{\tt ArrivalRateEstimator} encapsulates the process of estimating
5084
$\lambda$.  The parameter, \verb"passenger_data", is a list
5085
of {\tt k1, y, k2} tuples, as in the table above.
5086
\index{numpy}
5087

5088
\begin{verbatim}
5089
class ArrivalRateEstimator(object):
5090

5091
    def __init__(self, passenger_data):
5092
        low, high = 0, 5
5093
        n = 51
5094
        hypos = numpy.linspace(low, high, n) / 60
5095

5096
        self.prior_lam = ArrivalRate(hypos)
5097

5098
        self.post_lam = self.prior_lam.Copy()
5099
        for k1, y, k2 in passenger_data:
5100
            self.post_lam.Update((y, k2))
5101
\end{verbatim}
5102

5103
\verb"__init__" builds
5104
{\tt hypos}, which is a sequence of hypothetical values for {\tt lam},
5105
then builds the prior distribution, \verb"prior_lam".
5106
The {\tt for} loop updates the prior with data, yielding the posterior
5107
distribution, \verb"post_lam".
5108

5109
Figure~\ref{fig.redline1} shows
5110
the prior and posterior distributions.  As expected, the mean and
5111
median of the posterior are near the observed rate, 2 passengers per
5112
minute.  But the spread of the posterior distribution captures our
5113
uncertainty about $\lambda$ based on a small sample.
5114

5115

5116
\section{Incorporating uncertainty}
5117

5118
\begin{figure}
5119
% redline.py
5120
\centerline{\includegraphics[height=2.5in]{figs/redline4.pdf}}
5121
\caption{Predictive distributions of {\tt y} for possible values
5122
  of {\tt lam}. }
5123
\label{fig.redline4}
5124
\end{figure}
5125

5126
Whenever there is uncertainty about one of the inputs to an analysis,
5127
we can take it into account by a process like this:
5128
\index{uncertainty}
5129

5130
\begin{enumerate}
5131

5132
\item Implement the analysis based on a deterministic value of the
5133
  uncertain parameter (in this case $\lambda$).
5134

5135
\item Compute the distribution of the uncertain parameter.
5136

5137
\item Run the analysis for each value of the parameter, and generate a
5138
  set of predictive distributions.
5139
\index{predictive distribution}
5140

5141
\item Compute a mixture of the predictive distributions, using the
5142
  weights from the distribution of the parameter.
5143
\index{mixture}
5144

5145
\end{enumerate}
5146

5147
We have already done steps (1) and (2).  I wrote a class
5148
called {\tt WaitMixtureEstimator} to handle steps (3) and (4).
5149

5150
\begin{verbatim}
5151
class WaitMixtureEstimator(object):
5152

5153
    def __init__(self, wtc, are, num_passengers=15):
5154
        self.metapmf = thinkbayes.Pmf()
5155

5156
        for lam, prob in sorted(are.post_lam.Items()):
5157
            ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
5158
            self.metapmf.Set(ete.pmf_y, prob)
5159

5160
        self.mixture = thinkbayes.MakeMixture(self.metapmf)
5161
\end{verbatim}
5162

5163
{\tt wtc} is the {\tt WaitTimeCalculator} that contains the
5164
distribution of {\tt zb}.  {\tt are} is the {\tt ArrivalTimeEstimator}
5165
that contains the distribution of {\tt lam}.
5166

5167
The first line makes a meta-Pmf that maps from each possible
5168
distribution of {\tt y} to its probability.  For each value
5169
of {\tt lam}, we use {\tt ElapsedTimeEstimator} to
5170
compute the corresponding distribution of
5171
{\tt y} and store it in the Meta-Pmf.  Then
5172
we use {\tt MakeMixture} to compute the mixture.
5173
\index{MakeMixture}
5174
\index{meta-Pmf}
5175
\index{mixture}
5176

5177
%For purposes of comparison, I also compute the distribution of
5178
%{\tt y} based on a single point estimate of {\tt lam}, which is
5179
%the mean of the posterior distribution.
5180

5181
Figure~\ref{fig.redline4} shows the results.  The shaded lines
5182
in the background are the distributions of {\tt y} for each value
5183
of {\tt lam}, with line thickness that represents likelihood.
5184
The dark line is the mixture of these distributions.
5185

5186
In this case we could get a very similar result using a single point
5187
estimate of {\tt lam}.  So it was not necessary, for practical purposes,
5188
to include the uncertainty of the estimate.
5189

5190
In general, it is important to include variability if the system
5191
response is non-linear; that is, if small changes in the input can
5192
cause big changes in the output.  In this case, posterior variability
5193
in {\tt lam} is small and the system response is approximately
5194
linear for small perturbations.
5195
\index{non-linear}
5196

5197

5198
\section{Decision analysis}
5199

5200
\begin{figure}
5201
% redline.py
5202
\centerline{\includegraphics[height=2.5in]{figs/redline5.pdf}}
5203
\caption{Probability that wait time exceeds 15 minutes as
5204
a function of the number of passengers on the platform. }
5205
\label{fig.redline5}
5206
\end{figure}
5207

5208
At this point we can use the number of passengers on the platform
5209
to predict the distribution of wait times.  Now
5210
let's get to the second part of the question: when should I stop
5211
waiting for the train and go catch a taxi?
5212
\index{decision analysis}
5213

5214
Remember that in the original scenario, I am trying to get to
5215
South Station to catch the commuter rail.  Suppose I leave
5216
the office with enough time that I can wait 15 minutes
5217
and still make my connection at South Station.
5218

5219
In that case I would like to know the probability that {\tt y} exceeds
5220
15 minutes as a function of \verb"num_passengers".  It is easy enough
5221
to use the
5222
analysis from Section~\ref{elapsed} and run it for a range of
5223
\verb"num_passengers".
5224

5225
But there's a problem.
5226
The analysis is sensitive to the frequency of long delays, and
5227
because long delays are rare, it is hard to estimate
5228
their frequency.
5229

5230
I only have data from one week,
5231
and the longest delay I observed was 15 minutes.  So I can't
5232
estimate the frequency of longer delays accurately.
5233

5234
However, I can use previous observations to make at least a coarse
5235
estimate.  When I commuted by Red Line for a year, I saw three long
5236
delays caused by a signaling problem, a power outage, and ``police
5237
activity'' at another stop.  So I estimate that there are about
5238
3 major delays per year.
5239

5240
But remember that my observations are biased.  I am more likely
5241
to observe long delays because they affect a large number
5242
of passengers.  So we should treat my observations as a sample
5243
of {\tt zb} rather than {\tt z}.  Here's how we can do that.
5244
\index{observer bias}
5245

5246
During my year of commuting, I took the Red Line home about 220
5247
times.  So I take the observed gap times, \verb"gap_times",
5248
generate a sample of 220 gaps, and compute their Pmf:
5249

5250
\begin{verbatim}
5251
    n = 220
5252
    cdf_z = thinkbayes.MakeCdfFromList(gap_times)
5253
    sample_z = cdf_z.Sample(n)
5254
    pmf_z = thinkbayes.MakePmfFromList(sample_z)
5255
\end{verbatim}
5256

5257
Next I bias \verb"pmf_z" to get the distribution of
5258
{\tt zb}, draw a sample, and then add in delays of
5259
30, 40, and 50 minutes (expressed in seconds):
5260

5261
\begin{verbatim}
5262
    cdf_zp = BiasPmf(pmf_z).MakeCdf()
5263
    sample_zb = cdf_zp.Sample(n) + [1800, 2400, 3000]
5264
\end{verbatim}
5265

5266
{\tt Cdf.Sample} is more efficient than {\tt Pmf.Sample}, so it
5267
is usually faster to convert a Pmf to a Cdf before sampling.
5268

5269
Next I use the sample of {\tt zb} to estimate a Pdf using
5270
KDE, and then convert the Pdf to a Pmf:
5271

5272
\begin{verbatim}
5273
    pdf_zb = thinkbayes.EstimatedPdf(sample_zb)
5274
    xs = MakeRange(low=60)
5275
    pmf_zb = pdf_zb.MakePmf(xs)
5276
\end{verbatim}
5277

5278
Finally I unbias the distribution of {\tt zb} to get the
5279
distribution of {\tt z}, which I use to create the
5280
{\tt WaitTimeCalculator}:
5281

5282
\begin{verbatim}
5283
    pmf_z = UnbiasPmf(pmf_zb)
5284
    wtc = WaitTimeCalculator(pmf_z)
5285
\end{verbatim}
5286

5287
This process is complicated, but
5288
all of the steps are operations we have seen before.
5289
Now we are ready to compute the probability of a long wait.
5290

5291
\begin{verbatim}
5292
def ProbLongWait(num_passengers, minutes):
5293
    ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
5294
    cdf_y = ete.pmf_y.MakeCdf()
5295
    prob = 1 - cdf_y.Prob(minutes * 60)
5296
\end{verbatim}
5297

5298
Given the number of passengers on the platform,
5299
{\tt ProbLongWait} 
5300
makes an {\tt ElapsedTimeEstimator},
5301
extracts the distribution of wait time, and 
5302
computes 
5303
the probability that wait time
5304
exceeds {\tt minutes}.
5305

5306
Figure~\ref{fig.redline5} shows the result.  When the number of
5307
passengers is less than 20, we infer that the system is
5308
operating normally, so the probability of a long delay is small.
5309
If there are 30 passengers, we estimate that it has been 15
5310
minutes since the last train; that's longer than a normal delay,
5311
so we infer that something is wrong and expect longer delays.
5312

5313
If we are willing to accept a 10\% chance of missing the connection
5314
at South Station, we should stay and wait as long as there
5315
are fewer than 30 passengers, and take a taxi if there are more.
5316

5317
Or, to take this analysis one step further, we could quantify the cost
5318
of missing the connection and the cost of taking a taxi, then choose
5319
the threshold that minimizes expected cost.
5320

5321
\section{Discussion}
5322

5323
The analysis so far has been based on the assumption that the
5324
arrival rate of passengers is the same every day.  For a commuter
5325
train during rush hour, that might not be a bad assumption, but
5326
there are some obvious exceptions.  For example, if there is a special
5327
event nearby, a large number of people might arrive at the same time.
5328
In that case, the estimate of {\tt lam} would be too low, so the
5329
estimates of {\tt x} and {\tt y} would be too high.
5330

5331
If special events are as common as major delays, it would
5332
be important to include them in the model.  We could do that by
5333
extending the distribution of {\tt lam} to include occasional
5334
large values.
5335

5336
We started with the assumption that we know
5337
distribution of {\tt z}.
5338
As an alternative, a passenger could estimate {\tt z}, but it would
5339
not be easy.
5340
As a passenger, you only
5341
observe only your own wait time, {\tt y}.  Unless you skip
5342
the first train and wait for the second, you don't
5343
observe the gap between trains, {\tt z}.
5344

5345
However, we could make some inferences about {\tt zb}.  If we note
5346
the number of passengers waiting when we arrive, we can estimate
5347
the elapsed time since the last train, {\tt x}.  Then we observe
5348
{\tt y}.  If we add the posterior distribution of {\tt x} to
5349
the observed {\tt y}, we get a distribution that represents
5350
our posterior belief about the observed value of {\tt zb}.
5351

5352
We can use this distribution to update our beliefs about the
5353
distribution of {\tt zb}.  Finally, we can compute the
5354
inverse of {\tt BiasPmf} to get from the distribution of {\tt zb}
5355
to the distribution of {\tt z}.
5356

5357
I leave this analysis as an exercise for the
5358
reader.  One suggestion: you should read Chapter~\ref{species} first.
5359
You can find the outline of
5360
a solution in \url{http://thinkbayes.com/redline.py}.
5361
  For more information
5362
see Section~\ref{download}.
5363

5364
\section{Exercises}
5365

5366
\begin{exercise}
5367
This exercise is from
5368
MacKay, {\em Information Theory, Inference, and Learning Algorithms}:
5369
\index{MacKay, David}
5370

5371
\begin{quote}
5372
    Unstable particles are emitted from a source and decay at a
5373
distance $x$, a real number that has an exponential probability
5374
distribution with [parameter] $\lambda$.  Decay events can only be
5375
observed if they occur in a window extending from $x=1$ cm to $x=20$
5376
cm.  $N$ decays are observed at locations $\{ 1.5, 2, 3, 4, 5, 12 \}$
5377
cm.  What is the posterior distribution of $\lambda$?
5378

5379
\end{quote}
5380

5381
You can download a solution to this exercise from
5382
\url{http://thinkbayes.com/decay.py}.
5383

5384
\end{exercise}
5385

5386

5387
\chapter{Two Dimensions}
5388
\label{paintball}
5389

5390
\section{Paintball}
5391

5392
Paintball is a sport in which competing teams try to shoot each other
5393
with guns that fire paint-filled pellets that break on impact, leaving
5394
a colorful mark on the target.  It is usually played in an
5395
arena decorated with barriers and other objects that can be
5396
used as cover.
5397
\index{Paintball problem}
5398

5399
Suppose you are playing paintball in an indoor arena 30 feet
5400
wide and 50 feet long.  You are standing near one of the 30 foot
5401
walls, and you suspect that one of your opponents has taken cover
5402
nearby.  Along the wall, you see several paint spatters, all the same
5403
color, that you think your opponent fired recently.
5404

5405
The spatters are at 15, 16, 18, and 21 feet, measured from the
5406
lower-left corner of the room.  Based on these data, where do you
5407
think your opponent is hiding?
5408

5409
Figure~\ref{fig.paintball} shows a diagram of the arena.  Using the
5410
lower-left corner of the room as the origin, I denote the unknown
5411
location of the shooter with coordinates $\alpha$ and $\beta$, or {\tt
5412
  alpha} and {\tt beta}.  The location of a spatter is labeled
5413
{\tt x}.  The angle the opponent shoots at is $\theta$ or {\tt theta}.
5414

5415
The Paintball problem is a modified version
5416
of the Lighthouse problem, a common example of Bayesian analysis.  My
5417
notation follows the presentation of the problem in D.S.~Sivia's, {\it Data
5418
  Analysis: a Bayesian Tutorial, Second Edition} (Oxford, 2006).
5419
\index{Sivia, D.S.}
5420

5421
You can download the code in this chapter from
5422
\url{http://thinkbayes.com/paintball.py}.
5423
  For more information
5424
see Section~\ref{download}.
5425

5426
\section{The suite}
5427

5428
\begin{figure}
5429
% paintball.py
5430
\centerline{\includegraphics[height=2.5in]{figs/paintball.pdf}}
5431
\caption{Diagram of the layout for the paintball problem.}
5432
\label{fig.paintball}
5433
\end{figure}
5434

5435
To get started, we need a Suite that represents a set of hypotheses
5436
about the location of the opponent.  Each hypothesis is a
5437
pair of coordinates: {\tt (alpha, beta)}.
5438

5439
Here is the definition of the Paintball suite:
5440

5441
\begin{verbatim}
5442
class Paintball(thinkbayes.Suite, thinkbayes.Joint):
5443

5444
    def __init__(self, alphas, betas, locations):
5445
        self.locations = locations
5446
        pairs = [(alpha, beta) 
5447
                 for alpha in alphas 
5448
                 for beta in betas]
5449
        thinkbayes.Suite.__init__(self, pairs)
5450
\end{verbatim}
5451

5452
{\tt Paintball} inherits from {\tt Suite}, which we have seen before,
5453
and {\tt Joint}, which I will explain soon.
5454
\index{Joint pmf}
5455

5456
{\tt alphas} is the list of possible values for {\tt alpha}; {\tt
5457
  betas} is the list of values for {\tt beta}.  {\tt pairs} is a list
5458
of all {\tt (alpha, beta)} pairs.
5459

5460
{\tt locations} is a list of possible locations along
5461
the wall; it is stored for use in {\tt Likelihood}.
5462

5463
\begin{figure}
5464
% paintball.py
5465
\centerline{\includegraphics[height=2.5in]{figs/paintball2.pdf}}
5466
\caption{Posterior CDFs for {\tt alpha} and {\tt beta}, given the data.}
5467
\label{fig.paintball2}
5468
\end{figure}
5469

5470
The room is 30 feet wide and 50 feet long, so here's the code that
5471
creates the suite:
5472

5473
\begin{verbatim}
5474
    alphas = range(0, 31)
5475
    betas = range(1, 51)
5476
    locations = range(0, 31)
5477

5478
    suite = Paintball(alphas, betas, locations)
5479
\end{verbatim}
5480

5481
This prior distribution assumes that all locations in the room are
5482
equally likely.  Given a map of the room, we might choose a more
5483
detailed prior, but we'll start simple.
5484

5485

5486
\section{Trigonometry}
5487

5488
Now we need a likelihood function, which means we have to figure
5489
out the likelihood of hitting any spot along the wall, given
5490
the location of the opponent.
5491
\index{likelihood}
5492

5493
As a simple model, imagine that the opponent is like a rotating
5494
turret, equally likely to shoot in any direction.
5495
In that case, he is most likely to hit
5496
the wall at location {\tt alpha}, and less likely to hit the wall far
5497
away from {\tt alpha}.
5498
\index{trigonometry}
5499

5500
With a little trigonometry, we can compute the probability of hitting
5501
any spot along the wall.  Imagine that the shooter fires a shot at
5502
angle $\theta$; the pellet would hit the wall at location $x$, where
5503
%
5504
\[ x - \alpha = \beta \tan \theta \]
5505
%
5506
Solving this equation for $\theta$ yields
5507
%
5508
\[ \theta = tan^{-1} \left( \frac{x - \alpha}{\beta} \right) \]
5509
%
5510
So given a location on the wall, we can find $\theta$.
5511

5512
Taking the derivative of the first equation with respect to
5513
$\theta$ yields
5514
%
5515
\[ \frac{dx}{d\theta} = \frac{\beta}{\cos^2 \theta} \]
5516
%
5517
This derivative is what I'll call the ``strafing speed'',
5518
which is the speed of the target location along the wall as $\theta$
5519
increases.  The probability of hitting a given point on the wall is
5520
inversely related to strafing speed.
5521
\index{strafing speed}
5522

5523
If we know the coordinates of the shooter and a location 
5524
along the wall, we can compute strafing speed:
5525

5526
\begin{verbatim}
5527
def StrafingSpeed(alpha, beta, x):
5528
    theta = math.atan2(x - alpha, beta)
5529
    speed = beta / math.cos(theta)**2
5530
    return speed
5531
\end{verbatim}
5532

5533
{\tt alpha} and {\tt beta} are the coordinates of the shooter;
5534
{\tt x} is the location of a spatter.  The result is
5535
the derivative of {\tt x} with respect to {\tt theta}.
5536

5537
\begin{figure}
5538
% paintball.py
5539
\centerline{\includegraphics[height=2.5in]{figs/paintball1.pdf}}
5540
\caption{PMF of location given {\tt alpha=10}, for several values of
5541
  {\tt beta}.}
5542
\label{fig.paintball1}
5543
\end{figure}
5544

5545
Now we can compute a Pmf that represents the probability of hitting
5546
any location on the wall.  {\tt MakeLocationPmf} takes {\tt alpha} and
5547
{\tt beta}, the coordinates of the shooter, and {\tt locations}, a
5548
list of possible values of {\tt x}.
5549

5550
\begin{verbatim}
5551
def MakeLocationPmf(alpha, beta, locations):
5552
    pmf = thinkbayes.Pmf()
5553
    for x in locations:
5554
        prob = 1.0 / StrafingSpeed(alpha, beta, x)
5555
        pmf.Set(x, prob)
5556
    pmf.Normalize()
5557
    return pmf
5558
\end{verbatim}
5559

5560
{\tt MakeLocationPmf} computes the probability of hitting
5561
each location, which is inversely related to
5562
strafing speed.  The result is a Pmf of locations and their
5563
probabilities.
5564

5565
Figure~\ref{fig.paintball1} shows the Pmf of location with {\tt alpha
5566
  = 10} and a range of values for {\tt beta}.  For all values of beta
5567
the most likely spatter location is {\tt x = 10}; as {\tt beta}
5568
increases, so does the spread of the Pmf.
5569

5570

5571

5572
\section{Likelihood}
5573

5574
Now all we need is a likelihood function.
5575
We can use {\tt MakeLocationPmf} to compute the likelihood
5576
of any value of {\tt x}, given the coordinates of the opponent.
5577
\index{likelihood}
5578

5579
\begin{verbatim}
5580
    def Likelihood(self, data, hypo):
5581
        alpha, beta = hypo
5582
        x = data
5583
        pmf = MakeLocationPmf(alpha, beta, self.locations)
5584
        like = pmf.Prob(x)
5585
        return like
5586
\end{verbatim}
5587

5588
Again, {\tt alpha} and {\tt beta} are the hypothetical coordinates of
5589
the shooter, and {\tt x} is the location of an observed spatter.
5590

5591
{\tt pmf} contains the probability of each location, given the
5592
coordinates of the shooter.  From this Pmf, we select the probability
5593
of the observed location.
5594

5595
And we're done.  To update the suite, we can use {\tt UpdateSet},
5596
which is inherited from {\tt Suite}.
5597

5598
\begin{verbatim}
5599
suite.UpdateSet([15, 16, 18, 21])
5600
\end{verbatim}
5601

5602
The result is a distribution that maps each {\tt (alpha, beta)} pair
5603
to a posterior probability.
5604

5605

5606
\section{Joint distributions}
5607

5608
When each value in a distribution is a tuple of variables, it is
5609
called a {\bf joint distribution} because it represents the
5610
distributions of the variables together, that is ``jointly''.
5611
A joint distribution contains the distributions of the variables,
5612
as well as information about the relationships among them.
5613
\index{joint distribution}
5614

5615
Given a joint distribution, we can compute the distributions
5616
of each variable independently, which are called the {\bf marginal
5617
distributions}.
5618
\index{marginal distribution}
5619
\index{Joint}
5620

5621
{\tt thinkbayes.Joint} provides a method that computes marginal
5622
distributions:
5623

5624
\begin{verbatim}
5625
# class Joint:
5626

5627
    def Marginal(self, i):
5628
        pmf = Pmf()
5629
        for vs, prob in self.Items():
5630
            pmf.Incr(vs[i], prob)
5631
        return pmf
5632
\end{verbatim}
5633

5634
{\tt i} is the index of the variable we want; in this example
5635
{\tt i=0} indicates the distribution of {\tt alpha}, and
5636
{\tt i=1} indicates the distribution of {\tt beta}.
5637

5638
Here's the code that extracts the marginal distributions:
5639

5640
\begin{verbatim}
5641
    marginal_alpha = suite.Marginal(0)
5642
    marginal_beta = suite.Marginal(1)
5643
\end{verbatim}
5644

5645
Figure~\ref{fig.paintball2} shows the results (converted to CDFs).
5646
The median value for {\tt alpha} is 18, near the center of mass of
5647
the observed spatters.  For {\tt beta}, the most likely values are
5648
close to the wall, but beyond 10 feet the distribution is almost
5649
uniform, which indicates that the data do not distinguish strongly
5650
between these possible locations.
5651

5652
Given the posterior marginals, we can compute credible intervals
5653
for each coordinate independently:
5654
\index{credible interval}
5655

5656
\begin{verbatim}
5657
    print 'alpha CI', marginal_alpha.CredibleInterval(50)
5658
    print 'beta CI', marginal_beta.CredibleInterval(50)
5659
\end{verbatim}
5660

5661
The 50\% credible intervals are {\tt (14, 21)} for {\tt alpha} and
5662
{\tt (5, 31)} for {\tt beta}.  So the data provide evidence that the
5663
shooter is in the near side of the room.  But it is not strong
5664
evidence.  The 90\% credible intervals cover most of the room!
5665
\index{evidence}
5666

5667

5668
\section{Conditional distributions}
5669
\label{conditional}
5670

5671
\begin{figure}
5672
% paintball.py
5673
\centerline{\includegraphics[height=2.5in]{figs/paintball3.pdf}}
5674
\caption{Posterior distributions for {\tt alpha} conditioned on several values
5675
of {\tt beta}.}
5676
\label{fig.paintball3}
5677
\end{figure}
5678

5679
The marginal distributions contain information about the variables
5680
independently, but they do not capture the dependence between
5681
variables, if any.
5682
\index{independence}
5683
\index{dependence}
5684

5685
One way to visualize dependence is by computing {\bf conditional
5686
distributions}.  {\tt thinkbayes.Joint} provides a method that
5687
does that:
5688
\index{conditional distribution}
5689
\index{Joint}
5690

5691
\begin{verbatim}
5692
    def Conditional(self, i, j, val):
5693
        pmf = Pmf()
5694
        for vs, prob in self.Items():
5695
            if vs[j] != val: continue
5696
            pmf.Incr(vs[i], prob)
5697

5698
        pmf.Normalize()
5699
        return pmf
5700
\end{verbatim}
5701

5702
Again, {\tt i} is the index of the variable we want; {\tt j}
5703
is the index of the conditioning variable, and {\tt val} is the
5704
conditional value.
5705

5706
The result is the distribution of the $i$th variable under the
5707
condition that the $j$th variable is {\tt val}.
5708

5709
For example, the following code computes the conditional distributions
5710
of {\tt alpha} for a range of values of {\tt beta}:
5711

5712
\begin{verbatim}
5713
    betas = [10, 20, 40]
5714

5715
    for beta in betas:
5716
        cond = suite.Conditional(0, 1, beta)
5717
\end{verbatim}
5718

5719
Figure~\ref{fig.paintball3} shows the results, which we could
5720
fully describe as ``posterior conditional marginal distributions.''
5721
Whew!
5722

5723
If the variables were independent, the conditional distributions would
5724
all be the same.  Since they are all different, we can tell the
5725
variables are dependent.  For example, if we know (somehow) that {\tt
5726
  beta = 10}, the conditional distribution of {\tt alpha} is fairly
5727
narrow.  For larger values of {\tt beta}, the distribution of
5728
{\tt alpha} is wider.
5729
\index{dependence}
5730
\index{independence}
5731

5732

5733
\section{Credible intervals}
5734

5735
\begin{figure}
5736
% paintball.py
5737
\centerline{\includegraphics[height=2.5in]{figs/paintball5.pdf}}
5738
\caption{Credible intervals for the coordinates of the opponent.}
5739
\label{fig.paintball5}
5740
\end{figure}
5741

5742
Another way to visualize the posterior joint distribution is to
5743
compute credible intervals.  When we looked at credible intervals
5744
in Section~\ref{credible},
5745
I skipped over a subtle point: for a given distribution, there
5746
are many intervals with the same level of credibility.  For example,
5747
if you want a 50\% credible interval, you could choose any set of
5748
values whose probability adds up to 50\%.
5749

5750
When the values are one-dimensional, it is most common to choose
5751
the {\bf central credible interval}; for example, the central 50\%
5752
credible interval contains all values between the 25th and 75th
5753
percentiles.
5754
\index{central credible interval}
5755

5756
In multiple dimensions it is less obvious what the right credible
5757
interval should be.  The best choice might depend on context, but
5758
one common choice is the maximum likelihood credible interval, which
5759
contains the most likely values that add up to 50\% (or some other
5760
percentage).
5761
\index{maximum likelihood}
5762

5763
{\tt thinkbayes.Joint} provides a method that computes maximum
5764
likelihood credible intervals. 
5765
\index{Joint}
5766

5767
\begin{verbatim}
5768
# class Joint:
5769

5770
    def MaxLikeInterval(self, percentage=90):
5771
        interval = []
5772
        total = 0
5773

5774
        t = [(prob, val) for val, prob in self.Items()]
5775
        t.sort(reverse=True)
5776

5777
        for prob, val in t:
5778
            interval.append(val)
5779
            total += prob
5780
            if total >= percentage/100.0:
5781
                break
5782

5783
        return interval
5784
\end{verbatim}
5785

5786
The first step is to make a list of the values in the suite,
5787
sorted in descending order by probability.  Next we traverse the
5788
list, adding each value to the interval, until the total
5789
probability exceeds {\tt percentage}.  The result is a list
5790
of values from the suite.  Notice that this set of values
5791
is not necessarily contiguous.
5792

5793
To visualize the intervals, I wrote a function that ``colors''
5794
each value according to how many intervals it appears in:
5795

5796
\begin{verbatim}
5797
def MakeCrediblePlot(suite):
5798
    d = dict((pair, 0) for pair in suite.Values())
5799

5800
    percentages = [75, 50, 25]
5801
    for p in percentages:
5802
        interval = suite.MaxLikeInterval(p)
5803
        for pair in interval:
5804
            d[pair] += 1
5805

5806
    return d
5807
\end{verbatim}
5808

5809
{\tt d} is a dictionary that maps from each value in the suite
5810
to the number of intervals it appears in.  The loop computes intervals
5811
for several percentages and modifies {\tt d}.
5812

5813
Figure~\ref{fig.paintball5} shows the result.  The 25\% credible
5814
interval is the darkest region near the bottom wall.  For higher
5815
percentages, the credible interval is bigger, of course, and skewed
5816
toward the right side of the room.
5817

5818

5819
\section{Discussion}
5820

5821
This chapter shows that the Bayesian framework from the previous
5822
chapters can be extended to handle a two-dimensional parameter space.
5823
The only difference is that each hypothesis is represented by
5824
a tuple of parameters.
5825

5826
I also presented {\tt Joint}, which is a parent class that provides
5827
methods that apply to joint distributions:
5828
{\tt Marginal}, {\tt Conditional}, and {\tt MakeLikeInterval}.
5829
In object-oriented terms,
5830
{\tt Joint} is a mixin (see \url{http://en.wikipedia.org/wiki/Mixin}).
5831
\index{Joint}
5832

5833
There is a lot of new vocabulary in this chapter, so let's review:
5834

5835
\begin{description}
5836

5837
\item[Joint distribution:] A distribution that represents all possible
5838
  values in a multidimensional space and their probabilities.  The
5839
  example in this chapter is a two-dimensional space made up of the
5840
  coordinates {\tt alpha} and {\tt beta}.  The joint distribution
5841
  represents the probability of each ({\tt alpha}, {\tt beta}) pair.
5842

5843
\item[Marginal distribution:] The distribution of one parameter in a
5844
  joint distribution, treating the other parameters as unknown.  For
5845
  example, Figure~\ref{fig.paintball2} shows the distributions of {\tt
5846
    alpha} and {\tt beta} independently.
5847

5848
\item[Conditional distribution:] The distribution of one parameter in
5849
  a joint distribution, conditioned on one or more of the other
5850
  parameters.  Figure~\ref{fig.paintball3} shows several distributions for
5851
  {\tt alpha}, conditioned on different values of {\tt beta}.
5852

5853
\end{description}
5854

5855
Given the joint distribution, you can compute marginal and conditional
5856
distributions.  With enough conditional distributions, you could
5857
re-create the joint distribution, at least approximately.  But given
5858
the marginal distributions you cannot re-create the joint distribution
5859
because you have lost information about the dependence between
5860
variables.
5861
\index{joint distribution}
5862
\index{conditional distribution}
5863
\index{marginal distribution}
5864

5865
If there are $n$ possible values for each of two parameters, most
5866
operations on the joint distribution take time proportional to $n^2$.
5867
If there are $d$ parameters, run time is proportional to $n^d$,
5868
which quickly becomes impractical as the number of dimensions increases.
5869

5870
If you can process a million hypotheses in a reasonable amount of time,
5871
you could handle two dimensions with 1000 values for each parameter,
5872
or three dimensions with 100 values each, or six dimensions with 10
5873
values each.
5874

5875
If you need more dimensions, or more values per dimension, there are
5876
optimizations you can try.  I present an example
5877
in Chapter~\ref{species}.
5878

5879
You can download the code in this chapter from
5880
\url{http://thinkbayes.com/paintball.py}.
5881
  For more information
5882
see Section~\ref{download}.
5883

5884
\section{Exercises}
5885

5886
\begin{exercise}
5887
In our simple model, the opponent is equally likely to shoot in any
5888
direction.  As an exercise, let's consider improvements to this model.
5889

5890
The analysis in this chapter suggests that a shooter is most likely to
5891
hit the closest wall.  But in reality, if the opponent is close to a
5892
wall, he is unlikely to shoot at the wall because he is unlikely to
5893
see a target between himself and the wall.
5894

5895
Design an improved model that takes this behavior
5896
into account.  Try to find a model that is more realistic, but not
5897
too complicated.
5898
\end{exercise}
5899

5900

5901

5902

5903

5904
\chapter{Approximate Bayesian Computation}
5905

5906
\section{The Variability Hypothesis}
5907

5908
I have a soft spot for crank science.  Recently I visited Norumbega
5909
Tower, which is an enduring monument to the crackpot theories of Eben
5910
Norton Horsford, inventor of double-acting baking powder and fake
5911
history.  But that's not what this chapter is about.
5912
\index{crank science}
5913
\index{Horsford, Eben Norton}
5914

5915
This chapter is about the Variability Hypothesis, which
5916
\index{Variability Hypothesis}
5917
\index{Meckel, Johann}
5918

5919
\begin{quote}
5920
"originated in the early nineteenth century with Johann Meckel, who
5921
  argued that males have a greater range of ability than females,
5922
  especially in intelligence. In other words, he believed that most
5923
  geniuses and most mentally retarded people are men. Because he
5924
  considered males to be the 'superior animal,' Meckel concluded that
5925
  females' lack of variation was a sign of inferiority."
5926

5927
From \url{http://en.wikipedia.org/wiki/Variability_hypothesis}.
5928
\end{quote}
5929

5930
I particularly like that last part, because I suspect that if it turns
5931
out that women are actually more variable, Meckel would take that as a
5932
sign of inferiority, too.  Anyway, you will not be surprised to hear
5933
that the evidence for the Variability Hypothesis is weak.
5934
\index{evidence}
5935

5936
Nevertheless, it came up in my class recently when we looked at data
5937
from the CDC's Behavioral Risk Factor Surveillance System (BRFSS),
5938
specifically the self-reported heights of adult American men and women.
5939
The dataset includes responses from 154407 men and 254722 women.
5940
Here's what we found:
5941
\index{Centers for Disease Control}
5942
\index{CDC}
5943
\index{BRFSS}
5944
\index{Behavioral Risk Factor Surveillance System}
5945

5946
\begin{itemize}
5947

5948
\item The average height for men is 178 cm; the average height for
5949
  women is 163 cm.  So men are taller, on average.  No surprise there.
5950

5951
\item For men the standard deviation is 7.7 cm; for women it is 7.3
5952
  cm.  So in absolute terms, men's heights are more variable.
5953

5954
\item But to compare variability between groups, it is more meaningful
5955
  to use the coefficient of variation (CV), which is the standard
5956
  deviation divided by the mean.  It is a dimensionless measure of
5957
  variability relative to scale.  For men CV is 0.0433; for women it
5958
  is 0.0444.
5959
\index{coefficient of variation}
5960

5961
\end{itemize}
5962

5963
That's very close, so we could conclude that this dataset provides
5964
weak evidence against the Variability Hypothesis.  But we can use
5965
Bayesian methods to make that conclusion more precise.  And answering
5966
this question gives me a chance to demonstrate some techniques
5967
for working with large datasets.
5968
\index{height}
5969

5970
I will proceed in a few steps:
5971

5972
\begin{enumerate}
5973

5974
\item We'll start with the simplest implementation, but it only works
5975
  for datasets smaller than 1000 values.
5976

5977
\item By computing probabilities under a log transform, we can scale
5978
  up to the full size of the dataset, but the computation gets slow.
5979

5980
\item Finally, we speed things up substantially with Approximate
5981
  Bayesian Computation, also known as ABC.
5982

5983
\end{enumerate}
5984

5985
You can download the code in this chapter from
5986
\url{http://thinkbayes.com/variability.py}.
5987
  For more information
5988
see Section~\ref{download}.
5989

5990
\section{Mean and standard deviation}
5991

5992
In Chapter~\ref{paintball} we estimated two parameters simultaneously
5993
using a joint distribution.  In this chapter we use the same
5994
method to estimate the parameters of a Gaussian distribution:
5995
the mean, {\tt mu}, and the standard deviation, {\tt sigma}.
5996
\index{Gaussian distribution}
5997

5998
For this problem, I define a Suite called {\tt Height} that
5999
represents a map from each {\tt mu, sigma} pair to its probability:
6000

6001
\begin{verbatim}
6002
class Height(thinkbayes.Suite, thinkbayes.Joint):
6003

6004
    def __init__(self, mus, sigmas):
6005
        pairs = [(mu, sigma) 
6006
                 for mu in mus
6007
                 for sigma in sigmas]
6008

6009
        thinkbayes.Suite.__init__(self, pairs)
6010
\end{verbatim}
6011

6012
{\tt mus} is a sequence of possible values for {\tt mu}; {\tt sigmas}
6013
is a sequence of values for {\tt sigma}.  The prior distribution
6014
is uniform over all {\tt mu, sigma} pairs.
6015
\index{Joint}
6016
\index{joint distribution}
6017

6018
The likelihood function is easy.  Given hypothetical values
6019
of {\tt mu} and {\tt sigma}, we compute the likelihood
6020
of a particular value, {\tt x}.  That's what {\tt EvalGaussianPdf}
6021
does, so all we have to do is use it:
6022
\index{likelihood}
6023

6024
\begin{verbatim}
6025
# class Height
6026

6027
    def Likelihood(self, data, hypo):
6028
        x = data
6029
        mu, sigma = hypo
6030
        like = thinkbayes.EvalGaussianPdf(x, mu, sigma)
6031
        return like
6032
\end{verbatim}
6033

6034
If you have studied statistics from a mathematical perspective,
6035
you know that when you evaluate a PDF, you get a probability
6036
density.  In order to get a probability, you have to integrate
6037
probability densities over some range.
6038
\index{density}
6039

6040
But for our purposes, we don't need a probability; we just
6041
need something proportional to the probability we want.
6042
A probability density does that job nicely.
6043

6044
The hardest part of this problem turns
6045
out to be choosing appropriate ranges for {\tt mus} and
6046
{\tt sigmas}.  If the range is too small, we omit some
6047
possibilities with non-negligible probability and get the
6048
wrong answer.  If the range is too big, we get the right answer,
6049
but waste computational power.
6050

6051
So this is an opportunity to use classical estimation to
6052
make Bayesian techniques more efficient.  Specifically, we can use
6053
classical estimators to find a likely location for {\tt mu} and {\tt
6054
  sigma}, and use the standard errors of those estimates to choose a
6055
likely spread.
6056
\index{classical estimation}
6057

6058
If the true parameters of the distribution are $\mu$ and $\sigma$, and
6059
we take a sample of $n$ values, an estimator of $\mu$ is the sample
6060
mean, {\tt m}.
6061

6062
And an estimator of $\sigma$ is the sample standard
6063
variance, {\tt s}.
6064

6065
The standard error of the estimated $\mu$ is $s / \sqrt{n}$
6066
and the standard error of the estimated $\sigma$ is
6067
$s / \sqrt{2 (n-1)}$.
6068

6069
Here's the code to compute all that:
6070

6071
\begin{verbatim}
6072
def FindPriorRanges(xs, num_points, num_stderrs=3.0):
6073

6074
    # compute m and s
6075
    n = len(xs)
6076
    m = numpy.mean(xs)
6077
    s = numpy.std(xs)
6078

6079
    # compute ranges for m and s
6080
    stderr_m = s / math.sqrt(n)
6081
    mus = MakeRange(m, stderr_m, num_stderrs)
6082

6083
    stderr_s = s / math.sqrt(2 * (n-1))
6084
    sigmas = MakeRange(s, stderr_s, num_stderrs)
6085

6086
    return mus, sigmas
6087
\end{verbatim}
6088

6089
{\tt xs} is the dataset.  \verb"num_points" is the desired number of
6090
values in the range.  \verb"num_stderrs" is the width of the range on
6091
each side of the estimate, in number of standard errors.
6092

6093
The return
6094
value is a pair of sequences, {\tt mus} and {\tt sigmas}.
6095

6096
Here's {\tt MakeRange}:
6097
\index{numpy}
6098

6099
\begin{verbatim}
6100
def MakeRange(estimate, stderr, num_stderrs):
6101
    spread = stderr * num_stderrs
6102
    array = numpy.linspace(estimate-spread,
6103
                           estimate+spread,
6104
                           num_points)
6105
    return array
6106
\end{verbatim}
6107

6108
{\tt numpy.linspace} makes an array of equally spaced elements between
6109
{\tt estimate-spread} and {\tt estimate+spread}, including both.
6110
\index{linspace}
6111

6112

6113
\section{Update}
6114

6115
Finally here's the code to make and update the suite:
6116

6117
\begin{verbatim}
6118
    mus, sigmas = FindPriorRanges(xs, num_points)
6119
    suite = Height(mus, sigmas)
6120
    suite.UpdateSet(xs)
6121
    print suite.MaximumLikelihood()    
6122
\end{verbatim}
6123

6124
This process might seem bogus, because we use the data to choose the
6125
range of the prior distribution, and then use the data again to do the
6126
update.  In general, using the same data twice is, in fact, bogus.
6127
\index{bogus}
6128
\index{maximum likelihood}
6129

6130
But in this case it is ok.  Really.  We use the data to choose the
6131
range for the prior, but only to avoid computing a lot of
6132
probabilities that would have been very small anyway.  With
6133
\verb"num_stderrs=4", the range is big enough to cover all values with
6134
non-negligible likelihood.  After that, making it bigger has no effect
6135
on the results.
6136

6137
In effect, the prior is uniform over all values 
6138
of {\tt mu} and {\tt sigma}, but for computational efficiency
6139
we ignore all the values that don't matter.
6140

6141
\section{The posterior distribution of CV}
6142

6143
Once we have the posterior joint distribution of {\tt mu} and {\tt
6144
  sigma}, we can compute the distribution of CV for men and women, and
6145
then the probability that one exceeds the other.
6146

6147
To compute the distribution of CV, we enumerate pairs of
6148
{\tt mu} and {\tt sigma}:
6149

6150
\begin{verbatim}
6151
def CoefVariation(suite):
6152
    pmf = thinkbayes.Pmf()
6153
    for (mu, sigma), p in suite.Items():
6154
        pmf.Incr(sigma/mu, p)
6155
    return pmf
6156
\end{verbatim}
6157

6158
Then we use \verb"thinkbayes.PmfProbGreater" to compute the
6159
probability that men are more variable.
6160

6161
The analysis itself is simple, but there are two more issues we
6162
have to deal with:
6163

6164
\begin{enumerate}
6165

6166
\item As the size of the dataset increases, we run into a series of
6167
  computational problems due to the limitations of floating-point
6168
  arithmetic.
6169

6170
\item The dataset contains a number of extreme values that are almost
6171
  certainly errors.  We will need to make the estimation process
6172
  robust in the presence of these outliers.
6173

6174
\end{enumerate}
6175

6176
The following sections explain these problems and their solutions.
6177

6178

6179
\section{Underflow}
6180
\label{underflow}
6181

6182
If we select the first 100 values from the BRFSS dataset and run the
6183
analysis I just described, it runs without errors and we get posterior
6184
distributions that look reasonable.
6185

6186
If we select the first 1000 values and run the program again, we get
6187
an error in \verb"Pmf.Normalize":
6188

6189
\begin{verbatim}
6190
ValueError: total probability is zero.
6191
\end{verbatim}
6192

6193
The problem is that we are using probability densities to compute
6194
likelihoods, and densities from continuous distributions tend to be
6195
small.  And if you take 1000 small values and multiply
6196
them together, the result is very small.  In this case it is so small
6197
it can't be represented by a floating-point number, so it gets rounded
6198
down to zero, which is called {\bf underflow}.  And if all
6199
probabilities in the distribution are 0, it's not a distribution any
6200
more.
6201
\index{underflow}
6202

6203
A possible solution is to renormalize the Pmf after each update,
6204
or after each batch of 100.  That would work, but it would be slow.
6205

6206
A better alternative is to compute likelihoods under a log
6207
transform.  That way, instead of multiplying small values, we can add
6208
up log likelihoods.  {\tt Pmf} provides methods {\tt Log}, {\tt
6209
  LogUpdateSet} and {\tt Exp} to make this process easy.
6210
\index{logarithm}
6211
\index{log transform}
6212

6213
{\tt Log} computes the log of the probabilities in a Pmf:
6214

6215
\begin{verbatim}
6216
# class Pmf
6217

6218
    def Log(self):
6219
        m = self.MaxLike()
6220
        for x, p in self.d.iteritems():
6221
            if p:
6222
                self.Set(x, math.log(p/m))
6223
            else:
6224
                self.Remove(x)
6225
\end{verbatim}
6226

6227
Before applying the log transform {\tt Log} uses {\tt MaxLike} to find
6228
{\tt m}, the highest probability in the Pmf.  It divide all
6229
probabilities by {\tt m}, so the highest probability gets normalized
6230
to 1, which yields a log of 0.  The other log probabilities are all
6231
negative.  If there are any values in the Pmf with probability 0, they
6232
are removed.
6233

6234
While the Pmf is under a log transform, we can't use {\tt Update},
6235
{\tt UpdateSet}, or {\tt Normalize}.  The result would be nonsensical;
6236
if you try, Pmf raises an exception.
6237
Instead, we have to use {\tt LogUpdate}
6238
and {\tt LogUpdateSet}.  
6239
\index{exception}
6240

6241
Here's the implementation of {\tt LogUpdateSet}:
6242

6243
\begin{verbatim}
6244
# class Suite
6245

6246
    def LogUpdateSet(self, dataset):
6247
        for data in dataset:
6248
            self.LogUpdate(data)
6249
\end{verbatim}
6250

6251
{\tt LogUpdateSet} loops through the data and calls {\tt LogUpdate}:
6252

6253
\begin{verbatim}
6254
# class Suite
6255

6256
    def LogUpdate(self, data):
6257
        for hypo in self.Values():
6258
            like = self.LogLikelihood(data, hypo)
6259
            self.Incr(hypo, like)
6260
\end{verbatim}
6261

6262
{\tt LogUpdate} is just like {\tt Update} except that it calls
6263
{\tt LogLikelihood} instead of {\tt Likelihood}, and {\tt Incr}
6264
instead of {\tt Mult}.
6265

6266
Using log-likelihoods avoids the problem with underflow, but while
6267
the Pmf is under the log transform, there's not much we can do with
6268
it.  We have to use {\tt Exp} to invert the transform:
6269

6270
\begin{verbatim}
6271
# class Pmf
6272

6273
    def Exp(self):
6274
        m = self.MaxLike()
6275
        for x, p in self.d.iteritems():
6276
            self.Set(x, math.exp(p-m))
6277
\end{verbatim}
6278

6279
If the log-likelihoods are large negative numbers, the resulting
6280
likelihoods might underflow.  So {\tt Exp} finds the maximum
6281
log-likelihood, {\tt m}, and shifts all the likelihoods up by {\tt m}.
6282
The resulting distribution has a maximum likelihood of 1.  This
6283
process inverts the log transform with minimal loss of precision.
6284
\index{maximum likelihood}
6285

6286

6287
\section{Log-likelihood}
6288

6289
Now all we need is {\tt LogLikelihood}.
6290

6291
\begin{verbatim}
6292
# class Height
6293

6294
    def LogLikelihood(self, data, hypo):
6295
        x = data
6296
        mu, sigma = hypo
6297
        loglike = scipy.stats.norm.logpdf(x, mu, sigma)
6298
        return loglike
6299
\end{verbatim}
6300

6301
{\tt norm.logpdf} computes the log-likelihood of the
6302
Gaussian PDF.
6303
\index{scipy}
6304
\index{log-likelihood}
6305

6306

6307
Here's what the whole update process looks like:
6308

6309
\begin{verbatim}
6310
    suite.Log()
6311
    suite.LogUpdateSet(xs)
6312
    suite.Exp()
6313
    suite.Normalize()
6314
\end{verbatim}
6315

6316
To review, {\tt Log} puts the suite under a log transform.
6317
{\tt LogUpdateSet} calls {\tt LogUpdate}, which calls
6318
{\tt LogLikelihood}.  {\tt LogUpdate} uses {\tt Pmf.Incr},
6319
because adding a log-likelihood is the same as multiplying
6320
by a likelihood.
6321

6322
After the update, the log-likelihoods are large negative
6323
numbers, so {\tt Exp} shifts them up before inverting the
6324
transform, which is how we avoid underflow.
6325

6326
Once the suite is transformed back, the probabilities
6327
are ``linear'' again, which means ``not logarithmic'',
6328
so we can use {\tt Normalize} again.
6329

6330
Using this algorithm, we can process the entire dataset without
6331
underflow, but it is still slow.  On my computer it might
6332
take an hour.  We can do better.
6333

6334

6335
\section{A little optimization}
6336

6337
This section uses math and computational optimization
6338
to speed things up by a factor of 100.  But the following section
6339
presents an algorithm that is even faster.  So if you want to
6340
get right to the good stuff, feel free to skip this section.
6341
\index{optimization}
6342

6343
{\tt Suite.LogUpdateSet} calls {\tt LogUpdate} once for each data
6344
point.  We can speed it up by computing the log-likelihood of the entire
6345
dataset at once.
6346

6347
We'll start with the Gaussian PDF:
6348
%
6349
\[ \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 \right] \]
6350
%
6351
and compute the log (dropping the constant term):
6352
%
6353
\[ -\log \sigma -\frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 \]
6354
%
6355
Given a sequence of values, $x_i$, the total log-likelihood is
6356
%
6357
\[ \sum_i -\log \sigma - \frac{1}{2} \left( \frac{x_i-\mu}{\sigma} \right)^2 \]
6358
%
6359
Pulling out the terms that don't depend on $i$, we get
6360
%
6361
\[ -n \log \sigma - \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2 \]
6362
%
6363
which we can translate into Python:
6364

6365
\begin{verbatim}
6366
# class Height
6367

6368
    def LogUpdateSetFast(self, data):
6369
        xs = tuple(data)
6370
        n = len(xs)
6371

6372
        for hypo in self.Values():
6373
            mu, sigma = hypo
6374
            total = Summation(xs, mu)
6375
            loglike = -n * math.log(sigma) - total / 2 / sigma**2
6376
            self.Incr(hypo, loglike)
6377
\end{verbatim}
6378

6379
By itself, this would be a small improvement, but it
6380
creates an opportunity for a bigger one.  Notice that the
6381
summation only depends on {\tt mu}, not {\tt sigma}, so we only
6382
have to compute it once for each value of {\tt mu}.
6383
\index{optimization}
6384

6385
To avoid recomputing, I factor out a function that computes the
6386
summation, and {\bf memoize} it so it stores previously computed
6387
results in a dictionary (see
6388
\url{http://en.wikipedia.org/wiki/Memoization}): \index{memoization}
6389

6390
\begin{verbatim}
6391
def Summation(xs, mu, cache={}):
6392
    try:
6393
        return cache[xs, mu]
6394
    except KeyError:
6395
        ds = [(x-mu)**2 for x in xs]
6396
        total = sum(ds)
6397
        cache[xs, mu] = total
6398
        return total
6399
\end{verbatim}
6400

6401
{\tt cache} stores previously computed sums.  The {\tt try} statement
6402
returns a result from the cache if possible; otherwise it computes
6403
the summation, then caches and returns the result.
6404
\index{cache}
6405

6406
The only catch is that we can't use a list as a key in the cache, because
6407
it is not a hashable type.  That's why {\tt LogUpdateSetFast} converts
6408
the dataset to a tuple.
6409

6410
This optimization speeds up the computation by about a
6411
factor of 100, processing the entire dataset (154~407 men and 254~722
6412
women) in less than a minute on my not-very-fast computer.
6413

6414

6415
\section{ABC}
6416

6417
But maybe you don't have that kind of time.  In that case, Approximate
6418
Bayesian Computation (ABC) might be the way to go.  The motivation
6419
behind ABC is that the likelihood of any particular dataset is:
6420
\index{ABC}
6421
\index{Approximate Bayesian Computation}
6422

6423
\begin{enumerate}
6424

6425
\item Very small, especially for large datasets, which is why we had
6426
to use the log transform,
6427

6428
\item Expensive to compute, which is why we had to do so much
6429
optimization, and
6430

6431
\item Not really what we want anyway.
6432

6433
\end{enumerate}
6434

6435
We don't really care about the likelihood of seeing the exact dataset
6436
we saw.  Especially for continuous variables, we care about the
6437
likelihood of seeing any dataset like the one we saw.
6438

6439
For example, in the Euro problem, we don't care about the order of
6440
the coin flips, only the total number of heads and tails.  And in
6441
the locomotive problem, we don't care about which particular trains were
6442
seen, only the number of trains and the maximum of the serial numbers.
6443
\index{locomotive problem}
6444
\index{Euro problem}
6445

6446
Similarly, in the BRFSS sample, we don't really want to know the
6447
probability of seeing one particular set of values (especially since
6448
there are hundreds of thousands of them).  It is more
6449
relevant to ask, ``If we sample 100,000 people from a population
6450
with hypothetical values of $\mu$ and $\sigma$, what would be
6451
the chance of collecting a sample with the observed mean and
6452
variance?''
6453
\index{BRFSS}
6454

6455
For samples from a Gaussian distribution, we can answer this question
6456
efficiently because we can find the distribution of the sample
6457
statistics analytically.  In fact, we already did it when we computed
6458
the range of the prior.
6459
\index{Gaussian distribution}
6460

6461
If you draw $n$ values from a Gaussian distribution with parameters
6462
$\mu$ and $\sigma$, and compute the sample mean, $m$, the
6463
distribution of $m$ is Gaussian
6464
with parameters $\mu$ and $\sigma / \sqrt{n}$.
6465

6466
Similarly, the distribution of the sample standard deviation, $s$, is
6467
Gaussian with parameters $\sigma$ and $\sigma / \sqrt{2 (n-1)}$.
6468

6469
\index{sample statistics}
6470
We can use these sample distributions to compute the likelihood of the
6471
sample statistics, $m$ and $s$, given hypothetical values
6472
for $\mu$ and $\sigma$.  Here's a new version of \verb"LogUpdateSet"
6473
that does it:
6474

6475
\begin{verbatim}
6476
    def LogUpdateSetABC(self, data):
6477
        xs = data
6478
        n = len(xs)
6479

6480
        # compute sample statistics
6481
        m = numpy.mean(xs)
6482
        s = numpy.std(xs)
6483

6484
        for hypo in sorted(self.Values()):
6485
            mu, sigma = hypo
6486

6487
            # compute log likelihood of m, given hypo
6488
            stderr_m = sigma / math.sqrt(n)
6489
            loglike = EvalGaussianLogPdf(m, mu, stderr_m)
6490

6491
            #compute log likelihood of s, given hypo
6492
            stderr_s = sigma / math.sqrt(2 * (n-1))
6493
            loglike += EvalGaussianLogPdf(s, sigma, stderr_s)
6494

6495
            self.Incr(hypo, loglike)
6496
\end{verbatim}
6497

6498
On my computer this function processes the entire dataset in about a
6499
second, and the result agrees with the exact result with about 5
6500
digits of precision.
6501

6502

6503
\section{Robust estimation}
6504

6505
\begin{figure}
6506
% variability.py
6507
\centerline{\includegraphics[height=2.5in]{figs/variability_posterior_male.pdf}}
6508
\caption{Contour plot of the posterior joint distribution of
6509
mean and standard deviation of height for men in the U.S.}
6510
\label{fig.variability1}
6511
\end{figure}
6512

6513
\begin{figure}
6514
% variability.py
6515
\centerline{\includegraphics[height=2.5in]{figs/variability_posterior_female.pdf}}
6516
\caption{Contour plot of the posterior joint distribution of
6517
mean and standard deviation of height for women in the U.S.}
6518
\label{fig.variability2}
6519
\end{figure}
6520

6521
We are almost ready to look at results, but we have one more
6522
problem to deal with.  There are a number of outliers in this
6523
dataset that are almost certainly errors.  For example, there
6524
are three adults with reported height of 61 cm, which would
6525
place them among the shortest living adults in the world.
6526
At the other end, there are four women with reported height
6527
229 cm, just short of the tallest women in the world.
6528

6529
It is not impossible that these values are correct, but it is
6530
unlikely, which makes it hard to know how to deal with them.
6531
And we have to get
6532
it right, because these extreme values have a disproportionate
6533
effect on the estimated variability.
6534

6535
Because ABC is based on summary statistics, rather than the entire
6536
dataset, we can make it more robust by choosing summary statistics
6537
that are robust in the presence of outliers.  For example, rather
6538
than use the sample mean and standard deviation, we could use the median
6539
and inter-quartile range
6540
(IQR), which is the difference between the 25th and 75th percentiles.
6541
\index{summary statistic}
6542
\index{robust estimation}
6543
\index{inter-quartile range}
6544
\index{IQR}
6545

6546
More generally, we could compute an inter-percentile range (IPR) that
6547
spans any given fraction of the distribution, {\tt p}:
6548

6549
\begin{verbatim}
6550
def MedianIPR(xs, p):
6551
    cdf = thinkbayes.MakeCdfFromList(xs)
6552
    median = cdf.Percentile(50)
6553

6554
    alpha = (1-p) / 2
6555
    ipr = cdf.Value(1-alpha) - cdf.Value(alpha)
6556
    return median, ipr
6557
\end{verbatim}
6558

6559
{\tt xs} is a sequence of values.  {\tt p} is the desired range;
6560
for example, {\tt p=0.5} yields the inter-quartile range.
6561

6562
{\tt MedianIPR} works by computing the CDF of {\tt xs},
6563
then extracting the median and the difference between two
6564
percentiles.
6565

6566
We can convert from {\tt ipr} to an estimate of {\tt sigma} using the
6567
Gaussian CDF to compute the fraction of the distribution covered by a
6568
given number of standard deviations.  For example, it is a well-known
6569
rule of thumb that 68\% of a Gaussian distribution falls within one
6570
standard deviation of the mean, which leaves 16\% in each tail.  If we
6571
compute the range between the 16th and 84th percentiles, we expect the
6572
result to be {\tt 2 * sigma}.  So we can estimate {\tt sigma} by
6573
computing the 68\% IPR and dividing by 2.
6574
\index{Gaussian distribution}
6575

6576
More generally we could use any number of {\tt sigmas}.
6577
{\tt MedianS} performs the more general version of this
6578
computation:
6579

6580
\begin{verbatim}
6581
def MedianS(xs, num_sigmas):
6582
    half_p = thinkbayes.StandardGaussianCdf(num_sigmas) - 0.5
6583

6584
    median, ipr = MedianIPR(xs, half_p * 2)
6585
    s = ipr / 2 / num_sigmas
6586

6587
    return median, s
6588
\end{verbatim}
6589

6590
Again, {\tt xs} is the sequence of values; \verb"num_sigmas" is the
6591
number of standard deviations the results should be based on.  The
6592
result is {\tt median}, which estimates $\mu$, and {\tt s}, which 
6593
estimates $\sigma$.
6594

6595
Finally, in {\tt LogUpdateSetABC} we can replace the sample mean and
6596
standard deviation with {\tt median} and {\tt s}.  And that pretty
6597
much does it.
6598

6599
It might seem odd that we are using observed percentiles to
6600
estimate $\mu$ and $\sigma$, but it is an example of the
6601
flexibility of the Bayesian approach.  In effect we are asking,
6602
``Given hypothetical values for $\mu$ and $\sigma$, and
6603
a sampling process that has some chance of introducing errors,
6604
what is the likelihood of generating a given set of sample
6605
statistics?''
6606

6607
We are free to choose any sample statistics we like, up to a point:
6608
$\mu$ and $\sigma$ determine the location and spread of
6609
a distribution, so we need to choose statistics that capture those
6610
characteristics.  For example, if we chose the 49th and 51st percentiles,
6611
we would get very little information about spread, so it
6612
would leave the estimate of $\sigma$ relatively unconstrained
6613
by the data.  All values of {\tt sigma} would have nearly the
6614
same likelihood of producing the observed values, so the posterior
6615
distribution of {\tt sigma} would look a lot like the
6616
prior.
6617

6618

6619
\section{Who is more variable?}
6620

6621
\begin{figure}
6622
% variability.py
6623
\centerline{\includegraphics[height=2.5in]{figs/variability_cv.pdf}}
6624
\caption{Posterior distributions of CV for men and women, based on
6625
robust estimators.}
6626
\label{fig.variability3}
6627
\end{figure}
6628

6629
Finally we are ready to answer the question we started with: is the
6630
coefficient of variation greater for men than for women?
6631

6632
Using ABC based on the median and IPR with \verb"num_sigmas=1", I
6633
computed posterior joint distributions for {\tt mu} and {\tt
6634
  sigma}.  Figures~\ref{fig.variability1} and ~\ref{fig.variability2}
6635
show the results as a contour plot with {\tt mu} on the x-axis, {\tt
6636
  sigma} on the y-axis, and probability on the z-axis.
6637

6638
For each joint distribution, I computed the posterior distribution of
6639
CV.  Figure~\ref{fig.variability3} shows these distributions for men
6640
and women.  The mean for men is 0.0410; for women it is 0.0429.
6641
Since there is no overlap between the distributions, we conclude with
6642
near certainty that
6643
women are more variable in height than men.
6644

6645
So is that the end of the Variability Hypothesis?  Sadly, no.  It turns
6646
out that this
6647
result depends on the choice of the
6648
inter-percentile range.  With \verb"num_sigmas=1", we conclude that
6649
women are more variable, but with \verb"num_sigmas=2" we conclude
6650
with equal confidence that men are more variable.
6651

6652
The reason for the difference is that there
6653
are more men of short stature, and their distance from the mean is
6654
greater.
6655

6656
So our evaluation of the Variability Hypothesis depends on the
6657
interpretation of ``variability.''  With \verb"num_sigmas=1" we
6658
focus on people near the mean.  As we increase
6659
\verb"num_sigmas", we give more weight to the extremes.  
6660

6661
To decide which
6662
emphasis is appropriate, we would need a more precise statement
6663
of the hypothesis.  As it is, the Variability Hypothesis may be
6664
too vague to evaluate.
6665

6666
Nevertheless, it helped
6667
me demonstrate several new ideas and, I hope you agree,
6668
it makes an interesting example.
6669

6670

6671
\section{Discussion}
6672

6673
There are two ways you might think of ABC.  One interpretation
6674
is that it is, as the name suggests, an approximation that is
6675
faster to compute than the exact value.
6676

6677
But remember that Bayesian analysis is always
6678
based on modeling decisions, which implies that there is no
6679
``exact'' solution.  For any interesting
6680
physical system there are many possible models, and each model
6681
yields different results.  To interpret the results, we have to
6682
evaluate the models.
6683
\index{modeling}
6684

6685
So another interpretation of ABC is that it represents an alternative
6686
model of the likelihood.  When we compute \p{D|H}, we are asking
6687
``What is the likelihood of the data under a given hypothesis?''
6688
\index{likelihood}
6689

6690
For large datasets, the likelihood of the data is very small, which
6691
is a hint that we might not be asking the right question.  What
6692
we really want to know is the likelihood of any outcome
6693
like the data, where the definition of ``like'' is yet another
6694
modeling decision.
6695

6696
The underlying idea of ABC is that two datasets are alike if they yield
6697
the same summary statistics.  But in some cases, like the example in
6698
this chapter, it is not obvious which summary statistics to choose.
6699
\index{summary statistic}
6700

6701
You can download the code in this chapter from
6702
\url{http://thinkbayes.com/variability.py}.
6703
  For more information
6704
see Section~\ref{download}.
6705

6706
\section{Exercises}
6707

6708
\begin{exercise}
6709

6710
An ``effect size'' is a statistic intended to measure the difference
6711
between two groups (see
6712
\url{http://en.wikipedia.org/wiki/Effect_size}).
6713

6714
For example, we could use data from the BRFSS to estimate the
6715
difference in height between men and women.  By sampling values
6716
from the posterior distributions of $\mu$ and
6717
$\sigma$, we could generate the posterior distribution of this
6718
difference.
6719

6720
But it might be better to use a dimensionless measure of effect
6721
size, rather than a difference measured in cm.  One option is
6722
to use divide through by the standard deviation (similar to what
6723
we did with the coefficient of variation).
6724

6725
If the parameters for Group 1 are $(\mu_1, \sigma_1)$, and the
6726
parameters for Group 2 are $(\mu_2, \sigma_2)$, the dimensionless
6727
effect size is
6728
%
6729
\[ \frac{\mu_1 - \mu_2}{(\sigma_1 + \sigma_2)/2} \]
6730
%
6731
Write a function that takes joint distributions of
6732
{\tt mu} and {\tt sigma} for two groups and returns
6733
the posterior distribution of effect size.
6734

6735
Hint: if enumerating all pairs from the two distributions takes too
6736
long, consider random sampling.
6737

6738
\end{exercise}
6739

6740

6741

6742
\chapter{Hypothesis Testing}
6743

6744
\section{Back to the Euro problem}
6745

6746
In Section~\ref{euro} I presented a problem from MacKay's {\it Information
6747
  Theory, Inference, and Learning Algorithms}:
6748
\index{MacKay, David}
6749

6750
\begin{quote}
6751
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
6752

6753
  \begin{quote}
6754
        When spun on edge 250 times, a Belgian one-euro coin came
6755
        up heads 140 times and tails 110.  `It looks very suspicious
6756
        to me,' said Barry Blight, a statistics lecturer at the London
6757
        School of Economics.  `If the coin were unbiased, the chance of
6758
        getting a result as extreme as that would be less than 7\%.'
6759
        \end{quote}
6760

6761
But do these data give evidence that the coin is biased rather than fair?
6762
\end{quote}
6763

6764
We estimated the probability that the coin would
6765
land face up, but we didn't really answer MacKay's question:
6766
Do the data give evidence that the coin is biased?
6767
\index{Euro problem}
6768
\index{evidence}
6769

6770
In Chapter~\ref{more} I proposed that data are in favor of
6771
a hypothesis if the data are more likely under the hypothesis than
6772
under the alternative or, equivalently, if the Bayes factor is greater
6773
than 1.
6774
\index{hypothesis testing}
6775
\index{Bayes factor}
6776

6777
In the Euro example, we have two hypotheses to consider: I'll use
6778
$F$ for the hypothesis that the coin is fair and $B$ for the hypothesis
6779
that it is biased.
6780
\index{fair coin}
6781
\index{biased coin}
6782

6783
If the coin is fair, it is easy to compute the likelihood of the
6784
data, \p{D|F}.  In fact, we already wrote the function
6785
that does it.
6786

6787
\begin{verbatim}
6788
    def Likelihood(self, data, hypo):
6789
        x = hypo / 100.0
6790
        head, tails = data
6791
        like = x**heads * (1-x)**tails
6792
        return like
6793
\end{verbatim}
6794

6795
To use it we can
6796
create a {\tt Euro} suite and invoke
6797
{\tt Likelihood}:
6798

6799
\begin{verbatim}
6800
    suite = Euro()
6801
    likelihood = suite.Likelihood(data, 50)
6802
\end{verbatim}
6803

6804
\p{D|F} is $5.5 \cdot 10^{-76}$, which doesn't tell us much except
6805
that the probability of seeing any particular dataset is very small.
6806
It takes two likelihoods to make a ratio, so we also have to
6807
compute \p{D|B}.
6808

6809
It is not obvious how to compute the likelihood of $B$, because
6810
it's not obvious what ``biased'' means.
6811

6812
One possibility is to cheat and look at the data before we define
6813
the hypothesis.  In that case we would say that ``biased'' means that
6814
the probability of heads is 140/250.
6815

6816
\begin{verbatim}
6817
    actual_percent = 100.0 * 140 / 250
6818
    likelihood = suite.Likelihood(data, actual_percent)
6819
\end{verbatim}
6820

6821
This version of $B$ I call \verb"B_cheat"; the likelihood of
6822
\verb"b_cheat" is $34 \cdot 10^{-76}$ and the likelihood ratio is
6823
6.1.  So we would say that the data are evidence in favor of this
6824
version of $B$.
6825
\index{evidence}
6826

6827
But using the data to formulate the hypothesis
6828
is obviously bogus.  By that definition, any dataset would
6829
be evidence in favor of $B$, unless the observed percentage of heads
6830
is exactly 50\%.
6831
\index{bogus}
6832

6833
\section{Making a fair comparison}
6834
\label{suitelike}
6835

6836
To make a legitimate comparison, we have to define $B$ without looking
6837
at the data.  So let's try a different definition.  If you inspect
6838
a Belgian Euro coin, you might notice that the ``heads'' side is more
6839
prominent than the ``tails'' side.  You might expect the shape to 
6840
have some effect on
6841
$x$, but be unsure whether it makes heads more or less
6842
likely.  So you might say ``I think the coin is biased so that
6843
$x$ is either 0.6 or 0.4, but I am not sure which.''
6844

6845
We can think of this version, which I'll call \verb"B_two"
6846
as a hypothesis made up of two
6847
sub-hypotheses.  We can compute the likelihood for each
6848
sub-hypothesis and then compute the average likelihood.
6849

6850
\begin{verbatim}
6851
    like40 = suite.Likelihood(data, 40)
6852
    like60 = suite.Likelihood(data, 60)
6853
    likelihood = 0.5 * like40 + 0.5 * like60
6854
\end{verbatim}
6855

6856
The likelihood ratio (or Bayes factor) for \verb"b_two" is 1.3, which
6857
means the data provide weak evidence in favor of \verb"b_two".
6858
\index{evidence}
6859
\index{likelihood ratio}
6860
\index{Bayes factor}
6861

6862
More generally, suppose you suspect that the coin is biased, but you
6863
have no clue about the value of $x$.  In that case you might build a
6864
Suite, which I call \verb"b_uniform", to represent sub-hypotheses from
6865
0 to 100.
6866

6867
\begin{verbatim}
6868
    b_uniform = Euro(xrange(0, 101))
6869
    b_uniform.Remove(50)
6870
    b_uniform.Normalize()
6871
\end{verbatim}
6872

6873
I initialize \verb"b_uniform" with values from 0 to 100.
6874
I removed the sub-hypothesis that $x$ is 50\%, because if
6875
$x$ is 50\% the coin is fair, but it has almost no
6876
effect on the result whether you remove it or not.
6877

6878
To compute the likelihood of
6879
\verb"b_uniform" we compute the likelihood of each sub-hypothesis
6880
and accumulate a weighted average.
6881

6882
\begin{verbatim}
6883
def SuiteLikelihood(suite, data):
6884
    total = 0
6885
    for hypo, prob in suite.Items():
6886
        like = suite.Likelihood(data, hypo)
6887
        total += prob * like
6888
    return total
6889
\end{verbatim}
6890

6891
The likelihood ratio for \verb"b_uniform" is 0.47, which means
6892
that the data are weak evidence against \verb"b_uniform",
6893
compared to $F$.
6894
\index{likelihood}
6895

6896
If you think about the computation performed by
6897
\verb"SuiteLikelihood", you might notice that it is similar to an
6898
update.  To refresh your memory, here's the {\tt Update} function:
6899

6900
\begin{verbatim}
6901
    def Update(self, data):
6902
        for hypo in self.Values():
6903
            like = self.Likelihood(data, hypo)
6904
            self.Mult(hypo, like)
6905
        return self.Normalize()
6906
\end{verbatim}
6907

6908
And here's {\tt Normalize}:
6909

6910
\begin{verbatim}
6911
    def Normalize(self):
6912
        total = self.Total()
6913
        
6914
        factor = 1.0 / total
6915
        for x in self.d:
6916
            self.d[x] *= factor
6917

6918
        return total
6919
\end{verbatim}
6920

6921
The return value from {\tt Normalize} is the total of the
6922
probabilities in the Suite, which is the average of the likelihoods
6923
for the sub-hypotheses, weighted by the prior probabilities.  And {\tt
6924
  Update} passes this value along, so instead of using {\tt
6925
  SuiteLikelihood}, we could compute the likelihood of
6926
\verb"b_uniform" like this:
6927

6928
\begin{verbatim}
6929
    likelihood = b_uniform.Update(data)
6930
\end{verbatim}
6931

6932

6933

6934
\section{The triangle prior}
6935

6936
In Chapter~\ref{more} we also considered a triangle-shaped prior that
6937
gives higher probability to values of $x$ near 50\%.  If we think of
6938
this prior as a suite of sub-hypotheses, we can compute its likelihood
6939
like this:
6940
\index{triangle distribution}
6941

6942
\begin{verbatim}
6943
    b_triangle = TrianglePrior()
6944
    likelihood = b_triangle.Update(data)
6945
\end{verbatim}
6946

6947
The likelihood ratio for \verb"b_triangle" is 0.84, compared to $F$, so
6948
again we would say that the data are weak evidence against $B$.
6949
\index{evidence}
6950

6951
The following table shows the priors we have considered, the
6952
likelihood of each, and the likelihood ratio (or Bayes factor)
6953
relative to $F$.
6954
\index{likelihood ratio}
6955
\index{Bayes factor}
6956

6957
\begin{tabular}{|l|r|r|}
6958
\hline
6959
Hypothesis   & Likelihood & Bayes  \\
6960
             & $\times 10^{-76}$ & Factor  \\
6961
\hline
6962
$F$              & 5.5   & --   \\
6963
\verb"B_cheat"  & 34   &  6.1   \\
6964
\verb"B_two"     & 7.4   &  1.3   \\
6965
\verb"B_uniform"  & 2.6   &  0.47   \\
6966
\verb"B_triangle"  & 4.6   &  0.84   \\
6967
\hline
6968
\end{tabular}
6969

6970
Depending on which definition we choose, the data might provide
6971
evidence for or against the hypothesis that the coin is biased, but
6972
in either case it is relatively weak evidence.
6973

6974
In summary, we can use Bayesian hypothesis testing to compare the
6975
likelihood of $F$ and $B$, but we have to do some work to specify
6976
precisely what $B$ means.  This specification depends on background
6977
information about coins and their behavior when spun, so people
6978
could reasonably disagree about the right definition.
6979

6980
My presentation of this example follows
6981
David MacKay's discussion, and comes to the same conclusion.
6982
You can download the code I used in this chapter from
6983
\url{http://thinkbayes.com/euro3.py}.
6984
  For more information
6985
see Section~\ref{download}.
6986

6987
\section{Discussion}
6988

6989
The Bayes factor for \verb"B_uniform" is 0.47, which means
6990
that the data provide evidence against this hypothesis, compared
6991
to $F$.  In the previous section I characterized this evidence
6992
as ``weak,'' but didn't say why.
6993
\index{evidence}
6994

6995
Part of the answer is historical.  Harold Jeffreys, an early
6996
proponent of Bayesian statistics, suggested a scale for
6997
interpreting Bayes factors:
6998

6999
\begin{tabular}{|l|l|}
7000
\hline
7001
Bayes & Strength \\
7002
Factor & \\
7003
\hline
7004
1 -- 3 & Barely worth mentioning \\
7005
3 -- 10 & Substantial \\
7006
10 -- 30 & Strong \\
7007
30 -- 100 & Very strong \\
7008
$>$ 100 & Decisive \\
7009
\hline
7010
\end{tabular}
7011

7012
In the example, the Bayes factor is 0.47 in favor of \verb"B_uniform",
7013
so it is 2.1 in favor of $F$, which Jeffreys would consider ``barely
7014
worth mentioning.''  Other authors have suggested variations on the
7015
wording.  To avoid arguing about adjectives, we could think about odds
7016
instead.
7017

7018
If your prior odds are 1:1, and you see evidence with Bayes
7019
factor 2, your posterior odds are 2:1.  In terms of probability,
7020
the data changed your degree of belief from 50\% to 66\%.  For
7021
most real world problems, that change would be small relative
7022
to modeling errors and other sources of uncertainty.
7023

7024
On the other hand, if you had seen evidence with Bayes
7025
factor 100, your posterior odds would be 100:1 or more than 99\%.
7026
Whether or not you agree that such evidence is ``decisive,''
7027
it is certainly strong.
7028

7029
\section{Exercises}
7030

7031
\begin{exercise}
7032
Some people believe in the existence of extra-sensory
7033
perception (ESP); for example, the ability of some people to guess
7034
the value of an unseen playing card with probability better
7035
than chance.
7036
\index{ESP}
7037
\index{extra-sensory perception}
7038

7039
What is your prior degree of belief in this kind of ESP?
7040
Do you think it is as likely to exist as not?  Or are you
7041
more skeptical about it?  Write down your prior odds.
7042

7043
Now compute the strength of the evidence it would take to
7044
convince you that ESP is at least 50\% likely to exist.
7045
What Bayes factor would be needed to make you 90\% sure
7046
that ESP exists?
7047
\end{exercise}
7048

7049

7050
\begin{exercise}
7051
Suppose that your answer to the previous question is 1000;
7052
that is, evidence with Bayes factor 1000 in favor of ESP would
7053
be sufficient to change your mind.
7054

7055
Now suppose that you read a paper in a respectable peer-reviewed
7056
scientific journal that presents evidence with Bayes factor 1000 in
7057
favor of ESP.  Would that change your mind?
7058

7059
If not, how do you resolve the apparent contradiction?
7060
You might find it helpful to read about David Hume's article, ``Of
7061
Miracles,'' at \url{http://en.wikipedia.org/wiki/Of_Miracles}.
7062
\index{Hume, David}
7063

7064
\end{exercise}
7065

7066

7067

7068
\chapter{Evidence}
7069
\label{evidence}
7070

7071
\section{Interpreting SAT scores}
7072

7073
Suppose you are the Dean of Admission at a small engineering
7074
college in Massachusetts, and you are considering two candidates,
7075
Alice and Bob, whose qualifications are similar in many ways,
7076
with the exception that Alice got a higher score on the Math
7077
portion of the SAT, a standardized test intended to measure
7078
preparation for college-level work in mathematics.
7079
\index{SAT}
7080
\index{standardized test}
7081

7082
If Alice got 780 and Bob got a 740 (out of a possible 800), you might
7083
want to know whether that difference is evidence that Alice is better
7084
prepared than Bob, and what the strength of that evidence is.
7085
\index{evidence}
7086

7087
Now in reality, both scores are very good, and both 
7088
candidates are probably well prepared for college math.  So
7089
the real Dean of Admission would probably suggest that we choose
7090
the candidate who best demonstrates the other skills and
7091
attitudes we look for in students.  But as an example of
7092
Bayesian hypothesis testing, let's stick with a narrower question:
7093
``How strong is the evidence that Alice is better prepared
7094
than Bob?''
7095

7096
To answer that question, we need to make some modeling decisions.
7097
I'll start with a simplification I know is wrong; then we'll come back
7098
and improve the model.  I pretend, temporarily, that
7099
all SAT questions are equally difficult.  Actually, the designers of
7100
the SAT choose questions with a range of difficulty, because that
7101
improves the ability to measure statistical differences between
7102
test-takers.
7103
\index{modeling}
7104

7105
But if we choose a model where all questions are equally difficult, we
7106
can define a characteristic, \verb"p_correct", for each test-taker,
7107
which is the probability of answering any question correctly.  This
7108
simplification makes it easy to compute the likelihood of a given
7109
score.
7110

7111

7112
\section{The scale}
7113

7114
In order to understand SAT scores, we have to understand the scoring
7115
and scaling process.  Each test-taker gets a raw score based on the
7116
number of correct and incorrect questions.  The raw score is converted
7117
to a scaled score in the range 200--800.
7118
\index{scaled score}
7119

7120
In 2009, there were 54 questions on the math SAT.  The raw score
7121
for each test-taker is the number of questions answered correctly
7122
minus a penalty of $1/4$ point for each question answered incorrectly.
7123

7124
The College Board, which administers the SAT, publishes the
7125
map from raw scores to scaled scores.  I have downloaded that
7126
data and wrapped it in an Interpolator object that provides a forward
7127
lookup (from raw score to scaled) and a reverse lookup (from scaled
7128
score to raw).
7129
\index{College Board}
7130

7131
You can download the code for this example from
7132
\url{http://thinkbayes.com/sat.py}.
7133
  For more information
7134
see Section~\ref{download}.
7135

7136
\section{The prior}
7137

7138
The College Board also publishes the distribution of scaled scores
7139
for all test-takers.  If we convert each scaled score to a raw score,
7140
and divide by the number of questions, the result is an estimate
7141
of \verb"p_correct".
7142
So we can use the distribution of raw scores to model the
7143
prior distribution of \verb"p_correct".
7144

7145
Here is the code that reads and processes the data:
7146

7147
\begin{verbatim}
7148
class Exam(object):
7149

7150
    def __init__(self):
7151
        self.scale = ReadScale()
7152
        scores = ReadRanks()
7153
        score_pmf = thinkbayes.MakePmfFromDict(dict(scores))
7154
        self.raw = self.ReverseScale(score_pmf)
7155
        self.max_score = max(self.raw.Values())
7156
        self.prior = DivideValues(self.raw, self.max_score)
7157
\end{verbatim}
7158

7159
{\tt Exam} encapsulates the information we have about the exam.
7160
{\tt ReadScale} and {\tt ReadRanks} read files and return
7161
objects that contain the data:
7162
{\tt self.scale} is the {\tt Interpolator} that converts
7163
from raw to scaled scores and back;  {\tt scores} is a list
7164
of (score, frequency) pairs.
7165

7166
\verb"score_pmf" is the Pmf of
7167
scaled scores.   {\tt self.raw} is the Pmf of raw scores, and
7168
{\tt self.prior} is the Pmf of \verb"p_correct".
7169

7170
\begin{figure}
7171
% sat.py
7172
\centerline{\includegraphics[height=2.5in]{figs/sat_prior.pdf}}
7173
\caption{Prior distribution of {\tt p\_correct} for SAT test-takers.}
7174
\label{fig.satprior}
7175
\end{figure}
7176

7177
Figure~\ref{fig.satprior} shows the prior distribution of
7178
\verb"p_correct".  This distribution is approximately Gaussian, but it
7179
is compressed at the extremes.  By design, the SAT has the most power
7180
to discriminate between test-takers within two standard deviations of
7181
the mean, and less power outside that range.
7182
\index{Gaussian distribution}
7183

7184
For each test-taker, I define a Suite called {\tt Sat} that
7185
represents the distribution of \verb"p_correct".  Here's the definition:
7186

7187
\begin{verbatim}
7188
class Sat(thinkbayes.Suite):
7189

7190
    def __init__(self, exam, score):
7191
        thinkbayes.Suite.__init__(self)
7192

7193
        self.exam = exam
7194
        self.score = score
7195

7196
        # start with the prior distribution
7197
        for p_correct, prob in exam.prior.Items():
7198
            self.Set(p_correct, prob)
7199

7200
        # update based on an exam score
7201
        self.Update(score)
7202
\end{verbatim}
7203

7204
\verb"__init__" takes an Exam object and a scaled score.  It makes a
7205
copy of the prior distribution and then updates itself based on the
7206
exam score.
7207

7208
As usual, we inherit {\tt Update} from {\tt Suite} and provide
7209
{\tt Likelihood}:
7210

7211
\begin{verbatim}
7212
    def Likelihood(self, data, hypo):
7213
        p_correct = hypo
7214
        score = data
7215

7216
        k = self.exam.Reverse(score)
7217
        n = self.exam.max_score
7218
        like = thinkbayes.EvalBinomialPmf(k, n, p_correct)
7219
        return like
7220
\end{verbatim}
7221

7222
{\tt hypo} is a hypothetical
7223
value of \verb"p_correct", and {\tt data} is a scaled score.
7224

7225
To keep things simple, I interpret the raw score as the number of
7226
correct answers, ignoring the penalty for wrong answers.  With
7227
this simplification, the likelihood is given by the binomial
7228
distribution, which computes the probability of $k$ correct
7229
responses out of $n$ questions.
7230
\index{binomial distribution}
7231
\index{raw score}
7232

7233

7234
\section{Posterior}
7235

7236
\begin{figure}
7237
% sat.py
7238
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_p_corr.pdf}}
7239
\caption{Posterior distributions of {\tt p\_correct} for Alice and Bob.}
7240
\label{fig.satposterior1}
7241
\end{figure}
7242

7243
Figure~\ref{fig.satposterior1} shows the posterior distributions
7244
of \verb"p_correct" for Alice and Bob based on their exam scores.
7245
We can see that they overlap, so it is possible that \verb"p_correct"
7246
is actually higher for Bob, but it seems unlikely.
7247

7248
Which brings us back to the original question, ``How strong is the
7249
evidence that Alice is better prepared than Bob?''  We can use the
7250
posterior distributions of \verb"p_correct" to answer this question.
7251

7252
To formulate the question in terms of Bayesian hypothesis testing,
7253
I define two hypotheses:
7254

7255
\begin{itemize}
7256

7257
\item $A$: \verb"p_correct" is higher for Alice than for Bob.
7258

7259
\item $B$: \verb"p_correct" is higher for Bob than for Alice.
7260

7261
\end{itemize}
7262

7263
To compute the likelihood of $A$, we can enumerate all pairs of values
7264
from the posterior distributions and add up the total probability of
7265
the cases where \verb"p_correct" is higher for Alice than for Bob.
7266
And we already have a function, \verb"thinkbayes.PmfProbGreater",
7267
that does that.
7268

7269
So we can define a Suite that computes the posterior probabilities
7270
of $A$ and $B$:
7271

7272
\begin{verbatim}
7273
class TopLevel(thinkbayes.Suite):
7274

7275
    def Update(self, data):
7276
        a_sat, b_sat = data
7277

7278
        a_like = thinkbayes.PmfProbGreater(a_sat, b_sat)
7279
        b_like = thinkbayes.PmfProbLess(a_sat, b_sat)
7280
        c_like = thinkbayes.PmfProbEqual(a_sat, b_sat)
7281

7282
        a_like += c_like / 2
7283
        b_like += c_like / 2
7284

7285
        self.Mult('A', a_like)
7286
        self.Mult('B', b_like)
7287

7288
        self.Normalize()
7289
\end{verbatim}
7290

7291
Usually when we define a new Suite, we inherit {\tt Update}
7292
and provide {\tt Likelihood}.  In this case I override {\tt Update},
7293
because it is easier to evaluate the likelihood of both
7294
hypotheses at the same time.
7295

7296
The data passed to {\tt Update} are Sat objects that represent
7297
the posterior distributions of \verb"p_correct".
7298

7299
\verb"a_like" is the total probability that
7300
\verb"p_correct" is higher for Alice; \verb"b_like" is that
7301
probability that it is higher for Bob.
7302

7303
\verb"c_like" is the probability that they are ``equal,'' but this
7304
equality is an artifact of the decision to model \verb"p_correct" with
7305
a set of discrete values.  If we use more values, \verb"c_like"
7306
is smaller, and in the extreme, if \verb"p_correct" is
7307
continuous, \verb"c_like" is zero.  So I treat \verb"c_like" as
7308
a kind of round-off error and split it evenly between \verb"a_like"
7309
and \verb"b_like".
7310

7311
Here is the code that creates {\tt TopLevel} and updates it:
7312

7313
\begin{verbatim}
7314
    exam = Exam()
7315
    a_sat = Sat(exam, 780)
7316
    b_sat = Sat(exam, 740)
7317

7318
    top = TopLevel('AB')
7319
    top.Update((a_sat, b_sat))
7320
    top.Print()
7321
\end{verbatim}
7322

7323
The likelihood of $A$ is 0.79 and the likelihood of $B$ is 0.21.  The
7324
likelihood ratio (or Bayes factor) is 3.8, which means that these test
7325
scores are evidence that Alice is better than Bob at answering SAT
7326
questions.  If we believed, before seeing the test scores, that $A$
7327
and $B$ were equally likely, then after seeing the scores we should
7328
believe that the probability of $A$ is 79\%, which means there is
7329
still a 21\% chance that Bob is actually better prepared.
7330
\index{likelihood ratio}
7331
\index{Bayes factor}
7332

7333

7334
\section{A better model}
7335

7336
Remember that the analysis we have done so far is based on
7337
the simplification that all SAT questions are equally difficult.
7338
In reality, some are easier than others, which means that the
7339
difference between Alice and Bob might be even smaller.
7340

7341
But how big is the modeling error?  If it is small, we conclude
7342
that the first model---based on the simplification that all questions
7343
are equally difficult---is good enough.  If it's large,
7344
we need a better model.
7345
\index{modeling error}
7346

7347
In the next few sections, I develop a better model and 
7348
discover (spoiler alert!) that the modeling error is small.  So if
7349
you are satisfied with the simple model, you can skip to the next
7350
chapter.  If you want to see how the more realistic model works,
7351
read on...
7352

7353
\begin{itemize}
7354

7355
\item Assume that each test-taker has some 
7356
  degree of {\tt efficacy}, which measures their
7357
  ability to answer SAT questions.
7358
\index{efficacy}
7359

7360
\item Assume that each question has some level of
7361
  {\tt difficulty}.
7362

7363
\item Finally, assume that the chance that a test-taker answers a
7364
  question correctly is related to {\tt efficacy} and {\tt difficulty}
7365
  according to this function:
7366

7367
\begin{verbatim}
7368
def ProbCorrect(efficacy, difficulty, a=1):
7369
    return 1 / (1 + math.exp(-a * (efficacy - difficulty)))
7370
\end{verbatim}
7371

7372
\end{itemize}
7373

7374
This function is a simplified version of the curve used in {\bf item
7375
response theory}, which you can read about at
7376
\url{http://en.wikipedia.org/wiki/Item_response_theory}.  {\tt
7377
  efficacy} and {\tt difficulty} are considered to be on the same
7378
scale, and the probability of getting a question right depends only on
7379
the difference between them.
7380
\index{item response theory}
7381

7382
When {\tt efficacy} and {\tt difficulty} are equal, the
7383
probability of getting the question right is 50\%.  As
7384
{\tt efficacy} increases, this probability approaches 100\%.
7385
As it decreases (or as {\tt difficulty} increases), the
7386
probability approaches 0\%.
7387

7388
Given the distribution of {\tt efficacy} across test-takers
7389
and the distribution of {\tt difficulty} across questions, we
7390
can compute the expected distribution of raw scores.  We'll do that
7391
in two steps.  First, for a person with given {\tt efficacy},
7392
we'll compute the distribution of raw scores.
7393

7394
\begin{verbatim}
7395
def PmfCorrect(efficacy, difficulties):
7396
    pmf0 = thinkbayes.Pmf([0])
7397

7398
    ps = [ProbCorrect(efficacy, diff) for diff in difficulties]
7399
    pmfs = [BinaryPmf(p) for p in ps]
7400
    dist = sum(pmfs, pmf0)
7401
    return dist
7402
\end{verbatim}
7403

7404
{\tt difficulties} is a list of difficulties, one for each question.
7405
{\tt ps} is a list of probabilities, and {\tt pmfs} is a list of
7406
two-valued Pmf objects; here's the function that makes them:
7407

7408
\begin{verbatim}
7409
def BinaryPmf(p):
7410
    pmf = thinkbayes.Pmf()
7411
    pmf.Set(1, p)
7412
    pmf.Set(0, 1-p)
7413
    return pmf
7414
\end{verbatim}
7415

7416
{\tt dist} is the sum of these Pmfs.  Remember from Section~\ref{addends}
7417
that when we add up Pmf objects, the result is the distribution
7418
of the sums.  In order to use Python's {\tt sum} to add up Pmfs,
7419
we have to provide {\tt pmf0} which is the identity for Pmfs,
7420
so {\tt pmf + pmf0} is always {\tt pmf}.
7421

7422
If we know a person's efficacy, we can compute their distribution
7423
of raw scores.  For a group of people with a different efficacies, the
7424
resulting distribution of raw scores is a mixture.  Here's the code
7425
that computes the mixture:
7426

7427
\begin{verbatim}
7428
# class Exam:
7429

7430
    def MakeRawScoreDist(self, efficacies):
7431
        pmfs = thinkbayes.Pmf()
7432
        for efficacy, prob in efficacies.Items():
7433
            scores = PmfCorrect(efficacy, self.difficulties)
7434
            pmfs.Set(scores, prob)
7435

7436
        mix = thinkbayes.MakeMixture(pmfs)
7437
        return mix
7438
\end{verbatim}
7439

7440
{\tt MakeRawScoreDist} takes {\tt efficacies}, which is a Pmf that
7441
represents the distribution of efficacy across test-takers.  I assume
7442
it is Gaussian with mean 0 and standard deviation 1.5.  This
7443
choice is mostly arbitrary.  The probability of getting a question
7444
correct depends on the difference between efficacy and difficulty, so
7445
we can choose the units of efficacy and then calibrate the units of
7446
difficulty accordingly.  \index{Gaussian distribution}
7447

7448
{\tt pmfs} is a meta-Pmf that contains one Pmf for each level of
7449
efficacy, and maps to the fraction of test-takers at that level.  {\tt
7450
  MakeMixture} takes the meta-pmf and computes the distribution of the
7451
mixture (see Section~\ref{mixture}).  \index{meta-Pmf}
7452
\index{MakeMixture}
7453

7454

7455
\section{Calibration}
7456

7457
If we were given the distribution of difficulty, we could use
7458
\verb"MakeRawScoreDist" to compute the distribution of raw scores.
7459
But for us the problem is the other way around: we are given the
7460
distribution of raw scores and we want to infer the distribution of
7461
difficulty.
7462

7463
\begin{figure}
7464
% sat.py
7465
\centerline{\includegraphics[height=2.5in]{figs/sat_calibrate.pdf}}
7466
\caption{Actual distribution of raw scores and a model to fit it.}
7467
\label{fig.satcalibrate}
7468
\end{figure}
7469

7470
I assume that the distribution of difficulty is uniform with
7471
parameters {\tt center} and {\tt width}.  {\tt MakeDifficulties}
7472
makes a list of difficulties with these parameters.
7473
\index{numpy}
7474

7475
\begin{verbatim}
7476
def MakeDifficulties(center, width, n):
7477
    low, high = center-width, center+width
7478
    return numpy.linspace(low, high, n)
7479
\end{verbatim}
7480

7481
By trying out a few combinations, I found that
7482
{\tt center=-0.05} and {\tt width=1.8} yield a distribution
7483
of raw scores similar to the actual data, as shown in
7484
Figure~\ref{fig.satcalibrate}.
7485
\index{calibration}
7486

7487
So, assuming that the distribution of difficulty is uniform,
7488
its range is approximately
7489
{\tt -1.85} to {\tt 1.75}, given that
7490
efficacy is Gaussian with mean 0 and standard deviation 1.5.
7491
\index{Gaussian distribution}
7492

7493
The following table shows the range of {\tt ProbCorrect} for
7494
test-takers at different levels of efficacy:
7495

7496
\begin{tabular}{|r|r|r|r|}
7497
\hline
7498
           & \multicolumn{3}{|c|}{Difficulty} \\
7499
\hline
7500
Efficacy   & -1.85   &   -0.05   &      1.75  \\
7501
\hline
7502
3.00 &  0.99 &  0.95 &  0.78   \\
7503
1.50 &  0.97 &  0.82 &  0.44   \\
7504
0.00 &  0.86 &  0.51 &  0.15   \\
7505
-1.50 &  0.59 &  0.19 &  0.04   \\
7506
-3.00 &  0.24 &  0.05 &  0.01   \\
7507
\hline
7508
\end{tabular}
7509

7510
Someone with efficacy 3 (two standard deviations above
7511
the mean) has a 99\% chance of answering the easiest questions on
7512
the exam, and a 78\% chance of answering the hardest.  On the other
7513
end of the range, someone two standard deviations below the mean
7514
has only a 24\% chance of answering the easiest questions.
7515

7516

7517
\section{Posterior distribution of efficacy}
7518

7519
\begin{figure}
7520
% sat.py
7521
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_eff.pdf}}
7522
\caption{Posterior distributions of efficacy for Alice and Bob.}
7523
\label{fig.satposterior2}
7524
\end{figure}
7525

7526
Now that the model is calibrated, we can compute the posterior
7527
distribution of efficacy for Alice and Bob.  Here is a version of the
7528
Sat class that uses the new model:
7529

7530
\begin{verbatim}
7531
class Sat2(thinkbayes.Suite):
7532

7533
    def __init__(self, exam, score):
7534
        self.exam = exam
7535
        self.score = score
7536

7537
        # start with the Gaussian prior
7538
        efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3)
7539
        thinkbayes.Suite.__init__(self, efficacies)
7540

7541
        # update based on an exam score
7542
        self.Update(score)
7543
\end{verbatim}
7544

7545
\verb"Update" invokes
7546
\verb"Likelihood", which computes the likelihood of a given test score
7547
for a hypothetical level of efficacy.
7548

7549
\begin{verbatim}
7550
    def Likelihood(self, data, hypo):
7551
        efficacy = hypo
7552
        score = data
7553
        raw = self.exam.Reverse(score)
7554

7555
        pmf = self.exam.PmfCorrect(efficacy)
7556
        like = pmf.Prob(raw)
7557
        return like
7558
\end{verbatim}
7559

7560
{\tt pmf} is the distribution of raw scores for a test-taker
7561
with the given efficacy; {\tt like} is the probability of
7562
the observed score.
7563

7564
Figure~\ref{fig.satposterior2} shows the posterior distributions
7565
of efficacy for Alice and Bob.  As expected, the location
7566
of Alice's distribution is farther to the right, but again there
7567
is some overlap.
7568

7569
Using {\tt TopLevel} again, we compare $A$, the
7570
hypothesis that Alice's efficacy is higher, and $B$, the
7571
hypothesis that Bob's is higher.  The likelihood ratio is
7572
3.4, a bit smaller than what we got from the simple model (3.8).
7573
So this model indicates that the data are evidence in favor
7574
of $A$, but a little weaker than the previous estimate.
7575

7576
If our prior belief is that $A$ and $B$ are equally likely,
7577
then in light of this evidence we would give $A$ a posterior
7578
probability of 77\%, leaving a 23\% chance that Bob's efficacy
7579
is higher.
7580

7581

7582
\section{Predictive distribution}
7583

7584
The analysis we have done so far generates estimates for
7585
Alice and Bob's efficacy, but since efficacy is not directly
7586
observable, it is hard to validate the results.
7587
\index{predictive distribution}
7588

7589
To give the model predictive power, we can use it to answer
7590
a related question: ``If Alice and Bob take the math SAT
7591
again, what is the chance that Alice will do better again?''
7592

7593
We'll answer this question in two steps:
7594

7595
\begin{itemize}
7596

7597
\item We'll use the posterior distribution of efficacy to
7598
generate a predictive distribution of raw score for each test-taker.
7599

7600
\item We'll compare the two predictive distributions to compute
7601
the probability that Alice gets a higher score again.
7602

7603
\end{itemize}
7604

7605
We already have most of the code we need.  To compute
7606
the predictive distributions, we can use \verb"MakeRawScoreDist" again:
7607

7608
\begin{verbatim}
7609
    exam = Exam()
7610
    a_sat = Sat(exam, 780)
7611
    b_sat = Sat(exam, 740)
7612

7613
    a_pred = exam.MakeRawScoreDist(a_sat)
7614
    b_pred = exam.MakeRawScoreDist(b_sat)
7615
\end{verbatim}
7616

7617
Then we can find the likelihood that Alice does better on the second
7618
test, Bob does better, or they tie:
7619

7620
\begin{verbatim}
7621
    a_like = thinkbayes.PmfProbGreater(a_pred, b_pred)
7622
    b_like = thinkbayes.PmfProbLess(a_pred, b_pred)
7623
    c_like = thinkbayes.PmfProbEqual(a_pred, b_pred)
7624
\end{verbatim}
7625

7626
The probability that Alice does better on the second exam is 63\%,
7627
which means that Bob has a 37\% chance of doing as well or better.
7628

7629
Notice that we have more confidence about Alice's efficacy than we do
7630
about the outcome of the next test.  The posterior odds are 3:1 that
7631
Alice's efficacy is higher, but only 2:1 that Alice will do better on
7632
the next exam.
7633

7634

7635
\section{Discussion}
7636

7637
\begin{figure}
7638
% sat.py
7639
\centerline{\includegraphics[height=2.5in]{figs/sat_joint.pdf}}
7640
\caption{Joint posterior distribution of {\tt p\_correct} for Alice and Bob.}
7641
\label{fig.satjoint}
7642
\end{figure}
7643

7644
We started this chapter with the question,
7645
``How strong is the evidence that Alice is better prepared
7646
than Bob?''  On the face of it, that sounds like we want to
7647
test two hypotheses: either Alice is more prepared or Bob is.
7648

7649
But in order to compute likelihoods for these hypotheses, we
7650
have to solve an estimation problem.  For each test-taker
7651
we have to find the posterior distribution of either
7652
\verb"p_correct" or \verb"efficacy".
7653

7654
Values like this are called {\bf nuisance parameters} because
7655
we don't care what they are, but we have
7656
to estimate them to answer the question we care about.
7657
\index{nuisance parameter}
7658

7659
One way to visualize the analysis we did in this chapter is
7660
to plot the space of these parameters.  \verb"thinkbayes.MakeJoint"
7661
takes two Pmfs, computes their joint distribution, and returns
7662
a joint pmf of each possible pair of values and its probability.
7663

7664
\begin{verbatim}
7665
def MakeJoint(pmf1, pmf2):
7666
    joint = Joint()
7667
    for v1, p1 in pmf1.Items():
7668
        for v2, p2 in pmf2.Items():
7669
            joint.Set((v1, v2), p1 * p2)
7670
    return joint
7671
\end{verbatim}
7672

7673
This function assumes that the two distributions are independent.
7674
\index{joint distribution}
7675
\index{independence}
7676

7677
Figure~\ref{fig.satjoint} shows the joint posterior distribution of
7678
\verb"p_correct" for Alice and Bob.  The diagonal line indicates the
7679
part of the space where \verb"p_correct" is the same for Alice and
7680
Bob.  To the right of this line, Alice is more prepared; to the left,
7681
Bob is more prepared.
7682

7683
In {\tt TopLevel.Update}, when we compute the likelihoods of $A$ and
7684
$B$, we add up the probability mass on each side of this line.  For the
7685
cells that fall on the line, we add up the total mass and split it
7686
between $A$ and $B$.
7687

7688
The process we used in this chapter---estimating nuisance
7689
parameters in order to evaluate the likelihood of competing
7690
hypotheses---is a common Bayesian approach to problems like this.
7691

7692

7693

7694

7695
\chapter{Simulation}
7696

7697
In this chapter I describe my solution to a problem posed
7698
by a patient with a kidney tumor.  I think the problem is 
7699
important and relevant to patients with these tumors
7700
and doctors treating them.
7701

7702
And I think the solution is interesting because, although it
7703
is a Bayesian approach to the problem, the use of Bayes's theorem
7704
is implicit.  I present the solution and my code; at the end
7705
of the chapter I will explain the Bayesian part.
7706

7707
If you want more technical detail than I present here, you can
7708
read my paper on this work at \url{http://arxiv.org/abs/1203.6890}.
7709

7710

7711
\section{The Kidney Tumor problem}
7712

7713
\index{Kidney tumor problem}
7714
\index{Reddit}
7715
I am a frequent reader and occasional contributor to the online statistics
7716
forum at \url{http://reddit.com/r/statistics}.  In November 2011, I read
7717
the following message:
7718

7719
\begin{quote}
7720
"I have Stage IV Kidney Cancer and am trying to determine if the
7721
  cancer formed before I retired from the military. ... Given the
7722
  dates of retirement and detection is it possible to determine when
7723
  there was a 50/50 chance that I developed the disease? Is it
7724
  possible to determine the probability on the retirement date?  My
7725
  tumor was 15.5 cm x 15 cm at detection. Grade II."
7726
\end{quote}
7727

7728
I contacted the author of the message and got more information; I learned
7729
that veterans get different benefits if it is "more likely than not"
7730
that a tumor formed while they were in military service (among other
7731
considerations).
7732

7733
Because renal tumors grow slowly, and often do not cause symptoms,
7734
they are sometimes left untreated.  As a result, doctors can observe
7735
the rate of growth for untreated tumors by comparing scans from the
7736
same patient at different times.  Several papers have reported these
7737
growth rates.
7738

7739
I collected data from a paper by Zhang et al\footnote{Zhang et al,
7740
  Distribution of Renal Tumor Growth Rates Determined by Using Serial
7741
  Volumetric CT Measurements, January 2009 {\it Radiology}, 250,
7742
  137-144.}.  I contacted the authors to see if I could get raw data,
7743
but they refused on grounds of medical privacy.  Nevertheless, I was
7744
able to extract the data I needed by printing one of their graphs and
7745
measuring it with a ruler.
7746

7747
\begin{figure}
7748
% kidney.py
7749
\centerline{\includegraphics[height=2.5in]{figs/kidney2.pdf}}
7750
\caption{CDF of RDT in doublings per year.}
7751
\label{fig.kidney2}
7752
\end{figure}
7753

7754
They report growth rates in reciprocal doubling time (RDT),
7755
which is in units of doublings per year.  So a tumor with $RDT=1$
7756
doubles in volume each year; with $RDT=2$ it quadruples in the same
7757
time, and with $RDT=-1$, it halves.  Figure~\ref{fig.kidney2} shows the
7758
distribution of RDT for 53 patients.
7759
\index{doubling time}
7760

7761
The squares are the data points from the paper; the line is a model I
7762
fit to the data.  The positive tail fits an exponential distribution
7763
well, so I used a mixture of two exponentials.
7764
\index{exponential distribution}
7765
\index{mixture}
7766

7767

7768

7769
\section{A simple model}
7770

7771
It is usually a good idea to start with a simple model before
7772
trying something more challenging.  Sometimes the simple model is
7773
sufficient for the problem at hand, and if not, you can use it
7774
to validate the more complex model.
7775
\index{modeling}
7776

7777
For my simple model, I assume that tumors grow with a constant
7778
doubling time, and that they are three-dimensional in the sense that
7779
if the maximum linear measurement doubles, the volume is multiplied by
7780
eight.
7781

7782
I learned from my correspondent that the time between his discharge
7783
from the military and his diagnosis was 3291 days (about 9 years).
7784
So my first calculation was, ``If this tumor grew at the median
7785
rate, how big would it have been at the date of discharge?''
7786

7787
The median volume doubling time reported by Zhang et al is 811 days.
7788
Assuming 3-dimensional geometry, the doubling time for a linear
7789
measure is three times longer.
7790

7791
\begin{verbatim}
7792
    # time between discharge and diagnosis, in days 
7793
    interval = 3291.0
7794

7795
    # doubling time in linear measure is doubling time in volume * 3
7796
    dt = 811.0 * 3
7797

7798
    # number of doublings since discharge
7799
    doublings = interval / dt
7800

7801
    # how big was the tumor at time of discharge (diameter in cm)
7802
    d1 = 15.5
7803
    d0 = d1 / 2.0 ** doublings
7804
\end{verbatim}
7805

7806
You can download the code in this chapter from
7807
\url{http://thinkbayes.com/kidney.py}.  For more information
7808
see Section~\ref{download}.
7809

7810
The result, {\tt d0}, is about 6 cm.  So if this tumor formed after
7811
the date of discharge, it must have grown substantially faster than
7812
the median rate.  Therefore I concluded that it is ``more likely than
7813
not'' that this tumor formed before the date of discharge.
7814

7815
In addition, I computed the growth rate that would be implied
7816
if this tumor had formed after the date of discharge.  If we
7817
assume an initial size of 0.1 cm, we can compute the number of
7818
doublings to get to a final size of 15.5 cm:
7819

7820
\begin{verbatim}
7821
    # assume an initial linear measure of 0.1 cm
7822
    d0 = 0.1
7823
    d1 = 15.5
7824

7825
    # how many doublings would it take to get from d0 to d1
7826
    doublings = log2(d1 / d0)
7827

7828
    # what linear doubling time does that imply?
7829
    dt = interval / doublings
7830

7831
    # compute the volumetric doubling time and RDT
7832
    vdt = dt / 3
7833
    rdt = 365 / vdt
7834
\end{verbatim}
7835

7836
{\tt dt} is linear doubling time, so {\tt vdt} is volumetric
7837
doubling time, and {\tt rdt} is reciprocal doubling
7838
time.
7839

7840
The number of doublings, in linear measure, is 7.3, which implies
7841
an RDT of 2.4.  In the data from Zhang et al, only 20\% of tumors
7842
grew this fast during a period of observation.  So again,
7843
I concluded that is ``more likely than not'' that the tumor
7844
formed prior to the date of discharge.
7845

7846
These calculations are sufficient to answer the question as
7847
posed, and on behalf of my correspondent, I wrote a letter explaining
7848
my conclusions to the Veterans' Benefit Administration.
7849
\index{Veterans' Benefit Administration}
7850

7851
Later I told a friend, who is an oncologist, about my results.  He was
7852
surprised by the growth rates observed by Zhang et al, and by what
7853
they imply about the ages of these tumors.  He suggested that the
7854
results might be interesting to researchers and doctors.
7855

7856
But in order to make them useful, I wanted a more general model
7857
of the relationship between age and size.
7858

7859

7860
\section{A more general model}
7861

7862
Given the size of a tumor at time of diagnosis, it would be most
7863
useful to know the probability that the tumor formed before
7864
any given date; in other words, the distribution of ages.
7865
\index{modeling}
7866
\index{simulation}
7867

7868
To find it, I run simulations of tumor growth to get the
7869
distribution of size conditioned on age.  Then we can use
7870
a Bayesian approach to get the
7871
distribution of age conditioned on size.
7872
\index{conditional distribution}
7873

7874
The simulation starts with a small tumor and runs these steps:
7875

7876
\begin{enumerate}
7877

7878
\item Choose a growth rate from the distribution of RDT.
7879

7880
\item Compute the size of the tumor at the end of an interval.
7881

7882
\item Record the size of the tumor at each interval.
7883

7884
\item Repeat until the tumor exceeds the maximum relevant size.
7885

7886
\end{enumerate}
7887

7888
For the initial size I chose 0.3 cm, because carcinomas smaller than
7889
that are less likely to be invasive and less likely to have the blood
7890
supply needed for rapid growth (see
7891
\url{http://en.wikipedia.org/wiki/Carcinoma_in_situ}).  
7892
\index{carcinoma}
7893

7894
I chose an interval of 245 days (about 8 months) because that is the
7895
median time between measurements in the data source.
7896

7897
For the maximum size I chose 20 cm.  In the data source, the range of
7898
observed sizes is 1.0 to 12.0 cm, so we are extrapolating beyond
7899
the observed range at each end, but not by far, and not in a way
7900
likely to have a strong effect on the results.
7901

7902
\begin{figure}
7903
% kidney.py
7904
\centerline{\includegraphics[height=2.5in]{figs/kidney4.pdf}}
7905
\caption{Simulations of tumor growth, size vs. time.}
7906
\label{fig.kidney4}
7907
\end{figure}
7908

7909
The simulation is based on one big simplification:
7910
the growth rate is chosen independently during each interval,
7911
so it does not depend on age, size, or growth rate during
7912
previous intervals.
7913
\index{independence}
7914

7915
In Section~\ref{serial} I review these assumptions and
7916
consider more detailed models.  But first let's look at some
7917
examples.
7918

7919
Figure~\ref{fig.kidney4} shows 
7920
the size of simulated tumors as a function of
7921
age.  The dashed line at 10 cm shows the range of ages for tumors at
7922
that size: the fastest-growing tumor gets there in 8 years; the
7923
slowest takes more than 35.
7924

7925
I am presenting results in terms of linear measurements, but the
7926
calculations are in terms of volume.  To convert from one to the
7927
other, again, I use the volume of a sphere with the given
7928
diameter.
7929
\index{volume}
7930
\index{sphere}
7931

7932

7933
\section{Implementation}
7934

7935
Here is the kernel of the simulation:
7936
\index{simulation}
7937

7938
\begin{verbatim}
7939
def MakeSequence(rdt_seq, v0=0.01, interval=0.67, vmax=Volume(20.0)):
7940
    seq = v0,
7941
    age = 0
7942

7943
    for rdt in rdt_seq:
7944
        age += interval
7945
        final, seq = ExtendSequence(age, seq, rdt, interval)
7946
        if final > vmax:
7947
            break
7948

7949
    return seq
7950
\end{verbatim}
7951

7952
\verb"rdt_seq" is an iterator that yields 
7953
random values from the CDF of growth rate.
7954
{\tt v0} is the initial volume in mL.  {\tt interval} is the time step
7955
in years.  {\tt vmax} is the final volume corresponding to a linear
7956
measurement of 20 cm.
7957
\index{iterator}
7958

7959
{\tt Volume} converts from linear measurement in cm to volume
7960
in mL, based on the simplification that the tumor is a sphere:
7961

7962
\begin{verbatim}
7963
def Volume(diameter, factor=4*math.pi/3):
7964
    return factor * (diameter/2.0)**3
7965
\end{verbatim}
7966

7967
{\tt ExtendSequence} computes the volume of the tumor at the
7968
end of the interval.
7969

7970
\begin{verbatim}
7971
def ExtendSequence(age, seq, rdt, interval):
7972
    initial = seq[-1]
7973
    doublings = rdt * interval
7974
    final = initial * 2**doublings
7975
    new_seq = seq + (final,)
7976
    cache.Add(age, new_seq, rdt)
7977
    
7978
    return final, new_seq
7979
\end{verbatim}
7980

7981
{\tt age} is the age of the tumor at the end of the interval.
7982
{\tt seq} is a tuple that contains the volumes so far.  {\tt rdt} is
7983
the growth rate during the interval, in doublings per year.
7984
{\tt interval} is the size of the time step in years.
7985

7986
The return values are {\tt final}, the volume of the
7987
tumor at the end of the interval, and \verb"new_seq", a new
7988
tuple containing the volumes in {\tt seq} plus the new volume
7989
{\tt final}.
7990

7991
{\tt Cache.Add} records the age and size of each tumor at the end
7992
of each interval, as explained in the next section.
7993
\index{cache}
7994

7995

7996
\section{Caching the joint distribution}
7997

7998
\begin{figure}
7999
% kidney.py
8000
\centerline{\includegraphics[height=2.5in]{figs/kidney8.pdf}}
8001
\caption{Joint distribution of age and tumor size.}
8002
\label{fig.kidney8}
8003
\end{figure}
8004

8005
Here's how the cache works.  
8006

8007
\begin{verbatim}
8008
class Cache(object):
8009

8010
    def __init__(self):
8011
        self.joint = thinkbayes.Joint()
8012
\end{verbatim}
8013

8014
{\tt joint} is a joint Pmf that records the
8015
frequency of each age-size pair, so it approximates the
8016
joint distribution of age and size.
8017
\index{joint distribution}
8018

8019
At the end of each simulated interval, {\tt ExtendSequence} calls
8020
{\tt Add}:
8021

8022
\begin{verbatim}
8023
# class Cache
8024

8025
    def Add(self, age, seq):
8026
        final = seq[-1]
8027
        cm = Diameter(final)
8028
        bucket = round(CmToBucket(cm))
8029
        self.joint.Incr((age, bucket))
8030
\end{verbatim}
8031

8032
Again, {\tt age} is the age of the tumor, and {\tt seq} is the
8033
sequence of volumes so far.
8034

8035
\begin{figure}
8036
% kidney.py
8037
\centerline{\includegraphics[height=2.5in]{figs/kidney6.pdf}}
8038
\caption{Distributions of age, conditioned on size.}
8039
\label{fig.kidney6}
8040
\end{figure}
8041

8042
Before adding the new data to the joint distribution, we use {\tt
8043
  Diameter} to convert from volume to diameter in centimeters:
8044

8045
\begin{verbatim}
8046
def Diameter(volume, factor=3/math.pi/4, exp=1/3.0):
8047
    return 2 * (factor * volume) ** exp
8048
\end{verbatim}
8049

8050
And
8051
{\tt CmToBucket} to convert from centimeters to a discrete bucket
8052
number:
8053

8054
\begin{verbatim}
8055
def CmToBucket(x, factor=10):
8056
    return factor * math.log(x)
8057
\end{verbatim}
8058

8059
The buckets are equally spaced on a log scale.  Using {\tt factor=10}
8060
yields a reasonable number of buckets; for example,
8061
1 cm maps to bucket 0 and 10 cm maps to bucket 23.
8062
\index{log scale}
8063
\index{bucket}
8064

8065
After running the simulations, we can plot the joint distribution
8066
as a pseudocolor plot, where each cell represents the number of
8067
tumors observed at a given size-age pair.
8068
Figure~\ref{fig.kidney8} shows the joint distribution after 1000
8069
simulations.
8070
\index{pseudocolor plot}
8071

8072

8073

8074
\section{Conditional distributions}
8075

8076
\begin{figure}
8077
% kidney.py
8078
\centerline{\includegraphics[height=2.5in]{figs/kidney7.pdf}}
8079
\caption{Percentiles of tumor age as a function of size.}
8080
\label{fig.kidney7}
8081
\end{figure}
8082

8083
By taking a vertical slice from the joint distribution, we can get the
8084
distribution of sizes for any given age.  By taking a horizontal
8085
slice, we can get the distribution of ages conditioned on size.
8086
\index{conditional distribution}
8087

8088
Here's the code that reads the joint distribution and builds
8089
the conditional distribution for a given size.
8090
\index{joint distribution}
8091

8092
\begin{verbatim}
8093
# class Cache
8094

8095
    def ConditionalCdf(self, bucket):
8096
        pmf = self.joint.Conditional(0, 1, bucket)
8097
        cdf = pmf.MakeCdf()
8098
        return cdf
8099
\end{verbatim}
8100

8101
\verb"bucket" is the integer bucket number corresponding to
8102
tumor size.  {\tt Joint.Conditional} computes the
8103
PMF of age conditioned on {\tt bucket}.
8104
The result is the CDF of age conditioned on {\tt bucket}.
8105

8106
Figure~\ref{fig.kidney6} shows several of these CDFs, for
8107
a range of sizes.  To summarize these distributions, we can
8108
compute percentiles as a function of size.
8109
\index{percentile}
8110

8111
\begin{verbatim}
8112
    percentiles = [95, 75, 50, 25, 5]
8113

8114
    for bucket in cache.GetBuckets():
8115
        cdf = ConditionalCdf(bucket)      
8116
        ps = [cdf.Percentile(p) for p in percentiles]
8117
\end{verbatim}
8118

8119
Figure~\ref{fig.kidney7} shows these percentiles for each
8120
size bucket.  The data points are computed from the estimated
8121
joint distribution.  In the model, size and time are discrete,
8122
which contributes numerical errors, so I also show a least 
8123
squares fit for each sequence of percentiles.
8124
\index{least squares fit}
8125

8126

8127
\section{Serial Correlation}
8128
\label{serial}
8129

8130
The results so far are based on a number of modeling decisions;
8131
let's review them and consider which ones are the most
8132
likely sources of error:
8133
\index{modeling error}
8134

8135
\begin{itemize}
8136

8137
\item To convert from linear measure to volume, we assume that
8138
  tumors are approximately spherical.  This assumption is probably
8139
  fine for tumors up to a few centimeters, but not for very
8140
  large tumors.
8141
  \index{sphere}
8142

8143
\item The distribution of growth rates in the simulations are based on
8144
  a continuous model we chose to fit the data reported by Zhang et al,
8145
  which is based on 53 patients.  The fit is only approximate and, more
8146
  importantly, a larger sample would yield a
8147
  different distribution.
8148
  \index{growth rate}
8149

8150
\item The growth model does not take into account tumor subtype or
8151
  grade; this assumption is consistent with the conclusion of Zhang et al:
8152
  ``Growth rates in renal tumors of different sizes, subtypes and
8153
  grades represent a wide range and overlap substantially.''
8154
  But with a larger sample, a difference might become apparent.
8155
  \index{tumor type}
8156

8157
\item The distribution of growth rate does not depend on the size of
8158
  the tumor.  This assumption would not be realistic for very
8159
  small and very large tumors, whose growth is limited by blood supply.
8160

8161
  But tumors observed by Zhang et al ranged from 1 to 12 cm, and they
8162
  found no statistically significant relationship between
8163
  size and growth rate.  So if there is a relationship, it is
8164
  likely to be weak, at least in this size range.
8165
  
8166
\item In the simulations, growth rate during each interval is
8167
  independent of previous growth rates.  In reality it is plausible
8168
  that tumors that have grown quickly in the past are more likely
8169
  to grow quickly.  In other words, there is probably
8170
  a serial correlation in growth rate.
8171
  \index{serial correlation}
8172

8173
\end{itemize}
8174

8175
Of these, the first and last seem the most problematic.  I'll
8176
investigate serial correlation first, then come back to
8177
spherical geometry.
8178

8179
To simulate correlated growth, I wrote a generator\footnote{If you are
8180
  not familiar with Python generators, see
8181
  \url{http://wiki.python.org/moin/Generators}.} that yields a
8182
correlated series from a given Cdf.  Here's how the algorithm works:
8183
\index{generator}
8184

8185
\begin{enumerate}
8186

8187
\item Generate correlated values from a Gaussian distribution.
8188
  This is easy to do because we can compute the distribution
8189
  of the next value conditioned on the previous value.
8190
  \index{Gaussian distribution}
8191

8192
\item Transform each value to its cumulative probability using
8193
  the Gaussian CDF.
8194
  \index{cumulative probability}
8195

8196
\item Transform each cumulative probability to the corresponding value
8197
  using the given Cdf.
8198

8199
\end{enumerate}
8200

8201
Here's what that looks like in code:
8202

8203
\begin{verbatim}
8204
def CorrelatedGenerator(cdf, rho):
8205
    x = random.gauss(0, 1)
8206
    yield Transform(x)
8207

8208
    sigma = math.sqrt(1 - rho**2);    
8209
    while True:
8210
        x = random.gauss(x * rho, sigma)
8211
        yield Transform(x)
8212
\end{verbatim}
8213

8214
{\tt cdf} is the desired Cdf; {\tt rho} is the desired correlation.
8215
The values of {\tt x} are Gaussian; {\tt Transform} converts them
8216
to the desired distribution.
8217

8218
The first value of {\tt x} is Gaussian with mean 0 and standard
8219
deviation 1.  For subsequent values, the mean and standard deviation
8220
depend on the previous value.  Given the previous {\tt x}, the mean of the
8221
next value is {\tt x * rho}, and the variance is {\tt 1 - rho**2}.
8222
\index{correlated random value}
8223

8224
{\tt Transform} maps from each
8225
Gaussian value, {\tt x}, to a value from the given Cdf, {\tt y}.
8226

8227
\begin{verbatim}
8228
    def Transform(x):
8229
        p = thinkbayes.GaussianCdf(x)
8230
        y = cdf.Value(p)
8231
        return y
8232
\end{verbatim}
8233

8234
{\tt GaussianCdf} computes the CDF of the standard Gaussian
8235
distribution at {\tt x}, returning a cumulative probability.
8236
{\tt Cdf.Value} maps from a cumulative probability to the
8237
corresponding value in {\tt cdf}.
8238

8239
Depending on the shape of {\tt cdf}, information can
8240
be lost in transformation, so the actual correlation might be
8241
lower than {\tt rho}.  For example, when I generate
8242
10000 values from the distribution of growth rates with
8243
{\tt rho=0.4}, the actual correlation is 0.37.
8244
But since we are guessing at the right correlation anyway,
8245
that's close enough.
8246

8247
Remember that {\tt MakeSequence} takes an iterator as an argument.
8248
That interface allows it to work with different generators:
8249
\index{generator}
8250

8251
\begin{verbatim}
8252
    iterator = UncorrelatedGenerator(cdf)
8253
    seq1 = MakeSequence(iterator)
8254

8255
    iterator = CorrelatedGenerator(cdf, rho)
8256
    seq2 = MakeSequence(iterator)
8257
\end{verbatim}
8258

8259
In this example, {\tt seq1} and {\tt seq2} are
8260
drawn from the same distribution, but the values in {\tt seq1}
8261
are uncorrelated and the values in {\tt seq2} are correlated
8262
with a coefficient of approximately {\tt rho}.
8263
\index{serial correlation}
8264

8265
Now we can see what effect serial correlation has on the results;
8266
the following table shows percentiles of age for a 6 cm tumor,
8267
using the uncorrelated generator and a correlated generator
8268
with target $\rho = 0.4$.
8269
\index{percentile}
8270

8271
\begin{table}
8272
\input{kidney_table2}
8273
\caption{Percentiles of tumor age conditioned on size.}
8274
\end{table}
8275

8276
Correlation makes the fastest growing tumors faster and the slowest
8277
slower, so the range of ages is wider.  The difference is modest for
8278
low percentiles, but for the 95th percentile it is more than 6 years.
8279
To compute these percentiles precisely, we would need a better
8280
estimate of the actual serial correlation.
8281

8282
However, this model is sufficient to answer the question
8283
we started with: given a tumor with a linear dimension of
8284
15.5 cm, what is the probability that it formed more than
8285
8 years ago?
8286

8287
Here's the code:
8288

8289
\begin{verbatim}
8290
# class Cache
8291

8292
    def ProbOlder(self, cm, age):
8293
        bucket = CmToBucket(cm)
8294
        cdf = self.ConditionalCdf(bucket)
8295
        p = cdf.Prob(age)
8296
        return 1-p
8297
\end{verbatim}
8298

8299
{\tt cm} is the size of the tumor; {\tt age} is the age threshold
8300
in years.  {\tt ProbOlder} converts size to a bucket number,
8301
gets the Cdf of age conditioned on bucket, and computes the
8302
probability that age exceeds the given value.
8303

8304
With no serial correlation, the probability that a
8305
15.5 cm tumor is older than 8 years is 0.999, or almost certain.
8306
With correlation 0.4, faster-growing tumors are more likely, but
8307
the probability is still 0.995.  Even with correlation 0.8, the
8308
probability is 0.978.
8309

8310
Another likely source of error is the assumption that tumors are
8311
approximately spherical.  For a tumor with linear dimensions 15.5 x 15
8312
cm, this assumption is probably not valid.  If, as seems likely, a
8313
tumor this size
8314
is relatively flat, it might have the same volume as a 6 cm sphere.
8315
With this smaller volume and correlation 0.8, the probability of age
8316
greater than 8 is still 95\%.
8317

8318
So even taking into account modeling errors, it is unlikely that such
8319
a large tumor could have formed less than 8 years prior to the date of
8320
diagnosis.
8321
\index{modeling error}
8322

8323

8324
\section{Discussion}
8325

8326
Well, we got through a whole chapter without using Bayes's theorem or
8327
the {\tt Suite} class that encapsulates Bayesian updates.  What
8328
happened?
8329

8330
One way to think about Bayes's theorem is as an algorithm for
8331
inverting conditional probabilities.  Given \p{B|A}, we can compute
8332
\p{A|B}, provided we know \p{A} and \p{B}.  Of course this algorithm
8333
is only useful if, for some reason, it is easier to compute \p{B|A}
8334
than \p{A|B}.
8335

8336
In this example, it is.  By running simulations, we can estimate the
8337
distribution of size conditioned on age, or \p{size|age}.  But it is
8338
harder to get the distribution of age conditioned on size, or
8339
\p{age|size}.  So this seems like a perfect opportunity to use Bayes's
8340
theorem.
8341

8342
The reason I didn't is computational efficiency.  To estimate
8343
\p{size|age} for any given size, you have to run a lot of simulations.
8344
Along the way, you end up computing \p{size|age} for a lot of sizes.
8345
In fact, you end up computing the entire joint distribution of size
8346
and age, \p{size, age}.
8347
\index{joint distribution}
8348

8349
And once you have the joint distribution, you don't really need
8350
Bayes's theorem, you can extract \p{age|size} by taking slices from
8351
the joint distribution, as demonstrated in {\tt ConditionalCdf}.
8352
\index{conditional distribution}
8353

8354
So we side-stepped Bayes, but he was with us in spirit.
8355

8356

8357
\chapter{A Hierarchical Model}
8358
\label{hierarchical}
8359

8360

8361
\section{The Geiger counter problem}
8362

8363
I got the idea for the following problem from Tom Campbell-Ricketts,
8364
author of the Maximum Entropy blog at
8365
\url{http://maximum-entropy-blog.blogspot.com}.  And he got the idea
8366
from E.~T.~Jaynes, author of the classic {\em Probability Theory: The
8367
  Logic of Science}:
8368
\index{Jaynes, E.~T.}
8369
\index{Campbell-Ricketts, Tom}
8370
\index{Geiger counter problem}
8371

8372
\begin{quote}
8373
Suppose that a radioactive source emits particles toward
8374
a Geiger counter at an average rate of $r$ particles per second,
8375
but the counter only registers a fraction, $f$, of the particles
8376
that hit it.  If $f$ is 10\% and
8377
the counter registers 15 particles in a one second
8378
interval, what is the posterior distribution of $n$, the actual
8379
number of particles that hit the counter, and $r$, the average
8380
rate particles are emitted?
8381
\end{quote}
8382

8383
To get started on a problem like this, think about the chain of
8384
causation that starts with the parameters of the system and ends
8385
with the observed data:
8386
\index{causation}
8387

8388
\begin{enumerate}
8389

8390
\item The source emits particles at an average rate, $r$.
8391

8392
\item During any given second, the source emits $n$ particles
8393
toward the counter.
8394

8395
\item Out of those $n$ particles, some number, $k$, get counted.
8396

8397
\end{enumerate}
8398

8399
The probability that an atom decays is the same at any point in time,
8400
so radioactive decay is well modeled by a Poisson process.  Given $r$,
8401
the distribution of $n$ is Poisson distribution with parameter $r$.
8402
\index{radioactive decay}
8403
\index{Poisson process}
8404

8405
And if we assume that the probability of detection for each particle
8406
is independent of the others, the distribution of $k$ is the binomial
8407
distribution with parameters $n$ and $f$.
8408
\index{binomial distribution}
8409

8410
Given the parameters of the system, we can find the distribution of
8411
the data.  So we can solve what is called the {\bf forward problem}.
8412
\index{forward problem}
8413

8414
Now we want to go the other way: given the data, we
8415
want the distribution of the parameters.  This is called
8416
the {\bf inverse problem}.  And if you can solve the forward
8417
problem, you can use Bayesian methods to solve the inverse problem.
8418
\index{inverse problem}
8419

8420

8421
\section{Start simple}
8422

8423
\begin{figure}
8424
% jaynes.py
8425
\centerline{\includegraphics[height=2.5in]{figs/jaynes1.pdf}}
8426
\caption{Posterior distribution of $n$ for three values of $r$.}
8427
\label{fig.jaynes1}
8428
\end{figure}
8429

8430
Let's start with a simple version of the problem where we know
8431
the value of $r$.  We are given the value of $f$, so all we
8432
have to do is estimate $n$.
8433

8434
I define a Suite called {\tt Detector} that models the behavior
8435
of the detector and estimates $n$.
8436

8437
\begin{verbatim}
8438
class Detector(thinkbayes.Suite):
8439

8440
    def __init__(self, r, f, high=500, step=1):
8441
        pmf = thinkbayes.MakePoissonPmf(r, high, step=step)
8442
        thinkbayes.Suite.__init__(self, pmf, name=r)
8443
        self.r = r
8444
        self.f = f
8445
\end{verbatim}
8446

8447
If the average emission rate is $r$ particles per second, the
8448
distribution of $n$ is Poisson with parameter $r$.
8449
{\tt high} and {\tt step} determine the upper bound for $n$
8450
and the step size between hypothetical values.
8451
\index{Poisson distribution}
8452

8453
Now we need a likelihood function:
8454
\index{likelihood}
8455

8456
\begin{verbatim}
8457
# class Detector
8458

8459
    def Likelihood(self, data, hypo):
8460
        k = data
8461
        n = hypo
8462
        p = self.f
8463

8464
        return thinkbayes.EvalBinomialPmf(k, n, p)
8465
\end{verbatim}
8466

8467
{\tt data} is the number of particles detected, and {\tt hypo} is
8468
the hypothetical number of particles emitted, $n$.
8469

8470
If there are actually $n$ particles, and the probability of detecting
8471
any one of them is $f$, the probability of detecting $k$ particles is
8472
given by the binomial distribution.
8473
\index{binomial distribution}
8474

8475
That's it for the Detector.  We can try it out for a range
8476
of values of $r$:
8477

8478
\begin{verbatim}
8479
    f = 0.1
8480
    k = 15
8481

8482
    for r in [100, 250, 400]:
8483
        suite = Detector(r, f, step=1)
8484
        suite.Update(k)
8485
        print suite.MaximumLikelihood()
8486
\end{verbatim}
8487

8488
Figure~\ref{fig.jaynes1} shows the posterior distribution of $n$ for
8489
several given values of $r$.
8490

8491

8492
\section{Make it hierarchical}
8493

8494
In the previous section, we assume $r$ is known.  Now let's
8495
relax that assumption.  I define another Suite, called {\tt Emitter},
8496
that models the behavior of the emitter and estimates $r$:
8497

8498
\begin{verbatim}
8499
class Emitter(thinkbayes.Suite):
8500

8501
    def __init__(self, rs, f=0.1):
8502
        detectors = [Detector(r, f) for r in rs]
8503
        thinkbayes.Suite.__init__(self, detectors)
8504
\end{verbatim}
8505

8506
{\tt rs} is a sequence of hypothetical value for $r$.  {\tt detectors}
8507
is a sequence of Detector objects, one for each value of $r$.  The
8508
values in the Suite are Detectors, so Emitter is a {\bf meta-Suite};
8509
that is, a Suite that contains other Suites as values.
8510
\index{meta-Suite}
8511

8512
To update the Emitter, we have to compute the likelihood of the data
8513
under each hypothetical value of $r$.  But each value of $r$ is
8514
represented by a Detector that contains a range of values for $n$.
8515

8516
To compute the likelihood of the data for a given Detector, we loop
8517
through the values of $n$ and add up the total probability of $k$.
8518
That's what {\tt SuiteLikelihood} does:
8519

8520
\begin{verbatim}
8521
# class Detector
8522

8523
    def SuiteLikelihood(self, data):
8524
        total = 0
8525
        for hypo, prob in self.Items():
8526
            like = self.Likelihood(data, hypo)
8527
            total += prob * like
8528
        return total
8529
\end{verbatim}
8530

8531
Now we can write the Likelihood function for the Emitter:
8532

8533
\begin{verbatim}
8534
# class Emitter
8535

8536
    def Likelihood(self, data, hypo):
8537
        detector = hypo
8538
        like = detector.SuiteLikelihood(data)
8539
        return like
8540
\end{verbatim}
8541

8542
Each {\tt hypo} is a Detector, so we can invoke
8543
{\tt SuiteLikelihood} to get the likelihood of the data under
8544
the hypothesis.
8545

8546
After we update the Emitter, we have to update each of the
8547
Detectors, too.  
8548

8549
\begin{verbatim}
8550
# class Emitter
8551

8552
    def Update(self, data):
8553
        thinkbayes.Suite.Update(self, data)
8554
        
8555
        for detector in self.Values():
8556
            detector.Update()
8557
\end{verbatim}
8558

8559
A model like this, with multiple levels of Suites, is called {\bf
8560
  hierarchical}.  \index{hierarchical model}
8561

8562

8563
\section{A little optimization}
8564

8565
You might recognize {\tt SuiteLikelihood}; we saw it
8566
in Section~\ref{suitelike}.  At the time, I pointed out that
8567
we didn't really need it, because the total probability
8568
computed by {\tt SuiteLikelihood} is exactly the normalizing
8569
constant computed and returned by {\tt Update}.
8570
\index{normalizing constant}
8571
 
8572
So instead of updating the Emitter and then updating the
8573
Detectors, we can do both steps at the same time, using
8574
the result from {\tt Detector.Update} as the likelihood
8575
of Emitter.
8576

8577
Here's the streamlined version of {\tt Emitter.Likelihood}:
8578

8579
\begin{verbatim}
8580
# class Emitter
8581

8582
    def Likelihood(self, data, hypo):
8583
        return hypo.Update(data)
8584
\end{verbatim}
8585

8586
And with this version of {\tt Likelihood} we can use the
8587
default version of {\tt Update}.  So this version has fewer
8588
lines of code, and it runs faster because it does not compute
8589
the normalizing constant twice.
8590
\index{optimization}
8591

8592

8593
\section{Extracting the posteriors}
8594

8595
\begin{figure}
8596
% jaynes.py
8597
\centerline{\includegraphics[height=2.5in]{figs/jaynes2.pdf}}
8598
\caption{Posterior distributions of $n$ and $r$.}
8599
\label{fig.jaynes2}
8600
\end{figure}
8601

8602
After we update the Emitter, we can get the posterior distribution
8603
of $r$ by looping through the Detectors and their probabilities:
8604

8605
\begin{verbatim}
8606
# class Emitter
8607

8608
    def DistOfR(self):
8609
        items = [(detector.r, prob) for detector, prob in self.Items()]
8610
        return thinkbayes.MakePmfFromItems(items)
8611
\end{verbatim}
8612

8613
{\tt items} is a list of values of $r$ and their probabilities.
8614
The result is the Pmf of $r$.
8615

8616
To get the posterior distribution of $n$, we have to compute
8617
the mixture of the Detectors.  We can use 
8618
{\tt thinkbayes.MakeMixture}, which takes a meta-Pmf that maps
8619
from each distribution to its probability.  And that's exactly
8620
what the Emitter is:
8621

8622
\begin{verbatim}
8623
# class Emitter
8624

8625
    def DistOfN(self):
8626
        return thinkbayes.MakeMixture(self)
8627
\end{verbatim}
8628

8629
Figure~\ref{fig.jaynes2} shows the results.  Not surprisingly, the
8630
most likely value for $n$ is 150.  Given $f$ and $n$, the expected
8631
count is $k = f n$, so given $f$ and $k$, the expected value of $n$ is
8632
$k / f$, which is 150.
8633

8634
And if 150 particles are emitted in one second, the most likely value
8635
of $r$ is 150 particles per second.  So the posterior distribution of
8636
$r$ is also centered on 150.
8637

8638
The posterior distributions of $r$ and $n$ are similar;
8639
the only difference is that we are slightly less certain about $n$.
8640
In general, we can be more certain about the long-range emission rate,
8641
$r$, than about the number of particles emitted in any particular second,
8642
$n$.
8643

8644
You can download the code in this chapter from
8645
\url{http://thinkbayes.com/jaynes.py}.  For more information see
8646
Section~\ref{download}.
8647

8648

8649
\section{Discussion}
8650

8651
The Geiger counter problem demonstrates the connection between
8652
causation and hierarchical modeling.  In the example, the
8653
emission rate $r$ has a causal effect on the number of particles,
8654
$n$, which has a causal effect on the particle count, $k$.
8655
\index{Geiger counter problem}
8656
\index{causation}
8657

8658
The hierarchical model reflects the structure of the
8659
system, with causes at the top and effects at the bottom.
8660
\index{hierarchical model}
8661

8662
\begin{enumerate}
8663

8664
\item At the top level, we start with a range of hypothetical
8665
values for $r$.
8666

8667
\item For each value of $r$, we have a range of values for $n$,
8668
and the prior distribution of $n$ depends on $r$.
8669

8670
\item When we update the model, we go bottom-up.  We compute
8671
a posterior distribution of $n$ for each value of $r$, then
8672
compute the posterior distribution of $r$.
8673

8674
\end{enumerate}
8675

8676
So causal information flows down the hierarchy, and inference flows
8677
up.
8678

8679

8680
\section{Exercises}
8681

8682
\begin{exercise}
8683
This exercise is also inspired by an example in Jaynes, {\em
8684
Probability Theory}.
8685

8686
Suppose you buy a mosquito trap that is supposed to reduce the
8687
population of mosquitoes near your house.  Each
8688
week, you empty the trap and count the number of mosquitoes
8689
captured.  After the first week, you count 30 mosquitoes.
8690
After the second week, you count 20 mosquitoes.  Estimate the
8691
percentage change in the number of mosquitoes in your yard.
8692

8693
To answer this question, you have to make some modeling
8694
decisions.  Here are some suggestions:
8695

8696
\begin{itemize}
8697

8698
\item Suppose that each week a large number of mosquitoes, $N$, is bred
8699
in a wetland near your home.
8700

8701
\item During the week, some fraction of
8702
them, $f_1$, wander into your yard, and of those some fraction, $f_2$,
8703
are caught in the trap.
8704

8705
\item Your solution should take into account your prior belief
8706
about how much $N$ is likely to change from one week to the next.
8707
You can do that by adding a level to the hierarchy to
8708
model the percent change in $N$.
8709

8710
\end{itemize}
8711

8712
\end{exercise}
8713

8714

8715
\chapter{Dealing with Dimensions}
8716
\label{species}
8717

8718
\section{Belly button bacteria}
8719

8720
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen
8721
science project with the goal of identifying bacterial species that
8722
can be found in human navels (\url{http://bbdata.yourwildlife.org}).
8723
The project might seem whimsical, but it is part of an increasing
8724
interest in the human microbiome, the set of microorganisms that live
8725
on human skin and parts of the body.
8726
\index{biodiversity}
8727
\index{belly button}
8728
\index{bacteria}
8729
\index{microbiome}
8730

8731
In their pilot study, BBB2 researchers collected swabs from the navels
8732
of 60 volunteers, used multiplex pyrosequencing to extract and sequence
8733
fragments of 16S rDNA, then identified the species or genus the
8734
fragments came from.  Each identified fragment is called a ``read.''
8735
\index{navel}
8736
\index{rDNA}
8737
\index{pyrosequencing}
8738

8739
We can use these data to answer several related questions:
8740

8741
\begin{itemize}
8742

8743
\item Based on the number of species observed, can we estimate
8744
  the total number of species in the environment?
8745
\index{species}
8746

8747
\item Can we estimate the prevalence of each species; that is, the
8748
  fraction of the total population belonging to each species?
8749
\index{prevalence}
8750

8751
\item If we are planning to collect additional samples, can we predict
8752
  how many new species we are likely to discover?
8753

8754
\item How many additional reads are needed to increase the
8755
  fraction of observed species to a given threshold?
8756

8757
\end{itemize}
8758

8759
These questions make up what is called the {\bf Unseen Species problem}.
8760
\index{Unseen Species problem}
8761

8762

8763
\section{Lions and tigers and bears}
8764

8765
I'll start with a simplified version of the problem where we know that
8766
there are exactly three species.  Let's call them lions, tigers and
8767
bears.  Suppose we visit a wild animal preserve and see 3 lions, 2
8768
tigers and one bear.
8769
\index{lions and tigers and bears}
8770

8771
If we have an equal chance of observing any animal in the preserve,
8772
the number of each species we see is governed by the multinomial
8773
distribution.  If the prevalence of lions and tigers and bears is
8774
\verb"p_lion" and \verb"p_tiger" and \verb"p_bear", the likelihood of
8775
seeing 3 lions, 2 tigers and one bear is proportional to
8776
\index{multinomial distribution}
8777

8778
\begin{verbatim}
8779
p_lion**3 * p_tiger**2 * p_bear**1
8780
\end{verbatim}
8781

8782
An approach that is tempting, but not correct, is to use beta
8783
distributions, as in Section~\ref{beta}, to describe the prevalence of
8784
each species separately.  For example, we saw 3 lions and 3 non-lions;
8785
if we think of that as 3 ``heads'' and 3 ``tails,'' then the posterior
8786
distribution of \verb"p_lion" is:
8787
\index{beta distribution}
8788

8789
\begin{verbatim}
8790
    beta = thinkbayes.Beta()
8791
    beta.Update((3, 3))
8792
    print beta.MaximumLikelihood()
8793
\end{verbatim}
8794

8795
The maximum likelihood estimate for \verb"p_lion" is the observed
8796
rate, 50\%.  Similarly the MLEs for \verb"p_tiger" and \verb"p_bear"
8797
are 33\% and 17\%.
8798
\index{maximum likelihood}
8799

8800
But there are two problems:
8801

8802
\begin{enumerate}
8803

8804
\item We have implicitly used a prior for each species that is uniform
8805
  from 0 to 1, but since we know that there are three species, that
8806
  prior is not correct.  The right prior should have a mean of 1/3,
8807
  and there should be zero likelihood that any species has a
8808
  prevalence of 100\%.
8809

8810
\item The distributions for each species are not independent, because
8811
  the prevalences have to add up to 1.  To capture this dependence, we
8812
  need a joint distribution for the three prevalences.
8813
\index{independence}
8814
\index{joint distribution}
8815

8816
\end{enumerate}
8817

8818
We can use a Dirichlet distribution to solve both of these problems
8819
(see \url{http://en.wikipedia.org/wiki/Dirichlet_distribution}).  In
8820
the same way we used the beta distribution to describe the
8821
distribution of bias for a coin, we can use a Dirichlet
8822
distribution to describe the joint distribution of \verb"p_lion",
8823
\verb"p_tiger" and \verb"p_bear".
8824
\index{beta distribution}
8825
\index{Dirichlet distribution}
8826

8827
The Dirichlet distribution is the multi-dimensional generalization
8828
of the beta distribution.  Instead of two possible outcomes, like
8829
heads and tails, the Dirichlet distribution handles any number of
8830
outcomes: in this example, three species.
8831

8832
If there are {\tt n} outcomes, the Dirichlet distribution is
8833
described by {\tt n} parameters, written $\alpha_1$ through $\alpha_n$.
8834

8835
Here's the definition, from {\tt thinkbayes.py}, of a class that
8836
represents a Dirichlet distribution:
8837
\index{numpy}
8838

8839
\begin{verbatim}
8840
class Dirichlet(object):
8841

8842
    def __init__(self, n):
8843
        self.n = n
8844
        self.params = numpy.ones(n, dtype=numpy.int)
8845
\end{verbatim}
8846

8847
{\tt n} is the number of dimensions; initially the parameters
8848
are all 1.  I use a {\tt numpy} array to store the parameters
8849
so I can take advantage of array operations.
8850

8851
Given a Dirichlet distribution, the marginal distribution
8852
for each prevalence is a beta distribution, which we can
8853
compute like this:
8854

8855
\begin{verbatim}
8856
    def MarginalBeta(self, i):
8857
        alpha0 = self.params.sum()
8858
        alpha = self.params[i]
8859
        return Beta(alpha, alpha0-alpha)
8860
\end{verbatim}
8861

8862
{\tt i} is the index of the marginal distribution we want.
8863
{\tt alpha0} is the sum of the parameters; {\tt alpha} is the
8864
parameter for the given species.
8865
\index{marginal distribution}
8866

8867
In the example, the prior marginal distribution for each species
8868
is {\tt Beta(1, 2)}.  We can compute the prior means like
8869
this:
8870

8871
\begin{verbatim}
8872
    dirichlet = thinkbayes.Dirichlet(3)
8873
    for i in range(3):
8874
        beta = dirichlet.MarginalBeta(i)
8875
        print beta.Mean()
8876
\end{verbatim}
8877

8878
As expected, the prior mean prevalence for each species is 1/3.
8879

8880
To update the Dirichlet distribution, we add the
8881
observations to the parameters like this:
8882

8883
\begin{verbatim}
8884
    def Update(self, data):
8885
        m = len(data)
8886
        self.params[:m] += data
8887
\end{verbatim}
8888

8889
Here {\tt data} is a sequence of counts in the same order as {\tt
8890
  params}, so in this example, it should be the number of lions,
8891
tigers and bears.
8892

8893
{\tt data} can be shorter than {\tt params}; in that
8894
case there are some species that have not been
8895
observed.
8896

8897
Here's code that updates {\tt dirichlet} with the observed data and
8898
computes the posterior marginal distributions.
8899

8900
\begin{verbatim}
8901
    data = [3, 2, 1]
8902
    dirichlet.Update(data)
8903

8904
    for i in range(3):
8905
        beta = dirichlet.MarginalBeta(i)
8906
        pmf = beta.MakePmf()
8907
        print i, pmf.Mean()
8908
\end{verbatim}
8909

8910
\begin{figure}
8911
% species.py
8912
\centerline{\includegraphics[height=2.5in]{figs/species1.pdf}}
8913
\caption{Distribution of prevalences for three species.}
8914
\label{fig.species1}
8915
\end{figure}
8916

8917
Figure~\ref{fig.species1} shows the results.  The posterior
8918
mean prevalences are 44\%, 33\%, and 22\%.
8919

8920

8921
\section{The hierarchical version}
8922

8923
We have solved a simplified version of the problem: if we
8924
know how many species there are, we can estimate the prevalence
8925
of each.
8926
\index{prevalence}
8927

8928
Now let's get back to the original problem, estimating the total
8929
number of species.  To solve this problem I'll define a meta-Suite,
8930
which is a Suite that contains other Suites as hypotheses.  In this
8931
case, the top-level Suite contains hypotheses about the number of
8932
species; the bottom level contains hypotheses about prevalences.
8933
\index{hierarchical model}
8934
\index{meta-Suite}
8935

8936
Here's the class definition:
8937

8938
\begin{verbatim}
8939
class Species(thinkbayes.Suite):
8940

8941
    def __init__(self, ns):
8942
        hypos = [thinkbayes.Dirichlet(n) for n in ns]
8943
        thinkbayes.Suite.__init__(self, hypos)
8944
\end{verbatim}
8945

8946
\verb"__init__" takes a list of possible values for {\tt n} and
8947
makes a list of Dirichlet objects.
8948

8949
Here's the code that creates the top-level suite:
8950

8951
\begin{verbatim}
8952
    ns = range(3, 30)
8953
    suite = Species(ns)
8954
\end{verbatim}
8955

8956
{\tt ns} is the list of possible values for {\tt n}.  We have seen 3
8957
species, so there have to be at least that many.  I chose an upper
8958
bound that seems reasonable, but we will check later that the
8959
probability of exceeding this bound is low.  And at least initially
8960
we assume that any value in this range is equally likely.
8961

8962
To update a hierarchical model, you have to update all levels.
8963
Usually you have to update the bottom
8964
level first and work up, but in this case we can
8965
update the top level first:
8966

8967
\begin{verbatim}
8968
#class Species
8969

8970
    def Update(self, data):
8971
        thinkbayes.Suite.Update(self, data)
8972
        for hypo in self.Values():
8973
            hypo.Update(data)
8974
\end{verbatim}
8975

8976
{\tt Species.Update} invokes {\tt Update} in the parent class,
8977
then loops through the sub-hypotheses and updates them.
8978

8979
Now all we need is a likelihood function:
8980

8981
\begin{verbatim}
8982
# class Species
8983

8984
    def Likelihood(self, data, hypo):
8985
        dirichlet = hypo
8986
        like = 0
8987
        for i in range(1000):
8988
            like += dirichlet.Likelihood(data)
8989

8990
        return like
8991
\end{verbatim}
8992

8993
{\tt data} is a sequence of
8994
observed counts; {\tt hypo} is a Dirichlet object.
8995
{\tt Species.Likelihood} calls
8996
{\tt Dirichlet.Likelihood} 1000 times and returns the total.
8997

8998
Why call it 1000 times?  Because {\tt
8999
  Dirichlet.Likelihood} doesn't actually compute the likelihood of the
9000
data under the whole Dirichlet distribution.  Instead, it draws one
9001
sample from the hypothetical distribution and computes the likelihood
9002
of the data under the sampled set of prevalences.
9003

9004
Here's what it looks like:
9005

9006
\begin{verbatim}
9007
# class Dirichlet
9008

9009
    def Likelihood(self, data):
9010
        m = len(data)
9011
        if self.n < m:
9012
            return 0
9013

9014
        x = data
9015
        p = self.Random()
9016
        q = p[:m]**x
9017
        return q.prod()
9018
\end{verbatim}
9019

9020
The length of {\tt data} is the number of species observed.  If
9021
we see more species than we thought existed, the likelihood is 0.
9022

9023
\index{multinomial distribution}
9024
Otherwise we select a random set of prevalences, {\tt p}, and
9025
compute the multinomial PMF, which is
9026
%
9027
\[ c_x  p_1^{x_1} \cdots p_n^{x_n} \]
9028
%
9029
$p_i$ is the prevalence of the $i$th species, and $x_i$ is the
9030
observed number.  The first term, $c_x$, is the multinomial
9031
coefficient; I leave it out of the computation because it is
9032
a multiplicative factor that depends only
9033
on the data, not the hypothesis, so it gets normalized away
9034
(see \url{http://en.wikipedia.org/wiki/Multinomial_distribution}).
9035
\index{multinomial coefficient}
9036

9037
{\tt m} is the number of observed species.
9038
We only need the first {\tt m} elements of {\tt p};
9039
for the others, $x_i$ is 0, so
9040
$p_i^{x_i}$ is 1, and we can leave them out of the product.
9041

9042

9043
\section{Random sampling}
9044
\label{randomdir}
9045

9046
There are two ways to generate a random sample from a Dirichlet
9047
distribution.  One is to use the marginal beta distributions, but in
9048
that case you have to select one at a time and scale the rest so they
9049
add up to 1 (see
9050
\url{http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation}).
9051
\index{random sample}
9052

9053
A less obvious, but faster, way is to select values from {\tt n} gamma
9054
distributions, then normalize by dividing through by the total. 
9055
Here's the code:
9056
\index{numpy}
9057
\index{gamma distribution}
9058

9059
\begin{verbatim}
9060
# class Dirichlet
9061

9062
    def Random(self):
9063
        p = numpy.random.gamma(self.params)
9064
        return p / p.sum()
9065
\end{verbatim}
9066

9067
Now we're ready to look at some results.  Here is the code that extracts
9068
the posterior distribution of {\tt n}:
9069

9070
\begin{verbatim}
9071
    def DistOfN(self):
9072
        pmf = thinkbayes.Pmf()
9073
        for hypo, prob in self.Items():
9074
            pmf.Set(hypo.n, prob)
9075
        return pmf
9076
\end{verbatim}
9077

9078
{\tt DistOfN} iterates
9079
through the top-level hypotheses and accumulates the probability
9080
of each {\tt n}.
9081

9082
\begin{figure}
9083
% species.py
9084
\centerline{\includegraphics[height=2.5in]{figs/species2.pdf}}
9085
\caption{Posterior distribution of {\tt n}.}
9086
\label{fig.species2}
9087
\end{figure}
9088

9089
Figure~\ref{fig.species2} shows the result.  The most likely value is 4.
9090
Values from 3 to 7 are reasonably likely; after that the probabilities
9091
drop off quickly.  The probability that there are 29 species is
9092
low enough to be negligible; if we chose a higher bound, 
9093
we would get nearly the same result.
9094

9095
Remember that this result is based on a uniform prior for {\tt n}.  If
9096
we have background information about the number of species in the
9097
environment, we might choose a different prior.  \index{uniform
9098
  distribution}
9099

9100

9101
\section{Optimization}
9102

9103
I have to admit that I am proud of this example.  The Unseen Species
9104
problem is not easy, and I think this solution is simple and clear,
9105
and takes surprisingly few lines of code (about 50 so far).
9106

9107
The only problem is that it is slow.  It's good enough for the example
9108
with only 3 observed species, but not good enough for the belly button
9109
data, with more than 100 species in some samples.
9110

9111
The next few sections present a series of optimizations we need to
9112
make this solution scale.  Before we get into the details, here's
9113
a road map.
9114
\index{optimization}
9115

9116
\begin{itemize}
9117

9118
\item The first step is to recognize that if we update the Dirichlet
9119
  distributions with the same data, the first {\tt m} parameters are
9120
  the same for all of them.  The only difference is the number of
9121
  hypothetical unseen species.  So we don't really need {\tt n}
9122
  Dirichlet objects; we can store the parameters in the top level of
9123
  the hierarchy.  {\tt Species2} implements this optimization.
9124

9125
\item {\tt Species2} also uses the same set of random values for all
9126
  of the hypotheses.  This saves time generating random values, but it
9127
  has a second benefit that turns out to be more important: by giving
9128
  all hypotheses the same selection from the sample space, we make
9129
  the comparison between the hypotheses more fair, so it takes
9130
  fewer iterations to converge.
9131

9132
\item Even with these changes there is a major performance problem.
9133
  As the number of observed species increases, the array of random
9134
  prevalences gets bigger, and the chance of choosing one that is
9135
  approximately right becomes small.  So the vast majority of
9136
  iterations yield small likelihoods that don't contribute much to the
9137
  total, and don't discriminate between hypotheses.
9138

9139
  The solution is to do the updates one species at a time.  {\tt
9140
  Species4} is a simple implementation of this strategy using
9141
  Dirichlet objects to represent the sub-hypotheses.
9142

9143
\item Finally, {\tt Species5} combines the sub-hypotheses into the top
9144
  level and uses {\tt numpy} array operations to speed things up.
9145
\index{numpy}
9146

9147
\end{itemize}
9148

9149
If you are not interested in the details, feel free to skip to
9150
Section~\ref{belly} where we look at results from the belly
9151
button data.
9152

9153

9154
\section{Collapsing the hierarchy}
9155
\label{collapsing}
9156

9157
All of the bottom-level Dirichlet distributions are updated
9158
with the same data, so the first {\tt m} parameters are the same for
9159
all of them.  
9160
We can eliminate them and merge the parameters into
9161
the top-level suite.  {\tt Species2} implements this optimization:
9162
\index{numpy}
9163

9164
\begin{verbatim}
9165
class Species2(object):
9166
    
9167
    def __init__(self, ns):
9168
        self.ns = ns
9169
        self.probs = numpy.ones(len(ns), dtype=numpy.double)
9170
        self.params = numpy.ones(self.high, dtype=numpy.int)
9171
\end{verbatim}
9172

9173
{\tt ns} is the list of hypothetical values for {\tt n};
9174
{\tt probs} is the list of corresponding probabilities.  And
9175
{\tt params} is the sequence of Dirichlet parameters, initially
9176
all 1.
9177

9178
{\tt Species2.Update} updates both levels of
9179
the hierarchy: first the probability for each value of {\tt n},
9180
then the Dirichlet parameters:
9181
\index{numpy}
9182

9183
\begin{verbatim}
9184
# class Species2
9185

9186
    def Update(self, data):
9187
        like = numpy.zeros(len(self.ns), dtype=numpy.double)
9188
        for i in range(1000):
9189
            like += self.SampleLikelihood(data)
9190

9191
        self.probs *= like
9192
        self.probs /= self.probs.sum()
9193

9194
        m = len(data)
9195
        self.params[:m] += data
9196
\end{verbatim}
9197

9198
{\tt SampleLikelihood} returns an array of likelihoods, one for each
9199
value of {\tt n}.  {\tt like} accumulates the total likelihood for
9200
1000 samples.  {\tt self.probs} is multiplied by the total likelihood,
9201
then normalized.  The last two lines, which update the parameters,
9202
are the same as in {\tt Dirichlet.Update}.
9203

9204
Now let's look at {\tt SampleLikelihood}.  There are two
9205
opportunities for optimization here:
9206

9207
\begin{itemize}
9208

9209
\item When the hypothetical number of species, {\tt n},
9210
exceeds the observed number, {\tt m}, we only need the first {\tt m}
9211
terms of the multinomial PMF; the rest are 1.
9212

9213
\item If the number of species is large, the likelihood of the data
9214
  might be too small for floating-point (see ~\ref{underflow}).  So it
9215
  is safer to compute log-likelihoods.
9216
  \index{log-likelihood} \index{underflow}
9217

9218
\end{itemize}
9219

9220
\index{multinomial distribution}
9221
Again, the multinomial PMF is
9222
%
9223
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
9224
%
9225
So the log-likelihood is
9226
%
9227
\[ \log c_x + x_1 \log p_1 + \cdots + x_n \log p_n \]
9228
%
9229
which is fast and easy to compute.  Again, $c_x$
9230
it is the same for all hypotheses, so we can drop it.
9231
Here's the code:
9232
\index{numpy}
9233

9234
\begin{verbatim}
9235
# class Species2
9236

9237
    def SampleLikelihood(self, data):
9238
        gammas = numpy.random.gamma(self.params)
9239

9240
        m = len(data)
9241
        row = gammas[:m]
9242
        col = numpy.cumsum(gammas)
9243

9244
        log_likes = []
9245
        for n in self.ns:
9246
            ps = row / col[n-1]
9247
            terms = data * numpy.log(ps)
9248
            log_like = terms.sum()
9249
            log_likes.append(log_like)
9250

9251
        log_likes -= numpy.max(log_likes)
9252
        likes = numpy.exp(log_likes)
9253

9254
        coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]
9255
        likes *= coefs
9256

9257
        return likes
9258
\end{verbatim}
9259

9260
{\tt gammas} is an array of values from a gamma distribution; its
9261
length is the largest hypothetical value of {\tt n}.  {\tt row} is
9262
just the first {\tt m} elements of {\tt gammas}; since these are the
9263
only elements that depend on the data, they are the only ones we need.
9264
\index{gamma distribution}
9265

9266
For each value of {\tt n} we need to divide {\tt row} by the
9267
total of the first {\tt n} values from {\tt gamma}.  {\tt cumsum}
9268
computes these cumulative sums and stores them in {\tt col}.
9269
\index{cumulative sum}
9270

9271
The loop iterates through the values of {\tt n} and accumulates
9272
a list of log-likelihoods.
9273
\index{log-likelihood}
9274

9275
Inside the loop, {\tt ps} contains the row of probabilities, normalized
9276
with the appropriate cumulative sum.  {\tt terms} contains the
9277
terms of the summation, $x_i \log p_i$, and \verb"log_like" contains
9278
their sum.
9279

9280
After the loop, we want to convert the log-likelihoods to linear
9281
likelihoods, but first it's a good idea to shift them so the largest
9282
log-likelihood is 0; that way the linear likelihoods are not too
9283
small (see ~\ref{underflow}).
9284

9285
Finally, before we return the likelihood, we have to apply a correction
9286
factor, which is the number of ways we could have observed these {\tt m}
9287
species, if the total number of species is {\tt n}.  
9288
{\tt BinomialCoefficient} computes ``n choose m'', which is written
9289
$\binom{n}{m}$. 
9290
\index{binomial coefficient}
9291

9292
As often happens, the optimized version is less readable and more
9293
error-prone than the original.  But that's one reason I think it is
9294
a good idea to start with the simple version; we can use it for
9295
regression testing.  I plotted results from both versions and confirmed
9296
that they are approximately equal, and that they converge as the
9297
number of iterations increases.
9298
\index{regression testing}
9299

9300

9301
\section{One more problem}
9302

9303
There's more we could do to optimize this code, but there's another
9304
problem we need to fix first.  As the number of observed
9305
species increases, this version gets noisier and takes more
9306
iterations to converge on a good answer.
9307

9308
The problem is that if the prevalences we choose from the Dirichlet
9309
distribution, the {\tt ps}, are not at least approximately right,
9310
the likelihood of the observed data is close to zero and almost
9311
equally bad for all values of {\tt n}.  So most iterations don't
9312
provide any useful contribution to the total likelihood.  And as the
9313
number of observed species, {\tt m}, gets large, the probability of
9314
choosing {\tt ps} with non-negligible likelihood gets small.  Really
9315
small.
9316

9317
Fortunately, there is a solution.  Remember that if you observe
9318
a set of data, you can update the prior distribution with the
9319
entire dataset, or you can break it up into a series of updates
9320
with subsets of the data, and the result is the same either way.
9321

9322
For this example, the key is to perform the updates one species at
9323
a time.  That way when we generate a random set of {\tt ps}, only
9324
one of them affects the computed likelihood, so the chance of choosing
9325
a good one is much better.
9326

9327
Here's a new version that updates one species at a time:
9328
\index{numpy}
9329

9330
\begin{verbatim}
9331
class Species4(Species):
9332

9333
    def Update(self, data):
9334
        m = len(data)
9335

9336
        for i in range(m):
9337
            one = numpy.zeros(i+1)
9338
            one[i] = data[i]            
9339
            Species.Update(self, one)
9340
\end{verbatim}
9341

9342
This version inherits \verb"__init__" from {\tt Species}, so it
9343
represents the hypotheses as a list of Dirichlet objects (unlike
9344
{\tt Species2}).
9345

9346
{\tt Update} loops through the observed species and makes an
9347
array, {\tt one}, with all zeros and one species count.  Then
9348
it calls {\tt Update} in the parent class, which computes
9349
the likelihoods and updates the sub-hypotheses.
9350

9351
So in the running example, we do three updates.  The first
9352
is something like ``I have seen three lions.''  The second is
9353
``I have seen two tigers and no additional lions.''  And the third
9354
is ``I have seen one bear and no more lions and tigers.''
9355

9356
Here's the new version of {\tt Likelihood}:
9357

9358
\begin{verbatim}
9359
# class Species4
9360

9361
    def Likelihood(self, data, hypo):
9362
        dirichlet = hypo
9363
        like = 0
9364
        for i in range(self.iterations):
9365
            like += dirichlet.Likelihood(data)
9366

9367
        # correct for the number of unseen species the new one
9368
        # could have been
9369
        m = len(data)
9370
        num_unseen = dirichlet.n - m + 1
9371
        like *= num_unseen
9372

9373
        return like
9374
\end{verbatim}
9375

9376
This is almost the same as {\tt Species.Likelihood}.  The difference
9377
is the factor, \verb"num_unseen".  This correction is necessary
9378
because each time we see a species for the first time, we have to
9379
consider that there were some number of other unseen species that
9380
we might have seen.  For larger values of {\tt n} there are more
9381
unseen species that we could have seen, which increases the likelihood
9382
of the data.
9383

9384
This is a subtle point and I have to admit that I did not get it right
9385
the first time.  But again I was able to validate this version
9386
by comparing it to the previous versions.
9387
\index{regression testing}
9388

9389

9390
\section{We're not done yet}
9391

9392
\newcommand{\BigO}[1]{\mathcal{O}(#1)}
9393

9394
Performing the updates one species at a time solves one problem, but
9395
it creates another.  Each update takes time proportional to $k m$,
9396
where $k$ is the number of hypotheses and $m$ is the number of observed
9397
species.  So if we do $m$ updates, the total run time is
9398
proportional to $k m^2$. 
9399

9400
But we can speed things up using the same trick we used in
9401
Section~\ref{collapsing}: we'll get rid of the Dirichlet objects and
9402
collapse the two levels of the hierarchy into a single object.  So
9403
here's yet another version of {\tt Species}:
9404

9405
\begin{verbatim}
9406
class Species5(Species2):
9407
    
9408
    def Update(self, data):
9409
        m = len(data)
9410
        for i in range(m):
9411
            self.UpdateOne(i+1, data[i])
9412
            self.params[i] += data[i]
9413
\end{verbatim}
9414

9415
This version inherits \verb"__init__" from {\tt Species2}, so
9416
it uses {\tt ns} and {\tt probs} to represent the distribution
9417
of {\tt n}, and {\tt params} to represent the parameters of
9418
the Dirichlet distribution.
9419

9420
{\tt Update} is similar to what we saw in the previous section.
9421
It loops through the observed species and calls {\tt UpdateOne}:
9422
\index{numpy}
9423

9424
\begin{verbatim}
9425
# class Species5
9426

9427
    def UpdateOne(self, i, count):
9428
        likes = numpy.zeros(len(self.ns), dtype=numpy.double)
9429
        for i in range(self.iterations):
9430
            likes += self.SampleLikelihood(i, count)
9431

9432
        unseen_species = [n-i+1 for n in self.ns]
9433
        likes *= unseen_species
9434

9435
        self.probs *= likes
9436
        self.probs /= self.probs.sum()
9437
\end{verbatim}
9438

9439
This function is similar to {\tt Species2.Update}, with two changes:
9440

9441
\begin{itemize}
9442

9443
\item The interface is different.  Instead of the whole dataset, we
9444
  get {\tt i}, the index of the observed species, and {\tt count},
9445
  how many of that species we've seen.
9446

9447
\item We have to apply a correction factor for the number of unseen
9448
  species, as in {\tt Species4.Likelihood}.  The difference here is
9449
  that we update all of the likelihoods at once with array
9450
  multiplication.
9451

9452
\end{itemize}
9453

9454
Finally, here's {\tt SampleLikelihood}:
9455
\index{numpy}
9456

9457
\begin{verbatim}
9458
# class Species5
9459

9460
    def SampleLikelihood(self, i, count):
9461
        gammas = numpy.random.gamma(self.params)
9462

9463
        sums = numpy.cumsum(gammas)[self.ns[0]-1:]
9464

9465
        ps = gammas[i-1] / sums
9466
        log_likes = numpy.log(ps) * count
9467

9468
        log_likes -= numpy.max(log_likes)
9469
        likes = numpy.exp(log_likes)
9470

9471
        return likes
9472
\end{verbatim}
9473

9474
This is similar to {\tt Species2.SampleLikelihood}; the
9475
difference is that each update only includes a single species,
9476
so we don't need a loop.
9477

9478
The runtime of this function is proportional to the number
9479
of hypotheses, $k$.  It runs $m$ times, so the run time of
9480
the update is proportional to $k m$.
9481
And the number of iterations we
9482
need to get an accurate result is usually small.
9483

9484

9485
\section{The belly button data}
9486
\label{belly}
9487

9488
That's enough about lions and tigers and bears.
9489
Let's get back to belly buttons.  To get a sense of what the
9490
data look like, consider subject B1242,
9491
whose sample of 400 reads yielded 61 species with the following
9492
counts:
9493

9494
\begin{verbatim}
9495
92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5, 
9496
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
9497
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
9498
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
9499
\end{verbatim}
9500

9501
There are a few dominant species that make up a large
9502
fraction of the whole, but many species that yielded only
9503
a single read.  The number of these ``singletons'' suggests
9504
that there are likely to be at least a few unseen species.
9505
\index{species}
9506

9507
In the example with lions and tigers, we assume that each
9508
animal in the preserve is equally likely to be observed.
9509
Similarly, for the belly button data, we assume that each
9510
bacterium is equally likely to yield a read.
9511

9512
In reality, each step in the data-collection
9513
process might introduce biases.  Some species might
9514
be more likely to be picked up by a swab, or to yield identifiable
9515
amplicons.  So when we talk about the prevalence of each species,
9516
we should remember this source of error.
9517
\index{sample bias}
9518

9519
I should also acknowledge that I am using the term ``species''
9520
loosely.  First, bacterial species are not well defined.  Second,
9521
some reads identify a particular species, others only identify
9522
a genus.  To be more precise, I should say ``operational
9523
taxonomic unit'', or OTU.
9524
\index{operational taxonomic unit}
9525
\index{OTU}
9526

9527
Now let's process some of the belly button data.  I define
9528
a class called {\tt Subject} to represent information about
9529
each subject in the study:
9530

9531
\begin{verbatim}
9532
class Subject(object):
9533

9534
    def __init__(self, code):
9535
        self.code = code
9536
        self.species = []
9537
\end{verbatim}
9538

9539
Each subject has a string code, like ``B1242'', and a list of
9540
(count, species name) pairs, sorted in increasing order by count.
9541
{\tt Subject} provides several methods to make it
9542
easy to access these counts and species names.  You can see the details
9543
in \url{http://thinkbayes.com/species.py}.
9544
  For more information
9545
see Section~\ref{download}.
9546

9547
\begin{figure}
9548
% species.py
9549
\centerline{\includegraphics[height=2.5in]{figs/species-ndist-B1242.pdf}}
9550
\caption{Distribution of {\tt n} for subject B1242.}
9551
\label{species-ndist}
9552
\end{figure}
9553

9554
{\tt Subject} provides a method named {\tt Process} that creates and
9555
updates a {\tt Species5} suite,
9556
which represents the distributions of {\tt n} and the prevalences.
9557
\index{prevalence}
9558

9559
And {\tt Suite2} provides {\tt DistOfN}, which returns the posterior
9560
distribution of {\tt n}.
9561

9562
\begin{verbatim}
9563
# class Suite2
9564

9565
    def DistN(self):
9566
        items = zip(self.ns, self.probs)
9567
        pmf = thinkbayes.MakePmfFromItems(items)
9568
        return pmf
9569
\end{verbatim}
9570

9571
Figure~\ref{species-ndist} shows the distribution of {\tt n} for
9572
subject B1242.  The probability that there are exactly 61 species, and
9573
no unseen species, is nearly zero.  The most likely value is 72, with
9574
90\% credible interval 66 to 79.  At the high end, it is unlikely that
9575
there are as many as 87 species.
9576

9577
Next we compute the posterior distribution of prevalence for
9578
each species.  {\tt Species2} provides {\tt DistOfPrevalence}:
9579

9580
\begin{verbatim}
9581
# class Species2
9582

9583
    def DistOfPrevalence(self, index):
9584
        metapmf = thinkbayes.Pmf()
9585

9586
        for n, prob in zip(self.ns, self.probs):
9587
            beta = self.MarginalBeta(n, index)
9588
            pmf = beta.MakePmf()
9589
            metapmf.Set(pmf, prob)
9590

9591
        mix = thinkbayes.MakeMixture(metapmf)
9592
        return metapmf, mix
9593
\end{verbatim}
9594

9595
{\tt index} indicates which species we want.  For each
9596
{\tt n}, we have a different posterior distribution
9597
of prevalence.
9598

9599
\begin{figure}
9600
% species.py
9601
\centerline{\includegraphics[height=2.5in]{figs/species-prev-B1242.pdf}}
9602
\caption{Distribution of prevalences for subject B1242.}
9603
\label{species-prev}
9604
\end{figure}
9605

9606
The loop iterates through the possible values of {\tt n}
9607
and their probabilities.  For each value of {\tt n} it gets
9608
a Beta object representing the marginal distribution for the
9609
indicated species.  Remember that Beta objects contain the
9610
parameters {\tt alpha} and {\tt beta}; they don't have
9611
values and probabilities like a Pmf, but they provide {\tt MakePmf},
9612
which generates a discrete approximation to the continuous
9613
beta distribution.
9614
\index{Beta object}
9615

9616
{\tt metapmf} is a meta-Pmf that contains the distributions
9617
of prevalence, conditioned on {\tt n}.  {\tt MakeMixture}
9618
combines the meta-Pmf into {\tt mix}, which combines the
9619
conditional distributions into a single distribution
9620
of prevalence.
9621
\index{meta-Pmf}
9622
\index{mixture}
9623
\index{MakeMixture}
9624

9625
Figure~\ref{species-prev} shows results for the five
9626
species with the most reads.  The most prevalent species accounts for
9627
23\% of the 400 reads, but since there are almost certainly unseen
9628
species, the most likely estimate for its prevalence is 20\%,
9629
with 90\% credible interval between 17\% and 23\%.
9630

9631

9632
\section{Predictive distributions}
9633

9634
\begin{figure}
9635
% species.py
9636
\centerline{\includegraphics[height=2.5in]{figs/species-rare-B1242.pdf}}
9637
\caption{Simulated rarefaction curves for subject B1242.}
9638
\label{species-rare}
9639
\end{figure}
9640

9641
I introduced the hidden species problem in the form of four related
9642
questions.  We have answered the first two by computing the posterior
9643
distribution for {\tt n} and the prevalence of each species.
9644
\index{predictive distribution}
9645

9646
The other two questions are:
9647

9648
\begin{itemize}
9649

9650
\item If we are planning to collect additional reads, can we predict
9651
  how many new species we are likely to discover?
9652

9653
\item How many additional reads are needed to increase the
9654
  fraction of observed species to a given threshold?
9655

9656
\end{itemize}
9657

9658
To answer predictive questions like this we can use the posterior
9659
distributions to simulate possible future events and compute
9660
predictive distributions for the number of species, and fraction of
9661
the total, we are likely to see.
9662

9663
The kernel of these simulations looks like this:
9664
\index{simulation}
9665

9666
\begin{enumerate}
9667

9668
\item Choose {\tt n} from its posterior distribution.
9669

9670
\item Choose a prevalence for each species, including possible unseen
9671
  species, using the Dirichlet distribution.
9672
\index{Dirichlet distribution}
9673

9674
\item Generate a random sequence of future observations.
9675

9676
\item Compute the number of new species, \verb"num_new", as a function
9677
  of the number of additional reads, {\tt k}.
9678

9679
\item Repeat the previous steps and accumulate the joint distribution
9680
  of \verb"num_new" and {\tt k}.
9681
\index{joint distribution}
9682

9683
\end{enumerate}
9684

9685
And here's the code.  {\tt RunSimulation} runs a single simulation:
9686

9687
\begin{verbatim}
9688
# class Subject
9689

9690
    def RunSimulation(self, num_reads):
9691
        m, seen = self.GetSeenSpecies()
9692
        n, observations = self.GenerateObservations(num_reads)
9693

9694
        curve = []
9695
        for k, obs in enumerate(observations):
9696
            seen.add(obs)
9697

9698
            num_new = len(seen) - m
9699
            curve.append((k+1, num_new))
9700

9701
        return curve
9702
\end{verbatim}
9703

9704
\verb"num_reads" is the number of additional reads to simulate.
9705
{\tt m} is the number of seen species, and {\tt seen} is a set of
9706
strings with a unique name for each species.
9707
{\tt n} is a random value from the posterior distribution, and
9708
{\tt observations} is a random sequence of species names.
9709

9710
Each time through the loop, we add the new observation to
9711
{\tt seen} and record the number of reads and the number of
9712
new species so far.
9713

9714
The result of {\tt RunSimulation} is a {\bf rarefaction curve},
9715
represented as a list of pairs with the number of reads and
9716
the number of new species.
9717
\index{rarefaction curve}
9718

9719
Before we see the results, let's look at {\tt GetSeenSpecies} and
9720
{\tt GenerateObservations}.
9721

9722
\begin{verbatim}
9723
#class Subject
9724

9725
    def GetSeenSpecies(self):
9726
        names = self.GetNames()
9727
        m = len(names)
9728
        seen = set(SpeciesGenerator(names, m))
9729
        return m, seen
9730
\end{verbatim}
9731

9732
{\tt GetNames} returns the list of species names that appear in
9733
the data files, but for many subjects these names are not unique.
9734
So I use {\tt SpeciesGenerator} to extend each name with a serial
9735
number:
9736
\index{generator}
9737

9738
\begin{verbatim}
9739
def SpeciesGenerator(names, num):
9740
    i = 0
9741
    for name in names:
9742
        yield '%s-%d' % (name, i)
9743
        i += 1
9744

9745
    while i < num:
9746
        yield 'unseen-%d' % i
9747
        i += 1
9748
\end{verbatim}
9749

9750
Given a name like {\tt Corynebacterium}, {\tt SpeciesGenerator} yields
9751
{\tt Corynebacterium-1}.  When the list of names is exhausted, it
9752
yields names like {\tt unseen-62}.
9753

9754
Here is {\tt GenerateObservations}:
9755

9756
\begin{verbatim}
9757
# class Subject
9758

9759
    def GenerateObservations(self, num_reads):
9760
        n, prevalences = self.suite.SamplePosterior()
9761

9762
        names = self.GetNames()
9763
        name_iter = SpeciesGenerator(names, n)
9764

9765
        d = dict(zip(name_iter, prevalences))
9766
        cdf = thinkbayes.MakeCdfFromDict(d)
9767
        observations = cdf.Sample(num_reads)
9768

9769
        return n, observations
9770
\end{verbatim}
9771

9772
Again, \verb"num_reads" is the number of additional reads
9773
to generate.  {\tt n} and {\tt prevalences} are samples from
9774
the posterior distribution.
9775

9776
{\tt cdf} is a Cdf object that maps species names, including the
9777
unseen, to cumulative probabilities.  Using a Cdf makes it efficient
9778
to generate a random sequence of species names.
9779
\index{Cdf}
9780
\index{cumulative probability}
9781

9782
Finally, here is {\tt Species2.SamplePosterior}:
9783

9784
\begin{verbatim}
9785
    def SamplePosterior(self):
9786
        pmf = self.DistOfN()
9787
        n = pmf.Random()
9788
        prevalences = self.SamplePrevalences(n)
9789
        return n, prevalences
9790
\end{verbatim}
9791

9792
And {\tt SamplePrevalences}, which generates a sample of
9793
prevalences conditioned on {\tt n}:
9794
\index{numpy}
9795
\index{random sample}
9796

9797
\begin{verbatim}
9798
# class Species2
9799

9800
    def SamplePrevalences(self, n):
9801
        params = self.params[:n]
9802
        gammas = numpy.random.gamma(params)
9803
        gammas /= gammas.sum()
9804
        return gammas
9805
\end{verbatim}
9806

9807
We saw this algorithm for generating random values from a Dirichlet
9808
distribution in Section~\ref{randomdir}.
9809

9810
Figure~\ref{species-rare} shows 100 simulated rarefaction curves
9811
for subject B1242.  The curves are ``jittered;''
9812
that is, I shifted each curve by a random offset so they
9813
would not all overlap.  By inspection we can estimate that after
9814
400 more reads we are likely to find 2--6 new species.
9815

9816

9817
\section{Joint posterior}
9818

9819
\begin{figure}
9820
% species.py
9821
\centerline{\includegraphics[height=2.5in]{figs/species-cond-B1242.pdf}}
9822
\caption{Distributions of the number of new species conditioned on
9823
the number of additional reads.}
9824
\label{species-cond}
9825
\end{figure}
9826

9827
We can use these simulations to estimate the
9828
joint distribution of \verb"num_new" and {\tt k}, and from that
9829
we can get the distribution of \verb"num_new" conditioned on any
9830
value of {\tt k}.
9831
\index{joint distribution}
9832

9833
\begin{verbatim}
9834
def MakeJointPredictive(curves):
9835
    joint = thinkbayes.Joint()
9836
    for curve in curves:
9837
        for k, num_new in curve:
9838
            joint.Incr((k, num_new))
9839
    joint.Normalize()
9840
    return joint
9841
\end{verbatim}
9842

9843
{\tt MakeJointPredictive} makes a Joint object, which is a
9844
Pmf whose values are tuples.
9845
\index{Joint object}
9846

9847
{\tt curves} is a list of rarefaction curves created by
9848
{\tt RunSimulation}.  Each curve contains a list of pairs of
9849
{\tt k} and \verb"num_new".
9850
\index{rarefaction curve}
9851

9852
The resulting joint distribution is a map from each pair to
9853
its probability of occurring.  Given the joint distribution, we
9854
can use {\tt Joint.Conditional}
9855
get the distribution of \verb"num_new" conditioned on {\tt k}
9856
(see Section~\ref{conditional}).
9857
\index{conditional distribution}
9858

9859
{\tt Subject.MakeConditionals} takes a list of {\tt ks}
9860
and computes the conditional distribution of \verb"num_new"
9861
for each {\tt k}.  The result is a list of Cdf objects.
9862

9863
\begin{verbatim}
9864
def MakeConditionals(curves, ks):
9865
    joint = MakeJointPredictive(curves)
9866

9867
    cdfs = []
9868
    for k in ks:
9869
        pmf = joint.Conditional(1, 0, k)
9870
        pmf.name = 'k=%d' % k
9871
        cdf = pmf.MakeCdf()
9872
        cdfs.append(cdf)
9873

9874
    return cdfs
9875
\end{verbatim}
9876

9877
Figure~\ref{species-cond} shows the results.  After 100 reads, the
9878
median predicted number of new species is 2; the 90\% credible
9879
interval is 0 to 5.  After 800 reads, we expect to see 3 to 12 new
9880
species.
9881

9882

9883
\section{Coverage}
9884

9885
\begin{figure}
9886
% species.py
9887
\centerline{\includegraphics[height=2.5in]{figs/species-frac-B1242.pdf}}
9888
\caption{Complementary CDF of coverage for a range of additional reads.}
9889
\label{species-frac}
9890
\end{figure}
9891

9892
The last question we want to answer is, ``How many additional reads
9893
are needed to increase the fraction of observed species to a given
9894
threshold?''
9895
\index{coverage}
9896

9897
To answer this question, we need a version of {\tt RunSimulation}
9898
that computes the fraction of observed species rather than the
9899
number of new species.
9900

9901
\begin{verbatim}
9902
# class Subject
9903

9904
    def RunSimulation(self, num_reads):
9905
        m, seen = self.GetSeenSpecies()
9906
        n, observations = self.GenerateObservations(num_reads)
9907

9908
        curve = []
9909
        for k, obs in enumerate(observations):
9910
            seen.add(obs)
9911

9912
            frac_seen = len(seen) / float(n)
9913
            curve.append((k+1, frac_seen))
9914

9915
        return curve
9916
\end{verbatim}
9917

9918
Next we loop through each curve and make a dictionary, {\tt d},
9919
that maps from the number of additional reads, {\tt k}, to
9920
a list of {\tt fracs}; that is, a list of values for the
9921
coverage achieved after {\tt k} reads.
9922

9923
\begin{verbatim}
9924
    def MakeFracCdfs(self, curves):
9925
        d = {}
9926
        for curve in curves:
9927
            for k, frac in curve:
9928
                d.setdefault(k, []).append(frac)
9929

9930
        cdfs = {}
9931
        for k, fracs in d.iteritems():
9932
            cdf = thinkbayes.MakeCdfFromList(fracs)
9933
            cdfs[k] = cdf
9934

9935
        return cdfs
9936
\end{verbatim}
9937

9938
Then for each value of {\tt k} we make a Cdf of {\tt fracs}; this Cdf
9939
represents the distribution of coverage after {\tt k} reads.
9940

9941
Remember that the CDF tells you the probability of falling below a
9942
given threshold, so the {\em complementary} CDF tells you the
9943
probability of exceeding it.  Figure~\ref{species-frac} shows
9944
complementary CDFs for a range of values of {\tt k}.
9945
\index{complementary CDF}
9946

9947
To read this figure, select the level of coverage you want to achieve
9948
along the $x$-axis.  As an example, choose 90\%.
9949
\index{coverage}
9950

9951
Now you can read up the chart to find the probability of achieving
9952
90\% coverage after {\tt k} reads.  For example, with 200 reads,
9953
you have about a 40\% chance of getting 90\% coverage.  With 1000 reads, you
9954
have a 90\% chance of getting 90\% coverage.
9955

9956
With that, we have answered the four questions that make up the unseen
9957
species problem.  To validate the algorithms in this chapter with
9958
real data, I had to deal with a few more details.  But
9959
this chapter is already too long, so I won't discuss them here.
9960

9961
You can read about the problems, and how I addressed them, at
9962
\url{http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html}.
9963

9964
You can download the code in this chapter from
9965
\url{http://thinkbayes.com/species.py}.
9966
  For more information
9967
see Section~\ref{download}.
9968

9969

9970
\section{Discussion}
9971

9972
The Unseen Species problem is an area of active research, and I
9973
believe the algorithm in this chapter is a novel contribution.  So in
9974
fewer than 200 pages we have made it from the basics of probability to
9975
the research frontier.  I'm very happy about that.
9976

9977
My goal for this book is to present three related ideas:
9978

9979
\begin{itemize}
9980

9981
\item {\bf Bayesian thinking}: The foundation of Bayesian analysis is
9982
  the idea of using probability distributions to represent uncertain
9983
  beliefs, using data to update those distributions, and using the
9984
  results to make predictions and inform decisions.
9985

9986
\item {\bf A computational approach}: The premise of this book is that
9987
  it is easier to understand Bayesian analysis using computation
9988
  rather than math, and easier to implement Bayesian methods with
9989
  reusable building blocks that can be rearranged to solve real-world
9990
  problems quickly.
9991

9992
\item {\bf Iterative modeling}: Most real-world problems involve
9993
  modeling decisions and trade-offs between realism and complexity.
9994
  It is often impossible to know ahead of time what factors should be
9995
  included in the model and which can be abstracted away.  The best
9996
  approach is to iterate, starting with simple models and adding
9997
  complexity gradually, using each model to validate the others.
9998

9999
\end{itemize}
10000

10001
These ideas are versatile and powerful; they are applicable to
10002
problems in every area of science and engineering, from simple
10003
examples to topics of current research.
10004

10005
If you made it this far, you should be prepared to apply these
10006
tools to new problems relevant to your work.  I hope you find
10007
them useful; let me know how it goes!
10008

10009

10010

10011
%\chapter{Future chapters}
10012

10013
%Bayesian regression (hybrid version with resampling?)
10014
%\url{http://www.reddit.com/r/statistics/comments/1647yj/which_regression_technique/}
10015

10016
%Change point detection: 
10017

10018
%Deconvolution: Estimating round trip times
10019

10020
%Bayesian search
10021

10022
%Extension of the Euro problem: evaluating reddit items and redditors
10023
%\url{http://www.reddit.com/r/statistics/comments/15rurz/question_about_continuous_bayesian_inference/}
10024

10025
%Charles Darwin problem (capture-tag-recapture)
10026
%\url{http://maximum-entropy-blog.blogspot.com/2012/04/capture-recapture-and-charles-darwin.html}
10027

10028
% http://camdp.com/blogs/how-solve-price-rights-showdown
10029

10030
% https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
10031

10032
% http://blog.yhathq.com/posts/estimating-user-lifetimes-with-pymc.html
10033

10034
\printindex
10035

10036
\end{document}
10037

10038
Product

Resources

Company