Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Repository for a workshop on Bayesian statistics

1430 views
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3
4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7
8
from __future__ import print_function, division
9
10
import thinkbayes
11
import thinkplot
12
13
import numpy
14
15
"""
16
Bayesian solution to the Lincoln index, described in a blog
17
article at Probably Overthinking It.
18
19
Last year my occasional correspondent John D. Cook wrote an excellent
20
blog post about the Lincoln index, which is a way to estimate the
21
number of errors in a document (or program) by comparing results from
22
two independent testers. Here's his presentation of the problem:
23
24
"Suppose you have a tester who finds 20 bugs in your program. You
25
want to estimate how many bugs are really in the program. You know
26
there are at least 20 bugs, and if you have supreme confidence in your
27
tester, you may suppose there are around 20 bugs. But maybe your
28
tester isn't very good. Maybe there are hundreds of bugs. How can you
29
have any idea how many bugs there are? There's no way to know with one
30
tester. But if you have two testers, you can get a good idea, even if
31
you don't know how skilled the testers are."
32
33
Then he presents the Lincoln index, an estimator "described by
34
Frederick Charles Lincoln in 1930," where Wikpedia's use of
35
"described" is a hint that the index is another example of Stigler's
36
law of eponymy.
37
38
"Suppose two testers independently search for bugs. Let k1 be the
39
number of errors the first tester finds and k2 the number of errors
40
the second tester finds. Let c be the number of errors both testers
41
find. The Lincoln Index estimates the total number of errors as k1 k2
42
/ c [I changed his notation to be consistent with mine]."
43
44
So if the first tester finds 20 bugs, the second finds 15, and they
45
find 3 in common, we estimate that there are about 100 bugs.
46
47
Of course, whenever I see something like this, the idea that pops into
48
my head is that there must be a (better) Bayesian solution! And there
49
is.
50
51
"""
52
53
def choose(n, k, d={}):
54
"""The binomial coefficient "n choose k".
55
56
Args:
57
n: number of trials
58
k: number of successes
59
d: map from (n,k) tuples to cached results
60
61
Returns:
62
int
63
"""
64
if k == 0:
65
return 1
66
if n == 0:
67
return 0
68
69
try:
70
return d[n, k]
71
except KeyError:
72
res = choose(n-1, k) + choose(n-1, k-1)
73
d[n, k] = res
74
return res
75
76
def binom(k, n, p):
77
"""Computes the rest of the binomial PMF.
78
79
k: number of hits
80
n: number of attempts
81
p: probability of a hit
82
"""
83
return p**k * (1-p)**(n-k)
84
85
86
class Lincoln(thinkbayes.Suite, thinkbayes.Joint):
87
"""Represents hypotheses about the number of errors."""
88
89
def Likelihood(self, data, hypo):
90
"""Computes the likelihood of the data under the hypothesis.
91
92
hypo: n, p1, p2
93
data: k1, k2, c
94
"""
95
n, p1, p2 = hypo
96
k1, k2, c = data
97
98
part1 = choose(n, k1) * binom(k1, n, p1)
99
part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
100
return part1 * part2
101
102
103
def main():
104
data = 20, 15, 3
105
probs = numpy.linspace(0, 1, 101)
106
hypos = []
107
for n in range(32, 350):
108
for p1 in probs:
109
for p2 in probs:
110
hypos.append((n, p1, p2))
111
112
suite = Lincoln(hypos)
113
suite.Update(data)
114
115
n_marginal = suite.Marginal(0)
116
117
thinkplot.Pmf(n_marginal, label='n')
118
thinkplot.Save(root='lincoln1',
119
xlabel='number of bugs',
120
ylabel='PMF',
121
formats=['pdf', 'png'])
122
123
print(n_marginal.Mean())
124
print(n_marginal.MaximumLikelihood())
125
126
127
if __name__ == '__main__':
128
main()
129
130