CoCalc -- lincoln.py

Repository for a workshop on Bayesian statistics
¹⁴³⁰ views
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3

4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7

8
from __future__ import print_function, division
9

10
import thinkbayes
11
import thinkplot
12

13
import numpy
14

15
"""
16
Bayesian solution to the Lincoln index, described in a blog
17
article at Probably Overthinking It.
18

19
Last year my occasional correspondent John D. Cook wrote an excellent
20
blog post about the Lincoln index, which is a way to estimate the
21
number of errors in a document (or program) by comparing results from
22
two independent testers. Here's his presentation of the problem:
23

24
   "Suppose you have a tester who finds 20 bugs in your program. You
25
want to estimate how many bugs are really in the program. You know
26
there are at least 20 bugs, and if you have supreme confidence in your
27
tester, you may suppose there are around 20 bugs. But maybe your
28
tester isn't very good. Maybe there are hundreds of bugs. How can you
29
have any idea how many bugs there are? There's no way to know with one
30
tester. But if you have two testers, you can get a good idea, even if
31
you don't know how skilled the testers are."
32

33
Then he presents the Lincoln index, an estimator "described by
34
Frederick Charles Lincoln in 1930," where Wikpedia's use of
35
"described" is a hint that the index is another example of Stigler's
36
law of eponymy.
37

38
    "Suppose two testers independently search for bugs. Let k1 be the
39
number of errors the first tester finds and k2 the number of errors
40
the second tester finds. Let c be the number of errors both testers
41
find. The Lincoln Index estimates the total number of errors as k1 k2
42
/ c [I changed his notation to be consistent with mine]."
43

44
So if the first tester finds 20 bugs, the second finds 15, and they
45
find 3 in common, we estimate that there are about 100 bugs.
46

47
Of course, whenever I see something like this, the idea that pops into
48
my head is that there must be a (better) Bayesian solution! And there
49
is.
50

51
"""
52

53
def choose(n, k, d={}):
54
    """The binomial coefficient "n choose k".
55

56
    Args:
57
      n: number of trials
58
      k: number of successes
59
      d: map from (n,k) tuples to cached results
60

61
    Returns:
62
      int
63
    """
64
    if k == 0:
65
        return 1
66
    if n == 0:
67
        return 0
68

69
    try:
70
        return d[n, k]
71
    except KeyError:
72
        res = choose(n-1, k) + choose(n-1, k-1)
73
        d[n, k] = res
74
        return res
75

76
def binom(k, n, p):
77
    """Computes the rest of the binomial PMF.
78

79
    k: number of hits
80
    n: number of attempts
81
    p: probability of a hit
82
    """
83
    return p**k * (1-p)**(n-k)
84

85

86
class Lincoln(thinkbayes.Suite, thinkbayes.Joint):
87
    """Represents hypotheses about the number of errors."""
88

89
    def Likelihood(self, data, hypo):
90
        """Computes the likelihood of the data under the hypothesis.
91

92
        hypo: n, p1, p2
93
        data: k1, k2, c
94
        """
95
        n, p1, p2 = hypo
96
        k1, k2, c = data
97

98
        part1 = choose(n, k1) * binom(k1, n, p1)
99
        part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
100
        return part1 * part2
101

102

103
def main():
104
    data = 20, 15, 3
105
    probs = numpy.linspace(0, 1, 101)
106
    hypos = []
107
    for n in range(32, 350):
108
        for p1 in probs:
109
            for p2 in probs:
110
                hypos.append((n, p1, p2))
111

112
    suite = Lincoln(hypos)
113
    suite.Update(data)
114

115
    n_marginal = suite.Marginal(0)
116

117
    thinkplot.Pmf(n_marginal, label='n')
118
    thinkplot.Save(root='lincoln1',
119
                   xlabel='number of bugs',
120
                   ylabel='PMF',
121
                   formats=['pdf', 'png'])
122

123
    print(n_marginal.Mean())
124
    print(n_marginal.MaximumLikelihood())
125

126
    
127
if __name__ == '__main__':
128
    main()
129

130
Product

Resources

Company