Repository for a workshop on Bayesian statistics

¹⁴³⁰ views

Kernel: Python 3 (Ubuntu Linux)

Bayesian Statistics Made Simple

Code and exercises from my workshop on Bayesian statistics in Python.

MIT License: https://opensource.org/licenses/MIT

In [1]:

from __future__ import print_function, division

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from thinkbayes2 import Pmf, Suite
import thinkplot

Working with Pmfs

Create a Pmf object to represent a six-sided die.

In [2]:

d6 = Pmf()

A Pmf is a map from possible outcomes to their probabilities.

In [3]:

for x in [1,2,3,4,5,6]:
    d6[x] = 1

Initially the probabilities don't add up to 1.

In [4]:

d6.Print()

Out[4]:

Normalize adds up the probabilities and divides through. The return value is the total probability before normalizing.

In [5]:

d6.Normalize()

Out[5]:

6

Now the Pmf is normalized.

In [6]:

d6.Print()

Out[6]:

0.16666666666666666
0.16666666666666666
0.16666666666666666
0.16666666666666666
0.16666666666666666
0.16666666666666666

And we can compute its mean (which only works if it's normalized).

In [7]:

d6.Mean()

Out[7]:

3.5

Random chooses a random value from the Pmf.

In [8]:

d6.Random()

Out[8]:

4

thinkplot provides methods for plotting Pmfs in a few different styles.

In [9]:

thinkplot.Hist(d6)

Out[9]:

Exercise 1: The Pmf object provides __add__, so you can use the + operator to compute the Pmf of the sum of two dice.

Compute and plot the Pmf of the sum of two 6-sided dice.

In [10]:

# Solution goes here

Exercise 2: Suppose I roll two dice and tell you the result is greater than 3.

Plot the Pmf of the remaining possible outcomes and compute its mean.

In [11]:

# Solution goes here

Create a Pmf with two equally likely hypotheses.

In [12]:

cookie = Pmf(['Bowl 1', 'Bowl 2'])
cookie.Print()

Out[12]:

Bowl 1 0.5
Bowl 2 0.5

Update each hypothesis with the likelihood of the data (a vanilla cookie).

In [13]:

cookie['Bowl 1'] *= 0.75
cookie['Bowl 2'] *= 0.5
cookie.Normalize()

Out[13]:

0.625

Print the posterior probabilities.

In [14]:

cookie.Print()

Out[14]:

Bowl 1 0.6000000000000001
Bowl 2 0.4

Exercise 3: Suppose we put the first cookie back, stir, choose again from the same bowl, and get a chocolate cookie.

Hint: The posterior (after the first cookie) becomes the prior (before the second cookie).

In [15]:

# Solution goes here

Exercise 4: Instead of doing two updates, what if we collapse the two pieces of data into one update?

Re-initialize Pmf with two equally likely hypotheses and perform one update based on two pieces of data, a vanilla cookie and a chocolate cookie.

The result should be the same regardless of how many updates you do (or the order of updates).

In [16]:

# Solution goes here

The dice problem

Create a Suite to represent dice with different numbers of sides.

In [17]:

pmf = Pmf([4, 6, 8, 12])
pmf.Print()

Out[17]:

0.25
0.25
0.25
0.25

Exercise 5: We'll solve this problem two ways. First we'll do it "by hand", as we did with the cookie problem; that is, we'll multiply each hypothesis by the likelihood of the data, and then renormalize.

In the space below, update suite based on the likelihood of the data (rolling a 6), then normalize and print the results.

In [18]:

# Solution goes here

Exercise 6: Now let's do the same calculation using Suite.Update.

Write a definition for a new class called Dice that extends Suite. Then define a method called Likelihood that takes data and hypo and returns the probability of the data (the outcome of rolling the die) for a given hypothesis (number of sides on the die).

Hint: What should you do if the outcome exceeds the hypothetical number of sides on the die?

Here's an outline to get you started:

In [19]:

class Dice(Suite):
        # hypo is the number of sides on the die
        # data is the outcome
        def Likelihood(self, data, hypo):
            return 1

In [20]:

# Solution goes here

Now we can create a Dice object and update it.

In [21]:

dice = Dice([4, 6, 8, 12])
dice.Update(6)
dice.Print()

Out[21]:

0.25
0.25
0.25
0.25

If we get more data, we can perform more updates.

In [22]:

for roll in [8, 7, 7, 5, 4]:
    dice.Update(roll)

Here are the results.

In [23]:

dice.Print()

Out[23]:

0.25
0.25
0.25
0.25

The German tank problem

The German tank problem is actually identical to the dice problem.

In [24]:

class Tank(Suite):
    # hypo is the number of tanks
    # data is an observed serial number
    def Likelihood(self, data, hypo):
        if data > hypo:
            return 0
        else:
            return 1 / hypo

Here are the posterior probabilities after seeing Tank #37.

In [25]:

tank = Tank(range(100))
tank.Update(37)
thinkplot.Pdf(tank)
tank.Mean()

Out[25]:

62.822944785168964

Exercise 7: Suppose we see another tank with serial number 17. What effect does this have on the posterior probabilities?

Update the suite again with the new data and plot the results.

In [26]:

# Solution goes here

The Euro problem

Exercise 8: Write a class definition for Euro, which extends Suite and defines a likelihood function that computes the probability of the data (heads or tails) for a given value of x (the probability of heads).

Note that hypo is in the range 0 to 100. Here's an outline to get you started.

In [27]:

class Euro(Suite):
    
    def Likelihood(self, data, hypo):
        """ 
        hypo is the prob of heads (0-100)
        data is a string, either 'H' or 'T'
        """
        return 1

In [28]:

# Solution goes here

We'll start with a uniform distribution from 0 to 100.

In [29]:

euro = Euro(range(101))
thinkplot.Pdf(euro)

Out[29]:

Now we can update with a single heads:

In [30]:

euro.Update('H')
thinkplot.Pdf(euro)

Out[30]:

Another heads:

In [31]:

euro.Update('H')
thinkplot.Pdf(euro)

Out[31]:

And a tails:

In [32]:

euro.Update('T')
thinkplot.Pdf(euro)

Out[32]:

Starting over, here's what it looks like after 7 heads and 3 tails.

In [33]:

euro = Euro(range(101))

for outcome in 'HHHHHHHTTT':
    euro.Update(outcome)

thinkplot.Pdf(euro)
euro.MaximumLikelihood()

Out[33]:

100

The maximum posterior probability is 70%, which is the observed proportion.

Here are the posterior probabilities after 140 heads and 110 tails.

In [34]:

euro = Euro(range(101))

evidence = 'H' * 140 + 'T' * 110
for outcome in evidence:
    euro.Update(outcome)
    
thinkplot.Pdf(euro)

Out[34]:

The posterior mean s about 56%

In [35]:

euro.Mean()

Out[35]:

49.9999999999999

So is the value with Maximum Aposteriori Probability (MAP).

In [36]:

euro.MAP()

Out[36]:

100

The posterior credible interval has a 90% chance of containing the true value (provided that the prior distribution truly represents our background knowledge).

In [37]:

euro.CredibleInterval(90)

Out[37]:

(5, 95)

Swamping the prior

The following function makes a Euro object with a triangle prior.

In [38]:

def TrianglePrior():
    """Makes a Suite with a triangular prior."""
    suite = Euro(label='triangle')
    for x in range(0, 51):
        suite[x] = x
    for x in range(51, 101):
        suite[x] = 100-x 
    suite.Normalize()
    return suite

And here's what it looks like:

In [39]:

euro1 = Euro(range(101), label='uniform')
euro2 = TrianglePrior()
thinkplot.Pdfs([euro1, euro2])
thinkplot.Config(title='Priors')

Out[39]:

Exercise 9: Update euro1 and euro2 with the same data we used before (140 heads and 110 tails) and plot the posteriors. How big is the difference in the means?

In [40]:

# Solution goes here

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

Bayesian Statistics Made Simple

Working with Pmfs

The dice problem

The German tank problem

The Euro problem

Swamping the prior

Product

Resources

Company

Bayesian Statistics Made Simple

Working with Pmfs

The cookie problem

The dice problem

The German tank problem

The Euro problem

Swamping the prior