📚 The CoCalc Library - books, templates and other resources
License: OTHER
###Bayes Factor for Log Regression/
#####Is there really a relationship between failure and temperature?
An critism of our above analysis is that assumed that the relationship followed a logistic model, this we implictly assumed that the probabilities change over temperature. Let's look at the data again. (Top figure)
Could it be that infact this particular sequence of defects occured by chance? After all, I can produce similar plots. (Bottom figure) This might explain the large overlap in temperatures.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-a4aeded293db> in <module>()
----> 1 figsize( 12.5, 6)
2 subplot(211)
3 plt.scatter( challenger_data[:,0], challenger_data[:,1], s = 75, \
4 color="k", alpha = 0.5)
5 plt.yticks([0,1])
NameError: name 'figsize' is not defined
Introducing the Bayes factor
The Bayes factor is a measure comparing two models. In our example, on the one hand we believe that temperature plays an important role in the probability of O-ring failures. On the other hand, we believe that there is no significant connection, and the pattern appears by coincidence. We can compare these model using the ratio of the probabilities of observing the data, given the model:
Calculating this is not at all obvious. For simplicity, let's select a set of parameters from the posterior distribution (which is tantamount to selecting a particular model).
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-bf332447957b> in <module>()
----> 1 alpha_hat = alpha_samples[0,0] #select the first alpha sample
2 beta_hat = beta_samples[0,0] #select the first beta sample
3
4 p_hat = logistic( temperature, beta_hat, alpha_hat)
5 print "estimates of probability at observed temperature, defects: "
NameError: name 'alpha_samples' is not defined
To calculate the numerator in the Bayes factor, we start by assuming a model, in our case assume alpha_hat
, beta_hat
are true, and calculate the probability of the defects we observe, equal to the product of:
For example, using the output above, the first computation, , would look like
The probability of this ocurring is (Recall the definition of a Bernoulli random variable : it is equal to with probability and equal to 0 with probability ). As each observation is independent, we multiply all the observations together. A clever way to do this for a specific is, recalling the can only be 1 or 0:
(it is possible that the quantity can overflow, so it is recommended to take the of the above::
and sum the terms instead of multiplying. Be sure to use this transform for both models you are comparing.)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-be149a40cb07> in <module>()
----> 1 print p_hat
2 _product = p_hat**( D )*(1-p_hat)**(1-D)
3 prob_given_model_1 = _product.prod()
4 print "probability of observations, model 1: %.10f"%prob_given_model_1
NameError: name 'p_hat' is not defined
The second model we are comparing against is that , i.e. there is no relationship between probabilites and temperature. We perform the same calculations as above. Notice that beta=0
here in the below PyMC code:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-2fd082360270> in <module>()
1 beta = 0
----> 2 alpha = mc.Normal( "alpha", 0, 0.001, value = 0 )
3
4 @mc.deterministic
5 def p( temp = temperature, alpha = alpha, beta = beta):
NameError: name 'mc' is not defined
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-de7da8dccc70> in <module>()
1 #compute the probability of observations given the model_2
2
----> 3 _product = p_hat**( D )*(1-p_hat)**(1-D)
4 prob_given_model_2 = _product.prod()
5 print "probability of observations, model 2: %.10f"%prob_given_model_2
NameError: name 'p_hat' is not defined
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-c40e6307b245> in <module>()
----> 1 print "Bayes factor = %.3f"%(prob_given_model_1/prob_given_model_2)
NameError: name 'prob_given_model_1' is not defined
Is this good? Below is a chart that can be used to compare the computed Bayes factors to a degree of confidence in M1.
Bayes factor | Supporting M1 |
---|---|
<1:1 | Negative (supports M2) |
1:1 to 3:1 | Barely worth mentioning |
3:1 to 10:1 | Substantial |
10:1 to 30:1 | Strong |
30:1 to 100:1 | Very strong |
> 100:1 | Decisive |
We are not done yet. If you recall, we selected only one model, but recall we actually have a distribution of possible models (sequential pairs of () from the posterior distributions). So to be correct we need to average over samples from the posterior:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-5e4bf3a1b066> in <module>()
----> 1 p = logistic( temperature[None,:], beta_samples, alpha_samples )
2 _product = p**( D )*(1-p)**(1-D)
3 prob_given_model_1 = _product.prod(axis=1).mean()
4 print "expected prob. of obs., given model 1: %.10f"%prob_given_model_1
5
NameError: name 'logistic' is not defined
It looks we have a pretty good model, and temperature is significant. Look at you, you're a rocket scientist now.