📚 The CoCalc Library - books, templates and other resources

cocalc-examples / bayesian-methods-for-hackers / sandbox / Chapter10_ / More hacking with PyMC.ipynb

¹³²⁹⁵⁶ views
License: OTHER

Kernel: Unknown Kernel

In [2]:

figsize(12.5, 4)
import scipy.stats as stats;

Chapter 11

More Hacking with PyMC

This chapter introduces useful or advanced techniques with PyMC including building your own stochastic variables, user-defined steps etc.

Example: Real-time Github Popularity Measures

Most of you are likely familar with the git-repository website Github. An observed phenomenon on Github is the scale-ness of the popularity of repositories. Here, for lack of a better measure, we use the numbers of stars and forks to measure popularity. This is not a bad measure, but it can ignore page-views, downloads and tends to overemphasize older repositories. Since we will be studying all repositories and not a single one, the absense of these measures is not as relevant.

Contained in this folder is a Python script for scrapping data from Github on the popularity of repos. The script requires the Requests and BeautifulSoup libraries, but if you don't have that installed, provided in the ./data folder is the same data from a previous date (Feburary 18, 2013 at last pull). The data is the fraction of repositories with stars equal to or greater than $2^k,\; k = 0,...,15$ and the fraction of repositories with forks equal to or than $2^k,\; k = 0,...,15$ .

In [1]:

run github_datapull.py;

Out[1]:

Scrapping data from Github. Sorry Github...
The data is contained in variables `foo_to_explore` and `repo_with_foo`

stars first...
number of repos with greater than or equal to 0 stars: 2738541
number of repos with greater than or equal to 1 stars: 1704779
number of repos with greater than or equal to 2 stars: 493529
number of repos with greater than or equal to 4 stars: 212099
number of repos with greater than or equal to 8 stars: 106973
number of repos with greater than or equal to 16 stars: 58101
number of repos with greater than or equal to 32 stars: 31877
number of repos with greater than or equal to 64 stars: 17370
number of repos with greater than or equal to 128 stars: 9239
number of repos with greater than or equal to 256 stars: 4578
number of repos with greater than or equal to 512 stars: 2150
number of repos with greater than or equal to 1024 stars: 872
number of repos with greater than or equal to 2048 stars: 286
number of repos with greater than or equal to 4096 stars: 84
number of repos with greater than or equal to 8192 stars: 22
number of repos with greater than or equal to 16384 stars: 5
number of repos with greater than or equal to 32768 stars: 1

forks second...
number of repos with greater than or equal to 0 forks: 2738548
number of repos with greater than or equal to 1 forks: 334539
number of repos with greater than or equal to 2 forks: 159206
number of repos with greater than or equal to 4 forks: 74836
number of repos with greater than or equal to 8 forks: 36532
number of repos with greater than or equal to 16 forks: 17948
number of repos with greater than or equal to 32 forks: 8394
number of repos with greater than or equal to 64 forks: 3841
number of repos with greater than or equal to 128 forks: 1580
number of repos with greater than or equal to 256 forks: 605
number of repos with greater than or equal to 512 forks: 222
number of repos with greater than or equal to 1024 forks: 69
number of repos with greater than or equal to 2048 forks: 17
number of repos with greater than or equal to 4096 forks: 4
number of repos with greater than or equal to 8192 forks: 2
number of repos with greater than or equal to 16384 forks: 0
number of repos with greater than or equal to 32768 forks: 0

In [3]:

plt.plot((stars_to_explore), repo_with_stars, label="stars")
plt.plot((forks_to_explore), repo_with_forks, label="forks")
plt.legend(loc="lower right")
plt.title("Popularity of Repos (as measured by stars and forks)")
plt.xlabel("$K$")
plt.ylabel("number of repos with stars/forks  $K$")
plt.xlim(-200, 35000);

Out[3]:

(-200, 35000)

Clearly, we need to adjust the scale of this plot as most of the action is hidden. The number of repos falls very quickly. We will put it on a log-log plot.

In [4]:

plt.plot(log2(stars_to_explore + 1), log2(repo_with_stars + 1), 'o-', label="stars")
plt.plot(log2(forks_to_explore + 1), log2(repo_with_forks + 1), 'o-', label="forks",)
plt.legend(loc="upper right")
plt.title("Log-Log plot of Popularity of Repos (as measured by stars and forks)")
plt.xlabel("$\log{K}$")
plt.ylabel("$\log$(number of repos with stars/forks  < K )");

Out[4]:

<matplotlib.text.Text at 0x5d64590>

Both characteristics look like a straight line plotted on a log-log plot. What does this mean? Denote the fraction of repos with greater than or equal to $k$ stars (or forks) $P(k)$ . So in the above plot, $\log{P(k)}$ on the y-axis and $\log{k}$ is on the x-axis. The above linear relationship can be written as:

\log_2{P(k)} = \beta\log_2{k} + \alpha

rearranging by taking both sides to the power of 2:

P(k) = 2^\alpha k^{\beta} = C k^{\beta}, \;\; C = 2^{\alpha}

This relationship is very interesting. It is called a power-law, and occurs very freqently in social datasets. Why does it occur so frequently in social datasets? It has much to do with a "winner-take-all" or "winner-take-most" effect. Winners in a power-law enviroment are components that seem take a disproportiante amount of the popularity, and keep winning. In term of popularity of repos, winning repos are repos that are very good quailty (intially are winners), and are shared/talked about often (keep winning).

The above plot is also telling us that the majority of repos have very few stars and forks, only a handful have hundreds, and an incredibly small number have thousands. This is not-so obvious after browsing Github's website, where you see some repos with 36000+ stars, but fail to see the millions that do not have any stars (as they are not popular, they won't be common on your tour of the site.)

Distributions like this are also said to have fat-tails, i.e. the probability does not drop quickly as we extend into the tail of the dataset, but most of the probability is still centered near zero.

The heaviness of the tail and strength of "winner-take-all" effect are both influenced by the $\beta$ parameter. The small the $\beta$ , the more pronounced these effects. Below is a list of distributions that follow a power-law and an approximate $\beta$ exponent [1]. Recall though that we never observe these numbers, we must infer them from the data.

Phenomenon	Assumed Exponent
Frequency of word use	-1.2
Number of hits on website	-1.4
US book sales	-1.5
Intensity of wars	-0.8
New worth of Americans	-1.1
Github Stars	??

The estimation problem

It is very easy to overestimate the true paramter $\beta$ . This is because the tail events (the events of 500+ stars) are very rare. For example, suppose in our Github dataset we only observe 100 samples. With very high probability (about 30%), all of these samples will have less than 31 stars. This is because approximately 99% ( Number of all repos - Number of repos with greater than 31 stars)/(Number of all repos) of all repos have less than 31 stars. Thus, we would have no samples in our dataset from the tail of the distribution. If I then told you that there existed a repo with 36000+ stars, you would call me crazy, as it would be about 1000 times larger than your observed most popular repo. You would assign a very large $\beta$ exponent to your dataset (recall large $\beta$ means thinner tails). Similarly, with the same 30% probability we would not see repos more popular than 64 stars if we had a sample of 1000. Taking this to its logical conclusion, how confident should we be that there might not exist a theoretical repo that can attain 72000+ stars, or 150000+ stars, one which would push an estimated $\beta$ down even more.

Yule-Simon distribution

The

In [83]:

from scipy.special import beta
import pymc as pm


param = pm.Exponential("param", 1)


@pm.stochastic(dtype=int, observed=True)
def yule_simon(value=repo_with_stars, rho=param):
    """test"""

    def logp(value, rho):
        return np.log(rho) + np.log(beta(value, rho + 1))

    def random(rho):
        W = stats.expon.rvs(scale=1. / rho)
        return stats.geom.rvs(np.exp(-W))


model = pm.Model([param, yule_simon])
mcmc = pm.MCMC(model)

mcmc.sample(10000, 8000);

Out[83]:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-83-fe84742cff9d> in <module>()
      8 
      9 @mc.stochastic( dtype = int, observed = True )
---> 10 def yule_simon( value = repo_with_stars, rho = param ):
     11     """test"""
     12 
c:\Python27\lib\site-packages\pymc\InstantiationDecorators.pyc in instantiate_p(__func__)
    147     def instantiate_p(__func__):
    148         value, parents = _extract(__func__, kwds, keys, 'Stochastic')
--> 149         return __class__(value=value, parents=parents, **kwds)
    150 
    151     keys = ['logp','random','rseed']
c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients)
    714         if check_logp:
    715             # Check initial value
--> 716             if not isinstance(self.logp, float):
    717                 raise ValueError("Stochastic " + self.__name__ + "'s initial log-probability is %s, should be a float." %self.logp.__repr__())
    718 
c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in get_logp(self)
    833             logp = float(logp)
    834         except:
--> 835             raise TypeError(self.__name__ + ': computed log-probability ' + str(logp) + ' cannot be cast to float')
    836 
    837         if logp != logp:
TypeError: yule_simon: computed log-probability [-20.51503062 -19.93158602 -18.31405136 -17.21386783 -16.32349938
 -15.53050299 -14.75010755 -13.96101721 -13.13877723 -12.2264853
 -11.23694781 -10.06769225  -8.63087616  -7.0237458   -5.33941252
  -3.44559118  -1.4738842 ] cannot be cast to float

In [84]:

def logp(value, rho):
        return np.log(rho) + np.log(beta(value, rho + 1))

beta(repo_with_stars, 1.3);

Out[84]:

array([  3.96781274e-09,   7.12048348e-09,   3.60230434e-08,
         1.08508004e-07,   2.64859823e-07,   5.86390404e-07,
         1.28195491e-06,   2.82711390e-06,   6.44529816e-06,
         1.60819963e-05,   4.33570545e-05,   1.39960612e-04,
         5.90762159e-04,   2.95765600e-03,   1.59980669e-02,
         1.06728840e-01,   7.69230769e-01])

Exercises:

Distributions like the Normal distribution have very skinny tails. Compare the PDFs of the Normal versus a power-law distribution.

In [20]:

x = np.linspace(1, 50, 200)
plt.plot(x, exp(-(x - 1) ** 2), label="Normal distribution")
plt.plot(x, x ** (-2), label=r"Power law, $\beta = -2$")
plt.plot(x, x ** (-1), label=r"Power law, $\beta = -1$")
plt.legend();

Out[20]:

<matplotlib.legend.Legend at 0x1487d518>

References

Taleb, Nassim. The Black Swan. 1st edition. New York: Random House, 2007. Print.

In [1]:

from IPython.core.display import HTML


def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling();

Out[1]:

In [167]:

import pymc as pm

beta = pm.Uniform("beta", -100, 100)


@pm.observed
def survival(value=y_, beta=beta):
    return np.sum([value[i - 1] * np.log((i + 0.) ** beta - (i + 1.) ** beta) for i in range(1, 99)]);

In [168]:

model = pm.Model([survival, beta])
#map_ = pm.MAP( model )
# map_.fit()

mcmc = pm.MCMC(model)
mcmc.sample(50000, 40000);

Out[168]:

[****************100%******************]  50000 of 50000 complete

In [155]:

from pymc.Matplot import plot as mcplot
mcplot(mcmc);

Out[155]:

Plotting beta

In [17]:

stars_to_explore[1:];

Out[17]:

array([    1,     2,     4,     8,    16,    32,    64,   128,   256,
         512,  1024,  2048,  4096,  8192, 16384, 32768])

In [149]:

a = stats.pareto.rvs(2.5, size=(50000, 1));

In [150]:

hist(a, bins=100)
print;

Out[150]:

In [165]:

y = [(a >= i).sum() for i in range(1, 100)];

In [166]:

y_ = -np.diff(y)
print y_

print y;

Out[166]:

[41264  5572  1638   646   313   181   101    70    48    47    22    14
  14     7    11     4     6     4     0     1     0     2     0
   1     2     1     3     0     2     2     1     0     1     1
   1     0     1     0     0     0     0     0     0     0     0
   1     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0]
[50000, 8736, 3164, 1526, 880, 567, 386, 285, 215, 167, 120, 98, 84, 70, 56, 49, 38, 34, 28, 24, 24, 23, 23, 21, 21, 21, 20, 18, 17, 14, 14, 12, 10, 9, 9, 8, 7, 5, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [112]:

b = -2.3;

In [113]:

np.sum([y_[i - 1] * np.log((i + 0.) ** b - (i + 1.) ** b) for i in range(1, 7)]) + y[-1] * np.log(7);

Out[113]:

-13526.483069774908

In [114]:

y_;

Out[114]:

array([48930,   940,   103,    19,     7,     1])

In [116]:

np.append(y_, y[-1]);

Out[116]:

array([48930,   940,   103,    19,     7,     1,     0])

In [129]:

mc.Uninformative?