Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

📚 The CoCalc Library - books, templates and other resources

132956 views
License: OTHER
Kernel: Unknown Kernel
figsize(12.5, 4) import scipy.stats as stats;

Chapter 11


More Hacking with PyMC

This chapter introduces useful or advanced techniques with PyMC including building your own stochastic variables, user-defined steps etc.

Example: Real-time Github Popularity Measures

Most of you are likely familar with the git-repository website Github. An observed phenomenon on Github is the scale-ness of the popularity of repositories. Here, for lack of a better measure, we use the numbers of stars and forks to measure popularity. This is not a bad measure, but it can ignore page-views, downloads and tends to overemphasize older repositories. Since we will be studying all repositories and not a single one, the absense of these measures is not as relevant.

Contained in this folder is a Python script for scrapping data from Github on the popularity of repos. The script requires the Requests and BeautifulSoup libraries, but if you don't have that installed, provided in the ./data folder is the same data from a previous date (Feburary 18, 2013 at last pull). The data is the fraction of repositories with stars equal to or greater than 2k,  k=0,...,152^k,\; k = 0,...,15 and the fraction of repositories with forks equal to or than 2k,  k=0,...,152^k,\; k = 0,...,15.

run github_datapull.py;
Scrapping data from Github. Sorry Github... The data is contained in variables `foo_to_explore` and `repo_with_foo` stars first... number of repos with greater than or equal to 0 stars: 2738541 number of repos with greater than or equal to 1 stars: 1704779 number of repos with greater than or equal to 2 stars: 493529 number of repos with greater than or equal to 4 stars: 212099 number of repos with greater than or equal to 8 stars: 106973 number of repos with greater than or equal to 16 stars: 58101 number of repos with greater than or equal to 32 stars: 31877 number of repos with greater than or equal to 64 stars: 17370 number of repos with greater than or equal to 128 stars: 9239 number of repos with greater than or equal to 256 stars: 4578 number of repos with greater than or equal to 512 stars: 2150 number of repos with greater than or equal to 1024 stars: 872 number of repos with greater than or equal to 2048 stars: 286 number of repos with greater than or equal to 4096 stars: 84 number of repos with greater than or equal to 8192 stars: 22 number of repos with greater than or equal to 16384 stars: 5 number of repos with greater than or equal to 32768 stars: 1 forks second... number of repos with greater than or equal to 0 forks: 2738548 number of repos with greater than or equal to 1 forks: 334539 number of repos with greater than or equal to 2 forks: 159206 number of repos with greater than or equal to 4 forks: 74836 number of repos with greater than or equal to 8 forks: 36532 number of repos with greater than or equal to 16 forks: 17948 number of repos with greater than or equal to 32 forks: 8394 number of repos with greater than or equal to 64 forks: 3841 number of repos with greater than or equal to 128 forks: 1580 number of repos with greater than or equal to 256 forks: 605 number of repos with greater than or equal to 512 forks: 222 number of repos with greater than or equal to 1024 forks: 69 number of repos with greater than or equal to 2048 forks: 17 number of repos with greater than or equal to 4096 forks: 4 number of repos with greater than or equal to 8192 forks: 2 number of repos with greater than or equal to 16384 forks: 0 number of repos with greater than or equal to 32768 forks: 0
plt.plot((stars_to_explore), repo_with_stars, label="stars") plt.plot((forks_to_explore), repo_with_forks, label="forks") plt.legend(loc="lower right") plt.title("Popularity of Repos (as measured by stars and forks)") plt.xlabel("$K$") plt.ylabel("number of repos with stars/forks $K$") plt.xlim(-200, 35000);
(-200, 35000)

Clearly, we need to adjust the scale of this plot as most of the action is hidden. The number of repos falls very quickly. We will put it on a log-log plot.

plt.plot(log2(stars_to_explore + 1), log2(repo_with_stars + 1), 'o-', label="stars") plt.plot(log2(forks_to_explore + 1), log2(repo_with_forks + 1), 'o-', label="forks",) plt.legend(loc="upper right") plt.title("Log-Log plot of Popularity of Repos (as measured by stars and forks)") plt.xlabel("$\log{K}$") plt.ylabel("$\log$(number of repos with stars/forks < K )");
<matplotlib.text.Text at 0x5d64590>

Both characteristics look like a straight line plotted on a log-log plot. What does this mean? Denote the fraction of repos with greater than or equal to kk stars (or forks) P(k)P(k). So in the above plot, logP(k)\log{P(k)} on the y-axis and logk\log{k} is on the x-axis. The above linear relationship can be written as:

log2P(k)=βlog2k+α\log_2{P(k)} = \beta\log_2{k} + \alpha

rearranging by taking both sides to the power of 2:

P(k)=2αkβ=Ckβ,    C=2αP(k) = 2^\alpha k^{\beta} = C k^{\beta}, \;\; C = 2^{\alpha}

This relationship is very interesting. It is called a power-law, and occurs very freqently in social datasets. Why does it occur so frequently in social datasets? It has much to do with a "winner-take-all" or "winner-take-most" effect. Winners in a power-law enviroment are components that seem take a disproportiante amount of the popularity, and keep winning. In term of popularity of repos, winning repos are repos that are very good quailty (intially are winners), and are shared/talked about often (keep winning).

The above plot is also telling us that the majority of repos have very few stars and forks, only a handful have hundreds, and an incredibly small number have thousands. This is not-so obvious after browsing Github's website, where you see some repos with 36000+ stars, but fail to see the millions that do not have any stars (as they are not popular, they won't be common on your tour of the site.)

Distributions like this are also said to have fat-tails, i.e. the probability does not drop quickly as we extend into the tail of the dataset, but most of the probability is still centered near zero.

The heaviness of the tail and strength of "winner-take-all" effect are both influenced by the β\beta parameter. The small the β\beta, the more pronounced these effects. Below is a list of distributions that follow a power-law and an approximate β\beta exponent [1]. Recall though that we never observe these numbers, we must infer them from the data.

PhenomenonAssumed Exponent
Frequency of word use-1.2
Number of hits on website-1.4
US book sales-1.5
Intensity of wars-0.8
New worth of Americans-1.1
Github Stars??

The estimation problem

It is very easy to overestimate the true paramter β\beta. This is because the tail events (the events of 500+ stars) are very rare. For example, suppose in our Github dataset we only observe 100 samples. With very high probability (about 30%), all of these samples will have less than 31 stars. This is because approximately 99% ( Number of all repos - Number of repos with greater than 31 stars)/(Number of all repos) of all repos have less than 31 stars. Thus, we would have no samples in our dataset from the tail of the distribution. If I then told you that there existed a repo with 36000+ stars, you would call me crazy, as it would be about 1000 times larger than your observed most popular repo. You would assign a very large β\beta exponent to your dataset (recall large β\beta means thinner tails). Similarly, with the same 30% probability we would not see repos more popular than 64 stars if we had a sample of 1000. Taking this to its logical conclusion, how confident should we be that there might not exist a theoretical repo that can attain 72000+ stars, or 150000+ stars, one which would push an estimated β\beta down even more.

Yule-Simon distribution

The

from scipy.special import beta import pymc as pm param = pm.Exponential("param", 1) @pm.stochastic(dtype=int, observed=True) def yule_simon(value=repo_with_stars, rho=param): """test""" def logp(value, rho): return np.log(rho) + np.log(beta(value, rho + 1)) def random(rho): W = stats.expon.rvs(scale=1. / rho) return stats.geom.rvs(np.exp(-W)) model = pm.Model([param, yule_simon]) mcmc = pm.MCMC(model) mcmc.sample(10000, 8000);
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-83-fe84742cff9d> in <module>() 8 9 @mc.stochastic( dtype = int, observed = True ) ---> 10 def yule_simon( value = repo_with_stars, rho = param ): 11 """test""" 12 c:\Python27\lib\site-packages\pymc\InstantiationDecorators.pyc in instantiate_p(__func__) 147 def instantiate_p(__func__): 148 value, parents = _extract(__func__, kwds, keys, 'Stochastic') --> 149 return __class__(value=value, parents=parents, **kwds) 150 151 keys = ['logp','random','rseed'] c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients) 714 if check_logp: 715 # Check initial value --> 716 if not isinstance(self.logp, float): 717 raise ValueError("Stochastic " + self.__name__ + "'s initial log-probability is %s, should be a float." %self.logp.__repr__()) 718 c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in get_logp(self) 833 logp = float(logp) 834 except: --> 835 raise TypeError(self.__name__ + ': computed log-probability ' + str(logp) + ' cannot be cast to float') 836 837 if logp != logp: TypeError: yule_simon: computed log-probability [-20.51503062 -19.93158602 -18.31405136 -17.21386783 -16.32349938 -15.53050299 -14.75010755 -13.96101721 -13.13877723 -12.2264853 -11.23694781 -10.06769225 -8.63087616 -7.0237458 -5.33941252 -3.44559118 -1.4738842 ] cannot be cast to float
def logp(value, rho): return np.log(rho) + np.log(beta(value, rho + 1)) beta(repo_with_stars, 1.3);
array([ 3.96781274e-09, 7.12048348e-09, 3.60230434e-08, 1.08508004e-07, 2.64859823e-07, 5.86390404e-07, 1.28195491e-06, 2.82711390e-06, 6.44529816e-06, 1.60819963e-05, 4.33570545e-05, 1.39960612e-04, 5.90762159e-04, 2.95765600e-03, 1.59980669e-02, 1.06728840e-01, 7.69230769e-01])

Exercises:

  1. Distributions like the Normal distribution have very skinny tails. Compare the PDFs of the Normal versus a power-law distribution.

x = np.linspace(1, 50, 200) plt.plot(x, exp(-(x - 1) ** 2), label="Normal distribution") plt.plot(x, x ** (-2), label=r"Power law, $\beta = -2$") plt.plot(x, x ** (-1), label=r"Power law, $\beta = -1$") plt.legend();
<matplotlib.legend.Legend at 0x1487d518>

References

  1. Taleb, Nassim. The Black Swan. 1st edition. New York: Random House, 2007. Print.

from IPython.core.display import HTML def css_styling(): styles = open("../styles/custom.css", "r").read() return HTML(styles) css_styling();
import pymc as pm beta = pm.Uniform("beta", -100, 100) @pm.observed def survival(value=y_, beta=beta): return np.sum([value[i - 1] * np.log((i + 0.) ** beta - (i + 1.) ** beta) for i in range(1, 99)]);
model = pm.Model([survival, beta]) #map_ = pm.MAP( model ) # map_.fit() mcmc = pm.MCMC(model) mcmc.sample(50000, 40000);
[****************100%******************] 50000 of 50000 complete
from pymc.Matplot import plot as mcplot mcplot(mcmc);
Plotting beta
stars_to_explore[1:];
array([ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768])
a = stats.pareto.rvs(2.5, size=(50000, 1));
hist(a, bins=100) print;
y = [(a >= i).sum() for i in range(1, 100)];
y_ = -np.diff(y) print y_ print y;
[41264 5572 1638 646 313 181 101 70 48 47 22 14 14 14 7 11 4 6 4 0 1 0 2 0 0 1 2 1 3 0 2 2 1 0 1 1 2 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [50000, 8736, 3164, 1526, 880, 567, 386, 285, 215, 167, 120, 98, 84, 70, 56, 49, 38, 34, 28, 24, 24, 23, 23, 21, 21, 21, 20, 18, 17, 14, 14, 12, 10, 9, 9, 8, 7, 5, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
b = -2.3;
np.sum([y_[i - 1] * np.log((i + 0.) ** b - (i + 1.) ** b) for i in range(1, 7)]) + y[-1] * np.log(7);
-13526.483069774908
y_;
array([48930, 940, 103, 19, 7, 1])
np.append(y_, y[-1]);
array([48930, 940, 103, 19, 7, 1, 0])
mc.Uninformative?