📚 The CoCalc Library - books, templates and other resources
License: OTHER
Chapter 11
More Hacking with PyMC
This chapter introduces useful or advanced techniques with PyMC including building your own stochastic variables, user-defined steps etc.
Example: Real-time Github Popularity Measures
Most of you are likely familar with the git-repository website Github. An observed phenomenon on Github is the scale-ness of the popularity of repositories. Here, for lack of a better measure, we use the numbers of stars and forks to measure popularity. This is not a bad measure, but it can ignore page-views, downloads and tends to overemphasize older repositories. Since we will be studying all repositories and not a single one, the absense of these measures is not as relevant.
Contained in this folder is a Python script for scrapping data from Github on the popularity of repos. The script requires the Requests
and BeautifulSoup
libraries, but if you don't have that installed, provided in the ./data
folder is the same data from a previous date (Feburary 18, 2013 at last pull). The data is the fraction of repositories with stars equal to or greater than and the fraction of repositories with forks equal to or than .
Clearly, we need to adjust the scale of this plot as most of the action is hidden. The number of repos falls very quickly. We will put it on a log-log plot.
Both characteristics look like a straight line plotted on a log-log plot. What does this mean? Denote the fraction of repos with greater than or equal to stars (or forks) . So in the above plot, on the y-axis and is on the x-axis. The above linear relationship can be written as:
rearranging by taking both sides to the power of 2:
This relationship is very interesting. It is called a power-law, and occurs very freqently in social datasets. Why does it occur so frequently in social datasets? It has much to do with a "winner-take-all" or "winner-take-most" effect. Winners in a power-law enviroment are components that seem take a disproportiante amount of the popularity, and keep winning. In term of popularity of repos, winning repos are repos that are very good quailty (intially are winners), and are shared/talked about often (keep winning).
The above plot is also telling us that the majority of repos have very few stars and forks, only a handful have hundreds, and an incredibly small number have thousands. This is not-so obvious after browsing Github's website, where you see some repos with 36000+ stars, but fail to see the millions that do not have any stars (as they are not popular, they won't be common on your tour of the site.)
Distributions like this are also said to have fat-tails, i.e. the probability does not drop quickly as we extend into the tail of the dataset, but most of the probability is still centered near zero.
The heaviness of the tail and strength of "winner-take-all" effect are both influenced by the parameter. The small the , the more pronounced these effects. Below is a list of distributions that follow a power-law and an approximate exponent [1]. Recall though that we never observe these numbers, we must infer them from the data.
Phenomenon | Assumed Exponent | |
---|---|---|
Frequency of word use | -1.2 | |
Number of hits on website | -1.4 | |
US book sales | -1.5 | |
Intensity of wars | -0.8 | |
New worth of Americans | -1.1 | |
Github Stars | ?? |
The estimation problem
It is very easy to overestimate the true paramter . This is because the tail events (the events of 500+ stars) are very rare. For example, suppose in our Github dataset we only observe 100 samples. With very high probability (about 30%), all of these samples will have less than 31 stars. This is because approximately 99% ( Number of all repos - Number of repos with greater than 31 stars)/(Number of all repos) of all repos have less than 31 stars. Thus, we would have no samples in our dataset from the tail of the distribution. If I then told you that there existed a repo with 36000+ stars, you would call me crazy, as it would be about 1000 times larger than your observed most popular repo. You would assign a very large exponent to your dataset (recall large means thinner tails). Similarly, with the same 30% probability we would not see repos more popular than 64 stars if we had a sample of 1000. Taking this to its logical conclusion, how confident should we be that there might not exist a theoretical repo that can attain 72000+ stars, or 150000+ stars, one which would push an estimated down even more.
Yule-Simon distribution
The
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-83-fe84742cff9d> in <module>()
8
9 @mc.stochastic( dtype = int, observed = True )
---> 10 def yule_simon( value = repo_with_stars, rho = param ):
11 """test"""
12
c:\Python27\lib\site-packages\pymc\InstantiationDecorators.pyc in instantiate_p(__func__)
147 def instantiate_p(__func__):
148 value, parents = _extract(__func__, kwds, keys, 'Stochastic')
--> 149 return __class__(value=value, parents=parents, **kwds)
150
151 keys = ['logp','random','rseed']
c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients)
714 if check_logp:
715 # Check initial value
--> 716 if not isinstance(self.logp, float):
717 raise ValueError("Stochastic " + self.__name__ + "'s initial log-probability is %s, should be a float." %self.logp.__repr__())
718
c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in get_logp(self)
833 logp = float(logp)
834 except:
--> 835 raise TypeError(self.__name__ + ': computed log-probability ' + str(logp) + ' cannot be cast to float')
836
837 if logp != logp:
TypeError: yule_simon: computed log-probability [-20.51503062 -19.93158602 -18.31405136 -17.21386783 -16.32349938
-15.53050299 -14.75010755 -13.96101721 -13.13877723 -12.2264853
-11.23694781 -10.06769225 -8.63087616 -7.0237458 -5.33941252
-3.44559118 -1.4738842 ] cannot be cast to float
Exercises:
Distributions like the Normal distribution have very skinny tails. Compare the PDFs of the Normal versus a power-law distribution.
References
Taleb, Nassim. The Black Swan. 1st edition. New York: Random House, 2007. Print.
[****************100%******************] 50000 of 50000 complete