📚 The CoCalc Library - books, templates and other resources
License: OTHER
Scale-Free Networks
Code examples from Think Complexity, 2nd edition.
Copyright 2016 Allen Downey, MIT License
Facebook data
The following function reads a file with one edge per line, specified by two integer node IDs.
We'll read the Facecook data downloaded from SNAP
With larger graphs, it takes too long to compute clustering coefficients and path lengths, but we can estimate them by sampling. NetworkX provides a function in its approximation module that estimates the clustering coefficient:
And I've written a function that estimates the average path length.
The average clustering coefficient is high.
The average path length is low.
WS Graph
Next I'll construct a WS graph with the same number of nodes and average degree as the Facebook network:
With p=0
we get a ring lattice.
The number of edges is a little bigger than in the dataset because we have to round k
to an integer.
The clustering coefficient is a little higher than in the dataset.
And the path length is much higher.
With p=1
we get a random graph.
The clustering coefficient is small.
And the path lengths are very small.
By trial and error, I found that p=0.05
yields a graph with about the right values for C
and L
.
The clustering coefficient is a little higher than in the data.
And the path length is a little lower.
So that seems good so far.
Degree
But let's look at the degree distribution.
The following function returns a list of degrees, one for each node:
The average degree in the WS model is about right.
But the standard deviation isn't even close:
To see what's going on, we need to look at the whole distribution.
I'll start with a very small graph:
Here's what the list of degrees looks like for this graph:
To compute the degree distribution, I'll use the Pmf
class from empiricaldist
A Pmf
object maps from each degree to the fraction of nodes with that degree.
75% of the nodes have degree 1; 25% have degree 3.
We can visualize the distribution as a histogram:
And we can use the Pmf
to compute mean and standard deviation:
We can also use the Pmf
to look up the fraction of nodes with exactly 1 neighbor.
Here's what the degree distributions look like for the Facebook data and the WS model. They don't resemble each other at all.
We can get a better view of the Facebook data by plotting the PMF on a log-log scale.
The result suggests that the degree distribution follows a power law, at least for values larger than 10 or so.
The log-log scale doesn't help the WS graph.
The discrepancy between the actual degree distribution and the WS model is the motivation for the BA model.
BA model
Here's a simplified version of the NetworkX function that generates BA graphs.
And here's the function that generates a random subset without repetition.
I'll generate a BA graph with the same number of nodes and edges as the Facebook data:
Providing a random seed means we'll get the same graph every time.
The number of edges is pretty close to what we asked for.
So the mean degree is about right.
The standard deviation of degree is pretty close, and much better than the WS model.
Let's take a look at the degree distribution.
Looking at the PMFs on a linear scale, we see one difference, which is that the BA model has no nodes with degree less than k
, which is 22.
But if we look at the PMF on a log-log scale, the BA model looks pretty good for values bigger than about 20. And it seems to follow a power law.
The characteristic path length is even smaller in the model than in the data.
But the clustering coefficient isn't even close.
In the BA model, the degree distribution is better than in the WS model, but the clustering coefficient is too low.
Cumulative distributions
Cumulative distributions are a better way to visualize distributions. The following function shows what a cumulative probability is:
The total probability for all values up to and including 11 is 0.258, so the 25th percentile is about 11.
The median degree is about 25.
And the 75th percentile is about 57. That is, about 75% of users have 57 friends or fewer.
empiricaldist
provides Cdf
, which computes cumulative distribution functions.
Here are the degree CDFs for the Facebook data, the WS model, and the BA model.
If we plot them on a log-x scale, we get a sense of how well the models fit the central part of the distribution.
The WS model is hopeless. The BA model is ok for values above the median, but not very good for smaller values.
On a log-log scale, we see that the BA model fits the tail of the distribution reasonably well.
But there is certainly room for a model that does a better job of fitting the whole distribution.
Exercises
Exercise: Data files from the Barabasi and Albert paper are available from this web page.
Their actor collaboration data is included in the repository for this book in a file named actor.dat.gz
. The following function reads the file and builds the graph.
Compute the number of actors in the graph and the number of edges.
Check whether this graph has the small world properties, high clustering and low path length.
Plot the PMF of degree on a log-log scale. Does it seem to follow a power law?
Also plot the CDF of degree on a log-x scale, to see the general shape of the distribution, and on a log-log scale, to see whether the tail follows a power law.
Note: The actor network is not connected, so you might want to use nx.connected_components
to find connected subsets of the nodes.
Exercise: NetworkX provides a function called powerlaw_cluster_graph
that implements the "Holme and Kim algorithm for growing graphs with powerlaw degree distribution and approximate average clustering". Read the documentation of this function and see if you can use it to generate a graph that has the same number of nodes as the Facebook network, the same average degree, and the same clustering coefficient. How does the degree distribution in the model compare to the actual distribution?