📚 The CoCalc Library - books, templates and other resources

cocalc-examples / think-complexity-2ed / soln / chap04soln.ipynb

²⁰¹⁹⁴⁸ views
License: OTHER

Kernel: Python 3

Scale-Free Networks

Code examples from Think Complexity, 2nd edition.

In [1]:

%matplotlib inline

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import seaborn as sns

from utils import decorate, savefig

# I set the random seed so the notebook 
# produces the same results every time.
np.random.seed(17)

# TODO: remove this when NetworkX is fixed
from warnings import simplefilter
import matplotlib.cbook
simplefilter("ignore", matplotlib.cbook.mplDeprecation)

Facebook data

The following function reads a file with one edge per line, specified by two integer node IDs.

In [2]:

def read_graph(filename):
    G = nx.Graph()
    array = np.loadtxt(filename, dtype=int)
    G.add_edges_from(array)
    return G

We'll read the Facecook data downloaded from SNAP

In [3]:

# https://snap.stanford.edu/data/facebook_combined.txt.gz

fb = read_graph('../data/facebook_combined.txt.gz')
n = len(fb)
m = len(fb.edges())
n, m

Out[3]:

(4039, 88234)

With larger graphs, it takes too long to compute clustering coefficients and path lengths, but we can estimate them by sampling. NetworkX provides a function in its approximation module that estimates the clustering coefficient:

In [4]:

from networkx.algorithms.approximation import average_clustering

And I've written a function that estimates the average path length.

In [5]:

def sample_path_lengths(G, nodes=None, trials=1000):
    """Choose random pairs of nodes and compute the path length between them.

    G: Graph
    nodes: list of nodes to choose from
    trials: number of pairs to choose

    returns: list of path lengths
    """
    if nodes is None:
        nodes = list(G)
    else:
        nodes = list(nodes)
        
    pairs = np.random.choice(nodes, (trials, 2))
    lengths = [nx.shortest_path_length(G, *pair) 
               for pair in pairs]
    return lengths

def estimate_path_length(G, nodes=None, trials=1000):
    return np.mean(sample_path_lengths(G, nodes, trials))

The average clustering coefficient is high.

In [6]:

C = average_clustering(fb)
C

Out[6]:

0.615

The average path length is low.

In [7]:

L = estimate_path_length(fb)
L

Out[7]:

3.717

WS Graph

Next I'll construct a WS graph with the same number of nodes and average degree as the Facebook network:

In [8]:

n = len(fb)
m = len(fb.edges())
k = int(round(2*m/n))
k

Out[8]:

44

With p=0 we get a ring lattice.

The number of edges is a little bigger than in the dataset because we have to round k to an integer.

In [9]:

lattice = nx.watts_strogatz_graph(n, k, p=0)
len(lattice), len(lattice.edges())

Out[9]:

(4039, 88858)

The clustering coefficient is a little higher than in the dataset.

In [10]:

C, average_clustering(lattice)

Out[10]:

(0.615, 0.747)

And the path length is much higher.

In [11]:

L, estimate_path_length(lattice)

Out[11]:

(3.717, 47.088)

With p=1 we get a random graph.

In [12]:

random_graph = nx.watts_strogatz_graph(n, k, p=1)

The clustering coefficient is small.

In [13]:

C, average_clustering(random_graph)

Out[13]:

(0.615, 0.007)

And the path lengths are very small.

In [14]:

L, estimate_path_length(random_graph)

Out[14]:

(3.717, 2.614)

By trial and error, I found that p=0.05 yields a graph with about the right values for C and L.

In [15]:

ws = nx.watts_strogatz_graph(n, k, 0.05, seed=15)

The clustering coefficient is a little higher than in the data.

In [16]:

C, average_clustering(ws)

Out[16]:

(0.615, 0.615)

And the path length is a little lower.

In [17]:

L, estimate_path_length(ws)

Out[17]:

(3.717, 3.264)

So that seems good so far.

Degree

But let's look at the degree distribution.

The following function returns a list of degrees, one for each node:

In [18]:

def degrees(G):
    """List of degrees for nodes in `G`.
    
    G: Graph object
    
    returns: list of int
    """
    return [G.degree(u) for u in G]

The average degree in the WS model is about right.

In [19]:

np.mean(degrees(fb)), np.mean(degrees(ws))

Out[19]:

(43.69101262688784, 44.0)

But the standard deviation isn't even close:

In [20]:

np.std(degrees(fb)), np.std(degrees(ws))

Out[20]:

(52.41411556737521, 1.4309215628189869)

To see what's going on, we need to look at the whole distribution.

I'll start with a very small graph:

In [21]:

G = nx.Graph()
G.add_edge(1, 0)
G.add_edge(2, 0)
G.add_edge(3, 0)
nx.draw(G)

Out[21]:

Here's what the list of degrees looks like for this graph:

In [22]:

degrees(G)

Out[22]:

[1, 3, 1, 1]

To compute the degree distribution, I'll use the Pmf class from empiricaldist

In [23]:

from empiricaldist import Pmf

A Pmf object maps from each degree to the fraction of nodes with that degree.

In [24]:

pmf = Pmf.from_seq(degrees(G))
pmf

Out[24]:

75% of the nodes have degree 1; 25% have degree 3.

We can visualize the distribution as a histogram:

In [25]:

pmf.bar()
decorate(xlabel='Degree',
         ylabel='Pmf')

Out[25]:

And we can use the Pmf to compute mean and standard deviation:

In [26]:

pmf_fb = Pmf.from_seq(degrees(fb))
pmf_fb.mean(), pmf_fb.std()

Out[26]:

(43.69101262688785, 52.41411556737521)

In [27]:

pmf_ws = Pmf.from_seq(degrees(ws))
pmf_ws.mean(), pmf_ws.std()

Out[27]:

(44.00000000000001, 1.4309215628189869)

We can also use the Pmf to look up the fraction of nodes with exactly 1 neighbor.

In [28]:

pmf_fb[1], pmf_ws[1]

Out[28]:

(0.018568952711067097, 0)

Here's what the degree distributions look like for the Facebook data and the WS model. They don't resemble each other at all.

In [29]:

plt.figure(figsize=(8,4))

plt.subplot(1,2,1)
pmf_fb.plot(label='Facebook', color='C0')
decorate(xlabel='Degree', ylabel='PMF')

plt.subplot(1,2,2)
pmf_ws.plot(label='WS graph', color='C1')
decorate(xlabel='Degree')

savefig('figs/chap04-1')

Out[29]:

Saving figure to file figs/chap04-1

We can get a better view of the Facebook data by plotting the PMF on a log-log scale.

The result suggests that the degree distribution follows a power law, at least for values larger than 10 or so.

The log-log scale doesn't help the WS graph.

In [30]:

plt.figure(figsize=(8,4))
options = dict(ls='', marker='.')

plt.subplot(1,2,1)
plt.plot([20, 1000], [5e-2, 2e-4], color='gray', linestyle='dashed')

pmf_fb.plot(label='Facebook', color='C0', **options)
decorate(xscale='log', yscale='log',
         xlabel='Degree', ylabel='PMF')

plt.subplot(1,2,2)
pmf_ws.plot(label='WS graph', color='C1', **options)
decorate(xlim=[35, 55], 
         xscale='log', yscale='log',
         xlabel='Degree')

savefig('figs/chap04-2')

Out[30]:

Saving figure to file figs/chap04-2

The discrepancy between the actual degree distribution and the WS model is the motivation for the BA model.

BA model

Here's a simplified version of the NetworkX function that generates BA graphs.

In [31]:

# modified version of the NetworkX implementation from
# https://github.com/networkx/networkx/blob/master/networkx/generators/random_graphs.py

import random

def barabasi_albert_graph(n, k, seed=None):
    """Constructs a BA graph.
    
    n: number of nodes
    k: number of edges for each new node
    seed: random seen
    """
    if seed is not None:
        random.seed(seed)
    
    G = nx.empty_graph(k)
    targets = set(range(k))
    repeated_nodes = []

    for source in range(k, n):

        G.add_edges_from(zip([source]*k, targets))

        repeated_nodes.extend(targets)
        repeated_nodes.extend([source] * k)

        targets = _random_subset(repeated_nodes, k)

    return G

And here's the function that generates a random subset without repetition.

In [32]:

def _random_subset(repeated_nodes, k):
    """Select a random subset of nodes without repeating.
    
    repeated_nodes: list of nodes
    k: size of set
    
    returns: set of nodes
    """
    targets = set()
    while len(targets) < k:
        x = random.choice(repeated_nodes)
        targets.add(x)
    return targets

I'll generate a BA graph with the same number of nodes and edges as the Facebook data:

In [33]:

n = len(fb)
m = len(fb.edges())
k = int(round(m/n))
n, m, k

Out[33]:

(4039, 88234, 22)

Providing a random seed means we'll get the same graph every time.

In [34]:

ba = barabasi_albert_graph(n, k, seed=15)

The number of edges is pretty close to what we asked for.

In [35]:

len(ba), len(ba.edges()), len(ba.edges())/len(ba)

Out[35]:

(4039, 88374, 21.88016835850458)

So the mean degree is about right.

In [36]:

np.mean(degrees(fb)), np.mean(degrees(ba))

Out[36]:

(43.69101262688784, 43.76033671700916)

The standard deviation of degree is pretty close, and much better than the WS model.

In [37]:

np.std(degrees(fb)), np.std(degrees(ba))

Out[37]:

(52.41411556737521, 41.03760075705614)

Let's take a look at the degree distribution.

In [38]:

pmf_ba = Pmf.from_seq(degrees(ba))

Looking at the PMFs on a linear scale, we see one difference, which is that the BA model has no nodes with degree less than k, which is 22.

In [39]:

plt.figure(figsize=(8,4))

plt.subplot(1,2,1)
pmf_fb.plot(label='Facebook', color='C0')
decorate(xlabel='Degree', ylabel='PMF')

plt.subplot(1,2,2)
pmf_ba.plot(label='BA graph', color='C2')
decorate(xlabel='Degree')

Out[39]:

But if we look at the PMF on a log-log scale, the BA model looks pretty good for values bigger than about 20. And it seems to follow a power law.

In [40]:

plt.figure(figsize=(8,4))
options = dict(ls='', marker='.')

plt.subplot(1,2,1)

pmf_fb.plot(label='Facebook', color='C0', **options)
decorate(xlabel='Degree', ylabel='PMF',
         xscale='log', yscale='log')

plt.subplot(1,2,2)

pmf_ba.plot(label='BA model', color='C2', **options)
decorate(xlabel='Degree',
         xlim=[1, 1e4],
         xscale='log', yscale='log')

savefig('figs/chap04-3')

Out[40]:

Saving figure to file figs/chap04-3

The characteristic path length is even smaller in the model than in the data.

In [41]:

L, estimate_path_length(ba)

Out[41]:

(3.717, 2.481)

But the clustering coefficient isn't even close.

In [42]:

C, average_clustering(ba)

Out[42]:

(0.615, 0.048)

In the BA model, the degree distribution is better than in the WS model, but the clustering coefficient is too low.

Cumulative distributions

Cumulative distributions are a better way to visualize distributions. The following function shows what a cumulative probability is:

In [43]:

def cumulative_prob(pmf, x):
    """Computes the cumulative probability of `x`.
    
    Total probability of all values <= x.
    
    returns: float probability
    """
    ps = [pmf[value] for value in pmf.qs if value<=x]
    return np.sum(ps)

The total probability for all values up to and including 11 is 0.258, so the 25th percentile is about 11.

In [44]:

cumulative_prob(pmf_fb, 11)

Out[44]:

0.2577370636296113

The median degree is about 25.

In [45]:

cumulative_prob(pmf_fb, 25)

Out[45]:

0.5060658578856152

And the 75th percentile is about 57. That is, about 75% of users have 57 friends or fewer.

In [46]:

cumulative_prob(pmf_fb, 57)

Out[46]:

0.751671205743996

empiricaldist provides Cdf, which computes cumulative distribution functions.

In [47]:

from empiricaldist import Cdf

Here are the degree CDFs for the Facebook data, the WS model, and the BA model.

In [48]:

cdf_fb = Cdf.from_seq(degrees(fb), name='Facebook')

In [49]:

cdf_ws = Cdf.from_seq(degrees(ws), name='WS model')

In [50]:

cdf_ba = Cdf.from_seq(degrees(ba), name='BA model')

If we plot them on a log-x scale, we get a sense of how well the models fit the central part of the distribution.

The WS model is hopeless. The BA model is ok for values above the median, but not very good for smaller values.

In [51]:

plt.figure(figsize=(8,4))

plt.subplot(1,2,1)
cdf_fb.plot(color='C0')
cdf_ws.plot(color='C1', alpha=0.4)
decorate(xlabel='Degree', xscale='log',
                 ylabel='CDF')

plt.subplot(1,2,2)
cdf_fb.plot(color='C0', label='Facebook')
cdf_ba.plot(color='C2', alpha=0.4)
decorate(xlabel='Degree', xscale='log')

savefig('figs/chap04-4')

Out[51]:

Saving figure to file figs/chap04-4

On a log-log scale, we see that the BA model fits the tail of the distribution reasonably well.

In [52]:

plt.figure(figsize=(8,4))

plt.subplot(1,2,1)
(1 - cdf_fb).plot(color='C0')
(1 - cdf_ws).plot(color='C1', alpha=0.4)
decorate(xlabel='Degree', xscale='log',
                 ylabel='CCDF', yscale='log')

plt.subplot(1,2,2)

(1 - cdf_fb).plot(color='C0', label='Facebook')
(1 - cdf_ba).plot(color='C2', alpha=0.4)
decorate(xlabel='Degree', xscale='log',
                 yscale='log')

savefig('figs/chap04-5')

Out[52]:

Saving figure to file figs/chap04-5

But there is certainly room for a model that does a better job of fitting the whole distribution.

Exercises

Exercise: Data files from the Barabasi and Albert paper are available from this web page.

Their actor collaboration data is included in the repository for this book in a file named actor.dat.gz. The following function reads the file and builds the graph.

In [53]:

import gzip

def read_actor_network(filename, n=None):
    """Reads graph data from a file.
    
    filename: string
    n: int, number of lines to read (default is all)
    """
    G = nx.Graph()
    with gzip.open(filename) as f:
        for i, line in enumerate(f):
            nodes = [int(x) for x in line.split()]
            G.add_edges_from(all_pairs(nodes))
            if n and i >= n:
                break
    return G

def all_pairs(nodes):
    """Generates all pairs of nodes."""
    for i, u in enumerate(nodes):
        for j, v in enumerate(nodes):
            if i < j:
                yield u, v

Compute the number of actors in the graph and the number of edges.

Check whether this graph has the small world properties, high clustering and low path length.

Plot the PMF of degree on a log-log scale. Does it seem to follow a power law?

Also plot the CDF of degree on a log-x scale, to see the general shape of the distribution, and on a log-log scale, to see whether the tail follows a power law.

Note: The actor network is not connected, so you might want to use nx.connected_components to find connected subsets of the nodes.

In [54]:

# WARNING: if you run this with larger values of `n`, you
# might run out of memory, and Jupyter does not handle that well.

%time actors = read_actor_network('../data/actor.dat.gz', n=10000)
len(actors)

Out[54]:

CPU times: user 758 ms, sys: 36.1 ms, total: 794 ms
Wall time: 804 ms

17540

In [55]:

# Solution

# As expected, the average clustering is high

average_clustering(actors, trials=10000)

Out[55]:

0.7308

In [56]:

# Solution

# And in the largest connected component, the average path length is low

for nodes in nx.connected_components(actors):
    if len(nodes) > 100:
        print(len(nodes), estimate_path_length(actors, nodes))

Out[56]:

17270 3.492

In [57]:

# Solution

# Here are the mean and standard deviation of degree:

ds = degrees(actors)
np.mean(ds), np.std(ds)

Out[57]:

(38.35541619156214, 61.53610074821354)

In [58]:

# Solution

# And the PMF of degree on a log-log scale

pmf = Pmf.from_seq(ds, name='actors')
pmf.plot(**options)
decorate(xlabel='Degree', ylabel='PMF',
         xscale='log', yscale='log')

Out[58]:

In [59]:

# Solution

# Here's the CDF on a log scale

cdf = Cdf.from_seq(ds, name='actors')
cdf.plot()
decorate(xlabel='Degree', ylabel='CDF', xscale='log')

Out[59]:

In [60]:

# Solution

# and the CDF on a log-log scale

(1-cdf).plot()
decorate(xlabel='Degree', ylabel='CDF',
                 xscale='log', yscale='log')

Out[60]:

In [61]:

# Solution

# The PMF on a log-log scale suggests a power law.

# The CDF on a log-x scale looks like a lognormal distribution, possibly
# skewed to the right.

# The CDF on a log-log scale does not have the straight line behavior
# we expect from a power law, but it is consistent with a heavy-tailed
# distribution.

Exercise: NetworkX provides a function called powerlaw_cluster_graph that implements the "Holme and Kim algorithm for growing graphs with powerlaw degree distribution and approximate average clustering". Read the documentation of this function and see if you can use it to generate a graph that has the same number of nodes as the Facebook network, the same average degree, and the same clustering coefficient. How does the degree distribution in the model compare to the actual distribution?

In [62]:

# Solution

# Again, here are the parameters of the Facebook data

n = len(fb)
m = len(fb.edges())
k = int(round(m / n))
n, m, k

Out[62]:

(4039, 88234, 22)

In [63]:

# Solution

# Now we can make an HK graph with these parameters,
# and with the target clustering as high as possible.

hk = nx.powerlaw_cluster_graph(n, k, 1.0, seed=15)
len(hk), len(hk.edges())

Out[63]:

(4039, 88363)

In [64]:

# Solution

# The average clustering is much higher than in the BA
# model, but still not as high as in the data.

C, average_clustering(hk)

Out[64]:

(0.615, 0.281)

In [65]:

# Solution

# The average path length is even lower than in the data.

L, estimate_path_length(hk)

Out[65]:

(3.717, 2.795)

In [66]:

# Solution

# The mean degree is about right.

np.mean(degrees(fb)), np.mean(degrees(hk))

Out[66]:

(43.69101262688784, 43.75488982421391)

In [67]:

# Solution

# The standard deviation of degree is a little low

np.std(degrees(fb)), np.std(degrees(hk))

Out[67]:

(52.41411556737521, 43.106377456263075)

In [68]:

# Solution

# The degree distribution is almost identical to the BA model

cdf_hk = Cdf.from_seq(degrees(hk), name='HK model')
cdf_fb.plot(color='C0')
cdf_ba.plot(color='C2', alpha=0.4)
cdf_hk.plot(color='C3', alpha=0.4)
decorate(xscale='log')

Out[68]:

In [69]:

# Solution

# On a log-log scale, both HK and BA are reasonable
# models for the tail behavior.

(1-cdf_fb).plot()
(1-cdf_ba).plot(color='C2', alpha=0.4)
(1-cdf_hk).plot(color='C3', alpha=0.4)
decorate(xscale='log', yscale='log', loc='upper right')

Out[69]:

In [ ]: