Jupyter notebook Homework/hw1/similarity.ipynb

²²⁸ views

Kernel: Python 3

Similarity measures

The following workbook is based on p. 236 of the text, exercises 2.124–127, regarding "similarity" of two sets.

Definitions of similarity

Given two sets $A$ and $B$ :

the cardinality measure of similarity is $|A \cap B|$
the Jaccard coefficient of similarity is $\frac{|A \cap B|}{|A \cup B|}$

Question: In your own words, explain why these two measures can represent the similarity of two sets, and explain the difference between the two.

The cardinality measure of similarity gives the number of items that are elements of both sets whereas the Jaccard coefficient gives the ratio of the number of items that are elements of both sets to the total number of unique elements in both sets. These two measures both represent similarity because they both give an idea of the quantity of elements that are members of both sets.

The cardinality measures gives the items that are elements of both sets without context regarding the sets themselves. The Jaccard coefficient gives a better idea of what each set looks like, in terms of similarity, by providing the context of all items that are elements of both sets. However, the Jaccard coefficient does not provide the number of elements the sets have in common.

Similarity as a metric

In class we defined a metric, which is a distance function that obeys some nice, standard properties. See the writeup distributed in Week 2 notes for more information about metrics.

Similarity, as a measure, is essentially the opposite of "distance": two sets that are similar are not very "far apart".

In fact, it turns out we can formalize this mathematically. Consider the function $\mathrm{dist}(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}.$ Intuitively, this is the "opposite" of the Jaccard coefficient: Think of the coefficient as a percentage, because it's a fraction between $0$ and $1$ ; if the Jaccard coefficient of $A$ and $B$ says something like the sets are "20% similar," then $\mathrm{dist}(A, B)$ would say they are "80% different."

This function is in fact a metric on sets. Below, justify this by arguing that this function satisfies the four properties that a metric must satisfy (non-negativity, identity, symmetry, and triangle inequality).

Computing the two measures

Implement a python function sim_card that returns the similarity of two sets using the cardinality measure:

In [50]:

def sim_card(s1, s2):
            
    return len(s1 & s2)
    pass

Implement a python function sim_jaccard that returns the similarity of two sets using the Jaccard coefficient:

In [59]:

def sim_jaccard(s1, s2):
    top = sim_card(s1, s2)
    denom = len(set.union(s1, s2))
    
    return top / denom
    pass

You can use this space to test your functions. Just be sure to hit the "run" button on the blocks above after you make changes, so that calling them below will invoke the correct code.

In [60]:

jeff = [1, 2, 3, 4, 5, 6]
jeffrey = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

jeff = set(jeff)
jeffrey = set(jeffrey)

kevin = [11, 12, 13]
kev = [11, 12, 13]

kevin = set(kevin)
kev = set(kev)

a = sim_card(jeffrey, jeff) # cardinality is 6 
b = sim_jaccard(jeffrey, jeff)# jaccard coefficient is .6 (6 / 10)

c = sim_card(kevin, kev) # 3 
d = sim_jaccard(kevin, kev)# .5

e = sim_card(kev, jeff) # 0 
f = sim_jaccard(kev, jeff) # 0 

print("The cardinality of jeffrey and jeff is", a, " and the jaccard coefficient is", b)
print("The cardinality of kevin and kev is", c, " and the jaccard coefficient is", d)
print("The cardinality of kevin and kev is", e, " and the jaccard coefficient is", f)

Out[60]:

The cardinality of jeffrey and jeff is 6  and the jaccard coefficient is 0.6
The cardinality of kevin and kev is 3  and the jaccard coefficient is 1.0
The cardinality of kevin and kev is 0  and the jaccard coefficient is 0.0

"Reversing" similarity

Assume we have a collection of sets $A_1, A_2, \ldots, A_n$ .

Claim: Suppose that the set $A_i$ is the most similar set to $A_j$ in this collection (aside from $A_j$ itself). Then $A_j$ the set is most similar to $A_i$ (other than $A_i$ itself).

Question: Is the above claim true for the cardinality measure? (Argue why it must be true or give a counterexample.)

Suppose set A is comprised of {1, 2, 3, 4, 5} and is the most similar, in terms of cardinality, to set B which is comprised of {1, 2, 3, 4, 5, 6, 7, 8, 9}. The cardinality measure of similarity of A to B is 5.

If there is a set C which is comprised of {1, 2, 3, 4, 6, 7, 8, 9, 10}, then the cardinality measure of similarity between B and C is 8 and the cardinality measure of similarity between sets A and C is 4.

Althought set A is the most similar to set B, set B is not necessarily the most similar to set A. Using the cardinality measurement, set B is most similar to set C.

If A is a subset of B then A is most similar to B in any set using the cardinality measure of similarity (all of A's elements would also be elements of B). If B is much larger than A, B is a subset of C, and A is not a subset of C, then B would be most similar to C, rather than A.

This claim is not true for the cardinality measure of similarity.

Question: Is the above claim true for the Jaccard coefficient? (Argue why it must be true or give a counterexample.)

Suppose set A is comprised of {0, 1, 2, 3, 4}, set B is comprised of {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, and set A is most similar to set B with a Jaccard coefficient of .5

If there is a set C comprised of {1, 2, 3, 6, 7, 8} then its Jaccard coefficient to set A is .375 and its Jaccard coefficient to set B is .6

Alternatively, if A and C are subsets of B and C is larger than A, then A is most similar to B however, B is most similar to C

Set A is most similar to set B however, set B is most similar to set C.

Collaborative filtering

A recommender system based on nearest-neighbor collaborative filtering works like this:

Let $U$ denote the set of users in the system.
Let $P$ denote the set of products in the system.
For each user $u \in U$ , let $S_u \subset P$ be the set of products purchased by $u$ .
To make a recommendation to a user $a$ :
- Identify the user $b \in U - \{a\}$ whose purchases $S_b$ are most similar to $S_a$ .
- Recommend products in $S_b - S_a$ .

Exercise: Your goal is to write a function recommend(m, u, n, L) that makes a recommendation to a user.

The parameters should be, in order:

m is the function (either sim_card or sim_jaccard) to use for the similarity measure. You can assume that it is a function that takes two sets and returns a number.
An integer $u$ corresponding to a user number to whom a recommendation should be made.
An integer $n$ specifying the maximum number of products to recommend.
A list of sets, where the set $S_i$ at index $i$ is the set of products purchased by user $i$ . In other words, users are numbered by their position in this list.

In order to test this function, you will need a sample set of user purchases. The following code snippet loads a set of serial numbers into the set $P$ of products. Then, P can be used as a global variable in this workbook by running this cell.

In [53]:

P = set()
with open('serial_no.txt', 'rt') as f:
    for line in f:
        P.add(line)

Here is some code that generates a list of user purchases. It randomly samples from the set $P$ above. You will probably want to modify this code to test your recommendation function under different sets of user purchases, but this is a place to start.

In [57]:

import random

N = 10 # number of users
L = [] # list of sets of purchases, where L[i] are purchases of user i

for i in range(N):
    num_purchases = random.randint(10, 15)  # the number of purchases for this user
    L.append(set(random.sample(P, num_purchases)))  # randomly sample num_purchases items from the products P

Use this space to write an implementation of the function recommend:

In [61]:

def recommend(similarity, user, n, L):
    simUser = -1
    mostSim = -1 
    
    for index in range(len(L)):
        if index == user:
            continue 
        if mostSim < similarity(L[user], L[index]) and len(L[index] - L[user]) > 0:
            mostSim = similarity(L[user], L[index])
            simUser = index 
    if simUser == -1:
        print("There are no similar users")
        return 
    
    products = L[index] - L[user]
    
    if len(products) > n:
        while(len(products) > n):
            products.pop()
        return products 
    else:
        return products

You can use this space to test your implementation.

In [63]:

recommended = recommend(sim_jaccard, 3, 4, L)

for product in recommended:
    print(product)

recommended = recommend(sim_card, 3, 4, L)

for product in recommended:
    print(product)

Out[63]:

In [ ]:

Similarity measures

Definitions of similarity

Similarity as a metric

Computing the two measures

"Reversing" similarity

Collaborative filtering

Product

Resources

Company