Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Jupyter notebook Homework/hw1/similarity.ipynb

228 views
Kernel: Python 3

Similarity measures

The following workbook is based on p. 236 of the text, exercises 2.124–127, regarding "similarity" of two sets.

Definitions of similarity

Given two sets AA and BB:

  • the cardinality measure of similarity is AB|A \cap B|

  • the Jaccard coefficient of similarity is ABAB\frac{|A \cap B|}{|A \cup B|}

Question: In your own words, explain why these two measures can represent the similarity of two sets, and explain the difference between the two.

The cardinality measure of similarity gives the number of items that are elements of both sets whereas the Jaccard coefficient gives the ratio of the number of items that are elements of both sets to the total number of unique elements in both sets. These two measures both represent similarity because they both give an idea of the quantity of elements that are members of both sets.

The cardinality measures gives the items that are elements of both sets without context regarding the sets themselves. The Jaccard coefficient gives a better idea of what each set looks like, in terms of similarity, by providing the context of all items that are elements of both sets. However, the Jaccard coefficient does not provide the number of elements the sets have in common.

Similarity as a metric

In class we defined a metric, which is a distance function that obeys some nice, standard properties. See the writeup distributed in Week 2 notes for more information about metrics.

Similarity, as a measure, is essentially the opposite of "distance": two sets that are similar are not very "far apart".

In fact, it turns out we can formalize this mathematically. Consider the function dist(A,B)=1ABAB. \mathrm{dist}(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}. Intuitively, this is the "opposite" of the Jaccard coefficient: Think of the coefficient as a percentage, because it's a fraction between 00 and 11; if the Jaccard coefficient of AA and BB says something like the sets are "20% similar," then dist(A,B)\mathrm{dist}(A, B) would say they are "80% different."

This function is in fact a metric on sets. Below, justify this by arguing that this function satisfies the four properties that a metric must satisfy (non-negativity, identity, symmetry, and triangle inequality).

Computing the two measures

Implement a python function sim_card that returns the similarity of two sets using the cardinality measure:

def sim_card(s1, s2): return len(s1 & s2) pass

Implement a python function sim_jaccard that returns the similarity of two sets using the Jaccard coefficient:

def sim_jaccard(s1, s2): top = sim_card(s1, s2) denom = len(set.union(s1, s2)) return top / denom pass

You can use this space to test your functions. Just be sure to hit the "run" button on the blocks above after you make changes, so that calling them below will invoke the correct code.

jeff = [1, 2, 3, 4, 5, 6] jeffrey = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] jeff = set(jeff) jeffrey = set(jeffrey) kevin = [11, 12, 13] kev = [11, 12, 13] kevin = set(kevin) kev = set(kev) a = sim_card(jeffrey, jeff) # cardinality is 6 b = sim_jaccard(jeffrey, jeff)# jaccard coefficient is .6 (6 / 10) c = sim_card(kevin, kev) # 3 d = sim_jaccard(kevin, kev)# .5 e = sim_card(kev, jeff) # 0 f = sim_jaccard(kev, jeff) # 0 print("The cardinality of jeffrey and jeff is", a, " and the jaccard coefficient is", b) print("The cardinality of kevin and kev is", c, " and the jaccard coefficient is", d) print("The cardinality of kevin and kev is", e, " and the jaccard coefficient is", f)
The cardinality of jeffrey and jeff is 6 and the jaccard coefficient is 0.6 The cardinality of kevin and kev is 3 and the jaccard coefficient is 1.0 The cardinality of kevin and kev is 0 and the jaccard coefficient is 0.0

"Reversing" similarity

Assume we have a collection of sets A1,A2,,AnA_1, A_2, \ldots, A_n.

Claim: Suppose that the set AiA_i is the most similar set to AjA_j in this collection (aside from AjA_j itself). Then AjA_j the set is most similar to AiA_i (other than AiA_i itself).

Question: Is the above claim true for the cardinality measure? (Argue why it must be true or give a counterexample.)

Suppose set A is comprised of {1, 2, 3, 4, 5} and is the most similar, in terms of cardinality, to set B which is comprised of {1, 2, 3, 4, 5, 6, 7, 8, 9}. The cardinality measure of similarity of A to B is 5.

If there is a set C which is comprised of {1, 2, 3, 4, 6, 7, 8, 9, 10}, then the cardinality measure of similarity between B and C is 8 and the cardinality measure of similarity between sets A and C is 4.

Althought set A is the most similar to set B, set B is not necessarily the most similar to set A. Using the cardinality measurement, set B is most similar to set C.

If A is a subset of B then A is most similar to B in any set using the cardinality measure of similarity (all of A's elements would also be elements of B). If B is much larger than A, B is a subset of C, and A is not a subset of C, then B would be most similar to C, rather than A.

This claim is not true for the cardinality measure of similarity.

Question: Is the above claim true for the Jaccard coefficient? (Argue why it must be true or give a counterexample.)

Suppose set A is comprised of {0, 1, 2, 3, 4}, set B is comprised of {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, and set A is most similar to set B with a Jaccard coefficient of .5

If there is a set C comprised of {1, 2, 3, 6, 7, 8} then its Jaccard coefficient to set A is .375 and its Jaccard coefficient to set B is .6

Alternatively, if A and C are subsets of B and C is larger than A, then A is most similar to B however, B is most similar to C

Set A is most similar to set B however, set B is most similar to set C.

Collaborative filtering

A recommender system based on nearest-neighbor collaborative filtering works like this:

  • Let UU denote the set of users in the system.

  • Let PP denote the set of products in the system.

  • For each user uUu \in U, let SuPS_u \subset P be the set of products purchased by uu.

  • To make a recommendation to a user aa:

    • Identify the user bU{a}b \in U - \{a\} whose purchases SbS_b are most similar to SaS_a.

    • Recommend products in SbSaS_b - S_a.

Exercise: Your goal is to write a function recommend(m, u, n, L) that makes a recommendation to a user.

The parameters should be, in order:

  • m is the function (either sim_card or sim_jaccard) to use for the similarity measure. You can assume that it is a function that takes two sets and returns a number.

  • An integer uu corresponding to a user number to whom a recommendation should be made.

  • An integer nn specifying the maximum number of products to recommend.

  • A list of sets, where the set SiS_i at index ii is the set of products purchased by user ii. In other words, users are numbered by their position in this list.

In order to test this function, you will need a sample set of user purchases. The following code snippet loads a set of serial numbers into the set PP of products. Then, P can be used as a global variable in this workbook by running this cell.

P = set() with open('serial_no.txt', 'rt') as f: for line in f: P.add(line)

Here is some code that generates a list of user purchases. It randomly samples from the set PP above. You will probably want to modify this code to test your recommendation function under different sets of user purchases, but this is a place to start.

import random N = 10 # number of users L = [] # list of sets of purchases, where L[i] are purchases of user i for i in range(N): num_purchases = random.randint(10, 15) # the number of purchases for this user L.append(set(random.sample(P, num_purchases))) # randomly sample num_purchases items from the products P

Use this space to write an implementation of the function recommend:

def recommend(similarity, user, n, L): simUser = -1 mostSim = -1 for index in range(len(L)): if index == user: continue if mostSim < similarity(L[user], L[index]) and len(L[index] - L[user]) > 0: mostSim = similarity(L[user], L[index]) simUser = index if simUser == -1: print("There are no similar users") return products = L[index] - L[user] if len(products) > n: while(len(products) > n): products.pop() return products else: return products

You can use this space to test your implementation.

recommended = recommend(sim_jaccard, 3, 4, L) for product in recommended: print(product) recommended = recommend(sim_card, 3, 4, L) for product in recommended: print(product)
3967863913 4458019913 5618748970 6961102807 3967863913 4458019913 5618748970 6961102807