Jupyter notebook Homework/hw1/similarity.ipynb
Similarity measures
The following workbook is based on p. 236 of the text, exercises 2.124–127, regarding "similarity" of two sets.
Definitions of similarity
Given two sets and :
the cardinality measure of similarity is
the Jaccard coefficient of similarity is
Question: In your own words, explain why these two measures can represent the similarity of two sets, and explain the difference between the two.
The cardinality measure of similarity gives the number of items that are elements of both sets whereas the Jaccard coefficient gives the ratio of the number of items that are elements of both sets to the total number of unique elements in both sets. These two measures both represent similarity because they both give an idea of the quantity of elements that are members of both sets.
The cardinality measures gives the items that are elements of both sets without context regarding the sets themselves. The Jaccard coefficient gives a better idea of what each set looks like, in terms of similarity, by providing the context of all items that are elements of both sets. However, the Jaccard coefficient does not provide the number of elements the sets have in common.
Similarity as a metric
In class we defined a metric, which is a distance function that obeys some nice, standard properties. See the writeup distributed in Week 2 notes for more information about metrics.
Similarity, as a measure, is essentially the opposite of "distance": two sets that are similar are not very "far apart".
In fact, it turns out we can formalize this mathematically. Consider the function Intuitively, this is the "opposite" of the Jaccard coefficient: Think of the coefficient as a percentage, because it's a fraction between and ; if the Jaccard coefficient of and says something like the sets are "20% similar," then would say they are "80% different."
This function is in fact a metric on sets. Below, justify this by arguing that this function satisfies the four properties that a metric must satisfy (non-negativity, identity, symmetry, and triangle inequality).
Computing the two measures
Implement a python function sim_card that returns the similarity of two sets using the cardinality measure:
Implement a python function sim_jaccard that returns the similarity of two sets using the Jaccard coefficient:
You can use this space to test your functions. Just be sure to hit the "run" button on the blocks above after you make changes, so that calling them below will invoke the correct code.
"Reversing" similarity
Assume we have a collection of sets .
Claim: Suppose that the set is the most similar set to in this collection (aside from itself). Then the set is most similar to (other than itself).
Question: Is the above claim true for the cardinality measure? (Argue why it must be true or give a counterexample.)
Suppose set A is comprised of {1, 2, 3, 4, 5} and is the most similar, in terms of cardinality, to set B which is comprised of {1, 2, 3, 4, 5, 6, 7, 8, 9}. The cardinality measure of similarity of A to B is 5.
If there is a set C which is comprised of {1, 2, 3, 4, 6, 7, 8, 9, 10}, then the cardinality measure of similarity between B and C is 8 and the cardinality measure of similarity between sets A and C is 4.
Althought set A is the most similar to set B, set B is not necessarily the most similar to set A. Using the cardinality measurement, set B is most similar to set C.
If A is a subset of B then A is most similar to B in any set using the cardinality measure of similarity (all of A's elements would also be elements of B). If B is much larger than A, B is a subset of C, and A is not a subset of C, then B would be most similar to C, rather than A.
This claim is not true for the cardinality measure of similarity.
Question: Is the above claim true for the Jaccard coefficient? (Argue why it must be true or give a counterexample.)
Suppose set A is comprised of {0, 1, 2, 3, 4}, set B is comprised of {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, and set A is most similar to set B with a Jaccard coefficient of .5
If there is a set C comprised of {1, 2, 3, 6, 7, 8} then its Jaccard coefficient to set A is .375 and its Jaccard coefficient to set B is .6
Alternatively, if A and C are subsets of B and C is larger than A, then A is most similar to B however, B is most similar to C
Set A is most similar to set B however, set B is most similar to set C.
Collaborative filtering
A recommender system based on nearest-neighbor collaborative filtering works like this:
Let denote the set of users in the system.
Let denote the set of products in the system.
For each user , let be the set of products purchased by .
To make a recommendation to a user :
Identify the user whose purchases are most similar to .
Recommend products in .
Exercise: Your goal is to write a function recommend(m, u, n, L) that makes a recommendation to a user.
The parameters should be, in order:
mis the function (eithersim_cardorsim_jaccard) to use for the similarity measure. You can assume that it is a function that takes two sets and returns a number.An integer corresponding to a user number to whom a recommendation should be made.
An integer specifying the maximum number of products to recommend.
A list of sets, where the set at index is the set of products purchased by user . In other words, users are numbered by their position in this list.
In order to test this function, you will need a sample set of user purchases. The following code snippet loads a set of serial numbers into the set of products. Then, P can be used as a global variable in this workbook by running this cell.
Here is some code that generates a list of user purchases. It randomly samples from the set above. You will probably want to modify this code to test your recommendation function under different sets of user purchases, but this is a place to start.
Use this space to write an implementation of the function recommend:
You can use this space to test your implementation.