Digital Garden
Machine Learning
Finding Similar Items

Finding Similar Items

Many problems relating to data can be expressed as finding similar items. For example, if we wanted to solve the scene completion problem we would want to find images that are similar to each other to complete the scene.

We can generalize this problem to finding near-neighbors in a high dimensional space. We can pretty easily come up with a naive solution to solve this. For example, if vectorize the images and then define some distance function d(x1,x2)d(x_1,x_2) which returns the distance between the vectors x1x_1 and x2x_2. We could then find all pairs of images (xi,xj)(x_i, x_j) that are within some distance threshold d(xi,xj)td(x_i, x_j) \leq t. However, for nn images this naive solution takes O(n2)O(n^2). Luckily there is a method to do this in O(n)O(n)!

Jaccard Similarity and Distance

The jaccard similarity of two sets is the size of their intersection divided by the size of their union:

sim(A,B)=ABABsim(A, B) = \frac{|A \cap B|}{|A \cup B|}

The jaccard distance is then defined as

d(A,B)=1ABABd(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}

This leads to the desired result of if the documents are similar then the distance is small.

Todo

example and better use cases and prob separate into single page.

Overview of Finding Similar items

Our goal is to find "near duplicate" pairs of a given large number documents. Some possible applications for this could be to find mirror websites, cluster articles or detect plagiarism. When faced with this problem to solve we can immediatily think of some issues:

  • What will the similarity look like if two documents have the same content but in a different order?
  • What if there are too many documents to compare, O(n2)O(n^2)?
  • What if the documents are to large to load into main memory?

So to solve this problem we define a series of steps:

  1. We convert documents into sets, this can be done by shingling.
  2. We convert large sets to short signatures while preserving their similarity, we call this process min-hashing.
  3. Lastly, we only want to focus on pairs of signatures that have a high similarity i.e we want to decrease the number of pairs to compare, this is done with locality sensitive-hashing, short LSH.

Shingling

Shingling is the process of converting a document into a set. We might start with some simple approaches such as using the set of words appearing in documents or the set of important words defined by the TF-IDF but these approaches do not really work well as they do not take something into account. That something is the ordering of the words.

Example

In the two approaches above the following documents would have the same similarity: ......TODODODODODODODO

The answer to this problem is a kk-shingle or or also commonly referred to as nn-gram. A kk-shingle for a document is a sequence of kk tokens, a token could be characters or words or something else.

Todo

Example and possibly belongs on its own page.

We can then further compress long shingles by hashing them to for example 4 bytes (s321s^32-1 possible values small likelihood of collision).

Min-Hashing

Our next step would be to encode sets as bit vectors, so we have one dimension per element in the universal set. We can then interpret the set intersection as a bitwise AND operation and the set union as a bitwise OR operation.

Todo

example

Our next step could be to go from vectors to matrices, where each row is a shingle and each column represents a document. This is very nice as the column similarity corresponds to the jaccard similarity of the 2 sets.

Our next goal is to compute for each column a small signature but to keep the properity where similar columns have similar signatures. The key idea here is that we want to find a hash function that if we hash a column C to the signature, the signature is small and we preserve the similarity. This can be done with so called min-hashing.

Todo

Can explain a lot of things for min hashing I think, as seems very popular

The process of min hashing can be defined in the following steps:

  1. We have kk hash functions.
  2. For each Column C and hash function k we keep a slot for the min-hash value whihc is in the beggining infinity.
  3. We scan row for row looking for 1s and replace the value with the hashed value if it is smaller.
Todo

example, code ?

Locality-Sensitive Hashing

The last step is to find signatures with a high similarity that we can compare. So the general idea is to have a function f(x,y)f(x,y) that tells us whether xx and yy are a candidate pair given a certain threshhold tt.

So if t=0.8t=0.8 then x and y are a candidate pair if their signatures agree on at least 80% of their rows. However this means we need to compare all pairs of columns which is O(n2)O(n^2) again. Instea we use a hash function again that only hashes similar columns to the same bucket, the candidate pairs are then those that hash to the same bucket.

So we partition the signature matrix M into b bands which each have r rows. We then hash for each band the portion of columns to a hash table with k buckets where k is as large as possible. A candidate column is then if at least for one band they hashed to the same bucket. b and r are then hyperparameters to catch most similar pairs and the fewest non similar pairs.

Todo

the s curve