Locality-sensitive hashing: near-duplicate detection at scale with MinHash and SimHash

SimHash and MinHash compress high-dimensional document features into compact signatures that preserve similarity, letting search engines detect near-duplicates at web scale in sub-linear time.

The core idea

A web crawler indexing billions of pages needs to identify and filter out near-duplicates. Millions of pages are nearly identical: they share the same core text but differ slightly in dynamic advertisements, timestamps, page counters, or navigation menus. Indexing duplicate pages wastes computation, storage, and crawl bandwidth, and clutters results with repetitive content.

Standard cryptographic hashes such as MD5 or SHA-256 are built for integrity, not similarity. They show the avalanche effect: changing a single character produces a completely different hash, so they cannot measure how alike two documents are.

Comparing every document against every other is an $O(N^2)$ operation, impossible for billions of pages. Search engines instead use locality-sensitive hashing (LSH): hashes designed so that similar documents map to identical or similar signatures with high probability, which finds near-duplicates in sub-linear time. MinHash and SimHash are the two foundational LSH algorithms for web-scale deduplication.

Jaccard similarity and MinHash (set hashing)

MinHash, introduced by Andrei Broder in 1997, represents documents as sets of tokens and estimates the Jaccard similarity: the size of the intersection of two sets divided by the size of their union.

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

Jaccard similarity is 1 for identical sets, 0 for disjoint sets, and a fraction in between for overlapping sets.

The MinHash process

A document is first turned into a set of shingles. A $k$-shingle is a contiguous sequence of $k$ tokens; for 2-word shingles, \"the cat sat\" becomes the set $\{\text{the cat},\ \text{cat sat}\}$. Once documents are sets, MinHash uses a random permutation $\pi$ of the universe of shingles; the minimum hash value of a set $A$ is the smallest permuted index of any element in $A$:

$$h_{\min}(A) = \min_{x \in A} \pi(x)$$

The collision theorem states that the probability that two sets share the same minimum hash under a random permutation equals their Jaccard similarity:

$$P\big(h_{\min}(A) = h_{\min}(B)\big) = J(A, B)$$

Applying $M$ independent permutations gives each document a signature vector of length $M$. The fraction of matching slots between two signatures is an unbiased estimate of their Jaccard similarity.

MinHash pipeline: two documents are tokenized into 2-word shingles, mapped under two permutation hashes, and their minimum positions selected to produce signature vectors [1, 1] and [3, 1].

Worked example: MinHash

Let $A$ be \"the cat sat\" and $B$ be \"the cat lay\". With 2-word shingles:

  • $A = \{\text{the cat},\ \text{cat sat}\}$
  • $B = \{\text{the cat},\ \text{cat lay}\}$

The intersection is $\{\text{the cat}\}$ (size 1) and the union has size 3, so the exact Jaccard similarity is:

$$J(A, B) = \tfrac{1}{3} \approx 0.333$$

Take two permutation hashes. Under $h_1$, $A$ maps to minimum 1 and $B$ to minimum 3 (mismatch). Under $h_2$, both map to minimum 1 (match). So $\mathrm{Sig}(A) = [1, 1]$ and $\mathrm{Sig}(B) = [3, 1]$. The estimated similarity is the matching fraction $\tfrac{1}{2} = 0.5$; as the signature length $M$ grows, the estimate converges to the exact $0.333$.

Cosine similarity and SimHash (vector hashing)

SimHash, designed by Moses Charikar in 2002, maps high-dimensional term vectors to compact binary fingerprints and estimates the cosine of the angle $\theta$ between document vectors:

$$\text{similarity} = \cos(\theta) = \frac{\vec{u} \cdot \vec{v}}{\lVert \vec{u} \rVert\, \lVert \vec{v} \rVert}$$

Unlike MinHash, which treats elements as unweighted set members, SimHash can incorporate term weights such as TF-IDF.

The SimHash process

SimHash builds an $f$-bit fingerprint (typically 64 bits):

  • Initialize an accumulator of $f$ zeros, $V = [0,\dots,0]$.
  • Hash each term to an $f$-bit value.
  • For each term, add its weight to $V[i]$ where bit $i$ of the term hash is 1, and subtract it where bit $i$ is 0.
  • Binarize: bit $i$ of the fingerprint is 1 if $V[i] > 0$, else 0.

Two fingerprints are compared by Hamming distance $D_H$, the number of differing bit positions. The expected fraction of differing bits is proportional to the angle between the vectors:

$$\frac{D_H}{f} = \frac{\theta}{\pi}$$

A Hamming distance of 0 means the vectors point the same way ($\theta = 0$); a distance of $f/2$ means they are orthogonal ($\theta = \pi/2$).

SimHash pipeline: terms with TF-IDF weights are hashed to 4-bit sequences, weights are signed per bit and accumulated to [3, 1, 3, -7], then binarized at zero to the fingerprint 1110.

Worked example: SimHash

Compute a 4-bit SimHash ($f = 4$) for $A = \{\text{cat}:3,\ \text{sat}:2,\ \text{mat}:2\}$ and $B = \{\text{cat}:3,\ \text{sat}:2,\ \text{rat}:2\}$, with term hashes $\text{cat}=1010$, $\text{sat}=1100$, $\text{mat}=0110$, $\text{rat}=1001$ (bit 1 adds the weight, bit 0 subtracts it).

Document A

  • After $\text{cat}$ (weight 3, $1010$): $V_A = [3, -3, 3, -3]$.
  • After $\text{sat}$ (weight 2, $1100$): $V_A = [5, -1, 1, -5]$.
  • After $\text{mat}$ (weight 2, $0110$): $V_A = [3, 1, 3, -7]$.
  • Binarize: $F_A = 1110$.

Document B

  • After $\text{cat}$ and $\text{sat}$: $V_B = [5, -1, 1, -5]$.
  • After $\text{rat}$ (weight 2, $1001$): $V_B = [7, -3, -1, -3]$.
  • Binarize: $F_B = 1000$.

Comparing $F_A = 1110$ and $F_B = 1000$, the XOR is $0110$, so the Hamming distance is $D_H = 2$. The differing fraction is $2/4 = 0.5$, giving an estimated angle of $\theta = \pi/2$ (90 degrees): the two documents are nearly orthogonal.

Scaling: bands versus permutation indexing

Compact fingerprints are only the first step. Comparing a query against billions of indexed fingerprints is still too slow as a linear scan, so each method has a sub-linear index.

MinHash: the LSH bands technique

A signature of length $M$ is split into $b$ bands of $r$ rows each ($M = b \times r$). Each band is hashed; two documents that collide in the same bucket in at least one band become a candidate pair. The probability that two documents with Jaccard similarity $s$ collide in at least one band is:

$$P(\text{candidate}) = 1 - (1 - s^{\,r})^{b}$$

This forms an S-curve. Tuning $b$ and $r$ shifts the threshold where the probability rises sharply, so documents above the threshold collide with high probability and those below are filtered out.

S-curve for LSH bands with b = 20 and r = 5: candidate probability rises sharply near Jaccard similarity 0.51.

SimHash: block-based permutation indexing

For SimHash the goal is to find all fingerprints within Hamming distance $k$ of a query (for example $k = 3$ at $f = 64$). By the pigeonhole principle, if a 64-bit fingerprint is split into $k + 1$ blocks, any two fingerprints within $k$ bits must match exactly in at least one block. Google's index exploits this by keeping $k + 1$ permuted tables, sorting each lexicographically by its blocks, and doing a binary search on the matching block before filtering candidates by full Hamming distance. That makes querying billions of 64-bit fingerprints sub-linear, so deduplication can run during crawling.

MinHash versus SimHash

DimensionMinHashSimHash
Mathematical basisJaccard similarity (set overlap)Cosine / angular similarity (vector angle)
InputUnweighted sets (character or word shingles)Weighted vectors (terms with TF-IDF weights)
SignatureVector of M integer hash valuesCompact f-bit binary fingerprint (e.g. 64 bits)
ComparisonSignature collision rate (matching fraction)Bitwise Hamming distance (XOR and bit count)
Scaling toolLSH bands technique (S-curve hashing)Block-based permutation indexing (Hamming lookup)
StrengthAccurate for subset containmentVery fast bitwise comparison; low memory
Typical useClustering, recommendationReal-time crawl deduplication, query clustering

Search-engine patents

Both algorithms appear by name in foundational search-engine patents.

AltaVista / DEC, US 6,119,124 (Broder, Glassman, Nelson, Manasse, Zweig; priority 1998-03-26, published 2000-09-12), \"Method for clustering closely resembling data objects\", uses MinHash: it represents documents as shingles and uses minima of random permutations as sketches for similarity clustering.

Google, US 7,158,961 (Charikar; priority 2001-12-31, published 2007-01-02), \"Methods and apparatus for estimating similarity\", uses SimHash: it projects weighted vectors onto random directions and binarizes the results into fingerprints.

Google, US 8,140,505 (Jain, Manku; priority 2005-03-31, published 2012-03-20), \"Near-duplicate document detection for web crawling\", uses SimHash at scale: 64-bit fingerprints checked for Hamming distance at most 3 using sorted block permutations, the indexing scheme described above.

Method and limits

Patent research used Google Patents, querying \"SimHash\", \"MinHash\", and \"locality sensitive hashing\" for assignees Google and AltaVista / Digital Equipment Corp. The search is not exhaustive: patents using generic phrasing (\"random projections\", \"sketch-based similarity\") would not surface, and absence of a named patent does not prove none exists.

References

1. Broder (1997). On the resemblance and containment of documents. IEEE

2. Charikar (2002). Similarity estimation techniques from rounding algorithms. STOC. ACM

3. Manku, Jain, Das Sarma (2007). Detecting near-duplicates for web crawling. WWW. ACM

No comments yet