The core idea
A web crawler indexing billions of pages needs to identify and filter out near-duplicates. Millions of pages are nearly identical: they share the same core text but differ slightly in dynamic advertisements, timestamps, page counters, or navigation menus. Indexing duplicate pages wastes computation, storage, and crawl bandwidth, and clutters results with repetitive content.
Standard cryptographic hashes such as MD5 or SHA-256 are built for integrity, not similarity. They show the avalanche effect: changing a single character produces a completely different hash, so they cannot measure how alike two documents are.
Comparing every document against every other is an $O(N^2)$ operation, impossible for billions of pages. Search engines instead use locality-sensitive hashing (LSH): hashes designed so that similar documents map to identical or similar signatures with high probability, which finds near-duplicates in sub-linear time. MinHash and SimHash are the two foundational LSH algorithms for web-scale deduplication.
Jaccard similarity and MinHash (set hashing)
MinHash, introduced by Andrei Broder in 1997, represents documents as sets of tokens and estimates the Jaccard similarity: the size of the intersection of two sets divided by the size of their union.
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
Jaccard similarity is 1 for identical sets, 0 for disjoint sets, and a fraction in between for overlapping sets.
The MinHash process
A document is first turned into a set of shingles. A $k$-shingle is a contiguous sequence of $k$ tokens; for 2-word shingles, \"the cat sat\" becomes the set $\{\text{the cat},\ \text{cat sat}\}$. Once documents are sets, MinHash uses a random permutation $\pi$ of the universe of shingles; the minimum hash value of a set $A$ is the smallest permuted index of any element in $A$:
$$h_{\min}(A) = \min_{x \in A} \pi(x)$$
The collision theorem states that the probability that two sets share the same minimum hash under a random permutation equals their Jaccard similarity:
$$P\big(h_{\min}(A) = h_{\min}(B)\big) = J(A, B)$$
Applying $M$ independent permutations gives each document a signature vector of length $M$. The fraction of matching slots between two signatures is an unbiased estimate of their Jaccard similarity.
Worked example: MinHash
Let $A$ be \"the cat sat\" and $B$ be \"the cat lay\". With 2-word shingles:
- $A = \{\text{the cat},\ \text{cat sat}\}$
- $B = \{\text{the cat},\ \text{cat lay}\}$
The intersection is $\{\text{the cat}\}$ (size 1) and the union has size 3, so the exact Jaccard similarity is:
$$J(A, B) = \tfrac{1}{3} \approx 0.333$$
Take two permutation hashes. Under $h_1$, $A$ maps to minimum 1 and $B$ to minimum 3 (mismatch). Under $h_2$, both map to minimum 1 (match). So $\mathrm{Sig}(A) = [1, 1]$ and $\mathrm{Sig}(B) = [3, 1]$. The estimated similarity is the matching fraction $\tfrac{1}{2} = 0.5$; as the signature length $M$ grows, the estimate converges to the exact $0.333$.
Cosine similarity and SimHash (vector hashing)
SimHash, designed by Moses Charikar in 2002, maps high-dimensional term vectors to compact binary fingerprints and estimates the cosine of the angle $\theta$ between document vectors:
$$\text{similarity} = \cos(\theta) = \frac{\vec{u} \cdot \vec{v}}{\lVert \vec{u} \rVert\, \lVert \vec{v} \rVert}$$
Unlike MinHash, which treats elements as unweighted set members, SimHash can incorporate term weights such as TF-IDF.
The SimHash process
SimHash builds an $f$-bit fingerprint (typically 64 bits):
- Initialize an accumulator of $f$ zeros, $V = [0,\dots,0]$.
- Hash each term to an $f$-bit value.
- For each term, add its weight to $V[i]$ where bit $i$ of the term hash is 1, and subtract it where bit $i$ is 0.
- Binarize: bit $i$ of the fingerprint is 1 if $V[i] > 0$, else 0.
Two fingerprints are compared by Hamming distance $D_H$, the number of differing bit positions. The expected fraction of differing bits is proportional to the angle between the vectors:
$$\frac{D_H}{f} = \frac{\theta}{\pi}$$
A Hamming distance of 0 means the vectors point the same way ($\theta = 0$); a distance of $f/2$ means they are orthogonal ($\theta = \pi/2$).
Worked example: SimHash
Compute a 4-bit SimHash ($f = 4$) for $A = \{\text{cat}:3,\ \text{sat}:2,\ \text{mat}:2\}$ and $B = \{\text{cat}:3,\ \text{sat}:2,\ \text{rat}:2\}$, with term hashes $\text{cat}=1010$, $\text{sat}=1100$, $\text{mat}=0110$, $\text{rat}=1001$ (bit 1 adds the weight, bit 0 subtracts it).
Document A
- After $\text{cat}$ (weight 3, $1010$): $V_A = [3, -3, 3, -3]$.
- After $\text{sat}$ (weight 2, $1100$): $V_A = [5, -1, 1, -5]$.
- After $\text{mat}$ (weight 2, $0110$): $V_A = [3, 1, 3, -7]$.
- Binarize: $F_A = 1110$.
Document B
- After $\text{cat}$ and $\text{sat}$: $V_B = [5, -1, 1, -5]$.
- After $\text{rat}$ (weight 2, $1001$): $V_B = [7, -3, -1, -3]$.
- Binarize: $F_B = 1000$.
Comparing $F_A = 1110$ and $F_B = 1000$, the XOR is $0110$, so the Hamming distance is $D_H = 2$. The differing fraction is $2/4 = 0.5$, giving an estimated angle of $\theta = \pi/2$ (90 degrees): the two documents are nearly orthogonal.
Scaling: bands versus permutation indexing
Compact fingerprints are only the first step. Comparing a query against billions of indexed fingerprints is still too slow as a linear scan, so each method has a sub-linear index.
MinHash: the LSH bands technique
A signature of length $M$ is split into $b$ bands of $r$ rows each ($M = b \times r$). Each band is hashed; two documents that collide in the same bucket in at least one band become a candidate pair. The probability that two documents with Jaccard similarity $s$ collide in at least one band is:
$$P(\text{candidate}) = 1 - (1 - s^{\,r})^{b}$$
This forms an S-curve. Tuning $b$ and $r$ shifts the threshold where the probability rises sharply, so documents above the threshold collide with high probability and those below are filtered out.
SimHash: block-based permutation indexing
For SimHash the goal is to find all fingerprints within Hamming distance $k$ of a query (for example $k = 3$ at $f = 64$). By the pigeonhole principle, if a 64-bit fingerprint is split into $k + 1$ blocks, any two fingerprints within $k$ bits must match exactly in at least one block. Google's index exploits this by keeping $k + 1$ permuted tables, sorting each lexicographically by its blocks, and doing a binary search on the matching block before filtering candidates by full Hamming distance. That makes querying billions of 64-bit fingerprints sub-linear, so deduplication can run during crawling.
MinHash versus SimHash
| Dimension | MinHash | SimHash |
|---|---|---|
| Mathematical basis | Jaccard similarity (set overlap) | Cosine / angular similarity (vector angle) |
| Input | Unweighted sets (character or word shingles) | Weighted vectors (terms with TF-IDF weights) |
| Signature | Vector of M integer hash values | Compact f-bit binary fingerprint (e.g. 64 bits) |
| Comparison | Signature collision rate (matching fraction) | Bitwise Hamming distance (XOR and bit count) |
| Scaling tool | LSH bands technique (S-curve hashing) | Block-based permutation indexing (Hamming lookup) |
| Strength | Accurate for subset containment | Very fast bitwise comparison; low memory |
| Typical use | Clustering, recommendation | Real-time crawl deduplication, query clustering |
Search-engine patents
Both algorithms appear by name in foundational search-engine patents.
AltaVista / DEC, US 6,119,124 (Broder, Glassman, Nelson, Manasse, Zweig; priority 1998-03-26, published 2000-09-12), \"Method for clustering closely resembling data objects\", uses MinHash: it represents documents as shingles and uses minima of random permutations as sketches for similarity clustering.
Google, US 7,158,961 (Charikar; priority 2001-12-31, published 2007-01-02), \"Methods and apparatus for estimating similarity\", uses SimHash: it projects weighted vectors onto random directions and binarizes the results into fingerprints.
Google, US 8,140,505 (Jain, Manku; priority 2005-03-31, published 2012-03-20), \"Near-duplicate document detection for web crawling\", uses SimHash at scale: 64-bit fingerprints checked for Hamming distance at most 3 using sorted block permutations, the indexing scheme described above.
Method and limits
Patent research used Google Patents, querying \"SimHash\", \"MinHash\", and \"locality sensitive hashing\" for assignees Google and AltaVista / Digital Equipment Corp. The search is not exhaustive: patents using generic phrasing (\"random projections\", \"sketch-based similarity\") would not surface, and absence of a named patent does not prove none exists.
References
1. Broder (1997). On the resemblance and containment of documents. IEEE
2. Charikar (2002). Similarity estimation techniques from rounding algorithms. STOC. ACM
3. Manku, Jain, Das Sarma (2007). Detecting near-duplicates for web crawling. WWW. ACM
No comments yet