In information retrieval, we frequently encounter the vocabulary mismatch problem: users and authors often choose different words to describe the same concept. For example, if a user queries a search engine for "automobile maintenance", standard keyword-matching algorithms may fail to retrieve the most relevant documents if those documents happen to use terms like "car repair" or "vehicle servicing". The user is left with incomplete results, simply because they did not guess the exact vocabulary used by the document authors.
To bridge this semantic gap, search engines employ query expansion and relevance feedback techniques. Rather than treating the user's initial query as a static, unchangeable set of keywords, we can treat retrieval as an interactive process. By collecting feedback on which documents are relevant and which are not, the search system can learn to reformulate the query dynamically.
The Rocchio algorithm, developed by Joseph John Rocchio Jr. in 1971 as part of the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System at Cornell University, is the classic, foundational method for executing relevance feedback in the vector space model. By representing queries and documents as vectors in a multi-dimensional term space, Rocchio uses simple vector arithmetic to shift the query toward relevant concepts and push it away from irrelevant ones, bringing synonyms and related terms into the search space.
This iterative cycle of query submission, initial retrieval, relevance evaluation, and query reformulation is illustrated in Figure 1.
The Rocchio Formula: Vector Space Arithmetic
In a vector space model, we represent both documents and queries as high-dimensional vectors, where each dimension corresponds to a unique term in the vocabulary. The value along each dimension is typically determined by Term Frequency-Inverse Document Frequency (TF-IDF) weights.
To refine a search, the Rocchio algorithm calculates a modified query vector $\vec{q}_m$ from the original query vector $\vec{q}_0$, a set of known relevant documents $D_r$, and a set of known non-relevant documents $D_{nr}$:
$$\vec{q}_m = \max\left(0,\ \alpha \vec{q}_0 + \frac{\beta}{|D_r|} \sum_{\vec{d} \in D_r} \vec{d} - \frac{\gamma}{|D_{nr}|} \sum_{\vec{d} \in D_{nr}} \vec{d}\right)$$
Connecting Summation to Centroids
To understand the formula intuitively, we can define the centroids (the average coordinate vectors) of the relevant and non-relevant document sets:
$$\vec{g}_r = \frac{1}{|D_r|} \sum_{\vec{d} \in D_r} \vec{d}$$
$$\vec{g}_{nr} = \frac{1}{|D_{nr}|} \sum_{\vec{d} \in D_{nr}} \vec{d}$$
Using centroid notation, the Rocchio equation simplifies to:
$$\vec{q}_m = \max(0,\ \alpha \vec{q}_0 + \beta \vec{g}_r - \gamma \vec{g}_{nr})$$
Parameter Explanations
- $\vec{q}_0$: The original query vector representing the user's initial search terms.
- $\vec{g}_r$: The centroid of the relevant document set. Adding this shifts the query toward terms that frequently appear in successful matches.
- $\vec{g}_{nr}$: The centroid of the non-relevant document set. Subtracting this pushes the query away from terms that appear in off-topic documents.
- $\alpha, \beta, \gamma$: Tunable coefficients that control the balance of the shift: $\alpha$ weights the original query, ensuring the modified query does not drift too far from the user's initial intent; $\beta$ weights the positive feedback, attracting the query to relevant terms; and $\gamma$ weights the negative feedback, repelling the query from non-relevant terms.
The Non-Negativity Constraint
In standard vector space retrieval, term weights must be non-negative to remain compatible with similarity metrics like cosine similarity. Because the subtraction of the non-relevant centroid ($\gamma \vec{g}_{nr}$) can result in negative values for some dimensions, we apply the $\max(0, \cdot)$ operator. This clips any negative term weights to exactly 0, removing those terms from the reformulated query entirely.
This vector translation can be visualized geometrically in a low-dimensional term space, as shown in Figure 2.
Worked Toy Example: Vector Shift Calculation
To see Rocchio in action, let us trace a calculation in a simple 3D term space where the coordinates represent the weights for the terms `[car, engine, train]`.
Setup
Suppose the user submits the query "car". The original query vector is:
$$\vec{q}_0 = [1.0,\ 0.0,\ 0.0]$$
The system retrieves initial documents, and the user marks two documents as relevant and one document as non-relevant:
- Relevant Document 1 ($\vec{d}_1$): Contains "car" and "engine" $\rightarrow \vec{d}_1 = [1.0,\ 1.0,\ 0.0]$
- Relevant Document 2 ($\vec{d}_2$): Contains "car" and "engine" (lower weight) $\rightarrow \vec{d}_2 = [1.0,\ 0.5,\ 0.0]$
- Non-relevant Document 3 ($\vec{d}_3$): Contains "train" $\rightarrow \vec{d}_3 = [0.0,\ 0.0,\ 1.0]$
Calculating Centroids
First, we compute the centroids of the relevant and non-relevant document sets:
$$\vec{g}_r = \frac{\vec{d}_1 + \vec{d}_2}{2} = \frac{[1.0, 1.0, 0.0] + [1.0, 0.5, 0.0]}{2} = [1.0,\ 0.75,\ 0.0]$$
$$\vec{g}_{nr} = \vec{d}_3 = [0.0,\ 0.0,\ 1.0]$$
Vector Shift Arithmetic
Let us apply the Rocchio formula with standard weights $\alpha = 1.0$, $\beta = 0.8$, and $\gamma = 0.2$:
$$\vec{q}_{calc} = 1.0 \cdot [1.0,\ 0.0,\ 0.0] + 0.8 \cdot [1.0,\ 0.75,\ 0.0] - 0.2 \cdot [0.0,\ 0.0,\ 1.0]$$
$$\vec{q}_{calc} = [1.0,\ 0.0,\ 0.0] + [0.8,\ 0.6,\ 0.0] - [0.0,\ 0.0,\ 0.2]$$
$$\vec{q}_{calc} = [1.8,\ 0.6,\ -0.2]$$
Applying Non-Negativity Clipping
Since the coordinate for `train` is negative ($-0.2$), we apply the non-negativity constraint:
$$\vec{q}_m = \max(0,\ \vec{q}_{calc}) = [1.8,\ 0.6,\ 0.0]$$
Result
The modified query $\vec{q}_m$ has successfully expanded to include the term "engine" with a weight of 0.6, reflecting the context found in the relevant documents. Meanwhile, the term "train" is completely excluded (clipped to 0), ensuring that the search engine does not retrieve documents related to rail transit.
We can visualize the dynamic movement of the query vector within a three-dimensional term space in the companion Manim script `scene.py`. The animation translates the query vector along the centroids of the relevant and non-relevant document clusters within the term space. The render command is: `manim -pql _working/05-manim/scene.py RocchioVectorShift`.
Pseudo-Relevance Feedback (PRF) and Query Expansion
In practical web search scenarios, users rarely take the time to explicitly mark search results as relevant or non-relevant. To automate this benefit, modern search engines utilize Pseudo-Relevance Feedback (PRF), also known as blind relevance feedback.
Instead of waiting for user input, the system performs a two-pass retrieval process:
- First Pass: Run the user's original query to retrieve an initial set of documents.
- Assumption: Assume that the top $k$ retrieved documents (typically $k=10$ or $20$) are relevant.
- Reformulation: Apply the Rocchio formula using this assumed relevant set. Because we cannot safely assume that all lower-ranked documents are non-relevant, we set $\gamma = 0$, ignoring the negative feedback component. This eliminates the risk of subtracting useful terms.
- Second Pass: Extract the top $N$ terms with the highest weights from the modified query vector $\vec{q}_m$, append them to the query, and run the search again to return the final results to the user.
While pseudo-relevance feedback is highly effective at resolving vocabulary mismatch without user intervention, it introduces the risk of query drift. If the top-ranked results from the first pass are irrelevant or contain an off-topic dominant theme, the query will expand in the wrong direction, leading to worse final results.
Rocchio vs. Modern Successors
The Rocchio algorithm belongs to the Vector Space Model (VSM) paradigm of information retrieval. As search technology evolved toward probabilistic models and language modeling frameworks, new feedback algorithms were developed to replace vector addition.
The most notable language modeling alternatives are RM3 (Relevance Model 3) and Kullback-Leibler (KL) Divergence Query Expansion (QE).
| Feature | Rocchio | RM3 (Relevance Model 3) | KL-Divergence Query Expansion |
|---|---|---|---|
| Framework | Vector Space Model (VSM) | Language Modeling (LM) | Probabilistic / Information Theory |
| Primary Input | TF-IDF term weight vectors | Probability distributions over vocabulary | Term probability distributions |
| Mathematical Basis | Linear vector algebra | Probabilistic estimation & interpolation | Information-theoretic divergence |
| Negative Feedback | Supported natively via vector subtraction | Difficult to model probabilistically | Used primarily for selection, not subtraction |
| Query Drift Control | Parameter weighting (alpha, beta, gamma) | Linear model interpolation (lambda) | Thresholding top expansion terms |
Mathematical Formulations
RM3 (Relevance Model 3)
Under the language modeling framework, retrieval is modeled as the probability of generating a query from a document's language model. RM3 estimates a relevance language model (Relevance Model 1, or RM1) $P(w | \theta_{RM1})$ from the top retrieved documents and interpolates it with the original query model $P(w | \theta_q)$ using a parameter $\lambda \in [0, 1]$:
$$P(w | \theta_{RM3}) = \lambda P(w | \theta_q) + (1 - \lambda) P(w | \theta_{RM1})$$
This linear interpolation acts as a smoothing mechanism. By keeping a fraction of the original query probability distribution, RM3 prevents query drift.
KL-Divergence Query Expansion
In this method, we select and score query expansion terms based on how much they contribute to the divergence between the relevance model $P$ (the probability of terms in the feedback set) and the background model $Q$ (the probability of terms in the entire collection):
$$D_{\text{KL}}(P \parallel Q) = \sum_{w} P(w | R) \log \frac{P(w | R)}{P(w | C)}$$
Terms that appear frequently in the feedback set $R$ but infrequently in the general collection $C$ yield a high Kullback-Leibler (KL) divergence score, making them excellent candidates for query expansion.
Search Engine Patents & Real-World Infrastructure
While the core Rocchio algorithm is an academic method in the public domain, search engine companies have patented specific systems that implement relevance feedback principles, particularly utilizing implicit user signals.
Google US 8,661,029 B1
Title: Modifying search result ranking based on implicit user feedback
Assignee: Google Inc. (now Google LLC)
Priority Date: November 2, 2006 | Publication Date: February 25, 2014
Relation: Uses relevance feedback concepts by tracking user clicks and document viewing times (dwell time) to assign relevance weights. The system uses these implicit user signals to re-rank results for subsequent queries, functioning as an automated, implicit relevance feedback loop. View Patent Page
Google US 8,938,463 B1
Title: Modifying search result ranking based on implicit user feedback and a model of presentation bias
Assignee: Google Inc. (now Google LLC)
Priority Date: March 12, 2007 | Publication Date: January 20, 2015
Relation: Extends the implicit feedback model by incorporating a prior model of presentation bias. Since users are biased to click higher-ranked results, the system discounts this bias to calculate a more mathematically accurate relevance signal for result re-ranking. View Patent Page
Microsoft US 8,645,289 B2
Title: Structured cross-lingual relevance feedback for enhancing search results
Assignee: Microsoft Corporation
Priority Date: December 16, 2010 | Publication Date: February 4, 2014
Relation: Applies relevance feedback across languages. When a user searches in one language, the system translates the query, retrieves documents in a second language, applies relevance feedback to select translated expansion terms, and reformulates the query to improve cross-lingual search accuracy. View Patent Page
Yahoo US 9,411,886 B2
Title: Ranking advertisements with pseudo-relevance feedback and translation models
Assignee: Yahoo! Inc.
Priority Date: March 31, 2008 | Publication Date: August 9, 2016
Relation: Uses pseudo-relevance feedback in combination with translation models to improve the matching and ranking of advertisements relative to user search queries. View Patent Page
Research Methods, Search Limits & References
Research Method and Search Limits
Our patent research was conducted using Google Patents, querying the terms "relevance feedback", "pseudo-relevance feedback", and "Rocchio" for the assignees Google, Microsoft, and Yahoo with a priority date floor of 2000-01-01.
We note that this search is not exhaustive. Patents that describe similar concepts using generic terminology (e.g., "clickthrough analysis", "user interest profiles", or "query translation") might not match these exact keyword queries. Consequently, the absence of a patent explicitly naming "Rocchio" does not mean search engines do not employ its mathematical principles.
References
1. Rocchio, J. J. Jr. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), *The SMART Retrieval System: Experiments in Automatic Document Processing* (pp. 313–323). Prentice-Hall. Internet Archive Link
2. Manning, C. D., Raghavan, P., & Schütze, H. (2008). *Introduction to Information Retrieval* (Chapter 9: Relevance feedback and query expansion). Cambridge University Press. Stanford NLP Group Link
3. Zhai, C., & Lafferty, J. D. (2001). Model-based feedback in the language modeling approach to information retrieval. *Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM)*. ACM DOI Link
4. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. *Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval*. ACM DOI Link
No comments yet