Rocchio Relevance Feedback: Vector Space Query Expansion in Search

The Rocchio algorithm optimizes search queries in vector space by adjusting the query vector to shift toward relevant documents and away from non-relevant ones, resolving vocabulary mismatch.

In information retrieval, we frequently encounter the vocabulary mismatch problem: users and authors often choose different words to describe the same concept. For example, if a user queries a search engine for "automobile maintenance", standard keyword-matching algorithms may fail to retrieve the most relevant documents if those documents happen to use terms like "car repair" or "vehicle servicing". The user is left with incomplete results, simply because they did not guess the exact vocabulary used by the document authors.

To bridge this semantic gap, search engines employ query expansion and relevance feedback techniques. Rather than treating the user's initial query as a static, unchangeable set of keywords, we can treat retrieval as an interactive process. By collecting feedback on which documents are relevant and which are not, the search system can learn to reformulate the query dynamically.

The Rocchio algorithm, developed by Joseph John Rocchio Jr. in 1971 as part of the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System at Cornell University, is the classic, foundational method for executing relevance feedback in the vector space model. By representing queries and documents as vectors in a multi-dimensional term space, Rocchio uses simple vector arithmetic to shift the query toward relevant concepts and push it away from irrelevant ones, bringing synonyms and related terms into the search space.

This iterative cycle of query submission, initial retrieval, relevance evaluation, and query reformulation is illustrated in Figure 1.

Flowchart of the relevance feedback process: User submits a query; system retrieves initial documents; user marks documents as relevant or non-relevant; system reformulates query vector using the Rocchio algorithm; system runs the expanded query and returns refined results to the user.

The Rocchio Formula: Vector Space Arithmetic

In a vector space model, we represent both documents and queries as high-dimensional vectors, where each dimension corresponds to a unique term in the vocabulary. The value along each dimension is typically determined by Term Frequency-Inverse Document Frequency (TF-IDF) weights.

To refine a search, the Rocchio algorithm calculates a modified query vector $\vec{q}_m$ from the original query vector $\vec{q}_0$, a set of known relevant documents $D_r$, and a set of known non-relevant documents $D_{nr}$:

$$\vec{q}_m = \max\left(0,\ \alpha \vec{q}_0 + \frac{\beta}{|D_r|} \sum_{\vec{d} \in D_r} \vec{d} - \frac{\gamma}{|D_{nr}|} \sum_{\vec{d} \in D_{nr}} \vec{d}\right)$$

Connecting Summation to Centroids

To understand the formula intuitively, we can define the centroids (the average coordinate vectors) of the relevant and non-relevant document sets:

$$\vec{g}_r = \frac{1}{|D_r|} \sum_{\vec{d} \in D_r} \vec{d}$$

$$\vec{g}_{nr} = \frac{1}{|D_{nr}|} \sum_{\vec{d} \in D_{nr}} \vec{d}$$

Using centroid notation, the Rocchio equation simplifies to:

$$\vec{q}_m = \max(0,\ \alpha \vec{q}_0 + \beta \vec{g}_r - \gamma \vec{g}_{nr})$$

Parameter Explanations

  • $\vec{q}_0$: The original query vector representing the user's initial search terms.
  • $\vec{g}_r$: The centroid of the relevant document set. Adding this shifts the query toward terms that frequently appear in successful matches.
  • $\vec{g}_{nr}$: The centroid of the non-relevant document set. Subtracting this pushes the query away from terms that appear in off-topic documents.
  • $\alpha, \beta, \gamma$: Tunable coefficients that control the balance of the shift: $\alpha$ weights the original query, ensuring the modified query does not drift too far from the user's initial intent; $\beta$ weights the positive feedback, attracting the query to relevant terms; and $\gamma$ weights the negative feedback, repelling the query from non-relevant terms.

The Non-Negativity Constraint

In standard vector space retrieval, term weights must be non-negative to remain compatible with similarity metrics like cosine similarity. Because the subtraction of the non-relevant centroid ($\gamma \vec{g}_{nr}$) can result in negative values for some dimensions, we apply the $\max(0, \cdot)$ operator. This clips any negative term weights to exactly 0, removing those terms from the reformulated query entirely.

This vector translation can be visualized geometrically in a low-dimensional term space, as shown in Figure 2.

2D geometric diagram of Rocchio vector adaptation: An original query vector q0 is modified by adding a positive centroid vector of relevant documents (multiplied by beta) and subtracting a centroid vector of non-relevant documents (multiplied by gamma). The final modified query vector q_m is visibly shifted closer to the relevant cluster and pushed away from the non-relevant cluster.

Worked Toy Example: Vector Shift Calculation

To see Rocchio in action, let us trace a calculation in a simple 3D term space where the coordinates represent the weights for the terms `[car, engine, train]`.

Setup

Suppose the user submits the query "car". The original query vector is:

$$\vec{q}_0 = [1.0,\ 0.0,\ 0.0]$$

The system retrieves initial documents, and the user marks two documents as relevant and one document as non-relevant:

  • Relevant Document 1 ($\vec{d}_1$): Contains "car" and "engine" $\rightarrow \vec{d}_1 = [1.0,\ 1.0,\ 0.0]$
  • Relevant Document 2 ($\vec{d}_2$): Contains "car" and "engine" (lower weight) $\rightarrow \vec{d}_2 = [1.0,\ 0.5,\ 0.0]$
  • Non-relevant Document 3 ($\vec{d}_3$): Contains "train" $\rightarrow \vec{d}_3 = [0.0,\ 0.0,\ 1.0]$

Calculating Centroids

First, we compute the centroids of the relevant and non-relevant document sets:

$$\vec{g}_r = \frac{\vec{d}_1 + \vec{d}_2}{2} = \frac{[1.0, 1.0, 0.0] + [1.0, 0.5, 0.0]}{2} = [1.0,\ 0.75,\ 0.0]$$

$$\vec{g}_{nr} = \vec{d}_3 = [0.0,\ 0.0,\ 1.0]$$

Vector Shift Arithmetic

Let us apply the Rocchio formula with standard weights $\alpha = 1.0$, $\beta = 0.8$, and $\gamma = 0.2$:

$$\vec{q}_{calc} = 1.0 \cdot [1.0,\ 0.0,\ 0.0] + 0.8 \cdot [1.0,\ 0.75,\ 0.0] - 0.2 \cdot [0.0,\ 0.0,\ 1.0]$$

$$\vec{q}_{calc} = [1.0,\ 0.0,\ 0.0] + [0.8,\ 0.6,\ 0.0] - [0.0,\ 0.0,\ 0.2]$$

$$\vec{q}_{calc} = [1.8,\ 0.6,\ -0.2]$$

Applying Non-Negativity Clipping

Since the coordinate for `train` is negative ($-0.2$), we apply the non-negativity constraint:

$$\vec{q}_m = \max(0,\ \vec{q}_{calc}) = [1.8,\ 0.6,\ 0.0]$$

Result

The modified query $\vec{q}_m$ has successfully expanded to include the term "engine" with a weight of 0.6, reflecting the context found in the relevant documents. Meanwhile, the term "train" is completely excluded (clipped to 0), ensuring that the search engine does not retrieve documents related to rail transit.

We can visualize the dynamic movement of the query vector within a three-dimensional term space in the companion Manim script `scene.py`. The animation translates the query vector along the centroids of the relevant and non-relevant document clusters within the term space. The render command is: `manim -pql _working/05-manim/scene.py RocchioVectorShift`.

Pseudo-Relevance Feedback (PRF) and Query Expansion

In practical web search scenarios, users rarely take the time to explicitly mark search results as relevant or non-relevant. To automate this benefit, modern search engines utilize Pseudo-Relevance Feedback (PRF), also known as blind relevance feedback.

Instead of waiting for user input, the system performs a two-pass retrieval process:

  • First Pass: Run the user's original query to retrieve an initial set of documents.
  • Assumption: Assume that the top $k$ retrieved documents (typically $k=10$ or $20$) are relevant.
  • Reformulation: Apply the Rocchio formula using this assumed relevant set. Because we cannot safely assume that all lower-ranked documents are non-relevant, we set $\gamma = 0$, ignoring the negative feedback component. This eliminates the risk of subtracting useful terms.
  • Second Pass: Extract the top $N$ terms with the highest weights from the modified query vector $\vec{q}_m$, append them to the query, and run the search again to return the final results to the user.

While pseudo-relevance feedback is highly effective at resolving vocabulary mismatch without user intervention, it introduces the risk of query drift. If the top-ranked results from the first pass are irrelevant or contain an off-topic dominant theme, the query will expand in the wrong direction, leading to worse final results.

Rocchio vs. Modern Successors

The Rocchio algorithm belongs to the Vector Space Model (VSM) paradigm of information retrieval. As search technology evolved toward probabilistic models and language modeling frameworks, new feedback algorithms were developed to replace vector addition.

The most notable language modeling alternatives are RM3 (Relevance Model 3) and Kullback-Leibler (KL) Divergence Query Expansion (QE).

FeatureRocchioRM3 (Relevance Model 3)KL-Divergence Query Expansion
FrameworkVector Space Model (VSM)Language Modeling (LM)Probabilistic / Information Theory
Primary InputTF-IDF term weight vectorsProbability distributions over vocabularyTerm probability distributions
Mathematical BasisLinear vector algebraProbabilistic estimation & interpolationInformation-theoretic divergence
Negative FeedbackSupported natively via vector subtractionDifficult to model probabilisticallyUsed primarily for selection, not subtraction
Query Drift ControlParameter weighting (alpha, beta, gamma)Linear model interpolation (lambda)Thresholding top expansion terms

Mathematical Formulations

RM3 (Relevance Model 3)

Under the language modeling framework, retrieval is modeled as the probability of generating a query from a document's language model. RM3 estimates a relevance language model (Relevance Model 1, or RM1) $P(w | \theta_{RM1})$ from the top retrieved documents and interpolates it with the original query model $P(w | \theta_q)$ using a parameter $\lambda \in [0, 1]$:

$$P(w | \theta_{RM3}) = \lambda P(w | \theta_q) + (1 - \lambda) P(w | \theta_{RM1})$$

This linear interpolation acts as a smoothing mechanism. By keeping a fraction of the original query probability distribution, RM3 prevents query drift.

KL-Divergence Query Expansion

In this method, we select and score query expansion terms based on how much they contribute to the divergence between the relevance model $P$ (the probability of terms in the feedback set) and the background model $Q$ (the probability of terms in the entire collection):

$$D_{\text{KL}}(P \parallel Q) = \sum_{w} P(w | R) \log \frac{P(w | R)}{P(w | C)}$$

Terms that appear frequently in the feedback set $R$ but infrequently in the general collection $C$ yield a high Kullback-Leibler (KL) divergence score, making them excellent candidates for query expansion.

Search Engine Patents & Real-World Infrastructure

While the core Rocchio algorithm is an academic method in the public domain, search engine companies have patented specific systems that implement relevance feedback principles, particularly utilizing implicit user signals.

Google US 8,661,029 B1

Title: Modifying search result ranking based on implicit user feedback
Assignee: Google Inc. (now Google LLC)
Priority Date: November 2, 2006 | Publication Date: February 25, 2014
Relation: Uses relevance feedback concepts by tracking user clicks and document viewing times (dwell time) to assign relevance weights. The system uses these implicit user signals to re-rank results for subsequent queries, functioning as an automated, implicit relevance feedback loop. View Patent Page

Google US 8,938,463 B1

Title: Modifying search result ranking based on implicit user feedback and a model of presentation bias
Assignee: Google Inc. (now Google LLC)
Priority Date: March 12, 2007 | Publication Date: January 20, 2015
Relation: Extends the implicit feedback model by incorporating a prior model of presentation bias. Since users are biased to click higher-ranked results, the system discounts this bias to calculate a more mathematically accurate relevance signal for result re-ranking. View Patent Page

Microsoft US 8,645,289 B2

Title: Structured cross-lingual relevance feedback for enhancing search results
Assignee: Microsoft Corporation
Priority Date: December 16, 2010 | Publication Date: February 4, 2014
Relation: Applies relevance feedback across languages. When a user searches in one language, the system translates the query, retrieves documents in a second language, applies relevance feedback to select translated expansion terms, and reformulates the query to improve cross-lingual search accuracy. View Patent Page

Yahoo US 9,411,886 B2

Title: Ranking advertisements with pseudo-relevance feedback and translation models
Assignee: Yahoo! Inc.
Priority Date: March 31, 2008 | Publication Date: August 9, 2016
Relation: Uses pseudo-relevance feedback in combination with translation models to improve the matching and ranking of advertisements relative to user search queries. View Patent Page

Research Methods, Search Limits & References

Research Method and Search Limits

Our patent research was conducted using Google Patents, querying the terms "relevance feedback", "pseudo-relevance feedback", and "Rocchio" for the assignees Google, Microsoft, and Yahoo with a priority date floor of 2000-01-01.

We note that this search is not exhaustive. Patents that describe similar concepts using generic terminology (e.g., "clickthrough analysis", "user interest profiles", or "query translation") might not match these exact keyword queries. Consequently, the absence of a patent explicitly naming "Rocchio" does not mean search engines do not employ its mathematical principles.

References

1. Rocchio, J. J. Jr. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), *The SMART Retrieval System: Experiments in Automatic Document Processing* (pp. 313–323). Prentice-Hall. Internet Archive Link

2. Manning, C. D., Raghavan, P., & Schütze, H. (2008). *Introduction to Information Retrieval* (Chapter 9: Relevance feedback and query expansion). Cambridge University Press. Stanford NLP Group Link

3. Zhai, C., & Lafferty, J. D. (2001). Model-based feedback in the language modeling approach to information retrieval. *Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM)*. ACM DOI Link

4. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. *Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval*. ACM DOI Link

No comments yet