Counting links is not enough
Suppose you want the best page for a query and all you have is the link graph. The obvious move is to count incoming links: whoever is pointed to the most wins. That is close to what an early link ranker would do, and it fails in a specific way.
Jon Kleinberg gave the clean example in his 1999 paper . Rank pages for the query *java* by raw incoming-link count and you get java.sun.com and gamelan.com, which are correct, sitting next to the Amazon home page and pages advertising Caribbean vacations, which are not. Those pages collect links from all over the web for reasons that have nothing to do with Java. They are universally popular, not authoritative on the topic.
HITS (Hyperlink-Induced Topic Search) resolves that tension by giving each page two scores instead of one, and computing them not over the whole web but over the handful of pages the query itself brings into view. That second part makes HITS authority topic-scoped: the same page can be an authority for one query and a nobody for another.
HITS came out of web search directly. Kleinberg developed it at the Almaden research center of International Business Machines (IBM), where it anchored a research project named CLEVER, and IBM holds the founding patent. A few years later the Teoma search engine (later Ask Jeeves, then Ask.com) shipped a HITS-derived method it called Subject-Specific Popularity.
Two roles that define each other
HITS splits the job of being good into two roles:
- A hub is a good list. It points to many good pages. A well-kept resources page is a hub.
- An authority is a good destination. It is pointed to by many good lists. The page everyone's resource lists send you to is an authority.
The two definitions lean on each other. A page is a good hub if it points to good authorities; a page is a good authority if good hubs point to it. That looks circular, and it is, but the circularity is the feature: it lets the two scores reinforce each other until they settle.
The piece of the web one query pulls up
Before any scoring, HITS builds a small subgraph tied to the query. This is the step that makes its authority scores query-dependent.
- Root set: the top results from an ordinary text search for the query, about 200 pages in Kleinberg's setup. These are relevant but link-poor.
- Base set: grow the root set by adding the pages it links to, plus pages that link into it (capped per root page). The result is usually a few thousand pages, relevant and richly linked.
HITS runs only on this base set. One cleanup helps: drop same-site navigation links (Kleinberg calls these intrinsic links) and keep links between different sites (transverse links), since a site linking to its own pages tells you little about authority.
This is the sharpest contrast with PageRank , which scores the entire web once, independent of any query. HITS scores a fresh, query-specific neighborhood each time.
The update rule
Give every page an authority score $x_p$ and a hub score $y_p$. To update a page's authority, add up the hub scores of the pages that link to it. To update its hub score, add up the authority scores of the pages it links to.
$$x_p \;\leftarrow\; \sum_{q\,:\,q \to p} y_q \qquad\qquad y_p \;\leftarrow\; \sum_{q\,:\,p \to q} x_q$$
After each round, rescale each set of scores so the squares sum to one ($\sum_p x_p^2 = 1$ and $\sum_p y_p^2 = 1$); otherwise the numbers grow without bound. Start every page at one, alternate the two updates, and the scores settle. That is the whole algorithm.
Watch a popular page lose
Here is a small graph built to show the effect. Ten pages: three topical hubs (H1, H2, H3) point to three candidate authorities (A1, A2, A3). A popular page P is pointed to by four weak hubs (W1 to W4) that mostly point only to P; one of them, W1, also points to A1, which keeps the graph connected. The numbers below are from this constructed example, computed directly.
Count incoming links and P wins outright with four, ahead of A1 and A2 at three and A3 at two. If raw popularity were authority, P would be the answer. Now run HITS. The table tracks each candidate's authority score by round (one round is omitted for space; the trend is monotone).
| Authority score | A1 | A2 | A3 | P |
|---|---|---|---|---|
| Raw incoming links | 3 | 3 | 2 | 4 |
| After round 1 | 0.487 | 0.487 | 0.324 | 0.649 |
| After round 2 | 0.575 | 0.521 | 0.356 | 0.521 |
| After round 3 | 0.595 | 0.559 | 0.380 | 0.434 |
| After round 5 | 0.599 | 0.600 | 0.411 | 0.336 |
| Converged | 0.594 | 0.626 | 0.432 | 0.261 |
After the first round P is on top, exactly where link-counting put it. Then it slides every round and ends last. P's four backers point almost nowhere except P, so they never earn real hub scores, so the authority they pass to P keeps shrinking. A1, A2, and A3 collect their links from the topical hubs H1, H2, and H3, which each point to several strong pages, so those hubs earn high scores and keep lifting the authorities they cite. P starts first and ends last.
Optional: the same thing in matrix form
This section restates the update rule as linear algebra and explains why the scores are guaranteed to settle. The convergence result already stated does not depend on it; skip to the next section if you do not need the math.
Write the authority scores as a vector $x$ and the hub scores as a vector $y$, and let $A$ be the adjacency matrix of the base set ($A_{ij}=1$ when page $i$ links to page $j$). The two update operations are then two matrix multiplications:
$$x = A^{\top} y \qquad\qquad y = A\,x$$
Substitute one into the other and each score vector satisfies its own equation:
$$x = A^{\top} A\,x \qquad\qquad y = A\,A^{\top} y$$
So the converged authority vector is the principal eigenvector of $A^{\top}A$, and the converged hub vector is the principal eigenvector of $A\,A^{\top}$. Kleinberg proves both, and notes that the alternating updates are power iteration, the standard way to find a principal eigenvector; in his experiments the top-ranked pages stabilize within about 20 rounds. The matrix $A^{\top}A$ has a plain reading too: its $(i,j)$ entry counts how many pages link to both $i$ and $j$, so HITS rewards pages that are co-cited by the same hubs.
HITS versus PageRank
HITS and PageRank arrived within a year of each other and answer the same question, which pages are authoritative, in opposite ways (see the HITS overview ). The difference is not which is better; it is what each one measures and when.
| Method (year) | What it computes | Scope | Notes |
|---|---|---|---|
| HITS (1999) | Two scores per page: hub and authority | Per query, on a small subgraph at search time | Topic-scoped; more work per query; prone to topic drift and the tightly-knit community effect |
| PageRank (1998) | One authority score per page, from a whole-web random walk | Whole web, computed once, query-independent | Cheap at query time; not topic-aware on its own |
| Bharat and Henzinger (1998) | HITS with links and pages weighted by query relevance | Per query | Reduces HITS topic drift |
| SALSA (2000) | Hub and authority scores from random walks on two chains | Per query | Less vulnerable to the tightly-knit community effect than HITS |
PageRank computes one number per page offline, so ranking at query time is cheap. HITS builds and solves a subgraph for every query, so it costs more at query time but can tune authority to the topic. That cost-versus-specificity tradeoff is why the two are usually discussed together.
Where HITS breaks, and the fixes
HITS has two well-known failure modes, and each has a named fix.
Topic drift
Expanding the root set into the base set can pull in a dense, off-topic community whose internal links overwhelm the query's own pages, so the top scores drift to a more popular neighboring topic. The fix is Bharat and Henzinger's: weight each link and page by how well its content matches the query, so off-topic density stops dominating. Their method, described in the patent as a modified Kleinberg algorithm, runs this relevance-weighted scoring to rank search results.
The tightly-knit community effect
A small set of pages that all link to each other can pump up each other's scores out of proportion to their real authority, a pattern Lempel and Moran named the tightly-knit community (TKC) effect. Their algorithm, SALSA (Stochastic Approach for Link-Structure Analysis), keeps the hub and authority split but replaces the reinforcement loop with random walks on two separate Markov chains, one over hubs and one over authorities. SALSA is less vulnerable to the TKC effect than HITS, while keeping HITS's topic-scoped, two-score character.
Patents tied to search
HITS is well-documented in patents because it was built inside an industrial research lab. Each entry below says whether the patent uses the algorithm (its method applies it), names it (mentions hubs and authorities or Kleinberg), or cites it (references the paper). The patents were verified on their own pages.
| Patent | Assignee | Priority / publication | Tie | Search use |
|---|---|---|---|---|
| US 6,112,202 | IBM (inventor Jon M. Kleinberg) | 1997 / 2000 | Uses, names | The HITS algorithm itself: hub and authority vectors over a query-focused subgraph to surface authoritative pages |
| US 6,738,678 | Bharat and Henzinger (inventors); assigned to Digital Equipment Corp. at filing, later AltaVista, then Overture | 1998 / 2004 | Uses, names, cites | A modified Kleinberg algorithm scores hubs and authorities on a pruned result-set graph to rank search results |
| US 7,117,206 | Overture Services | 2003 / 2006 | Uses, names, cites | Continuation of US 6,738,678: the same connectivity analysis for ranking results |
| US 7,580,931 | Microsoft | 2006 / 2009 | Names, cites | Topic distillation, the task HITS targets; relevance propagation over a site tree; cites Kleinberg |
| US 8,612,453 | Microsoft | 2009 / 2013 | Names, cites | Continuation of US 7,580,931: the same subsite-retrieval topic-distillation method |
Patent pages: US 6,112,202 , US 6,738,678 , US 7,117,206 , US 7,580,931 , US 8,612,453 .
Where it was used
The clearest production use was Teoma, the search engine that became Ask Jeeves and then Ask.com. Teoma built a topic community for the query at search time and ranked within it by what it called Subject-Specific Popularity : authority measured by links from pages on the same subject, rather than by overall link count. That is HITS's topic-scoped authority, shipped to users. The Bharat and Henzinger patent and the Microsoft patents above point at the same application area: ranking and topic distillation, the task of pulling the authoritative pages for a broad topic out of a larger candidate set.
Method and limits of the patent search
The patent search was run on Google Patents and confirmed against the issued documents from the US Patent and Trademark Office, reading assignee and dates off each patent's own page and assignment record. It is not exhaustive. The automated triage step did not complete (the patent endpoint returned errors during the run), so candidates came from name search plus direct retrieval rather than a full assignee-scoped sweep. Link analysis is often described as connectivity analysis, link-based ranking, or graph propagation without naming HITS, hubs, or authorities, so a name query misses those. No Google-assigned HITS patent was confirmed; absence is not proof one does not exist, and it is consistent with Google's public position that its ranking is built on PageRank rather than HITS. Bharat and Henzinger both later worked at Google, but US 6,738,678 was assigned to Digital Equipment Corporation at filing and passed through AltaVista to Overture, not to Google.
The one thing to remember: authority is not how many pages link to you. It is whether good hubs link to you, judged inside the topic the query is about.
References
- Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of the ACM 46(5), 1999, pp. 604 to 632. PDF , doi:10.1145/324133.324140 .
- US Patent 6,112,202, identifying authoritative information resources, IBM (inventor Jon M. Kleinberg), filed 1997, granted 2000. Google Patents .
- HITS algorithm, overview and PageRank contrast. Wikipedia .
- R. Lempel and S. Moran, The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect, Computer Networks 33 (2000); journal version in ACM Transactions on Information Systems 19(2), 2001, pp. 131 to 160. Summary .
- Search Engine Watch, Looking at Links (2002), on Teoma's Subject-Specific Popularity. Article .
- Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine (PageRank), 1998. Stanford .
- US Patent 6,738,678, ranking hyperlinked pages using content and connectivity analysis (Bharat and Henzinger), filed 1998, granted 2004. Google Patents .
No comments yet