HITS (Hubs and Authorities): Topic-Scoped Authority in Search

Counting links is not enough

Suppose you want the best page for a query and all you have is the link graph. The obvious move is to count incoming links: whoever is pointed to the most wins. That is close to what an early link ranker would do, and it fails in a specific way.

Jon Kleinberg gave the clean example in his 1999 paper . Rank pages for the query *java* by raw incoming-link count and you get java.sun.com and gamelan.com, which are correct, sitting next to the Amazon home page and pages advertising Caribbean vacations, which are not. Those pages collect links from all over the web for reasons that have nothing to do with Java. They are universally popular, not authoritative on the topic.

HITS (Hyperlink-Induced Topic Search) resolves that tension by giving each page two scores instead of one, and computing them not over the whole web but over the handful of pages the query itself brings into view. That second part makes HITS authority topic-scoped: the same page can be an authority for one query and a nobody for another.

Two ranked lists. Ranked by incoming links, the popular page P leads with 4, ahead of authorities A1 and A2 at 3 and A3 at 2. Ranked by HITS authority, A2 leads at 0.63, then A1 at 0.59 and A3 at 0.43, with P last at 0.26. An arrow shows P dropping from first to last.

HITS came out of web search directly. Kleinberg developed it at the Almaden research center of International Business Machines (IBM), where it anchored a research project named CLEVER, and IBM holds the founding patent. A few years later the Teoma search engine (later Ask Jeeves, then Ask.com) shipped a HITS-derived method it called Subject-Specific Popularity.

Two roles that define each other

HITS splits the job of being good into two roles:

A hub is a good list. It points to many good pages. A well-kept resources page is a hub.
An authority is a good destination. It is pointed to by many good lists. The page everyone's resource lists send you to is an authority.

The two definitions lean on each other. A page is a good hub if it points to good authorities; a page is a good authority if good hubs point to it. That looks circular, and it is, but the circularity is the feature: it lets the two scores reinforce each other until they settle.

A bipartite schematic: hubs on the left send arrows to authorities on the right, with a two-arrow loop between them. Callouts read: a good hub points to many good authorities, and a good authority is pointed to by many good hubs.

The piece of the web one query pulls up

Before any scoring, HITS builds a small subgraph tied to the query. This is the step that makes its authority scores query-dependent.

Root set: the top results from an ordinary text search for the query, about 200 pages in Kleinberg's setup. These are relevant but link-poor.
Base set: grow the root set by adding the pages it links to, plus pages that link into it (capped per root page). The result is usually a few thousand pages, relevant and richly linked.

HITS runs only on this base set. One cleanup helps: drop same-site navigation links (Kleinberg calls these intrinsic links) and keep links between different sites (transverse links), since a site linking to its own pages tells you little about authority.

A concentric set diagram: a query box feeds an inner root-set disc of text-search results, which sits inside a larger base-set ring of pages added by following links in and out. A note says HITS runs on this subgraph, one query's neighborhood.

This is the sharpest contrast with PageRank , which scores the entire web once, independent of any query. HITS scores a fresh, query-specific neighborhood each time.

The update rule

Give every page an authority score $x_p$ and a hub score $y_p$. To update a page's authority, add up the hub scores of the pages that link to it. To update its hub score, add up the authority scores of the pages it links to.

$$x_p \;\leftarrow\; \sum_{q\,:\,q \to p} y_q \qquad\qquad y_p \;\leftarrow\; \sum_{q\,:\,p \to q} x_q$$

After each round, rescale each set of scores so the squares sum to one ($\sum_p x_p^2 = 1$ and $\sum_p y_p^2 = 1$); otherwise the numbers grow without bound. Start every page at one, alternate the two updates, and the scores settle. That is the whole algorithm.

Watch a popular page lose

Here is a small graph built to show the effect. Ten pages: three topical hubs (H1, H2, H3) point to three candidate authorities (A1, A2, A3). A popular page P is pointed to by four weak hubs (W1 to W4) that mostly point only to P; one of them, W1, also points to A1, which keeps the graph connected. The numbers below are from this constructed example, computed directly.

A directed ten-node graph: three cobalt hubs H1, H2, H3 point to three teal authorities A1, A2, A3; a coral popular page P is pointed to by four gray weak hubs W1 to W4, and W1 also points to A1. Annotations note P has the most incoming links (4) and the four weak hubs mostly just point to P. No scores are shown.

Count incoming links and P wins outright with four, ahead of A1 and A2 at three and A3 at two. If raw popularity were authority, P would be the answer. Now run HITS. The table tracks each candidate's authority score by round (one round is omitted for space; the trend is monotone).

Authority score	A1	A2	A3	P
Raw incoming links	3	3	2	4
After round 1	0.487	0.487	0.324	0.649
After round 2	0.575	0.521	0.356	0.521
After round 3	0.595	0.559	0.380	0.434
After round 5	0.599	0.600	0.411	0.336
Converged	0.594	0.626	0.432	0.261

After the first round P is on top, exactly where link-counting put it. Then it slides every round and ends last. P's four backers point almost nowhere except P, so they never earn real hub scores, so the authority they pass to P keeps shrinking. A1, A2, and A3 collect their links from the topical hubs H1, H2, and H3, which each point to several strong pages, so those hubs earn high scores and keep lifting the authorities they cite. P starts first and ends last.

Optional: the same thing in matrix form

This section restates the update rule as linear algebra and explains why the scores are guaranteed to settle. The convergence result already stated does not depend on it; skip to the next section if you do not need the math.

Write the authority scores as a vector $x$ and the hub scores as a vector $y$, and let $A$ be the adjacency matrix of the base set ($A_{ij}=1$ when page $i$ links to page $j$). The two update operations are then two matrix multiplications:

$$x = A^{\top} y \qquad\qquad y = A\,x$$

Substitute one into the other and each score vector satisfies its own equation:

$$x = A^{\top} A\,x \qquad\qquad y = A\,A^{\top} y$$

So the converged authority vector is the principal eigenvector of $A^{\top}A$, and the converged hub vector is the principal eigenvector of $A\,A^{\top}$. Kleinberg proves both, and notes that the alternating updates are power iteration, the standard way to find a principal eigenvector; in his experiments the top-ranked pages stabilize within about 20 rounds. The matrix $A^{\top}A$ has a plain reading too: its $(i,j)$ entry counts how many pages link to both $i$ and $j$, so HITS rewards pages that are co-cited by the same hubs.

HITS versus PageRank

HITS and PageRank arrived within a year of each other and answer the same question, which pages are authoritative, in opposite ways (see the HITS overview ). The difference is not which is better; it is what each one measures and when.

Two panels. PageRank: a web of eight nodes, each with one score badge, computed once for the whole web, independent of the query. HITS: a magnifier pulls a small subgraph out of the web; each node carries two badges, a hub mark and an authority mark, computed at query time on a topic subgraph.

Method (year)	What it computes	Scope	Notes
HITS (1999)	Two scores per page: hub and authority	Per query, on a small subgraph at search time	Topic-scoped; more work per query; prone to topic drift and the tightly-knit community effect
PageRank (1998)	One authority score per page, from a whole-web random walk	Whole web, computed once, query-independent	Cheap at query time; not topic-aware on its own
Bharat and Henzinger (1998)	HITS with links and pages weighted by query relevance	Per query	Reduces HITS topic drift
SALSA (2000)	Hub and authority scores from random walks on two chains	Per query	Less vulnerable to the tightly-knit community effect than HITS

PageRank computes one number per page offline, so ranking at query time is cheap. HITS builds and solves a subgraph for every query, so it costs more at query time but can tune authority to the topic. That cost-versus-specificity tradeoff is why the two are usually discussed together.

Where HITS breaks, and the fixes

HITS has two well-known failure modes, and each has a named fix.

Topic drift

Expanding the root set into the base set can pull in a dense, off-topic community whose internal links overwhelm the query's own pages, so the top scores drift to a more popular neighboring topic. The fix is Bharat and Henzinger's: weight each link and page by how well its content matches the query, so off-topic density stops dominating. Their method, described in the patent as a modified Kleinberg algorithm, runs this relevance-weighted scoring to rank search results.

The tightly-knit community effect

A small set of pages that all link to each other can pump up each other's scores out of proportion to their real authority, a pattern Lempel and Moran named the tightly-knit community (TKC) effect. Their algorithm, SALSA (Stochastic Approach for Link-Structure Analysis), keeps the hub and authority split but replaces the reinforcement loop with random walks on two separate Markov chains, one over hubs and one over authorities. SALSA is less vulnerable to the TKC effect than HITS, while keeping HITS's topic-scoped, two-score character.

Patents tied to search

HITS is well-documented in patents because it was built inside an industrial research lab. Each entry below says whether the patent uses the algorithm (its method applies it), names it (mentions hubs and authorities or Kleinberg), or cites it (references the paper). The patents were verified on their own pages.

Patent	Assignee	Priority / publication	Tie	Search use
US 6,112,202	IBM (inventor Jon M. Kleinberg)	1997 / 2000	Uses, names	The HITS algorithm itself: hub and authority vectors over a query-focused subgraph to surface authoritative pages
US 6,738,678	Bharat and Henzinger (inventors); assigned to Digital Equipment Corp. at filing, later AltaVista, then Overture	1998 / 2004	Uses, names, cites	A modified Kleinberg algorithm scores hubs and authorities on a pruned result-set graph to rank search results
US 7,117,206	Overture Services	2003 / 2006	Uses, names, cites	Continuation of US 6,738,678: the same connectivity analysis for ranking results
US 7,580,931	Microsoft	2006 / 2009	Names, cites	Topic distillation, the task HITS targets; relevance propagation over a site tree; cites Kleinberg
US 8,612,453	Microsoft	2009 / 2013	Names, cites	Continuation of US 7,580,931: the same subsite-retrieval topic-distillation method

Patent pages: US 6,112,202 , US 6,738,678 , US 7,117,206 , US 7,580,931 , US 8,612,453 .

Where it was used

The clearest production use was Teoma, the search engine that became Ask Jeeves and then Ask.com. Teoma built a topic community for the query at search time and ranked within it by what it called Subject-Specific Popularity : authority measured by links from pages on the same subject, rather than by overall link count. That is HITS's topic-scoped authority, shipped to users. The Bharat and Henzinger patent and the Microsoft patents above point at the same application area: ranking and topic distillation, the task of pulling the authoritative pages for a broad topic out of a larger candidate set.

Method and limits of the patent search

The patent search was run on Google Patents and confirmed against the issued documents from the US Patent and Trademark Office, reading assignee and dates off each patent's own page and assignment record. It is not exhaustive. The automated triage step did not complete (the patent endpoint returned errors during the run), so candidates came from name search plus direct retrieval rather than a full assignee-scoped sweep. Link analysis is often described as connectivity analysis, link-based ranking, or graph propagation without naming HITS, hubs, or authorities, so a name query misses those. No Google-assigned HITS patent was confirmed; absence is not proof one does not exist, and it is consistent with Google's public position that its ranking is built on PageRank rather than HITS. Bharat and Henzinger both later worked at Google, but US 6,738,678 was assigned to Digital Equipment Corporation at filing and passed through AltaVista to Overture, not to Google.

The one thing to remember: authority is not how many pages link to you. It is whether good hubs link to you, judged inside the topic the query is about.

References

Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of the ACM 46(5), 1999, pp. 604 to 632. PDF , doi:10.1145/324133.324140 .
US Patent 6,112,202, identifying authoritative information resources, IBM (inventor Jon M. Kleinberg), filed 1997, granted 2000. Google Patents .
HITS algorithm, overview and PageRank contrast. Wikipedia .
R. Lempel and S. Moran, The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect, Computer Networks 33 (2000); journal version in ACM Transactions on Information Systems 19(2), 2001, pp. 131 to 160. Summary .
Search Engine Watch, Looking at Links (2002), on Teoma's Subject-Specific Popularity. Article .
Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine (PageRank), 1998. Stanford .
US Patent 6,738,678, ranking hyperlinked pages using content and connectivity analysis (Bharat and Henzinger), filed 1998, granted 2004. Google Patents .