The core idea
Picture a graph: dots (nodes) joined by lines (edges). Some regions are densely linked; others barely connect. A community is a set of nodes with more edges inside it than you would expect by chance. The task is to find those sets automatically.
The core idea behind every algorithm here: let each node repeatedly take a cue from its neighbors, and the communities fall out of those local moves. No node sees the whole graph. Nobody sets the number of communities ahead of time. That is what makes the approach cheap enough to run on graphs with hundreds of millions of nodes.
This matters for search. Google holds a patent that runs label propagation over the web graph and describes feeding the result to a process that ranks search results. Microsoft's GraphRAG uses the Leiden algorithm to group a knowledge graph into a community hierarchy, then uses the per-community summaries to answer broad questions over a document collection. Those are two different uses: one is web-search ranking, the other is retrieval for a large language model.
A toy graph to anchor everything
The eye groups these nodes instantly: a left clump, a middle clump, a right clump. The algorithms have to reach the same answer using only local structure. The same 13-node graph (19 edges) carries every step below, so the numbers stay concrete.
Label propagation: the cheapest method
The label propagation algorithm (LPA), introduced by Raghavan, Albert, and Kumara in 2007 ( paper ), is the simplest of the three. Give every node a unique label. Then sweep through the nodes and, for each one, change its label to whichever label most of its neighbors carry. Ties are broken at random. Repeat until no node would change. Labels spread through dense regions and stall at the sparse boundaries between them, so each final label marks a community.
The update rule is just: each node takes the most common label among its neighbors. Each pass costs time proportional to the number of edges (near-linear), and the method needs only a few passes. It optimizes no global score.
Where it falls short
That simplicity has costs. The order nodes are visited and the random tie-breaks make LPA non-deterministic: two runs on the same graph can return different partitions. If every node updates at the same time (synchronous updating) rather than one at a time (asynchronous), labels can flip back and forth forever, so the original paper uses asynchronous updates. And nothing stops the labels from collapsing so that one giant label swallows most of the graph.
Modularity: a score for a partition
LPA exposes a gap. It gave two different answers on two runs, and it offers no way to say which one is better, because it never measures the quality of a partition. Modularity, introduced by Newman and Girvan in 2004 ( paper ), is that measure. It compares the fraction of edges inside communities against the fraction you would expect by chance. The chance baseline rewires the same edges at random but keeps each node's degree (its number of edges) fixed. Higher modularity means stronger community structure.
Modularity is written:
Here m is the total number of edges; Aij is 1 when nodes i and j are linked; ki is the degree of node i; and the delta term equals 1 only when i and j are in the same community, so only same-community pairs count. For the toy graph's three-way split, Q works out to 0.5554, which the figures round to 0.56.
A known limit: modularity optimization has a resolution limit. It can fail to separate communities smaller than a scale set by the size of the whole graph, so genuinely distinct small communities get merged (Fortunato and Barthelemy, 2007, paper ). It applies to any method that maximizes modularity, including the next one.
Louvain: optimize modularity, fast
The Louvain method (Blondel, Guillaume, Lambiotte, and Lefebvre, 2008, paper ) is a greedy way to push modularity up. It alternates two phases.
Phase 1 (local moving): start with every node in its own community. For each node, consider moving it into the community of each neighbor, and compute the change in modularity. Make the move that increases Q the most; repeat until stable. On the toy graph, moving node 4 from being alone into community A raises Q from 0.500 to 0.555, a gain of +0.055. The gain has a closed form, so it is cheap to evaluate:
Phase 2 (aggregation): collapse each community into a single super-node. Edges between two communities become one weighted edge between their super-nodes, and each super-node gets a self-loop whose weight is the number of edges that were inside that community. Run Phase 1 again on the smaller graph; repeating the two phases builds a hierarchy and scales to very large graphs (the paper partitioned a web graph of 118 million nodes and more than a billion links).
Where it falls short
Louvain optimizes a number, not connectivity, and that creates a specific defect. A community it reports can be badly connected, and in the worst case the nodes sharing one community label are not even connected to each other inside that community. It also inherits modularity's resolution limit.
Leiden: guarantee connected communities
Leiden (Traag, Waltman, and van Eck, 2019, paper ) targets exactly that defect. The authors show that the Louvain algorithm "may yield arbitrarily badly connected communities" and that "communities may even be disconnected."
Leiden runs three phases instead of two: local moving, then a refinement step, then aggregation built from the refined partition. The refinement step only forms a community out of nodes that are actually connected within it, so every community Leiden outputs is guaranteed to be connected. The paper also reports that Leiden runs faster than Louvain. The guarantee is connectivity, not a globally optimal partition; the resolution limit still applies.
How they relate
| Algorithm | Problem it solves | Key weakness | What the next one fixed |
|---|---|---|---|
| Label propagation | Find communities with the cheapest local rule, no parameters. | No objective to compare partitions; non-deterministic; can collapse to one giant community. | Modularity adds a score; Louvain optimizes it. |
| Louvain | Greedily maximize modularity in two phases; scales to huge graphs. | Reported communities can be badly connected or disconnected; resolution limit. | Leiden adds a refinement step that guarantees connected communities. |
| Leiden | Same modularity goal, plus a connected-community guarantee, and faster. | Still bound by modularity's resolution limit. | Adjustable-resolution and alternative quality functions address the limit. |
The line runs in one direction. Label propagation is cheapest but has no objective, so you cannot rank its outputs. Modularity supplies that objective, and Louvain climbs it greedily, but optimizing a single number lets it report communities whose members are not connected. Leiden inserts a refinement step that builds communities only from connected pieces, removing that failure while running faster.
The video
The companion video animates the toy graph through each step: label propagation updating node by node, modularity scoring a partition, Louvain's two phases, and Leiden's refinement split, then the search ties and a source list. The full explanation lives on this page; the video is the visual aid.
Patents tied to search
Label propagation appears, by name, in a granted Google patent family aimed at processing very large graphs. The patents describe representing the World Wide Web as a graph and analyzing it "to provide information to a search engine process that ranks search results," and they list spam detection among the uses.
| Patent | Title | Assignee | Relation |
|---|---|---|---|
| US 8,793,283 B1 | Label Propagation in a Distributed System | Google Inc. | Uses and names label propagation; cites Raghavan et al. 2007 |
| US 9,652,876 B1 | Label Propagation in a Distributed System | Google Inc. | Continuation; uses and names label propagation |
The inventors are the team behind Pregel, Google's large-scale graph processing framework, and the patent runs label propagation in Pregel's synchronized, bulk-synchronous iteration model rather than the one-node-at-a-time asynchronous procedure above. The "uses and names" relation is confirmed from the granted patent text; the citation to Raghavan et al. was read from the USPTO grant document and is reported on that basis.
Applications in search and retrieval
Retrieval for language models: Microsoft's GraphRAG builds a knowledge graph from a document collection, runs the Leiden algorithm to partition it into a hierarchy of communities, and writes a summary for each community. Its global search mode answers broad questions by reasoning over those community summaries ( paper , docs ). Community detection is also a standard tool for grouping search queries by intent, clustering near-duplicate pages, grouping entities in a knowledge graph, and the recommendation cold-start problem. Label propagation and Louvain ship in common graph libraries (Neo4j Graph Data Science, NetworkX, igraph).
Method and limits
Patent search used Google Patents, filtered to United States, English-language grants with a priority date on or after 2010-01-01, focused on Google and Microsoft (Bing) as assignees. The search is not exhaustive: a patent that uses these methods while only saying "graph clustering" would not surface in a name query, so absence of a named patent is not proof none exists. Algorithm definitions, the modularity and gain formulas, the 118-million-node figure, and the supersession claims were checked against the primary papers below.
References
- Raghavan, Albert, Kumara (2007). Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106. https://arxiv.org/abs/0709.2938
- Blondel, Guillaume, Lambiotte, Lefebvre (2008). Fast unfolding of communities in large networks. J. Stat. Mech. P10008. https://arxiv.org/abs/0803.0476
- Traag, Waltman, van Eck (2019). From Louvain to Leiden. Scientific Reports 9, 5233. https://arxiv.org/abs/1810.08473
- Newman, Girvan (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113. https://arxiv.org/abs/cond-mat/0308217
- Fortunato, Barthelemy (2007). Resolution limit in community detection. PNAS 104(1), 36-41. https://arxiv.org/abs/physics/0607100
- Edge et al. (2024). From Local to Global: A Graph RAG Approach. https://arxiv.org/abs/2404.16130
No comments yet