Finding communities in a graph: Label Propagation, Louvain, and Leiden

The core idea

Picture a graph: dots (nodes) joined by lines (edges). Some regions are densely linked; others barely connect. A community is a set of nodes with more edges inside it than you would expect by chance. The task is to find those sets automatically.

The core idea behind every algorithm here: let each node repeatedly take a cue from its neighbors, and the communities fall out of those local moves. No node sees the whole graph. Nobody sets the number of communities ahead of time. That is what makes the approach cheap enough to run on graphs with hundreds of millions of nodes.

This matters for search. Google holds a patent that runs label propagation over the web graph and describes feeding the result to a process that ranks search results. Microsoft's GraphRAG uses the Leiden algorithm to group a knowledge graph into a community hierarchy, then uses the per-community summaries to answer broad questions over a document collection. Those are two different uses: one is web-search ranking, the other is retrieval for a large language model.

A toy graph to anchor everything

Graph of 13 grey numbered nodes in three dense clumps joined by two bridge edges (4-5 and 8-9); no communities assigned yet.

The eye groups these nodes instantly: a left clump, a middle clump, a right clump. The algorithms have to reach the same answer using only local structure. The same 13-node graph (19 edges) carries every step below, so the numbers stay concrete.

Label propagation: the cheapest method

The label propagation algorithm (LPA), introduced by Raghavan, Albert, and Kumara in 2007 ( paper ), is the simplest of the three. Give every node a unique label. Then sweep through the nodes and, for each one, change its label to whichever label most of its neighbors carry. Ties are broken at random. Repeat until no node would change. Labels spread through dense regions and stall at the sparse boundaries between them, so each final label marks a community.

Before and after of one label propagation step on node 4: it has two blue neighbors and one orange, so it takes the blue label by majority vote.

The update rule is just: each node takes the most common label among its neighbors. Each pass costs time proportional to the number of edges (near-linear), and the method needs only a few passes. It optimizes no global score.

Where it falls short

That simplicity has costs. The order nodes are visited and the random tie-breaks make LPA non-deterministic: two runs on the same graph can return different partitions. If every node updates at the same time (synchronous updating) rather than one at a time (asynchronous), labels can flip back and forth forever, so the original paper uses asynchronous updates. And nothing stops the labels from collapsing so that one giant label swallows most of the graph.

Modularity: a score for a partition

LPA exposes a gap. It gave two different answers on two runs, and it offers no way to say which one is better, because it never measures the quality of a partition. Modularity, introduced by Newman and Girvan in 2004 ( paper ), is that measure. It compares the fraction of edges inside communities against the fraction you would expect by chance. The chance baseline rewires the same edges at random but keeps each node's degree (its number of edges) fixed. Higher modularity means stronger community structure.

The toy graph colored by its natural three-way split, showing the modularity value Q = 0.56 and the modularity formula.

Modularity is written:

Q = \frac{1}{2 m} ij \sum [A_{ij} - \frac{k _{i} k _{j}}{2 m}] δ (c_{i}, c_{j})

Here m is the total number of edges; Aij is 1 when nodes i and j are linked; ki is the degree of node i; and the delta term equals 1 only when i and j are in the same community, so only same-community pairs count. For the toy graph's three-way split, Q works out to 0.5554, which the figures round to 0.56.

A known limit: modularity optimization has a resolution limit. It can fail to separate communities smaller than a scale set by the size of the whole graph, so genuinely distinct small communities get merged (Fortunato and Barthelemy, 2007, paper ). It applies to any method that maximizes modularity, including the next one.

Louvain: optimize modularity, fast

The Louvain method (Blondel, Guillaume, Lambiotte, and Lefebvre, 2008, paper ) is a greedy way to push modularity up. It alternates two phases.

Louvain's two phases: Phase 1 moves node 4 into community A with a modularity gain of +0.055; Phase 2 collapses each community into a super-node with self-loop weights 7, 5, 5 and weight-1 super-edges.

Phase 1 (local moving): start with every node in its own community. For each node, consider moving it into the community of each neighbor, and compute the change in modularity. Make the move that increases Q the most; repeat until stable. On the toy graph, moving node 4 from being alone into community A raises Q from 0.500 to 0.555, a gain of +0.055. The gain has a closed form, so it is cheap to evaluate:

Δ Q = [\frac{Σ _{in} + k _{i, in}}{2 m} - (\frac{Σ _{t o t} + k _{i}}{2 m})^{2}] - [\frac{Σ _{in}}{2 m} - (\frac{Σ _{t o t}}{2 m})^{2} - (\frac{k _{i}}{2 m})^{2}]

Phase 2 (aggregation): collapse each community into a single super-node. Edges between two communities become one weighted edge between their super-nodes, and each super-node gets a self-loop whose weight is the number of edges that were inside that community. Run Phase 1 again on the smaller graph; repeating the two phases builds a hierarchy and scales to very large graphs (the paper partitioned a web graph of 118 million nodes and more than a billion links).

Where it falls short

Louvain optimizes a number, not connectivity, and that creates a specific defect. A community it reports can be badly connected, and in the worst case the nodes sharing one community label are not even connected to each other inside that community. It also inherits modularity's resolution limit.

Leiden: guarantee connected communities

Leiden (Traag, Waltman, and van Eck, 2019, paper ) targets exactly that defect. The authors show that the Louvain algorithm "may yield arbitrarily badly connected communities" and that "communities may even be disconnected."

A Louvain community split into two disconnected pieces (left); Leiden's refinement step splits it so every reported community is internally connected (right).

Leiden runs three phases instead of two: local moving, then a refinement step, then aggregation built from the refined partition. The refinement step only forms a community out of nodes that are actually connected within it, so every community Leiden outputs is guaranteed to be connected. The paper also reports that Leiden runs faster than Louvain. The guarantee is connectivity, not a globally optimal partition; the resolution limit still applies.

How they relate

Algorithm	Problem it solves	Key weakness	What the next one fixed
Label propagation	Find communities with the cheapest local rule, no parameters.	No objective to compare partitions; non-deterministic; can collapse to one giant community.	Modularity adds a score; Louvain optimizes it.
Louvain	Greedily maximize modularity in two phases; scales to huge graphs.	Reported communities can be badly connected or disconnected; resolution limit.	Leiden adds a refinement step that guarantees connected communities.
Leiden	Same modularity goal, plus a connected-community guarantee, and faster.	Still bound by modularity's resolution limit.	Adjustable-resolution and alternative quality functions address the limit.

Supersession diagram: LPA, plus an objective (modularity), to Louvain, plus a refinement phase, to Leiden.

The line runs in one direction. Label propagation is cheapest but has no objective, so you cannot rank its outputs. Modularity supplies that objective, and Louvain climbs it greedily, but optimizing a single number lets it report communities whose members are not connected. Leiden inserts a refinement step that builds communities only from connected pieces, removing that failure while running faster.

The video

The companion video animates the toy graph through each step: label propagation updating node by node, modularity scoring a partition, Louvain's two phases, and Leiden's refinement split, then the search ties and a source list. The full explanation lives on this page; the video is the visual aid.

Community detection on the toy graph.

Patents tied to search

Label propagation appears, by name, in a granted Google patent family aimed at processing very large graphs. The patents describe representing the World Wide Web as a graph and analyzing it "to provide information to a search engine process that ranks search results," and they list spam detection among the uses.

Patent	Title	Assignee	Relation
US 8,793,283 B1	Label Propagation in a Distributed System	Google Inc.	Uses and names label propagation; cites Raghavan et al. 2007
US 9,652,876 B1	Label Propagation in a Distributed System	Google Inc.	Continuation; uses and names label propagation

The inventors are the team behind Pregel, Google's large-scale graph processing framework, and the patent runs label propagation in Pregel's synchronized, bulk-synchronous iteration model rather than the one-node-at-a-time asynchronous procedure above. The "uses and names" relation is confirmed from the granted patent text; the citation to Raghavan et al. was read from the USPTO grant document and is reported on that basis.

Applications in search and retrieval

Communities of a knowledge graph feeding two outcomes: a RANK box (label propagation over the web graph, Google patent) and a RETRIEVE box (Leiden communities to GraphRAG to LLM answers).

Retrieval for language models: Microsoft's GraphRAG builds a knowledge graph from a document collection, runs the Leiden algorithm to partition it into a hierarchy of communities, and writes a summary for each community. Its global search mode answers broad questions by reasoning over those community summaries ( paper , docs ). Community detection is also a standard tool for grouping search queries by intent, clustering near-duplicate pages, grouping entities in a knowledge graph, and the recommendation cold-start problem. Label propagation and Louvain ship in common graph libraries (Neo4j Graph Data Science, NetworkX, igraph).

Method and limits

Patent search used Google Patents, filtered to United States, English-language grants with a priority date on or after 2010-01-01, focused on Google and Microsoft (Bing) as assignees. The search is not exhaustive: a patent that uses these methods while only saying "graph clustering" would not surface in a name query, so absence of a named patent is not proof none exists. Algorithm definitions, the modularity and gain formulas, the 118-million-node figure, and the supersession claims were checked against the primary papers below.

References

Raghavan, Albert, Kumara (2007). Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106. https://arxiv.org/abs/0709.2938
Blondel, Guillaume, Lambiotte, Lefebvre (2008). Fast unfolding of communities in large networks. J. Stat. Mech. P10008. https://arxiv.org/abs/0803.0476
Traag, Waltman, van Eck (2019). From Louvain to Leiden. Scientific Reports 9, 5233. https://arxiv.org/abs/1810.08473
Newman, Girvan (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113. https://arxiv.org/abs/cond-mat/0308217
Fortunato, Barthelemy (2007). Resolution limit in community detection. PNAS 104(1), 36-41. https://arxiv.org/abs/physics/0607100
Edge et al. (2024). From Local to Global: A Graph RAG Approach. https://arxiv.org/abs/2404.16130

Finding communities in a graph: Label Propagation, Louvain, and Leiden

The core idea

A toy graph to anchor everything

Label propagation: the cheapest method

Where it falls short

Modularity: a score for a partition

Louvain: optimize modularity, fast

Where it falls short

Leiden: guarantee connected communities

How they relate

The video

Patents tied to search

Applications in search and retrieval

Method and limits

References

No comments yet

Continue reading

Rocchio Relevance Feedback: Vector Space Query Expansion in Search

HITS (Hubs and Authorities): Topic-Scoped Authority in Search

Locality-sensitive hashing: near-duplicate detection at scale with MinHash and SimHash