PrefixSpan and GSP: finding frequent paths in search sessions

The core idea

Search and recommendation logs are full of sequences: a person reformulates a query, clicks a result, opens documentation, tries pricing, starts a trial, or abandons. A sequential pattern is an ordered subsequence that appears in enough of those sessions to matter. The events do not need to be adjacent, only ordered.

The key question is how to search the enormous space of possible subsequences. GSP uses a generate-and-test loop inspired by Apriori-style mining. PrefixSpan changes the search space: after choosing a prefix, it only looks at suffixes that follow that prefix. That small shift is the whole story.

A small search-session database

Five search sessions; the ordered pattern login, docs, trial is highlighted in sessions S1, S2, and S5, so its support is three of five sessions.

With minimum support set to 3 sessions, <login, docs, trial> is frequent because it appears in S1, S2, and S5. Let each row be one session:

S1: <login, pricing, docs, trial>
S2: <login, docs, api, trial>
S3: <pricing, docs, api, trial>
S4: <login, pricing, api>
S5: <login, docs, trial>

A pattern's support is the number of sequences that contain it. In the notation used by the sequential-pattern literature, support can be written as:

support (p) = ∣ {s \in D : p ⊑ s} ∣

Here D is the sequence database, p is a subsequence of s, and a session contributes at most one count to support. Agrawal and Srikant introduced this problem in their 1995 paper on mining sequential patterns, where a sequence is a list of itemsets ordered by time and a pattern is frequent when it reaches a user-specified support threshold ( paper ).

GSP: generate candidates, then count them

Generalized Sequential Patterns (GSP), from Srikant and Agrawal's 1996 paper, extends the original problem with useful constraints such as minimum and maximum gaps, sliding windows, and taxonomies ( paper ). Its basic mining strategy is still easy to see on the toy data.

A GSP flow: L1 frequent single events, C2 candidate pairs and L2 survivors, then one L3 survivor, login then docs then trial.

GSP works by levels: it uses frequent patterns from the previous level to generate candidates for the next level, scans the database to count them, and prunes anything below support. With minimum support 3, the frequent one-event patterns are <login>, <pricing>, <docs>, <api>, and <trial>. GSP joins these into candidate two-event patterns, scans all five sessions, and keeps the supported ones:

<login, docs>, support 3.
<login, trial>, support 3.
<docs, trial>, support 4.

Then GSP joins those survivors into length-3 candidates. The candidate <login, docs, trial> survives because each of its length-2 subsequences is frequent and because the full pattern appears in three sessions. No supported length-4 pattern remains.

The important cost is not in this tiny example. It appears when the database has many frequent items or many long patterns. GSP can generate a large set of candidate sequences and then rescan the original database to test them. Pei and coauthors describe this candidate-generation-and-test loop as one of the main costs PrefixSpan was designed to reduce ( paper ).

PrefixSpan: project around the prefix

PrefixSpan, short for Prefix-projected Sequential pattern mining, keeps the same support rule but changes the search strategy. Instead of generating broad candidate sets for the whole database, it takes a frequent prefix and builds the projected database of suffixes that follow that prefix ( paper ).

PrefixSpan projection for the prefix login: the sessions containing login are reduced to suffixes after login, then login docs grows to login docs trial.

For the prefix <login>, only four sessions matter. Their suffixes are:

S1: <pricing, docs, trial>
S2: <docs, api, trial>
S4: <pricing, api>
S5: <docs, trial>

Inside that projected database, docs and trial each appear in at least three suffixes, so PrefixSpan grows <login, docs> and <login, trial>. Now project again, this time for <login, docs>. The suffixes are <trial>, <api, trial>, and <trial>. The locally frequent extension is trial, so the pattern becomes <login, docs, trial>.

Pei et al. report that PrefixSpan mines the complete pattern set while reducing candidate generation, and that its projected databases get smaller as mining proceeds. They also describe pseudo-projection: when projected databases fit in memory, the algorithm can store pointers and offsets instead of copying suffixes. Apache Spark exposes PrefixSpan with parameters for minimum support, maximum pattern length, and projected database size ( Spark docs ).

How the algorithms relate

Algorithm	Search strategy	Strength	Main cost or limit
GSP	Generate candidate k-sequences from frequent (k-1)-sequences, then count candidates by scanning the database.	Clear level-wise logic. Supports practical constraints such as time gaps, sliding windows, and taxonomies.	Candidate sets and repeated full-database scans can become expensive, especially with many long frequent patterns.
PrefixSpan	Choose a frequent prefix, project the database to suffixes after that prefix, and grow patterns recursively from local support.	Reduces candidate generation and focuses counting on smaller projected databases.	Projection still has a cost. Implementations need to manage memory, copying, and projected database size.

A compact way to say it: GSP asks, "Which candidates should we test next against the full database?" PrefixSpan asks, "Given this prefix, what can still follow it in the suffixes that remain?"

The video

The companion video walks through the same toy sessions used here: support, the GSP candidate passes, the PrefixSpan projection, and the search-system tie. The full explanation lives on this page; the video is the visual aid.

Sequential pattern mining on the toy search sessions.

Where this connects to search

The direct search connection is not that a modern web search engine must literally run PrefixSpan or GSP in its ranking stack. The connection is that search systems produce ordered logs, and ordered logs are exactly the data these algorithms were built to mine.

Query recommendation

Search engines can mine query logs to suggest follow-up queries. Zhang and Nasraoui's WWW 2006 paper is a short example of query-log mining for query recommendation ( record ). The larger query-log mining survey by Silvestri reviews how query logs can be used to enhance search-engine operations, including recommendation and efficiency tasks ( survey ).

Web access recommendation

A web recommendation system can treat each session as a sequence of page visits. Niranjan, Subramanyam, and Khanaa describe using PrefixSpan on preprocessed web server logs to mine sequential web access patterns before building a recommendation structure ( chapter ).

In practice, sequential patterns are usually a feature or candidate generator, not the whole product. A pattern like <login, docs, trial> might tell a search or product team that documentation often acts as the bridge between account access and conversion. In a search interface, related patterns can feed query suggestions, next-page recommendations, or analysis of where users refine, stall, or succeed.

Patent search result

The bundled patent triage script searched Google Patents for PrefixSpan, GSP, Generalized Sequential Pattern, sequential pattern mining, and sequence pattern mining, filtered to Google and Bing assignees, US patents, English, and priority dates after January 1, 2000. It checked two result pages per assignee. The script saw 205 broad candidates and kept zero likely relevant patents for those exact terms.

That means this explainer uses an application-backed search tie rather than a patent-backed one. The limit matters: a patent could describe a related idea using different language, such as session mining, query reformulation, click-path modeling, or pattern discovery, without naming PrefixSpan or GSP.

Takeaways

Sequential pattern mining finds ordered subsequences with enough support across sessions.
GSP is the level-wise candidate-generation approach: generate, scan, prune, repeat.
PrefixSpan keeps the same support goal but moves the work into prefix-projected databases.
For search and recommendation, the useful mental model is session mining: queries, clicks, page paths, and product journeys are sequences.

References

Agrawal, Srikant (1995). Mining Sequential Patterns. ICDE. https://rsrikant.com/papers/icde95.pdf
Srikant, Agrawal (1996). Mining Sequential Patterns: Generalizations and Performance Improvements. EDBT. https://rsrikant.com/papers/edbt96_rj.pdf
Pei et al. (2001). PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE. https://www.cs.sfu.ca/~jpei/publications/span.pdf
Apache Spark. Frequent Pattern Mining - PrefixSpan. https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
Zhang, Nasraoui (2006). Mining Search Engine Query Logs for Query Recommendation. WWW. https://ir.webis.de/anthology/2006.wwwconf_conference-2006.194/
Silvestri. Mining Query Logs: Turning Search Usage Data into Knowledge. https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/paper.pdf
Niranjan, Subramanyam, Khanaa (2010). Developing a Web Recommendation System Based on Closed Sequential Patterns. ICT. https://link.springer.com/chapter/10.1007/978-3-642-15766-0_25