Topic Analysis
Lociator uses AI-powered topic analysis to automatically group your pages into topical clusters and generate meaningful topic names. This provides a content-level view of your site, revealing topical relationships beyond just link structure.
Overview
Topic analysis is automatically triggered after a crawl completes. The system extracts text content from each page, generates vector embeddings, clusters similar pages together, names each cluster using an LLM, and builds a hierarchical topic tree.
Analysis Pipeline
The full pipeline runs in 8 sequential steps:
- Clear existing data — Removes any previous topics and Pinecone vectors for the crawl (allows re-triggering).
- Fetch pages — Loads all pages with extracted text from Supabase.
- Filter pages — Removes short content (<50 chars) and date archive pages.
- Generate embeddings — Vectorizes page text using Pinecone Inference API.
- K-Means clustering — Groups similar vectors into topical clusters.
- LLM topic naming — Generates a concise name for each cluster.
- Hierarchy generation — LLM organizes topics into a parent-child tree.
- Score recalculation — Updates silo and cross-silo architecture scores.
Content Extraction & Filtering
Content is extracted during the crawl phase by the HtmlParserService using smart noise removal (up to 2,000 characters per page). Before analysis, pages are filtered:
- Pages with less than 50 characters of extracted text are excluded.
- Date archive pages are excluded — WordPress-style archive URLs (
/2023/09/) and pages with date-only titles (e.g., "Tháng 9 2023", "December 2021").
Vector Embeddings
Page text is vectorized using Pinecone Inference API with themultilingual-e5-large model (1,024 dimensions, cosine similarity).
- Embedding model supports multilingual content — works with any language.
- Batch size: 96 inputs per API call (model limit).
- Input type:
passagewithtruncate: END. - Vectors are stored in a Pinecone serverless index for the crawl.
K-Means Clustering
Clusters are formed using K-Means++ on the embedding vectors:
// Dynamic K calculation
k = min(20, max(1, floor(validPages / 5)))
// Examples:
// 5 pages → 1 cluster
// 25 pages → 5 clusters
// 100 pages → 10 clusters (capped)
// 500 pages → 20 clusters (max)Each page is assigned to exactly one cluster based on vector similarity.
LLM Topic Naming
Each cluster is named by an LLM using JSON structured output. The top 5 page titles from each cluster provide naming context:
// LLM prompt (simplified)
"Given these page titles, identify a concise topic name (1-3 words)."
// Response format (JSON schema)
{"topic": "<topic name>"}
// Extraction with robust fallback:
// 1. Parse JSON → extract "topic" field
// 2. Regex fallback → extract short phrase
// 3. Fallback → "Topic Cluster N"Key naming rules enforced by the prompt:
- Language matching: Topic name must be in the same language as the page titles.
- Deduplication: Previously used topic names are passed to the LLM with instructions not to reuse them.
- Configurable length: Users can set topic name length to 1–3, 2–4, or 5–6 words in Settings.
- Safety net: If the LLM returns a duplicate name, a cluster suffix is appended.
Topic Hierarchy
After naming, a second LLM call organizes the topics into a hierarchical tree:
- One topic is selected as the root (overarching theme).
- All other topics are assigned a parent topic.
- The hierarchy is stored via
parent_idreferences in thetopicstable. - The frontend renders this as a waterfall tree layout in the Topic Browser.
Topic Edges
The system also calculates inter-topic links by aggregating page-level links:
- For each link between two pages in different topics, a topic-level edge is created.
- Edges are weighted by the number of page-level links between the two topics.
- Stored in the
topic_linkstable withsource_topic_id,target_topic_id, andweight. - Visualized as connections between topic nodes in the graph.
Impact on Scoring
After topic analysis, two architecture sub-scores are recalculated using topic-aware metrics instead of URL-based silos:
| Score | Metric | Ideal |
|---|---|---|
| Silo Score (20%) | Topic cohesion — % of links staying within the same topic | ≥ 60% intra-topic = 100 |
| Cross-Silo Score (10%) | Cross-topic linking — % of links crossing topic boundaries | 15–30% = 100 |
LLM Providers
Two LLM backends are supported via the unified callLLM helper:
| Provider | Default Model | Output Format |
|---|---|---|
| Hyperbolic (default) | Meta-Llama-3.1-8B-Instruct | OpenAI-compatible JSON mode |
| Google Gemini | gemini-2.5-flash | Native Gemini API with responseMimeType: application/json |
Users select their provider in Settings and provide their own API key for Gemini.
Configuration
Topic analysis is configurable via user settings:
- LLM Provider: Choose between Hyperbolic (default) or Google Gemini.
- Gemini API Key: Required when using the Gemini provider.
- Gemini Model: Select which Gemini model to use (default: gemini-2.5-flash).
- Topic Name Length: 1–3 words (concise), 2–4 words, or 5–6 words (descriptive).