Topic Analysis

Lociator uses AI-powered topic analysis to automatically group your pages into topical clusters and generate meaningful topic names. This provides a content-level view of your site, revealing topical relationships beyond just link structure.

Overview

Topic analysis is automatically triggered after a crawl completes. The system extracts text content from each page, generates vector embeddings, clusters similar pages together, names each cluster using an LLM, and builds a hierarchical topic tree.

Analysis Pipeline

The full pipeline runs in 8 sequential steps:

Clear existing data — Removes any previous topics and Pinecone vectors for the crawl (allows re-triggering).
Fetch pages — Loads all pages with extracted text from Supabase.
Filter pages — Removes short content (<50 chars) and date archive pages.
Generate embeddings — Vectorizes page text using Pinecone Inference API.
K-Means clustering — Groups similar vectors into topical clusters.
LLM topic naming — Generates a concise name for each cluster.
Hierarchy generation — LLM organizes topics into a parent-child tree.
Score recalculation — Updates silo and cross-silo architecture scores.

Content Extraction & Filtering

Content is extracted during the crawl phase by the HtmlParserService using smart noise removal (up to 2,000 characters per page). Before analysis, pages are filtered:

Pages with less than 50 characters of extracted text are excluded.
Date archive pages are excluded — WordPress-style archive URLs (/2023/09/) and pages with date-only titles (e.g., "Tháng 9 2023", "December 2021").

Vector Embeddings

Page text is vectorized using Pinecone Inference API with themultilingual-e5-large model (1,024 dimensions, cosine similarity).

Embedding model supports multilingual content — works with any language.
Batch size: 96 inputs per API call (model limit).
Input type: passage with truncate: END.
Vectors are stored in a Pinecone serverless index for the crawl.

K-Means Clustering

Clusters are formed using K-Means++ on the embedding vectors:

// Dynamic K calculation
k = min(20, max(1, floor(validPages / 5)))

// Examples:
//   5 pages  → 1 cluster
//  25 pages  → 5 clusters
// 100 pages  → 10 clusters (capped)
// 500 pages  → 20 clusters (max)

Each page is assigned to exactly one cluster based on vector similarity.

LLM Topic Naming

Each cluster is named by an LLM using JSON structured output. The top 5 page titles from each cluster provide naming context:

// LLM prompt (simplified)
"Given these page titles, identify a concise topic name (1-3 words)."

// Response format (JSON schema)
{"topic": "<topic name>"}

// Extraction with robust fallback:
// 1. Parse JSON → extract "topic" field
// 2. Regex fallback → extract short phrase
// 3. Fallback → "Topic Cluster N"

Key naming rules enforced by the prompt:

Language matching: Topic name must be in the same language as the page titles.
Deduplication: Previously used topic names are passed to the LLM with instructions not to reuse them.
Configurable length: Users can set topic name length to 1–3, 2–4, or 5–6 words in Settings.
Safety net: If the LLM returns a duplicate name, a cluster suffix is appended.

Topic Hierarchy

After naming, a second LLM call organizes the topics into a hierarchical tree:

One topic is selected as the root (overarching theme).
All other topics are assigned a parent topic.
The hierarchy is stored via parent_id references in the topics table.
The frontend renders this as a waterfall tree layout in the Topic Browser.

Topic Edges

The system also calculates inter-topic links by aggregating page-level links:

For each link between two pages in different topics, a topic-level edge is created.
Edges are weighted by the number of page-level links between the two topics.
Stored in the topic_links table with source_topic_id, target_topic_id, and weight.
Visualized as connections between topic nodes in the graph.

Impact on Scoring

After topic analysis, two architecture sub-scores are recalculated using topic-aware metrics instead of URL-based silos:

Score	Metric	Ideal
Silo Score (20%)	Topic cohesion — % of links staying within the same topic	≥ 60% intra-topic = 100
Cross-Silo Score (10%)	Cross-topic linking — % of links crossing topic boundaries	15–30% = 100

LLM Providers

Two LLM backends are supported via the unified callLLM helper:

Provider	Default Model	Output Format
Hyperbolic (default)	Meta-Llama-3.1-8B-Instruct	OpenAI-compatible JSON mode
Google Gemini	gemini-2.5-flash	Native Gemini API with `responseMimeType: application/json`

Users select their provider in Settings and provide their own API key for Gemini.

Configuration

Topic analysis is configurable via user settings:

LLM Provider: Choose between Hyperbolic (default) or Google Gemini.
Gemini API Key: Required when using the Gemini provider.
Gemini Model: Select which Gemini model to use (default: gemini-2.5-flash).
Topic Name Length: 1–3 words (concise), 2–4 words, or 5–6 words (descriptive).

💡Topic analysis can be re-triggered manually from the dashboard — it clears all existing topic data before re-running, allowing you to experiment with different LLM settings.