⌘K
v1.0.7
Lociator

Topic Analysis

Lociator uses AI-powered topic analysis to automatically group your pages into topical clusters and generate meaningful topic names. This provides a content-level view of your site, revealing topical relationships beyond just link structure.

Overview

Topic analysis is automatically triggered after a crawl completes. The system extracts text content from each page, generates vector embeddings, clusters similar pages together, names each cluster using an LLM, and builds a hierarchical topic tree.

Analysis Pipeline

The full pipeline runs in 8 sequential steps:

  1. Clear existing data — Removes any previous topics and Pinecone vectors for the crawl (allows re-triggering).
  2. Fetch pages — Loads all pages with extracted text from Supabase.
  3. Filter pages — Removes short content (<50 chars) and date archive pages.
  4. Generate embeddings — Vectorizes page text using Pinecone Inference API.
  5. K-Means clustering — Groups similar vectors into topical clusters.
  6. LLM topic naming — Generates a concise name for each cluster.
  7. Hierarchy generation — LLM organizes topics into a parent-child tree.
  8. Score recalculation — Updates silo and cross-silo architecture scores.

Content Extraction & Filtering

Content is extracted during the crawl phase by the HtmlParserService using smart noise removal (up to 2,000 characters per page). Before analysis, pages are filtered:

  • Pages with less than 50 characters of extracted text are excluded.
  • Date archive pages are excluded — WordPress-style archive URLs (/2023/09/) and pages with date-only titles (e.g., "Tháng 9 2023", "December 2021").

Vector Embeddings

Page text is vectorized using Pinecone Inference API with themultilingual-e5-large model (1,024 dimensions, cosine similarity).

  • Embedding model supports multilingual content — works with any language.
  • Batch size: 96 inputs per API call (model limit).
  • Input type: passage with truncate: END.
  • Vectors are stored in a Pinecone serverless index for the crawl.

K-Means Clustering

Clusters are formed using K-Means++ on the embedding vectors:

// Dynamic K calculation
k = min(20, max(1, floor(validPages / 5)))

// Examples:
//   5 pages  → 1 cluster
//  25 pages  → 5 clusters
// 100 pages  → 10 clusters (capped)
// 500 pages  → 20 clusters (max)

Each page is assigned to exactly one cluster based on vector similarity.

LLM Topic Naming

Each cluster is named by an LLM using JSON structured output. The top 5 page titles from each cluster provide naming context:

// LLM prompt (simplified)
"Given these page titles, identify a concise topic name (1-3 words)."

// Response format (JSON schema)
{"topic": "<topic name>"}

// Extraction with robust fallback:
// 1. Parse JSON → extract "topic" field
// 2. Regex fallback → extract short phrase
// 3. Fallback → "Topic Cluster N"

Key naming rules enforced by the prompt:

  • Language matching: Topic name must be in the same language as the page titles.
  • Deduplication: Previously used topic names are passed to the LLM with instructions not to reuse them.
  • Configurable length: Users can set topic name length to 1–3, 2–4, or 5–6 words in Settings.
  • Safety net: If the LLM returns a duplicate name, a cluster suffix is appended.

Topic Hierarchy

After naming, a second LLM call organizes the topics into a hierarchical tree:

  • One topic is selected as the root (overarching theme).
  • All other topics are assigned a parent topic.
  • The hierarchy is stored via parent_id references in the topics table.
  • The frontend renders this as a waterfall tree layout in the Topic Browser.

Topic Edges

The system also calculates inter-topic links by aggregating page-level links:

  • For each link between two pages in different topics, a topic-level edge is created.
  • Edges are weighted by the number of page-level links between the two topics.
  • Stored in the topic_links table with source_topic_id, target_topic_id, and weight.
  • Visualized as connections between topic nodes in the graph.

Impact on Scoring

After topic analysis, two architecture sub-scores are recalculated using topic-aware metrics instead of URL-based silos:

ScoreMetricIdeal
Silo Score (20%)Topic cohesion — % of links staying within the same topic≥ 60% intra-topic = 100
Cross-Silo Score (10%)Cross-topic linking — % of links crossing topic boundaries15–30% = 100

LLM Providers

Two LLM backends are supported via the unified callLLM helper:

ProviderDefault ModelOutput Format
Hyperbolic (default)Meta-Llama-3.1-8B-InstructOpenAI-compatible JSON mode
Google Geminigemini-2.5-flashNative Gemini API with responseMimeType: application/json

Users select their provider in Settings and provide their own API key for Gemini.

Configuration

Topic analysis is configurable via user settings:

  • LLM Provider: Choose between Hyperbolic (default) or Google Gemini.
  • Gemini API Key: Required when using the Gemini provider.
  • Gemini Model: Select which Gemini model to use (default: gemini-2.5-flash).
  • Topic Name Length: 1–3 words (concise), 2–4 words, or 5–6 words (descriptive).
💡Topic analysis can be re-triggered manually from the dashboard — it clears all existing topic data before re-running, allowing you to experiment with different LLM settings.