⌘K
v1.0.7
Lociator

Crawl Engine

The crawl engine is the core of Lociator. Built with NestJS and deployed as a stateless service behind Traefik, it uses a 3-phase crawl architecture that combines BFS link-following with sitemap discovery to build a complete and accurate graph — including orphan page detection.

Overview

Each crawl job is received via an Upstash QStash webhook and processed in three sequential phases:

  1. Phase 1 — BFS: Standard breadth-first crawl following internal links.
  2. Phase 2 — Sitemap Discovery: Fetches sitemap.xml and enqueues URLs not found by BFS.
  3. Phase 3 — Orphan Expansion: Continues BFS for sitemap-discovered pages, following their outgoing links.

3-Phase Crawl Architecture

Phase 1 — BFS Crawl

The crawler starts from the root URL and uses Breadth-First Search to discover all reachable pages by following internal links. Each page tracks its BFS parent for hierarchical analysis. This phase builds the core graph structure.

// Phase 1: BFS (simplified from CrawlService)
queue = [{ url: rootUrl, depth: 0, parent: null }]
visited = new Set([normalizedRoot])

while (queue.length > 0 && results.length < maxPages) {
  batch = queue.splice(0, CONCURRENCY)  // 5 at a time
  results = await Promise.allSettled(batch.map(fetchAndParse))
  
  for (result of results) {
    for (link of result.outLinks) {
      if (!visited.has(link.normalizedUrl)) {
        visited.add(link.normalizedUrl)
        queue.push({ url: link, depth: result.depth + 1 })
      }
    }
  }
  await delay(200)  // Polite inter-batch delay
}

Phase 2 — Sitemap Discovery

After BFS finishes, the SitemapParserService fetches and parses the site's sitemap. URLs found in the sitemap but not visited during BFS are added to the queue atdepth = maxBfsDepth + 1 with no parent page. This is how orphan pages are discovered.

  • Fetches /sitemap.xml with fallback to /sitemap_index.xml
  • Recursively handles sitemap index files (max depth 3)
  • Filters internal-only, non-asset URLs
  • Normalizes and deduplicates all URLs
  • Caps at 10,000 URLs, 5-second timeout
  • Silently returns empty set on any error (sitemap is optional)
ℹ️Sitemap-discovered pages enter the queue with parentNormalizedUrl: null, so they have no BFS parent. Their in_degree depends entirely on whether other crawled pages link to them.

Phase 3 — Orphan Expansion

The BFS loop runs again for the newly enqueued sitemap URLs. If an orphan page contains links to other new pages, those pages get normal BFS treatment at depth + 1. This ensures child pages of orphans are crawled with correct depth values rather than being lost.

Concurrency & Politeness

ParameterValuePurpose
Concurrency5 parallel requestsFetch multiple pages per batch
Inter-batch delay200msPrevent overwhelming target servers
Request timeout10 secondsSkip unresponsive pages
Max redirects5Follow redirect chains to final URL

URL Normalization

The UrlNormalizerService normalizes URLs before comparison:

  • Lowercases the hostname
  • Removes trailing slashes (except root /)
  • Removes URL fragments (#section)
  • Strips tracking parameters: utm_*, fbclid, gclid, msclkid, mc_cid, mc_eid, ref, _ga, _gl
  • Removes default ports (80 for HTTP, 443 for HTTPS)
  • Sorts remaining query parameters for consistency

HTML Parsing

The HtmlParserService uses Cheerio with a two-phase approach:

  1. Phase 1 — Link Discovery (full DOM): Extracts ALL <a href> links before noise removal, including nav, header, and footer links critical for BFS.
  2. Phase 2 — Content Extraction (smart noise removal): Removes scripts, styles, nav, footer, sidebars, widgets, and aria-hidden elements. Extracts up to 2,000 characters of clean text for topic analysis.

Page titles are cleaned by stripping trailing site names (e.g., "About Us | MySite" → "About Us").

Links pass through multiple filters before entering the BFS queue:

  • Skip non-navigable: mailto:, tel:, javascript:, data:, ftp:, file:, fragments.
  • Skip nofollow: Links with rel="nofollow" are excluded.
  • Internal only: Only links with the same origin as the root URL are followed.
  • Skip assets: 40+ file extensions are filtered (images, videos, fonts, documents, code).
  • Deduplicate: Each unique href is processed only once per page.
  • Exclude patterns: User-configured URL patterns are applied.

Exclude Patterns

Configure global exclude patterns in Settings. Comma-separated patterns with wildcard (*) support, matched against the full URL, pathname, or pathname+search:

/admin/*
/wp-json/*
*.pdf
/tag/*
/author/*

Page Limits

PlanMax PagesCrawls/Month
Free505
Starter20020
Pro1,000Unlimited
Advanced5,000Unlimited
Premium25,000Unlimited

Error Handling

  • HTTP 4xx/5xx: Rejected by validateStatus.
  • Redirects (3xx): Followed automatically up to 5 hops.
  • Timeouts: Pages exceeding 10 seconds are recorded as errors.
  • Non-HTML responses: Pages without text/html content-type return empty outLinks.
  • Job failure: If the crawl fails, job status is set to failed with an error message.

Errors are logged but don't stop the crawl — the BFS continues processing remaining pages.