Crawl Engine

The crawl engine is the core of Lociator. Built with NestJS and deployed as a stateless service behind Traefik, it uses a 3-phase crawl architecture that combines BFS link-following with sitemap discovery to build a complete and accurate graph — including orphan page detection.

Overview

Each crawl job is received via an Upstash QStash webhook and processed in three sequential phases:

Phase 1 — BFS: Standard breadth-first crawl following internal links.
Phase 2 — Sitemap Discovery: Fetches sitemap.xml and enqueues URLs not found by BFS.
Phase 3 — Orphan Expansion: Continues BFS for sitemap-discovered pages, following their outgoing links.

3-Phase Crawl Architecture

Phase 1 — BFS Crawl

The crawler starts from the root URL and uses Breadth-First Search to discover all reachable pages by following internal links. Each page tracks its BFS parent for hierarchical analysis. This phase builds the core graph structure.

// Phase 1: BFS (simplified from CrawlService)
queue = [{ url: rootUrl, depth: 0, parent: null }]
visited = new Set([normalizedRoot])

while (queue.length > 0 && results.length < maxPages) {
  batch = queue.splice(0, CONCURRENCY)  // 5 at a time
  results = await Promise.allSettled(batch.map(fetchAndParse))
  
  for (result of results) {
    for (link of result.outLinks) {
      if (!visited.has(link.normalizedUrl)) {
        visited.add(link.normalizedUrl)
        queue.push({ url: link, depth: result.depth + 1 })
      }
    }
  }
  await delay(200)  // Polite inter-batch delay
}

Phase 2 — Sitemap Discovery

After BFS finishes, the SitemapParserService fetches and parses the site's sitemap. URLs found in the sitemap but not visited during BFS are added to the queue atdepth = maxBfsDepth + 1 with no parent page. This is how orphan pages are discovered.

Fetches /sitemap.xml with fallback to /sitemap_index.xml
Recursively handles sitemap index files (max depth 3)
Filters internal-only, non-asset URLs
Normalizes and deduplicates all URLs
Caps at 10,000 URLs, 5-second timeout
Silently returns empty set on any error (sitemap is optional)

ℹ️Sitemap-discovered pages enter the queue with parentNormalizedUrl: null, so they have no BFS parent. Their in_degree depends entirely on whether other crawled pages link to them.

Phase 3 — Orphan Expansion

The BFS loop runs again for the newly enqueued sitemap URLs. If an orphan page contains links to other new pages, those pages get normal BFS treatment at depth + 1. This ensures child pages of orphans are crawled with correct depth values rather than being lost.

Concurrency & Politeness

Parameter	Value	Purpose
Concurrency	5 parallel requests	Fetch multiple pages per batch
Inter-batch delay	200ms	Prevent overwhelming target servers
Request timeout	10 seconds	Skip unresponsive pages
Max redirects	5	Follow redirect chains to final URL

URL Normalization

The UrlNormalizerService normalizes URLs before comparison:

Lowercases the hostname
Removes trailing slashes (except root /)
Removes URL fragments (#section)
Strips tracking parameters: utm_*, fbclid, gclid, msclkid, mc_cid, mc_eid, ref, _ga, _gl
Removes default ports (80 for HTTP, 443 for HTTPS)
Sorts remaining query parameters for consistency

HTML Parsing

The HtmlParserService uses Cheerio with a two-phase approach:

Phase 1 — Link Discovery (full DOM): Extracts ALL <a href> links before noise removal, including nav, header, and footer links critical for BFS.
Phase 2 — Content Extraction (smart noise removal): Removes scripts, styles, nav, footer, sidebars, widgets, and aria-hidden elements. Extracts up to 2,000 characters of clean text for topic analysis.

Page titles are cleaned by stripping trailing site names (e.g., "About Us | MySite" → "About Us").

Link Filtering

Links pass through multiple filters before entering the BFS queue:

Skip non-navigable: mailto:, tel:, javascript:, data:, ftp:, file:, fragments.
Skip nofollow: Links with rel="nofollow" are excluded.
Internal only: Only links with the same origin as the root URL are followed.
Skip assets: 40+ file extensions are filtered (images, videos, fonts, documents, code).
Deduplicate: Each unique href is processed only once per page.
Exclude patterns: User-configured URL patterns are applied.

Exclude Patterns

Configure global exclude patterns in Settings. Comma-separated patterns with wildcard (*) support, matched against the full URL, pathname, or pathname+search:

/admin/*
/wp-json/*
*.pdf
/tag/*
/author/*

Page Limits

Plan	Max Pages	Crawls/Month
Free	50	5
Starter	200	20
Pro	1,000	Unlimited
Advanced	5,000	Unlimited
Premium	25,000	Unlimited

Error Handling

HTTP 4xx/5xx: Rejected by validateStatus.
Redirects (3xx): Followed automatically up to 5 hops.
Timeouts: Pages exceeding 10 seconds are recorded as errors.
Non-HTML responses: Pages without text/html content-type return empty outLinks.
Job failure: If the crawl fails, job status is set to failed with an error message.

Errors are logged but don't stop the crawl — the BFS continues processing remaining pages.