Crawl Engine
The crawl engine is the core of Lociator. Built with NestJS and deployed as a stateless service behind Traefik, it uses a 3-phase crawl architecture that combines BFS link-following with sitemap discovery to build a complete and accurate graph — including orphan page detection.
Overview
Each crawl job is received via an Upstash QStash webhook and processed in three sequential phases:
- Phase 1 — BFS: Standard breadth-first crawl following internal links.
- Phase 2 — Sitemap Discovery: Fetches sitemap.xml and enqueues URLs not found by BFS.
- Phase 3 — Orphan Expansion: Continues BFS for sitemap-discovered pages, following their outgoing links.
3-Phase Crawl Architecture
Phase 1 — BFS Crawl
The crawler starts from the root URL and uses Breadth-First Search to discover all reachable pages by following internal links. Each page tracks its BFS parent for hierarchical analysis. This phase builds the core graph structure.
// Phase 1: BFS (simplified from CrawlService)
queue = [{ url: rootUrl, depth: 0, parent: null }]
visited = new Set([normalizedRoot])
while (queue.length > 0 && results.length < maxPages) {
batch = queue.splice(0, CONCURRENCY) // 5 at a time
results = await Promise.allSettled(batch.map(fetchAndParse))
for (result of results) {
for (link of result.outLinks) {
if (!visited.has(link.normalizedUrl)) {
visited.add(link.normalizedUrl)
queue.push({ url: link, depth: result.depth + 1 })
}
}
}
await delay(200) // Polite inter-batch delay
}Phase 2 — Sitemap Discovery
After BFS finishes, the SitemapParserService fetches and parses the site's sitemap. URLs found in the sitemap but not visited during BFS are added to the queue atdepth = maxBfsDepth + 1 with no parent page. This is how orphan pages are discovered.
- Fetches
/sitemap.xmlwith fallback to/sitemap_index.xml - Recursively handles sitemap index files (max depth 3)
- Filters internal-only, non-asset URLs
- Normalizes and deduplicates all URLs
- Caps at 10,000 URLs, 5-second timeout
- Silently returns empty set on any error (sitemap is optional)
parentNormalizedUrl: null, so they have no BFS parent. Their in_degree depends entirely on whether other crawled pages link to them.Phase 3 — Orphan Expansion
The BFS loop runs again for the newly enqueued sitemap URLs. If an orphan page contains links to other new pages, those pages get normal BFS treatment at depth + 1. This ensures child pages of orphans are crawled with correct depth values rather than being lost.
Concurrency & Politeness
| Parameter | Value | Purpose |
|---|---|---|
| Concurrency | 5 parallel requests | Fetch multiple pages per batch |
| Inter-batch delay | 200ms | Prevent overwhelming target servers |
| Request timeout | 10 seconds | Skip unresponsive pages |
| Max redirects | 5 | Follow redirect chains to final URL |
URL Normalization
The UrlNormalizerService normalizes URLs before comparison:
- Lowercases the hostname
- Removes trailing slashes (except root
/) - Removes URL fragments (
#section) - Strips tracking parameters:
utm_*,fbclid,gclid,msclkid,mc_cid,mc_eid,ref,_ga,_gl - Removes default ports (80 for HTTP, 443 for HTTPS)
- Sorts remaining query parameters for consistency
HTML Parsing
The HtmlParserService uses Cheerio with a two-phase approach:
- Phase 1 — Link Discovery (full DOM): Extracts ALL
<a href>links before noise removal, including nav, header, and footer links critical for BFS. - Phase 2 — Content Extraction (smart noise removal): Removes scripts, styles, nav, footer, sidebars, widgets, and aria-hidden elements. Extracts up to 2,000 characters of clean text for topic analysis.
Page titles are cleaned by stripping trailing site names (e.g., "About Us | MySite" → "About Us").
Link Filtering
Links pass through multiple filters before entering the BFS queue:
- Skip non-navigable:
mailto:,tel:,javascript:,data:,ftp:,file:, fragments. - Skip nofollow: Links with
rel="nofollow"are excluded. - Internal only: Only links with the same origin as the root URL are followed.
- Skip assets: 40+ file extensions are filtered (images, videos, fonts, documents, code).
- Deduplicate: Each unique href is processed only once per page.
- Exclude patterns: User-configured URL patterns are applied.
Exclude Patterns
Configure global exclude patterns in Settings. Comma-separated patterns with wildcard (*) support, matched against the full URL, pathname, or pathname+search:
/admin/*
/wp-json/*
*.pdf
/tag/*
/author/*Page Limits
| Plan | Max Pages | Crawls/Month |
|---|---|---|
| Free | 50 | 5 |
| Starter | 200 | 20 |
| Pro | 1,000 | Unlimited |
| Advanced | 5,000 | Unlimited |
| Premium | 25,000 | Unlimited |
Error Handling
- HTTP 4xx/5xx: Rejected by
validateStatus. - Redirects (3xx): Followed automatically up to 5 hops.
- Timeouts: Pages exceeding 10 seconds are recorded as errors.
- Non-HTML responses: Pages without
text/htmlcontent-type return empty outLinks. - Job failure: If the crawl fails, job status is set to
failedwith an error message.
Errors are logged but don't stop the crawl — the BFS continues processing remaining pages.