Skip to content

Features

Features

Kreuzcrawl is a Rust-native web crawling engine with deep extraction capabilities. This page covers every major feature area and includes a competitive comparison with other tools in the space.


Core Crawling

The CrawlEngine is built with the builder pattern and validates all configuration at construction time. Invalid configs fail fast before any network requests are made.

Feature Description
CrawlEngine builder Fluent .builder().config(...).build() pattern with strict serde validation
Concurrent fetching JoinSet + Semaphore for parallel requests (default: 10 concurrent)
Multiple strategies BFS, DFS, BestFirst, and Adaptive traversal via the CrawlStrategy trait
Batch crawling Multi-seed batch_crawl() and batch_scrape() for processing URL lists
Streaming events Real-time crawl_stream() returning CrawlEvent items as pages are processed
URL discovery Sitemap parsing (XML, gzip, sitemap index) combined with link extraction
Redirect handling HTTP 3xx, Refresh header, and meta refresh detection with loop detection

Metadata Extraction

Every scraped page yields a PageMetadata struct with 40+ fields, plus separate collections for links, images, feeds, and structured data.

Feature Description
Open Graph og:title, og:description, og:image, og:type, og:url, and more
Twitter Card twitter:card, twitter:title, twitter:description, twitter:image
Dublin Core dc.title, dc.creator, dc.date, dc.subject
Article metadata article:published_time, article:author, article:section
JSON-LD Full JSON-LD extraction from <script type="application/ld+json"> blocks
Link extraction 4 link types: Internal, External, Anchor, Document
Image extraction All sources: <img>, <picture>, og:image, srcset
Feed discovery RSS, Atom, and JSON Feed detection from <link> elements
Favicons Extraction and canonicalization of site icons
hreflang Language and region variant links for internationalized pages
Headings H1-H6 extraction with hierarchy preservation
Response metadata HTTP headers, content type, charset detection, body size

Markdown Conversion

HTML-to-markdown conversion runs automatically on every page via html-to-markdown-rs. Results are available in the MarkdownResult struct.

Feature Description
Always-on conversion Every page includes a markdown field with converted content
Document structure Optional structured document tree with semantic nodes
Table extraction Structured table data preserved alongside markdown output
Link-to-citations Numbered reference conversion (e.g., [1], [2]) with a CitationResult containing all references
Fit markdown Content pruning and heuristic-based truncation optimized for LLM consumption
Warnings Non-fatal processing warnings surfaced in MarkdownResult.warnings

AI and LLM Integration

Feature gate

Requires the ai feature: kreuzcrawl = { version = "0.1", features = ["ai"] }

Feature Description
LlmExtractor Multi-provider LLM extraction powered by liter-llm
JSON schema extraction Pass a JSON schema and receive structured data matching it
Cost tracking ExtractionMeta includes estimated USD cost, prompt tokens, and completion tokens
Model metadata The model identifier used for each extraction is recorded
Chunk tracking Number of content chunks processed by the LLM

Anti-Bot and Browser Automation

Feature gate

Requires the browser feature: kreuzcrawl = { version = "0.1", features = ["browser"] }

Feature Description
WAF detection 8 vendors: Cloudflare, Akamai, AWS WAF, Imperva, DataDome, PerimeterX, Sucuri, F5
Browser fallback Headless Chrome via chromiumoxide with configurable BrowserMode (Auto, Always, Never)
BrowserPool Multi-browser management with health checks and crash recovery
Browser wait strategies NetworkIdle, Selector (wait for CSS selector), and Fixed duration
Browser profiles Named persistent sessions preserving cookies and localStorage
JS rendering detection Heuristic-based detection of pages requiring JavaScript rendering
Screenshot capture PNG screenshot capture when using the browser (via capture_screenshot)
User-Agent rotation UaRotationLayer Tower middleware for UA header diversity

Network and Caching

Feature Description
Per-domain rate limiting PerDomainRateLimitLayer Tower middleware with configurable delays (default: 200ms)
HTTP caching ETag and Last-Modified conditional requests via CrawlCacheLayer
Disk cache blake3-hashed file storage with TTL and automatic eviction
Proxy support HTTP, HTTPS, and SOCKS5 proxies via ProxyConfig
User-Agent rotation Configurable list of UA strings rotated across requests
Cookie handling Cookie tracking with deduplication and persistence
Authentication Basic, Bearer, and custom header authentication via AuthConfig
Configurable timeouts Per-request timeout (default: 30s), max redirects (default: 10, max: 100)
Retry logic Configurable retry count with specific HTTP status code triggers
Body size limits Optional max_body_size to cap response payloads

Content Filtering and Relevance

Feature Description
BM25 scoring Bm25Filter for adaptive relevance evaluation of crawled pages
Adaptive crawling AdaptiveStrategy with term saturation detection for early termination
Main content extraction main_content_only strips boilerplate, leaving primary page content
Tag removal remove_tags accepts CSS selectors for elements to strip before processing
Path filtering include_paths and exclude_paths with regex pattern matching
Domain scoping stay_on_domain with optional allow_subdomains

Document Downloads

Feature Description
Non-HTML documents Download PDFs, DOCX, images, code files via download_documents (enabled by default)
Asset downloads CSS, JS, images via download_assets with category filtering
Size limits document_max_size (default: 50 MB) and max_asset_size caps
MIME filtering document_mime_types allowlist for controlling which document types to download
Content hashing SHA-256 digest computed for every downloaded document
Filename extraction Parsed from Content-Disposition headers or URL path

Compliance and Standards

Feature Description
robots.txt RFC 9309 compliant with user-agent prefix matching and crawl-delay support
Sitemap parsing XML, gzip-compressed, and sitemap index file support
noindex / nofollow Detection of <meta> robots directives and X-Robots-Tag headers
Charset detection Automatic encoding detection from HTTP headers and HTML meta tags
Binary/PDF skipping Content-type aware filtering to avoid processing non-HTML content
Config validation serde with deny_unknown_fields -- typos in config keys are compile-time or parse-time errors

WARC Output

Feature gate

Requires the warc feature: kreuzcrawl = { version = "0.1", features = ["warc"] }

Feature Description
WARC output Standards-compliant WARC archiving for entire crawl sessions
Archive format Web ARChive (WARC) format with complete HTTP request/response pairs
File storage Write to disk via warc_output configuration path

MCP and REST API

Feature gates

MCP server: features = ["mcp"] -- REST API: features = ["api"]

Feature Description
MCP server Model Context Protocol server for AI agent integration
REST API Axum-based HTTP API with OpenAPI documentation via utoipa
Page interaction Execute action sequences on browser pages (feature-gated: interact)

Extensibility

Kreuzcrawl exposes 7 pluggable traits that let you replace any component of the crawl pipeline:

Trait Purpose Default Implementation
Frontier URL queue and deduplication InMemoryFrontier (VecDeque + HashSet)
RateLimiter Per-domain request throttling PerDomainThrottle (200ms delay)
CrawlStore Result storage backend NoopStore (results returned, not stored)
EventEmitter Lifecycle event callbacks NoopEmitter
CrawlStrategy Traversal algorithm and URL scoring BfsStrategy
ContentFilter Page relevance evaluation NoopFilter (accept all)
CrawlCache HTTP response caching NoopCache

The Tower service stack composes these traits into a layered pipeline:

CrawlStrategy --> Frontier --> CrawlTracingLayer --> UaRotationLayer
    --> CrawlCacheLayer --> PerDomainRateLimitLayer --> HttpFetchService

CLI

The kreuzcrawl CLI provides three core commands:

# Scrape a single page
kreuzcrawl scrape https://example.com

# Crawl with depth limiting
kreuzcrawl crawl https://example.com --depth 2 --max-pages 50 --format markdown

# Discover URLs via sitemap + crawling
kreuzcrawl map https://example.com --respect-robots-txt

Output formats: json (full CrawlResult with all metadata) and markdown (MarkdownResult with citations).


Competitive Comparison

Overview

kreuzcrawl spider firecrawl crawl4ai webclaw ScrapeGraphAI CRW
Language Rust Rust TypeScript Python Rust Python Rust
License Elastic-2.0 MIT AGPL-3.0 Apache-2.0 AGPL-3.0 MIT AGPL-3.0
Distribution Library + CLI Library + CLI + SaaS SaaS + Self-hosted Library + CLI + API Library + CLI + MCP Library + SaaS API CLI + MCP + API
Headless browser chromiumoxide chromey / WebDriver Playwright Playwright None (TLS fingerprint) Playwright LightPanda / Chrome

Crawling

kreuzcrawl spider firecrawl crawl4ai webclaw ScrapeGraphAI CRW
Traversal strategies BFS, DFS, BestFirst, Adaptive BFS BFS BFS, DFS, BestFirst BFS LLM-driven graph BFS
Concurrent fetching JoinSet + Semaphore Tokio multi-thread + AIMD Bull queue workers asyncio browser pool Tokio asyncio Tokio
Streaming events Real-time Subscriber channels SSE / polling Yes -- -- --
Batch operations batch_crawl() -- Async API Deep crawl Yes -- Yes
Sitemap parsing XML, gzip, index Yes Yes -- Yes -- Yes
robots.txt RFC 9309 With caching Yes Basic Yes -- Yes

Extraction and Content

kreuzcrawl spider firecrawl crawl4ai webclaw ScrapeGraphAI CRW
Markdown conversion Always-on + structure Yes Primary output Yes Yes Yes Yes
Fit markdown (LLM-pruned) BM25 + heuristic -- -- BM25/LLM-based Token-optimized -- --
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) Basic Basic Basic Moderate -- Basic
JSON-LD extraction Full -- -- -- Data islands -- --
Feed discovery RSS, Atom, JSON Feed -- -- -- -- -- --
Link-to-citations Numbered refs -- -- Yes -- -- --
LLM extraction Multi-provider (liter-llm) OpenAI, Gemini 10+ providers litellm Ollama (local) LangChain (core) Claude, OpenAI
Cost tracking USD + tokens -- Yes Yes -- Token counting --

Architecture and Extensibility

kreuzcrawl spider firecrawl crawl4ai webclaw ScrapeGraphAI CRW
Pluggable traits 7 traits -- -- Partial (strategies) -- Graph nodes --
Middleware stack Tower services -- -- -- -- -- --
Config validation serde strict -- -- -- -- -- --
BM25 relevance scoring Yes -- -- Yes -- -- --
Adaptive crawling Term saturation -- -- Pattern learning -- -- --
Asset download + dedup SHA-256 -- -- -- -- -- --
Language SDKs 11 languages Rust, Python, Node.js Python, JS, Go, Java, Elixir, Rust Python Rust Python, Node.js Rust

License Details

License Tools Commercial use Hosting restriction
Elastic-2.0 kreuzcrawl Yes Cannot provide as managed service
MIT spider, ScrapeGraphAI Yes None
Apache-2.0 crawl4ai Yes None
AGPL-3.0 firecrawl, webclaw, CRW Yes Must open-source modifications if hosting