Skip to content

Features

Features

Kreuzcrawl is a Rust-native web crawling engine. Every feature below is wired through the public surface of the Rust crate (pub use from the crate root) or the equivalent surface in the native bindings, unless explicitly marked as an internal preview.

Native bindings ship for 14 languages — Rust, Python, TypeScript (Node + WebAssembly), Go, Java, Kotlin, C#, Ruby, PHP, Elixir, Dart, Swift, Zig — plus a stable C FFI surface for everything else.

The public Rust surface from kreuzcrawl::* is six free functions over an opaque CrawlEngineHandle: create_engine, scrape, crawl, map_urls, batch_scrape, batch_crawl. The same six operations are exposed by every binding, plus serve_api and start_mcp_server under the api and mcp cargo features.


Core Crawling

Feature Description
Engine construction create_engine(config) — opaque CrawlEngineHandle; all configuration validated up front via serde.
Concurrent fetching Tokio JoinSet + Semaphore for parallel requests, bounded by CrawlConfig::max_concurrent.
Sequential crawl crawl(&engine, url) follows links from a seed up to max_depth / max_pages.
Batch operations batch_crawl(&engine, urls) and batch_scrape(&engine, urls) for multi-seed processing.
URL discovery map_urls(&engine, url) returns a MapResult from sitemap parsing and link extraction.
Redirect handling HTTP 3xx, Refresh header, and meta-refresh detection with loop detection (max_redirects, default 10).

Four traversal strategies — BFS (default), DFS, BestFirst, and Adaptive — are implemented internally; the public surface always uses BFS. A configuration knob for strategy selection is on the roadmap.


Metadata Extraction

Every scraped page yields a PageMetadata struct alongside separate collections for links, images, feeds, and structured data.

Feature Description
Open Graph og:title, og:description, og:image, og:type, og:url, and more.
Twitter Card twitter:card, twitter:title, twitter:description, twitter:image.
Dublin Core dc.title, dc.creator, dc.date, dc.subject.
Article metadata article:published_time, article:author, article:section.
JSON-LD Full extraction from <script type="application/ld+json"> blocks.
Link extraction 4 categories: Internal, External, Anchor, Document.
Image extraction <img>, <picture>, og:image, and srcset sources.
Feed discovery RSS, Atom, and JSON Feed <link> elements.
Favicons Extraction and canonicalisation of site icons.
hreflang Language and region variants for internationalised pages.
Headings H1–H6 with hierarchy preservation.
Response metadata HTTP headers, content type, charset detection, body size.

Markdown Conversion

HTML→Markdown conversion runs automatically on every page via html-to-markdown. Results land in the MarkdownResult struct attached to every page.

Feature Description
Always-on conversion Every page result includes a markdown field with converted content.
Document structure Optional structured tree of semantic nodes alongside the rendered Markdown.
Table extraction Structured table data preserved alongside Markdown output.
Link-to-citations Numbered references ([1], [2]) with a CitationResult carrying every reference.
Fit Markdown Heuristic-based pruning and truncation optimised for LLM consumption (MarkdownResult.fit_content).
Warnings Non-fatal processing warnings surfaced in MarkdownResult.warnings.

Browser Fallback

Feature gate

Requires the browser feature: kreuzcrawl = { version = "0.3", features = ["browser"] }

Feature Description
Headless Chrome chromiumoxide-driven Chrome with BrowserMode::Auto / Always / Never.
WAF detection 8 vendors auto-detected on HTTP and browser responses: Cloudflare, Akamai, AWS WAF, Imperva, DataDome, PerimeterX, Sucuri, F5.
Auto fallback In Auto mode, WAF-blocked or JS-render-required responses retry through Chrome with a legitimate browser fingerprint.
Wait strategies NetworkIdle (default), Selector (wait for CSS selector), Fixed (fixed duration); plus optional extra_wait after the wait condition.
CDP endpoint Point at an already-running browser via BrowserConfig::endpoint instead of launching one locally.
Persistent profiles Named profiles via CrawlConfig::browser_profile / save_browser_profile. Profile names validated against path-traversal.
Screenshot capture PNG screenshot captured when capture_screenshot is enabled and the browser is used.
JS-render detection SPA-shell and noscript-warning heuristics flag pages that need browser rendering.

Network and Caching

Feature Description
Per-domain rate limiting Default 200 ms delay per origin; configurable in CrawlConfig.
HTTP caching ETag and Last-Modified conditional requests with an on-disk cache.
Proxy support HTTP, HTTPS, and SOCKS5 via ProxyConfig.
User-Agent rotation Configurable list rotated across requests.
Cookie handling Tracking, deduplication, and persistence across requests.
Authentication Basic, Bearer, and custom-header authentication via AuthConfig.
Timeouts Per-request timeout (default 30 s); max_redirects default 10 (cap 100).
Retry logic Configurable retry count with explicit status-code triggers (retry_codes).
Body-size limits Optional CrawlConfig::max_body_size to cap response payloads.

Content Processing

Feature Description
Preprocessing presets content.preprocessing_preset accepts "minimal", "standard" (default), or "aggressive" (boilerplate-stripping).
Tag removal remove_tags takes CSS selectors stripped before extraction.
Path filtering include_paths and exclude_paths accept regex patterns; excludes take priority.
Domain scoping stay_on_domain with optional allow_subdomains.

Document Downloads

Feature Description
Non-HTML documents Download PDFs, DOCX, images, and code files via download_documents (enabled by default).
Asset downloads CSS, JS, images via download_assets with asset_types category filtering.
Size limits document_max_size (default 50 MB) and max_asset_size caps.
MIME filtering document_mime_types allowlist for permitted document types.
Content hashing SHA-256 digest computed for every downloaded document.
Filename extraction Parsed from Content-Disposition or the URL path.

Compliance and Standards

Feature Description
robots.txt RFC 9309 compliant with user-agent prefix matching and Crawl-delay support.
Sitemap parsing XML, gzip-compressed, and sitemap-index files.
noindex / nofollow Detection of <meta> robots directives and X-Robots-Tag headers.
Charset detection Automatic from HTTP headers and HTML meta tags.
Config validation serde with deny_unknown_fields — typos in config keys fail at parse time.

WARC Output

Feature gate

Requires the warc feature: kreuzcrawl = { version = "0.3", features = ["warc"] }

Feature Description
WARC 1.1 Standards-compliant warcinfo + per-page response records, written to CrawlConfig::warc_output.
Header safety Header names and values validated against CR/LF injection before being written.

REST API, MCP, and CLI

Feature Description
REST API serve_api(config).await starts an Axum-based, Firecrawl v1-compatible HTTP API (feature api). OpenAPI schema is generated by utoipa.
MCP server start_mcp_server(...) starts a Model Context Protocol server for AI-agent integration (feature mcp).
CLI The kreuzcrawl binary exposes scrape, crawl, map, and serve subcommands.

CLI quickstart

# Scrape a single page
kreuzcrawl scrape https://example.com

# Crawl with depth limiting
kreuzcrawl crawl https://example.com --depth 2 --max-pages 50 --format markdown

# Discover URLs via sitemap and link extraction
kreuzcrawl map https://example.com --respect-robots-txt

Output formats: json (full CrawlResult / ScrapeResult / MapResult) and markdown (MarkdownResult content with citations).

Edit this page on GitHub