Architecture¶
Kreuzcrawl is a Rust core crate (kreuzcrawl) with a small public surface, surrounded by polyglot bindings that all wrap the same core. The runtime is Tokio; HTTP fetching uses reqwest; browser-backed rendering and interaction can use either the chromiumoxide CDP backend or the in-process native backend.
Public surface¶
The crate root exports seven free functions over an opaque handle, plus serialisable configuration and result types:
| Symbol | Purpose |
|---|---|
create_engine(config: Option<CrawlConfig>) -> Result<CrawlEngineHandle, CrawlError> |
Build an engine from a validated CrawlConfig. |
scrape(&engine, url) -> Result<ScrapeResult, _> |
Fetch and extract a single page. |
crawl(&engine, url) -> Result<CrawlResult, _> |
Follow links from a seed up to max_depth / max_pages. |
map_urls(&engine, url) -> Result<MapResult, _> |
Discover URLs via sitemaps and link extraction. |
interact(&engine, url, actions) -> Result<InteractionResult, _> |
Navigate once and run ordered page actions. |
batch_scrape(&engine, urls) -> Result<Vec<BatchScrapeResult>, _> |
Scrape many URLs concurrently. |
batch_crawl(&engine, urls) -> Result<Vec<BatchCrawlResult>, _> |
Crawl many seeds concurrently. |
serve_api(...) (feature api) / start_mcp_server(...) (feature mcp) |
Long-running REST and MCP servers backed by the same engine. |
All other items in the source tree are internal — the public crate surface is intentionally narrow.
Data flow¶
graph LR
A[create_engine] --> B[CrawlEngineHandle]
B --> S[scrape / crawl / map_urls / interact / batch_*]
S --> H[HTTP fetch + middleware]
H --> R{Content type}
R -->|HTML| E[Extraction pipeline]
R -->|PDF / binary| D[DownloadedDocument]
E --> M[Markdown conversion]
M --> O[ScrapeResult / CrawlResult / MapResult]
D --> O
The middleware stack between the engine and the network applies per-domain rate limiting, conditional caching, and User-Agent rotation, plus optional tracing spans. WAF responses can trigger an automatic browser fallback when BrowserMode::Auto is set. interact() bypasses the crawl/extraction pipeline and keeps one browser page open while it executes PageAction values such as click, type, wait, screenshot, JavaScript evaluation, and scrape. Chromiumoxide screenshots are compositor captures; native screenshots are deterministic PNG snapshots derived from the post-action HTML and are intended for inspection, not pixel-perfect Chrome parity. The extraction pipeline is described in detail in Content Extraction.
Bindings¶
Every binding consumes the same Rust core via FFI. The per-binding glue is generated by alef from the core types and a binding manifest (alef.toml); generated code lives under packages/<lang>/ and crates/kreuzcrawl-<binding>/. Binding-level differences (async runtimes, naming conventions, type marshalling) are handled by the generator — the core itself stays language-agnostic.
| Binding crate | Distribution | Mechanism |
|---|---|---|
crates/kreuzcrawl-py |
PyPI kreuzcrawl |
PyO3 + maturin |
crates/kreuzcrawl-node |
npm @kreuzberg/kreuzcrawl |
NAPI-RS |
crates/kreuzcrawl-php |
Composer kreuzberg-dev/kreuzcrawl |
ext-php-rs |
crates/kreuzcrawl-wasm |
npm @kreuzberg/kreuzcrawl-wasm |
wasm-bindgen |
crates/kreuzcrawl-ffi |
Shared library + cbindgen header | C FFI |
packages/ruby/ext/... |
RubyGems kreuzcrawl |
Magnus + rb-sys |
packages/elixir/native/... |
Hex kreuzcrawl |
Rustler NIF |
packages/go |
Go module github.com/kreuzberg-dev/kreuzcrawl/packages/go |
cgo over C FFI |
packages/java |
Maven Central dev.kreuzberg.kreuzcrawl:kreuzcrawl |
Java 25 Panama FFM |
packages/kotlin-android |
Maven Central dev.kreuzberg.kreuzcrawl:kreuzcrawl-android |
Android AAR with JNI .sos |
packages/csharp |
NuGet Kreuzcrawl |
.NET 10 P/Invoke |
packages/dart |
pub.dev kreuzcrawl |
Dart FFI |
packages/swift |
Swift Package Manager | Swift over C FFI |
packages/zig |
zig fetch --save |
Zig over C FFI |
Feature gates¶
Cargo features keep the default build minimal — the default feature set is empty. The user-facing features are:
| Feature | Capability |
|---|---|
browser |
Headless-Chrome fallback for JS-heavy or WAF-protected pages. |
browser-native |
In-process native browser backend for rendering and page interaction. |
interact |
Compatibility alias for browser-backed page interaction. The public API is always compiled. |
tracing |
OpenTelemetry-compatible request spans. |
api |
serve_api(...) — Firecrawl v1-compatible REST server. |
mcp |
start_mcp_server(...) — Model Context Protocol server for AI-agent integration. |
mcp-http |
MCP over HTTP transport (implies mcp + api). |
warc |
WARC 1.1 output via CrawlConfig::warc_output. |