Kreuzcrawl
kreuzcrawl¶
High-performance web crawling and scraping with a Rust core and native bindings for 14 languages. Always-on HTML→Markdown conversion, structured metadata, and a headless-Chrome fallback for WAF-protected pages — usable as a library, CLI, REST API, MCP server, or Docker image.
Why Kreuzcrawl¶
- Flexible Crawling
BFS, DFS, BestFirst, and Adaptive traversal with concurrent fetching, streaming events, and batch crawl/scrape.
- Always-On Markdown
Every fetched page is converted to clean, LLM-ready Markdown with citation tracking and content pruning.
- Rich Metadata
PageMetadata carries Open Graph, Twitter Card, Dublin Core, article fields, JSON-LD, links, images, feeds, favicons, and hreflang.
- Browser Fallback
Headless Chrome via chromiumoxide. Auto-detects WAF blocks across 8 vendors (Cloudflare, Akamai, Imperva, DataDome, PerimeterX, Sucuri, F5, AWS-WAF) and retries through a legitimate Chrome fingerprint.
- MCP & REST Servers
Built-in MCP server for AI agents, REST API for service deployments, both gated behind cargo features.
- WARC Output
Standards-compliant WARC archiving for web preservation and audit-grade compliance pipelines.
- 14 Language Bindings
Native bindings for Rust, Python, TypeScript, Go, Java, Kotlin (Android), C#, Ruby, PHP, Elixir, Dart, Swift, Zig, and WebAssembly — plus a C FFI surface for everything else.
Language Support¶
| Language | Package | Docs |
|---|---|---|
| Rust | cargo add kreuzcrawl |
API Reference |
| Python | pip install kreuzcrawl |
API Reference |
| TypeScript / Node | npm install @kreuzberg/kreuzcrawl |
API Reference |
| WebAssembly | npm install @kreuzberg/kreuzcrawl-wasm |
API Reference |
| Go | go get github.com/kreuzberg-dev/kreuzcrawl/packages/go |
API Reference |
| Java | Maven Central dev.kreuzberg.kreuzcrawl:kreuzcrawl |
API Reference |
| Kotlin (Android) | Maven Central dev.kreuzberg.kreuzcrawl:kreuzcrawl-android |
API Reference |
| C# | dotnet add package Kreuzcrawl |
API Reference |
| Ruby | gem install kreuzcrawl |
API Reference |
| PHP | composer require kreuzberg-dev/kreuzcrawl |
API Reference |
| Elixir | {:kreuzcrawl, "~> 0.3.0-rc.19"} |
API Reference |
| Dart / Flutter | dart pub add kreuzcrawl |
API Reference |
| Swift | Swift Package Manager | API Reference |
| Zig | zig fetch --save from GitHub |
API Reference |
| C (FFI) | Shared library + header | API Reference |
| CLI | cargo install kreuzcrawl-cli |
CLI Guide |
| Docker | ghcr.io/kreuzberg-dev/kreuzcrawl |
Docker Guide |
Choosing between TypeScript packages
@kreuzberg/kreuzcrawl — Native NAPI-RS bindings. Use for Node.js servers and CLI tools. Full feature set including the browser fallback.
@kreuzberg/kreuzcrawl-wasm — Pure WebAssembly. Use for browsers, Cloudflare Workers, Deno, Bun, and serverless. No headless-Chrome support.
Quick Example¶
use kreuzcrawl::{CrawlConfig, ContentConfig, create_engine, crawl};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = CrawlConfig {
max_depth: Some(2),
max_pages: Some(50),
content: ContentConfig::default(),
..Default::default()
};
let engine = create_engine(Some(config))?;
let result = crawl(&engine, "https://example.com").await?;
for page in &result.pages {
let title = page.metadata.title.as_deref().unwrap_or("(no title)");
println!("{} — {}", page.url, title);
}
Ok(())
}
import asyncio
from kreuzcrawl import CrawlConfig, create_engine, crawl
async def main():
engine = create_engine(CrawlConfig(max_depth=2, max_pages=50))
result = await crawl(engine, "https://example.com")
for page in result.pages:
print(f"{page.url} — {page.metadata.title or '(no title)'}")
asyncio.run(main())
Part of kreuzberg.dev¶
Document intelligence — text, tables, and metadata from 91+ file formats with optional OCR.
Managed document-extraction API with SDKs, dashboards, and observability built in.
The HTML→Markdown engine powering Kreuzcrawl's always-on conversion. Use it stand-alone for static HTML.
Multi-provider LLM orchestration with cost and token accounting.
306 tree-sitter grammars and code-intelligence primitives. Used downstream when crawled pages contain code.
Join the Kreuzberg community for help, roadmap discussion, and announcements.
Explore the Docs¶
- Get Started
Install Kreuzcrawl and run your first crawl in under five minutes.
- Guides
Crawling, scraping, URL discovery, browser automation, WARC output, and deployment.
- Concepts
Public surface, data flow, the binding matrix, feature gates, and the content-extraction pipeline.
- Reference
Per-language API docs, the configuration schema, type catalogue, and error matrix.
- CLI & Servers
The kreuzcrawl CLI, REST API server, and MCP server for AI agents.
- Features
Complete feature breakdown: crawl strategies, metadata extraction, browser fallback, WARC, MCP, REST.
Getting Help¶
- Bugs & feature requests — Open an issue on GitHub
- Community chat — Join the Discord
- Contributing — Read the contributor guide