Kreuzcrawl

kreuzcrawl¶

High-performance web crawling and scraping with a Rust core and native bindings for 14 languages. Always-on HTML→Markdown conversion, structured metadata, and a headless-Chrome fallback for WAF-protected pages — usable as a library, CLI, REST API, MCP server, or Docker image.

Quick Start Installation Features Join our Community

Why Kreuzcrawl¶

Flexible Crawling

BFS, DFS, BestFirst, and Adaptive traversal with concurrent fetching, streaming events, and batch crawl/scrape.

Always-On Markdown

Every fetched page is converted to clean, LLM-ready Markdown with citation tracking and content pruning.

Rich Metadata

PageMetadata carries Open Graph, Twitter Card, Dublin Core, article fields, JSON-LD, links, images, feeds, favicons, and hreflang.

Browser Fallback

Headless Chrome via chromiumoxide. Auto-detects WAF blocks across 8 vendors (Cloudflare, Akamai, Imperva, DataDome, PerimeterX, Sucuri, F5, AWS-WAF) and retries through a legitimate Chrome fingerprint.

MCP & REST Servers

Built-in MCP server for AI agents, REST API for service deployments, both gated behind cargo features.

WARC Output

Standards-compliant WARC archiving for web preservation and audit-grade compliance pipelines.

14 Language Bindings

Native bindings for Rust, Python, TypeScript, Go, Java, Kotlin (Android), C#, Ruby, PHP, Elixir, Dart, Swift, Zig, and WebAssembly — plus a C FFI surface for everything else.

→ See all features

Language Support¶

Language	Package	Docs
Rust	`cargo add kreuzcrawl`	API Reference
Python	`pip install kreuzcrawl`	API Reference
TypeScript / Node	`npm install @kreuzberg/kreuzcrawl`	API Reference
WebAssembly	`npm install @kreuzberg/kreuzcrawl-wasm`	API Reference
Go	`go get github.com/kreuzberg-dev/kreuzcrawl/packages/go`	API Reference
Java	Maven Central `dev.kreuzberg.kreuzcrawl:kreuzcrawl`	API Reference
Kotlin (Android)	Maven Central `dev.kreuzberg.kreuzcrawl:kreuzcrawl-android`	API Reference
C#	`dotnet add package Kreuzcrawl`	API Reference
Ruby	`gem install kreuzcrawl`	API Reference
PHP	`composer require kreuzberg-dev/kreuzcrawl`	API Reference
Elixir	`{:kreuzcrawl, "~> 0.3.0-rc.19"}`	API Reference
Dart / Flutter	`dart pub add kreuzcrawl`	API Reference
Swift	Swift Package Manager	API Reference
Zig	`zig fetch --save` from GitHub	API Reference
C (FFI)	Shared library + header	API Reference
CLI	`cargo install kreuzcrawl-cli`	CLI Guide
Docker	`ghcr.io/kreuzberg-dev/kreuzcrawl`	Docker Guide

Choosing between TypeScript packages

@kreuzberg/kreuzcrawl — Native NAPI-RS bindings. Use for Node.js servers and CLI tools. Full feature set including the browser fallback.

@kreuzberg/kreuzcrawl-wasm — Pure WebAssembly. Use for browsers, Cloudflare Workers, Deno, Bun, and serverless. No headless-Chrome support.

Quick Example¶

RustPythonTypeScript

src/main.rs

use kreuzcrawl::{CrawlConfig, ContentConfig, create_engine, crawl};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = CrawlConfig {
        max_depth: Some(2),
        max_pages: Some(50),
        content: ContentConfig::default(),
        ..Default::default()
    };
    let engine = create_engine(Some(config))?;

    let result = crawl(&engine, "https://example.com").await?;
    for page in &result.pages {
        let title = page.metadata.title.as_deref().unwrap_or("(no title)");
        println!("{} — {}", page.url, title);
    }
    Ok(())
}

main.py

import asyncio
from kreuzcrawl import CrawlConfig, create_engine, crawl

async def main():
    engine = create_engine(CrawlConfig(max_depth=2, max_pages=50))
    result = await crawl(engine, "https://example.com")
    for page in result.pages:
        print(f"{page.url} — {page.metadata.title or '(no title)'}")

asyncio.run(main())

index.ts

import { createEngine, crawl } from "@kreuzberg/kreuzcrawl";

const engine = createEngine({ maxDepth: 2, maxPages: 50 });
const result = await crawl(engine, "https://example.com");

for (const page of result.pages) {
  console.log(`${page.url} — ${page.metadata.title ?? "(no title)"}`);
}

Part of kreuzberg.dev¶

Kreuzberg

Document intelligence — text, tables, and metadata from 91+ file formats with optional OCR.

Kreuzberg Cloud

Managed document-extraction API with SDKs, dashboards, and observability built in.

html-to-markdown

The HTML→Markdown engine powering Kreuzcrawl's always-on conversion. Use it stand-alone for static HTML.

liter-llm

Multi-provider LLM orchestration with cost and token accounting.

tree-sitter-language-pack

306 tree-sitter grammars and code-intelligence primitives. Used downstream when crawled pages contain code.

Discord

Join the Kreuzberg community for help, roadmap discussion, and announcements.

Explore the Docs¶

Get Started

Install Kreuzcrawl and run your first crawl in under five minutes.

Quick Start

Guides

Crawling, scraping, URL discovery, browser automation, WARC output, and deployment.

All Guides

Concepts

Public surface, data flow, the binding matrix, feature gates, and the content-extraction pipeline.

Architecture

Reference

Per-language API docs, the configuration schema, type catalogue, and error matrix.

References

CLI & Servers

The kreuzcrawl CLI, REST API server, and MCP server for AI agents.

CLI Usage

Features

Complete feature breakdown: crawl strategies, metadata extraction, browser fallback, WARC, MCP, REST.

Features

Getting Help¶

Bugs & feature requests — Open an issue on GitHub
Community chat — Join the Discord
Contributing — Read the contributor guide

Edit this page on GitHub