Skip to content

Kreuzcrawl

kreuzcrawl

High-performance web crawling and scraping with a Rust core and native bindings for 14 languages. Always-on HTML→Markdown conversion, structured metadata, and a headless-Chrome fallback for WAF-protected pages — usable as a library, CLI, REST API, MCP server, or Docker image.


Why Kreuzcrawl

  • Flexible Crawling

BFS, DFS, BestFirst, and Adaptive traversal with concurrent fetching, streaming events, and batch crawl/scrape.

  • Always-On Markdown

Every fetched page is converted to clean, LLM-ready Markdown with citation tracking and content pruning.

  • Rich Metadata

PageMetadata carries Open Graph, Twitter Card, Dublin Core, article fields, JSON-LD, links, images, feeds, favicons, and hreflang.

  • Browser Fallback

Headless Chrome via chromiumoxide. Auto-detects WAF blocks across 8 vendors (Cloudflare, Akamai, Imperva, DataDome, PerimeterX, Sucuri, F5, AWS-WAF) and retries through a legitimate Chrome fingerprint.

  • MCP & REST Servers

Built-in MCP server for AI agents, REST API for service deployments, both gated behind cargo features.

  • WARC Output

Standards-compliant WARC archiving for web preservation and audit-grade compliance pipelines.

  • 14 Language Bindings

Native bindings for Rust, Python, TypeScript, Go, Java, Kotlin (Android), C#, Ruby, PHP, Elixir, Dart, Swift, Zig, and WebAssembly — plus a C FFI surface for everything else.

See all features


Language Support

Language Package Docs
Rust cargo add kreuzcrawl API Reference
Python pip install kreuzcrawl API Reference
TypeScript / Node npm install @kreuzberg/kreuzcrawl API Reference
WebAssembly npm install @kreuzberg/kreuzcrawl-wasm API Reference
Go go get github.com/kreuzberg-dev/kreuzcrawl/packages/go API Reference
Java Maven Central dev.kreuzberg.kreuzcrawl:kreuzcrawl API Reference
Kotlin (Android) Maven Central dev.kreuzberg.kreuzcrawl:kreuzcrawl-android API Reference
C# dotnet add package Kreuzcrawl API Reference
Ruby gem install kreuzcrawl API Reference
PHP composer require kreuzberg-dev/kreuzcrawl API Reference
Elixir {:kreuzcrawl, "~> 0.3.0-rc.19"} API Reference
Dart / Flutter dart pub add kreuzcrawl API Reference
Swift Swift Package Manager API Reference
Zig zig fetch --save from GitHub API Reference
C (FFI) Shared library + header API Reference
CLI cargo install kreuzcrawl-cli CLI Guide
Docker ghcr.io/kreuzberg-dev/kreuzcrawl Docker Guide

Choosing between TypeScript packages

@kreuzberg/kreuzcrawl — Native NAPI-RS bindings. Use for Node.js servers and CLI tools. Full feature set including the browser fallback.

@kreuzberg/kreuzcrawl-wasm — Pure WebAssembly. Use for browsers, Cloudflare Workers, Deno, Bun, and serverless. No headless-Chrome support.


Quick Example

src/main.rs
use kreuzcrawl::{CrawlConfig, ContentConfig, create_engine, crawl};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = CrawlConfig {
        max_depth: Some(2),
        max_pages: Some(50),
        content: ContentConfig::default(),
        ..Default::default()
    };
    let engine = create_engine(Some(config))?;

    let result = crawl(&engine, "https://example.com").await?;
    for page in &result.pages {
        let title = page.metadata.title.as_deref().unwrap_or("(no title)");
        println!("{} — {}", page.url, title);
    }
    Ok(())
}
main.py
import asyncio
from kreuzcrawl import CrawlConfig, create_engine, crawl

async def main():
    engine = create_engine(CrawlConfig(max_depth=2, max_pages=50))
    result = await crawl(engine, "https://example.com")
    for page in result.pages:
        print(f"{page.url}{page.metadata.title or '(no title)'}")

asyncio.run(main())
index.ts
import { createEngine, crawl } from "@kreuzberg/kreuzcrawl";

const engine = createEngine({ maxDepth: 2, maxPages: 50 });
const result = await crawl(engine, "https://example.com");

for (const page of result.pages) {
  console.log(`${page.url}${page.metadata.title ?? "(no title)"}`);
}

Part of kreuzberg.dev

Document intelligence — text, tables, and metadata from 91+ file formats with optional OCR.

Managed document-extraction API with SDKs, dashboards, and observability built in.

The HTML→Markdown engine powering Kreuzcrawl's always-on conversion. Use it stand-alone for static HTML.

Multi-provider LLM orchestration with cost and token accounting.

306 tree-sitter grammars and code-intelligence primitives. Used downstream when crawled pages contain code.

Join the Kreuzberg community for help, roadmap discussion, and announcements.


Explore the Docs

  • Get Started

Install Kreuzcrawl and run your first crawl in under five minutes.

Quick Start

  • Guides

Crawling, scraping, URL discovery, browser automation, WARC output, and deployment.

All Guides

  • Concepts

Public surface, data flow, the binding matrix, feature gates, and the content-extraction pipeline.

Architecture

  • Reference

Per-language API docs, the configuration schema, type catalogue, and error matrix.

References

  • CLI & Servers

The kreuzcrawl CLI, REST API server, and MCP server for AI agents.

CLI Usage

  • Features

Complete feature breakdown: crawl strategies, metadata extraction, browser fallback, WARC, MCP, REST.

Features


Getting Help

Edit this page on GitHub