Skip to content

Competitive Landscape

How kreuzcrawl compares to other web crawling and scraping tools.


title: Kreuzcrawl vs Firecrawl

description: Detailed comparison of Kreuzcrawl and Firecrawl web crawling tools


Kreuzcrawl vs Firecrawl

Firecrawl is a TypeScript-based web scraping platform built for SaaS-first delivery. It provides a hosted API, self-hosted Docker deployment, and SDKs for Python, JavaScript, Go, Java, Elixir, and Rust. Firecrawl focuses on making web data accessible through a managed service with built-in page interaction, LLM extraction, and stealth capabilities.

Kreuzcrawl is a Rust library and CLI designed to be embedded directly in your application. It provides a trait-based engine with a Tower middleware stack, giving developers fine-grained control over every stage of the crawl pipeline.

The fundamental difference: Firecrawl is a service you call; Kreuzcrawl is a library you embed.


Feature Comparison

Feature Kreuzcrawl Firecrawl
Language Rust TypeScript
License Elastic-2.0 AGPL-3.0
Distribution Library + CLI SaaS + Self-hosted
Headless browser chromiumoxide Playwright
Traversal strategies BFS, DFS, BestFirst, Adaptive BFS
Concurrent fetching JoinSet + Semaphore Bull queue workers
Streaming events Real-time SSE / polling
Batch operations batch_crawl() Async API
Sitemap parsing XML, gzip, index Yes
robots.txt RFC 9309 Yes
Markdown conversion Always-on + structure preservation Primary output
Fit markdown (LLM-pruned) BM25 + heuristic No
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) Basic
JSON-LD extraction Full No
Feed discovery RSS, Atom, JSON Feed No
Link-to-citations Numbered refs No
LLM extraction Multi-provider (liter-llm) 10+ providers
Cost tracking USD + tokens Yes
PDF extraction No Yes (FirePDF)
WAF detection 8 vendors Cloud only
Stealth / anti-detect UA rotation Stealth injection
Proxy support HTTP/HTTPS/SOCKS5 Rotating proxies
Screenshot capture Stub Yes
Page interaction No Click, scroll, type
REST API server No Yes (primary interface)
MCP server No Yes
CLI scrape/crawl/map No
Language SDKs Rust only Python, JS, Go, Java, Elixir, Rust
Disk cache blake3 + TTL Redis
Per-domain rate limiting Tower layer Global RPS
HTTP caching (ETag) Yes No
Pluggable traits 7 traits No
Middleware stack Tower services No
Config validation serde strict No
BM25 relevance scoring Yes No
Adaptive crawling Term saturation No
Search integration No Yes

Where Kreuzcrawl Wins

Performance and resource efficiency. Kreuzcrawl is compiled Rust with zero garbage collection overhead. For high-volume crawling workloads, this translates to lower memory usage and higher throughput per CPU core.

Architectural extensibility. Seven pluggable traits (Frontier, RateLimiter, CrawlStore, EventEmitter, CrawlStrategy, ContentFilter, CrawlCache) let you replace any component of the crawl pipeline. The Tower middleware stack allows composable request/response processing. Firecrawl has no equivalent extension mechanism.

Crawl intelligence. BM25 relevance scoring and adaptive crawling with term saturation detection allow Kreuzcrawl to prioritize high-value pages and terminate early when content becomes repetitive. Firecrawl offers only BFS traversal.

Rich metadata extraction. Kreuzcrawl extracts 40+ metadata fields including full JSON-LD, Dublin Core, feed discovery, and hreflang -- significantly more than Firecrawl's basic metadata output.

Per-domain rate limiting. Kreuzcrawl's Tower-based rate limiter operates per domain with backoff, while Firecrawl uses a global RPS limit. Per-domain limiting is more respectful to target sites and less likely to trigger blocks.

No cloud dependency. Kreuzcrawl runs entirely in your infrastructure with no external service calls. This matters for air-gapped environments, data sovereignty requirements, and cost predictability.


Where Firecrawl Wins

Managed infrastructure. Firecrawl's cloud service handles browser management, proxy rotation, and scaling. You do not need to provision headless browsers or manage infrastructure.

Page interaction. Firecrawl supports clicking, scrolling, typing, and other browser interactions. This is essential for SPAs, infinite scroll pages, and sites that require form submission before revealing content. Kreuzcrawl does not currently support page interaction.

Broad LLM provider support. Firecrawl integrates with 10+ LLM providers out of the box. While Kreuzcrawl supports multiple providers through liter-llm, Firecrawl's ecosystem is more mature here.

Language SDK breadth. Firecrawl provides official SDKs for Python, JavaScript, Go, Java, Elixir, and Rust. Kreuzcrawl is currently Rust-only (polyglot bindings are in development).

PDF extraction. FirePDF handles PDF content extraction natively. Kreuzcrawl does not currently extract PDF content.

REST API and MCP. Firecrawl's primary interface is a REST API, making integration trivial for any language. It also provides an MCP server for AI agent workflows.

Screenshot capture. Firecrawl captures full-page screenshots. Kreuzcrawl has only a stub implementation.


When to Use Which

Choose Kreuzcrawl when:

  • You need to embed crawling directly in a Rust application
  • You require custom crawl strategies, storage backends, or middleware
  • You want per-domain rate limiting and HTTP caching (ETag/Last-Modified)
  • You need rich metadata extraction (JSON-LD, Dublin Core, feeds)
  • You want relevance-based crawling with BM25 scoring
  • You are building a pipeline where you control the full stack
  • Data sovereignty or air-gapped deployment is a requirement

Choose Firecrawl when:

  • You want a managed service with minimal infrastructure management
  • You need page interaction (clicking, scrolling, form filling)
  • You need SDKs in Python, JavaScript, Go, or Java today
  • You need PDF extraction
  • You want a REST API that any language can call immediately
  • You prefer SaaS pricing over self-managed infrastructure costs

License Comparison

Kreuzcrawl Firecrawl
License Elastic-2.0 AGPL-3.0
Commercial use Yes Yes
Modification Yes Yes, but must open-source if hosting
Hosting restriction Cannot provide as a managed crawling service Must open-source all modifications when hosting
Embedding in proprietary software Yes Requires AGPL compliance (copyleft)

Firecrawl's AGPL-3.0 license requires that any modifications to the software be open-sourced when the software is offered as a network service. This has significant implications for companies that want to self-host a modified version. Kreuzcrawl's Elastic-2.0 license allows proprietary modifications but prohibits offering Kreuzcrawl itself as a managed service.


title: Kreuzcrawl vs Spider

description: Detailed comparison of Kreuzcrawl and Spider web crawling libraries


Kreuzcrawl vs Spider

Spider is a mature Rust web crawling library (v2.48+) with a broad feature set including WARC output, search provider integration, AIMD-based concurrency control, and SDKs for Python and Node.js. Spider has been in development longer and has a wider deployment base.

Both Kreuzcrawl and Spider are Rust-native crawling libraries, making this the most direct architectural comparison in this series. The key differences lie in extensibility approach, content intelligence, and middleware design.


Feature Comparison

Feature Kreuzcrawl Spider
Language Rust Rust
License Elastic-2.0 MIT
Distribution Library + CLI Library + CLI + SaaS
Headless browser chromiumoxide chromey / WebDriver
Traversal strategies BFS, DFS, BestFirst, Adaptive BFS
Concurrent fetching JoinSet + Semaphore Tokio multi-thread + AIMD
Streaming events Real-time Subscriber channels
Batch operations batch_crawl() No
Sitemap parsing XML, gzip, index Yes
robots.txt RFC 9309 Yes, with caching
Markdown conversion Always-on + structure preservation Yes
Fit markdown (LLM-pruned) BM25 + heuristic No
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) Basic
JSON-LD extraction Full No
Feed discovery RSS, Atom, JSON Feed No
Link-to-citations Numbered refs No
LLM extraction Multi-provider (liter-llm) OpenAI, Gemini
Cost tracking USD + tokens No
PDF extraction No No
WAF detection 8 vendors Smart mode (auto-escalate)
Stealth / anti-detect UA rotation ua_generator
Proxy support HTTP/HTTPS/SOCKS5 HTTP/HTTPS/SOCKS + Cloud
User-Agent rotation Tower layer Yes
Screenshot capture Stub Yes
Page interaction No Agent automation
REST API server No No
MCP server No Yes
CLI scrape/crawl/map Yes
Language SDKs Rust only Rust, Python, Node.js
Disk cache blake3 + TTL SQLite
Per-domain rate limiting Tower layer Token bucket + auto-throttle
HTTP caching (ETag) Yes Yes
Pluggable traits 7 traits No
Middleware stack Tower services No
Config validation serde strict No
BM25 relevance scoring Yes No
Adaptive crawling Term saturation No
Asset download + dedup SHA-256 No
Search integration No Serper, Brave, Bing, Tavily
WARC output No Yes

Where Kreuzcrawl Wins

Pluggable architecture. Kreuzcrawl defines 7 traits (Frontier, RateLimiter, CrawlStore, EventEmitter, CrawlStrategy, ContentFilter, CrawlCache) that let you replace any component. Spider does not expose equivalent extension points. If you need a custom URL queue, a custom storage backend, or a custom traversal algorithm, Kreuzcrawl's trait system makes this straightforward.

Tower middleware stack. Kreuzcrawl's request/response pipeline is built on Tower services -- the same abstraction used by Axum and Tonic. This means you can compose layers for tracing, caching, rate limiting, and UA rotation in a standard, well-understood pattern. Spider handles these concerns internally without a composable middleware abstraction.

Crawl intelligence. Four traversal strategies (BFS, DFS, BestFirst, Adaptive) versus Spider's BFS-only approach. BM25 relevance scoring lets Kreuzcrawl prioritize pages that match a query, and adaptive crawling terminates early when term saturation is detected.

Metadata depth. Kreuzcrawl extracts 40+ metadata fields including full JSON-LD, Dublin Core, RSS/Atom/JSON Feed discovery, and hreflang links. Spider provides basic metadata extraction.

Fit markdown. BM25-based content pruning produces LLM-optimized markdown that reduces token consumption. Spider does not offer content pruning.

Strict configuration. Kreuzcrawl uses serde with deny_unknown_fields for configuration validation, catching typos and invalid fields at build time rather than silently ignoring them.


Where Spider Wins

Maturity and stability. Spider is at v2.48+ with a longer track record in production. It has had more time to handle edge cases across the wide variety of websites in the real world.

WARC output. Spider supports WARC (Web ARChive) format, which is the standard for web archival and replay. Kreuzcrawl does not currently produce WARC output.

Search provider integration. Spider integrates with Serper, Brave, Bing, and Tavily for search-driven crawling. This is useful for discovery workflows where you do not have a starting URL. Kreuzcrawl has no search integration.

AIMD concurrency control. Spider uses Additive Increase/Multiplicative Decrease (the algorithm behind TCP congestion control) to dynamically adjust concurrency based on server response. This is a sophisticated approach to avoiding rate limiting. Kreuzcrawl uses static semaphore-based concurrency with per-domain rate limiting.

Page interaction. Spider supports agent-based browser automation for interacting with page elements. Kreuzcrawl does not currently support page interaction.

Language SDKs. Spider provides official Python and Node.js bindings in addition to Rust. Kreuzcrawl is currently Rust-only.

MCP server. Spider provides an MCP server for AI agent integration. Kreuzcrawl does not.

MIT license. Spider's MIT license has no restrictions on commercial use, modification, or hosting. This is the most permissive option available.


When to Use Which

Choose Kreuzcrawl when:

  • You need to customize crawl components (frontier, strategy, storage, filtering)
  • You want Tower middleware composition for request/response processing
  • You need multiple traversal strategies (DFS, BestFirst, Adaptive)
  • You want BM25 relevance scoring or adaptive early termination
  • You need deep metadata extraction (JSON-LD, Dublin Core, feeds)
  • You want LLM-optimized content pruning (fit markdown)
  • You need strict configuration validation

Choose Spider when:

  • You need WARC output for web archival
  • You need search provider integration (Serper, Brave, Bing, Tavily)
  • You need Python or Node.js bindings today
  • You need page interaction / browser automation
  • You want AIMD-based dynamic concurrency control
  • You prefer MIT licensing with no restrictions
  • You need MCP server support for AI agent workflows

License Comparison

Kreuzcrawl Spider
License Elastic-2.0 MIT
Commercial use Yes Yes
Modification Yes Yes
Hosting restriction Cannot provide as a managed crawling service None
Embedding in proprietary software Yes Yes

Spider's MIT license is maximally permissive -- there are no restrictions on use, modification, or distribution. Kreuzcrawl's Elastic-2.0 license allows all commercial use and modification but prohibits offering Kreuzcrawl itself as a managed crawling service. For most use cases (embedding in applications, internal tooling, commercial products), both licenses work. The distinction matters only if you plan to offer a hosted crawling API as a product.


title: Kreuzcrawl vs Crawl4AI

description: Detailed comparison of Kreuzcrawl and Crawl4AI web crawling tools


Kreuzcrawl vs Crawl4AI

Crawl4AI is a Python-based web crawling library optimized for AI and LLM workflows. With over 50,000 GitHub stars, it is one of the most popular open-source crawling tools. Crawl4AI provides a Python-native API, Playwright-based browser automation, BM25 content scoring, and a growing ecosystem of extraction strategies.

The core difference is the language runtime: Kreuzcrawl is compiled Rust with zero GC overhead; Crawl4AI is Python with Playwright for browser automation. This creates distinct trade-offs around performance, memory usage, and ecosystem integration.


Feature Comparison

Feature Kreuzcrawl Crawl4AI
Language Rust Python
License Elastic-2.0 Apache-2.0
Distribution Library + CLI Library + CLI + API
Headless browser chromiumoxide Playwright
Traversal strategies BFS, DFS, BestFirst, Adaptive BFS, DFS, BestFirst
Concurrent fetching JoinSet + Semaphore asyncio browser pool
Streaming events Real-time Yes
Batch operations batch_crawl() Deep crawl
Sitemap parsing XML, gzip, index No
robots.txt RFC 9309 Basic
Markdown conversion Always-on + structure preservation Yes
Fit markdown (LLM-pruned) BM25 + heuristic BM25/LLM-based
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) Basic
JSON-LD extraction Full No
Feed discovery RSS, Atom, JSON Feed No
Link-to-citations Numbered refs Yes
LLM extraction Multi-provider (liter-llm) litellm
Cost tracking USD + tokens Yes
PDF extraction No Yes
WAF detection 8 vendors 3-tier detection
Stealth / anti-detect UA rotation Playwright Stealth
Proxy support HTTP/HTTPS/SOCKS5 Yes, with escalation
User-Agent rotation Tower layer Yes
Screenshot capture Stub Yes
Page interaction No JS execution
REST API server No FastAPI
MCP server No Yes
CLI scrape/crawl/map crwl
Language SDKs Rust only Python
Disk cache blake3 + TTL SQLite
Per-domain rate limiting Tower layer Adaptive
HTTP caching (ETag) Yes No
Pluggable traits 7 traits Partial (strategies)
Middleware stack Tower services No
Config validation serde strict No
BM25 relevance scoring Yes Yes
Adaptive crawling Term saturation Pattern learning
Search integration No Google

Where Kreuzcrawl Wins

Performance. Rust compiles to native code with no garbage collector, no interpreter overhead, and no GIL. For CPU-bound extraction tasks -- HTML parsing, metadata extraction, markdown conversion -- Kreuzcrawl will be significantly faster per core. Memory usage is also lower and more predictable, which matters for large-scale crawls.

Architectural extensibility. Kreuzcrawl's 7 pluggable traits provide a formal extension contract. You can replace the URL frontier, rate limiter, storage backend, event system, traversal strategy, content filter, or cache without modifying the engine. Crawl4AI offers partial strategy customization but does not expose the same breadth of extension points.

Tower middleware composition. The Tower service stack allows you to compose request/response processing layers (tracing, caching, rate limiting, UA rotation) using a standard Rust ecosystem pattern. This is a fundamentally different approach from Crawl4AI's monolithic pipeline.

Metadata depth. Kreuzcrawl extracts 40+ fields including full JSON-LD, Dublin Core, feed discovery, and hreflang. Crawl4AI extracts basic metadata.

Per-domain rate limiting. Kreuzcrawl's Tower-based rate limiter operates per domain with backoff. This is more respectful to target sites than global rate limiting.

HTTP caching. ETag and Last-Modified conditional request support reduces unnecessary re-fetching. Crawl4AI does not implement HTTP-level caching.

Sitemap parsing. Kreuzcrawl parses XML sitemaps, gzip-compressed sitemaps, and sitemap index files for comprehensive URL discovery. Crawl4AI does not parse sitemaps.


Where Crawl4AI Wins

Community and ecosystem. With 50,000+ GitHub stars and an active community, Crawl4AI has extensive documentation, tutorials, community support, and battle-tested real-world usage. Finding help, examples, and integrations is easier.

Python-native. If your stack is Python -- and many AI/ML pipelines are -- Crawl4AI integrates without any FFI boundary. You can use it directly in Jupyter notebooks, combine it with pandas, pass results to scikit-learn or PyTorch, and debug with standard Python tools.

Page interaction. Crawl4AI supports JavaScript execution through Playwright, enabling interaction with dynamic pages, SPAs, and content behind client-side rendering. Kreuzcrawl does not currently support page interaction.

PDF extraction. Crawl4AI handles PDF content extraction. Kreuzcrawl does not.

Screenshot capture. Full-page screenshot capture is supported. Kreuzcrawl has only a stub.

REST API and MCP. Crawl4AI provides a FastAPI-based REST API and an MCP server for AI agent workflows. Kreuzcrawl provides neither.

Adaptive crawling with pattern learning. Crawl4AI's adaptive crawling learns content patterns during a crawl and adjusts its strategy accordingly. Kreuzcrawl's adaptive strategy uses term saturation, which is effective but less dynamic.

Search integration. Crawl4AI integrates with Google search for discovery-driven crawling.


When to Use Which

Choose Kreuzcrawl when:

  • Performance is critical (high-volume crawling, tight latency requirements)
  • Memory efficiency matters (large-scale crawls, constrained environments)
  • You need deep architectural customization (custom frontier, storage, middleware)
  • You are building a Rust application or polyglot system with Rust at the core
  • You need comprehensive metadata extraction (JSON-LD, Dublin Core, feeds)
  • You need sitemap-based URL discovery
  • You want composable Tower middleware

Choose Crawl4AI when:

  • Your stack is Python and you want zero-friction integration
  • You need page interaction or JavaScript execution
  • You need PDF extraction
  • You want an active community with extensive examples and support
  • You need a REST API or MCP server
  • You are prototyping or working in Jupyter notebooks
  • You need search-driven crawling via Google

License Comparison

Kreuzcrawl Crawl4AI
License Elastic-2.0 Apache-2.0
Commercial use Yes Yes
Modification Yes Yes
Hosting restriction Cannot provide as a managed crawling service None
Patent grant No Yes (explicit patent grant)
Embedding in proprietary software Yes Yes

Crawl4AI's Apache-2.0 license is highly permissive with an explicit patent grant, meaning contributors cannot later assert patent claims against users. Kreuzcrawl's Elastic-2.0 license allows all commercial use and modification but prohibits offering Kreuzcrawl itself as a managed crawling service. For embedding in applications or internal use, both licenses work well.


title: Kreuzcrawl vs Webclaw

description: Detailed comparison of Kreuzcrawl and Webclaw web crawling tools


Kreuzcrawl vs Webclaw

Webclaw is a Rust web crawling tool that deliberately avoids headless browsers. Instead, Webclaw uses TLS fingerprinting and HTTP-level stealth to bypass anti-bot protections. This browserless approach yields extremely fast extraction times (reported 3.2ms per page) and zero browser overhead. Webclaw also includes brand extraction and MCP server support.

The core architectural difference: Webclaw is browserless by design; Kreuzcrawl uses browser fallback when HTTP-only fetching is insufficient. This creates distinct trade-offs around speed, compatibility, and resource usage.


Feature Comparison

Feature Kreuzcrawl Webclaw
Language Rust Rust
License Elastic-2.0 AGPL-3.0
Distribution Library + CLI Library + CLI + MCP
Headless browser chromiumoxide None (TLS fingerprint)
Traversal strategies BFS, DFS, BestFirst, Adaptive BFS
Concurrent fetching JoinSet + Semaphore Tokio
Streaming events Real-time No
Batch operations batch_crawl() Yes
Sitemap parsing XML, gzip, index Yes
robots.txt RFC 9309 Yes
Markdown conversion Always-on + structure preservation Yes
Fit markdown (LLM-pruned) BM25 + heuristic Token-optimized
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) Moderate
JSON-LD extraction Full Data islands
Feed discovery RSS, Atom, JSON Feed No
Link-to-citations Numbered refs No
LLM extraction Multi-provider (liter-llm) Ollama (local)
Cost tracking USD + tokens No
PDF extraction No Yes
WAF detection 8 vendors No
Stealth / anti-detect UA rotation TLS fingerprinting
Proxy support HTTP/HTTPS/SOCKS5 Yes
User-Agent rotation Tower layer No
Screenshot capture Stub No
Page interaction No No
REST API server No No
MCP server No Yes
CLI scrape/crawl/map Yes
Language SDKs Rust only Rust
Disk cache blake3 + TTL No
Per-domain rate limiting Tower layer No
HTTP caching (ETag) Yes No
Pluggable traits 7 traits No
Middleware stack Tower services No
Config validation serde strict No
BM25 relevance scoring Yes No
Adaptive crawling Term saturation No
Asset download + dedup SHA-256 No
Search integration No API key

Where Kreuzcrawl Wins

Browser fallback. Kreuzcrawl can fall back to headless Chrome (via chromiumoxide) when HTTP-only fetching fails. This handles JavaScript-rendered SPAs, Cloudflare challenges, and other scenarios where a real browser is required. Webclaw has no browser capability -- if TLS fingerprinting fails against a particular WAF, there is no fallback path.

Pluggable architecture. Seven traits and a Tower middleware stack provide deep customization. Webclaw is a more opinionated tool without equivalent extension points.

Crawl intelligence. Four traversal strategies, BM25 relevance scoring, and adaptive crawling with term saturation detection. Webclaw offers BFS only.

Rich metadata. 40+ metadata fields versus Webclaw's moderate extraction. Full JSON-LD parsing, Dublin Core, feed discovery, and hreflang links are available in Kreuzcrawl.

WAF detection. Kreuzcrawl detects 8 WAF vendors (Cloudflare, Akamai, AWS WAF, Imperva, DataDome, PerimeterX, Sucuri, F5) and can route to browser fallback. Webclaw does not detect WAFs -- it relies on TLS fingerprinting to avoid triggering them in the first place.

Per-domain rate limiting and caching. Tower-based per-domain rate limiting, ETag/Last-Modified HTTP caching, and blake3-hashed disk cache with TTL. Webclaw has none of these.

Streaming events. Real-time event streaming during crawls. Webclaw does not support streaming.


Where Webclaw Wins

Raw speed. Without browser overhead, Webclaw achieves reported 3.2ms extraction times per page. This is an order of magnitude faster than any browser-based approach. For bulk extraction of static or server-rendered content, Webclaw's pure HTTP approach is extremely efficient.

Zero browser dependencies. No Chromium binary, no browser process management, no browser crashes to recover from. This simplifies deployment, reduces Docker image size, and eliminates an entire class of operational issues.

TLS fingerprinting. Webclaw's stealth approach operates at the TLS handshake level, which is harder for WAFs to detect than browser-level stealth injection. For many sites, this is sufficient to avoid blocks without the weight of a full browser.

Brand extraction. Webclaw includes built-in brand/company extraction from web pages -- a specialized feature for business intelligence use cases.

PDF extraction. Webclaw handles PDF content extraction. Kreuzcrawl does not.

MCP server. Webclaw provides an MCP server for AI agent integration. Kreuzcrawl does not.

Lower resource footprint. No browser means dramatically less memory and CPU per crawl worker. A single machine can run many more concurrent Webclaw workers than browser-based crawlers.

Search integration. Webclaw supports search-driven discovery via API key integration.


When to Use Which

Choose Kreuzcrawl when:

  • You need to handle JavaScript-rendered content or SPAs
  • You need browser fallback for sites that block HTTP-only requests
  • You want pluggable architecture with custom components
  • You need multiple traversal strategies and relevance scoring
  • You need comprehensive metadata extraction
  • You need per-domain rate limiting and HTTP caching
  • You want streaming events during crawls

Choose Webclaw when:

  • Speed is the top priority and browser rendering is not required
  • You are crawling server-rendered or static content at high volume
  • You want minimal deployment dependencies (no Chromium)
  • You need brand extraction for business intelligence
  • You need PDF extraction
  • You want MCP server support for AI agent workflows
  • Resource efficiency (memory, CPU) is a primary constraint

License Comparison

Kreuzcrawl Webclaw
License Elastic-2.0 AGPL-3.0
Commercial use Yes Yes
Modification Yes Yes, but must open-source if hosting
Hosting restriction Cannot provide as a managed crawling service Must open-source all modifications when hosting
Embedding in proprietary software Yes Requires AGPL compliance (copyleft)

Webclaw's AGPL-3.0 license has strong copyleft requirements: if you modify Webclaw and offer it as a network service, you must release your modifications under AGPL. This also applies to software that links against Webclaw in some interpretations. Kreuzcrawl's Elastic-2.0 license allows proprietary modifications but prohibits offering Kreuzcrawl itself as a managed service. For embedding in a proprietary application, Elastic-2.0 is less restrictive than AGPL.


title: Kreuzcrawl vs CRW

description: Detailed comparison of Kreuzcrawl and CRW web crawling tools


Kreuzcrawl vs CRW

CRW is a Rust-based web crawling tool that emphasizes simplicity and Firecrawl API compatibility. CRW provides a Firecrawl-compatible REST API, making it a drop-in replacement for teams migrating from Firecrawl's hosted service. It supports LightPanda as an alternative lightweight browser and includes MCP server support.

The key distinction: CRW focuses on being a simple, Firecrawl-compatible server; Kreuzcrawl focuses on being a deeply extensible crawling library. CRW optimizes for ease of deployment and API compatibility, while Kreuzcrawl optimizes for architectural flexibility and content intelligence.


Feature Comparison

Feature Kreuzcrawl CRW
Language Rust Rust
License Elastic-2.0 AGPL-3.0
Distribution Library + CLI CLI + MCP + API
Headless browser chromiumoxide LightPanda / Chrome
Traversal strategies BFS, DFS, BestFirst, Adaptive BFS
Concurrent fetching JoinSet + Semaphore Tokio
Streaming events Real-time No
Batch operations batch_crawl() Yes
Sitemap parsing XML, gzip, index Yes
robots.txt RFC 9309 Yes
Markdown conversion Always-on + structure preservation Yes
Fit markdown (LLM-pruned) BM25 + heuristic No
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) Basic
JSON-LD extraction Full No
Feed discovery RSS, Atom, JSON Feed No
Link-to-citations Numbered refs No
LLM extraction Multi-provider (liter-llm) Claude, OpenAI
Cost tracking USD + tokens No
PDF extraction No Yes
WAF detection 8 vendors No
Stealth / anti-detect UA rotation Stealth injection
Proxy support HTTP/HTTPS/SOCKS5 HTTP/SOCKS5
User-Agent rotation Tower layer Yes
Screenshot capture Stub No
Page interaction No No
REST API server No Yes (Firecrawl-compatible)
MCP server No Yes
CLI scrape/crawl/map Yes
Language SDKs Rust only Rust
Disk cache blake3 + TTL No
Per-domain rate limiting Tower layer Global RPS
HTTP caching (ETag) Yes No
Pluggable traits 7 traits No
Middleware stack Tower services No
Config validation serde strict No
BM25 relevance scoring Yes No
Adaptive crawling Term saturation No
Asset download + dedup SHA-256 No
Search integration No API key

Where Kreuzcrawl Wins

Deep extensibility. Seven pluggable traits and a Tower middleware stack allow you to customize every aspect of the crawl pipeline. CRW is designed as a turnkey server -- you configure it, but you do not extend it at the architectural level.

Crawl intelligence. BFS, DFS, BestFirst, and Adaptive traversal strategies with BM25 relevance scoring and term saturation-based early termination. CRW supports BFS only.

Content quality. Fit markdown with BM25-based content pruning, 40+ metadata fields, full JSON-LD extraction, feed discovery, and link-to-citations conversion. CRW produces basic metadata and standard markdown.

WAF detection. Kreuzcrawl detects 8 WAF vendors and can route to browser fallback. CRW does not detect WAFs.

Caching infrastructure. blake3-hashed disk cache with TTL eviction, ETag/Last-Modified HTTP caching, and per-domain rate limiting via Tower layers. CRW has no disk cache, no HTTP caching, and only global RPS limiting.

Streaming events. Real-time event streaming during crawls for monitoring and progress tracking. CRW does not support streaming.

Cost tracking. Kreuzcrawl tracks LLM extraction costs in USD and tokens. CRW does not track costs.


Where CRW Wins

Firecrawl API compatibility. CRW implements a Firecrawl-compatible REST API, making it a drop-in replacement for existing Firecrawl integrations. If you have applications that call the Firecrawl API, you can point them at CRW without code changes. Kreuzcrawl does not provide a REST API.

LightPanda browser. CRW supports LightPanda as an alternative to Chromium. LightPanda is a lightweight browser engine that uses significantly less memory than full Chromium, making it viable for resource-constrained deployments.

Simple deployment. CRW is designed as a server you deploy and call via HTTP. For teams that want a crawling service without writing Rust code, CRW is more accessible. Kreuzcrawl requires Rust integration (until polyglot bindings are available).

MCP server. CRW provides an MCP server for AI agent workflows. Kreuzcrawl does not.

PDF extraction. CRW handles PDF content extraction. Kreuzcrawl does not.

Search integration. CRW supports search-driven discovery via API key integration.


When to Use Which

Choose Kreuzcrawl when:

  • You are building a Rust application that needs embedded crawling
  • You need custom crawl components (frontier, strategy, storage, middleware)
  • You need relevance-based crawling with BM25 scoring
  • You want deep metadata extraction and LLM-optimized content
  • You need per-domain rate limiting and HTTP caching
  • You need streaming events for real-time monitoring
  • You want cost tracking for LLM extraction

Choose CRW when:

  • You want a Firecrawl-compatible API server with no code changes
  • You are migrating from Firecrawl and want API compatibility
  • You want a simple deployment model (server + HTTP API)
  • You need a lightweight browser option (LightPanda)
  • You need PDF extraction
  • You need MCP server support for AI agents
  • You do not need custom crawl logic or extensibility

License Comparison

Kreuzcrawl CRW
License Elastic-2.0 AGPL-3.0
Commercial use Yes Yes
Modification Yes Yes, but must open-source if hosting
Hosting restriction Cannot provide as a managed crawling service Must open-source all modifications when hosting
Embedding in proprietary software Yes Requires AGPL compliance (copyleft)

CRW's AGPL-3.0 license requires that modifications be open-sourced when the software is offered as a network service. Since CRW is primarily a server, this means any customizations to a hosted CRW instance must be released under AGPL. Kreuzcrawl's Elastic-2.0 license allows proprietary modifications but prohibits offering Kreuzcrawl itself as a managed crawling service. For embedding in proprietary applications, Elastic-2.0 is less restrictive.


title: Kreuzcrawl vs ScrapeGraphAI

description: Detailed comparison of Kreuzcrawl and ScrapeGraphAI web crawling tools


Kreuzcrawl vs ScrapeGraphAI

ScrapeGraphAI is a Python library that puts LLMs at the center of the extraction pipeline. Instead of writing CSS selectors or XPath queries, you describe what you want in natural language and ScrapeGraphAI uses LLM-driven graph execution to extract it. It integrates deeply with LangChain and supports multiple LLM providers.

The fundamental difference: ScrapeGraphAI is LLM-native -- the LLM is the extraction engine. Kreuzcrawl is extraction-native -- structured parsing runs first, with LLM as an optional enhancement. This creates very different cost profiles, latency characteristics, and reliability models.


Feature Comparison

Feature Kreuzcrawl ScrapeGraphAI
Language Rust Python
License Elastic-2.0 MIT
Distribution Library + CLI Library + SaaS API
Headless browser chromiumoxide Playwright
Traversal strategies BFS, DFS, BestFirst, Adaptive LLM-driven graph
Concurrent fetching JoinSet + Semaphore asyncio
Streaming events Real-time No
Batch operations batch_crawl() No
Sitemap parsing XML, gzip, index No
robots.txt RFC 9309 No
Markdown conversion Always-on + structure preservation Yes
Fit markdown (LLM-pruned) BM25 + heuristic No
Metadata fields 40+ (OG, DC, Twitter, Article, JSON-LD) No
JSON-LD extraction Full No
Feed discovery RSS, Atom, JSON Feed No
Link-to-citations Numbered refs No
LLM extraction Multi-provider (liter-llm) LangChain (core)
Cost tracking USD + tokens Token counting
PDF extraction No Yes
WAF detection 8 vendors No
Stealth / anti-detect UA rotation Undetected Playwright
Proxy support HTTP/HTTPS/SOCKS5 Via Playwright
User-Agent rotation Tower layer No
Screenshot capture Stub Yes
Page interaction No No
REST API server No SaaS API
MCP server No Yes (via Toolhouse)
CLI scrape/crawl/map No
Language SDKs Rust only Python, Node.js
Disk cache blake3 + TTL No
Per-domain rate limiting Tower layer No
HTTP caching (ETag) Yes No
Pluggable traits 7 traits Graph nodes
Middleware stack Tower services No
Config validation serde strict No
BM25 relevance scoring Yes No
Adaptive crawling Term saturation No
Asset download + dedup SHA-256 No
Search integration No DuckDuckGo

Where Kreuzcrawl Wins

No LLM required for basic extraction. Kreuzcrawl extracts metadata, converts to markdown, discovers links, and parses sitemaps without calling any LLM. This means zero API costs, zero LLM latency, and deterministic output for standard extraction tasks. ScrapeGraphAI requires an LLM for virtually every operation.

Cost efficiency at scale. Crawling 10,000 pages with Kreuzcrawl costs nothing beyond compute. The same workload with ScrapeGraphAI would incur significant LLM API costs -- potentially hundreds of dollars depending on the provider and model. Kreuzcrawl's LLM extraction is opt-in and feature-gated, so you pay only when you choose to use it.

Performance. Compiled Rust versus interpreted Python, with no LLM round-trip on the critical path. Kreuzcrawl's extraction pipeline runs in microseconds per field; ScrapeGraphAI's LLM calls add seconds per page at minimum.

Deterministic output. Kreuzcrawl's HTML parsing produces the same output for the same input every time. LLM-based extraction is inherently non-deterministic -- the same page may produce different results across runs. For pipelines that require reproducibility, this matters.

Crawling infrastructure. Kreuzcrawl is a full crawling engine with 4 traversal strategies, per-domain rate limiting, sitemap parsing, robots.txt compliance, HTTP caching, and streaming events. ScrapeGraphAI is primarily an extraction tool with basic fetching; it does not provide crawl orchestration, rate limiting, or compliance features.

Rich metadata extraction. 40+ metadata fields including full JSON-LD, Dublin Core, Open Graph, Twitter Card, feed discovery, and hreflang. ScrapeGraphAI does not extract structured metadata -- it relies on the LLM to identify relevant information.

WAF detection and browser fallback. Kreuzcrawl detects 8 WAF vendors and can route to headless Chrome. ScrapeGraphAI relies on Undetected Playwright but does not detect or classify WAFs.


Where ScrapeGraphAI Wins

Natural language extraction. Describe what you want in plain English and ScrapeGraphAI extracts it. No CSS selectors, no XPath, no schema definitions needed for ad-hoc extraction tasks. This dramatically lowers the barrier to entry for non-technical users and exploratory workflows.

Schema-free extraction. ScrapeGraphAI can extract structured data from pages without predefined schemas. The LLM infers the structure from context. Kreuzcrawl's LLM extraction requires you to define a JSON schema for the output you want.

LangChain integration. Deep integration with the LangChain ecosystem means ScrapeGraphAI fits naturally into existing LLM application pipelines. If you are already using LangChain for other parts of your application, ScrapeGraphAI slots in seamlessly.

Graph-based execution. ScrapeGraphAI's graph execution model allows complex multi-step extraction workflows where the output of one node feeds into another. This is powerful for extraction tasks that require reasoning across multiple page elements.

PDF extraction. ScrapeGraphAI can extract content from PDFs. Kreuzcrawl cannot.

Screenshot capture. Full screenshot support for visual extraction workflows.

MIT license. ScrapeGraphAI's MIT license has no restrictions on use, modification, or hosting.

Node.js SDK. Available in both Python and Node.js.


When to Use Which

Choose Kreuzcrawl when:

  • You need to crawl at scale without LLM costs
  • Deterministic, reproducible output is required
  • You need a full crawling engine (rate limiting, robots.txt, sitemaps)
  • You want structured metadata extraction without LLM dependency
  • Performance and resource efficiency are priorities
  • You need per-domain rate limiting and HTTP caching
  • You want LLM extraction as an optional enhancement, not a requirement

Choose ScrapeGraphAI when:

  • You need ad-hoc extraction described in natural language
  • You do not know the page structure in advance
  • You are already using LangChain in your pipeline
  • You need multi-step reasoning across page elements
  • LLM costs are acceptable for your volume
  • You need PDF extraction
  • You want the lowest possible barrier to entry for extraction tasks
  • You prefer MIT licensing

License Comparison

Kreuzcrawl ScrapeGraphAI
License Elastic-2.0 MIT
Commercial use Yes Yes
Modification Yes Yes
Hosting restriction Cannot provide as a managed crawling service None
Embedding in proprietary software Yes Yes

ScrapeGraphAI's MIT license is maximally permissive with no restrictions. Kreuzcrawl's Elastic-2.0 license allows all commercial use and modification but prohibits offering Kreuzcrawl itself as a managed crawling service. For most use cases, both licenses are compatible with commercial development.


Cost Model Comparison

The cost difference between these tools deserves special attention because their architectures create fundamentally different cost profiles:

Scenario Kreuzcrawl ScrapeGraphAI
10,000 pages, metadata only Compute only (~$0) ~$50-200 in LLM API costs
10,000 pages, structured extraction Compute + LLM costs (opt-in) ~$50-200 in LLM API costs
100,000 pages, basic crawl Compute only (~$0) ~$500-2,000 in LLM API costs
Latency per page Milliseconds (no LLM) Seconds (LLM round-trip)

These are rough estimates and vary significantly by LLM provider, model, and page complexity. The key point is that Kreuzcrawl's cost scales with compute only for standard extraction, while ScrapeGraphAI's cost scales linearly with page count due to per-page LLM calls.