Changelog¶
Unreleased¶
Features¶
- Core: Published
interact()withPageAction,ActionResult, andInteractionResultfor backend-neutral page interaction. Exposed across every language binding (Python, Node, Ruby, PHP, Go, Java, C#, Elixir, WASM, Dart, Kotlin/Android, Swift, Zig, C). - Core: Streaming crawl APIs —
CrawlEngineHandle::crawl_streamandbatch_crawl_streamyieldCrawlEvent::Page/Error/Completeas pages are processed. Per-language adapters: PythonAsyncIterator, Node async iterators, RubyEnumerator, JavaStream, Go channels, C#IAsyncEnumerable, ElixirStream.unfold, PHPGenerator, plus C FFI handle-based polling. WASM is excluded (no Tokio multi-thread runtime). - Browser (native): Added native
interact()execution for click, type, press, scroll, wait, JavaScript, scrape, and screenshot actions. Native screenshots are deterministic PNG snapshots derived from the post-action HTML rather than Chrome compositor captures. - Browser (native): Worker-pool isolation for the in-process native backend — concurrent crawls no longer contend on a single browser instance.
Fixes¶
- Core:
interact()now runsBrowserConfig.eval_scriptafter navigation and before page actions. - Core: Page-action validation now rejects invalid wait and scroll selectors before navigation.
- MCP: The MCP
interacttool now delegates to the public engine API instead of returning a placeholder message.
Breaking Changes¶
- JSON wire format:
CrawlEventnow serializes as an internally-tagged enum ({"type":"page","result":{...}}) instead of externally-tagged ({"Page":{...}}). Clients parsing the streaming wire format directly must update their decoders. The Rust enum API is unchanged.
Tooling¶
- Pinned
alef_version = "0.17.0". Upstream fixes that flowed in: streaming FFI emits_to_json/_freefor enum item types, NAPI Box-deref in tagged-enum conversions, Java CPD threshold lift, ktlint multiline-expression-wrapping disabled for ktfmt parity, PHP dedup of binding→core From impls across tagged-enum variant payloads.
0.3.0 - 2026-05-18¶
Highlights¶
- Native browser backend — a Servo/Deno-derived in-process browser (
browser.backend = "native") joins the existing chromiumoxide backend. Stealth TLS fingerprinting via BoringSSL, proxy with credentials, URL-pattern request blocking, JavaScript evaluation, selector-wait, robots-txt awareness, cookie forwarding, and full network-event capture. - Five new language bindings — Kotlin Android (AAR), Dart (Flutter Rust Bridge), Swift (XCFramework), Zig (native module), and C (cdylib + header) bring total binding coverage from 11 to 16 languages.
- Generator migration to alef — the entire bindings/docs/READMEs/e2e surface is now produced by alef (pinned via
alef.toml). Replaces the prior hand-rolled codegen scripts and dramatically tightens cross-language consistency.
Breaking Changes¶
- Rust API:
kreuzcrawl::CrawlEngineBuilder, strategy types, and default trait implementations are once again exported at the crate root (reverting the 0.1.2 lockdown) — downstream code that worked around the prior restriction can drop its wrappers. - Browser config:
BrowserConfig.backendis now an enum ("chromium"|"native") instead of a free-form string. Existing configs that omitted the field continue to default to chromium. - Bindings:
DownloadedDocument.contentandScrapeResult.screenshot(binary fields) are hidden from non-Rust bindings — call the dedicated download/screenshot helpers instead of reading the field. - Bindings:
CrawlConfig.browser_poolis no longer exposed across the FFI boundary (was unusable from non-Rust callers anyway). - WASM: Generated class names use the configured
wasm_type_prefix(defaultWasm) consistently across builders and tagged-enum constructors. Calls intoWasmAuthConfigare now field-based instead of enum-based. - Errors: Network errors are tagged with a stable
[network:<kind>]prefix in their message — assertions that grepped on raw reqwest text need updating. - C# / Java / Go: Tagged unions emit as sealed interfaces / records / discriminated structs (was nested classes / raw maps). Per-binding migration notes are in each package README.
Features¶
- Browser (native):
BrowserExtrasonScrapeResult.browsercarrieseval_result,network_events, andcookiespopulated by the native backend. - Browser (native):
browser-nativeCargo feature added to thefullfeature set. - Core: New
fullfeature aggregates every optional capability (api, browser, mcp, native browser, etc.) for one-flag installs. - Core:
CrawlConfig.soft_http_errorsunifies 404 handling — opt in to treat soft-404 HTML as success instead of error. - Core:
scrape()follows redirects with cycle detection and a soft stop onmax_redirects. - Core:
batch_scrape/batch_crawlnow returnResultand reject empty input rather than silently succeeding. - Core: Network errors carry a stable
[network:<kind>]tag for programmatic dispatch. - Core:
CrawlConfig::validate()enforcesmax_depth,max_body_size, proxy URL, and auth-config shape at build time; browser endpoint URL scheme is validated. - Core:
CrawlConfig.max_depthdefaults to unbounded when unset (was 1). - Core: Browser fallback is now restricted to
WafBlockedandForbiddenerrors (was overly broad). - CLI:
--configflag accepts a JSONCrawlConfigfor bothscrapeandcrawl. - Bindings: WASM
getters /setters on enum-typed fields use strings for JS interop;asset_typesaccepts string arrays. - Bindings: C language binding added (cdylib + cbindgen header, full e2e parity).
- Bindings: Dart bridge via flutter_rust_bridge with full e2e parity.
- Bindings: Swift package with XCFramework distribution + full e2e parity.
- Bindings: Zig native module + full e2e parity.
- Bindings: Kotlin replaced JVM facade with Kotlin-Android (AAR + Android Gradle Plugin) — drops desktop-JVM target; mobile-only.
- Bindings (JNI): Migrated from
jni = "0.21"tojni = "0.22"for the FFI-safeEnvUnowned<'frame>API. - API:
CachedPagere-exported from crate root. - Tooling: All bindings, docs, READMEs, and e2e suites are generated by alef — version pinned in
alef.toml. Thetask alef:bump/task alef:regen/task rebuildcycle replaces the prior hand-maintained scripts. - Tooling: New publish targets: Kotlin Android (Maven Central), Dart (pub.dev), Swift (Swift Package Index), Zig (build registry).
- Tooling: Per-language
task update/task upgradesplit (within-major vs latest). - Tooling: Adopted
gh-actions-updater(Goldziher) for GHA pin maintenance. - CI: Split monolithic
ci.yamlinto kreuzberg-topology workflows (ci-rust,ci-e2e,ci-docs,ci-mobile,publish). - CI: New
ci-mobileworkflow runs Android (AAR) and iOS (XCFramework) cargo checks. - CI: Discord release announcements wired into
publish.yaml. - Docs: Canonical Material/Zensical docs site at
docs.kreuzcrawl.kreuzberg.devaligned with sibling Kreuzberg.dev properties (shared CSS, base template, GA, ecosystem grid, llms.txt). - Docs: All concepts, guides, getting-started, features, and reference pages rewritten against the public binding surface only.
- Repo:
CITATION.cffgenerated from[workspace.citation]inalef.toml.
Fixes¶
- Elixir:
:force_buildnow respectsconfig :rustler_precompiled, :force_build, kreuzcrawl: truein addition to theKREUZCRAWL_BUILDenv var, fixing the documented workaround that was previously ignored when users hit precompiled checksum errors (#7). - Browser (chromium): Filter snap-incompatible flags + drop
enable-blink-featuresdefault arg — fixes startup under snap-packaged Chromium on Ubuntu noble. - Browser (chromium): Re-wired
browser_fetchinto the engine scrape pipeline (lost during the Tower refactor). - Core:
quick-xml0.40xml_contentAPI migration — sitemap parsing keeps working with the upgraded dep. - Core: WAF detection tests are deterministic on macOS.
- WASM: Capture response headers in
HttpResponsefor wasm builds (was empty). - WASM: Crate compiles for
wasm32-unknown-unknownwithoutmio(gated out under wasm32). - WASM: Structured error objects
{code, message}(was plain string). - Bindings (PHP): PSR-4 namespace escaping in
composer.json. - Bindings (Java): Added
jspecifydep for@Nullableannotations;Optionalwrapping for nullable returns. - Bindings (Ruby):
sorbet-runtimedeclared as gemspec dep;html-to-markdown-rs3.4 sig change handled. - Bindings (C#): Sealed-union and exception deserialization corrected.
- Bindings (Go): Enum values use serde rename (
og:image), batch functions return error instead of panic. - Docs: Stale install snippets, version strings, and binding sample code reconciled with the actual public APIs.
See Cargo.toml for the full dependency graph; alef.toml for the generator pin.
0.2.0¶
Breaking Changes¶
- Config:
MarkdownConfigandPlainTextConfigreplaced by unifiedContentConfigonCrawlConfig.content - Config:
main_content_onlyremoved fromCrawlConfig— usecontent.preprocessing_preset: "aggressive"instead - Config:
CrawlConfig.remove_tagsnow forwarded to h2m'sexclude_selectorsinstead of kreuzcrawl's DOM manipulation - Results:
plain_textfield removed fromScrapeResult— setcontent.output_format: "plain"to get plain text in themarkdownresult field
Features¶
- Config:
ContentConfigwithoutput_formatsupporting"markdown"(default),"plain","djot"— all powered by html-to-markdown-rs - Config:
ContentConfig.exclude_selectorsfor CSS selector-based element exclusion (.class,#id,[attr]) — replaces the buggyapply_remove_tagsDOM manipulation - Config:
ContentConfig.preprocessing_preset("minimal","standard","aggressive") for controlling noise removal aggressiveness - Config: Full h2m configuration exposed:
strip_tags,preserve_tags,skip_images,max_depth,wrap,wrap_width - Encoding: Non-UTF-8 charset detection and re-decoding via
encoding_rs(fixes Shift_JIS, EUC-JP, etc.) - WAF: Expanded WAF detection — AWS CloudFront,
awswaf.comchallenge scripts, "Verifying your connection" interstitials, "Just a moment" Cloudflare pages - WAF: WAF detection on HTTP 200 responses with challenge content (catches false-positive 200s from AWS WAF, Cloudflare)
- Browser: Browser fallback on
Forbidden,Connection, and generic errors (wasWafBlockedonly) — catches TLS fingerprint blocks and bot-detection responses - Browser: Unique user data directories per Chrome launch — prevents
SingletonLockconflicts when multiple instances run concurrently or after crashes - Benchmark: Full benchmark harness at
tools/benchmark-harness/with: - Scrape-evals dataset (1000 fixtures from HuggingFace) with TF1 multiset F1 scoring
- Reachability benchmark (16 domains across e-commerce, social, professional, review) with content verification and false-positive detection
- CPU flamegraphs via pprof, real-time memory/CPU monitoring
- Per-fixture output saving, baseline comparison reports
- CLI with download, run, profile, report, validate commands
- CI: Pinned alef to v0.5.3 in CI workflow
Fixes¶
- Content: Removed buggy
apply_remove_tagsthat corrupted DOM on pages with repeated structural patterns — CSS selector exclusion now handled correctly by h2m during its DOM walk - Content: Removed duplicate plain text extraction — h2m's
OutputFormat::Plainhandles this natively with proper preprocessing - Browser: Clean up temporary user data directories after browser teardown and on launch failure
0.1.2¶
Breaking Changes¶
- Rust API: Public surface restricted to binding API only.
CrawlEngine,CrawlEngineBuilder, all traits (Frontier,RateLimiter,CrawlStore,EventEmitter,CrawlStrategy,ContentFilter,CrawlCache), and all default implementations are nowpub(crate). Usecreate_engine,scrape,crawl,map_urls,batch_scrape,batch_crawlinstead. - Rust API:
BrowserPool,BrowserPoolConfig,PooledPageno longer re-exported - Rust API:
CachedPage,InteractionResult,ActionResult,CrawlEventremoved from bindings
Features¶
- Config: Added
rate_limit_ms: Option<u64>toCrawlConfigfor per-domain rate limiting across all languages - CLI: Added
--browser-modeand--browser-endpointflags toscrapeandcrawlsubcommands - CLI: Browser fallback now works in the crawl path (was scrape-only)
- CLI:
--timeoutpropagated to browser page-load timeout - CLI:
--browser-endpointvalidated asws://orwss://URL - CLI: Refactored to use only the public binding API (
create_engine,scrape,crawl,map_urls,batch_crawl) - Rust API:
serve_apiandstart_mcp_serverre-exported at crate root for server deployments - Bindings: TypeScript discriminated unions for
AuthConfigandCrawlEvent - Bindings: TypeScript non-optional fields are now required (no
?) in.d.ts - Bindings: JSDoc on all TypeScript types, functions, enums, and fields
- Bindings: Javadoc on all Java records, enums, and builders (with HTML escaping)
- Bindings: Go uses
json.RawMessagefor JSON value fields (wasinterface{}) - Bindings: Elixir enum modules generated for all 9 enums
- Bindings: WASM rustdoc on all generated types and functions
- Bindings: WASM structured error objects
{code, message}(was plain strings) - Infra: Workspace-level Cargo lints (
clippy::all,unsafe_code) inherited by all crates - Docs: Lychee link checker added to CI docs workflow
- E2E: 48 new test fixtures covering batch_crawl, downloads, interaction, WARC, proxy, browser crawl, and more
Fixes¶
- Bindings: Python mypy passes without
ignore-errors(enum lookups, return type imports) - Bindings: Go enum values use serde rename (e.g.,
og:imagenotog_image) - Bindings: Go batch functions return error instead of panic
- Bindings: TypeScript string enum values use correct casing (snake_case)
- Bindings: TypeScript
format!("{:?}")replaced with.to_string()for string fields - Bindings: Elixir NIF crate name
kreuzcrawl_nif(waskreuzcrawl_rustler) - Docs: Removed internal type references (
LlmExtractor, strategy names) from generated API reference - Docs: Fixed broken links to deleted repos in comparisons page
- Docs: Added
CONTRIBUTING.mdat project root - Docs: Snippet validator skips bare Ruby method signatures
- Docs: Snippet validator skips bare TypeScript method signatures
- CI: Homebrew formula sha256 handles single-quoted values
- CI: Docs workflow setup-go@v6 for Go 1.26 toolchain
- CI: Ruby
Gemfile.locksynced for v0.1.1 - Alef:
alef docsnow uses filtered IR matching binding surface - Alef: Deterministic C# NativeMethods.cs ordering (sorted DllImport entries)
- Alef:
alef.tomluses exclude blacklist instead of include whitelist
0.1.1¶
Fixes¶
- WASM: Added
getrandomwithwasm_jsfeature for wasm32 target compatibility - Java: Downgraded Maven compiler and source plugins from beta to stable (4.0.0-beta → 3.x)
- Elixir: NIF scaffold lib.path +
MIX_ENV=prodfor Hex publish (from v0.1.0) - CI: Fixed PEP 440 version conversion for stable releases (0.1.0 no longer becomes 0.10)
0.1.0¶
First stable release. High-performance web crawling engine with bindings for 11 languages.
Highlights¶
- Rust core with async Tokio runtime, configurable crawl depth/concurrency/rate limiting
- REST API server (Firecrawl v1-compatible) with OpenAPI 3.1 spec
- MCP (Model Context Protocol) server for AI agent integration
- Docker image (Alpine, multi-arch amd64/arm64)
- CLI with Homebrew tap (
brew install kreuzberg-dev/tap/kreuzcrawl)
Language Bindings¶
Python (PyPI), Node.js (npm), Ruby (RubyGems), Go (pkg.go.dev), Java (Maven Central), C# (NuGet), PHP (Packagist), Elixir (Hex.pm), WebAssembly (npm), C FFI (GitHub Releases), Rust (crates.io)
Changes Since rc.10¶
- Fixed Elixir Hex publish (NIF lib.path + MIX_ENV=prod)
- Fixed version.rb sync regex (pre-release suffix matching)
- Fixed Ruby native scaffold missing lib.path
- Clean prek run (all hooks pass)
- Idempotent
alef verifyvia blake3 output content hashing
0.1.0-rc.10¶
Features¶
- Go: Added FFI download pattern —
go generatedownloads prebuilt libraries from GitHub releases, enabling standalonego getwithout local C build - API: Added schemathesis property-based contract tests (12 tests covering all endpoints)
- CLI: Added Homebrew installation instructions (
brew install kreuzberg-dev/tap/kreuzcrawl)
Fixes¶
- Go: Fixed non-opaque struct methods using
r.ptr— now marshals to JSON via_from_jsonFFI - Go e2e: Pass
niltoCreateEnginewhen no config specified - Python stubs: Removed docstrings from
.pyifiles (ruff PYI021 compliance) - WASM e2e: Quote hyphenated keys in object literals, use
WasmCrawlConfigclass construction - Brew e2e: Fixed jq
| lengthpipe syntax (was.length), skip output capture for all-skipped assertions - Python e2e: Wrap long
CrawlConfiglines for E501 compliance - Rust e2e: Removed
[workspace]from generated Cargo.toml (conflicts with parent workspace) - Elixir: Fixed long line formatting in
native.exscaffold - PHP: Unified Packagist package name to
kreuzberg-dev/kreuzcrawl - CI: Removed prepare job gate that skipped release events
- Docs: Fixed stale version references in Java/Elixir READMEs and installation guide
- Pre-commit: Replaced local sync-versions hook with
alef-verify+alef-sync-versions
0.1.0-rc.9¶
Fixes¶
- WASM: Remove wasm-pack-generated
.gitignorefrompkg/subdirectories after build — npm respects nested.gitignoreand was excluding compiled WASM artifacts even withfilesfield set
0.1.0-rc.8¶
Fixes¶
- WASM: Removed
pkg/from.gitignoreso npm publish includes compiled WASM artifacts - Ruby: Fixed gem version format in test_apps (
0.1.0.pre.rc.3instead of0.1.0.rc3)
0.1.0-rc.6¶
Fixes¶
- WASM: Fixed npm package publishing — added
files,main,module,typesfields to package.json so compiled artifacts are included instead of raw Rust source - WASM e2e: Added
tsconfig.jsonto generated test_app (prevents Vite from walking to root tsconfig) - Elixir: Removed non-existent
Cargo.lockfrom mix.exs files list (NIF crate uses workspace lockfile) - Rust toolchain: Switched from pinned 1.91 to
stable(transitive depconstant_time_eq0.4.3 requires 1.95)
Features¶
- Docker: Added
publish-docker.yamlworkflow with Alpine CLI image, multi-arch builds (amd64/arm64)
0.1.0-rc.5¶
Fixes¶
- Version sync: All workspace member Cargo.toml files now synced (binding crates were stuck at rc.2)
- Ruby: Fixed Duration conversion in validate method (
.map()onu64) - Browser: Re-wired
browser_fetchinto engine scrape pipeline (lost during Tower refactor) - Brew e2e: Implemented 5 missing assertion types (greater_than_or_equal, contains_all, is_empty, less_than, not_contains)
0.1.0-rc.4¶
Fixes¶
- Node: Added missing
serdedependency to Node binding crate — fixes compilation failure - Elixir: Added missing
serdedependency to NIF crate + serde derives on enums — fixes compilation failure - Ruby: Fixed conflicting
Defaultimplementations — derive vs manual impl no longer collide - Ruby: Fixed enum conversion codegen — enum fields now use pattern matching instead of dot access
- Ruby: Fixed
Box<T>deref in enum tuple variant conversion (CrawlEvent::Page) - Version sync: Added root
package.jsonandkreuzcrawl-node/package.jsonto sync-versions extra_paths
0.1.0-rc.3¶
Fixes¶
- Go: Fixed module path to
github.com/kreuzberg-dev/kreuzcrawl/packages/gofor proper Go module resolution - Java: Added extract-from-JAR native library loading — published Maven artifact now works standalone without manual
java.library.pathconfiguration - Elixir: Switched to
RustlerPrecompiledwith GitHub release URLs for precompiled NIF binaries - PHP: Fixed
createEngineFromJson()— now usesCrawlConfigobject construction matching the binding API - PHP: Fixed risky test warning for fixtures with all skipped assertions
- NuGet: Use
PackageLicenseFileinstead ofPackageLicenseExpression(Elastic-2.0 not OSI-approved) - Docker (musl): Source cargo env before build (PATH not inherited on ARM)
- Ruby (macOS): Removed
setup-opensslaction that caused OpenSSL conflicts
Features¶
- Test apps: Added test_apps for all 11 languages (Rust, Python, Node, Go, Java, C#, PHP, Ruby, Elixir, WASM, Homebrew CLI)
- Brew generator: New shell-script e2e test generator for Homebrew CLI testing
- WASM: Full e2e test support — removed incorrect language skips from all fixtures
- WASM codegen: Fixed
mock_urlandhandleargument handling in generated tests - Go: Updated to Go 1.26
- Idempotency: All 14 registry publish jobs check for existing packages before publishing
Infrastructure¶
- Publish workflow: 66/76 jobs succeeded (0 failures, 10 skipped) on rc.2
- Shared actions: Upstreamed
setup-opensslfix, leveraged shared build/publish actions fromkreuzberg-dev/actions - Fixtures: Removed all language skip blocks — all bindings are full crawlers
0.1.0-rc.2¶
- Initial multi-registry publish (crates.io, PyPI, npm, RubyGems, Maven Central, NuGet, Packagist, Hex.pm, Go, WASM, CLI binaries, Docker, Homebrew)
- Published kreuzcrawl and kreuzcrawl-cli to crates.io
- Created Homebrew formula in homebrew-tap repo
0.1.0-rc.1¶
- Initial release candidate