Skip to content

Configuration Reference

Configuration Reference

This page documents all configuration types and their defaults across all languages.

ExtractionMeta

Metadata about an LLM extraction pass.

Field Type Default Description
cost float \| None None Estimated cost of the LLM call in USD.
prompt_tokens int \| None None Number of prompt (input) tokens consumed.
completion_tokens int \| None None Number of completion (output) tokens generated.
model str \| None None The model identifier used for extraction.
chunks_processed int Number of content chunks sent to the LLM.

ProxyConfig

Proxy configuration for HTTP requests.

Field Type Default Description
url str Proxy URL (e.g. "http://proxy:8080", "socks5://proxy:1080").
username str \| None None Optional username for proxy authentication.
password str \| None None Optional password for proxy authentication.

ContentConfig

Content extraction and conversion configuration.

Controls how HTML is converted to the output format. Uses html-to-markdown-rs as the conversion engine for all formats (markdown, plain text, djot).

Field Type Default Description
output_format str "markdown" Output format: "markdown" (default), "plain", "djot".
preprocessing_preset str "standard" Preprocessing aggressiveness: "minimal", "standard" (default), "aggressive". - Minimal: only scripts/styles removed. - Standard: also removes nav, nav-hinted headers/footers/asides, forms. - Aggressive: removes all footers/asides unconditionally.
remove_navigation bool True Remove navigation elements (nav, breadcrumbs, menus). Default: True.
remove_forms bool True Remove form elements. Default: True.
strip_tags list[str] [] HTML tag names to strip (render children only, remove the tag wrapper). Default: ["noscript"].
preserve_tags list[str] [] HTML tag names to preserve as raw HTML in output.
exclude_selectors list[str] [] CSS selectors for elements to exclude entirely (element + all content). Unlike strip_tags (which removes the wrapper but keeps children), excluded elements and all descendants are dropped. Supports CSS selectors: .class, #id, [attribute], compound selectors. Example: [".cookie-banner", "#ad-container", "[role='complementary']"]
skip_images bool False Skip image elements in output. Default: False.
max_depth int \| None None Max DOM traversal depth. Prevents stack overflow on deeply nested HTML.
wrap bool False Enable line wrapping. Default: False.
wrap_width int 80 Wrap width when wrap is enabled. Default: 80.
include_document_structure bool True Include document structure tree in output. Default: True.

BrowserConfig

Browser fallback configuration.

Field Type Default Description
mode BrowserMode BrowserMode.AUTO When to use the headless browser fallback.
backend BrowserBackend BrowserBackend.CHROMIUMOXIDE Browser backend used to render JavaScript-heavy pages.
endpoint str \| None None CDP WebSocket endpoint for connecting to an external browser instance.
timeout float 30000ms Timeout for browser page load and rendering (in milliseconds when serialized).
wait BrowserWait BrowserWait.NETWORK_IDLE Wait strategy after browser navigation.
wait_selector str \| None None CSS selector to wait for when wait is Selector.
extra_wait float \| None None Extra time to wait after the wait condition is met.
stealth bool False Enable browser-realistic TLS fingerprint via the stealth HTTP client. Only honored by BrowserBackend.Native — chromiumoxide is already full-stealth via Chrome's TLS stack.
proxy ProxyConfig \| None None Proxy for browser fetches. Overrides CrawlConfig.proxy when set. Native backend supports http/https only (no SOCKS5).
block_url_patterns list[str] [] URL patterns to block before the network request fires. Supports * wildcards. Useful for skipping ads/analytics/large images. Honored by BrowserBackend.Native; chromiumoxide ignores this field today.
eval_script str \| None None JavaScript snippet evaluated after navigation completes. Scraping captures the native backend result in ScrapeResult.browser.eval_result. Interactions run this script before page actions on both browser backends but do not include the script result in InteractionResult.
robots_user_agent str \| None None User-agent used when fetching robots.txt. Defaults to BrowserConfig.user_agent (or kreuzcrawl's default) if unset. Native only.
capture_network_events bool False Capture the full network event stream into the result. Default false (only the document event is captured). Native only.
session_affinity bool True Enable session affinity: reuse chromiumoxide Pages for same-domain requests so cookies + fingerprint + solved challenges persist. Default: true. When false, each request gets a fresh Page.

CrawlConfig

Configuration for crawl, scrape, and map operations.

Field Type Default Description
max_depth int \| None None Maximum crawl depth (number of link hops from the start URL).
max_pages int \| None None Maximum number of pages to crawl.
max_concurrent int \| None None Maximum number of concurrent requests.
respect_robots_txt bool False Whether to respect robots.txt directives.
soft_http_errors bool False When true, HTTP-level error responses (404 NotFound, 403 Forbidden, WAF blocks) are surfaced as ScrapeResult records with the matching status_code rather than raised as CrawlError. Default False preserves the historical throw-on-error contract for direct fetches. Independently of this flag, 404s reached at the end of a redirect chain are always surfaced softly — the user opted into redirect-following, so receiving a 404 there is part of the normal flow rather than an unexpected error.
user_agent str \| None None Custom user-agent string.
stay_on_domain bool False Whether to restrict crawling to the same domain.
allow_subdomains bool False Whether to allow subdomains when stay_on_domain is true.
include_paths list[str] [] Regex patterns for paths to include during crawling.
exclude_paths list[str] [] Regex patterns for paths to exclude during crawling.
custom_headers dict[str, str] {} Custom HTTP headers to send with each request.
request_timeout float 30000ms Timeout for individual HTTP requests (in milliseconds when serialized).
rate_limit_ms int \| None None Per-domain rate limit in milliseconds. When set, enforces a minimum delay between requests to the same domain. Defaults to 200ms when None.
max_redirects int 10 Maximum number of redirects to follow.
retry_count int 0 Number of retry attempts for failed requests.
retry_codes list[int] [] HTTP status codes that should trigger a retry.
cookies_enabled bool False Whether to enable cookie handling.
auth AuthConfig \| None None Authentication configuration.
max_body_size int \| None None Maximum response body size in bytes.
remove_tags list[str] [] CSS selectors for tags to remove from HTML before processing.
content ContentConfig Content extraction and conversion configuration.
map_limit int \| None None Maximum number of URLs to return from a map operation.
map_search str \| None None Search filter for map results (case-insensitive substring match on URLs).
download_assets bool False Whether to download assets (CSS, JS, images, etc.) from the page.
asset_types list[AssetCategory] [] Filter for asset categories to download.
max_asset_size int \| None None Maximum size in bytes for individual asset downloads.
browser BrowserConfig Browser configuration.
proxy ProxyConfig \| None None Proxy configuration for HTTP requests.
user_agents list[str] [] List of user-agent strings for rotation. If non-empty, overrides user_agent.
capture_screenshot bool False Whether to capture a screenshot when using the browser.
download_documents bool True Whether to download non-HTML documents (PDF, DOCX, images, code, etc.) instead of skipping them.
document_max_size int \| None None Maximum size in bytes for document downloads. Defaults to 50 MB.
document_mime_types list[str] [] Allowlist of MIME types to download. If empty, uses built-in defaults.
warc_output str \| None None Path to write WARC output. If None, WARC output is disabled.
browser_profile str \| None None Named browser profile for persistent sessions (cookies, localStorage).
save_browser_profile bool False Whether to save changes back to the browser profile on exit.
dispatch str \| None None Pluggable dispatch components: bypass provider, escalation strategy, retry policy, WAF classifier, domain state, escalation budget, and max_total_attempts. When None, the engine uses its built-in defaults (no bypass, BrowserOnly strategy, SimpleRetryPolicy, built-in WAF classifier, no domain state, unlimited budget, 10 total attempt cap). Not serializable — callers construct this at runtime and skip in TOML/JSON configs.

BrowserExtras

Browser-specific extras populated when the native browser backend was used.

Available on ScrapeResult.browser when BrowserBackend.Native handled the request.

Field Type Default Description
eval_result dict[str, Any] \| None None Return value of BrowserConfig.eval_script, if provided.
network_events list[ResponseMeta] [] Network events captured during page navigation (only populated when BrowserConfig.capture_network_events is true).
cookies list[CookieInfo] [] All non-expired cookies present in the browser's cookie jar after navigation completes (includes both prior cookies and server Set-Cookie).

DownloadedDocument

A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).

When the crawler encounters non-HTML content and download_documents is enabled, it downloads the raw bytes and populates this struct instead of skipping the resource.

Field Type Default Description
url str The URL the document was fetched from.
mime_type str The MIME type from the Content-Type header.
content bytes Raw document bytes. Skipped during JSON serialization.
size int Size of the document in bytes.
filename str \| None None Filename extracted from Content-Disposition or URL path.
content_hash str SHA-256 hex digest of the content.
headers dict[str, str] {} Selected response headers.

InteractionResult

Result of executing a sequence of page interaction actions.

Field Type Default Description
action_results list[ActionResult] [] Results from each executed action.
final_html str Final page HTML after all actions completed.
final_url str Final page URL (may have changed due to navigation).
screenshot bytes \| None None Screenshot taken after all actions, if requested.

ActionResult

Result from a single page action execution.

Field Type Default Description
action_index int Zero-based index of the action in the sequence.
action_type str The type of action that was executed.
success bool Whether the action completed successfully.
data dict[str, Any] \| None None Action-specific return data (screenshot bytes, JS return value, scraped HTML).
error str \| None None Error message if the action failed.

ScrapeResult

The result of a single-page scrape operation.

Field Type Default Description
status_code int The HTTP status code of the response.
final_url str The final URL after following all redirects.
content_type str The Content-Type header value.
html str The HTML body of the response.
body_size int The size of the response body in bytes.
metadata PageMetadata Extracted metadata from the page.
links list[LinkInfo] [] Links found on the page.
images list[ImageInfo] [] Images found on the page.
feeds list[FeedInfo] [] Feed links found on the page.
json_ld list[JsonLdEntry] [] JSON-LD entries found on the page.
is_allowed bool Whether the URL is allowed by robots.txt.
crawl_delay int \| None None The crawl delay from robots.txt, in seconds.
noindex_detected bool Whether a noindex directive was detected.
nofollow_detected bool Whether a nofollow directive was detected.
x_robots_tag str \| None None The X-Robots-Tag header value, if present.
is_pdf bool Whether the content is a PDF.
was_skipped bool Whether the page was skipped (binary or PDF content).
detected_charset str \| None None The detected character set encoding.
auth_header_sent bool Whether an authentication header was sent with the request.
response_meta ResponseMeta \| None None Response metadata extracted from HTTP headers.
assets list[DownloadedAsset] [] Downloaded assets from the page.
js_render_hint bool Whether the page content suggests JavaScript rendering is needed.
browser_used bool Whether the browser fallback was used to fetch this page.
markdown MarkdownResult \| None None Markdown conversion of the page content.
extracted_data dict[str, Any] \| None None Structured data extracted by LLM. Populated when extraction is configured.
extraction_meta ExtractionMeta \| None None Metadata about the LLM extraction pass (cost, tokens, model).
screenshot bytes \| None None Screenshot of the page as PNG bytes. Populated when browser is used and capture_screenshot is enabled.
downloaded_document DownloadedDocument \| None None Downloaded non-HTML document (PDF, DOCX, image, code, etc.).
browser BrowserExtras \| None None Browser-specific extras (eval result, network events, cookies). Only populated when BrowserBackend.Native was used for this request.

CrawlPageResult

The result of crawling a single page during a crawl operation.

Field Type Default Description
url str The original URL of the page.
normalized_url str The normalized URL of the page.
status_code int The HTTP status code of the response.
content_type str The Content-Type header value.
html str The HTML body of the response.
body_size int The size of the response body in bytes.
metadata PageMetadata Extracted metadata from the page.
links list[LinkInfo] [] Links found on the page.
images list[ImageInfo] [] Images found on the page.
feeds list[FeedInfo] [] Feed links found on the page.
json_ld list[JsonLdEntry] [] JSON-LD entries found on the page.
depth int The depth of this page from the start URL.
stayed_on_domain bool Whether this page is on the same domain as the start URL.
was_skipped bool Whether this page was skipped (binary or PDF content).
is_pdf bool Whether the content is a PDF.
detected_charset str \| None None The detected character set encoding.
markdown MarkdownResult \| None None Markdown conversion of the page content.
extracted_data dict[str, Any] \| None None Structured data extracted by LLM. Populated when extraction is configured.
extraction_meta ExtractionMeta \| None None Metadata about the LLM extraction pass (cost, tokens, model).
downloaded_document DownloadedDocument \| None None Downloaded non-HTML document (PDF, DOCX, image, code, etc.).
browser_used bool Whether the browser fallback was used to fetch this page.

CrawlResult

The result of a multi-page crawl operation.

Field Type Default Description
pages list[CrawlPageResult] [] The list of crawled pages.
final_url str The final URL after following redirects.
redirect_count int The number of redirects followed.
was_skipped bool Whether any page was skipped during crawling.
error str \| None None An error message, if the crawl encountered an issue.
cookies list[CookieInfo] [] Cookies collected during the crawl.
stayed_on_domain bool Whether all crawled pages stayed on the same domain as the start URL.
browser_used bool Whether the browser fallback was used for any page in this crawl.
normalized_urls list[str] [] Normalized URLs encountered during crawling (for deduplication counting).

SitemapUrl

A URL entry from a sitemap.

Field Type Default Description
url str The URL.
lastmod str \| None None The last modification date, if present.
changefreq str \| None None The change frequency, if present.
priority str \| None None The priority, if present.

MapResult

The result of a map operation, containing discovered URLs.

Field Type Default Description
urls list[SitemapUrl] [] The list of discovered URLs.

MarkdownResult

Rich markdown conversion result from HTML processing.

Field Type Default Description
content str Converted markdown text.
document_structure dict[str, Any] \| None None Structured document tree with semantic nodes.
tables list[dict[str, Any]] [] Extracted tables with structured cell data.
warnings list[str] [] Non-fatal processing warnings.
citations bool Whether citation conversion was applied and produced at least one reference. True when the markdown contained inline links that were converted to numbered citation references. The converted content (with [N] markers) is available in content; the full reference list is accessible via generate_citations if needed separately.
fit_content str \| None None Content-filtered markdown optimized for LLM consumption.

LinkInfo

Information about a link found on a page.

Field Type Default Description
url str The resolved URL of the link.
text str The visible text of the link.
link_type LinkType LinkType.INTERNAL The classification of the link.
rel str \| None None The rel attribute value, if present.
nofollow bool Whether the link has rel="nofollow".

ImageInfo

Information about an image found on a page.

Field Type Default Description
url str The image URL.
alt str \| None None The alt text, if present.
width int \| None None The width attribute, if present and parseable.
height int \| None None The height attribute, if present and parseable.
source ImageSource ImageSource.IMG The source of the image reference.

FeedInfo

Information about a feed link found on a page.

Field Type Default Description
url str The feed URL.
title str \| None None The feed title, if present.
feed_type FeedType FeedType.RSS The type of feed.

JsonLdEntry

A JSON-LD structured data entry found on a page.

Field Type Default Description
schema_type str The @type value from the JSON-LD object.
name str \| None None The name value, if present.
raw str The raw JSON-LD string.

CookieInfo

Information about an HTTP cookie received from a response.

Field Type Default Description
name str The cookie name.
value str The cookie value.
domain str \| None None The cookie domain, if specified.
path str \| None None The cookie path, if specified.

DownloadedAsset

A downloaded asset from a page.

Field Type Default Description
url str The original URL of the asset.
content_hash str The SHA-256 content hash of the asset.
mime_type str \| None None The MIME type from the Content-Type header.
size int The size of the asset in bytes.
asset_category AssetCategory AssetCategory.IMAGE The category of the asset.
html_tag str \| None None The HTML tag that referenced this asset (e.g., "link", "script", "img").

ArticleMetadata

Article metadata extracted from article:* Open Graph tags.

Field Type Default Description
published_time str \| None None The article publication time.
modified_time str \| None None The article modification time.
author str \| None None The article author.
section str \| None None The article section.
tags list[str] [] The article tags.

HreflangEntry

An hreflang alternate link entry.

Field Type Default Description
lang str The language code (e.g., "en", "fr", "x-default").
url str The URL for this language variant.

FaviconInfo

Information about a favicon or icon link.

Field Type Default Description
url str The icon URL.
rel str The rel attribute (e.g., "icon", "apple-touch-icon").
sizes str \| None None The sizes attribute, if present.
mime_type str \| None None The MIME type, if present.

HeadingInfo

A heading element extracted from the page.

Field Type Default Description
level int The heading level (1-6).
text str The heading text content.

ResponseMeta

Response metadata extracted from HTTP headers.

Field Type Default Description
etag str \| None None The ETag header value.
last_modified str \| None None The Last-Modified header value.
cache_control str \| None None The Cache-Control header value.
server str \| None None The Server header value.
x_powered_by str \| None None The X-Powered-By header value.
content_language str \| None None The Content-Language header value.
content_encoding str \| None None The Content-Encoding header value.

PageMetadata

Metadata extracted from an HTML page's <meta> tags and <title> element.

Field Type Default Description
title str \| None None The page title from the <title> element.
description str \| None None The meta description.
canonical_url str \| None None The canonical URL from <link rel="canonical">.
keywords str \| None None Keywords from <meta name="keywords">.
author str \| None None Author from <meta name="author">.
viewport str \| None None Viewport content from <meta name="viewport">.
theme_color str \| None None Theme color from <meta name="theme-color">.
generator str \| None None Generator from <meta name="generator">.
robots str \| None None Robots content from <meta name="robots">.
html_lang str \| None None The lang attribute from the <html> element.
html_dir str \| None None The dir attribute from the <html> element.
og_title str \| None None Open Graph title.
og_type str \| None None Open Graph type.
og_image str \| None None Open Graph image URL.
og_description str \| None None Open Graph description.
og_url str \| None None Open Graph URL.
og_site_name str \| None None Open Graph site name.
og_locale str \| None None Open Graph locale.
og_video str \| None None Open Graph video URL.
og_audio str \| None None Open Graph audio URL.
og_locale_alternates list[str] \| None [] Open Graph locale alternates.
twitter_card str \| None None Twitter card type.
twitter_title str \| None None Twitter title.
twitter_description str \| None None Twitter description.
twitter_image str \| None None Twitter image URL.
twitter_site str \| None None Twitter site handle.
twitter_creator str \| None None Twitter creator handle.
dc_title str \| None None Dublin Core title.
dc_creator str \| None None Dublin Core creator.
dc_subject str \| None None Dublin Core subject.
dc_description str \| None None Dublin Core description.
dc_publisher str \| None None Dublin Core publisher.
dc_date str \| None None Dublin Core date.
dc_type str \| None None Dublin Core type.
dc_format str \| None None Dublin Core format.
dc_identifier str \| None None Dublin Core identifier.
dc_language str \| None None Dublin Core language.
dc_rights str \| None None Dublin Core rights.
article ArticleMetadata \| None None Article metadata from article:* Open Graph tags.
hreflangs list[HreflangEntry] \| None [] Hreflang alternate links.
favicons list[FaviconInfo] \| None [] Favicon and icon links.
headings list[HeadingInfo] \| None [] Heading elements (h1-h6).
word_count int \| None None Computed word count of the page body text.

CrawlStreamRequest

Request to begin a single-URL streaming crawl.

Wraps a single seed URL for delivery through the streaming-adapter binding surface. Required as a struct because alef's streaming adapter requires a named request type — primitives are not supported.

Field Type Default Description
url str The seed URL to crawl.

BatchCrawlStreamRequest

Request to begin a multi-URL streaming crawl.

Wraps a set of seed URLs for delivery through the streaming-adapter binding surface. Required as a struct because alef's streaming adapter requires a named request type — primitives are not supported.

Field Type Default Description
urls list[str] [] The seed URLs to crawl. Each URL is followed independently up to the engine's configured depth.

CitationResult

Result of citation conversion.

Field Type Default Description
content str Markdown with links replaced by numbered citations.
references list[CitationReference] [] Numbered reference list: (index, url, text).

CitationReference

A single numbered reference in a citation list — produced by the citation extractor when content uses inline [N]-style markers.

Field Type Default Description
index int 1-based reference number as it appears in the source text.
url str Resolved absolute URL for this reference.
text str Human-readable anchor text or title for the reference.

BatchScrapeResult

Result from a single URL in a batch scrape operation.

Field Type Default Description
url str The URL that was scraped.
result ScrapeResult \| None None The scrape result, if successful.
error str \| None None The error message, if the scrape failed.

BatchCrawlResult

Result from a single URL in a batch crawl operation.

Field Type Default Description
url str The seed URL that was crawled.
result CrawlResult \| None None The crawl result, if successful.
error str \| None None The error message, if the crawl failed.

BatchScrapeResults

Aggregate result of a batch scrape, exposing per-URL results plus precomputed counts.

The counts are derived once at construction so every binding language can read them as plain integer fields without re-iterating the results vector.

Field Type Default Description
results list[BatchScrapeResult] [] Per-URL scrape results, in the order URLs were submitted.
total_count int Total number of URLs in the batch (equal to results.len()).
completed_count int Number of URLs whose scrape succeeded (error is None).
failed_count int Number of URLs whose scrape failed (error is Some).

BatchCrawlResults

Aggregate result of a batch crawl, exposing per-URL results plus precomputed counts.

The counts are derived once at construction so every binding language can read them as plain integer fields without re-iterating the results vector.

Field Type Default Description
results list[BatchCrawlResult] [] Per-URL crawl results, in the order seed URLs were submitted.
total_count int Total number of seed URLs in the batch (equal to results.len()).
completed_count int Number of seed URLs whose crawl succeeded (error is None).
failed_count int Number of seed URLs whose crawl failed (error is Some).

Enums

AssetCategory

The category of a downloaded asset.

Variant Wire value Description
Document document A document file (PDF, DOC, etc.).
Image image An image file.
Audio audio An audio file.
Video video A video file.
Font font A font file.
Stylesheet stylesheet A CSS stylesheet.
Script script A JavaScript file.
Archive archive An archive file (ZIP, TAR, etc.).
Data data A data file (JSON, XML, CSV, etc.).
Other other An unrecognized asset type.

AuthConfig

Authentication configuration.

Variant Wire value Description
Basic basic HTTP Basic authentication. — Fields: username: String, password: String
Bearer bearer Bearer token authentication. — Fields: token: String
Header header Custom authentication header. — Fields: name: String, value: String

BrowserBackend

Browser backend used for JavaScript rendering.

Variant Wire value Description
Chromiumoxide chromiumoxide Existing Chromium/CDP backend powered by chromiumoxide.
Native native Kreuzcrawl-owned native browser backend derived from Obscura.

BrowserMode

When to use the headless browser fallback.

Variant Wire value Description
Auto auto Automatically detect when JS rendering is needed and fall back to browser.
Always always Always use the browser for every request.
Never never Never use the browser fallback.

BrowserWait

Wait strategy for browser page rendering.

Variant Wire value Description
NetworkIdle network_idle Wait until network activity is idle.
Selector selector Wait for a specific CSS selector to appear in the DOM.
Fixed fixed Wait for a fixed duration after navigation.

FeedType

The type of a feed (RSS, Atom, or JSON Feed).

Variant Wire value Description
Rss rss RSS feed.
Atom atom Atom feed.
JsonFeed json_feed JSON Feed.

ImageSource

The source of an image reference.

Variant Wire value Description
Img img An <img> tag.
PictureSource picture_source A <source> tag inside <picture>.
OgImage og:image An og:image meta tag.
TwitterImage twitter:image A twitter:image meta tag.

LinkType

The classification of a link.

Variant Wire value Description
Internal internal A link to the same domain.
External external A link to a different domain.
Anchor anchor A fragment-only link (e.g., #section).
Document document A link to a downloadable document (PDF, DOC, etc.).

Edit this page on GitHub