Skip to content

Configuration

All kreuzcrawl operations are controlled through CrawlConfig. This guide covers every public field, its default value, and validation rules.

Constructing an engine

Pass a CrawlConfig to create_engine, then call scrape / crawl / map_urls / batch_scrape / batch_crawl against the returned handle:

use kreuzcrawl::{CrawlConfig, create_engine};

let engine = create_engine(Some(CrawlConfig {
    max_depth: Some(3),
    max_pages: Some(100),
    max_concurrent: Some(5),
    stay_on_domain: true,
    respect_robots_txt: true,
    ..Default::default()
}))?;

create_engine runs full validation up front: depth / page / size bounds, regex paths, and field-level constraints. Validation errors surface as CrawlError::InvalidConfig before any network request is made.

Convenience constructors

For most use cases, create_engine plus one of the top-level async functions is all you need:

use kreuzcrawl::{CrawlConfig, create_engine, scrape, crawl, map_urls};

let engine = create_engine(Some(CrawlConfig {
    max_depth: Some(2),
    ..Default::default()
}))?;

let scrape_result = scrape(&engine, "https://example.com").await?;
let crawl_result = crawl(&engine, "https://example.com").await?;
let map_result = map_urls(&engine, "https://example.com").await?;

Passing None to create_engine uses CrawlConfig::default().

For scraping or crawling multiple URLs in one go, use the batch variants. Each URL runs concurrently; failures are captured per-URL rather than bubbling up as an error:

use kreuzcrawl::{create_engine, batch_scrape, batch_crawl};

let engine = create_engine(None)?;

let scrape_results = batch_scrape(&engine, vec![
    "https://example.com".into(),
    "https://example.org".into(),
]).await;

let crawl_results = batch_crawl(&engine, vec![
    "https://example.com".into(),
    "https://example.org".into(),
]).await;

// Each entry in the Vec has .url, .result (Option<_>), and .error (Option<String>).
for r in &scrape_results {
    if let Some(err) = &r.error {
        eprintln!("{}: {}", r.url, err);
    }
}

Config validation

CrawlConfig::validate() is called automatically during CrawlEngineBuilder::build(). It checks:

Rule Error
max_concurrent must not be 0 "max_concurrent must be > 0"
max_pages must not be 0 "max_pages must be > 0"
max_redirects must be <= 100 "max_redirects must be <= 100"
request_timeout must not be zero "request_timeout must be > 0"
browser.wait_selector required when browser.wait is Selector "browser.wait_selector required when browser.wait is Selector"
All include_paths must be valid regex "invalid include_path regex '...': ..."
All exclude_paths must be valid regex "invalid exclude_path regex '...': ..."
All retry_codes must be 100-599 "invalid retry code: ..."

You can also call validate() manually:

let config = CrawlConfig {
    max_concurrent: Some(0),
    ..Default::default()
};

match config.validate() {
    Ok(()) => println!("Valid"),
    Err(e) => eprintln!("Invalid: {}", e),
}

Full field reference

Crawl scope

Field Type Default Description
max_depth Option<usize> None Maximum crawl depth (link hops from seed). None means 0 (seed only).
max_pages Option<usize> None Maximum pages to crawl. None means unlimited. Must be > 0 if set.
stay_on_domain bool false Restrict crawling to the seed URL's domain.
allow_subdomains bool false Allow subdomains when stay_on_domain is true.
include_paths Vec<String> [] Regex patterns -- only matching URL paths are crawled. Seed URL (depth 0) is always included.
exclude_paths Vec<String> [] Regex patterns -- matching URL paths are skipped.

HTTP client

Field Type Default Description
max_concurrent Option<usize> None (10) Maximum concurrent requests.
request_timeout Duration 30 seconds Timeout for individual HTTP requests. Serialized as milliseconds in JSON.
max_redirects usize 10 Maximum redirects to follow. Must be <= 100.
retry_count usize 0 Number of retry attempts for failed requests.
retry_codes Vec<u16> [] HTTP status codes that trigger a retry. Each must be 100-599.
max_body_size Option<usize> None Maximum response body size in bytes. Responses are truncated.
user_agent Option<String> None Custom User-Agent string.
user_agents Vec<String> [] User-Agent strings for rotation. When non-empty, overrides user_agent.
custom_headers HashMap<String, String> {} Extra HTTP headers sent with every request.
cookies_enabled bool false Whether to collect and track cookies across requests.

Authentication

Field Type Default Description
auth Option<AuthConfig> None Authentication configuration.

AuthConfig variants:

// HTTP Basic authentication
AuthConfig::Basic { username: "user".into(), password: "pass".into() }

// Bearer token
AuthConfig::Bearer { token: "your-token".into() }

// Custom header
AuthConfig::Header { name: "X-API-Key".into(), value: "key-value".into() }

Proxy

Field Type Default Description
proxy Option<ProxyConfig> None Proxy configuration.
ProxyConfig {
    url: "http://proxy:8080".to_string(), // or "socks5://proxy:1080"
    username: Some("user".to_string()),
    password: Some("pass".to_string()),
}

Robots and compliance

Field Type Default Description
respect_robots_txt bool false Fetch and honor robots.txt directives (allow/disallow, crawl-delay, sitemap).

Content processing

Field Type Default Description
content.preprocessing_preset String "standard" HTML preprocessing strength: "minimal", "standard", or "aggressive" (aggressive strips chrome/boilerplate).
remove_tags Vec<String> [] CSS selectors for elements to remove before processing (e.g., "nav", ".sidebar").
max_body_size Option<usize> None Truncate HTML bodies beyond this size in bytes. None keeps the full body.

URL discovery (map)

Field Type Default Description
map_limit Option<usize> None Maximum URLs returned by the map operation.
map_search Option<String> None Case-insensitive substring filter for map results.

Asset downloading

Field Type Default Description
download_assets bool false Download page assets (CSS, JS, images, etc.).
asset_types Vec<AssetCategory> [] Filter for which asset categories to download. Empty means all.
max_asset_size Option<usize> None Maximum size per asset download in bytes.

AssetCategory options: Document, Image, Audio, Video, Font, Stylesheet, Script, Archive, Data, Other.

Document downloading

Field Type Default Description
download_documents bool true Download non-HTML resources (PDF, DOCX, images, code files) instead of skipping.
document_max_size Option<usize> 50 MB Maximum document download size in bytes.
document_mime_types Vec<String> [] MIME type allowlist. Empty uses built-in defaults.

Browser configuration

Field Type Default Description
browser BrowserConfig See below Headless browser fallback settings.
capture_screenshot bool false Capture PNG screenshot when browser is used.
browser_profile Option<String> None Named browser profile for persistent sessions.
save_browser_profile bool false Save browser profile changes on exit.

BrowserConfig fields

Field Type Default Description
mode BrowserMode Auto When to use the browser: Auto, Always, or Never.
endpoint Option<String> None CDP WebSocket endpoint for an external browser instance.
timeout Duration 30 seconds Browser page load timeout.
wait BrowserWait NetworkIdle Wait strategy after navigation: NetworkIdle, Selector, or Fixed.
wait_selector Option<String> None CSS selector to wait for (required when wait is Selector).
extra_wait Option<Duration> None Additional wait time after the wait condition is met.

WARC output

Field Type Default Description
warc_output Option<PathBuf> None Path to write WARC output. None disables WARC output.

Serialization

CrawlConfig implements Serialize and Deserialize with #[serde(deny_unknown_fields)]. Duration fields are serialized as milliseconds:

{
  "max_depth": 3,
  "max_pages": 100,
  "max_concurrent": 5,
  "stay_on_domain": true,
  "respect_robots_txt": true,
  "request_timeout": 30000,
  "max_redirects": 10,
  "retry_count": 2,
  "retry_codes": [429, 503],
  "include_paths": ["^/docs/"],
  "exclude_paths": ["/admin/"],
  "content": {
    "preprocessing_preset": "standard"
  },
  "cookies_enabled": false,
  "download_assets": false,
  "download_documents": true,
  "document_max_size": 52428800,
  "capture_screenshot": false,
  "save_browser_profile": false,
  "browser": {
    "mode": "auto",
    "timeout": 30000,
    "wait": "network_idle"
  }
}

Loading config from a file

Since CrawlConfig implements Deserialize, you can load it from JSON, TOML, or YAML using the appropriate serde crate.

Default values summary

Field Default
max_depth None (0 -- seed only)
max_pages None (unlimited)
max_concurrent None (10)
respect_robots_txt false
user_agent None
stay_on_domain false
allow_subdomains false
request_timeout 30 seconds
max_redirects 10
retry_count 0
cookies_enabled false
content.preprocessing_preset "standard"
download_assets false
download_documents true
document_max_size 50 MB
capture_screenshot false
save_browser_profile false
browser.mode Auto
browser.timeout 30 seconds
browser.wait NetworkIdle

Edit this page on GitHub