Configuration¶

All kreuzcrawl operations are controlled through CrawlConfig. This guide covers every public field, its default value, and validation rules.

Constructing an engine¶

Pass a CrawlConfig to create_engine, then call scrape / crawl / map_urls / batch_scrape / batch_crawl against the returned handle:

use kreuzcrawl::{CrawlConfig, create_engine};

let engine = create_engine(Some(CrawlConfig {
    max_depth: Some(3),
    max_pages: Some(100),
    max_concurrent: Some(5),
    stay_on_domain: true,
    respect_robots_txt: true,
    ..Default::default()
}))?;

create_engine runs full validation up front: depth / page / size bounds, regex paths, and field-level constraints. Validation errors surface as CrawlError::InvalidConfig before any network request is made.

Convenience constructors¶

For most use cases, create_engine plus one of the top-level async functions is all you need:

use kreuzcrawl::{CrawlConfig, create_engine, scrape, crawl, map_urls};

let engine = create_engine(Some(CrawlConfig {
    max_depth: Some(2),
    ..Default::default()
}))?;

let scrape_result = scrape(&engine, "https://example.com").await?;
let crawl_result = crawl(&engine, "https://example.com").await?;
let map_result = map_urls(&engine, "https://example.com").await?;

Passing None to create_engine uses CrawlConfig::default().

For scraping or crawling multiple URLs in one go, use the batch variants. Each URL runs concurrently; failures are captured per-URL rather than bubbling up as an error:

use kreuzcrawl::{create_engine, batch_scrape, batch_crawl};

let engine = create_engine(None)?;

let scrape_results = batch_scrape(&engine, vec![
    "https://example.com".into(),
    "https://example.org".into(),
]).await;

let crawl_results = batch_crawl(&engine, vec![
    "https://example.com".into(),
    "https://example.org".into(),
]).await;

// Each entry in the Vec has .url, .result (Option<_>), and .error (Option<String>).
for r in &scrape_results {
    if let Some(err) = &r.error {
        eprintln!("{}: {}", r.url, err);
    }
}

Config validation¶

CrawlConfig::validate() is called automatically during CrawlEngineBuilder::build(). It checks:

Rule	Error
`max_concurrent` must not be 0	`"max_concurrent must be > 0"`
`max_pages` must not be 0	`"max_pages must be > 0"`
`max_redirects` must be <= 100	`"max_redirects must be <= 100"`
`request_timeout` must not be zero	`"request_timeout must be > 0"`
`browser.wait_selector` required when `browser.wait` is `Selector`	`"browser.wait_selector required when browser.wait is Selector"`
All `include_paths` must be valid regex	`"invalid include_path regex '...': ..."`
All `exclude_paths` must be valid regex	`"invalid exclude_path regex '...': ..."`
All `retry_codes` must be 100-599	`"invalid retry code: ..."`

You can also call validate() manually:

let config = CrawlConfig {
    max_concurrent: Some(0),
    ..Default::default()
};

match config.validate() {
    Ok(()) => println!("Valid"),
    Err(e) => eprintln!("Invalid: {}", e),
}

Full field reference¶

Crawl scope¶

Field	Type	Default	Description
`max_depth`	`Option<usize>`	`None`	Maximum crawl depth (link hops from seed). `None` means 0 (seed only).
`max_pages`	`Option<usize>`	`None`	Maximum pages to crawl. `None` means unlimited. Must be > 0 if set.
`stay_on_domain`	`bool`	`false`	Restrict crawling to the seed URL's domain.
`allow_subdomains`	`bool`	`false`	Allow subdomains when `stay_on_domain` is true.
`include_paths`	`Vec<String>`	`[]`	Regex patterns -- only matching URL paths are crawled. Seed URL (depth 0) is always included.
`exclude_paths`	`Vec<String>`	`[]`	Regex patterns -- matching URL paths are skipped.

HTTP client¶

Field	Type	Default	Description
`max_concurrent`	`Option<usize>`	`None` (10)	Maximum concurrent requests.
`request_timeout`	`Duration`	30 seconds	Timeout for individual HTTP requests. Serialized as milliseconds in JSON.
`max_redirects`	`usize`	`10`	Maximum redirects to follow. Must be <= 100.
`retry_count`	`usize`	`0`	Number of retry attempts for failed requests.
`retry_codes`	`Vec<u16>`	`[]`	HTTP status codes that trigger a retry. Each must be 100-599.
`max_body_size`	`Option<usize>`	`None`	Maximum response body size in bytes. Responses are truncated.
`user_agent`	`Option<String>`	`None`	Custom User-Agent string.
`user_agents`	`Vec<String>`	`[]`	User-Agent strings for rotation. When non-empty, overrides `user_agent`.
`custom_headers`	`HashMap<String, String>`	`{}`	Extra HTTP headers sent with every request.
`cookies_enabled`	`bool`	`false`	Whether to collect and track cookies across requests.

Authentication¶

Field	Type	Default	Description
`auth`	`Option<AuthConfig>`	`None`	Authentication configuration.

AuthConfig variants:

// HTTP Basic authentication
AuthConfig::Basic { username: "user".into(), password: "pass".into() }

// Bearer token
AuthConfig::Bearer { token: "your-token".into() }

// Custom header
AuthConfig::Header { name: "X-API-Key".into(), value: "key-value".into() }

Proxy¶

Field	Type	Default	Description
`proxy`	`Option<ProxyConfig>`	`None`	Proxy configuration.

ProxyConfig {
    url: "http://proxy:8080".to_string(), // or "socks5://proxy:1080"
    username: Some("user".to_string()),
    password: Some("pass".to_string()),
}

Robots and compliance¶

Field	Type	Default	Description
`respect_robots_txt`	`bool`	`false`	Fetch and honor robots.txt directives (allow/disallow, crawl-delay, sitemap).

Content processing¶

Field	Type	Default	Description
`content.preprocessing_preset`	`String`	`"standard"`	HTML preprocessing strength: `"minimal"`, `"standard"`, or `"aggressive"` (aggressive strips chrome/boilerplate).
`remove_tags`	`Vec<String>`	`[]`	CSS selectors for elements to remove before processing (e.g., `"nav"`, `".sidebar"`).
`max_body_size`	`Option<usize>`	`None`	Truncate HTML bodies beyond this size in bytes. `None` keeps the full body.

URL discovery (map)¶

Field	Type	Default	Description
`map_limit`	`Option<usize>`	`None`	Maximum URLs returned by the map operation.
`map_search`	`Option<String>`	`None`	Case-insensitive substring filter for map results.

Asset downloading¶

Field	Type	Default	Description
`download_assets`	`bool`	`false`	Download page assets (CSS, JS, images, etc.).
`asset_types`	`Vec<AssetCategory>`	`[]`	Filter for which asset categories to download. Empty means all.
`max_asset_size`	`Option<usize>`	`None`	Maximum size per asset download in bytes.

AssetCategory options: Document, Image, Audio, Video, Font, Stylesheet, Script, Archive, Data, Other.

Document downloading¶

Field	Type	Default	Description
`download_documents`	`bool`	`true`	Download non-HTML resources (PDF, DOCX, images, code files) instead of skipping.
`document_max_size`	`Option<usize>`	50 MB	Maximum document download size in bytes.
`document_mime_types`	`Vec<String>`	`[]`	MIME type allowlist. Empty uses built-in defaults.

Browser configuration¶

Field	Type	Default	Description
`browser`	`BrowserConfig`	See below	Headless browser fallback settings.
`capture_screenshot`	`bool`	`false`	Capture PNG screenshot when browser is used.
`browser_profile`	`Option<String>`	`None`	Named browser profile for persistent sessions.
`save_browser_profile`	`bool`	`false`	Save browser profile changes on exit.

BrowserConfig fields¶

Field	Type	Default	Description
`mode`	`BrowserMode`	`Auto`	When to use the browser: `Auto`, `Always`, or `Never`.
`endpoint`	`Option<String>`	`None`	CDP WebSocket endpoint for an external browser instance.
`timeout`	`Duration`	30 seconds	Browser page load timeout.
`wait`	`BrowserWait`	`NetworkIdle`	Wait strategy after navigation: `NetworkIdle`, `Selector`, or `Fixed`.
`wait_selector`	`Option<String>`	`None`	CSS selector to wait for (required when `wait` is `Selector`).
`extra_wait`	`Option<Duration>`	`None`	Additional wait time after the wait condition is met.

WARC output¶

Field	Type	Default	Description
`warc_output`	`Option<PathBuf>`	`None`	Path to write WARC output. `None` disables WARC output.

Serialization¶

CrawlConfig implements Serialize and Deserialize with #[serde(deny_unknown_fields)]. Duration fields are serialized as milliseconds:

{
  "max_depth": 3,
  "max_pages": 100,
  "max_concurrent": 5,
  "stay_on_domain": true,
  "respect_robots_txt": true,
  "request_timeout": 30000,
  "max_redirects": 10,
  "retry_count": 2,
  "retry_codes": [429, 503],
  "include_paths": ["^/docs/"],
  "exclude_paths": ["/admin/"],
  "content": {
    "preprocessing_preset": "standard"
  },
  "cookies_enabled": false,
  "download_assets": false,
  "download_documents": true,
  "document_max_size": 52428800,
  "capture_screenshot": false,
  "save_browser_profile": false,
  "browser": {
    "mode": "auto",
    "timeout": 30000,
    "wait": "network_idle"
  }
}

Loading config from a file

Since CrawlConfig implements Deserialize, you can load it from JSON, TOML, or YAML using the appropriate serde crate.

Default values summary¶

Field	Default
`max_depth`	`None` (0 -- seed only)
`max_pages`	`None` (unlimited)
`max_concurrent`	`None` (10)
`respect_robots_txt`	`false`
`user_agent`	`None`
`stay_on_domain`	`false`
`allow_subdomains`	`false`
`request_timeout`	30 seconds
`max_redirects`	`10`
`retry_count`	`0`
`cookies_enabled`	`false`
`content.preprocessing_preset`	`"standard"`
`download_assets`	`false`
`download_documents`	`true`
`document_max_size`	50 MB
`capture_screenshot`	`false`
`save_browser_profile`	`false`
`browser.mode`	`Auto`
`browser.timeout`	30 seconds
`browser.wait`	`NetworkIdle`

Edit this page on GitHub