Skip to content

Configuration

All kreuzcrawl operations are controlled through CrawlConfig. This guide covers every field, its default value, validation rules, and the builder pattern for constructing engines.

Builder pattern

The CrawlEngineBuilder provides a fluent API for constructing a CrawlEngine with custom configuration and trait implementations:

use kreuzcrawl::{CrawlEngine, CrawlConfig, BestFirstStrategy, Bm25Filter};

let engine = CrawlEngine::builder()
    .config(CrawlConfig {
        max_depth: Some(3),
        max_pages: Some(100),
        max_concurrent: Some(5),
        stay_on_domain: true,
        respect_robots_txt: true,
        ..Default::default()
    })
    .strategy(BestFirstStrategy)
    .content_filter(Bm25Filter::new("rust programming", 0.3))
    .build()?;

Builder methods

Method Accepts Default
.config(config) CrawlConfig CrawlConfig::default()
.frontier(impl Frontier) URL queue and deduplication InMemoryFrontier
.rate_limiter(impl RateLimiter) Per-domain throttling PerDomainThrottle (200ms)
.store(impl CrawlStore) Result persistence NoopStore
.event_emitter(impl EventEmitter) Lifecycle event handler NoopEmitter
.strategy(impl CrawlStrategy) URL selection and scoring BfsStrategy
.content_filter(impl ContentFilter) Post-extraction page filter NoopFilter
.cache(impl CrawlCache) HTTP response cache NoopCache

All fields are optional. Unset fields use their default implementations from the defaults module.

The .build() call validates the configuration and returns Result<CrawlEngine, CrawlError>.

Convenience constructors

For simple use cases that do not need custom trait implementations, use the binding functions:

use kreuzcrawl::{create_engine, scrape, crawl, map_urls};

let engine = create_engine(Some(CrawlConfig {
    max_depth: Some(2),
    ..Default::default()
}))?;

let scrape_result = scrape(&engine, "https://example.com").await?;
let crawl_result = crawl(&engine, "https://example.com").await?;
let map_result = map_urls(&engine, "https://example.com").await?;

Passing None to create_engine uses CrawlConfig::default().

Config validation

CrawlConfig::validate() is called automatically during CrawlEngineBuilder::build(). It checks:

Rule Error
max_concurrent must not be 0 "max_concurrent must be > 0"
max_pages must not be 0 "max_pages must be > 0"
max_redirects must be <= 100 "max_redirects must be <= 100"
request_timeout must not be zero "request_timeout must be > 0"
browser.wait_selector required when browser.wait is Selector "browser.wait_selector required when browser.wait is Selector"
All include_paths must be valid regex "invalid include_path regex '...': ..."
All exclude_paths must be valid regex "invalid exclude_path regex '...': ..."
All retry_codes must be 100-599 "invalid retry code: ..."

You can also call validate() manually:

let config = CrawlConfig {
    max_concurrent: Some(0),
    ..Default::default()
};

match config.validate() {
    Ok(()) => println!("Valid"),
    Err(e) => eprintln!("Invalid: {}", e),
}

Full field reference

Crawl scope

Field Type Default Description
max_depth Option<usize> None Maximum crawl depth (link hops from seed). None means 0 (seed only).
max_pages Option<usize> None Maximum pages to crawl. None means unlimited. Must be > 0 if set.
stay_on_domain bool false Restrict crawling to the seed URL's domain.
allow_subdomains bool false Allow subdomains when stay_on_domain is true.
include_paths Vec<String> [] Regex patterns -- only matching URL paths are crawled. Seed URL (depth 0) is always included.
exclude_paths Vec<String> [] Regex patterns -- matching URL paths are skipped.

HTTP client

Field Type Default Description
max_concurrent Option<usize> None (10) Maximum concurrent requests.
request_timeout Duration 30 seconds Timeout for individual HTTP requests. Serialized as milliseconds in JSON.
max_redirects usize 10 Maximum redirects to follow. Must be <= 100.
retry_count usize 0 Number of retry attempts for failed requests.
retry_codes Vec<u16> [] HTTP status codes that trigger a retry. Each must be 100-599.
max_body_size Option<usize> None Maximum response body size in bytes. Responses are truncated.
user_agent Option<String> None Custom User-Agent string.
user_agents Vec<String> [] User-Agent strings for rotation. When non-empty, overrides user_agent.
custom_headers HashMap<String, String> {} Extra HTTP headers sent with every request.
cookies_enabled bool false Whether to collect and track cookies across requests.

Authentication

Field Type Default Description
auth Option<AuthConfig> None Authentication configuration.

AuthConfig variants:

// HTTP Basic authentication
AuthConfig::Basic { username: "user".into(), password: "pass".into() }

// Bearer token
AuthConfig::Bearer { token: "your-token".into() }

// Custom header
AuthConfig::Header { name: "X-API-Key".into(), value: "key-value".into() }

Proxy

Field Type Default Description
proxy Option<ProxyConfig> None Proxy configuration.
ProxyConfig {
    url: "http://proxy:8080".to_string(), // or "socks5://proxy:1080"
    username: Some("user".to_string()),
    password: Some("pass".to_string()),
}

Robots and compliance

Field Type Default Description
respect_robots_txt bool false Fetch and honor robots.txt directives (allow/disallow, crawl-delay, sitemap).

Content processing

Field Type Default Description
main_content_only bool false Extract only the primary content, stripping navigation and boilerplate.
remove_tags Vec<String> [] CSS selectors for elements to remove before processing (e.g., "nav", ".sidebar").

URL discovery (map)

Field Type Default Description
map_limit Option<usize> None Maximum URLs returned by the map operation.
map_search Option<String> None Case-insensitive substring filter for map results.

Asset downloading

Field Type Default Description
download_assets bool false Download page assets (CSS, JS, images, etc.).
asset_types Vec<AssetCategory> [] Filter for which asset categories to download. Empty means all.
max_asset_size Option<usize> None Maximum size per asset download in bytes.

AssetCategory options: Document, Image, Audio, Video, Font, Stylesheet, Script, Archive, Data, Other.

Document downloading

Field Type Default Description
download_documents bool true Download non-HTML resources (PDF, DOCX, images, code files) instead of skipping.
document_max_size Option<usize> 50 MB Maximum document download size in bytes.
document_mime_types Vec<String> [] MIME type allowlist. Empty uses built-in defaults.

Browser configuration

Field Type Default Description
browser BrowserConfig See below Headless browser fallback settings.
capture_screenshot bool false Capture PNG screenshot when browser is used.
browser_profile Option<String> None Named browser profile for persistent sessions.
save_browser_profile bool false Save browser profile changes on exit.

BrowserConfig fields

Field Type Default Description
mode BrowserMode Auto When to use the browser: Auto, Always, or Never.
endpoint Option<String> None CDP WebSocket endpoint for an external browser instance.
timeout Duration 30 seconds Browser page load timeout.
wait BrowserWait NetworkIdle Wait strategy after navigation: NetworkIdle, Selector, or Fixed.
wait_selector Option<String> None CSS selector to wait for (required when wait is Selector).
extra_wait Option<Duration> None Additional wait time after the wait condition is met.

WARC output

Field Type Default Description
warc_output Option<PathBuf> None Path to write WARC output. None disables WARC output.

Serialization

CrawlConfig implements Serialize and Deserialize with #[serde(deny_unknown_fields)]. Duration fields are serialized as milliseconds:

{
  "max_depth": 3,
  "max_pages": 100,
  "max_concurrent": 5,
  "stay_on_domain": true,
  "respect_robots_txt": true,
  "request_timeout": 30000,
  "max_redirects": 10,
  "retry_count": 2,
  "retry_codes": [429, 503],
  "include_paths": ["^/docs/"],
  "exclude_paths": ["/admin/"],
  "main_content_only": false,
  "cookies_enabled": false,
  "download_assets": false,
  "download_documents": true,
  "document_max_size": 52428800,
  "capture_screenshot": false,
  "save_browser_profile": false,
  "browser": {
    "mode": "auto",
    "timeout": 30000,
    "wait": "network_idle"
  }
}

Loading config from a file

Since CrawlConfig implements Deserialize, you can load it from JSON, TOML, or YAML using the appropriate serde crate.

Default values summary

Field Default
max_depth None (0 -- seed only)
max_pages None (unlimited)
max_concurrent None (10)
respect_robots_txt false
user_agent None
stay_on_domain false
allow_subdomains false
request_timeout 30 seconds
max_redirects 10
retry_count 0
cookies_enabled false
main_content_only false
download_assets false
download_documents true
document_max_size 50 MB
capture_screenshot false
save_browser_profile false
browser.mode Auto
browser.timeout 30 seconds
browser.wait NetworkIdle

Default trait implementations

When not overridden via the builder, the engine uses these defaults:

Trait Default implementation Behavior
Frontier InMemoryFrontier In-memory HashSet for URL deduplication.
RateLimiter PerDomainThrottle 200ms minimum interval between requests to the same domain. Respects robots.txt crawl-delay.
CrawlStore NoopStore Discards all storage calls.
EventEmitter NoopEmitter Discards all events.
CrawlStrategy BfsStrategy Breadth-first URL selection.
ContentFilter NoopFilter Passes all pages through.
CrawlCache NoopCache No caching.