Browser Automation¶

Kreuzcrawl includes a headless Chrome/Chromium integration for rendering JavaScript-heavy pages. The browser subsystem is feature-gated behind browser and uses the Chrome DevTools Protocol (CDP) via the chromiumoxide crate.

Browser modes¶

The BrowserMode enum controls when the headless browser is used instead of a plain HTTP fetch.

Mode	Behaviour
`Auto` (default)	Kreuzcrawl first tries an HTTP fetch. If the response looks like it needs JS rendering (e.g. WAF challenge page), it automatically falls back to the browser.
`Always`	Every request goes through the headless browser. Useful for single-page applications or sites that rely entirely on client-side rendering.
`Never`	The browser is never launched. Only plain HTTP fetches are performed.

Set the mode in CrawlConfig:

use kreuzcrawl::{CrawlConfig, BrowserMode};

let config = CrawlConfig {
    browser: kreuzcrawl::BrowserConfig {
        mode: BrowserMode::Always,
        ..Default::default()
    },
    ..Default::default()
};

Browser pooling¶

A single Chrome instance is kept alive across requests; tabs are handed out lazily, the pool auto-recovers if Chrome crashes, and concurrent tabs are bounded by CrawlConfig::max_concurrent. No additional configuration is required.

Connecting to an external browser¶

Point the engine at an already-running Chrome via its CDP WebSocket endpoint instead of launching one locally:

use kreuzcrawl::{BrowserConfig, CrawlConfig};

let config = CrawlConfig {
    browser: BrowserConfig {
        endpoint: Some("ws://127.0.0.1:9222/devtools/browser/...".into()),
        ..Default::default()
    },
    ..Default::default()
};

This is the recommended pattern when running Chrome in a sidecar container or a remote debugging session.

Browser profiles¶

Persistent browser profiles retain cookies, localStorage, and other browser state across crawl sessions. Configure them through CrawlConfig::browser_profile (named profile to attach) and CrawlConfig::save_browser_profile (persist changes on exit):

use kreuzcrawl::CrawlConfig;

let config = CrawlConfig {
    browser_profile: Some("my-session".into()),
    save_browser_profile: true,
    ..Default::default()
};

Profile names are validated against path-traversal — only ASCII alphanumerics, hyphens, underscores, and dots are allowed (max 255 characters). Profiles are stored under <data_dir>/kreuzcrawl/profiles/<name> and, on Unix, are created with mode 0o700.

WAF detection¶

Kreuzcrawl detects when a response is blocked by a Web Application Firewall and returns a CrawlError::WafBlocked error with the identified vendor. Detection runs on both HTTP responses and browser-rendered pages.

Detected vendors:

Vendor	Detection signal
Cloudflare	`Server: cloudflare`, `cf-browser-verification`, `cf-chl-` body markers
Akamai	`Server: AkamaiGHost`
Imperva (Incapsula)	`incapsula`, `_incap_ses_` body markers
DataDome	`datadome` body marker, `x-datadome` header
PerimeterX	`perimeterx`, `px-captcha` body markers, `x-px-*` headers
Sucuri	`sucuri` body marker, `x-sucuri-id` header
F5 BIG-IP	`Server: big-ip`
AWS WAF	`awselb`, `x-amzn-waf` body markers, `x-amzn-waf-action` header

In Auto browser mode, a WAF challenge triggers an automatic browser fallback so that JavaScript challenges can be solved client-side.

Wait strategies¶

After the browser navigates to a URL, it needs to wait for the page to finish rendering. The BrowserWait enum controls this behaviour.

Strategy	Behaviour	Default wait
`NetworkIdle` (default)	Waits for a 500 ms settle period after initial page load, giving client-side JS time to execute.	500 ms
`Selector`	Waits until a specific CSS selector appears in the DOM. Falls back to 500 ms if no `wait_selector` is configured.	Varies
`Fixed`	Waits a fixed 2-second duration after navigation completes.	2 s

Configure in BrowserConfig:

use kreuzcrawl::{BrowserConfig, BrowserWait};

let browser = BrowserConfig {
    wait: BrowserWait::Selector,
    wait_selector: Some("#main-content".into()),
    extra_wait: Some(std::time::Duration::from_millis(200)),
    timeout: std::time::Duration::from_secs(30),
    ..Default::default()
};

The extra_wait field adds additional sleep time after the wait condition is met. The timeout field is the hard cap on the entire navigation-plus-wait cycle; if exceeded, CrawlError::BrowserTimeout is returned.

Choosing a strategy

Use NetworkIdle for most sites. Switch to Selector when you know the exact element that signals the page is ready (e.g. a data table or main content div). Use Fixed only as a last resort for unpredictable sites.

Edit this page on GitHub