Browser Automation¶
Kreuzcrawl includes a headless Chrome/Chromium integration for rendering JavaScript-heavy pages.
The browser subsystem is feature-gated behind browser and uses the Chrome DevTools Protocol (CDP)
via the chromiumoxide crate.
Browser modes¶
The BrowserMode enum controls when the headless browser is used instead of a plain HTTP fetch.
| Mode | Behaviour |
|---|---|
Auto (default) |
Kreuzcrawl first tries an HTTP fetch. If the response looks like it needs JS rendering (e.g. WAF challenge page), it automatically falls back to the browser. |
Always |
Every request goes through the headless browser. Useful for single-page applications or sites that rely entirely on client-side rendering. |
Never |
The browser is never launched. Only plain HTTP fetches are performed. |
Set the mode in CrawlConfig:
use kreuzcrawl::{CrawlConfig, BrowserMode};
let config = CrawlConfig {
browser: kreuzcrawl::BrowserConfig {
mode: BrowserMode::Always,
..Default::default()
},
..Default::default()
};
BrowserPool management¶
When crawling many pages, launching a fresh Chrome process per request is expensive.
BrowserPool keeps a single Chrome instance alive and hands out individual pages (tabs),
limiting concurrency with a semaphore.
use std::sync::Arc;
use kreuzcrawl::BrowserPoolConfig;
use kreuzcrawl::BrowserPool;
let pool = BrowserPool::new(BrowserPoolConfig {
max_pages: 8, // up to 8 concurrent tabs
browser_endpoint: None, // launch a local Chrome
chrome_args: Vec::new(),
launch_timeout: std::time::Duration::from_secs(30),
});
// Optionally warm the pool so the first request doesn't pay startup cost.
pool.warm().await?;
// Acquire a page, use it, then close.
let page = pool.acquire_page().await?;
// ... use page.page() for CDP operations ...
page.close().await;
Key behaviours:
- Lazy start -- Chrome is not launched until the first
acquire_page()orwarm()call. - Auto-recovery -- If Chrome crashes, the next
acquire_page()call relaunches it automatically. - Graceful shutdown --
pool.shutdown().awaitcloses the browser and rejects further requests. - Health probe --
pool.is_healthy()is a lock-free atomic read, safe for liveness checks.
Each PooledPage holds a semaphore permit. When the page is closed (or dropped), the permit
is released so another caller can open a tab.
Connecting to an external browser¶
Instead of launching a local Chrome process, you can connect to an already-running browser via its CDP WebSocket endpoint:
use kreuzcrawl::{CrawlConfig, BrowserConfig};
let config = CrawlConfig {
browser: BrowserConfig {
endpoint: Some("ws://127.0.0.1:9222/devtools/browser/...".into()),
..Default::default()
},
..Default::default()
};
Or with BrowserPoolConfig:
use kreuzcrawl::BrowserPoolConfig;
let pool_config = BrowserPoolConfig {
browser_endpoint: Some("ws://127.0.0.1:9222/devtools/browser/...".into()),
..Default::default()
};
This is useful when running Chrome in a sidecar container or a remote debugging session.
Browser profiles (persistent sessions)¶
BrowserProfile lets you persist cookies, localStorage, and other browser state across
crawl sessions by pointing Chrome at a stable --user-data-dir.
use kreuzcrawl::BrowserProfile;
// Create a profile handle (does not touch disk yet).
let profile = BrowserProfile::new("my-session")?;
// Create the directory if it doesn't exist.
if !profile.exists() {
profile.create()?;
}
// Pass the profile's Chrome args to BrowserPoolConfig.
let pool_config = BrowserPoolConfig {
chrome_args: profile.chrome_args(),
..Default::default()
};
Profile names are validated to prevent path-traversal attacks. Only ASCII alphanumerics,
hyphens, underscores, and dots are allowed (max 255 characters). Profiles are stored under
<data_dir>/kreuzcrawl/profiles/<name>.
Manage profiles:
// List all profiles on disk.
let names = BrowserProfile::list_all()?;
// Delete a profile (refuses to follow symlinks).
profile.delete()?;
Unix permissions
On Unix, profile.create() sets the directory to mode 0o700 (owner-only access).
Never store profiles in world-readable locations.
WAF detection¶
Kreuzcrawl detects when a response is blocked by a Web Application Firewall and returns a
CrawlError::WafBlocked error with the identified vendor. Detection runs on both HTTP
responses and browser-rendered pages.
Detected vendors:
| Vendor | Detection signal |
|---|---|
| Cloudflare | Server: cloudflare, cf-browser-verification, cf-chl- body markers |
| Akamai | Server: AkamaiGHost |
| Imperva (Incapsula) | incapsula, _incap_ses_ body markers |
| DataDome | datadome body marker, x-datadome header |
| PerimeterX | perimeterx, px-captcha body markers, x-px-* headers |
| Sucuri | sucuri body marker, x-sucuri-id header |
| F5 BIG-IP | Server: big-ip |
| AWS WAF | awselb, x-amzn-waf body markers, x-amzn-waf-action header |
In Auto browser mode, a WAF challenge triggers an automatic browser fallback so that
JavaScript challenges can be solved client-side.
Wait strategies¶
After the browser navigates to a URL, it needs to wait for the page to finish rendering.
The BrowserWait enum controls this behaviour.
| Strategy | Behaviour | Default wait |
|---|---|---|
NetworkIdle (default) |
Waits for a 500 ms settle period after initial page load, giving client-side JS time to execute. | 500 ms |
Selector |
Waits until a specific CSS selector appears in the DOM. Falls back to 500 ms if no wait_selector is configured. |
Varies |
Fixed |
Waits a fixed 2-second duration after navigation completes. | 2 s |
Configure in BrowserConfig:
use kreuzcrawl::{BrowserConfig, BrowserWait};
let browser = BrowserConfig {
wait: BrowserWait::Selector,
wait_selector: Some("#main-content".into()),
extra_wait: Some(std::time::Duration::from_millis(200)),
timeout: std::time::Duration::from_secs(30),
..Default::default()
};
The extra_wait field adds additional sleep time after the wait condition is met.
The timeout field is the hard cap on the entire navigation-plus-wait cycle; if exceeded,
CrawlError::BrowserTimeout is returned.
Choosing a strategy
Use NetworkIdle for most sites. Switch to Selector when you know the exact element
that signals the page is ready (e.g. a data table or main content div). Use Fixed
only as a last resort for unpredictable sites.