Crawling¶
Crawling follows links from a seed URL, building a collection of pages up to a configured depth and page limit. Kreuzcrawl provides both collected and streaming interfaces, batch operations, and pluggable strategies for URL selection.
Basic crawl¶
The simplest crawl fetches a single seed URL and follows its links:
use kreuzcrawl::{CrawlEngine, CrawlConfig};
let engine = CrawlEngine::builder()
.config(CrawlConfig {
max_depth: Some(2),
max_pages: Some(50),
..Default::default()
})
.build()?;
let result = engine.crawl("https://example.com").await?;
for page in &result.pages {
println!("{} (depth {}): {} bytes", page.url, page.depth, page.body_size);
}
CrawlResult contains the full list of crawled pages, the final URL after redirect resolution, redirect count, collected cookies, and any error encountered.
Depth and page limits¶
| Field | Type | Default | Description |
|---|---|---|---|
max_depth |
Option<usize> |
None (0 -- seed only) |
Maximum number of link hops from the seed URL. None means depth 0, which fetches only the seed page. |
max_pages |
Option<usize> |
None (unlimited) |
Maximum number of pages to include in the result. The engine stops spawning fetch tasks once this limit is reached and aborts any in-flight tasks. |
Depth 0 means seed only
When max_depth is None or Some(0), the engine fetches the seed URL but does not follow any links. Set max_depth: Some(1) to crawl one hop out.
Concurrent fetching¶
Control parallelism with max_concurrent:
| Field | Type | Default | Description |
|---|---|---|---|
max_concurrent |
Option<usize> |
None (10) |
Maximum number of simultaneous HTTP requests. A tokio Semaphore enforces this limit across all in-flight fetch tasks. |
Tip
The default of 10 concurrent requests is a good starting point. Lower it when crawling sites with strict rate limits; raise it for high-throughput internal crawls.
The engine also applies per-domain rate limiting through the RateLimiter trait. The default PerDomainThrottle enforces a 200ms delay between requests to the same domain, and automatically respects Crawl-delay directives from robots.txt when respect_robots_txt is enabled.
Domain scoping¶
Keep the crawl within the seed domain or allow subdomains:
CrawlConfig {
stay_on_domain: true,
allow_subdomains: false, // only exact domain match
..Default::default()
}
| Field | Type | Default | Description |
|---|---|---|---|
stay_on_domain |
bool |
false |
When true, only follow links whose host matches the seed URL's host. |
allow_subdomains |
bool |
false |
When true and stay_on_domain is true, also follow links to subdomains of the seed host (e.g., blog.example.com when the seed is example.com). |
Path filtering with regex¶
Include or exclude URL paths using regex patterns:
CrawlConfig {
include_paths: vec![r"^/docs/".to_string(), r"^/blog/".to_string()],
exclude_paths: vec![r"/admin/".to_string(), r"\.pdf$".to_string()],
..Default::default()
}
| Field | Type | Default | Description |
|---|---|---|---|
include_paths |
Vec<String> |
[] |
Regex patterns matched against the URL path. When non-empty, only URLs matching at least one pattern are crawled. The depth-0 seed URL is always included regardless of this filter. |
exclude_paths |
Vec<String> |
[] |
Regex patterns matched against the URL path. URLs matching any pattern are skipped. Exclude patterns take priority over include patterns. |
The engine compiles these patterns once at the start of the crawl and validates them during CrawlConfig::validate(). Invalid regex patterns produce a CrawlError::InvalidConfig error.
Streaming events¶
For real-time processing as pages are crawled, use crawl_stream:
use tokio_stream::StreamExt;
use kreuzcrawl::CrawlEvent;
let mut stream = engine.crawl_stream("https://example.com");
while let Some(event) = stream.next().await {
match event {
CrawlEvent::Page(page) => {
println!("Crawled: {} ({} bytes)", page.url, page.body_size);
}
CrawlEvent::Error { url, error } => {
eprintln!("Failed: {} - {}", url, error);
}
CrawlEvent::Complete { pages_crawled } => {
println!("Done: {} pages", pages_crawled);
}
}
}
The stream uses a buffered channel sized at max_concurrent * 16. Dropping the stream receiver signals the engine to cancel remaining work.
CrawlEvent variants¶
| Variant | Fields | Description |
|---|---|---|
Page |
Box<CrawlPageResult> |
A page was successfully fetched and extracted. Contains the full page result with HTML, metadata, links, markdown, etc. |
Error |
url: String, error: String |
A fetch or extraction error occurred for the given URL. |
Complete |
pages_crawled: usize |
The crawl has finished. |
Batch crawling¶
Crawl multiple seed URLs concurrently, each following links independently:
let results = engine.batch_crawl(&[
"https://example.com",
"https://other-site.org",
]).await;
for (seed_url, result) in results {
match result {
Ok(crawl) => println!("{}: {} pages", seed_url, crawl.pages.len()),
Err(e) => eprintln!("{}: {}", seed_url, e),
}
}
batch_crawl respects the same max_concurrent limit across all seed URLs. Each seed URL produces an independent CrawlResult.
For streaming across multiple seeds, use batch_crawl_stream:
let mut stream = engine.batch_crawl_stream(&[
"https://example.com",
"https://other-site.org",
]);
while let Some(event) = stream.next().await {
// Events from all crawls are interleaved
}
There is also batch_scrape for scraping multiple individual URLs without link following:
let results = engine.batch_scrape(&[
"https://example.com/page1",
"https://example.com/page2",
]).await;
Crawl strategies¶
The engine uses a pluggable CrawlStrategy trait to control URL selection order. Four built-in strategies are available:
BfsStrategy (default)¶
Breadth-first: always selects the oldest (first) entry from the working set. This explores all pages at depth N before moving to depth N+1.
DfsStrategy¶
Depth-first: always selects the newest (last) entry, giving LIFO behavior. This dives deep along a single path before backtracking.
BestFirstStrategy¶
Selects the candidate with the highest priority score. By default, priority is 1.0 / (depth + 1.0), favoring shallower pages. Override score_url for custom ranking.
use kreuzcrawl::BestFirstStrategy;
let engine = CrawlEngine::builder()
.strategy(BestFirstStrategy)
.build()?;
AdaptiveStrategy¶
Tracks unique terms across crawled pages and stops when content saturation is detected -- when the rate of new term discovery drops below a threshold. This is useful for crawling sites where you want comprehensive coverage without redundant pages.
use kreuzcrawl::AdaptiveStrategy;
// Stop when new-term rate drops below 5% over a 10-page window
let engine = CrawlEngine::builder()
.strategy(AdaptiveStrategy::new(10, 0.05))
.build()?;
| Parameter | Default | Description |
|---|---|---|
window_size |
10 |
Number of recent pages to consider for saturation detection. |
saturation_threshold |
0.05 |
Stop when the ratio of new terms per page drops below this value (0.0 to 1.0). |
The adaptive strategy continues crawling unconditionally until at least window_size pages have been processed, ensuring enough data for a meaningful saturation signal.
Content filtering¶
The ContentFilter trait allows post-extraction filtering of pages. The built-in Bm25Filter scores pages against a query using BM25 TF-saturation:
use kreuzcrawl::Bm25Filter;
let engine = CrawlEngine::builder()
.content_filter(Bm25Filter::new("rust programming", 0.3))
.build()?;
Pages scoring below the threshold are excluded from crawl results but still contribute to link discovery -- their outgoing links are followed normally.
Redirect handling¶
The engine resolves redirects before starting the crawl loop, following HTTP 3xx redirects, Refresh headers, and <meta http-equiv="refresh"> tags. Redirect loops and excessive redirects are detected and reported.
| Field | Type | Default | Description |
|---|---|---|---|
max_redirects |
usize |
10 |
Maximum number of redirects to follow before reporting an error. Must be <= 100. |
CrawlResult reference¶
| Field | Type | Description |
|---|---|---|
pages |
Vec<CrawlPageResult> |
All successfully crawled pages. |
final_url |
String |
The URL after resolving initial redirects from the seed. |
redirect_count |
usize |
Number of redirects followed during initial resolution. |
was_skipped |
bool |
Whether any page was skipped (binary or PDF content). |
error |
Option<String> |
Error message if the crawl encountered a fatal issue. |
cookies |
Vec<CookieInfo> |
Cookies collected during the crawl (when cookies_enabled is true). |
CrawlPageResult reference¶
Each page in the crawl result contains:
| Field | Type | Description |
|---|---|---|
url |
String |
The original fetched URL. |
normalized_url |
String |
URL after normalization (for deduplication). |
status_code |
u16 |
HTTP response status code. |
content_type |
String |
The Content-Type header value. |
html |
String |
The response body. |
body_size |
usize |
Size of the response body in bytes. |
metadata |
PageMetadata |
Extracted metadata (title, description, OG tags, etc.). |
links |
Vec<LinkInfo> |
Links found on the page. |
images |
Vec<ImageInfo> |
Images found on the page. |
feeds |
Vec<FeedInfo> |
RSS/Atom/JSON feed links. |
json_ld |
Vec<JsonLdEntry> |
JSON-LD structured data entries. |
depth |
usize |
Distance from the seed URL in link hops. |
stayed_on_domain |
bool |
Whether this page is on the same domain as the seed. |
markdown |
Option<MarkdownResult> |
Markdown conversion (always populated for HTML pages). |
extracted_data |
Option<Value> |
LLM-extracted structured data (when using LlmExtractor). |
extraction_meta |
Option<ExtractionMeta> |
LLM extraction cost and token metadata. |