Skip to content

Crawling

Crawling follows links from a seed URL, building a collection of pages up to a configured depth and page limit. Kreuzcrawl provides both collected and streaming interfaces, batch operations, and pluggable strategies for URL selection.

Basic crawl

The simplest crawl fetches a single seed URL and follows its links:

use kreuzcrawl::{CrawlConfig, create_engine, crawl};

let engine = create_engine(Some(CrawlConfig {
    max_depth: Some(2),
    max_pages: Some(50),
    ..Default::default()
}))?;

let result = crawl(&engine, "https://example.com").await?;

for page in &result.pages {
    println!("{} (depth {}): {} bytes", page.url, page.depth, page.body_size);
}

CrawlResult contains the full list of crawled pages, the final URL after redirect resolution, redirect count, collected cookies, and any error encountered.

Depth and page limits

Field Type Default Description
max_depth Option<usize> None (0 -- seed only) Maximum number of link hops from the seed URL. None means depth 0, which fetches only the seed page.
max_pages Option<usize> None (unlimited) Maximum number of pages to include in the result. The engine stops spawning fetch tasks once this limit is reached and aborts any in-flight tasks.

Depth 0 means seed only

When max_depth is None or Some(0), the engine fetches the seed URL but does not follow any links. Set max_depth: Some(1) to crawl one hop out.

Concurrent fetching

Control parallelism with max_concurrent:

CrawlConfig {
    max_concurrent: Some(5),
    ..Default::default()
}
Field Type Default Description
max_concurrent Option<usize> None (10) Maximum number of simultaneous HTTP requests. A tokio Semaphore enforces this limit across all in-flight fetch tasks.

Tip

The default of 10 concurrent requests is a good starting point. Lower it when crawling sites with strict rate limits; raise it for high-throughput internal crawls.

A per-domain rate limit also applies: the engine enforces a 200 ms minimum interval between requests to the same domain, and automatically respects Crawl-delay directives from robots.txt when respect_robots_txt is enabled.

Domain scoping

Keep the crawl within the seed domain or allow subdomains:

CrawlConfig {
    stay_on_domain: true,
    allow_subdomains: false, // only exact domain match
    ..Default::default()
}
Field Type Default Description
stay_on_domain bool false When true, only follow links whose host matches the seed URL's host.
allow_subdomains bool false When true and stay_on_domain is true, also follow links to subdomains of the seed host (e.g., blog.example.com when the seed is example.com).

Path filtering with regex

Include or exclude URL paths using regex patterns:

CrawlConfig {
    include_paths: vec![r"^/docs/".to_string(), r"^/blog/".to_string()],
    exclude_paths: vec![r"/admin/".to_string(), r"\.pdf$".to_string()],
    ..Default::default()
}
Field Type Default Description
include_paths Vec<String> [] Regex patterns matched against the URL path. When non-empty, only URLs matching at least one pattern are crawled. The depth-0 seed URL is always included regardless of this filter.
exclude_paths Vec<String> [] Regex patterns matched against the URL path. URLs matching any pattern are skipped. Exclude patterns take priority over include patterns.

The engine compiles these patterns once at the start of the crawl and validates them during CrawlConfig::validate(). Invalid regex patterns produce a CrawlError::InvalidConfig error.

Batch crawling

Crawl multiple seed URLs concurrently, each following links independently:

use kreuzcrawl::batch_crawl;

let results = batch_crawl(
    &engine,
    vec![
        "https://example.com".to_owned(),
        "https://other-site.org".to_owned(),
    ],
).await?;

for entry in &results {
    match (&entry.result, &entry.error) {
        (Some(crawl), _) => println!("{}: {} pages", entry.url, crawl.pages.len()),
        (None, Some(e)) => eprintln!("{}: {}", entry.url, e),
        _ => {}
    }
}

batch_crawl respects the same max_concurrent limit across all seed URLs. Each seed URL produces an independent BatchCrawlResult with either a populated result or an error message.

There is also batch_scrape for scraping multiple individual URLs without link following:

use kreuzcrawl::batch_scrape;

let results = batch_scrape(
    &engine,
    vec![
        "https://example.com/page1".to_owned(),
        "https://example.com/page2".to_owned(),
    ],
).await?;

Redirect handling

The engine resolves redirects before starting the crawl loop, following HTTP 3xx redirects, Refresh headers, and <meta http-equiv="refresh"> tags. Redirect loops and excessive redirects are detected and reported.

Field Type Default Description
max_redirects usize 10 Maximum number of redirects to follow before reporting an error. Must be <= 100.

CrawlResult reference

Field Type Description
pages Vec<CrawlPageResult> All successfully crawled pages.
final_url String The URL after resolving initial redirects from the seed.
redirect_count usize Number of redirects followed during initial resolution.
was_skipped bool Whether any page was skipped (binary or PDF content).
error Option<String> Error message if the crawl encountered a fatal issue.
cookies Vec<CookieInfo> Cookies collected during the crawl (when cookies_enabled is true).

CrawlPageResult reference

Each page in the crawl result contains:

Field Type Description
url String The original fetched URL.
normalized_url String URL after normalization (for deduplication).
status_code u16 HTTP response status code.
content_type String The Content-Type header value.
html String The response body.
body_size usize Size of the response body in bytes.
metadata PageMetadata Extracted metadata (title, description, OG tags, etc.).
links Vec<LinkInfo> Links found on the page.
images Vec<ImageInfo> Images found on the page.
feeds Vec<FeedInfo> RSS/Atom/JSON feed links.
json_ld Vec<JsonLdEntry> JSON-LD structured data entries.
depth usize Distance from the seed URL in link hops.
stayed_on_domain bool Whether this page is on the same domain as the seed.
markdown Option<MarkdownResult> Markdown conversion (always populated for HTML pages).
extracted_data Option<Value> LLM-extracted structured data, when extraction is configured.
extraction_meta Option<ExtractionMeta> LLM extraction cost and token metadata.

Edit this page on GitHub