Scraping¶

Scraping fetches a single URL and runs the full extraction pipeline: metadata, links, images, feeds, JSON-LD, robots.txt compliance, markdown conversion, and optionally LLM-powered structured extraction.

Single-page scrape¶

use kreuzcrawl::{CrawlConfig, create_engine, scrape};

let engine = create_engine(Some(CrawlConfig::default()))?;

let result = scrape(&engine, "https://example.com").await?;

println!("Status: {}", result.status_code);
println!("Title: {:?}", result.metadata.title);
println!("Links: {}", result.links.len());
println!("Body size: {} bytes", result.body_size);

The scrape request routes through the engine's Tower service stack, which applies per-domain rate limiting, HTTP response caching, and user-agent rotation before the actual HTTP fetch.

Page interaction¶

Use interact() when a page must be changed before HTML is captured: click a button, type into an input, wait for a selector, run JavaScript, take a screenshot, or scrape the current DOM.

use kreuzcrawl::{
    BrowserBackend, BrowserConfig, BrowserMode, CrawlConfig, PageAction, create_engine, interact,
};

let engine = create_engine(Some(CrawlConfig {
    browser: BrowserConfig {
        backend: BrowserBackend::Chromiumoxide,
        mode: BrowserMode::Always,
        ..BrowserConfig::default()
    },
    ..CrawlConfig::default()
}))?;

let result = interact(
    &engine,
    "https://example.com",
    vec![
        PageAction::Click {
            selector: "#show-more".to_string(),
        },
        PageAction::Wait {
            milliseconds: None,
            selector: Some("#expanded".to_string()),
        },
        PageAction::Scrape,
    ],
)
.await?;

println!("Final URL: {}", result.final_url);
println!("HTML bytes: {}", result.final_html.len());

interact() validates the action list before navigation and returns one ActionResult per action. Failed actions are recorded in the result and later actions still run; navigation/setup failures are returned as CrawlError.

Both browser backends support click, type, press, scroll, wait, screenshot, JavaScript execution, and scrape actions. Chromiumoxide screenshots are captured from Chrome. Native screenshots are deterministic PNG snapshots derived from the post-action HTML, so they are useful for inspection but are not pixel-perfect Chrome compositor captures.

ScrapeResult fields¶

The ScrapeResult struct contains everything extracted from a single page:

Core response data¶

Field	Type	Description
`status_code`	`u16`	HTTP response status code.
`content_type`	`String`	The Content-Type header value.
`html`	`String`	The response body (possibly truncated by `content.max_body_size`).
`body_size`	`usize`	Size of the response body in bytes.
`detected_charset`	`Option<String>`	Character encoding detected from Content-Type header or HTML meta tags.
`is_pdf`	`bool`	Whether the content was detected as PDF.
`was_skipped`	`bool`	Whether extraction was skipped (binary or PDF content).

Robots and directives¶

Field	Type	Description
`is_allowed`	`bool`	Whether the URL is permitted by robots.txt (always `true` when `respect_robots_txt` is `false`).
`crawl_delay`	`Option<u64>`	The Crawl-delay value from robots.txt, in seconds.
`noindex_detected`	`bool`	Whether a `noindex` directive was found in meta robots or X-Robots-Tag.
`nofollow_detected`	`bool`	Whether a `nofollow` directive was found in meta robots or X-Robots-Tag.
`x_robots_tag`	`Option<String>`	The raw X-Robots-Tag header value, if present.

Extracted content¶

Field	Type	Description
`metadata`	`PageMetadata`	Rich metadata from meta tags, OG, Twitter, Dublin Core, and more.
`links`	`Vec<LinkInfo>`	All links found on the page, classified by type.
`images`	`Vec<ImageInfo>`	All images found, including OG and Twitter images.
`feeds`	`Vec<FeedInfo>`	RSS, Atom, and JSON Feed links.
`json_ld`	`Vec<JsonLdEntry>`	JSON-LD structured data entries.
`markdown`	`Option<MarkdownResult>`	Markdown conversion with document structure, tables, citations, and fit content.

Engine state¶

Field	Type	Description
`auth_header_sent`	`bool`	Whether an authentication header was sent.
`response_meta`	`Option<ResponseMeta>`	HTTP headers: ETag, Last-Modified, Cache-Control, Server, etc.
`assets`	`Vec<DownloadedAsset>`	Downloaded page assets (when `download_assets` is enabled).
`js_render_hint`	`bool`	Whether the page content suggests JavaScript rendering is needed.
`browser_used`	`bool`	Whether the headless browser fallback was used.
`screenshot`	`Option<Vec<u8>>`	PNG screenshot bytes (when browser is used and `capture_screenshot` is enabled).
`downloaded_document`	`Option<DownloadedDocument>`	Non-HTML document data (PDF, DOCX, etc.) when `download_documents` is enabled.

LLM extraction¶

Field	Type	Description
`extracted_data`	`Option<Value>`	Structured JSON extracted by an LLM, when LLM extraction is configured.
`extraction_meta`	`Option<ExtractionMeta>`	LLM cost tracking: estimated cost in USD, prompt/completion tokens, model name.

Metadata extraction¶

The PageMetadata struct extracts 40+ fields from HTML meta tags:

Standard meta¶

title -- from <title> element
description -- from <meta name="description">
canonical_url -- from <link rel="canonical">
keywords, author, viewport, theme_color, generator, robots
html_lang, html_dir -- from the <html> element's lang and dir attributes

Open Graph¶

og_title, og_type, og_image, og_description, og_url
og_site_name, og_locale, og_video, og_audio, og_locale_alternates

Twitter Card¶

twitter_card, twitter_title, twitter_description, twitter_image
twitter_site, twitter_creator

Dublin Core¶

dc_title, dc_creator, dc_subject, dc_description, dc_publisher
dc_date, dc_type, dc_format, dc_identifier, dc_language, dc_rights

Structured data¶

article -- ArticleMetadata from article:* OG tags (published_time, modified_time, author, section, tags)
hreflangs -- alternate language links
favicons -- icon links with sizes and MIME types
headings -- all h1-h6 elements with level and text
word_count -- computed word count of the page body text

Link extraction¶

Each LinkInfo includes:

Field	Type	Description
`url`	`String`	The resolved absolute URL.
`text`	`String`	The visible link text.
`link_type`	`LinkType`	Classification: `Internal`, `External`, `Anchor`, or `Document`.
`rel`	`Option<String>`	The `rel` attribute value.
`nofollow`	`bool`	Whether the link has `rel="nofollow"`.

Image extraction¶

Each ImageInfo includes:

Field	Type	Description
`url`	`String`	The image URL.
`alt`	`Option<String>`	Alt text.
`width`	`Option<u32>`	Width attribute.
`height`	`Option<u32>`	Height attribute.
`source`	`ImageSource`	Where the image was found: `Img`, `PictureSource`, `OgImage`, or `TwitterImage`.

Feed extraction¶

Discovered RSS, Atom, and JSON Feed links:

Field	Type	Description
`url`	`String`	The feed URL.
`title`	`Option<String>`	The feed title from the link element.
`feed_type`	`FeedType`	`Rss`, `Atom`, or `JsonFeed`.

JSON-LD extraction¶

Each JsonLdEntry contains:

Field	Type	Description
`schema_type`	`String`	The `@type` value (e.g., `"Article"`, `"Product"`).
`name`	`Option<String>`	The `name` field, if present.
`raw`	`String`	The raw JSON-LD string for full access.

Robots.txt compliance¶

When respect_robots_txt is set to true, the engine fetches and parses robots.txt before scraping:

CrawlConfig {
    respect_robots_txt: true,
    user_agent: Some("MyBot/1.0".to_string()),
    ..Default::default()
}

The scrape result includes is_allowed, crawl_delay, and any noindex/nofollow directives detected from both meta tags and X-Robots-Tag headers.

Note

When respect_robots_txt is false (the default), is_allowed is always true and robots.txt is not fetched.

Aggressive content pruning¶

Strip navigation, sidebars, and boilerplate before extraction via the content preset:

use kreuzcrawl::{CrawlConfig, ContentConfig};

CrawlConfig {
    content: ContentConfig {
        preprocessing_preset: "aggressive".to_owned(),
        ..Default::default()
    },
    ..Default::default()
}

preprocessing_preset accepts "minimal", "standard" (default), or "aggressive". The aggressive preset runs the main-content extractor before the metadata and link pipeline, so the resulting Markdown contains the primary content only.

Removing specific tags¶

Strip specific elements by CSS selector before processing:

CrawlConfig {
    remove_tags: vec![
        "nav".to_string(),
        ".sidebar".to_string(),
        "#cookie-banner".to_string(),
    ],
    ..Default::default()
}

Tag removal runs before main content extraction and before the metadata pipeline.

Response metadata¶

The ResponseMeta struct captures HTTP response headers:

Field	Type	Description
`etag`	`Option<String>`	ETag header for cache validation.
`last_modified`	`Option<String>`	Last-Modified header.
`cache_control`	`Option<String>`	Cache-Control directives.
`server`	`Option<String>`	Server software identifier.
`x_powered_by`	`Option<String>`	X-Powered-By header.
`content_language`	`Option<String>`	Content-Language header.
`content_encoding`	`Option<String>`	Content-Encoding header.

Authentication¶

Scrape pages behind authentication:

use kreuzcrawl::AuthConfig;

CrawlConfig {
    auth: Some(AuthConfig::Bearer {
        token: "your-token".to_string(),
    }),
    ..Default::default()
}

Three authentication modes are supported:

Mode	Fields	Header sent
`Basic`	`username`, `password`	`Authorization: Basic <base64>`
`Bearer`	`token`	`Authorization: Bearer <token>`
`Header`	`name`, `value`	Custom header with the specified name and value

Document downloads¶

When download_documents is enabled (the default), the engine downloads non-HTML resources like PDFs, DOCX files, and images instead of skipping them:

CrawlConfig {
    download_documents: true,          // default
    document_max_size: Some(50 * 1024 * 1024), // 50 MB default
    document_mime_types: vec![],       // empty = built-in defaults
    ..Default::default()
}

Downloaded documents are available in the downloaded_document field as a DownloadedDocument with raw bytes, MIME type, filename, size, and a SHA-256 content hash.

Edit this page on GitHub