Skip to content

Scraping

Scraping fetches a single URL and runs the full extraction pipeline: metadata, links, images, feeds, JSON-LD, robots.txt compliance, markdown conversion, and optionally LLM-powered structured extraction.

Single-page scrape

use kreuzcrawl::{CrawlEngine, CrawlConfig};

let engine = CrawlEngine::builder()
    .config(CrawlConfig::default())
    .build()?;

let result = engine.scrape("https://example.com").await?;

println!("Status: {}", result.status_code);
println!("Title: {:?}", result.metadata.title);
println!("Links: {}", result.links.len());
println!("Body size: {} bytes", result.body_size);

The scrape request routes through the engine's Tower service stack, which applies per-domain rate limiting, HTTP response caching, and user-agent rotation before the actual HTTP fetch.

ScrapeResult fields

The ScrapeResult struct contains everything extracted from a single page:

Core response data

Field Type Description
status_code u16 HTTP response status code.
content_type String The Content-Type header value.
html String The response body (possibly truncated by max_body_size or filtered by main_content_only).
body_size usize Size of the response body in bytes.
detected_charset Option<String> Character encoding detected from Content-Type header or HTML meta tags.
is_pdf bool Whether the content was detected as PDF.
was_skipped bool Whether extraction was skipped (binary or PDF content).

Robots and directives

Field Type Description
is_allowed bool Whether the URL is permitted by robots.txt (always true when respect_robots_txt is false).
crawl_delay Option<u64> The Crawl-delay value from robots.txt, in seconds.
noindex_detected bool Whether a noindex directive was found in meta robots or X-Robots-Tag.
nofollow_detected bool Whether a nofollow directive was found in meta robots or X-Robots-Tag.
x_robots_tag Option<String> The raw X-Robots-Tag header value, if present.

Extracted content

Field Type Description
metadata PageMetadata Rich metadata from meta tags, OG, Twitter, Dublin Core, and more.
links Vec<LinkInfo> All links found on the page, classified by type.
images Vec<ImageInfo> All images found, including OG and Twitter images.
feeds Vec<FeedInfo> RSS, Atom, and JSON Feed links.
json_ld Vec<JsonLdEntry> JSON-LD structured data entries.
markdown Option<MarkdownResult> Markdown conversion with document structure, tables, citations, and fit content.

Engine state

Field Type Description
main_content_only bool Whether main content extraction was active.
auth_header_sent bool Whether an authentication header was sent.
response_meta Option<ResponseMeta> HTTP headers: ETag, Last-Modified, Cache-Control, Server, etc.
assets Vec<DownloadedAsset> Downloaded page assets (when download_assets is enabled).
js_render_hint bool Whether the page content suggests JavaScript rendering is needed.
browser_used bool Whether the headless browser fallback was used.
screenshot Option<Vec<u8>> PNG screenshot bytes (when browser is used and capture_screenshot is enabled).
downloaded_document Option<DownloadedDocument> Non-HTML document data (PDF, DOCX, etc.) when download_documents is enabled.

LLM extraction

Field Type Description
extracted_data Option<Value> Structured JSON extracted by LLM (populated when using LlmExtractor).
extraction_meta Option<ExtractionMeta> LLM cost tracking: estimated cost in USD, prompt/completion tokens, model name.

Metadata extraction

The PageMetadata struct extracts 40+ fields from HTML meta tags:

Standard meta

  • title -- from <title> element
  • description -- from <meta name="description">
  • canonical_url -- from <link rel="canonical">
  • keywords, author, viewport, theme_color, generator, robots
  • html_lang, html_dir -- from the <html> element's lang and dir attributes

Open Graph

  • og_title, og_type, og_image, og_description, og_url
  • og_site_name, og_locale, og_video, og_audio, og_locale_alternates

Twitter Card

  • twitter_card, twitter_title, twitter_description, twitter_image
  • twitter_site, twitter_creator

Dublin Core

  • dc_title, dc_creator, dc_subject, dc_description, dc_publisher
  • dc_date, dc_type, dc_format, dc_identifier, dc_language, dc_rights

Structured data

  • article -- ArticleMetadata from article:* OG tags (published_time, modified_time, author, section, tags)
  • hreflangs -- alternate language links
  • favicons -- icon links with sizes and MIME types
  • headings -- all h1-h6 elements with level and text
  • word_count -- computed word count of the page body text

Each LinkInfo includes:

Field Type Description
url String The resolved absolute URL.
text String The visible link text.
link_type LinkType Classification: Internal, External, Anchor, or Document.
rel Option<String> The rel attribute value.
nofollow bool Whether the link has rel="nofollow".

Image extraction

Each ImageInfo includes:

Field Type Description
url String The image URL.
alt Option<String> Alt text.
width Option<u32> Width attribute.
height Option<u32> Height attribute.
source ImageSource Where the image was found: Img, PictureSource, OgImage, or TwitterImage.

Feed extraction

Discovered RSS, Atom, and JSON Feed links:

Field Type Description
url String The feed URL.
title Option<String> The feed title from the link element.
feed_type FeedType Rss, Atom, or JsonFeed.

JSON-LD extraction

Each JsonLdEntry contains:

Field Type Description
schema_type String The @type value (e.g., "Article", "Product").
name Option<String> The name field, if present.
raw String The raw JSON-LD string for full access.

Robots.txt compliance

When respect_robots_txt is set to true, the engine fetches and parses robots.txt before scraping:

CrawlConfig {
    respect_robots_txt: true,
    user_agent: Some("MyBot/1.0".to_string()),
    ..Default::default()
}

The scrape result includes is_allowed, crawl_delay, and any noindex/nofollow directives detected from both meta tags and X-Robots-Tag headers.

Note

When respect_robots_txt is false (the default), is_allowed is always true and robots.txt is not fetched.

Main content only mode

Extract only the primary content, stripping navigation, sidebars, and boilerplate:

CrawlConfig {
    main_content_only: true,
    ..Default::default()
}

When enabled, the engine runs main content extraction on the HTML before the metadata/link extraction pipeline. The html field in the result contains only the extracted main content.

Removing specific tags

Strip specific elements by CSS selector before processing:

CrawlConfig {
    remove_tags: vec![
        "nav".to_string(),
        ".sidebar".to_string(),
        "#cookie-banner".to_string(),
    ],
    ..Default::default()
}

Tag removal runs before main content extraction and before the metadata pipeline.

Response metadata

The ResponseMeta struct captures HTTP response headers:

Field Type Description
etag Option<String> ETag header for cache validation.
last_modified Option<String> Last-Modified header.
cache_control Option<String> Cache-Control directives.
server Option<String> Server software identifier.
x_powered_by Option<String> X-Powered-By header.
content_language Option<String> Content-Language header.
content_encoding Option<String> Content-Encoding header.

Authentication

Scrape pages behind authentication:

use kreuzcrawl::AuthConfig;

CrawlConfig {
    auth: Some(AuthConfig::Bearer {
        token: "your-token".to_string(),
    }),
    ..Default::default()
}

Three authentication modes are supported:

Mode Fields Header sent
Basic username, password Authorization: Basic <base64>
Bearer token Authorization: Bearer <token>
Header name, value Custom header with the specified name and value

Document downloads

When download_documents is enabled (the default), the engine downloads non-HTML resources like PDFs, DOCX files, and images instead of skipping them:

CrawlConfig {
    download_documents: true,          // default
    document_max_size: Some(50 * 1024 * 1024), // 50 MB default
    document_mime_types: vec![],       // empty = built-in defaults
    ..Default::default()
}

Downloaded documents are available in the downloaded_document field as a DownloadedDocument with raw bytes, MIME type, filename, size, and a SHA-256 content hash.