Scraping¶
Scraping fetches a single URL and runs the full extraction pipeline: metadata, links, images, feeds, JSON-LD, robots.txt compliance, markdown conversion, and optionally LLM-powered structured extraction.
Single-page scrape¶
use kreuzcrawl::{CrawlEngine, CrawlConfig};
let engine = CrawlEngine::builder()
.config(CrawlConfig::default())
.build()?;
let result = engine.scrape("https://example.com").await?;
println!("Status: {}", result.status_code);
println!("Title: {:?}", result.metadata.title);
println!("Links: {}", result.links.len());
println!("Body size: {} bytes", result.body_size);
The scrape request routes through the engine's Tower service stack, which applies per-domain rate limiting, HTTP response caching, and user-agent rotation before the actual HTTP fetch.
ScrapeResult fields¶
The ScrapeResult struct contains everything extracted from a single page:
Core response data¶
| Field | Type | Description |
|---|---|---|
status_code |
u16 |
HTTP response status code. |
content_type |
String |
The Content-Type header value. |
html |
String |
The response body (possibly truncated by max_body_size or filtered by main_content_only). |
body_size |
usize |
Size of the response body in bytes. |
detected_charset |
Option<String> |
Character encoding detected from Content-Type header or HTML meta tags. |
is_pdf |
bool |
Whether the content was detected as PDF. |
was_skipped |
bool |
Whether extraction was skipped (binary or PDF content). |
Robots and directives¶
| Field | Type | Description |
|---|---|---|
is_allowed |
bool |
Whether the URL is permitted by robots.txt (always true when respect_robots_txt is false). |
crawl_delay |
Option<u64> |
The Crawl-delay value from robots.txt, in seconds. |
noindex_detected |
bool |
Whether a noindex directive was found in meta robots or X-Robots-Tag. |
nofollow_detected |
bool |
Whether a nofollow directive was found in meta robots or X-Robots-Tag. |
x_robots_tag |
Option<String> |
The raw X-Robots-Tag header value, if present. |
Extracted content¶
| Field | Type | Description |
|---|---|---|
metadata |
PageMetadata |
Rich metadata from meta tags, OG, Twitter, Dublin Core, and more. |
links |
Vec<LinkInfo> |
All links found on the page, classified by type. |
images |
Vec<ImageInfo> |
All images found, including OG and Twitter images. |
feeds |
Vec<FeedInfo> |
RSS, Atom, and JSON Feed links. |
json_ld |
Vec<JsonLdEntry> |
JSON-LD structured data entries. |
markdown |
Option<MarkdownResult> |
Markdown conversion with document structure, tables, citations, and fit content. |
Engine state¶
| Field | Type | Description |
|---|---|---|
main_content_only |
bool |
Whether main content extraction was active. |
auth_header_sent |
bool |
Whether an authentication header was sent. |
response_meta |
Option<ResponseMeta> |
HTTP headers: ETag, Last-Modified, Cache-Control, Server, etc. |
assets |
Vec<DownloadedAsset> |
Downloaded page assets (when download_assets is enabled). |
js_render_hint |
bool |
Whether the page content suggests JavaScript rendering is needed. |
browser_used |
bool |
Whether the headless browser fallback was used. |
screenshot |
Option<Vec<u8>> |
PNG screenshot bytes (when browser is used and capture_screenshot is enabled). |
downloaded_document |
Option<DownloadedDocument> |
Non-HTML document data (PDF, DOCX, etc.) when download_documents is enabled. |
LLM extraction¶
| Field | Type | Description |
|---|---|---|
extracted_data |
Option<Value> |
Structured JSON extracted by LLM (populated when using LlmExtractor). |
extraction_meta |
Option<ExtractionMeta> |
LLM cost tracking: estimated cost in USD, prompt/completion tokens, model name. |
Metadata extraction¶
The PageMetadata struct extracts 40+ fields from HTML meta tags:
Standard meta¶
title-- from<title>elementdescription-- from<meta name="description">canonical_url-- from<link rel="canonical">keywords,author,viewport,theme_color,generator,robotshtml_lang,html_dir-- from the<html>element'slanganddirattributes
Open Graph¶
og_title,og_type,og_image,og_description,og_urlog_site_name,og_locale,og_video,og_audio,og_locale_alternates
Twitter Card¶
twitter_card,twitter_title,twitter_description,twitter_imagetwitter_site,twitter_creator
Dublin Core¶
dc_title,dc_creator,dc_subject,dc_description,dc_publisherdc_date,dc_type,dc_format,dc_identifier,dc_language,dc_rights
Structured data¶
article--ArticleMetadatafromarticle:*OG tags (published_time, modified_time, author, section, tags)hreflangs-- alternate language linksfavicons-- icon links with sizes and MIME typesheadings-- all h1-h6 elements with level and textword_count-- computed word count of the page body text
Link extraction¶
Each LinkInfo includes:
| Field | Type | Description |
|---|---|---|
url |
String |
The resolved absolute URL. |
text |
String |
The visible link text. |
link_type |
LinkType |
Classification: Internal, External, Anchor, or Document. |
rel |
Option<String> |
The rel attribute value. |
nofollow |
bool |
Whether the link has rel="nofollow". |
Image extraction¶
Each ImageInfo includes:
| Field | Type | Description |
|---|---|---|
url |
String |
The image URL. |
alt |
Option<String> |
Alt text. |
width |
Option<u32> |
Width attribute. |
height |
Option<u32> |
Height attribute. |
source |
ImageSource |
Where the image was found: Img, PictureSource, OgImage, or TwitterImage. |
Feed extraction¶
Discovered RSS, Atom, and JSON Feed links:
| Field | Type | Description |
|---|---|---|
url |
String |
The feed URL. |
title |
Option<String> |
The feed title from the link element. |
feed_type |
FeedType |
Rss, Atom, or JsonFeed. |
JSON-LD extraction¶
Each JsonLdEntry contains:
| Field | Type | Description |
|---|---|---|
schema_type |
String |
The @type value (e.g., "Article", "Product"). |
name |
Option<String> |
The name field, if present. |
raw |
String |
The raw JSON-LD string for full access. |
Robots.txt compliance¶
When respect_robots_txt is set to true, the engine fetches and parses robots.txt before scraping:
CrawlConfig {
respect_robots_txt: true,
user_agent: Some("MyBot/1.0".to_string()),
..Default::default()
}
The scrape result includes is_allowed, crawl_delay, and any noindex/nofollow directives detected from both meta tags and X-Robots-Tag headers.
Note
When respect_robots_txt is false (the default), is_allowed is always true and robots.txt is not fetched.
Main content only mode¶
Extract only the primary content, stripping navigation, sidebars, and boilerplate:
When enabled, the engine runs main content extraction on the HTML before the metadata/link extraction pipeline. The html field in the result contains only the extracted main content.
Removing specific tags¶
Strip specific elements by CSS selector before processing:
CrawlConfig {
remove_tags: vec![
"nav".to_string(),
".sidebar".to_string(),
"#cookie-banner".to_string(),
],
..Default::default()
}
Tag removal runs before main content extraction and before the metadata pipeline.
Response metadata¶
The ResponseMeta struct captures HTTP response headers:
| Field | Type | Description |
|---|---|---|
etag |
Option<String> |
ETag header for cache validation. |
last_modified |
Option<String> |
Last-Modified header. |
cache_control |
Option<String> |
Cache-Control directives. |
server |
Option<String> |
Server software identifier. |
x_powered_by |
Option<String> |
X-Powered-By header. |
content_language |
Option<String> |
Content-Language header. |
content_encoding |
Option<String> |
Content-Encoding header. |
Authentication¶
Scrape pages behind authentication:
use kreuzcrawl::AuthConfig;
CrawlConfig {
auth: Some(AuthConfig::Bearer {
token: "your-token".to_string(),
}),
..Default::default()
}
Three authentication modes are supported:
| Mode | Fields | Header sent |
|---|---|---|
Basic |
username, password |
Authorization: Basic <base64> |
Bearer |
token |
Authorization: Bearer <token> |
Header |
name, value |
Custom header with the specified name and value |
Document downloads¶
When download_documents is enabled (the default), the engine downloads non-HTML resources like PDFs, DOCX files, and images instead of skipping them:
CrawlConfig {
download_documents: true, // default
document_max_size: Some(50 * 1024 * 1024), // 50 MB default
document_mime_types: vec![], // empty = built-in defaults
..Default::default()
}
Downloaded documents are available in the downloaded_document field as a DownloadedDocument with raw bytes, MIME type, filename, size, and a SHA-256 content hash.