LLM Extraction¶
Kreuzcrawl can extract structured data from crawled pages using LLM providers. The LlmExtractor integrates as a ContentFilter, processing each page through an LLM to produce typed JSON output with full cost tracking.
Feature flag required
LLM extraction requires the ai feature flag: kreuzcrawl = { version = "...", features = ["ai"] }
LlmExtractor setup¶
use kreuzcrawl::{CrawlEngine, CrawlConfig, LlmExtractor};
use serde_json::json;
let schema = json!({
"type": "object",
"properties": {
"title": { "type": "string" },
"summary": { "type": "string" },
"topics": {
"type": "array",
"items": { "type": "string" }
}
},
"required": ["title", "summary"]
});
let extractor = LlmExtractor::new(
"your-api-key", // API key
"openai/gpt-4o-mini", // Model identifier
Some(schema), // JSON schema (optional)
Some("Extract the article title, a brief summary, and main topics.".into()), // Instruction (optional)
None, // Custom prompt template (optional)
)?;
let engine = CrawlEngine::builder()
.config(CrawlConfig {
max_depth: Some(1),
max_pages: Some(10),
..Default::default()
})
.content_filter(extractor)
.build()?;
let result = engine.crawl("https://example.com/blog").await?;
for page in &result.pages {
if let Some(ref data) = page.extracted_data {
println!("{}: {}", page.url, serde_json::to_string_pretty(data)?);
}
}
Constructor parameters¶
| Parameter | Type | Required | Description |
|---|---|---|---|
api_key |
&str |
Yes | API key for the LLM provider. |
model |
&str |
Yes | Model identifier in provider/model format. |
schema |
Option<Value> |
No | JSON Schema for structured extraction. When provided, the LLM is constrained to output conforming JSON. |
instruction |
Option<String> |
No | Natural language instruction describing what to extract. |
prompt_template |
Option<String> |
No | Custom Jinja2 template for the prompt. Overrides the default template when provided. |
Multi-provider support¶
The LlmExtractor uses liter-llm for provider routing. Model identifiers follow the provider/model format:
| Provider | Example model | Notes |
|---|---|---|
| OpenAI | openai/gpt-4o-mini |
Supports JSON schema response format with strict mode. |
| Anthropic | anthropic/claude-sonnet-4-20250514 |
|
google/gemini-2.0-flash |
||
| Mistral | mistral/mistral-large-latest |
The provider is inferred from the model string prefix. The API key must be valid for the specified provider.
JSON schema extraction¶
When a JSON schema is provided, the extractor enables structured output mode:
let schema = json!({
"type": "object",
"properties": {
"product_name": { "type": "string" },
"price": { "type": "number" },
"currency": { "type": "string" },
"in_stock": { "type": "boolean" }
},
"required": ["product_name", "price"]
});
let extractor = LlmExtractor::new(
api_key,
"openai/gpt-4o-mini",
Some(schema),
Some("Extract product information from this page.".into()),
None,
)?;
The schema is passed to the LLM as a response_format constraint with strict: true, ensuring the output conforms to the specified structure. If the LLM response cannot be parsed as JSON, it is wrapped in a Value::String.
Custom prompt templates¶
The default prompt template is a Jinja2 template that includes the extraction instruction, JSON schema, and page content. You can override it with a custom template:
let custom_template = r#"You are analyzing a web page.
URL: {{ url }}
{% if title %}Page title: {{ title }}{% endif %}
{% if instruction %}
Task: {{ instruction }}
{% endif %}
{% if schema %}
Output JSON schema:
```json
{{ schema }}
{% endif %}
Page content: {{ content }}"#;
let extractor = LlmExtractor::new( api_key, "openai/gpt-4o-mini", Some(schema), Some("Extract key data points.".into()), Some(custom_template.to_string()), )?;
### Template variables
| Variable | Type | Description |
|---|---|---|
| `content` | `String` | Page content (Markdown if available, otherwise HTML). Truncated to 100,000 characters. |
| `schema` | `Option<String>` | Pretty-printed JSON schema, if provided. |
| `instruction` | `Option<&str>` | The extraction instruction, if provided. |
| `url` | `&str` | The page URL. |
| `title` | `Option<&str>` | The page title from metadata, if available. |
## Cost tracking with ExtractionMeta
Every page processed by the `LlmExtractor` includes cost and usage metadata:
```rust
for page in &result.pages {
if let Some(ref meta) = page.extraction_meta {
println!("Model: {:?}", meta.model);
println!("Cost: ${:.6}", meta.cost.unwrap_or(0.0));
println!("Prompt tokens: {:?}", meta.prompt_tokens);
println!("Completion tokens: {:?}", meta.completion_tokens);
println!("Chunks processed: {}", meta.chunks_processed);
}
}
ExtractionMeta fields¶
| Field | Type | Description |
|---|---|---|
cost |
Option<f64> |
Estimated cost of the LLM call in USD. |
prompt_tokens |
Option<u64> |
Number of input tokens consumed. |
completion_tokens |
Option<u64> |
Number of output tokens generated. |
model |
Option<String> |
The model identifier used for extraction. |
chunks_processed |
usize |
Number of content chunks sent to the LLM (currently always 1). |
Aggregating costs across a crawl¶
let total_cost: f64 = result.pages.iter()
.filter_map(|p| p.extraction_meta.as_ref())
.filter_map(|m| m.cost)
.sum();
let total_tokens: u64 = result.pages.iter()
.filter_map(|p| p.extraction_meta.as_ref())
.filter_map(|m| m.prompt_tokens.zip(m.completion_tokens).map(|(p, c)| p + c))
.sum();
println!("Total cost: ${:.4}", total_cost);
println!("Total tokens: {}", total_tokens);
Content handling¶
The extractor uses the best available content representation:
- Markdown (preferred) -- if the page has a
MarkdownResult, thecontentfield is used - HTML (fallback) -- if Markdown is not available, the raw HTML is used
Content is truncated to 100,000 characters to avoid exceeding LLM context windows. Truncation respects UTF-8 character boundaries.
Integration as ContentFilter¶
The LlmExtractor implements the ContentFilter trait, which means it runs during the crawl pipeline after page extraction. This has several implications:
- Every crawled page passes through the LLM (subject to the filter returning
Some) - The extractor never drops pages -- it always returns
Some(page)with theextracted_dataandextraction_metafields populated - Pages filtered out by other means (path exclusion, robots.txt) never reach the extractor
- LLM calls are made concurrently, bounded by
max_concurrent
Cost awareness
Each page in a crawl triggers an LLM API call. A crawl with max_pages: Some(100) using GPT-4o-mini will make 100 API calls. Monitor costs using the ExtractionMeta fields and set appropriate max_pages limits.
Error handling¶
LLM errors (network failures, rate limits, invalid API keys) are propagated as CrawlError::Other. The error message includes context from the LLM client:
match engine.crawl("https://example.com").await {
Ok(result) => { /* process pages */ }
Err(e) => eprintln!("Crawl failed: {}", e),
}
Template rendering errors (invalid Jinja2 syntax in custom templates) also produce CrawlError::Other with a descriptive message.