Markdown Conversion¶
Kreuzcrawl converts every HTML page to Markdown automatically. The conversion produces a rich MarkdownResult that includes the Markdown text, a document structure tree, extracted tables, numbered citations, and LLM-optimized fit content.
Always-on conversion¶
Markdown conversion runs on every scrape and crawl operation. There is no configuration flag to enable it -- the markdown field on ScrapeResult and CrawlPageResult is always populated for HTML content.
let result = engine.scrape("https://example.com").await?;
if let Some(ref md) = result.markdown {
println!("{}", md.content);
}
The conversion runs in a blocking task (tokio::task::spawn_blocking) to avoid blocking the async runtime, since the underlying HTML parser is not Send.
MarkdownResult structure¶
| Field | Type | Description |
|---|---|---|
content |
String |
The converted Markdown text. |
document_structure |
Option<Value> |
JSON representation of the document's semantic structure tree. |
tables |
Vec<Value> |
Extracted tables as structured JSON with cell data. |
warnings |
Vec<String> |
Non-fatal processing warnings from the conversion. |
citations |
Option<CitationResult> |
Content with inline links replaced by numbered citations. |
fit_content |
Option<String> |
Pruned Markdown optimized for LLM consumption. |
All fields are populated in a single pass during conversion.
Document structure tree¶
The document_structure field contains a JSON tree representing the semantic structure of the HTML document. This is generated by the html-to-markdown-rs library with include_document_structure: true.
if let Some(ref md) = result.markdown {
if let Some(ref structure) = md.document_structure {
println!("{}", serde_json::to_string_pretty(structure)?);
}
}
The structure captures headings, sections, and their nesting relationships, which is useful for building tables of contents or understanding document hierarchy.
Table extraction¶
Tables found in the HTML are extracted as structured JSON and available in the tables field:
if let Some(ref md) = result.markdown {
for (i, table) in md.tables.iter().enumerate() {
println!("Table {}: {}", i, serde_json::to_string_pretty(table)?);
}
}
Each table preserves its cell data in a structured format beyond what appears in the Markdown text, making it accessible for programmatic use.
Citations¶
The citations field transforms inline Markdown links into numbered references, producing a format optimized for LLM consumption where URLs are separated from the flowing text.
Conversion example¶
Input Markdown:
Becomes:
With a reference list:
| Index | URL | Text |
|---|---|---|
| 1 | https://www.rust-lang.org |
Rust |
| 2 | https://tokio.rs |
Tokio |
CitationResult structure¶
| Field | Type | Description |
|---|---|---|
content |
String |
Markdown with links replaced by numbered citations (e.g., text[1]). |
references |
Vec<CitationReference> |
The numbered reference list. |
CitationReference fields¶
| Field | Type | Description |
|---|---|---|
index |
usize |
The 1-based citation number. |
url |
String |
The link URL. |
text |
String |
The original link text. |
Citation behavior¶
- Duplicate URLs share the same citation number. If two links point to the same URL, they both reference the same index.
- Images (
) are preserved unchanged and not converted to citations. - Parentheses in URLs are handled correctly, including Wikipedia-style URLs like
https://en.wikipedia.org/wiki/Rust_(programming_language).
if let Some(ref md) = result.markdown {
if let Some(ref citations) = md.citations {
println!("{}", citations.content);
println!("\nReferences:");
for r in &citations.references {
println!("[{}]: {} ({})", r.index, r.url, r.text);
}
}
}
Fit content (LLM-optimized pruning)¶
The fit_content field contains a pruned version of the Markdown, optimized for LLM context windows by removing low-value content. This is generated by heuristic-based pruning rules.
if let Some(ref md) = result.markdown {
if let Some(ref fit) = md.fit_content {
// Use this for LLM prompts instead of the full content
println!("Fit content ({} chars):", fit.len());
println!("{}", fit);
}
}
Pruning rules¶
The pruning algorithm applies these heuristics line by line:
| Rule | Description |
|---|---|
| Navigation link removal | Lines where more than 70% of characters are part of Markdown links (and the line is longer than 20 characters) are removed. |
| Short line removal | Non-heading lines shorter than 5 characters are removed (catches breadcrumbs, separators). |
| Boilerplate detection | Lines containing common boilerplate phrases are removed. |
| Code block preservation | Content inside fenced code blocks (``` or ~~~) is always preserved, even if individual lines are short. |
| Heading preservation | Lines starting with # are always kept regardless of length. |
| Paragraph spacing | Consecutive empty lines are collapsed to a single blank line. |
Detected boilerplate phrases¶
The following phrases trigger line removal (case-insensitive):
- "cookie policy", "cookie consent", "use cookies", "uses cookies"
- "privacy policy", "terms of service"
- "all rights reserved", "copyright"
- "subscribe to", "sign up for"
- "follow us", "share this"
- "powered by", "back to top"
When to use fit content¶
| Use case | Recommended field |
|---|---|
| Full-fidelity conversion | content |
| LLM prompts and summarization | fit_content |
| RAG indexing | fit_content or citations.content |
| Link analysis | content (preserves inline links) |
| Citation-style references | citations.content + citations.references |
Warnings¶
The warnings field captures non-fatal issues encountered during HTML-to-Markdown conversion:
if let Some(ref md) = result.markdown {
for warning in &md.warnings {
eprintln!("Conversion warning: {}", warning);
}
}
These are informational and do not indicate conversion failure. Common warnings include malformed HTML elements or unsupported constructs.