Markdown Conversion¶

Kreuzcrawl converts every HTML page to Markdown automatically. The conversion produces a rich MarkdownResult that includes the Markdown text, a document structure tree, extracted tables, numbered citations, and LLM-optimized fit content.

Always-on conversion¶

Markdown conversion runs on every scrape and crawl operation. There is no configuration flag to enable it -- the markdown field on ScrapeResult and CrawlPageResult is always populated for HTML content.

let result = engine.scrape("https://example.com").await?;

if let Some(ref md) = result.markdown {
    println!("{}", md.content);
}

The conversion runs in a blocking task (tokio::task::spawn_blocking) to avoid blocking the async runtime, since the underlying HTML parser is not Send.

MarkdownResult structure¶

Field	Type	Description
`content`	`String`	The converted Markdown text.
`document_structure`	`Option<Value>`	JSON representation of the document's semantic structure tree.
`tables`	`Vec<Value>`	Extracted tables as structured JSON with cell data.
`warnings`	`Vec<String>`	Non-fatal processing warnings from the conversion.
`citations`	`Option<CitationResult>`	Content with inline links replaced by numbered citations.
`fit_content`	`Option<String>`	Pruned Markdown optimized for LLM consumption.

All fields are populated in a single pass during conversion.

Document structure tree¶

The document_structure field contains a JSON tree representing the semantic structure of the HTML document. This is generated by the html-to-markdown-rs library with include_document_structure: true.

if let Some(ref md) = result.markdown {
    if let Some(ref structure) = md.document_structure {
        println!("{}", serde_json::to_string_pretty(structure)?);
    }
}

The structure captures headings, sections, and their nesting relationships, which is useful for building tables of contents or understanding document hierarchy.

Table extraction¶

Tables found in the HTML are extracted as structured JSON and available in the tables field:

if let Some(ref md) = result.markdown {
    for (i, table) in md.tables.iter().enumerate() {
        println!("Table {}: {}", i, serde_json::to_string_pretty(table)?);
    }
}

Each table preserves its cell data in a structured format beyond what appears in the Markdown text, making it accessible for programmatic use.

Citations¶

The citations field transforms inline Markdown links into numbered references, producing a format optimized for LLM consumption where URLs are separated from the flowing text.

Conversion example¶

Input Markdown:

Visit [Rust](https://www.rust-lang.org) and [Tokio](https://tokio.rs) for more.

Becomes:

Visit Rust[1] and Tokio[2] for more.

With a reference list:

Index	URL	Text
1	`https://www.rust-lang.org`	Rust
2	`https://tokio.rs`	Tokio

CitationResult structure¶

Field	Type	Description
`content`	`String`	Markdown with links replaced by numbered citations (e.g., `text[1]`).
`references`	`Vec<CitationReference>`	The numbered reference list.

CitationReference fields¶

Field	Type	Description
`index`	`usize`	The 1-based citation number.
`url`	`String`	The link URL.
`text`	`String`	The original link text.

Citation behavior¶

Duplicate URLs share the same citation number. If two links point to the same URL, they both reference the same index.
Images (![alt](url)) are preserved unchanged and not converted to citations.
Parentheses in URLs are handled correctly, including Wikipedia-style URLs like https://en.wikipedia.org/wiki/Rust_(programming_language).

if let Some(ref md) = result.markdown {
    if let Some(ref citations) = md.citations {
        println!("{}", citations.content);
        println!("\nReferences:");
        for r in &citations.references {
            println!("[{}]: {} ({})", r.index, r.url, r.text);
        }
    }
}

Fit content (LLM-optimized pruning)¶

The fit_content field contains a pruned version of the Markdown, optimized for LLM context windows by removing low-value content. This is generated by heuristic-based pruning rules.

if let Some(ref md) = result.markdown {
    if let Some(ref fit) = md.fit_content {
        // Use this for LLM prompts instead of the full content
        println!("Fit content ({} chars):", fit.len());
        println!("{}", fit);
    }
}

Pruning rules¶

The pruning algorithm applies these heuristics line by line:

Rule	Description
Navigation link removal	Lines where more than 70% of characters are part of Markdown links (and the line is longer than 20 characters) are removed.
Short line removal	Non-heading lines shorter than 5 characters are removed (catches breadcrumbs, separators).
Boilerplate detection	Lines containing common boilerplate phrases are removed.
Code block preservation	Content inside fenced code blocks (``` or `~~~`) is always preserved, even if individual lines are short.
Heading preservation	Lines starting with `#` are always kept regardless of length.
Paragraph spacing	Consecutive empty lines are collapsed to a single blank line.

Detected boilerplate phrases¶

The following phrases trigger line removal (case-insensitive):

"cookie policy", "cookie consent", "use cookies", "uses cookies"
"privacy policy", "terms of service"
"subscribe to", "sign up for"
"follow us", "share this"
"powered by", "back to top"

When to use fit content¶

Use case	Recommended field
Full-fidelity conversion	`content`
LLM prompts and summarization	`fit_content`
RAG indexing	`fit_content` or `citations.content`
Link analysis	`content` (preserves inline links)
Citation-style references	`citations.content` + `citations.references`

Warnings¶

The warnings field captures non-fatal issues encountered during HTML-to-Markdown conversion:

if let Some(ref md) = result.markdown {
    for warning in &md.warnings {
        eprintln!("Conversion warning: {}", warning);
    }
}

These are informational and do not indicate conversion failure. Common warnings include malformed HTML elements or unsupported constructs.

Edit this page on GitHub