Skip to content

Markdown Conversion

Kreuzcrawl converts every HTML page to Markdown automatically. The conversion produces a rich MarkdownResult that includes the Markdown text, a document structure tree, extracted tables, numbered citations, and LLM-optimized fit content.

Always-on conversion

Markdown conversion runs on every scrape and crawl operation. There is no configuration flag to enable it -- the markdown field on ScrapeResult and CrawlPageResult is always populated for HTML content.

let result = engine.scrape("https://example.com").await?;

if let Some(ref md) = result.markdown {
    println!("{}", md.content);
}

The conversion runs in a blocking task (tokio::task::spawn_blocking) to avoid blocking the async runtime, since the underlying HTML parser is not Send.

MarkdownResult structure

Field Type Description
content String The converted Markdown text.
document_structure Option<Value> JSON representation of the document's semantic structure tree.
tables Vec<Value> Extracted tables as structured JSON with cell data.
warnings Vec<String> Non-fatal processing warnings from the conversion.
citations Option<CitationResult> Content with inline links replaced by numbered citations.
fit_content Option<String> Pruned Markdown optimized for LLM consumption.

All fields are populated in a single pass during conversion.

Document structure tree

The document_structure field contains a JSON tree representing the semantic structure of the HTML document. This is generated by the html-to-markdown-rs library with include_document_structure: true.

if let Some(ref md) = result.markdown {
    if let Some(ref structure) = md.document_structure {
        println!("{}", serde_json::to_string_pretty(structure)?);
    }
}

The structure captures headings, sections, and their nesting relationships, which is useful for building tables of contents or understanding document hierarchy.

Table extraction

Tables found in the HTML are extracted as structured JSON and available in the tables field:

if let Some(ref md) = result.markdown {
    for (i, table) in md.tables.iter().enumerate() {
        println!("Table {}: {}", i, serde_json::to_string_pretty(table)?);
    }
}

Each table preserves its cell data in a structured format beyond what appears in the Markdown text, making it accessible for programmatic use.

Citations

The citations field transforms inline Markdown links into numbered references, producing a format optimized for LLM consumption where URLs are separated from the flowing text.

Conversion example

Input Markdown:

Visit [Rust](https://www.rust-lang.org) and [Tokio](https://tokio.rs) for more.

Becomes:

Visit Rust[1] and Tokio[2] for more.

With a reference list:

Index URL Text
1 https://www.rust-lang.org Rust
2 https://tokio.rs Tokio

CitationResult structure

Field Type Description
content String Markdown with links replaced by numbered citations (e.g., text[1]).
references Vec<CitationReference> The numbered reference list.

CitationReference fields

Field Type Description
index usize The 1-based citation number.
url String The link URL.
text String The original link text.

Citation behavior

  • Duplicate URLs share the same citation number. If two links point to the same URL, they both reference the same index.
  • Images (![alt](url)) are preserved unchanged and not converted to citations.
  • Parentheses in URLs are handled correctly, including Wikipedia-style URLs like https://en.wikipedia.org/wiki/Rust_(programming_language).
if let Some(ref md) = result.markdown {
    if let Some(ref citations) = md.citations {
        println!("{}", citations.content);
        println!("\nReferences:");
        for r in &citations.references {
            println!("[{}]: {} ({})", r.index, r.url, r.text);
        }
    }
}

Fit content (LLM-optimized pruning)

The fit_content field contains a pruned version of the Markdown, optimized for LLM context windows by removing low-value content. This is generated by heuristic-based pruning rules.

if let Some(ref md) = result.markdown {
    if let Some(ref fit) = md.fit_content {
        // Use this for LLM prompts instead of the full content
        println!("Fit content ({} chars):", fit.len());
        println!("{}", fit);
    }
}

Pruning rules

The pruning algorithm applies these heuristics line by line:

Rule Description
Navigation link removal Lines where more than 70% of characters are part of Markdown links (and the line is longer than 20 characters) are removed.
Short line removal Non-heading lines shorter than 5 characters are removed (catches breadcrumbs, separators).
Boilerplate detection Lines containing common boilerplate phrases are removed.
Code block preservation Content inside fenced code blocks (``` or ~~~) is always preserved, even if individual lines are short.
Heading preservation Lines starting with # are always kept regardless of length.
Paragraph spacing Consecutive empty lines are collapsed to a single blank line.

Detected boilerplate phrases

The following phrases trigger line removal (case-insensitive):

  • "cookie policy", "cookie consent", "use cookies", "uses cookies"
  • "privacy policy", "terms of service"
  • "all rights reserved", "copyright"
  • "subscribe to", "sign up for"
  • "follow us", "share this"
  • "powered by", "back to top"

When to use fit content

Use case Recommended field
Full-fidelity conversion content
LLM prompts and summarization fit_content
RAG indexing fit_content or citations.content
Link analysis content (preserves inline links)
Citation-style references citations.content + citations.references

Warnings

The warnings field captures non-fatal issues encountered during HTML-to-Markdown conversion:

if let Some(ref md) = result.markdown {
    for warning in &md.warnings {
        eprintln!("Conversion warning: {}", warning);
    }
}

These are informational and do not indicate conversion failure. Common warnings include malformed HTML elements or unsupported constructs.