Skip to content

C API Reference

C API Reference v0.3.0-rc.46

Functions

kcrawl_generate_citations()

Convert markdown links to numbered citations.

[Example](https://example.com) becomes Example[1] with [1]: <https://example.com> in the reference list. Images ![alt](url) are preserved unchanged.

Signature:

KcrawlCitationResult* kcrawl_generate_citations(const char* markdown);

Parameters:

Name Type Required Description
markdown const char* Yes The markdown

Returns: KcrawlCitationResult


kcrawl_create_engine()

Create a new crawl engine with the given configuration.

If config is NULL, uses CrawlConfig.default(). Returns an error if the configuration is invalid.

Signature:

KcrawlCrawlEngineHandle* kcrawl_create_engine(KcrawlCrawlConfig config);

Parameters:

Name Type Required Description
config KcrawlCrawlConfig* No The configuration options

Returns: KcrawlCrawlEngineHandle Errors: Returns NULL on error.


kcrawl_scrape()

Scrape a single URL, returning extracted page data.

Signature:

KcrawlScrapeResult* kcrawl_scrape(KcrawlCrawlEngineHandle engine, const char* url);

Parameters:

Name Type Required Description
engine KcrawlCrawlEngineHandle Yes The crawl engine handle
url const char* Yes The URL to fetch

Returns: KcrawlScrapeResult Errors: Returns NULL on error.


kcrawl_crawl()

Crawl a website starting from url, following links up to the configured depth.

Signature:

KcrawlCrawlResult* kcrawl_crawl(KcrawlCrawlEngineHandle engine, const char* url);

Parameters:

Name Type Required Description
engine KcrawlCrawlEngineHandle Yes The crawl engine handle
url const char* Yes The URL to fetch

Returns: KcrawlCrawlResult Errors: Returns NULL on error.


kcrawl_map_urls()

Discover all pages on a website by following links and sitemaps.

Signature:

KcrawlMapResult* kcrawl_map_urls(KcrawlCrawlEngineHandle engine, const char* url);

Parameters:

Name Type Required Description
engine KcrawlCrawlEngineHandle Yes The crawl engine handle
url const char* Yes The URL to fetch

Returns: KcrawlMapResult Errors: Returns NULL on error.


kcrawl_interact()

Execute browser actions on a single page.

Signature:

KcrawlInteractionResult* kcrawl_interact(KcrawlCrawlEngineHandle engine, const char* url, KcrawlPageAction* actions);

Parameters:

Name Type Required Description
engine KcrawlCrawlEngineHandle Yes The crawl engine handle
url const char* Yes The URL to fetch
actions KcrawlPageAction* Yes The actions

Returns: KcrawlInteractionResult Errors: Returns NULL on error.


kcrawl_batch_scrape()

Scrape multiple URLs concurrently.

Signature:

KcrawlBatchScrapeResults* kcrawl_batch_scrape(KcrawlCrawlEngineHandle engine, const char** urls);

Parameters:

Name Type Required Description
engine KcrawlCrawlEngineHandle Yes The crawl engine handle
urls const char** Yes The urls

Returns: KcrawlBatchScrapeResults Errors: Returns NULL on error.


kcrawl_batch_crawl()

Crawl multiple seed URLs concurrently, each following links to configured depth.

Signature:

KcrawlBatchCrawlResults* kcrawl_batch_crawl(KcrawlCrawlEngineHandle engine, const char** urls);

Parameters:

Name Type Required Description
engine KcrawlCrawlEngineHandle Yes The crawl engine handle
urls const char** Yes The urls

Returns: KcrawlBatchCrawlResults Errors: Returns NULL on error.


Types

KcrawlActionResult

Result from a single page action execution.

Field Type Default Description
action_index uintptr_t Zero-based index of the action in the sequence.
action_type const char* The type of action that was executed.
success bool Whether the action completed successfully.
data void** NULL Action-specific return data (screenshot bytes, JS return value, scraped HTML).
error const char** NULL Error message if the action failed.

KcrawlArticleMetadata

Article metadata extracted from article:* Open Graph tags.

Field Type Default Description
published_time const char** NULL The article publication time.
modified_time const char** NULL The article modification time.
author const char** NULL The article author.
section const char** NULL The article section.
tags const char** NULL The article tags.

KcrawlBatchCrawlResult

Result from a single URL in a batch crawl operation.

Field Type Default Description
url const char* The seed URL that was crawled.
result KcrawlCrawlResult* NULL The crawl result, if successful.
error const char** NULL The error message, if the crawl failed.

KcrawlBatchCrawlResults

Aggregate result of a batch crawl, exposing per-URL results plus precomputed counts.

The counts are derived once at construction so every binding language can read them as plain integer fields without re-iterating the results vector.

Field Type Default Description
results KcrawlBatchCrawlResult* NULL Per-URL crawl results, in the order seed URLs were submitted.
total_count uintptr_t Total number of seed URLs in the batch (equal to results.len()).
completed_count uintptr_t Number of seed URLs whose crawl succeeded (error is NULL).
failed_count uintptr_t Number of seed URLs whose crawl failed (error is Some).

KcrawlBatchCrawlStreamRequest

Request to begin a multi-URL streaming crawl.

Wraps a set of seed URLs for delivery through the streaming-adapter binding surface. Required as a struct because alef's streaming adapter requires a named request type — primitives are not supported.

Field Type Default Description
urls const char** NULL The seed URLs to crawl. Each URL is followed independently up to the engine's configured depth.

KcrawlBatchScrapeResult

Result from a single URL in a batch scrape operation.

Field Type Default Description
url const char* The URL that was scraped.
result KcrawlScrapeResult* NULL The scrape result, if successful.
error const char** NULL The error message, if the scrape failed.

KcrawlBatchScrapeResults

Aggregate result of a batch scrape, exposing per-URL results plus precomputed counts.

The counts are derived once at construction so every binding language can read them as plain integer fields without re-iterating the results vector.

Field Type Default Description
results KcrawlBatchScrapeResult* NULL Per-URL scrape results, in the order URLs were submitted.
total_count uintptr_t Total number of URLs in the batch (equal to results.len()).
completed_count uintptr_t Number of URLs whose scrape succeeded (error is NULL).
failed_count uintptr_t Number of URLs whose scrape failed (error is Some).

KcrawlBrowserConfig

Browser fallback configuration.

Field Type Default Description
mode KcrawlBrowserMode KCRAWL_KCRAWL_AUTO When to use the headless browser fallback.
backend KcrawlBrowserBackend KCRAWL_KCRAWL_CHROMIUMOXIDE Browser backend used to render JavaScript-heavy pages.
endpoint const char** NULL CDP WebSocket endpoint for connecting to an external browser instance.
timeout uint64_t 30000ms Timeout for browser page load and rendering (in milliseconds when serialized).
wait KcrawlBrowserWait KCRAWL_KCRAWL_NETWORK_IDLE Wait strategy after browser navigation.
wait_selector const char** NULL CSS selector to wait for when wait is Selector.
extra_wait uint64_t* NULL Extra time to wait after the wait condition is met.
stealth bool false Enable browser-realistic TLS fingerprint via the stealth HTTP client. Only honored by BrowserBackend.Native — chromiumoxide is already full-stealth via Chrome's TLS stack.
proxy KcrawlProxyConfig* NULL Proxy for browser fetches. Overrides CrawlConfig.proxy when set. Native backend supports http/https only (no SOCKS5).
block_url_patterns const char** NULL URL patterns to block before the network request fires. Supports * wildcards. Useful for skipping ads/analytics/large images. Honored by BrowserBackend.Native; chromiumoxide ignores this field today.
eval_script const char** NULL JavaScript snippet evaluated after navigation completes. Scraping captures the native backend result in ScrapeResult.browser.eval_result. Interactions run this script before page actions on both browser backends but do not include the script result in InteractionResult.
robots_user_agent const char** NULL User-agent used when fetching robots.txt. Defaults to BrowserConfig.user_agent (or kreuzcrawl's default) if unset. Native only.
capture_network_events bool false Capture the full network event stream into the result. Default false (only the document event is captured). Native only.
session_affinity bool true Enable session affinity: reuse chromiumoxide Pages for same-domain requests so cookies + fingerprint + solved challenges persist. Default: true. When false, each request gets a fresh Page.

Methods

kcrawl_default()

Signature:

KcrawlBrowserConfig kcrawl_default();

KcrawlBrowserExtras

Browser-specific extras populated when the native browser backend was used.

Available on ScrapeResult.browser when BrowserBackend.Native handled the request.

Field Type Default Description
eval_result void** NULL Return value of BrowserConfig.eval_script, if provided.
network_events KcrawlResponseMeta* NULL Network events captured during page navigation (only populated when BrowserConfig.capture_network_events is true).
cookies KcrawlCookieInfo* NULL All non-expired cookies present in the browser's cookie jar after navigation completes (includes both prior cookies and server Set-Cookie).

KcrawlCitationReference

A single numbered reference in a citation list — produced by the citation extractor when content uses inline [N]-style markers.

Field Type Default Description
index uintptr_t 1-based reference number as it appears in the source text.
url const char* Resolved absolute URL for this reference.
text const char* Human-readable anchor text or title for the reference.

KcrawlCitationResult

Result of citation conversion.

Field Type Default Description
content const char* Markdown with links replaced by numbered citations.
references KcrawlCitationReference* NULL Numbered reference list: (index, url, text).

KcrawlContentConfig

Content extraction and conversion configuration.

Controls how HTML is converted to the output format. Uses html-to-markdown-rs as the conversion engine for all formats (markdown, plain text, djot).

Field Type Default Description
output_format const char* "markdown" Output format: "markdown" (default), "plain", "djot".
preprocessing_preset const char* "standard" Preprocessing aggressiveness: "minimal", "standard" (default), "aggressive". - Minimal: only scripts/styles removed. - Standard: also removes nav, nav-hinted headers/footers/asides, forms. - Aggressive: removes all footers/asides unconditionally.
remove_navigation bool true Remove navigation elements (nav, breadcrumbs, menus). Default: true.
remove_forms bool true Remove form elements. Default: true.
strip_tags const char** NULL HTML tag names to strip (render children only, remove the tag wrapper). Default: ["noscript"].
preserve_tags const char** NULL HTML tag names to preserve as raw HTML in output.
exclude_selectors const char** NULL CSS selectors for elements to exclude entirely (element + all content). Unlike strip_tags (which removes the wrapper but keeps children), excluded elements and all descendants are dropped. Supports CSS selectors: .class, #id, [attribute], compound selectors. Example: [".cookie-banner", "#ad-container", "[role='complementary']"]
skip_images bool false Skip image elements in output. Default: false.
max_depth uintptr_t* NULL Max DOM traversal depth. Prevents stack overflow on deeply nested HTML.
wrap bool false Enable line wrapping. Default: false.
wrap_width uintptr_t 80 Wrap width when wrap is enabled. Default: 80.
include_document_structure bool true Include document structure tree in output. Default: true.

Methods

kcrawl_default()

Signature:

KcrawlContentConfig kcrawl_default();

KcrawlCookieInfo

Information about an HTTP cookie received from a response.

Field Type Default Description
name const char* The cookie name.
value const char* The cookie value.
domain const char** NULL The cookie domain, if specified.
path const char** NULL The cookie path, if specified.

KcrawlCrawlConfig

Configuration for crawl, scrape, and map operations.

Field Type Default Description
max_depth uintptr_t* NULL Maximum crawl depth (number of link hops from the start URL).
max_pages uintptr_t* NULL Maximum number of pages to crawl.
max_concurrent uintptr_t* NULL Maximum number of concurrent requests.
respect_robots_txt bool false Whether to respect robots.txt directives.
soft_http_errors bool false When true, HTTP-level error responses (404 NotFound, 403 Forbidden, WAF blocks) are surfaced as ScrapeResult records with the matching status_code rather than raised as CrawlError. Default false preserves the historical throw-on-error contract for direct fetches. Independently of this flag, 404s reached at the end of a redirect chain are always surfaced softly — the user opted into redirect-following, so receiving a 404 there is part of the normal flow rather than an unexpected error.
user_agent const char** NULL Custom user-agent string.
stay_on_domain bool false Whether to restrict crawling to the same domain.
allow_subdomains bool false Whether to allow subdomains when stay_on_domain is true.
include_paths const char** NULL Regex patterns for paths to include during crawling.
exclude_paths const char** NULL Regex patterns for paths to exclude during crawling.
custom_headers void* NULL Custom HTTP headers to send with each request.
request_timeout uint64_t 30000ms Timeout for individual HTTP requests (in milliseconds when serialized).
rate_limit_ms uint64_t* NULL Per-domain rate limit in milliseconds. When set, enforces a minimum delay between requests to the same domain. Defaults to 200ms when NULL.
max_redirects uintptr_t 10 Maximum number of redirects to follow.
retry_count uintptr_t 0 Number of retry attempts for failed requests.
retry_codes uint16_t* NULL HTTP status codes that should trigger a retry.
cookies_enabled bool false Whether to enable cookie handling.
auth KcrawlAuthConfig* NULL Authentication configuration.
max_body_size uintptr_t* NULL Maximum response body size in bytes.
remove_tags const char** NULL CSS selectors for tags to remove from HTML before processing.
content KcrawlContentConfig Content extraction and conversion configuration.
map_limit uintptr_t* NULL Maximum number of URLs to return from a map operation.
map_search const char** NULL Search filter for map results (case-insensitive substring match on URLs).
download_assets bool false Whether to download assets (CSS, JS, images, etc.) from the page.
asset_types KcrawlAssetCategory* NULL Filter for asset categories to download.
max_asset_size uintptr_t* NULL Maximum size in bytes for individual asset downloads.
browser KcrawlBrowserConfig Browser configuration.
proxy KcrawlProxyConfig* NULL Proxy configuration for HTTP requests.
user_agents const char** NULL List of user-agent strings for rotation. If non-empty, overrides user_agent.
capture_screenshot bool false Whether to capture a screenshot when using the browser.
download_documents bool true Whether to download non-HTML documents (PDF, DOCX, images, code, etc.) instead of skipping them.
document_max_size uintptr_t* NULL Maximum size in bytes for document downloads. Defaults to 50 MB.
document_mime_types const char** NULL Allowlist of MIME types to download. If empty, uses built-in defaults.
warc_output const char** NULL Path to write WARC output. If NULL, WARC output is disabled.
browser_profile const char** NULL Named browser profile for persistent sessions (cookies, localStorage).
save_browser_profile bool false Whether to save changes back to the browser profile on exit.
dispatch const char** NULL Pluggable dispatch components: bypass provider, escalation strategy, retry policy, WAF classifier, domain state, escalation budget, and max_total_attempts. When NULL, the engine uses its built-in defaults (no bypass, BrowserOnly strategy, SimpleRetryPolicy, built-in WAF classifier, no domain state, unlimited budget, 10 total attempt cap). Not serializable — callers construct this at runtime and skip in TOML/JSON configs.

Methods

kcrawl_default()

Signature:

KcrawlCrawlConfig kcrawl_default();

kcrawl_validate()

Validate the configuration, returning an error if any values are invalid.

Signature:

void kcrawl_validate();

KcrawlCrawlEngineHandle

Opaque handle to a configured crawl engine.

Constructed via create_engine with an optional CrawlConfig. Default implementations for all pluggable components are used internally.

Methods

kcrawl_crawl_stream()

Stream a single-URL crawl, yielding CrawlEvents as pages are processed.

Returns an async stream that emits one event per crawled page, plus a terminal Complete event. On per-URL failure during the crawl, emits an Error event followed by Complete. The stream item type is wrapped in a Result to surface transport-level errors; today every emit is Ok.

Signature:

const char* kcrawl_crawl_stream(KcrawlCrawlStreamRequest req);

kcrawl_batch_crawl_stream()

Stream a multi-URL crawl, yielding CrawlEvents across all seeds.

Returns an async stream that emits one event per crawled page across all seeds, plus terminal Complete and Error events as appropriate. The stream item type is wrapped in a Result to surface transport-level errors; today every emit is Ok.

Signature:

const char* kcrawl_batch_crawl_stream(KcrawlBatchCrawlStreamRequest req);

KcrawlCrawlPageResult

The result of crawling a single page during a crawl operation.

Field Type Default Description
url const char* The original URL of the page.
normalized_url const char* The normalized URL of the page.
status_code uint16_t The HTTP status code of the response.
content_type const char* The Content-Type header value.
html const char* The HTML body of the response.
body_size uintptr_t The size of the response body in bytes.
metadata KcrawlPageMetadata Extracted metadata from the page.
links KcrawlLinkInfo* NULL Links found on the page.
images KcrawlImageInfo* NULL Images found on the page.
feeds KcrawlFeedInfo* NULL Feed links found on the page.
json_ld KcrawlJsonLdEntry* NULL JSON-LD entries found on the page.
depth uintptr_t The depth of this page from the start URL.
stayed_on_domain bool Whether this page is on the same domain as the start URL.
was_skipped bool Whether this page was skipped (binary or PDF content).
is_pdf bool Whether the content is a PDF.
detected_charset const char** NULL The detected character set encoding.
markdown KcrawlMarkdownResult* NULL Markdown conversion of the page content.
extracted_data void** NULL Structured data extracted by LLM. Populated when extraction is configured.
extraction_meta KcrawlExtractionMeta* NULL Metadata about the LLM extraction pass (cost, tokens, model).
downloaded_document KcrawlDownloadedDocument* NULL Downloaded non-HTML document (PDF, DOCX, image, code, etc.).
browser_used bool Whether the browser fallback was used to fetch this page.

KcrawlCrawlResult

The result of a multi-page crawl operation.

Field Type Default Description
pages KcrawlCrawlPageResult* NULL The list of crawled pages.
final_url const char* The final URL after following redirects.
redirect_count uintptr_t The number of redirects followed.
was_skipped bool Whether any page was skipped during crawling.
error const char** NULL An error message, if the crawl encountered an issue.
cookies KcrawlCookieInfo* NULL Cookies collected during the crawl.
stayed_on_domain bool Whether all crawled pages stayed on the same domain as the start URL.
browser_used bool Whether the browser fallback was used for any page in this crawl.
normalized_urls const char** NULL Normalized URLs encountered during crawling (for deduplication counting).

Methods

kcrawl_unique_normalized_urls()

Returns the count of unique normalized URLs encountered during crawling.

Signature:

uintptr_t kcrawl_unique_normalized_urls();

KcrawlCrawlStreamRequest

Request to begin a single-URL streaming crawl.

Wraps a single seed URL for delivery through the streaming-adapter binding surface. Required as a struct because alef's streaming adapter requires a named request type — primitives are not supported.

Field Type Default Description
url const char* The seed URL to crawl.

KcrawlDownloadedAsset

A downloaded asset from a page.

Field Type Default Description
url const char* The original URL of the asset.
content_hash const char* The SHA-256 content hash of the asset.
mime_type const char** NULL The MIME type from the Content-Type header.
size uintptr_t The size of the asset in bytes.
asset_category KcrawlAssetCategory KCRAWL_KCRAWL_IMAGE The category of the asset.
html_tag const char** NULL The HTML tag that referenced this asset (e.g., "link", "script", "img").

KcrawlDownloadedDocument

A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).

When the crawler encounters non-HTML content and download_documents is enabled, it downloads the raw bytes and populates this struct instead of skipping the resource.

Field Type Default Description
url const char* The URL the document was fetched from.
mime_type const char* The MIME type from the Content-Type header.
content const uint8_t* Raw document bytes. Skipped during JSON serialization.
size uintptr_t Size of the document in bytes.
filename const char** NULL Filename extracted from Content-Disposition or URL path.
content_hash const char* SHA-256 hex digest of the content.
headers void* NULL Selected response headers.

KcrawlExtractionMeta

Metadata about an LLM extraction pass.

Field Type Default Description
cost double* NULL Estimated cost of the LLM call in USD.
prompt_tokens uint64_t* NULL Number of prompt (input) tokens consumed.
completion_tokens uint64_t* NULL Number of completion (output) tokens generated.
model const char** NULL The model identifier used for extraction.
chunks_processed uintptr_t Number of content chunks sent to the LLM.

KcrawlFaviconInfo

Information about a favicon or icon link.

Field Type Default Description
url const char* The icon URL.
rel const char* The rel attribute (e.g., "icon", "apple-touch-icon").
sizes const char** NULL The sizes attribute, if present.
mime_type const char** NULL The MIME type, if present.

KcrawlFeedInfo

Information about a feed link found on a page.

Field Type Default Description
url const char* The feed URL.
title const char** NULL The feed title, if present.
feed_type KcrawlFeedType KCRAWL_KCRAWL_RSS The type of feed.

KcrawlHeadingInfo

A heading element extracted from the page.

Field Type Default Description
level uint8_t The heading level (1-6).
text const char* The heading text content.

KcrawlHreflangEntry

An hreflang alternate link entry.

Field Type Default Description
lang const char* The language code (e.g., "en", "fr", "x-default").
url const char* The URL for this language variant.

KcrawlImageInfo

Information about an image found on a page.

Field Type Default Description
url const char* The image URL.
alt const char** NULL The alt text, if present.
width uint32_t* NULL The width attribute, if present and parseable.
height uint32_t* NULL The height attribute, if present and parseable.
source KcrawlImageSource KCRAWL_KCRAWL_IMG The source of the image reference.

KcrawlInteractionResult

Result of executing a sequence of page interaction actions.

Field Type Default Description
action_results KcrawlActionResult* NULL Results from each executed action.
final_html const char* Final page HTML after all actions completed.
final_url const char* Final page URL (may have changed due to navigation).
screenshot const uint8_t** NULL Screenshot taken after all actions, if requested.

KcrawlJsonLdEntry

A JSON-LD structured data entry found on a page.

Field Type Default Description
schema_type const char* The @type value from the JSON-LD object.
name const char** NULL The name value, if present.
raw const char* The raw JSON-LD string.

KcrawlLinkInfo

Information about a link found on a page.

Field Type Default Description
url const char* The resolved URL of the link.
text const char* The visible text of the link.
link_type KcrawlLinkType KCRAWL_KCRAWL_INTERNAL The classification of the link.
rel const char** NULL The rel attribute value, if present.
nofollow bool Whether the link has rel="nofollow".

KcrawlMapResult

The result of a map operation, containing discovered URLs.

Field Type Default Description
urls KcrawlSitemapUrl* NULL The list of discovered URLs.

KcrawlMarkdownResult

Rich markdown conversion result from HTML processing.

Field Type Default Description
content const char* Converted markdown text.
document_structure void** NULL Structured document tree with semantic nodes.
tables void** NULL Extracted tables with structured cell data.
warnings const char** NULL Non-fatal processing warnings.
citations bool Whether citation conversion was applied and produced at least one reference. true when the markdown contained inline links that were converted to numbered citation references. The converted content (with [N] markers) is available in content; the full reference list is accessible via generate_citations if needed separately.
fit_content const char** NULL Content-filtered markdown optimized for LLM consumption.

KcrawlPageMetadata

Metadata extracted from an HTML page's <meta> tags and <title> element.

Field Type Default Description
title const char** NULL The page title from the <title> element.
description const char** NULL The meta description.
canonical_url const char** NULL The canonical URL from <link rel="canonical">.
keywords const char** NULL Keywords from <meta name="keywords">.
author const char** NULL Author from <meta name="author">.
viewport const char** NULL Viewport content from <meta name="viewport">.
theme_color const char** NULL Theme color from <meta name="theme-color">.
generator const char** NULL Generator from <meta name="generator">.
robots const char** NULL Robots content from <meta name="robots">.
html_lang const char** NULL The lang attribute from the <html> element.
html_dir const char** NULL The dir attribute from the <html> element.
og_title const char** NULL Open Graph title.
og_type const char** NULL Open Graph type.
og_image const char** NULL Open Graph image URL.
og_description const char** NULL Open Graph description.
og_url const char** NULL Open Graph URL.
og_site_name const char** NULL Open Graph site name.
og_locale const char** NULL Open Graph locale.
og_video const char** NULL Open Graph video URL.
og_audio const char** NULL Open Graph audio URL.
og_locale_alternates const char*** NULL Open Graph locale alternates.
twitter_card const char** NULL Twitter card type.
twitter_title const char** NULL Twitter title.
twitter_description const char** NULL Twitter description.
twitter_image const char** NULL Twitter image URL.
twitter_site const char** NULL Twitter site handle.
twitter_creator const char** NULL Twitter creator handle.
dc_title const char** NULL Dublin Core title.
dc_creator const char** NULL Dublin Core creator.
dc_subject const char** NULL Dublin Core subject.
dc_description const char** NULL Dublin Core description.
dc_publisher const char** NULL Dublin Core publisher.
dc_date const char** NULL Dublin Core date.
dc_type const char** NULL Dublin Core type.
dc_format const char** NULL Dublin Core format.
dc_identifier const char** NULL Dublin Core identifier.
dc_language const char** NULL Dublin Core language.
dc_rights const char** NULL Dublin Core rights.
article KcrawlArticleMetadata* NULL Article metadata from article:* Open Graph tags.
hreflangs KcrawlHreflangEntry** NULL Hreflang alternate links.
favicons KcrawlFaviconInfo** NULL Favicon and icon links.
headings KcrawlHeadingInfo** NULL Heading elements (h1-h6).
word_count uintptr_t* NULL Computed word count of the page body text.

KcrawlProxyConfig

Proxy configuration for HTTP requests.

Field Type Default Description
url const char* Proxy URL (e.g. "http://proxy:8080", "socks5://proxy:1080").
username const char** NULL Optional username for proxy authentication.
password const char** NULL Optional password for proxy authentication.

KcrawlResponseMeta

Response metadata extracted from HTTP headers.

Field Type Default Description
etag const char** NULL The ETag header value.
last_modified const char** NULL The Last-Modified header value.
cache_control const char** NULL The Cache-Control header value.
server const char** NULL The Server header value.
x_powered_by const char** NULL The X-Powered-By header value.
content_language const char** NULL The Content-Language header value.
content_encoding const char** NULL The Content-Encoding header value.

KcrawlScrapeResult

The result of a single-page scrape operation.

Field Type Default Description
status_code uint16_t The HTTP status code of the response.
final_url const char* The final URL after following all redirects.
content_type const char* The Content-Type header value.
html const char* The HTML body of the response.
body_size uintptr_t The size of the response body in bytes.
metadata KcrawlPageMetadata Extracted metadata from the page.
links KcrawlLinkInfo* NULL Links found on the page.
images KcrawlImageInfo* NULL Images found on the page.
feeds KcrawlFeedInfo* NULL Feed links found on the page.
json_ld KcrawlJsonLdEntry* NULL JSON-LD entries found on the page.
is_allowed bool Whether the URL is allowed by robots.txt.
crawl_delay uint64_t* NULL The crawl delay from robots.txt, in seconds.
noindex_detected bool Whether a noindex directive was detected.
nofollow_detected bool Whether a nofollow directive was detected.
x_robots_tag const char** NULL The X-Robots-Tag header value, if present.
is_pdf bool Whether the content is a PDF.
was_skipped bool Whether the page was skipped (binary or PDF content).
detected_charset const char** NULL The detected character set encoding.
auth_header_sent bool Whether an authentication header was sent with the request.
response_meta KcrawlResponseMeta* NULL Response metadata extracted from HTTP headers.
assets KcrawlDownloadedAsset* NULL Downloaded assets from the page.
js_render_hint bool Whether the page content suggests JavaScript rendering is needed.
browser_used bool Whether the browser fallback was used to fetch this page.
markdown KcrawlMarkdownResult* NULL Markdown conversion of the page content.
extracted_data void** NULL Structured data extracted by LLM. Populated when extraction is configured.
extraction_meta KcrawlExtractionMeta* NULL Metadata about the LLM extraction pass (cost, tokens, model).
screenshot const uint8_t** NULL Screenshot of the page as PNG bytes. Populated when browser is used and capture_screenshot is enabled.
downloaded_document KcrawlDownloadedDocument* NULL Downloaded non-HTML document (PDF, DOCX, image, code, etc.).
browser KcrawlBrowserExtras* NULL Browser-specific extras (eval result, network events, cookies). Only populated when BrowserBackend.Native was used for this request.

KcrawlSitemapUrl

A URL entry from a sitemap.

Field Type Default Description
url const char* The URL.
lastmod const char** NULL The last modification date, if present.
changefreq const char** NULL The change frequency, if present.
priority const char** NULL The priority, if present.

Enums

KcrawlBrowserMode

When to use the headless browser fallback.

Value Description
KCRAWL_AUTO Automatically detect when JS rendering is needed and fall back to browser.
KCRAWL_ALWAYS Always use the browser for every request.
KCRAWL_NEVER Never use the browser fallback.

KcrawlBrowserWait

Wait strategy for browser page rendering.

Value Description
KCRAWL_NETWORK_IDLE Wait until network activity is idle.
KCRAWL_SELECTOR Wait for a specific CSS selector to appear in the DOM.
KCRAWL_FIXED Wait for a fixed duration after navigation.

KcrawlBrowserBackend

Browser backend used for JavaScript rendering.

Value Description
KCRAWL_CHROMIUMOXIDE Existing Chromium/CDP backend powered by chromiumoxide.
KCRAWL_NATIVE Kreuzcrawl-owned native browser backend derived from Obscura.

KcrawlAuthConfig

Authentication configuration.

Value Description
KCRAWL_BASIC HTTP Basic authentication. — Fields: username: const char*, password: const char*
KCRAWL_BEARER Bearer token authentication. — Fields: token: const char*
KCRAWL_HEADER Custom authentication header. — Fields: name: const char*, value: const char*

KcrawlLinkType

The classification of a link.

Value Description
KCRAWL_INTERNAL A link to the same domain.
KCRAWL_EXTERNAL A link to a different domain.
KCRAWL_ANCHOR A fragment-only link (e.g., #section).
KCRAWL_DOCUMENT A link to a downloadable document (PDF, DOC, etc.).

KcrawlImageSource

The source of an image reference.

Value Description
KCRAWL_IMG An <img> tag.
KCRAWL_PICTURE_SOURCE A <source> tag inside <picture>.
KCRAWL_OG_IMAGE An og:image meta tag.
KCRAWL_TWITTER_IMAGE A twitter:image meta tag.

KcrawlFeedType

The type of a feed (RSS, Atom, or JSON Feed).

Value Description
KCRAWL_RSS RSS feed.
KCRAWL_ATOM Atom feed.
KCRAWL_JSON_FEED JSON Feed.

KcrawlAssetCategory

The category of a downloaded asset.

Value Description
KCRAWL_DOCUMENT A document file (PDF, DOC, etc.).
KCRAWL_IMAGE An image file.
KCRAWL_AUDIO An audio file.
KCRAWL_VIDEO A video file.
KCRAWL_FONT A font file.
KCRAWL_STYLESHEET A CSS stylesheet.
KCRAWL_SCRIPT A JavaScript file.
KCRAWL_ARCHIVE An archive file (ZIP, TAR, etc.).
KCRAWL_DATA A data file (JSON, XML, CSV, etc.).
KCRAWL_OTHER An unrecognized asset type.

KcrawlCrawlEvent

An event emitted during a streaming crawl operation.

Not available on wasm32 targets — streaming requires native concurrency primitives (tokio channels, JoinSet) that are not supported on wasm32.

Delivered to bindings via alef's streaming-adapter pattern. The crawl_stream / batch_crawl_stream binding wrappers in bindings.rs expose this as the per-language streaming idiom (Python AsyncIterator, Ruby Enumerator, PHP Generator, Elixir Stream.unfold, etc.).

Value Description
KCRAWL_PAGE A single page has been crawled. — Fields: result: KcrawlCrawlPageResult
KCRAWL_ERROR An error occurred while crawling a URL. — Fields: url: const char*, error: const char*
KCRAWL_COMPLETE The crawl has completed. — Fields: pages_crawled: uintptr_t

KcrawlPageAction

A single page interaction action.

Actions are serialized with a type tag using camelCase naming, except ExecuteJs which is explicitly renamed to "executeJs".

Value Description
KCRAWL_CLICK Click on an element matching the given CSS selector. — Fields: selector: const char*
KCRAWL_TYPE_TEXT Type text into an element matching the given CSS selector. — Fields: selector: const char*, text: const char*
KCRAWL_PRESS Press a keyboard key (e.g. "Enter", "Tab", "Escape"). — Fields: key: const char*
KCRAWL_SCROLL Scroll the page or a specific element. — Fields: direction: KcrawlScrollDirection, selector: const char*, amount: int64_t
KCRAWL_WAIT Wait for a duration or for an element to appear. — Fields: milliseconds: int64_t, selector: const char*
KCRAWL_SCREENSHOT Take a screenshot of the current page. — Fields: full_page: bool
KCRAWL_EXECUTE_JS Execute arbitrary JavaScript in the page context. Safety: The script runs with full page privileges in the browser context. Only execute scripts from trusted sources. — Fields: script: const char*
KCRAWL_SCRAPE Scrape the current page HTML.

KcrawlScrollDirection

Direction for a scroll action.

Value Description
KCRAWL_UP Scroll upward.
KCRAWL_DOWN Scroll downward.

Errors

KcrawlCrawlError

Errors that can occur during crawling, scraping, or mapping operations.

Variant Description
KCRAWL_NOT_FOUND The requested page was not found (HTTP 404).
KCRAWL_UNAUTHORIZED The request was unauthorized (HTTP 401).
KCRAWL_FORBIDDEN The request was forbidden (HTTP 403).
KCRAWL_WAF_BLOCKED The request was blocked by a WAF or bot protection (HTTP 403 with WAF indicators). vendor is the lowercase identifier of the detected WAF (e.g. "cloudflare", "datadome"). When the engine cannot identify the vendor, it uses "unknown". message is the freeform description for logs and human readers. The stable error tag remains forbidden: waf/blocked: MESSAGE so existing log-grep patterns and cross-language bindings continue to work; vendor is surfaced separately for structured consumers.
KCRAWL_TIMEOUT The request timed out.
KCRAWL_RATE_LIMITED The request was rate-limited (HTTP 429).
KCRAWL_SERVER_ERROR A server error occurred (HTTP 5xx).
KCRAWL_BAD_GATEWAY A bad gateway error occurred (HTTP 502).
KCRAWL_GONE The resource is permanently gone (HTTP 410).
KCRAWL_CONNECTION A connection error occurred.
KCRAWL_DNS A DNS resolution error occurred.
KCRAWL_SSL An SSL/TLS error occurred.
KCRAWL_DATA_LOSS Data was lost or truncated during transfer.
KCRAWL_BROWSER_ERROR The browser failed to launch, connect, or navigate.
KCRAWL_BROWSER_TIMEOUT The browser page load or rendering timed out.
KCRAWL_INVALID_CONFIG The provided configuration is invalid.
KCRAWL_UNSUPPORTED The requested capability is not supported by the active backend or build.
KCRAWL_OTHER An unclassified error occurred.

Edit this page on GitHub