MCP Tools Reference¶
Kreuzcrawl implements a Model Context Protocol (MCP) server that exposes web scraping, crawling, and mapping capabilities as tools for LLM agents. The server uses stdio transport and is started via kreuzcrawl mcp or programmatically with kreuzcrawl::mcp::start_mcp_server().
Server Info¶
| Field | Value |
|---|---|
| Name | kreuzcrawl-mcp |
| Title | Kreuzcrawl Web Crawling MCP Server |
| Transport | stdio |
| Capabilities | Tools |
Tools¶
scrape¶
Scrape a single URL and extract content as markdown or JSON.
Annotations: read_only_hint = true, idempotent_hint = true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes | -- | URL to scrape (must start with http:// or https://) |
format |
string |
No | "markdown" |
Output format: "markdown" or "json" |
use_browser |
boolean |
No | false |
Force browser rendering instead of HTTP fetch (requires browser feature) |
Returns: Text content containing the page content in the requested format. Markdown format includes the converted page content. JSON format includes the full ScrapeResult as structured JSON.
crawl¶
Crawl a website following links up to a configured depth.
Annotations: read_only_hint = true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes | -- | Starting URL for the crawl |
max_depth |
integer |
No | Engine default | Maximum link depth from the start URL (max: 100) |
max_pages |
integer |
No | Engine default | Maximum number of pages to crawl (1--100,000) |
format |
string |
No | "markdown" |
Output format: "markdown" or "json" |
stay_on_domain |
boolean |
No | Engine default | Restrict crawling to the same domain |
Returns: Text content with results for all discovered pages. In markdown format, each page is separated by ---. In JSON format, each page is a structured object.
Validation:
max_depthmust be <= 100max_pagesmust be between 1 and 100,000
map¶
Discover all pages on a website via links and sitemaps.
Annotations: read_only_hint = true, idempotent_hint = true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes | -- | URL of the website to map |
limit |
integer |
No | No limit | Maximum number of URLs to return |
search |
string |
No | -- | Case-insensitive substring filter for discovered URLs |
respect_robots_txt |
boolean |
No | Engine default | Whether to respect robots.txt directives |
Returns: Text content with the list of discovered URLs and their sitemap metadata (last modified, change frequency, priority).
batch_scrape¶
Scrape multiple URLs concurrently.
Annotations: read_only_hint = true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls |
string[] |
Yes | -- | List of URLs to scrape (must not be empty) |
format |
string |
No | "markdown" |
Output format: "markdown" or "json" |
concurrency |
integer |
No | Engine default | Maximum number of concurrent requests |
Returns: Text content with results for each URL, separated by ---. Failed URLs include an error message.
download¶
Download a document from a URL.
Annotations: read_only_hint = true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes | -- | URL to download |
max_size |
integer |
No | 50 MB | Maximum document size in bytes |
Returns: JSON text with document metadata:
{
"url": "https://example.com/doc.pdf",
"mime_type": "application/pdf",
"size": 102400,
"filename": "doc.pdf",
"content_hash": "abc123..."
}
If the URL returns HTML instead of a downloadable document, the response includes a note field explaining this.
get_version¶
Get the current kreuzcrawl library version.
Annotations: read_only_hint = true, idempotent_hint = true
Parameters: None (empty object {}).
Returns: JSON text:
Planned Tools (Not Yet Implemented)¶
The following tools are defined but return placeholder messages. They require specific feature flags to be enabled at compile time.
screenshot¶
Capture a screenshot of a URL.
Requires: browser feature
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string |
Yes | URL to capture |
full_page |
boolean |
No | Capture full page vs viewport only |
interact¶
Execute browser actions on a page (click, type, scroll, etc.).
Requires: interact feature
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string |
Yes | URL to navigate to before executing actions |
actions |
object[] |
Yes | Sequence of browser actions |
research¶
AI-driven research across multiple pages.
Requires: ai feature
| Parameter | Type | Required | Description |
|---|---|---|---|
query |
string |
Yes | Research query or question |
max_depth |
integer |
No | Maximum crawl depth per seed URL |
max_pages |
integer |
No | Maximum total pages to visit |
seed_urls |
string[] |
No | Optional seed URLs to start from |
crawl_status¶
Check the status of a crawl job.
| Parameter | Type | Required | Description |
|---|---|---|---|
job_id |
string |
No | Job ID to check status for |
Error Handling¶
MCP tool errors are mapped from CrawlError variants:
| CrawlError Variant | MCP Error Code | Description |
|---|---|---|
InvalidConfig |
-32602 (INVALID_PARAMS) |
Invalid configuration parameter |
| All other variants | -32603 (INTERNAL_ERROR) |
Network, browser, or server errors with descriptive context |
Error messages preserve the original context to aid debugging.
Usage with Claude Desktop¶
Add to your claude_desktop_config.json: