API Server¶
Kreuzcrawl includes a Firecrawl v1-compatible REST API server built on Axum. The server
is feature-gated behind api.
Starting the server¶
CLI¶
| Flag | Default | Description |
|---|---|---|
--host |
0.0.0.0 |
IP address to bind to. |
--port |
3000 |
TCP port to listen on. |
Programmatic¶
use std::sync::Arc;
use kreuzcrawl::{CrawlConfig, CrawlEngine};
use kreuzcrawl::api::serve;
let engine = CrawlEngine::builder()
.config(CrawlConfig::default())
.build()?;
serve("0.0.0.0", 3000, Arc::new(engine)).await?;
Or with the convenience wrapper that builds the engine internally:
use kreuzcrawl::api::serve_with_config;
use kreuzcrawl::CrawlConfig;
serve_with_config("0.0.0.0", 3000, CrawlConfig::default()).await?;
Middleware stack¶
The server applies the following middleware (outermost first):
- Request ID -- Generates and propagates
X-Request-Idheaders (UUID v4). - Sensitive headers -- Redacts the
Authorizationheader from logs. - Request timeout -- 5-minute hard cap per request (returns 408 on timeout).
- Body size limit -- Maximum 10 MB request body.
- CORS -- Permissive (allows any origin, method, and headers).
- Compression -- Transparent response compression.
- Panic recovery -- Catches panics and returns 500 instead of crashing.
- Tracing -- HTTP request/response logging.
Endpoints¶
Scrape (synchronous)¶
POST /v1/scrape -- Scrape a single URL and return extracted content.
Request:
{
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": true,
"includeTags": [".article"],
"excludeTags": [".sidebar"],
"timeout": 30000
}
Response:
{
"success": true,
"data": {
"status_code": 200,
"content_type": "text/html",
"body_size": 15234,
"metadata": { "title": "Example", "description": "..." },
"markdown": { "content": "# Example\n\nHello world..." },
"html": "<html>...</html>",
"links": [...],
"images": [...]
}
}
Crawl (asynchronous)¶
POST /v1/crawl -- Start an asynchronous crawl job.
Request:
{
"url": "https://example.com",
"maxDepth": 2,
"maxPages": 100,
"includePaths": ["/docs/.*"],
"excludePaths": ["/blog/.*"],
"onlyMainContent": true
}
Response (202 Accepted):
GET /v1/crawl/{id} -- Poll the status of a crawl job.
Response (in progress):
Response (completed):
{
"status": "completed",
"total": 25,
"completed": 25,
"data": [
{ "url": "https://example.com/", "status_code": 200, "depth": 0, ... },
{ "url": "https://example.com/about", "status_code": 200, "depth": 1, ... }
]
}
DELETE /v1/crawl/{id} -- Cancel a pending or in-progress crawl job.
Response:
Map (synchronous)¶
POST /v1/map -- Discover all URLs on a website via links and sitemaps.
Request:
Response:
{
"success": true,
"data": {
"urls": [
{ "url": "https://example.com/docs/intro", "lastmod": "2026-01-15" },
{ "url": "https://example.com/docs/api", "priority": "0.8" }
]
}
}
Batch scrape (asynchronous)¶
POST /v1/batch/scrape -- Scrape multiple URLs concurrently.
Request:
{
"urls": ["https://example.com", "https://example.org"],
"formats": ["markdown"],
"onlyMainContent": true
}
Response (202 Accepted):
GET /v1/batch/scrape/{id} -- Poll batch job status. Same response shape as crawl status.
Download (synchronous)¶
POST /v1/download -- Download a document from a URL.
Request:
Uses the scrape pipeline internally, returning document metadata and content.
Operational endpoints¶
GET /health -- Health check.
GET /version -- Version information.
GET /openapi.json -- OpenAPI schema (generated by utoipa).
Async job lifecycle¶
Crawl and batch scrape operations follow this lifecycle:
POST request --> pending --> in_progress --> completed
| |
v v
failed (evicted after TTL)
|
v
cancelled (via DELETE)
- pending -- Job accepted, not yet started.
- in_progress -- Worker is actively fetching pages.
- completed -- All pages fetched; data available in the status response.
- failed -- A fatal error occurred; the
errorfield contains the message. - cancelled -- User cancelled via
DELETE. Only pending and in-progress jobs can be cancelled.
Job TTL and eviction¶
Jobs are stored in an in-memory DashMap registry. A background task runs every 60 seconds
and evicts jobs older than the configured maximum age.
| Setting | Value |
|---|---|
| Default job TTL | 1 hour |
| Eviction interval | 60 seconds |
Once evicted, a job ID returns 404. There is no persistent job storage; jobs are lost on server restart.
Error responses¶
All error responses follow the same structure:
Error codes map to HTTP status codes:
| Error code | HTTP status | Condition |
|---|---|---|
BAD_REQUEST |
400 | Invalid input (missing URL, bad config) |
NOT_FOUND |
404 | Job or resource not found |
UNAUTHORIZED |
401 | Authentication required |
FORBIDDEN |
403 | Access denied |
WAF_BLOCKED |
403 | Blocked by WAF/bot protection |
TIMEOUT |
504 | Request or browser timed out |
RATE_LIMITED |
429 | Rate limit exceeded |
SERVER_ERROR |
502 | Upstream server error |
INTERNAL_ERROR |
500 | Unexpected internal error |
Embedding the router
The create_router() function returns an Axum Router that can be embedded in your
own application or used with Tower's oneshot for testing. Pass a shared
Arc<CrawlEngine> to configure the crawl behaviour.