Skip to content

API Server

Kreuzcrawl includes a Firecrawl v1-compatible REST API server built on Axum. The server is feature-gated behind api.

Starting the server

CLI

kreuzcrawl serve --host 0.0.0.0 --port 3000
Flag Default Description
--host 0.0.0.0 IP address to bind to.
--port 3000 TCP port to listen on.

Programmatic

use std::sync::Arc;
use kreuzcrawl::{CrawlConfig, CrawlEngine};
use kreuzcrawl::api::serve;

let engine = CrawlEngine::builder()
    .config(CrawlConfig::default())
    .build()?;

serve("0.0.0.0", 3000, Arc::new(engine)).await?;

Or with the convenience wrapper that builds the engine internally:

use kreuzcrawl::api::serve_with_config;
use kreuzcrawl::CrawlConfig;

serve_with_config("0.0.0.0", 3000, CrawlConfig::default()).await?;

Middleware stack

The server applies the following middleware (outermost first):

  • Request ID -- Generates and propagates X-Request-Id headers (UUID v4).
  • Sensitive headers -- Redacts the Authorization header from logs.
  • Request timeout -- 5-minute hard cap per request (returns 408 on timeout).
  • Body size limit -- Maximum 10 MB request body.
  • CORS -- Permissive (allows any origin, method, and headers).
  • Compression -- Transparent response compression.
  • Panic recovery -- Catches panics and returns 500 instead of crashing.
  • Tracing -- HTTP request/response logging.

Endpoints

Scrape (synchronous)

POST /v1/scrape -- Scrape a single URL and return extracted content.

Request:

{
  "url": "https://example.com",
  "formats": ["markdown"],
  "onlyMainContent": true,
  "includeTags": [".article"],
  "excludeTags": [".sidebar"],
  "timeout": 30000
}

Response:

{
  "success": true,
  "data": {
    "status_code": 200,
    "content_type": "text/html",
    "body_size": 15234,
    "metadata": { "title": "Example", "description": "..." },
    "markdown": { "content": "# Example\n\nHello world..." },
    "html": "<html>...</html>",
    "links": [...],
    "images": [...]
  }
}

Crawl (asynchronous)

POST /v1/crawl -- Start an asynchronous crawl job.

Request:

{
  "url": "https://example.com",
  "maxDepth": 2,
  "maxPages": 100,
  "includePaths": ["/docs/.*"],
  "excludePaths": ["/blog/.*"],
  "onlyMainContent": true
}

Response (202 Accepted):

{
  "success": true,
  "id": "550e8400-e29b-41d4-a716-446655440000"
}

GET /v1/crawl/{id} -- Poll the status of a crawl job.

Response (in progress):

{
  "status": "in_progress",
  "total": 0,
  "completed": 12
}

Response (completed):

{
  "status": "completed",
  "total": 25,
  "completed": 25,
  "data": [
    { "url": "https://example.com/", "status_code": 200, "depth": 0, ... },
    { "url": "https://example.com/about", "status_code": 200, "depth": 1, ... }
  ]
}

DELETE /v1/crawl/{id} -- Cancel a pending or in-progress crawl job.

Response:

{
  "success": true,
  "data": "cancelled"
}

Map (synchronous)

POST /v1/map -- Discover all URLs on a website via links and sitemaps.

Request:

{
  "url": "https://example.com",
  "limit": 100,
  "search": "docs"
}

Response:

{
  "success": true,
  "data": {
    "urls": [
      { "url": "https://example.com/docs/intro", "lastmod": "2026-01-15" },
      { "url": "https://example.com/docs/api", "priority": "0.8" }
    ]
  }
}

Batch scrape (asynchronous)

POST /v1/batch/scrape -- Scrape multiple URLs concurrently.

Request:

{
  "urls": ["https://example.com", "https://example.org"],
  "formats": ["markdown"],
  "onlyMainContent": true
}

Response (202 Accepted):

{
  "success": true,
  "id": "660e9500-f39c-52e5-b827-557766551111"
}

GET /v1/batch/scrape/{id} -- Poll batch job status. Same response shape as crawl status.

Download (synchronous)

POST /v1/download -- Download a document from a URL.

Request:

{
  "url": "https://example.com/report.pdf",
  "maxSize": 10485760
}

Uses the scrape pipeline internally, returning document metadata and content.

Operational endpoints

GET /health -- Health check.

{ "status": "ok", "version": "0.1.0-rc.1" }

GET /version -- Version information.

{ "version": "0.1.0-rc.1" }

GET /openapi.json -- OpenAPI schema (generated by utoipa).

Async job lifecycle

Crawl and batch scrape operations follow this lifecycle:

POST request  -->  pending  -->  in_progress  -->  completed
                                     |                 |
                                     v                 v
                                  failed          (evicted after TTL)
                                     |
                                     v
                               cancelled (via DELETE)
  1. pending -- Job accepted, not yet started.
  2. in_progress -- Worker is actively fetching pages.
  3. completed -- All pages fetched; data available in the status response.
  4. failed -- A fatal error occurred; the error field contains the message.
  5. cancelled -- User cancelled via DELETE. Only pending and in-progress jobs can be cancelled.

Job TTL and eviction

Jobs are stored in an in-memory DashMap registry. A background task runs every 60 seconds and evicts jobs older than the configured maximum age.

Setting Value
Default job TTL 1 hour
Eviction interval 60 seconds

Once evicted, a job ID returns 404. There is no persistent job storage; jobs are lost on server restart.

Error responses

All error responses follow the same structure:

{
  "success": false,
  "error": {
    "code": "BAD_REQUEST",
    "message": "url is required"
  }
}

Error codes map to HTTP status codes:

Error code HTTP status Condition
BAD_REQUEST 400 Invalid input (missing URL, bad config)
NOT_FOUND 404 Job or resource not found
UNAUTHORIZED 401 Authentication required
FORBIDDEN 403 Access denied
WAF_BLOCKED 403 Blocked by WAF/bot protection
TIMEOUT 504 Request or browser timed out
RATE_LIMITED 429 Rate limit exceeded
SERVER_ERROR 502 Upstream server error
INTERNAL_ERROR 500 Unexpected internal error

Embedding the router

The create_router() function returns an Axum Router that can be embedded in your own application or used with Tower's oneshot for testing. Pass a shared Arc<CrawlEngine> to configure the crawl behaviour.