Skip to content

OpenTelemetry Observability

Kreuzcrawl emits W3C-standard traces and metrics via OpenTelemetry, aligned with semantic conventions and kreuzberg-cloud's existing naming. Integrate your observability backend (Jaeger, Prometheus, Grafana, Datadog) to track crawl performance, identify bottlenecks, and diagnose WAF issues.

Why OpenTelemetry

OpenTelemetry provides a vendor-neutral API for instrumenting applications. Kreuzcrawl emits distributed traces (spans with W3C TraceContext headers) and metrics (counters, histograms, up-down counters) compatible with the W3C TraceContext specification. This enables end-to-end observability: trace a single request through your service layer into kreuzcrawl's engine, see exactly where time is spent, and correlate with WAF signals across your infrastructure.

Quick start with init_otlp

Enable the telemetry-init feature and call the one-line helper to attach both tracer and meter providers:

use kreuzcrawl::telemetry::{TelemetryConfig, init_otlp};
use kreuzcrawl::{batch_crawl, create_engine};

let _guard = init_otlp(TelemetryConfig {
    service_name: "my-crawler".into(),
    service_version: Some(env!("CARGO_PKG_VERSION").into()),
    otlp_endpoint: "http://localhost:4317".into(),
    resource_attrs: vec![
        ("deployment.environment".into(), "staging".into()),
        ("service.namespace".into(), "web-team".into()),
    ],
})?;

// Now run your crawl — spans and metrics will export to the collector
let engine = create_engine(None)?;
let results = batch_crawl(&engine, seeds).await?;
// Guard flushes on drop
drop(_guard);

The TelemetryGuard flushes all pending spans and metrics on drop, ensuring clean shutdown.

Bring-your-own SDK

If your application already initializes OpenTelemetry (as in kreuzberg-cloud's crates/observability/src/telemetry.rs:135-152), kreuzcrawl automatically emits spans and metrics to the global tracer and meter without any init call. Dependency versions must align:

Crate Version
opentelemetry 0.32 (features: trace, metrics)
opentelemetry_sdk 0.32 (features: rt-tokio)
opentelemetry-otlp 0.32 (features: grpc-tonic, trace, metrics)
opentelemetry-semantic-conventions 0.32
tracing-opentelemetry 0.33
tracing-subscriber 0.3.23

Mixing major versions will break propagator injection and cause spans to drop. Verify your Cargo.lock matches the table above.

W3C TraceContext propagation

Kreuzcrawl preserves W3C TraceContext headers across HTTP requests, making it easy to correlate crawl activity with upstream services. Rust callers can use:

  • with_traceparent(traceparent: &str, callback: impl Fn() -> R) -> R — execute a callback with the given trace context active, so child spans become descendants.
  • current_traceparent() -> Option<String> — extract the current trace context as a W3C traceparent header value (format: 00-<trace_id>-<span_id>-<flags>).

Generated language bindings do not expose these helpers today because with_traceparent requires a Rust callback. Propagate trace context at the host-service layer with that language's OpenTelemetry SDK, and pass request headers normally into services that wrap kreuzcrawl.

use kreuzcrawl::telemetry::{current_traceparent, with_traceparent};

let incoming = "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01";
let _traceparent = with_traceparent(incoming, || {
    current_traceparent()
});

Span catalogue

Every span is emitted via tracing::info_span! and automatically bridged to OpenTelemetry by the tracing_opentelemetry::layer(). Attribute keys use W3C semantic conventions where available (from opentelemetry_semantic_conventions), or kreuzcrawl-specific extensions (prefixed crawl.).

Span Name Site Attributes
crawl.engine.start CrawlEngine::build entry crawl.seed_count, crawl.max_depth, crawl.max_pages, crawl.strategy, crawl.browser_mode
crawl.engine.batch batch_crawl() entry crawl.seed_count
crawl.loop.iteration Crawl loop dequeue crawl.depth, crawl.frontier_size, crawl.pages_completed
crawl.page.discover Link extraction (per discovered URL) url.full, url.domain, crawl.parent_url, crawl.depth, crawl.link_type
crawl.page.fetch Tower HTTP layer http.request.method, url.full, server.address, http.response.status_code, http.response.body_size, crawl.tier, crawl.final_url, crawl.mime_type
crawl.browser.session Browser fetch entry (chromiumoxide or native) crawl.browser.backend, crawl.browser.session_id, crawl.pages_rendered
crawl.robots.check robots.txt predicate url.domain, crawl.host, crawl.allowed
crawl.document.download Document materialization url.full, crawl.mime_type, crawl.size_bytes

Metric catalogue

Kreuzcrawl emits 10 OTel instruments via kreuzcrawl::telemetry::metrics::registry(). Names align with kreuzberg-cloud's existing emission points, so cloud can consolidate duplicate metrics in a follow-up.

Metric Name Instrument Labels Description
crawl_pages_total Counter (u64) status{ok, http_error, timeout, blocked} Pages fetched, partitioned by terminal status
crawl_documents_discovered_total Counter (u64) mime_type Non-HTML documents discovered (PDF, DOCX, etc.)
crawl_robots_blocked_total Counter (u64) Requests rejected by robots.txt
crawl_waf_blocks_total Counter (u64) vendor{cloudflare, datadome, …} WAF challenges detected, per vendor
crawl_backend_escalations_total Counter (u64) from_tier, to_tier, reason Tier escalations (e.g., HTTP→Browser)
crawl_bypass_requests_total Counter (u64) vendor, mode{managed, byo} Requests routed through bypass provider
crawl_bypass_failures_total Counter (u64) vendor, reason Bypass provider failures
crawl_duration_seconds Histogram (f64) output_mode, status End-to-end crawl duration
crawl_pages_duration_seconds Histogram (f64) host Per-page fetch duration
crawl_browser_sessions_active Up-down Counter (i64) Active headless-browser sessions

Grafana panel example

Copy this JSON into a Grafana dashboard to visualize crawl success rate:

{
  "title": "Crawl Success Rate",
  "targets": [
    {
      "expr": "rate(crawl_pages_total{status=\"ok\"}[5m])",
      "legendFormat": "OK (pages/sec)",
      "refId": "A"
    },
    {
      "expr": "rate(crawl_pages_total{status=\"http_error\"}[5m])",
      "legendFormat": "HTTP Errors",
      "refId": "B"
    },
    {
      "expr": "rate(crawl_pages_total{status=\"timeout\"}[5m])",
      "legendFormat": "Timeouts",
      "refId": "C"
    },
    {
      "expr": "rate(crawl_pages_total{status=\"blocked\"}[5m])",
      "legendFormat": "WAF Blocked",
      "refId": "D"
    }
  ],
  "type": "graphPanel",
  "yaxes": [
    {
      "format": "short",
      "label": "Pages per second"
    }
  ]
}

Stack this with crawl_pages_duration_seconds_bucket (latency percentiles) to correlate performance with success rate.

Cloud alignment note

Kreuzberg-cloud currently emits ~12 duplicate crawl_* metrics and one manual crawl.engine.batch span at services/worker/src/observability/metrics.rs:131-272 and crawl_handler.rs:281-286. Once kreuzcrawl's observability lands, cloud will delete its duplicate emitters in a follow-up PR. No user-facing changes — cloud's observability dashboards and alerts continue to work unchanged.

Edit this page on GitHub