Tower Middleware¶
Kreuzcrawl's HTTP pipeline is built on the Tower Service abstraction. Each middleware layer wraps the next, forming a composable stack that processes CrawlRequest values and produces CrawlResponse results.
Service Contract¶
All layers operate on the same request/response types:
pub struct CrawlRequest {
pub url: String,
pub headers: HashMap<String, String>,
}
pub struct CrawlResponse {
pub status: u16,
pub content_type: String,
pub body: String,
pub body_bytes: Vec<u8>,
pub headers: HashMap<String, Vec<String>>,
}
Every service in the stack implements Service<CrawlRequest, Response = CrawlResponse, Error = CrawlError>.
Layer Composition Order¶
The stack is assembled in CrawlEngine::build_service using Tower's ServiceBuilder. Layers are listed outermost to innermost -- a request passes through them top-to-bottom, and the response bubbles back up:
graph TD
A[CrawlRequest] --> B["CrawlTracingLayer (feature: tracing)"]
B --> C[PerDomainRateLimitLayer]
C --> D[CrawlCacheLayer]
D --> E[UaRotationLayer]
E --> F[HttpFetchService]
F --> G[CrawlResponse]
CrawlTracingLayeris only included when thetracingfeature is enabled. All other layers are always present.
1. CrawlTracingLayer (feature: tracing)¶
Wraps each request in a tracing::info_span! with OpenTelemetry-compatible fields:
otel.kind = "client"url.full-- the full request URLserver.address-- the domain extracted from the URLhttp.request.method = "GET"http.response.status_code-- filled after the response arriveshttp.response.body.size-- response body length
Emits an info event on fetch completion with status code and body size.
2. PerDomainRateLimitLayer¶
Delegates to the engine's RateLimiter trait implementation. Before forwarding a request, it calls rate_limiter.acquire(domain) which blocks until the domain's rate limit window permits another request. After receiving a response, it calls rate_limiter.record_response(domain, status) to feed adaptive backoff logic.
The default PerDomainThrottle implementation:
- Maintains per-domain state: last request time, crawl delay, and consecutive success count.
- Enforces a configurable minimum delay between requests to the same domain (default: 200ms).
- Respects
robots.txtcrawl-delay viaset_crawl_delay. - Adaptive backoff on 429: doubles the delay on rate-limit responses, up to a 60-second maximum.
- Decay on success: after 5 consecutive successful responses, halves the backoff delay (floored at the robots.txt delay or the default).
// Domain state tracked per host
struct DomainState {
last_request: Instant,
crawl_delay: Option<Duration>, // adaptive backoff
robots_delay: Option<Duration>, // floor from robots.txt
consecutive_success: u32,
}
3. CrawlCacheLayer¶
Intercepts requests and checks the CrawlCache trait for cached responses before forwarding to the inner service.
Cache hit: returns the cached CrawlResponse immediately, reconstructing headers from stored etag and last_modified values. No HTTP request is made.
Cache miss: forwards to the inner service, then stores successful responses (status 200-299) in the cache as CachedPage entries:
pub struct CachedPage {
pub url: String,
pub status_code: u16,
pub content_type: String,
pub body: String,
pub etag: Option<String>,
pub last_modified: Option<String>,
pub cached_at: u64, // Unix timestamp
}
The default NoopCache always misses, effectively disabling caching. For production use, DiskCache provides filesystem-backed caching with:
- blake3 key hashing for deterministic, collision-resistant filenames.
- TTL-based expiry (default: 1 hour).
- LRU eviction when the entry count exceeds the configured maximum (default: 10,000).
- Atomic writes via temp file + rename.
4. UaRotationLayer¶
Injects a rotating User-Agent header into each request using an AtomicUsize counter:
let idx = self.index.fetch_add(1, Ordering::Relaxed) % self.user_agents.len();
req.headers.insert("user-agent", self.user_agents[idx].clone());
The rotation counter is shared across all service rebuilds (the UaRotationLayer is stored on CrawlEngine and cloned into each stack build). If the user-agent list is empty, no header is injected and the HttpFetchService falls back to the default kreuzcrawl/{version} user-agent.
5. HttpFetchService¶
The innermost service that performs the actual HTTP request using reqwest::Client. Responsibilities:
- User-Agent fallback: if no UA header was set by the rotation layer, uses
config.user_agentor the defaultkreuzcrawl/{version}string. - Authentication: supports
Basic,Bearer, and customHeaderauth viaAuthConfig. - Custom headers: merges
config.custom_headersand request-level headers from middleware. - Error classification: maps HTTP status codes to typed
CrawlErrorvariants (401->Unauthorized,403->Forbidden/WafBlocked,404->NotFound,429->RateLimited,5xx->ServerError/BadGateway). - WAF detection: on 403 responses, inspects the
Serverheader and body for WAF signatures (Cloudflare, Akamai, etc.). - Content-length validation: detects truncated responses by comparing
Content-Lengthheader to actual body size. - Retry with exponential backoff: retries on server errors, rate limits, and configurable status codes. Delay starts at 100ms and doubles per attempt (
100ms * 2^attempt).
Extending the Stack¶
Custom middleware can be added by implementing Tower's Layer and Service traits for the CrawlRequest/CrawlResponse types. However, the current stack is assembled internally by CrawlEngine::build_service and is not directly extensible from the public API. Customization is achieved through the trait system (custom RateLimiter, CrawlCache, etc.) rather than adding new Tower layers.