Skip to content

SSRF Defense

Kreuzcrawl refuses outbound HTTP requests targeting internal infrastructure, cloud metadata endpoints, and unsupported schemes. The policy is on by default and applies to every crawl, scrape, sitemap fetch, robots.txt fetch, asset download, and link-following enqueue.

What is refused

Category Ranges
Loopback 127.0.0.0/8, ::1
Private (RFC1918) 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Link-local 169.254.0.0/16 (incl. AWS/GCP metadata 169.254.169.254), fe80::/10
Unspecified 0.0.0.0/8
Multicast 224.0.0.0/4, ff00::/8
IPv6 unique-local fc00::/7
Non-http/https schemes file, ftp, gopher, …

DNS rebinding is mitigated: if a hostname resolves to a mix of public and denied IPs, the request is refused.

Each 30x Location is re-resolved and re-validated against the same policy before the next hop is taken. Up to SsrfPolicy::max_redirects (default 5) hops are followed.

Opting out

Two equivalent paths:

Environment variable — applies to every crawler in the process:

export KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1

Per-config builder — applies to a single CrawlConfig:

use kreuzcrawl::CrawlConfigBuilder;

let config = CrawlConfigBuilder::default()
    .allow_private_networks(true)
    .build();

When opt-out is on, the policy permits private IPs but still refuses non-http(s) schemes. The redirect cap and per-hop re-validation also stay in effect.

Host allowlists

Allowlist specific hosts while keeping the rest of the policy strict:

use kreuzcrawl::{CrawlConfigBuilder, HostMatcher};

let config = CrawlConfigBuilder::default()
    .ssrf_allowlist_host(HostMatcher::Suffix(".internal.kreuzberg.dev".into()))
    .ssrf_allowlist_host(HostMatcher::Cidr("10.42.0.0/16".into()))
    .build();
Matcher Matches
Exact("api.example.com") the exact hostname, case-insensitive
Suffix(".example.com") api.example.com, example.com — but not notexample.com
Cidr("10.42.0.0/16") resolved IPs inside the CIDR; also permits literal-IP URLs whose IP is inside

Allowlist entries permit access regardless of the default denylist. A mismatch between hostname allowlist and resolved IPs (e.g. Exact("svc.internal") resolves to a public IP) still permits the request — the allowlist trusts the host string.

What happens when a request is refused

Errors are typed:

pub enum CrawlError {
    SsrfPolicyViolation { url: String, reason: String },
    /* … */
}

url is the refused URL (original input or the redirect target that failed). reason is one of "loopback", "private_network", "link_local", "unique_local", "multicast", "unspecified", or "disallowed scheme: <scheme>".

The default retry policy classifies SsrfPolicyViolation as permanent — the crawler will not retry the request.

For link-following inside the crawl loop, refused targets are dropped from the queue and a tracing::warn! is emitted with structured fields (url, reason) so operators can see what was blocked.

Browser layer parity

The headless browser layer (kreuzcrawl-browser) shares the same policy core and applies it to every JS-initiated fetch() and every navigation. Two browser-specific extras are kept:

  • file:// is permitted in the browser process so test pages can use local fixtures.
  • A localhost/.localhost string short-circuit runs before DNS to mitigate rebinding through the browser's resolver.

This is the same mitigation chain that fixed GHSA-8v6v-g4rh-jmcm.

Configuration reference

See the SsrfPolicy rustdoc for the full type signature.

Edit this page on GitHub