SSRF Defense¶

Kreuzcrawl refuses outbound HTTP requests targeting internal infrastructure, cloud metadata endpoints, and unsupported schemes. The policy is on by default and applies to every crawl, scrape, sitemap fetch, robots.txt fetch, asset download, and link-following enqueue.

What is refused¶

Category	Ranges
Loopback	127.0.0.0/8, ::1
Private (RFC1918)	10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Link-local	169.254.0.0/16 (incl. AWS/GCP metadata 169.254.169.254), fe80::/10
Unspecified	0.0.0.0/8
Multicast	224.0.0.0/4, ff00::/8
IPv6 unique-local	fc00::/7
Non-http/https schemes	file, ftp, gopher, …

DNS rebinding is mitigated: if a hostname resolves to a mix of public and denied IPs, the request is refused.

Each 30x Location is re-resolved and re-validated against the same policy before the next hop is taken. Up to SsrfPolicy::max_redirects (default 5) hops are followed.

Opting out¶

Two equivalent paths:

Environment variable — applies to every crawler in the process:

export KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1

Per-config builder — applies to a single CrawlConfig:

use kreuzcrawl::CrawlConfigBuilder;

let config = CrawlConfigBuilder::default()
    .allow_private_networks(true)
    .build();

When opt-out is on, the policy permits private IPs but still refuses non-http(s) schemes. The redirect cap and per-hop re-validation also stay in effect.

Host allowlists¶

Allowlist specific hosts while keeping the rest of the policy strict:

use kreuzcrawl::{CrawlConfigBuilder, HostMatcher};

let config = CrawlConfigBuilder::default()
    .ssrf_allowlist_host(HostMatcher::Suffix(".internal.kreuzberg.dev".into()))
    .ssrf_allowlist_host(HostMatcher::Cidr("10.42.0.0/16".into()))
    .build();

Matcher	Matches
`Exact("api.example.com")`	the exact hostname, case-insensitive
`Suffix(".example.com")`	`api.example.com`, `example.com` — but not `notexample.com`
`Cidr("10.42.0.0/16")`	resolved IPs inside the CIDR; also permits literal-IP URLs whose IP is inside

Allowlist entries permit access regardless of the default denylist. A mismatch between hostname allowlist and resolved IPs (e.g. Exact("svc.internal") resolves to a public IP) still permits the request — the allowlist trusts the host string.

What happens when a request is refused¶

Errors are typed:

pub enum CrawlError {
    SsrfPolicyViolation { url: String, reason: String },
    /* … */
}

url is the refused URL (original input or the redirect target that failed). reason is one of "loopback", "private_network", "link_local", "unique_local", "multicast", "unspecified", or "disallowed scheme: <scheme>".

The default retry policy classifies SsrfPolicyViolation as permanent — the crawler will not retry the request.

For link-following inside the crawl loop, refused targets are dropped from the queue and a tracing::warn! is emitted with structured fields (url, reason) so operators can see what was blocked.

Browser layer parity¶

The headless browser layer (kreuzcrawl-browser) shares the same policy core and applies it to every JS-initiated fetch() and every navigation. Two browser-specific extras are kept:

file:// is permitted in the browser process so test pages can use local fixtures.
A localhost/.localhost string short-circuit runs before DNS to mitigate rebinding through the browser's resolver.

This is the same mitigation chain that fixed GHSA-8v6v-g4rh-jmcm.

Configuration reference¶

See the SsrfPolicy rustdoc for the full type signature.

Edit this page on GitHub