PT-2026-42667 · Pypi · Crawlee
2) Nested sitemap fetching bypassed the
Layer 2 — non-HTTP schemes (
Layer 2 — non-HTTP escalation (only
Published
2026-05-21
·
Updated
2026-05-26
·
CVE-2026-46497
CVSS v4.0
2.3
Low
| Vector | AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N |
Overview
- Vulnerability type: Blind SSRF
- Affected components:
src/crawlee/ utils/sitemap.py,src/crawlee/ utils/robots.py,src/crawlee/request loaders/ sitemap request loader.py, and all built-in HTTP clients. - Trigger: an attacker-controlled sitemap or
robots.txtcontaining a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).
Two-layer SSRF via sitemap-derived URLs:
1) Cross-host HTTP SSRF
Base case, affects every HTTP client.** Sitemap entries and
robots.txt Sitemap: directives were accepted regardless of the host they pointed to. A sitemap on example.com could push http://internal.corp/admin into the crawler's queue, and the configured HTTP client would dispatch the request.2) Non-HTTP scheme SSRF
Escalation, only
CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the Request construction step where Pydantic enforces http(s). Combined with the libcurl-backed CurlImpersonateHttpClient, this lets gopher://, file://, dict://, ftp://, etc., through.Root cause
Crawlee already validates URL schemes through Pydantic's
AnyHttpUrl (via validate http url in src/crawlee/ utils/urls.py) wherever a crawl target is materialised as a Request: the Request.url field is declared as Annotated[str, BeforeValidator(validate http url), Field(frozen=True)]. Anything that becomes a Request is therefore guaranteed to be http(s).Two parts of the sitemap pipeline sidestepped this property in different ways:
1) Sitemap-derived URLs were enqueued without any host policy
SitemapRequestLoader took every <urlset><url><loc> entry, wrapped it in Request.from url (which accepts any valid http(s) URL), and pushed the result into the request queue. RobotsTxtFile.get sitemaps() returned every Sitemap: directive verbatim. Neither imposed any host check against the parent sitemap or robots.txt URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.2) Nested sitemap fetching bypassed the Request chokepoint entirely
When
XmlSitemapParser encountered <sitemapindex><sitemap><loc>…</loc></sitemap></sitemapindex>, or when RobotsTxtFile.parse sitemaps forwarded Sitemap: directives into the same pipeline, fetch and process sitemap dispatched the URL directly to the HTTP client:async with http client.stream(
sitemap url,
method='GET',
headers=SITEMAP HEADERS,
proxy info=proxy info,
timeout=timeout,
) as response:
...
No
Request was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' own send request() and stream() methods did not call validate http url either, so a non-http(s) scheme could pass straight through to the backend client.The non-HTTP escalation in layer 2 is specific to
CurlImpersonateHttpClient, which is backed by curl-cffi / libcurl and speaks gopher, file, dict, ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient, ImpitHttpClient, PlaywrightHttpClient) reject non-http(s) schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.Vulnerable paths
Layer 1 — cross-host HTTP (all HTTP clients)
- Source: an attacker-controlled sitemap that lists internal URLs under
<urlset><url><loc>or<sitemapindex><sitemap><loc>, or an attacker-controlledrobots.txtthat lists internal URLs underSitemap:. - Sink: the configured HTTP client issues
GETrequests against those URLs — either viaclient.request(url=request.url, …)insidecrawl()for regular sitemap URLs, or viaclient.stream(url, …)inside the nested-sitemap fetch.
Layer 2 — non-HTTP schemes (CurlImpersonateHttpClient only)
- Source: a nested
<sitemap><loc>entry or arobots.txtSitemap:directive pointing to a non-http(s)URL. - Sink:
CurlImpersonateHttpClient.stream(...)hands the URL string verbatim toclient.request(url=…, …), which dispatches via libcurl.
Hardening in 1.7.0 was added at both producer and consumer ends — see Remediation.
Exploitation preconditions
- The crawler uses sitemap loading: any of
SitemapRequestLoader,Sitemap.load/parse sitemap,discover valid sitemaps, orRobotsTxtFile.parse sitemaps. - The attacker controls the body of a sitemap or
robots.txtthat the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap. - The crawler's network egress can reach the attacker-chosen destination (e.g., internal services on the same network).
- The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.
For layer 2 (non-HTTP), the configured HTTP client must additionally be
CurlImpersonateHttpClient.Impact
Layer 1 — cross-host HTTP (any client)
The crawler can be coerced into issuing
GET requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its local Dataset / KeyValueStore (push data() etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effecting GET endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).Layer 2 — non-HTTP escalation (only CurlImpersonateHttpClient)
Under the affected client, attackers gain the libcurl scheme set:
gopher://is the canonical RESP-injection vector: pipelineFLUSHALL,CONFIG SET dir,CONFIG SET dbfilename,SAVEto an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.file://allows the crawler to read local files (application secrets, configuration) on the crawler host.dict://andftp://permit fingerprinting and limited interaction with text-protocol services.
In both layers, the SSRF is blind in the default configuration. Write-side impact (
gopher:// → Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.Remediation
Both layers are fixed in
crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:- Producer-side filtering — sitemap and robots.txt loaders (PR #1864).
SitemapRequestLoaderandRobotsTxtFile.get sitemaps()now run every nested-sitemap entry, every regular sitemap URL, and everySitemap:directive throughcrawlee. utils.urls.filter url. This applies to anEnqueueStrategy(default'same-hostname') against the parent sitemap /robots.txtURL — cross-host entries are dropped — and rejects non-http(s)schemes. The strategy is stamped onto the emittedRequests, soBasicCrawler. check url after redirectscontinues policing the policy across redirects. - Consumer-side validation — HTTP-client boundary (PR #1862).
validate http url(url)is now called at the top ofsend request()andstream()inImpitHttpClient,HttpxHttpClient,CurlImpersonateHttpClient, andPlaywrightHttpClient. Non-http(s)schemes raisepydantic.ValidationErrorbefore any backend call.crawl()was already covered, becauseRequest.urlis validated by Pydantic on construction.
After these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.
Behaviour change for upgraders
SitemapRequestLoader and RobotsTxtFile.get sitemaps() now default to enqueue strategy='same-hostname'. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on sitemaps.example.com that points to content on www.example.com) must opt in explicitly with enqueue strategy='same-domain' or enqueue strategy='all'.Finder credits
- @r0otsu
- @Yuremin (Zhengmin Yu)
- @FORIMOC
- @invoke1442 (Ethan Carter)
- @Arturo0x90 (Arturo Melgarejo)
Fix
SSRF
Found an issue in the description? Have something to add? Feel free to write us 👾
Weakness Enumeration
Related Identifiers
Affected Products
Crawlee