PT-2026-50472 · Pypi · Vllm

Publicado

2026-06-17

·

Atualizado

2026-06-17

·

CVE-2026-53923

CVSS v4.0

5.3

Média

VetorAV:N/AC:L/AT:N/PR:N/UI:P/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N

Summary

Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure.

Root Cause

The to cuda ggml t function pointer type at ggml-common.h:1067 declares its element count parameter as int (32-bit):
cpp
using to cuda ggml t = void (*)(const void *  restrict  x,
                dst t *  restrict  y,
                int k,       // 32-bit
                cudaStream t stream);
All dequantize kernel functions (dequantize block cuda, dequantize row q2 K cuda, etc. in dequantize.cuh) inherit this int k parameter and use it as the kernel launch grid size:
cpp
static void dequantize block cuda(..., const int k, cudaStream t stream) {
  const int num blocks = (k + 2*CUDA DEQUANTIZE BLOCK SIZE - 1) / (2*CUDA DEQUANTIZE BLOCK SIZE);
  dequantize block<<<num blocks, CUDA DEQUANTIZE BLOCK SIZE, 0, stream>>>(vx, y, k);
}
In ggml dequantize() at gguf kernel.cu:85, the caller passes m * n (an int64 t product) to this int k parameter:
cpp
at::Tensor DW = torch::empty({m, n}, options);  // line 80: full-size, UNINITIALIZED
// ...
to cuda((void*)W.data ptr(), (scalar t*)DW.data ptr(), m * n, stream); // line 85: m*n truncated to int
When m * n > INT MAX, the truncated k is smaller than the actual tensor size. The kernel processes k elements. The remaining (m * n) - k elements in DW are never written and contain stale GPU memory.
This is a single root cause -- the int type on the k parameter in to cuda ggml t -- with a single fix: change int k to int64 t k. All dequantize functions inherit this type through the same typedef.

Affected Functions

All in csrc/quantization/gguf/gguf kernel.cu:
FunctionLineAllocationInfo Disclosure?
ggml dequantize74torch::empty({m, n}) at line 80Yes -- m*n truncated to int k at line 85
ggml mul mat vec a891torch::empty({vecs, row}) at line 99Yes -- int col = X.sizes()[1] at line 94
ggml mul mat a8207torch::empty({batch, row}) at line 215Yes -- int col = X.sizes()[1] at line 210
ggml moe a8279torch::empty({tokens*top k, row}) at line 289Yes -- int col = X.sizes()[1] at line 285
All four functions allocate output tensors with torch::empty (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory.
ggml moe a8 vec (line 382) uses torch::zeros instead of torch::empty, so it is not affected by the info disclosure variant.

Impact: Information Disclosure in Multi-Tenant Serving

vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor:
  1. The output tensor DW is allocated with torch::empty -- the buffer contains whatever was previously in that GPU memory region
  2. The dequantize kernel fills only a truncated portion of the buffer
  3. The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests
  4. The contaminated tensor proceeds through the model computation
  5. No error or warning is generated -- the partial fill is silent
This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory.

Attacker Control

The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT MAX (e.g., a matrix with shape [65536, 65536] gives m * n = 4,294,967,296). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization.

Fix

A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971

Correção

Information Disclosure

Encontrou algum problema na descrição? Tem algo a acrescentar? Fique à vontade para nos escrever 👾

Enumeração de Fraquezas

Identificadores relacionados

CVE-2026-53923
GHSA-5JV2-G5WQ-CMR4

Produtos afetados

Vllm