PT-2026-50472 · Pypi · Vllm

Publicado

2026-06-17

Atualizado

2026-06-17

CVE-2026-53923

CVSS v4.0

5.3

Média

Vetor

AV:N/AC:L/AT:N/PR:N/UI:P/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N

Summary

Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure.

Root Cause

The to cuda ggml t function pointer type at ggml-common.h:1067 declares its element count parameter as int (32-bit):

cpp

using to cuda ggml t = void (*)(const void *  restrict  x,
                dst t *  restrict  y,
                int k,       // 32-bit
                cudaStream t stream);

All dequantize kernel functions (dequantize block cuda, dequantize row q2 K cuda, etc. in dequantize.cuh) inherit this int k parameter and use it as the kernel launch grid size:

cpp

static void dequantize block cuda(..., const int k, cudaStream t stream) {
  const int num blocks = (k + 2*CUDA DEQUANTIZE BLOCK SIZE - 1) / (2*CUDA DEQUANTIZE BLOCK SIZE);
  dequantize block<<<num blocks, CUDA DEQUANTIZE BLOCK SIZE, 0, stream>>>(vx, y, k);
}

In ggml dequantize() at gguf kernel.cu:85, the caller passes m * n (an int64 t product) to this int k parameter:

cpp

at::Tensor DW = torch::empty({m, n}, options);  // line 80: full-size, UNINITIALIZED
// ...
to cuda((void*)W.data ptr(), (scalar t*)DW.data ptr(), m * n, stream); // line 85: m*n truncated to int

When m * n > INT MAX, the truncated k is smaller than the actual tensor size. The kernel processes k elements. The remaining (m * n) - k elements in DW are never written and contain stale GPU memory.

This is a single root cause -- the int type on the k parameter in to cuda ggml t -- with a single fix: change int k to int64 t k. All dequantize functions inherit this type through the same typedef.

Affected Functions

All in csrc/quantization/gguf/gguf kernel.cu:

Function	Line	Allocation	Info Disclosure?
`ggml dequantize`	74	`torch::empty({m, n})` at line 80	Yes -- `m*n` truncated to `int k` at line 85
`ggml mul mat vec a8`	91	`torch::empty({vecs, row})` at line 99	Yes -- `int col = X.sizes()[1]` at line 94
`ggml mul mat a8`	207	`torch::empty({batch, row})` at line 215	Yes -- `int col = X.sizes()[1]` at line 210
`ggml moe a8`	279	`torch::empty({tokens*top k, row})` at line 289	Yes -- `int col = X.sizes()[1]` at line 285

All four functions allocate output tensors with torch::empty (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory.

ggml moe a8 vec (line 382) uses torch::zeros instead of torch::empty, so it is not affected by the info disclosure variant.

Impact: Information Disclosure in Multi-Tenant Serving

vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor:

The output tensor DW is allocated with torch::empty -- the buffer contains whatever was previously in that GPU memory region
The dequantize kernel fills only a truncated portion of the buffer
The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests
The contaminated tensor proceeds through the model computation
No error or warning is generated -- the partial fill is silent

This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory.

Attacker Control

The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT MAX (e.g., a matrix with shape [65536, 65536] gives m * n = 4,294,967,296). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization.

Fix

A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971

Correção

Information Disclosure

Encontrou algum problema na descrição? Tem algo a acrescentar? Fique à vontade para nos escrever 👾

dbugs@ptsecurity.com

Enumeração de Fraquezas

CWE-681CWE-200

Identificadores relacionados

CVE-2026-53923

GHSA-5JV2-G5WQ-CMR4

Produtos afetados

Vllm

PT-2026-50472 · Pypi · Vllm

CVE-2026-53923

Summary

Root Cause

Affected Functions

Impact: Information Disclosure in Multi-Tenant Serving

Attacker Control

Fix

Enumeração de Fraquezas

Identificadores relacionados

Produtos afetados

Referências · 5