CVE-2026-7482: ‘Bleeding Llama’ Vulnerability Exposes 300,000 Ollama Instances

A critical heap out-of-bounds read in Ollama's GGUF parser allows unauthenticated remote attackers to exfiltrate API keys, environment variables, and conversat…

On May 12, 2026, Cyera disclosed CVE-2026-7482, dubbed "Bleeding Llama," a critical vulnerability with a CVSS score of 9.1 residing in Ollama's GGUF format loader. The flaw allows an unauthenticated remote attacker to upload a maliciously crafted model file to leak the process's entire memory, potentially exfiltrating API keys, environment variables, and private conversation history. The risk is compounded by widespread exposure: approximately 300,000 Ollama servers are currently reachable via the public internet, frequently configured to listen on 0.0.0.0 without active authentication.

Key Takeaways

The vulnerability is a heap out-of-bounds read (CVSS 9.1) in the Ollama GGUF parser, stemming from the use of Go's unsafe package within the WriteTo() function.
Attackers can trigger the leak by uploading a crafted GGUF file with an inflated tensor shape via the /api/create endpoint; the quantization process then reads past buffer boundaries.
Leaked data is exfiltrated through the /api/push endpoint to an attacker-controlled registry, requiring no credentials.
An estimated 300,000 Ollama servers are internet-exposed; all versions prior to 0.17.1 are confirmed vulnerable.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more" - Dor Attias, Cyera security researcher

AI Model Formats: The New Enterprise Attack Vector

Until now, the security of local AI engines has focused primarily on output filtering and system policies. Bleeding Llama demonstrates that the critical path lies in binary format parsing. A crafted GGUF file bypasses logical controls because the damage occurs before the model generates a single response. Organizations utilizing Ollama to process proprietary code, legal contracts, or internal data must now treat every model file as a potential binary exploit rather than a simple package of neural weights.

Consequently, security teams must extend their threat models from the application layer to the weight supply chain. A model downloaded from a public repository or shared internally could contain altered metadata designed to trigger a leak the moment it is loaded into the Ollama backend. Until parsers implement rigorous consistency checks between declared tensor shapes and actual buffer sizes, every external file must be treated as potentially hostile input.

GGUF Parsing as a Local AI Attack Surface

Ollama manages local models using GGUF, a binary container that describes weights, metadata, and tensor structures. The vulnerability is triggered during the import phase: an attacker prepares a file where the declared tensor size exceeds the actual buffer allocation. During quantization, the parser trusts this metadata and reads beyond the allocated boundaries into adjacent heap memory. Simply loading the corrupted file is enough to trigger the leak; no interaction with the chat or prior host compromise is required.

The root issue is not the neural model itself, but the loader that imports it. Ollama must convert and quantize weights to suit local hardware; during this operation, it reads tensor metadata without validating it against the actual buffer size. Attackers exploit this lack of validation by declaring an arbitrarily large shape, forcing the read loop to overshoot the assigned heap segment.

How Go’s ‘Unsafe’ Package Enables Memory Leaks

The technical root cause lies in the use of Go's unsafe package within the WriteTo() function, which handles tensor conversion to 32-bit formats. This package bypasses the language's native memory safety guarantees, allowing the quantization loop to read past the end of the heap buffer. This results in an out-of-bounds read that exposes adjacent bytes within the Ollama process address space, including keys, environment variables, and conversation fragments stored in plaintext.

According to Cyera’s analysis, the combination of binary format parsing and a lack of sandboxing leaves the Ollama process uniquely exposed. Once adjacent memory bytes enter the data stream, they can be encapsulated and transmitted externally using the framework’s native mechanisms, such as remote registry management, without the need for additional malware.

The Exfiltration Chain: From /api/create to /api/push

No passwords are required for this exploit. An attacker uploads the malicious file via the /api/create endpoint, which is exposed without authentication on many default installations. The server processes the model, triggering the out-of-bounds read. The leaked data is then funneled out via the /api/push endpoint to an attacker-controlled registry. This allows the entire process memory—including API keys, system prompts, and user content—to leave the infrastructure without appearing in traditional logs or triggering standard network filters.

The absence of a mandatory authentication layer effectively turns the /api/create and /api/push endpoints into open doors. Anyone capable of reaching the host—internally or externally—can trigger the full chain: file upload, out-of-bounds read, and memory exfiltration. No prior privileges or knowledge of the target RAM content are necessary; the Ollama workflow itself provides the exit channel.

300,000 Exposed Servers and the Authentication Gap

Ollama has seen massive adoption, boasting over 170,000 GitHub stars and 100 million downloads on Docker Hub. However, this popularity has led to widespread misconfiguration. Cyera estimates that roughly 300,000 instances are reachable over the internet, often listening on 0.0.0.0 without API gateways or native authentication. This exposure elevates the vulnerability from a theoretical risk to a significant threat; a single public endpoint without a password is sufficient to drain corporate secrets from the inference engine's memory.

The memory of an AI inference process rarely contains only weights. It often holds pre-configured system prompts, active conversation snippets, third-party service keys, and connection parameters for databases or cloud services. As Cyera researcher Dor Attias noted, the scope of this risk is extensive.

Mitigation and Security Recommendations

Update immediately to Ollama 0.17.1. The official advisory identifies this version as the fix; all previous releases remain vulnerable to remote exfiltration.
Rotate credentials and secrets. Any API keys, access tokens, or environment variables residing in the memory of Ollama servers should be considered compromised and replaced immediately.
Isolate the inference engine network. Disable listening on 0.0.0.0, restrict access via firewalls or corporate VPNs, and ensure the /api/create and /api/push endpoints are not publicly visible.
Implement an API gateway with mandatory authentication. Since Ollama does not include robust native access control, a reverse proxy or API gateway should be used to prevent anonymous model uploads.

Bleeding Llama shifts the focus from chat manipulation to infrastructure security: an attacker no longer needs to trick a model with a prompt when they can simply corrupt the file that defines it. As long as self-hosted inference engines treat authentication and sandboxing as optional, the neural weight format itself will remain a privileged vector for accessing corporate memory. While patching is urgent, the broader lesson is that the local model supply chain must be defended with the same rigor as traditional software supply chains.

Frequently Asked Questions

How does this differ from prompt injection?

Prompt injection attempts to manipulate a model's behavior via text input. CVE-2026-7482 is a memory safety flaw in the GGUF binary parser. The attacker does not interact with the chat interface but instead uses a corrupted model file to read the process memory directly.

Is a server at risk if it is behind a firewall but lacks authentication?

Yes. While the estimate of 300,000 servers refers to those internet-exposed, any host that accepts unauthenticated connections to /api/create—whether from internal users or via lateral movement within a local network—remains vulnerable to memory exfiltration.

Does version 0.17.1 fully resolve the issue?

The advisory lists 0.17.1 as the official fix. However, until independent verification rules out secondary bypasses, users are advised to update and closely monitor access to model creation endpoints and external registries.

Sources

Information verified against cited sources and current as of publication.