Bleeding Llama: Why "On-Premises" Doesn't Mean "Safe" — CVE-2026-7482 and the 300,000 Exposed Servers

CVE-2026-7482 allows unauthenticated remote attackers to leak Ollama process memory via crafted GGUF files, exposing sensitive API keys, system prompts, and pr…

Bleeding Llama: Critical Ollama Vulnerability Exposes Secrets Across 300,000 Servers

On May 12, 2026, researchers disclosed CVE-2026-7482, a critical out-of-bounds read vulnerability within Ollama’s GGUF loader. The flaw allows unauthenticated remote attackers to leak the entire memory space of the running process. Dubbed "Bleeding Llama" by researchers at Cyera, the vulnerability puts approximately 300,000 internet-exposed servers at immediate risk—many of which operate without authentication by default. The myth of "local-only" AI security has been shattered by an attack chain requiring just three HTTP calls to exfiltrate API keys, system prompts, and proprietary user data.

Key Takeaways

A crafted GGUF file with an inflated tensor shape triggers a read beyond heap buffer limits during quantization, resulting in the leakage of arbitrary process memory.
The /api/create endpoint accepts malicious models without authentication, while /api/push allows attackers to exfiltrate leaked data to an external registry.
Approximately 300,000 Ollama servers are currently reachable via the public internet, often bound to all interfaces by default without credential protection.
Leaked data may include environment variables, third-party API keys, system prompts, and active user conversations, jeopardizing corporate secrets and proprietary code.

The Tensor Shape Trick and GGUF Loader Memory Leaks

GGUF is a binary format for Llama-based models that can be manually constructed. According to Cyera, an attacker can declare an arbitrarily large tensor shape without the loader verifying if the element count matches the actual buffer size. "GGUF is just a binary format – anyone can create one manually and set the tensor’s shape to whatever they want. There’s no validation that the number of elements we’re about to read actually matches the real size of the data," the researchers explained.

During the quantization process, the WriteTo() function in fs/ggml/gguf.go and server/quantization.go processes the tensor without adequate boundary checks. Ollama utilizes Go’s unsafe package for low-level buffer operations, bypassing the language's native memory safety guarantees. This results in a read beyond the heap buffer limit, ingesting adjacent bytes of arbitrary memory.

This leaked memory is far from random junk; it often contains environment variables, API keys, system prompts, and active user sessions. Attackers do not require sophisticated reverse engineering; once the malicious file is provided to the /api/create endpoint, the process itself becomes the source of the leak.

From Localhost to Internet-Wide: The Hidden Attack Surface

Ollama has become a leading runtime for local model execution, boasting approximately 170,000 GitHub stars and over 100 million Docker Hub downloads. However, an architectural default poses a significant risk: upon startup, the software often listens on all network interfaces and does not enforce authentication. This can instantly transform a local development environment into a public node if an administrator exposes the host to the internet.

Current estimates suggest that roughly 300,000 Ollama servers are visible to public scanners. As noted by Echo, a CNA cited by Cybernews, "Ollama, when launched, listens on all interfaces by default with no authentication. Today, there are roughly 300,000 exposed servers on the internet." This is not a theoretical threat, but a measurable and mapped attack surface.

The combination of massive popularity and permissive default configurations makes the tool a prime target. Exploitation does not require a complex misconfiguration; simply putting an instance online with original settings is enough to grant access to anyone with a malicious GGUF file.

The Three-Step Exfiltration Chain

The attack is alarmingly straightforward. An attacker uploads a crafted GGUF file via a POST request to the /api/create endpoint, forcing Ollama to read memory outside the allocated buffer. The leaked data is then encapsulated within a model artifact. The attacker can subsequently use the /api/push endpoint to send this artifact to an external registry, completing the exfiltration without ever touching the victim's filesystem.

CVE-2026-7482 carries a CVSS score of 9.1. Described by CVE.org as "a heap out-of-bounds read vulnerability in the GGUF model loader," it does not require complex exploits. A valid binary file containing malicious metadata is sufficient to turn a server into a passive intelligence source.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more"
— Dor Attias, Cyera (via The Hacker News)

While no mass breaches have been confirmed yet, the simplicity of the chain raises the risk of targeted compromises. For enterprises using Ollama to prototype on cloud VMs or public containers, the margin for error has effectively disappeared.

Mitigation and Response

Restricting Ollama to the local loopback interface and enabling an authentication layer prior to any network exposure is the primary defense. If the service does not require external access, bindings must remain on 127.0.0.1 rather than 0.0.0.0.

Blocking remote access to the /api/create and /api/push endpoints drastically reduces the attack surface if the server must remain reachable for other services. These paths are central to the attack chain and should be restricted via firewalls or reverse proxies.

Updating to the latest available version is essential. Although one source suggests that version 0.17.1 or later addresses the vulnerability, official vendor confirmation was not available at the time of writing.

Finally, organizations should immediately rotate any API keys or secrets stored in environment variables on exposed instances. Monitoring traffic to suspicious external registries via the /api/push endpoint is also recommended, treating any previously accessible installation as potentially compromised.

The Go 'Unsafe' Package and the Failure of Modern Memory Safety

Go is designed to be a memory-safe language, but its unsafe package allows low-level pointer operations that bypass the type system and garbage collector. Ollama utilizes this in its quantization pipeline to manipulate tensors directly—a performance-driven choice that ultimately created a breach in the runtime's security boundaries.

The combination of a non-robust binary parser and unsafe memory access negates the architectural advantages of the language. This is not a vulnerability within Go itself, but a demonstration of how modern toolchains can compromise security for speed, allowing controlled external input to trigger arbitrary memory reads.

In this context, the use of unsafe is not a minor technical detail; it is the reason why a simple metadata discrepancy in a model file can expose the process's entire address space. Without this bypass, the runtime likely would have intercepted the anomaly before it became a leak.

The assumption that an AI model running locally or in a private container is inherently protected has met the reality of a global attack surface. Ollama’s memory is no longer a private enclosure; if the process is reachable over the internet, the data passing through it becomes accessible to anyone capable of crafting a malformed GGUF file. Distinguishing between "local" and "secure" is no longer an analytical luxury—it is a requirement for managing AI infrastructure.

Frequently Asked Questions

Is authentication required to exploit CVE-2026-7482?
No. An attacker can act remotely and without valid credentials by targeting endpoints that are accessible by default on exposed servers.

What specific data is at risk of exfiltration?
Environment variables, API keys, system prompts, and active user conversations currently residing in the Ollama process memory at the time of the attack.

Is an update currently available to fix the issue?
One report indicates that version 0.17.1 or later mitigates the vulnerability, though this has not been officially confirmed by vendor advisories at the time of publication.

Information has been verified against cited sources and is current as of the date of publication.