Bleeding Llama: 300,000 Ollama Servers Exposed to Unauthenticated Memory Leak

A critical vulnerability (CVE-2026-7482) in Ollama's GGUF loader allows remote, unauthenticated attackers to exfiltrate process memory, API keys, and sensitive…

On May 10, Cyera disclosed CVE-2026-7482, a vulnerability dubbed "Bleeding Llama" affecting Ollama's GGUF loader. The flaw allows a remote attacker to leak a process's entire memory—including secrets and active conversational data—by uploading a seemingly legitimate model file through unauthenticated REST endpoints. With CVSS scores estimated at 9.1 by Cyera and 9.9 by Qualys, the vulnerability demands immediate attention from the thousands of organizations utilizing Ollama for self-hosted Large Language Models (LLMs).

Key Takeaways

Ollama's GGUF loader utilizes Go's unsafe package for tensor quantization operations but fails to validate tensor shape fields against actual buffer sizes.
The /api/create and /api/push REST endpoints are exposed without authentication in upstream distributions, allowing unauthorized remote uploads and data exfiltration.
The resulting model artifacts incorporate portions of leaked heap memory, potentially containing API keys, environment variables, system prompts, and real-time user conversation data.
All versions prior to 0.17.1 are affected. Qualys has released detection QIDs 734196 and 5012259 for active identification of vulnerable instances.

The Attack Chain: From GGUF Files to Memory Exfiltration

The attack begins with the upload of a specifically crafted GGUF file via the /api/create endpoint. In default upstream distributions, this is one of two REST channels that require no authentication. An attacker manipulates the tensor offset and size fields, deliberately declaring them to be larger than the actual file size uploaded to the server.

Ollama accepts the artifact and, during the quantization phase, triggers the ConvertToF32 function within server/quantization.go. This function reads beyond the boundaries of the allocated heap buffer because it lacks dimensional consistency checks between the physical file and the tensor metadata. This out-of-bounds read operation embeds arbitrary bytes of process memory directly into the new quantized model saved on the server's disk.

The attacker can then invoke the /api/push endpoint to publish the modified artifact to a remote registry under their control. Because both endpoints are accessible without credentials in standard configurations, the entire exfiltration chain can be automated. The resulting model file is no longer just a collection of neural weights, but a partial snapshot of the host's active RAM.

Root Cause: unsafe.WriteTo and Inflated Tensor Shapes

The technical root cause lies in the use of Go's unsafe package within the GGUF loader's quantization routines. Cyera identified that the WriteTo() function in fs/ggml/gguf.go and server/quantization.go processes the tensor shape field without verifying that the declared number of elements matches the actual buffer size. This misalignment facilitates an input-controlled heap out-of-bounds read.

The mechanism is particularly insidious because it requires neither an executable payload nor prior server compromise; a model file with altered metadata is sufficient. The GGUF format itself becomes a transport vehicle for arbitrary data harvested from process RAM. While the attacker does not achieve Remote Code Execution (RCE), the memory leak is critical for enterprise environments managing sensitive secrets.

EXPLOIT STATUS AND TECHNICAL LIMITATIONS

As of the May 10 disclosure, there are no confirmed "in-the-wild" exploits. The vulnerability does not allow for direct remote code execution (RCE) or system crashes; rather, it serves as a massive, unauthenticated information disclosure flaw.

Impact Analysis: 300,000 Exposed Servers and CVSS Discrepancies

Cyera and The Hacker News estimate that approximately 300,000 Ollama servers are exposed to the internet. While this figure cannot be independently verified, it reflects the massive scale of the project, which boasts over 171,000 GitHub stars and 100 million downloads on Docker Hub. This popularity creates a vast global attack surface for security researchers and potential threat actors alike.

There is a technical discrepancy between assigned CVSS scores. Cyera calculated a score of 9.1, while Qualys ThreatPROTECT reports a near-maximum 9.9. This difference stems from differing evaluations of the "scope" metric and the attack vector. Qualys places higher weight on the total absence of authentication in upstream endpoints, which significantly elevates the potential for immediate remote exploitation.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more"
— Dor Attias, Cyera security researcher

Mitigation and Response

Update to Ollama 0.17.1 immediately. The patch implements validation for tensor shapes within the GGUF loader, effectively neutralizing the out-of-bounds read. Manually verify Docker instances to ensure image pulls reflect the latest stable version rather than obsolete tags.
Isolate or Authenticate /api/create and /api/push. Upstream distributions expose these endpoints by default. Administrators should deploy a reverse proxy (such as Nginx or Apache) to enforce API key authentication or restrict access to segmented internal networks via strict firewall rules.
Scan Infrastructure with Qualys QIDs. Utilize detection codes 734196 and 5012259 to systematically identify vulnerable instances across corporate inventories. Prioritize patching for nodes handling sensitive data or those directly exposed to the public internet.
Validate GGUF Model Provenance. Treat every model file as a supply chain artifact. Only load models from verified public repositories and, where possible, implement a pre-validation phase to check tensor metadata consistency before ingestion into the Ollama runtime.

Vendor Responsibility and the Risk of "Open Defaults"

Bleeding Llama highlights a fundamental concern regarding Ollama's architectural choices: the decision to keep critical endpoints like /api/create and /api/push open by default without upstream authentication. In the modern security landscape, "secure-by-default" should be the standard, not a manual configuration requirement for external proxies. This oversight transforms a memory vulnerability into a global privacy crisis.

For organizations handling intellectual property or regulated data, this flaw represents a direct breach of confidentiality. The issue transcends technical bugs, touching on vendor accountability. AI infrastructure must be defended with the same rigor as traditional software pipelines. An inflated offset in a GGUF header should never be capable of turning a downloaded model into a data exfiltration pump.

The memory of an inference process is an active archive of secrets and conversational context. The fact that the attack surface has shifted silently from code to weight containers demonstrates the fragility of the AI supply chain. For those running Ollama in production, the question is not whether to patch, but how quickly those unintentionally exposed endpoints can be secured.

Frequently Asked Questions

Is the risk still real if Ollama only runs locally?

Yes. Even without direct internet exposure, the lack of authentication means any compromised process or container on the same local network can trigger the memory leak. While remote exposure allows for global-scale exploitation, the vulnerability remains a dangerous information disclosure vector for multi-tenant internal networks.

Does version 0.17.1 block the exfiltration or the out-of-bounds read?

The patch addresses the technical root cause. It corrects tensor shape handling in the GGUF loader, preventing the function from reading beyond the heap buffer. Without the initial anomalous read, the generated artifact contains no sensitive RAM data, rendering subsequent exfiltration via /api/push useless.

Is it possible to detect if a malicious GGUF file has already been loaded?

While there are no unique indicators of compromise (IoCs), organizations can monitor logs for anomalous /api/push calls directed at unknown external registries. Another indicator is a discrepancy between the size of declared tensors and the actual size of files uploaded to /api/create during recent sessions.

Information has been verified against cited sources and is current as of the time of publication.