Bleeding Llama: Critical Memory Leak Exposes 300,000 Ollama Servers

CVE-2026-7482 in the Ollama framework allows remote attackers to exfiltrate heap memory via crafted GGUF files, potentially exposing API keys and private conve…

Bleeding Llama: Critical Memory Leak Exposes 300,000 Ollama Servers

On May 7, 2026, Cyera disclosed a critical vulnerability in the open-source Ollama framework designated as CVE-2026-7482. Dubbed "Bleeding Llama," this heap out-of-bounds read flaw within the GGUF parser enables unauthenticated remote attackers to leak process memory from exposed servers.

The vulnerability targets the way local AI models are loaded. When combined with Ollama’s lack of default authentication, the flaw exposes API keys, environment variables, and private user conversations without triggering any immediate signs of intrusion.

The scale of the threat is significant: over 300,000 Ollama servers are currently reachable via the public internet, many of which are configured to listen on all network interfaces. Furthermore, the attack chain can be completed in just three API calls.

Key Takeaways
  • CVE-2026-7482 is a heap out-of-bounds read vulnerability with an estimated CVSS score of 9.1, affecting Ollama's GGUF format loader.
  • The exploit is triggered by sending a crafted GGUF file to the /api/create endpoint, followed by memory exfiltration via /api/push to an attacker-controlled registry.
  • Leaked heap memory can contain environment variables, API keys, system prompts, proprietary code, and private user chat history.
  • Administrators must update to Ollama 0.17.1 immediately, isolate servers from the public web, and rotate all exposed credentials and tokens.

The GGUF Parser Flaw: Malformed Tensors and Buffer Over-reads

The GGUF (GPT-Generated Unified Format) is the standard format used by Ollama to host quantized models locally. Internal metadata within these files describes the offset, size, and structure of tensors. If an attacker manipulates these values to exceed the actual length of the loaded file, the server blindly trusts the malformed header.

During the model creation phase initiated via the /api/create endpoint, the code in fs/ggml/gguf.go and the WriteTo() function in server/quantization.go allocates a heap buffer based on this fraudulent metadata. This results in a read operation that extends beyond the buffer boundaries, "bleeding" adjacent bytes that contain variables, conversation fragments, and other process-level secrets.

The attacker requires no credentials. By performing a simple HTTP POST, they upload the crafted GGUF file, trigger quantization via /api/create, and force the over-read. The leaked data is then encapsulated within the model and exfiltrated to an external registry using /api/push, where the attacker can later extract it.

From Heap Memory to Corporate Secrets: Three Requests to Data Theft

Ollama's process heap is far from sterile. According to Cyera researchers, the memory space contains environment variables, API keys, system prompts, user conversations, proprietary code snippets, and even client contracts. Crucially, the memory leak does not cause visible crashes or errors; the server continues to function normally while data is siphoned off.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more"

Dor Attias, a researcher at Cyera, noted that the impact extends beyond a standard data breach, as AI inference becomes a vector for corporate reconnaissance. This is not a traditional database theft, but rather an open window into the organization's operational context.

The exfiltration leverages the /api/push function, typically used to distribute models to compatible registries. The attacker points the server to a repository under their control and pushes the contaminated model. This turns a legitimate administrative function into a covert data transfer channel. To the victim, the network traffic appears standard and non-threatening.

300,000 Exposed Instances: Default Configuration Amplifies Risk

Converging data suggests that over 300,000 Ollama servers are exposed to the internet, many of which are bound to 0.0.0.0 without any default authentication. The project has seen massive adoption, boasting over 171,000 stars and 16,100 forks on GitHub. According to CSO Online, containerized deployments have reached approximately 100 million downloads on Docker Hub, creating a vast attack surface.

The situation presents a security paradox. Organizations often choose Ollama to keep models and data away from public clouds to ensure data sovereignty. However, the combination of an open interface, a lack of out-of-the-box authentication, and vulnerabilities in model parsing can make self-hosted installations more dangerous than managed cloud services where perimeter controls are standard.

It is currently unknown how many servers have been actively compromised or if the exploit is being utilized in the wild. A lack of direct primary evidence—such as a formal public technical report from Cyera or a fully documented CVE.org bulletin—makes it difficult to quantify victims precisely. Current expert recommendations are based on the sheer size of the attack surface and the technical severity of the flaw.

Mitigation and Security Recommendations

Immediately update to Ollama version 0.17.1. The developers have released this version specifically to address CVE-2026-7482, patching the out-of-bounds read in the GGUF parser. Organizations unable to update immediately should use a firewall to block remote access to the /api/create and /api/push endpoints.

Credential rotation is mandatory. API keys, access tokens, and any secrets stored in environment variables on exposed servers should be considered compromised. Given Cyera's warnings, if an instance was reachable via the internet, a total rotation of all memory-resident credentials must be prioritized.

Network bindings should be reconfigured to restrict listening to localhost or trusted internal interfaces only, moving away from the 0.0.0.0 configuration. Where remote access is required, a reverse proxy with robust authentication and rate limiting should be deployed in front of the inference services, treating Ollama as a mission-critical production component.

Finally, monitor API logs for anomalous GGUF file uploads to /api/create and unauthorized pushes to external registries via /api/push. Early detection of these primitives can interrupt the attack chain before sensitive data leaves the corporate perimeter.

The "Bleeding Llama" case is more than a simple parsing bug; it highlights a trend in local AI deployment where speed and ease of use often take precedence over security-by-default. Until inference frameworks integrate rigid authentication and restrictive network bindings at the installation level, the promise of private AI remains a potential liability for sensitive enterprise data.

Frequently Asked Questions

Is applying the patch enough to secure my server?

No. While version 0.17.1 fixes the GGUF parser vulnerability, if your server was previously exposed to the internet, you must rotate all credentials and ensure the service is no longer listening on public interfaces. The patch stops the leak but does not recover data that may have already been stolen.

How does a single GGUF file allow access to unauthorized memory?

The loader relies on internal file metadata to calculate tensor sizes. By manipulating this metadata, an attacker causes the WriteTo() function in server/quantization.go to read beyond the allocated heap buffer, siphoning off contiguous bytes that often contain process secrets.

Are Docker installations also at risk?

Yes. The vulnerability is within the framework's parsing logic and is independent of the host operating system. Ollama containers exposed on public ports without front-end authentication are equally susceptible to the three-step attack chain.

Information has been verified against cited sources and is current as of the time of publication.

Sources