CVE-2026-7482: Malicious GGUF Files Trigger Memory Leaks in Ollama

A heap out-of-bounds read vulnerability in Ollama allows unauthenticated remote attackers to exfiltrate the entire memory of the inference process. Users are a…

On May 10, security researchers at Cyera disclosed a critical vulnerability in Ollama, identified as CVE-2026-7482. The flaw allows an unauthenticated remote attacker to trigger a heap out-of-bounds read during the quantization of a malicious GGUF file. This memory leak, triggered by the WriteTo() function utilizing Go's unsafe package, exposes the entire memory space of the inference process. Attackers can سپس exfiltrate this data via the /api/push endpoint, effectively turning exposed instances into open data sources.

Key Takeaways

Ollama's GGUF loader fails to validate tensor dimensions against the actual file length; an attacker-controlled shape value can overrun the heap buffer during quantization.
The WriteTo() function invokes ggml.ConvertToF32 through Go's unsafe package, bypassing standard memory safety guarantees and enabling arbitrary reads beyond allocated boundaries.
The /api/create and /api/push REST endpoints do not require authentication in default upstream deployments, allowing remote attackers to upload malicious models and exfiltrate leaked data.
Cyera estimates that over 300,000 servers are potentially exposed globally; at-risk data includes environment variables, API keys, system prompts, and concurrent user conversations.

The Malicious GGUF Mechanism: How the Leak is Triggered

The Ollama GGUF loader reads tensor metadata without verifying that the file's physical length supports the shape values declared in the header. An attacker can craft an archive where the shape field indicates a number of elements significantly larger than the actual allocated data portion.

When the quantization engine begins conversion, the WriteTo() function calls ggml.ConvertToF32 using q.from.Elements(), which is derived directly from that manipulated field. The heap buffer intended for the data stream is not sized to handle such an expansive request, resulting in an out-of-bounds read that crosses the boundaries of the assigned segment. Notably, the GGUF specification itself is not at fault; the defect lies entirely in the software's parsing and conversion path, which assumes metadata and payload always match.

Bypassing the Sandbox: The Role of Go's Unsafe Package

While Go typically provides native memory safety, it offers the unsafe package as an "escape hatch" for low-level operations on pointers and memory addresses. Ollama utilizes this package specifically during GGUF tensor conversion, effectively discarding runtime protections.

As Cyera researchers explain: "The answer is the unsafe package. Go gives developers an escape hatch for low-level memory operations, and as the name suggests, all the usual safety guarantees go out the window. Unsurprisingly, the one place Ollama uses unsafe is exactly where this vulnerability lives." This implementation allows WriteTo() to handle pointers without garbage collector oversight, turning an input validation error into an arbitrary read beyond buffer limits.

This architectural choice, likely driven by performance requirements for high-dimensional numerical tensors, created a breach point exactly where the GGUF format meets native-level processing.

From /api/create to /api/push: The Unauthenticated Exfiltration Chain

The attack chain begins at the /api/create endpoint, which accepts custom model uploads. An attacker uploads a manipulated GGUF file and waits for Ollama to initiate the quantization routine to prepare the tensor for inference. At this stage, the process reads memory outside the assigned segment, capturing adjacent heap fragments.

The leaked data is then funneled to /api/push, where the server forwards the payload to a remote registry. In default upstream configurations, no credentials are required for this action. Qualys describes this sequence as a three-step process and confirms that standard distributions leave both endpoints open. The presence of OLLAMA_HOST=0.0.0.0 on internet-facing installations effectively turns the server into a data tap accessible from any remote source.

Exposure Estimates: 300,000 Servers and Sensitive Data at Risk

Cyera estimates that over 300,000 Ollama servers are potentially exposed worldwide. While this figure has not been independently verified via external scans, it illustrates the massive scale of the attack surface. Many of these instances belong to developers or enterprises that enabled public hosting for convenience without implementing additional authentication layers or application firewalls.

The risk extends far beyond the loaded model itself. The inference process memory retains traces of everything processed during operation. Leaked data may include environment variables, API keys, system prompts, and the conversations of other users. For organizations integrating agentic coding tools or automated pipelines, Ollama's memory acts as a container for corporate secrets, proprietary code, and internal process outputs.

"An attacker can learn basically anything about the organization from your AI inference — API keys, proprietary code, customer contracts, and much more" — Dor Attias, Cyera security researcher

Mitigation and Remediation

Update to Ollama 0.17.1 immediately. Qualys confirms this release addresses the vulnerability. Administrators managing containerized or multi-instance infrastructures must ensure no legacy images remain in use.
Disable Public Exposure. Avoid setting OLLAMA_HOST=0.0.0.0 on internet-facing servers. Listening should be restricted to local interfaces, private networks, or corporate VPNs to minimize the external attack surface.
Enforce Authentication. Because the /api/create and /api/push endpoints are open by default, a reverse proxy with mandatory authentication or a web application firewall (WAF) should be used to block anonymous requests.
Rotate Secrets and Monitor Logs. Inspect logs for suspicious /api/create requests involving unknown GGUF files. Proactively rotate any API keys or credentials that may have resided in process memory during the exposure period.

This flaw serves as a reminder that the line between local inference and exposed infrastructure is thinner than many assume. The widespread adoption of OLLAMA_HOST=0.0.0.0 for convenience, combined with a single unsafe call in Go, collapsed the barrier between a downloaded model and the server's entire memory space. Organizations treating on-premise AI as a "safe zone" now have a compelling reason to re-evaluate their perimeter, authentication, and network segmentation strategies for inference workloads.

Frequently Asked Questions

Does version 0.17.1 fully resolve the issue?

Qualys identifies version 0.17.1 as the fix for this flaw. However, infrastructure managers should verify the specific version running on every node, ensure legacy container deployments are purged, and prioritize enabling authentication and network restrictions as a defense-in-depth measure.

Is OLLAMA_HOST=0.0.0.0 the only risk factor?

While the 0.0.0.0 setting increases danger by making endpoints reachable via the internet, the vulnerability exists in the code itself. Any environment where an attacker can interact with /api/create and /api/push—including internal networks—is potentially compromisable. Segmentation reduces the probability of an external attack but does not eliminate the underlying security defect.

Can I detect if my server has already been compromised?

Current sources do not indicate widespread active exploitation in the wild, making it difficult to provide definitive indicators of compromise (IoCs). Because the leak occurs in memory and exfiltration via /api/push resembles legitimate model publishing, detection is challenging. Immediate patching and the proactive rotation of sensitive credentials remain the most reliable countermeasures.

Information verified against cited sources and accurate at the time of publication.