DarkMoon: Open-Source AI Pentesting at $10 a…

DarkMoon separates LLM reasoning from execution via MCP to bypass Anthropic's safety classifiers. At roughly $10 per web-app scan, the framework integrates 50–80+ tools and 18 specialized agents, but ships with zero releases, no independent audits, and a GitHub footprint of just 110 stars.

On April 26, 2026, DarkMoon surfaced publicly: an open-source, agent-based autonomous penetration-testing framework maintained by lead developer Mehdi Boutayeb. The project tackles a concrete problem security teams have faced for months — flagship models, led by Anthropic's Claude Opus, block offensive tasks even when authorized, making pentest automation erratic and unpredictable. DarkMoon's answer is an architecture that isolates LLM reasoning from actual execution, leveraging the Model Context Protocol and Docker containers, at a stated cost of roughly $10 per web-app scan using cloud models.

Key Takeaways

The architecture separates the LLM orchestrator (OpenCode) from tool-based execution via MCP with an explicit allow-list, running in isolated Docker containers
The platform integrates 50+ security tools (editorial sources) or 80+ (official site) and 18 specialized agents for web apps, Active Directory, Kubernetes, and network protocols
Claude Opus 4.8 hit Anthropic's safety classifiers during testing, while version 4.6 completed the assessment uninterrupted
The GitHub repository shows 110 stars and 19 forks, zero published releases: a very early-stage project with no documented independent security audits

How DarkMoon Bypasses LLM Safety Classifiers

The problem is not theoretical. In a direct interview with Help Net Security, Boutayeb documented that Anthropic's Claude Opus 4.8 halted a pentest assessment via the model's built-in safety classifiers. Version 4.6, by contrast, completed the same sequence. The project therefore flags Opus 4.6 as the "more stable choice" for operators.

This is not a flaw in Claude per se. Vendor LLM safety classifiers are designed to prevent unauthorized offensive use, yet professional cybersecurity requires exactly those operations — when authorized. The tension is structural: general-purpose AI safety collides with legitimate offensive-security use cases.

DarkMoon resolves the friction with a three-layer architecture. The OpenCode orchestrator interacts with the LLM to plan moves and strategy. A control layer, built on the Model Context Protocol, exposes only an explicit allow-list of authorized tools and workflows. Execution happens in isolated Docker containers that hold the security toolbox. Boutayeb framed the goal bluntly: "make execution deterministic, auditable, and constrained, rather than allowing unlimited autonomous behavior."

The Toolbox and 18 Agents: What's Under the Hood

The platform bundles industry-standard tools: Nuclei, sqlmap, BloodHound, NetExec, WPScan, Hydra, Hashcat, kubectl, Kubescape. Editorial sources cite "over 50" tools in the Docker container; the official dark-moon.org site claims 80+ tools and 18 specialized agents, with a demo instance showing 57 critical vulnerabilities detected on target 172.19.0.3 in 28.5 minutes. It is not verifiable whether the demo findings are real or simulated.

Agents cover web applications, Active Directory, Kubernetes, network protocols, CMS, GraphQL, and headless browser interactions. The system supports multiple LLM providers: OpenAI, Anthropic, OpenRouter, and local models via Ollama or llama.cpp. Boutayeb emphasized economic flexibility: "it can be completely free if you stay local, or a few dollars per assessment if you want the extra reasoning of a frontier model. Every user chooses their own balance between cost and capability."

The most-cited metric is the roughly $10 per web-app scan using Claude Opus. The figure comes from the lead maintainer's direct testimony and the announcement's RSS description, not from independent verification.

"The LLM is never treated as the source of truth. The evidence collected from the target environment remains the source of truth." — Mehdi Boutayeb, DarkMoon lead maintainer, via Help Net Security

Why the GitHub Metrics (110 Stars) Raise Questions

The repository github.com/ASCIT31/Dark-Moon, as verified in May 2026, shows 110 stars and 19 forks. Zero published releases. This metric profile is consistent with a very early-stage project, not a mature platform adopted at scale in production. No independent security audits, penetration tests, or external red-team reviews of the MCP architecture or isolated containers are documented.

The absence of external verification is a significant limitation for a tool designed, by definition, to operate offensively against authorized targets. The reliability of the allow-list, the robustness of container isolation, the possibility of escalation or bypass of the MCP layer — all these elements are documented only from inside the project, not by third-party observers.

The official site also mentions a "hardware-bound licensing model" that the brief does not clarify. It is unknown whether this implies activation constraints, distribution limits, or runtime integrity checks.

Claimed Compliance and the Verification Gap

DarkMoon aligns its methodologies with recognized frameworks: ISO 27001, NIST SP 800-115, MITRE ATT&CK. This claim appears across multiple editorial sources but is not backed by certifications or external assessments. The platform includes a native bug-bounty mode with FOCUS, EXCLUDE, SEVERITY, and FORMAT=h1 flags for scoping and reporting.

Findings are classified in two categories: "Confirmed," with attached evidence (commands, output, HTTP request/response, execution traces), and "Unconfirmed," for weak signals requiring human verification. Boutayeb insisted on this point: "The LLM is never treated as the source of truth." The distinction is technically sound, but its effectiveness depends on the MCP layer's implementation — not independently verifiable at this stage.

What to Do Now

For security teams evaluating DarkMoon, three concrete actions emerge from the dossier. First: test the platform in isolated environments, not production, given the project's early-stage status and lack of official releases. Second: prefer Claude Opus 4.6 over 4.8 for complete assessments, following the lead maintainer's documented guidance on Anthropic's classifier blocks. Third: compare the roughly $10 per cloud scan against local models (Ollama, llama.cpp) to weigh the cost-versus-reasoning-quality trade-off.

For AI governance leads, the case signals a priority tension: vendor LLM safety classifiers do not distinguish between authorized offensive use and illicit activity. DarkMoon's architectural response — separating LLM reasoning from execution via MCP — is a pattern to watch, not an established standard. The brief documents no vendor exemption mechanisms for professional pentesting; the split between reasoning and execution remains the only documented path.

For security-tool developers, the GitHub repository's 110 stars indicate a minimal user base. Contributing to the project requires direct verification of the MCP implementation and allow-list controls, not reliance on official-site claims.

What's at Stake for the Industry

DarkMoon is not alone. Frameworks like PentestGPT (referenced in editorial comparisons) have explored similar paths. The difference, in the available brief, is the explicit focus on the vendor-classifier problem and the architectural solution via MCP. The claimed $10-per-assessment cost, if verifiable, places autonomous pentesting in an accessibility tier that traditionally demands far higher consulting budgets.

The open question is whether the language-model industry will develop "professional security" access channels with less restrictive classifiers, or whether the split between general-purpose LLMs and isolated security tooling will become the norm. DarkMoon bets on the latter. Its effectiveness, however, hinges on verifications that do not yet exist in the dossier.

For security and DevSecOps teams, the project is an option to monitor, not a solution ready for critical environments. The 110-star GitHub metric is a more reliable indicator of development stage than the official site's promises of 80+ tools and 18 agents. The ratio of cost to unverified reliability risk is the calculation every organization must make for itself.

Sources

Information verified against cited sources and current as of publication.

Sources

Sources and references