Validation and Verification: Testing Your Defenses Without Breaking Production

By this point in the guide, we've walked through the full arc: manual exploitation of ShopBox on 192.0.2.10, automated discovery with sqlmap, detection architecture from application logs and network telemetry, and the hardening patterns that became ShopBox 2.0. What remains is the part most teams skip—proving the fixes actually hold, continuously, without waiting for the next penetration test or breach to find out.

I'm going to show you how I institutionalized this at DeafNews: a CI pipeline that replays captured attack payloads against an anonymized clone of our schema, confirms prepared statements reject injection attempts, and tracks whether we're actually getting faster at detection. This is not theoretical. I broke production once in 2017 with a poorly scoped sqlmap run against a staging environment that had live payment processor credentials. Never again.

The Foundation: Capture-Validated Payloads from Earlier Exploitation

Everything we build here rests on artifacts from Pages 3 and 4. When we manually exploited ShopBox's login form with ' OR '1'='1 and later ran sqlmap with --batch --dump, we generated payloads that we know bypassed the original vulnerable code. These are gold for regression testing—not because we want to exploit again, but because they represent the exact attack surface we must now defend against.

I keep these in a structured format I call a payload manifest: each entry records the injection point (URL parameter, header, body field), the database backend (MySQL on 192.0.2.20 or PostgreSQL on 192.0.2.30), the payload itself, and the expected defensive outcome. For ShopBox 2.0, every expected outcome is "rejected cleanly, no query execution, HTTP 400 or sanitized error."

⚠️ Authorized, defensive use only. These payloads are stored encrypted and only decrypted in isolated CI runners with no external connectivity. Never commit live payloads to a repository that production systems can access.

Here's the manifest structure I use—adapted from our actual Jenkins pipeline at DeafNews:

# payload-manifest.yaml — truncated example
payloads: - id: SB-LOGIN-001 surface: /auth/login method: POST field: username backend: mysql original_payload: "' OR '1'='1" expected_behavior: "prepared_statement_rejection" severity: critical - id: SB-SEARCH-003 surface: /products/search method: GET field: q backend: postgresql original_payload: "test' UNION SELECT null,version()--" expected_behavior: "prepared_statement_rejection" severity: high

The original_payload field contains the exact string that succeeded against ShopBox 1.0. In plain terms: we're keeping a library of attacks that used to work, so we can automatically verify they no longer work.

Building the Isolated Test Target

Before any pipeline runs, we need a target that is structurally identical to production but contains no real data. I use a schema-anonymized clone—same table structures, indexes, and stored procedures, but with generated data and no foreign key relationships to external systems. The specifics of how you anonymize will depend on your database tooling; I can only say that we export CREATE TABLE and CREATE INDEX statements, strip GRANT privileges to production hosts, and populate with faker-generated rows.

Our CI target runs at 192.0.2.100 in the lab—a dedicated VLAN with no route to 192.0.2.10 or 192.0.2.30. The pipeline provisions this from a Docker Compose file:

# docker-compose.ci.yml — ShopBox 2.0 test target
version: '3.8'
services: shopbox-app: build: ./shopbox-2.0 ports: - "8080:8080" environment: - DB_HOST=192.0.2.100 - DB_NAME=shopbox_test - DB_USER=test_runner - DB_PASS=${DB_TEST_PASS} # injected from CI secrets shopbox-db-mysql: image: mysql:8.0 # check the latest release for your environment environment: - MYSQL_ROOT_PASSWORD=${DB_ROOT_PASS} - MYSQL_DATABASE=shopbox_test volumes: - ./schema/mysql-anonymized.sql:/docker-entrypoint-initdb.d/01-schema.sql shopbox-db-postgres: image: postgres:15 # check the latest release for your environment environment: - POSTGRES_PASSWORD=${DB_ROOT_PASS} - POSTGRES_DB=shopbox_test volumes: - ./schema/postgres-anonymized.sql:/docker-entrypoint-initdb.d/01-schema.sql

Why this matters: Running against a real production mirror—even with "test" data—risks credential leakage, accidental notification firing, and regulatory exposure if personal data persists. The VLAN isolation is non-negotiable. I verify it with a traceroute from the CI runner before each test suite starts.

Negative Testing: Confirming Prepared Statements Reject Injection

With the target running, we execute negative tests—verifying that known-bad inputs fail as expected. A negative test is the opposite of traditional QA: instead of confirming functionality works, we confirm that malfunction is prevented. If a prepared statement is properly implemented, the payload should be treated as a literal string value, never concatenated into the query.

I use a Python test runner with pytest and requests, but the pattern transfers to any language. The key assertion is not "response contains no data"—that's fragile. It's "database query log shows parameterized execution with no string interpolation."

# test_sql_injection_negative.py — excerpt
import requests
import pytest
import yaml with open('payload-manifest.yaml') as f: MANIFEST = yaml.safe_load(f) CI_TARGET = "http://192.0.2.100:8080" @pytest.mark.parametrize("payload", MANIFEST['payloads'])
def test_payload_rejected(payload): """Verify each captured payload is rejected by ShopBox 2.0 defenses.""" url = f"{CI_TARGET}{payload['surface']}" if payload['method'] == 'POST': response = requests.post(url, data={payload['field']: payload['original_payload']}) else: response = requests.get(url, params={payload['field']: payload['original_payload']}) # Assert: no successful data extraction (generic check) assert response.status_code in [400, 403, 422, 500], \ f"Unexpected success status for {payload['id']}" # Assert: response body contains no database error leakage assert "SQL" not in response.text.upper(), \ f"Possible error disclosure in {payload['id']}" # Assert: query log shows parameterized execution (requires DB access) # This is checked via separate audit query against performance_schema or pg_stat_statements

The third assertion is where most teams stop. They see a 400 status and declare victory. I don't—I've seen applications return 400 while still logging the full payload to a SIEM, or worse, passing it to a downstream system that is vulnerable. The database-side verification requires read access to performance_schema.prepared_statements_instances on MySQL or pg_stat_statements on PostgreSQL, which your CI runner must be granted.

Run this locally before committing to CI:

# Run negative test suite against isolated target
pytest test_sql_injection_negative.py -v --tb=short # illustrative output — verify on your target
# test_sql_injection_negative.py::test_payload_rejected[SB-LOGIN-001] PASSED
# test_sql_injection_negative.py::test_payload_rejected[SB-SEARCH-003] PASSED
# ...
# 12 passed in 4.32s

If a test fails here, that's actually a success condition for the attacker—and a critical regression for you. I once had SB-SEARCH-003 pass (meaning the payload succeeded) because a junior developer had disabled parameterization for a "quick fix" on the search endpoint. The CI failure blocked merge. That's the point.

OWASP ZAP Baseline Scan in CI

Negative tests verify specific known payloads. DAST (Dynamic Application Security Testing, essentially automated penetration testing against a running application) catches what you didn't think to include. For ShopBox 2.0, I integrate the ZAP Baseline Scan—a script included in ZAP's Docker images intended specifically for CI/CD environments.

The baseline scan spiders your application for one minute by default, then runs passive scanning before reporting. By default, every finding is a WARNing. You need a configuration file to escalate SQL injection rules to FAIL—otherwise your pipeline stays green while vulnerabilities accumulate.

Here's my GitLab CI stage (adaptable to Jenkins, GitHub Actions, or any Docker-capable runner):

# .gitlab-ci.yml — ZAP baseline stage
stages: - deploy-test - security-scan - report variables: ZAP_TARGET: "http://192.0.2.100:8080" ZAP_CONFIG: "zap-baseline.conf" zap-baseline-scan: stage: security-scan image: docker:stable services: - docker:dind script: - docker pull ghcr.io/zaproxy/zaproxy:stable # check the latest release - docker run -v $(pwd):/zap/wrk/:rw -t ghcr.io/zaproxy/zaproxy:stable \ zap-baseline.py -t ${ZAP_TARGET} -c /zap/wrk/${ZAP_CONFIG} \ -r zap-report.html -w zap-report.md artifacts: reports: junit: zap-junit-report.xml # requires additional conversion step paths: - zap-report.html - zap-report.md allow_failure: false # FAIL findings block the pipeline

And the configuration file that makes this meaningful for SQL injection:

# zap-baseline.conf — escalate SQL injection to FAIL
# Format: RULE_ID ACTION [TAB] Optional message
# Rule names are informational; only IDs matter for matching 40018 FAIL SQL Injection
40019 FAIL SQL Injection - MySQL
40020 FAIL SQL Injection - Oracle
40021 FAIL SQL Injection - PostgreSQL
40022 FAIL SQL Injection - SQLite
40024 FAIL SQL Injection - Boolean Based
40025 FAIL SQL Injection - Error Based
40026 FAIL SQL Injection - Time Based
40027 FAIL SQL Injection - Stacked Queries # Reduce noise from rules we handle separately
10106 IGNORE Informational Disclosure
10109 IGNORE Modern Web Application

Why this matters: Without the config file, ZAP's baseline scan will report SQL injection as WARN and your pipeline passes. I've seen teams run this for months, congratulating themselves on "no critical issues," while the report sat unread in artifact storage. The FAIL escalation is what makes the scan gate-worthy.

The ZAP Automation Framework (an add-on for flexible ZAP automation) can replace zap-baseline.py with more complex workflows, but for ShopBox 2.0 the baseline script suffices. If you need authenticated scanning or custom spidering, that's when you reach for the framework's YAML-based job definitions.

sqlmap Regression: Controlled Replay of Automated Discovery

ZAP finds what it can generically. sqlmap finds what a dedicated attacker would find. I run sqlmap in CI against ShopBox 2.0 with the same flags we used offensively in Page 4, but with safety constraints that prevent actual data extraction:

# sqlmap regression test — safe for CI
python sqlmap.py -u "http://192.0.2.100:8080/products/search?q=test" \ --batch \ --level=2 --risk=1 \ --safe-freq=2 \ --skip-urlencode \ --technique=BEUSTQ \ --flush-session \ --answers="follow=Y" \ --string="No products found" # illustrative output — verify on your target
# [INFO] testing connection to the target URL
# [INFO] testing if the target URL content is stable
# [WARNING] heuristic (basic) test shows that GET parameter 'q' might not be injectable
# [INFO] testing for SQL injection on GET parameter 'q'
# [WARNING] GET parameter 'q' does not seem to be injectable
# ...
# [INFO] fetched data logged to text files under '/home/ci-runner/.local/share/sqlmap/output/192.0.2.100'
# [WARNING] no parameter(s) found for testing in the provided data

The critical flags here: --batch (no interactive prompts), --safe-freq=2 (pause every 2 requests to avoid rate-limiting yourself), --flush-session (don't reuse cached results from prior runs), and --string="No products found" (a stable string indicating non-exploitable response). I omit --dump, --os-shell, and any data extraction flags entirely—this is verification, not exploitation.

If sqlmap reports any parameter as injectable against ShopBox 2.0, the pipeline fails. Full stop. I once had this trigger because a test environment had been deployed with DEBUG=True in the application config, which changed error handling enough to re-expose injection. The CI caught what code review missed.

Metrics That Matter: Tracking Defensive Capability

Running tests is pointless if you can't tell whether you're improving. I track three metrics weekly, derived from CI runs and our SOC's alert handling:

Metric	What It Measures	How I Calculate It	Target
MTTD Synthetic	Mean time from test payload execution to SOC alert generation	Timestamp of CI test run minus timestamp of first Splunk alert for that test ID	< 5 minutes
False Positive Rate Trend	Percentage of ZAP/sqlmap findings that are non-exploitable upon manual review	Monthly: (unverified ZAP alerts - confirmed true positives) / total ZAP alerts	Declining month-over-month
Negative Test Pass Rate	Percentage of captured payloads correctly rejected	Daily CI: passed negative tests / total negative tests	100%

MTTD (Mean Time To Detect) is the average duration between when malicious activity begins and when your monitoring systems or analysts identify it. For synthetic attacks, I measure this precisely because I control both the attack start time and the detection timestamp. Traditional MTTD calculations can report near-zero detection time in some configurations, which is misleading—I avoid this by using dedicated test IDs that bypass correlation rules that might auto-close.

The false positive rate trend is where most security programs embarrass themselves. Early in our ZAP deployment, we ran at 60% false positives—mostly from the "Informational Disclosure" rules we now ignore. Tracking this monthly forced us to tune configurations and justify each rule escalation. The trend matters more than the absolute number; a flat 10% is worse than a declining 15%.

I publish these in a simple Markdown report generated by the CI pipeline:

# generate-metrics-report.sh — appended to CI
#!/bin/bash
echo "## ShopBox 2.0 Security Metrics — $(date +%Y-%m-%d)" > metrics-report.md
echo "" >> metrics-report.md
echo "| Metric | Current | Target | Status |" >> metrics-report.md
echo "|--------|---------|--------|--------|" >> metrics-report.md # MTTD Synthetic — pulled from Splunk API, simplified here
MTTD=$(curl -s "https://splunk.deafnews.internal:8089/services/search/jobs/export" \ -d "search=search earliest=-7d@d test_id=SB-* | stats avg(detection_delay)" \ -u "${SPLUNK_USER}:${SPLUNK_PASS}" | tail -1)
# illustrative output — verify against your SIEM
# 4.3 echo "| MTTD Synthetic | ${MTTD}m | <5m | $(awk 'BEGIN{print ('$MTTD'<5)?"✅ PASS":"❌ FAIL"}') |" >> metrics-report.md # False positive rate — from ZAP history
FP_RATE=$(grep -c "FALSE_POSITIVE" zap-history.log 2>/dev/null || echo "N/A")
echo "| False Positive Rate | ${FP_RATE} | declining | trend |" >> metrics-report.md # Negative test pass rate — from pytest
PASS_RATE=$(grep -oP '\d+(?=%)' pytest-report.log | tail -1 || echo "N/A")
echo "| Negative Test Pass | ${PASS_RATE}% | 100% | $(awk 'BEGIN{print ('$PASS_RATE'==100)?"✅ PASS":"❌ FAIL"}') |" >> metrics-report.md

The Closed Loop

This page connects everything that came before. The payloads we captured in exploitation become our regression test corpus. The detection architecture from Pages 5-6 gives us the telemetry to measure MTTD. The hardened patterns from Page 7 are what we verify with negative tests and ZAP.

What I've described is not a one-time audit. It's a continuous validation system that runs on every merge request, every night, and reports weekly. The CI pipeline fails. The metrics trend. The team responds.

That's how you know the defense is real.