Skip to content

NVIDIA garak × ATR

Turn every garak failed probe into a MIT-licensed ATR detection rule. Zero extra red-team work, full provenance on the author line.

The idea

garak finds attacks at the LLM layer. ATR defends at the agent layer (tool calls, skills, MCP). Same attack payload, different defence surface. The bridge is a single Python script: garak reports in, ATR proposals out, crystallisation loop does the rest.

garak (red team)                    ATR (standard)           downstream
─────────────────                    ──────────────           ──────────
Probes claude-3.7                    314 rules today          npm: agent-threat-rules
Probes gpt-5                 ──▶     +auto-crystallised       PyPI: pyatr
Probes gemini-2-pro                  from garak evidence      Cisco DefenseClaw
Writes report.jsonl                  canary 24h               OWASP Agentic Top 10
                                     auto-merge               your CI pipeline

One-time setup

  1. Email [email protected] with your org name to receive a partner API key (manual issuance during early-partner phase, no cost, MIT terms).
  2. Fetch the pipe script: it lives in the ATR repo. curl -O https://raw.githubusercontent.com/Agent-Threat-Rule/agent-threat-rules/main/scripts/garak-to-tc.py
  3. Set the key in your environment: export ATR_PARTNER_KEY=…

Run it

After a garak eval session:

# Native garak JSONL report
python3 garak-to-tc.py \
    --input ~/.local/share/garak/runs/2026-04-18-claude-3-7/report.jsonl \
    --partner-name  nvidia-airt \
    --target-model  claude-3-7-sonnet \
    --garak-version 0.14.1

# Or ATR-style eval (from scripts/eval-garak.sh in this repo)
python3 garak-to-tc.py \
    --input data/garak-benchmark/garak-eval-report.json \
    --partner-name  nvidia-airt \
    --dry-run   # inspect before submitting

Two modes

The script has two submission modes. The default (--mode drafter) is what you want.

  • drafter (default). POSTs each failed probe to /api/atr-proposals/from-payload. Server-side runs a tool-use LLM drafter (grep existing rules for dedup, fetch research for grounding, write YAML with 3+ conditions / 5+ TP / 5+ TN / 3+ evasion tests), passes RFC-001 quality gate, self-tests its own regex, then files as a proposal. Each call takes 30-60s. This is the mode that produces real ATR rules.
  • proposal (legacy). POSTs a client-built literal draft. Fast but the LLM reviewer typically rejects these as too narrow (literal fingerprint ≠ detection rule). Only use if you have your own YAML-generation step upstream.

What happens downstream

  1. Each unique failed probe → one proposal in /api/atr-proposals with full garak provenance embedded.
  2. TC crystallisation LLM reads the attack payload + garak metadata and drafts a generalised detection regex.
  3. Rule enters 24h canary. Every ATR-integrated client in the field gets a sampling dose (~10%). If any client reports false positives, the rule is quarantined automatically.
  4. Survives canary → safety gate (0 FP on 432-sample benign corpus, test_cases present, ≤10 rules per PR) → auto-merged into agent-threat-rules main →npm publish fires within minutes → every downstream consumer upgrades on next install.
  5. Every resulting rule carries author: nvidia-airt (via garak vX.Y.Z) and metadata_provenance.garak_probe: <probe-id>. Attribution is permanent and machine-readable.

What counts as a failed probe

The script treats a garak attempt as a rule candidate when either:

  • At least one garak detector returned a score above 0.5 (the target model fell for the probe).
  • The input is an ATR-style eval report and the prompt appears in missed_samples — garak flagged it malicious and ATR had no pattern for it yet (highest-signal candidates).

Rate limits & safety

  • Default 100ms delay between POSTs. At 666 findings (the public garak in-the-wild-jailbreak set) that is about a minute.
  • patternHash is a sha256 of the prompt, so re-running on the same report is idempotent — TC recognises duplicate submissions and just bumps a confirmation counter, never creates duplicate rules.
  • All traffic is authenticated with your partner key. Key compromise? Email and we revoke and re-issue.
  • Every submitted proposal is reviewable and rejectable via the TC admin dashboard before it enters canary.

Related