Skip to content

NVIDIA garak × ATR

Turn every garak failed probe into a MIT-licensed ATR detection rule. Zero extra red-team work, full provenance on the author line.

The idea

garak finds attacks at the LLM layer. ATR defends at the agent layer (tool calls, skills, MCP). Same attack payload, different defence surface. The bridge is a single Python script: garak reports in, ATR proposals out, crystallisation loop does the rest.

garak (red team)                    ATR (standard)           downstream consumers
─────────────────                    ──────────────           ────────────────────
Probes claude-3.7                    462 rules today          npm: agent-threat-rules
Probes gpt-5                 ──▶     +auto-crystallised       PyPI: pyatr
Probes gemini-2-pro                  from garak evidence      your CI pipeline
Writes report.jsonl                  canary 24h
                                     auto-merge

garak PR status: integration PR #1676 OPEN (under maintainer review, not merged)

One-time setup

  1. Email [email protected] with your org name to receive a partner API key (manual issuance during early-partner phase, no cost, MIT terms).
  2. Fetch the pipe script: it lives in the ATR repo. curl -O https://raw.githubusercontent.com/Agent-Threat-Rule/agent-threat-rules/main/scripts/garak-to-tc.py
  3. Set the key in your environment: export ATR_PARTNER_KEY=…

Run it

After a garak eval session:

# Native garak JSONL report
python3 garak-to-tc.py \
    --input ~/.local/share/garak/runs/2026-04-18-claude-3-7/report.jsonl \
    --partner-name  nvidia-airt \
    --target-model  claude-3-7-sonnet \
    --garak-version 0.14.1

# Or ATR-style eval (from scripts/eval-garak.sh in this repo)
python3 garak-to-tc.py \
    --input data/garak-benchmark/garak-eval-report.json \
    --partner-name  nvidia-airt \
    --dry-run   # inspect before submitting

Two modes

The script has two submission modes. The default (--mode drafter) is what you want.

  • drafter (default). POSTs each failed probe to /api/atr-proposals/from-payload. Server-side runs a tool-use LLM drafter (grep existing rules for dedup, fetch research for grounding, write YAML with 3+ conditions / 5+ TP / 5+ TN / 3+ evasion tests), passes RFC-001 quality gate, self-tests its own regex, then files as a proposal. Each call takes 30-60s. This is the mode that produces real ATR rules.
  • proposal (legacy). POSTs a client-built literal draft. Fast but the LLM reviewer typically rejects these as too narrow (literal fingerprint ≠ detection rule). Only use if you have your own YAML-generation step upstream.

What happens downstream

  1. Each unique failed probe → one proposal in /api/atr-proposals with full garak provenance embedded.
  2. TC crystallisation LLM reads the attack payload + garak metadata and drafts a generalised detection regex.
  3. Rule enters 24h canary. Every ATR-integrated client in the field gets a sampling dose (~10%). If any client reports false positives, the rule is quarantined automatically.
  4. Survives canary → safety gate (0 FP on 432-sample benign corpus, test_cases present, ≤10 rules per PR) → auto-merged into agent-threat-rules main →npm publish fires within minutes → every downstream consumer upgrades on next install.
  5. Every resulting rule carries author: nvidia-airt (via garak vX.Y.Z) and metadata_provenance.garak_probe: <probe-id>. Attribution is permanent and machine-readable.

What counts as a failed probe

The script treats a garak attempt as a rule candidate when either:

  • At least one garak detector returned a score above 0.5 (the target model fell for the probe).
  • The input is an ATR-style eval report and the prompt appears in missed_samples — garak flagged it malicious and ATR had no pattern for it yet (highest-signal candidates).

Rate limits & safety

  • Default 100ms delay between POSTs. On the public garak in-the-wild jailbreak set (650 samples) recall is 98.0%; on the full garak corpus (3,475 samples) recall is 38.5%. A run takes about a minute.
  • patternHash is a sha256 of the prompt, so re-running on the same report is idempotent — TC recognises duplicate submissions and just bumps a confirmation counter, never creates duplicate rules.
  • All traffic is authenticated with your partner key. Key compromise? Email and we revoke and re-issue.
  • Every submitted proposal is reviewable and rejectable via the TC admin dashboard before it enters canary.

Related