Skip to content
For red teams

Find the attack.
We ship the rule.
2 hours 16 minutes.

One attack, crystallized into one rule, becomes a word every defender shares. Microsoft AGT pulls it weekly, Cisco AI Defense ships it in skill-scanner, MISP exports it to STIX with the discoverer's name attached. A closed-loop flywheel: attacks feed the standard, and the standard evolves at the same speed as the threat.

Why an attack belongs in the standard
2h 16m
Disclosure to published rule

2026-05-07 Microsoft Security disclosed Semantic Kernel CVE-2026-26030. 2026-05-11 06:07 UTC a Microsoft Copilot SWE Agent opened a regression-test PR presuming ATR already covered it. 08:24 UTC v2.1.2 published on npm with the paired rules. The loop from a disclosed attack to a shared detection closes in hours, not committee cycles.

5
Independent organizations carry the same rules

Microsoft Agent Governance Toolkit (weekly auto-sync). Cisco AI Defense (rule pack in skill-scanner). CIRCL/MISP (taxonomies + galaxy merged by the project lead). OWASP A-S-R-H (lead merged with "Welcome to the team"). Gen Digital Sage (Norton / Avast parent security team merged). A rule written once becomes vocabulary five organizations share — which is the point of a standard.

655
Rules, each carrying the discoverer's name

Every rule carries author + metadata_provenance.discovered_by, and a stable ATR-YYYY-NNNNN ID that never changes after publication. Microsoft AGT, Cisco AI Defense, MISP, and OWASP all preserve the field on sync. When MISP exports to STIX, the attribution survives the format change. When NIST references the rule, the lineage back to the original red-teamer stays intact.

0 FP
Required across the benign gate before a rule ships

A 6-check quality gate: own-TP must match + 431 benign + 1,352 extended + 157 research-mention + 1,611 cross-rule conflict-free + own true_negative coverage. A rule that fires on the paper describing the attack does not ship. The standard would rather miss a clever variant than corrupt the corpus with a false positive — and it publishes its real per-lane figures rather than a single flattering one.

What you get

What your attack leaves behind once it's in the standard — depending on who you are.

For
Academic researchers

A citable artifact paired with the attack you described — one with a stable ID, real downstream consumers, and fire counts in production telemetry, not just a paper that sits behind a paywall.

Your name lives in the rule file's author + metadata_provenance.discovered_by. When MISP exports the taxonomy to STIX, the attribution propagates. When NIST references the rule (collaboration in progress with iMichaela at NIST OSCAL — a working branch, not an endorsement), the lineage back to you is intact.

For
Corporate red teams

The variant your team found stops being an internal finding and becomes shared defensive vocabulary. Your team appears as the discoverer in the rule corpus that Microsoft AGT, Cisco AI Defense, MISP, OWASP A-S-R-H, and Gen Digital Sage all pull from.

Microsoft's Copilot SWE Agent already opens PRs presuming ATR coverage (AGT #1981, closed 2026-05-11). Originating an ATR-2026-NNNNN rule is durable, verifiable provenance — it sits in a public corpus, not a slide.

For
Independent researchers

Ship a detection without hand-writing regex. Submit positive and negative examples; the deterministic auto-regex generator tries four variants against the full gate, and roughly one in three passes on the first attempt.

The generator clears 0 FP across 3,551 samples (benign + extended + research-mention + cross-rule) before a candidate is ever shown. The PR labels itself gate-passed and goes straight to maintainer review — the same gate every rule in the standard had to clear.

For
Bug bounty hunters

The same attack does two jobs: the bounty payout, and a permanent rule that catches it everywhere afterward. Composes cleanly with Huntr.dev, HackerOne AI scope, and Protect AI bounty programmes.

ATR proposals are MIT-licensed and citable. No NDA conflict — the bounty programme owns the responsible-disclosure window; ATR ships the detection only after disclosure, with public attribution to the finder.

What happens after you submit

Probe in. Auto-regex generates. Quality gate validates. Merge if green.

01

You fill the form

3 attack samples, 3 benign lookalikes, attack category, source paper / repo. Takes 3-5 minutes.

No schema to learn. No YAML to write. No fork.

02

Auto-regex tries 4 variants

Deterministic n-gram set-cover algorithm extracts distinctive phrases from your positives, builds an alternation regex, tightens with word boundaries / whitespace anchors / co-occurrence constraints. Each variant runs through the full gate.

Gate = your TPs must match 100% + 1,783-sample benign+extended corpus 0 FP + 157-sample research-mention 0 FP + 0 cross-rule conflicts.

03

Gate clears → complete rule on a PR

PR lands with the gate-passed label. Maintainer reviews regex shape — is it too literal, can it generalize? Usually merged within 1-3 days. If gate didn't clear, stays as stub and a maintainer hand-crafts the regex (still using your test cases as ground truth).

04

Rule auto-propagates downstream

Microsoft AGT syncs weekly. Cisco AI Defense pins to release tags. MISP taxonomy + galaxy pull on every release. OWASP A-S-R-H references rule IDs in fixtures. Your discovered_by field rides the whole chain — this is the back half of the flywheel: one attack becomes one rule, and one rule becomes vocabulary every defender shares.

Microsoft Copilot's regression PR went from opened to v2.1.2 npm publish in 2h 16m (06:07 → 08:24 UTC). That cadence is how a standard stays level with the threat instead of lagging it.

Red team tooling in motion

The offensive side is already feeding the standard. Your tool can be next.

These are red-team tools — offensive testing frameworks and adversarial corpora. Each is an entry point into the flywheel. Defensive frameworks live on /ecosystem.

Integrated (4)

Microsoft PyRIT

Microsoft AI Red Team
Corpus ingested

The toolkit Microsoft uses internally to red-team production LLM products. Roman Lutz leads.

Added an ATR dataset loader exposing the rule corpus as PyRIT attack sources; PR #1715 merged 2026-05-27.

HackAPrompt

Schulhoff et al. — Learn Prompting · EMNLP 2023
Corpus ingested

The largest crowd-sourced prompt-injection competition corpus, 4,780 competition samples across GPT/Claude/PaLM. EMNLP 2023 best-paper nominee.

Clustered 4,780 HackAPrompt samples by attack family. Shipped 5 ATR rules (ATR-2026-00452..00456) from dominant clusters. HackAPrompt recall: 28.6% before sprint → 66.0% after. 100% precision maintained. Each rule cites the HackAPrompt cluster in metadata_provenance.

NeMo-Guardrails + llm-guard + Promptfoo

NVIDIA · Protect AI · Promptfoo Inc.
Corpus ingested

Vendor test suites from three production guardrail systems. Combined 94 adversarial samples covering jailbreak, injection, and output sanitization.

Ingested combined 94-sample corpus. Shipped 6 ATR rules (ATR-2026-00500..00505) covering attack patterns present across all three vendor test suites. Each rule validated against the vendor's own test cases as true-positive ground truth.

OWASP LLM Top 10 + MITRE ATLAS PoCs

OWASP · MITRE
Corpus ingested

Standards-defined PoC attack patterns from OWASP LLM01-LLM10 and corresponding MITRE ATLAS technique catalog (AML.T00XX series).

Shipped 8 ATR rules (ATR-2026-00510..00517), each mapping to a named OWASP LLM category (LLM01-LLM10) and a specific MITRE ATLAS technique. Rules are standards-aligned at the metadata_provenance level.

Under review (9)
NVIDIA Garak
NVIDIA AI Red Team

The reference open-source LLM vulnerability scanner. 50+ probe families, jmartin-tech + leondz maintainers.

Wrapped 330 ATR rules as garak detectors. PR #1676 cleared two review rounds; the in-the-wild jailbreak set (650 prompts) posted 98.0% recall, while the full 23-probe garak suite (3,475 prompts) is 38.5%. Per-family: latentinjection 34.4%, sysprompt_extraction 67.9%, dan 90.2%.

HarmBench
Center for AI Safety · Dan Hendrycks

320-behavior standardized red-team benchmark — the citation when comparing attack methods across target models.

Issue #93 proposes pairing every behavior with a content-layer rule, adding detection-rate and bypass-rate columns next to ASR. Pre-PR gap-analysis offer attached.

AgentDojo
ETH Zurich SPY Lab · Florian Tramèr

The only agent-specific attack benchmark with a real tool-use harness. 78 attack tasks across 4 environments.

Issue #160 proposes a detection-evaluator extension that runs ATR rules against AgentDojo attack tasks. Honest about regex limits against paraphrase / multilingual; reports per-task expected detection rate.

JailbreakBench
Princeton · Patrick Chao (PAIR / TAP)

Standardized jailbreak leaderboard with fixed eval interface. JBB-Behaviors is the citation for 'attack X beats baseline'.

Issue #48 proposes a registered-detector backend interface (ATR as one possible reference implementation). Adds detection-rate + bypass-rate + agreement-with-refusal as new comparable axes.

TextAttack
QData · ACL 2020

3.1k stars. The reference NLP-adversarial framework, used in undergraduate security curricula.

Issue #824 proposes a textattack-detection-atr companion PyPI package — plugin-shaped, zero core changes. Single ask: README cross-link.

Microsoft Counterfit
Microsoft Azure Security

Metasploit-shape CLI for AI red-teaming. The clean abstraction for command-line attack runs.

Issue #96 proposes a counterfit-detection-atr companion package adding a detection lane to scan output. Plugin-first; respects the project's maintenance state.

PromptInject
agencyenterprise · NeurIPS 2022 Best Paper

The original academic benchmark for prompt injection. 8.2k stars; cited by every prompt-injection paper since 2022.

Shipped 4 ATR rules (ATR-2026-00506..00509) closing user_input injection gaps identified from PromptInject attack taxonomy. Issue #9 proposes a corpus-to-ATR pipeline pairing every PromptInject attack with a matched detection rule.

Promptfoo
Promptfoo Inc.

10k stars, used by red teams at Klarna, Discord, Anduril. Promptfoo runs adversarial tests; ATR catches what Promptfoo found.

PR #8529 adds an MCP red-team example using ATR as the deterministic defense layer. Promptfoo runs the probe; ATR rules return the verdict.

Damn Vulnerable MCP Server
harishsg993010

A CTF-style training target with 10 intentionally-vulnerable MCP scenarios. The DVWA of agent security.

PR #29 ships the blue-team detection guide — every CTF challenge gets a paired ATR rule so trainees learn detection alongside the attack.

Continuous corpus integration

5 corpora · 75 new rules · HackAPrompt recall 28.6% → 66.0%

2026-05-12: 11 parallel agents ingested five external corpora and generated 75 rules, all passing the 6-gate quality process with 0 FP regression on the benign corpus. Full version record at /changelog.

This pipeline now runs daily: a red-team mega-scan flywheel and a CVE-ingestion flywheel have each completed a full sweep and moved to daily updates. New findings auto-crystallize into rules, growing the standard from 462 to the current 655 (published as agent-threat-rules on npm). The count moves daily — which is what a turning flywheel looks like.

HackAPrompt
Samples: 4,780
Rules: 5
ATR-2026-00452..00456
Recall: 28.6% → 66.0%
Vendor test suites
Samples: 94
Rules: 6
ATR-2026-00500..00505
PromptInject
Samples: full corpus
Rules: 4
ATR-2026-00506..00509
OWASP LLM Top 10 + ATLAS PoCs
Samples: 8 standard categories
Rules: 8
ATR-2026-00510..00517
Garak in-the-wild jailbreak
Samples: 650
Rules: existing coverage
98.0% recall

All 75 rules passed the RFC-001 quality gate, benign corpus 0 FP, and cross-rule conflict check. 53 rules had regex generalized from literal fingerprints to multi-layer structural patterns. Full detail in v2.2.0 changelog.

2026-06 mega-scan

8 corpora · 29 detection rules · 2 with no novel misses

Ingested 8 public agent red-team corpora to surface attack variants existing rules missed and crystallize them into new rules. 6 corpora produced 29 detection rules; 2 (TensorTrust, PoisonedRAG) yielded no novel misses beyond existing coverage. All passed the 6-gate quality process with 0 FP on the benign corpus.

LLMail-Inject
ATR-2026-01860..01865
InjecAgent
ATR-2026-00550 / 00584 / 00700 / 00859 / 01000
AgentDojo
ATR-2026-00715 / 00720 / 01751..01754
ToolEmu
ATR-2026-00702 / 00718 / 00719 / 01301
MCPSecBench
ATR-2026-01300 / 01306 / 01307 / 01310 / 01615 / 01616
AgentPoison
ATR-2026-01774 / 01800
TensorTrust
Existing coverage — no novel misses
PoisonedRAG
Existing coverage — no novel misses

29 distinct rules after de-duplication; several emerged from misses shared across corpora. Maturity spans stable and experimental; all cleared the safety gate with 0 FP on the benign corpus. Both flywheels (red-team + CVE) now run daily.

What we don't import

Honest scope boundaries.

Part of what a standard is worth comes from what it refuses. ATR does not import every available corpus — below are the datasets we made an active decision to leave out, and why.

PyRIT Pliny L1B3RT4S
Microsoft AI Red Team / Pliny
Refused

Anthropic usage policy prevented our subagents from consuming this dataset. ATR does not import material it cannot verify provenance for.

AdvBench
Zou et al. — GCG paper
Reclassified

Reclassified as a test corpus (data/test-corpora/) rather than a rule source. AdvBench describes target behaviors, not wrapped attack payloads — it measures attack success rates and does not directly yield detectable patterns.

HarmBench
Center for AI Safety · Dan Hendrycks
Reclassified

Reclassified as a test corpus. HarmBench's 320 behaviors describe target outputs, not attack syntax. Used to measure ATR coverage, not to generate rules.

JailbreakBench
Princeton · Patrick Chao
Reclassified

Reclassified as a test corpus. JBB-Behaviors is a standardized leaderboard input describing request content rather than attack patterns. Used for benchmarking, not rule generation.

4 corpus-fingerprint rules (KEPT-AS-IS)
ATR-GARAK-a7fcb4e5 + 3 others
Kept, not generalized

These 4 rules have highly literal regexes — they are corpus fingerprints, not generalizable attack patterns. Generalizing them would produce an unacceptable FP rate. Kept as experimental with explicit corpus-fingerprint notation; not used for production blocking.

On deck

Red-team integrations queued. One a week. Public schedule.

ATR doesn't only chase the biggest names. Here's the queue for the next six weeks — real dates, real targets. Maintainers who see themselves on the schedule tend to review faster. That's the side-effect of a public commitment.

01
InjecAgent
UIUC Kang Lab

Cleanest direct-vs-indirect agent injection taxonomy. 1,054 attack cases. Complements AgentDojo coverage.

Filing
2026-05-19
queued
02
GPTFuzz
NDSS 2024 · Jiahao Yu

First credible LLM fuzzer. Pairing fuzz output with ATR detection closes the loop on what their tool discovers.

Filing
2026-05-22
queued
03
R-Judge
Tongxin Yuan et al.

162-scenario agent safety benchmark with LLM-as-judge evaluator. ATR adds the complementary content-rule defense lane next to the judge layer.

Filing
2026-05-26
queued
04
PromptBench
Microsoft Research

MS Research LLM robustness eval framework. Second cross-link inside Microsoft after AGT + PyRIT.

Filing
2026-06-02
queued
05
Giskard
YC W23

ML testing framework with red-team mode. Open-source core + commercial cloud — integrate at OSS core.

Filing
2026-06-06
queued

Schedule syncs from filed GitHub issues / PRs. Once merged, entries move to "Already Integrated" above. Maintainers wanting earlier engagement: [email protected].

Pipeline is not vapor

Auto-regex already clears 0 FP across 3,551 samples on the sample probe.

$ npx tsx scripts/auto-regex.ts \
    --file proposals/red-team-probes/dan-trust-phrase-wrapping.proposal.yaml \
    --write

[auto-regex] 3 TPs, 3 TNs — generating candidate regex…
[auto-regex] gate corpora: 431 benign + 1,352 extended + 157 research + 1,611 cross-rule TNs
[auto-regex] variant 0: 3 phrases, tp=100%, fp=0
  (benign=0 ext=0 res=0 cross=0) — PASS
[auto-regex] wrote regex to proposals/red-team-probes/...

::auto-regex-summary::
{ "passed": true, "variant": 0, "tp_coverage": 1, "total_fp": 0 }
Submit a probe

10 minutes of form-filling. Permanent attribution. One rule, shared by every defender who adopts ATR.

MIT licensed. No CLA. No telemetry. Forever free. You keep every right to publish the attack itself — ATR only carries the detection, as a rule anyone, anywhere can cite by its stable ID.

655 rules deployed across 10 threat categories. Every one has an author field and metadata_provenance.discovered_by intact.