Find the attack.
We ship the rule.
2 hours 16 minutes.
One attack, crystallized into one rule, becomes a word every defender shares. Microsoft AGT pulls it weekly, Cisco AI Defense ships it in skill-scanner, MISP exports it to STIX with the discoverer's name attached. A closed-loop flywheel: attacks feed the standard, and the standard evolves at the same speed as the threat.
2026-05-07 Microsoft Security disclosed Semantic Kernel CVE-2026-26030. 2026-05-11 06:07 UTC a Microsoft Copilot SWE Agent opened a regression-test PR presuming ATR already covered it. 08:24 UTC v2.1.2 published on npm with the paired rules. The loop from a disclosed attack to a shared detection closes in hours, not committee cycles.
Microsoft Agent Governance Toolkit (weekly auto-sync). Cisco AI Defense (rule pack in skill-scanner). CIRCL/MISP (taxonomies + galaxy merged by the project lead). OWASP A-S-R-H (lead merged with "Welcome to the team"). Gen Digital Sage (Norton / Avast parent security team merged). A rule written once becomes vocabulary five organizations share — which is the point of a standard.
Every rule carries author + metadata_provenance.discovered_by, and a stable ATR-YYYY-NNNNN ID that never changes after publication. Microsoft AGT, Cisco AI Defense, MISP, and OWASP all preserve the field on sync. When MISP exports to STIX, the attribution survives the format change. When NIST references the rule, the lineage back to the original red-teamer stays intact.
A 6-check quality gate: own-TP must match + 431 benign + 1,352 extended + 157 research-mention + 1,611 cross-rule conflict-free + own true_negative coverage. A rule that fires on the paper describing the attack does not ship. The standard would rather miss a clever variant than corrupt the corpus with a false positive — and it publishes its real per-lane figures rather than a single flattering one.
What your attack leaves behind once it's in the standard — depending on who you are.
A citable artifact paired with the attack you described — one with a stable ID, real downstream consumers, and fire counts in production telemetry, not just a paper that sits behind a paywall.
Your name lives in the rule file's author + metadata_provenance.discovered_by. When MISP exports the taxonomy to STIX, the attribution propagates. When NIST references the rule (collaboration in progress with iMichaela at NIST OSCAL — a working branch, not an endorsement), the lineage back to you is intact.
The variant your team found stops being an internal finding and becomes shared defensive vocabulary. Your team appears as the discoverer in the rule corpus that Microsoft AGT, Cisco AI Defense, MISP, OWASP A-S-R-H, and Gen Digital Sage all pull from.
Microsoft's Copilot SWE Agent already opens PRs presuming ATR coverage (AGT #1981, closed 2026-05-11). Originating an ATR-2026-NNNNN rule is durable, verifiable provenance — it sits in a public corpus, not a slide.
Ship a detection without hand-writing regex. Submit positive and negative examples; the deterministic auto-regex generator tries four variants against the full gate, and roughly one in three passes on the first attempt.
The generator clears 0 FP across 3,551 samples (benign + extended + research-mention + cross-rule) before a candidate is ever shown. The PR labels itself gate-passed and goes straight to maintainer review — the same gate every rule in the standard had to clear.
The same attack does two jobs: the bounty payout, and a permanent rule that catches it everywhere afterward. Composes cleanly with Huntr.dev, HackerOne AI scope, and Protect AI bounty programmes.
ATR proposals are MIT-licensed and citable. No NDA conflict — the bounty programme owns the responsible-disclosure window; ATR ships the detection only after disclosure, with public attribution to the finder.
Probe in. Auto-regex generates. Quality gate validates. Merge if green.
You fill the form
3 attack samples, 3 benign lookalikes, attack category, source paper / repo. Takes 3-5 minutes.
No schema to learn. No YAML to write. No fork.
Auto-regex tries 4 variants
Deterministic n-gram set-cover algorithm extracts distinctive phrases from your positives, builds an alternation regex, tightens with word boundaries / whitespace anchors / co-occurrence constraints. Each variant runs through the full gate.
Gate = your TPs must match 100% + 1,783-sample benign+extended corpus 0 FP + 157-sample research-mention 0 FP + 0 cross-rule conflicts.
Gate clears → complete rule on a PR
PR lands with the gate-passed label. Maintainer reviews regex shape — is it too literal, can it generalize? Usually merged within 1-3 days. If gate didn't clear, stays as stub and a maintainer hand-crafts the regex (still using your test cases as ground truth).
Rule auto-propagates downstream
Microsoft AGT syncs weekly. Cisco AI Defense pins to release tags. MISP taxonomy + galaxy pull on every release. OWASP A-S-R-H references rule IDs in fixtures. Your discovered_by field rides the whole chain — this is the back half of the flywheel: one attack becomes one rule, and one rule becomes vocabulary every defender shares.
Microsoft Copilot's regression PR went from opened to v2.1.2 npm publish in 2h 16m (06:07 → 08:24 UTC). That cadence is how a standard stays level with the threat instead of lagging it.
The offensive side is already feeding the standard. Your tool can be next.
These are red-team tools — offensive testing frameworks and adversarial corpora. Each is an entry point into the flywheel. Defensive frameworks live on /ecosystem.
HackAPrompt
The largest crowd-sourced prompt-injection competition corpus, 4,780 competition samples across GPT/Claude/PaLM. EMNLP 2023 best-paper nominee.
Clustered 4,780 HackAPrompt samples by attack family. Shipped 5 ATR rules (ATR-2026-00452..00456) from dominant clusters. HackAPrompt recall: 28.6% before sprint → 66.0% after. 100% precision maintained. Each rule cites the HackAPrompt cluster in metadata_provenance.
NeMo-Guardrails + llm-guard + Promptfoo
Vendor test suites from three production guardrail systems. Combined 94 adversarial samples covering jailbreak, injection, and output sanitization.
Ingested combined 94-sample corpus. Shipped 6 ATR rules (ATR-2026-00500..00505) covering attack patterns present across all three vendor test suites. Each rule validated against the vendor's own test cases as true-positive ground truth.
OWASP LLM Top 10 + MITRE ATLAS PoCs
Standards-defined PoC attack patterns from OWASP LLM01-LLM10 and corresponding MITRE ATLAS technique catalog (AML.T00XX series).
Shipped 8 ATR rules (ATR-2026-00510..00517), each mapping to a named OWASP LLM category (LLM01-LLM10) and a specific MITRE ATLAS technique. Rules are standards-aligned at the metadata_provenance level.
The reference open-source LLM vulnerability scanner. 50+ probe families, jmartin-tech + leondz maintainers.
Wrapped 330 ATR rules as garak detectors. PR #1676 cleared two review rounds; the in-the-wild jailbreak set (650 prompts) posted 98.0% recall, while the full 23-probe garak suite (3,475 prompts) is 38.5%. Per-family: latentinjection 34.4%, sysprompt_extraction 67.9%, dan 90.2%.
320-behavior standardized red-team benchmark — the citation when comparing attack methods across target models.
Issue #93 proposes pairing every behavior with a content-layer rule, adding detection-rate and bypass-rate columns next to ASR. Pre-PR gap-analysis offer attached.
The only agent-specific attack benchmark with a real tool-use harness. 78 attack tasks across 4 environments.
Issue #160 proposes a detection-evaluator extension that runs ATR rules against AgentDojo attack tasks. Honest about regex limits against paraphrase / multilingual; reports per-task expected detection rate.
Standardized jailbreak leaderboard with fixed eval interface. JBB-Behaviors is the citation for 'attack X beats baseline'.
Issue #48 proposes a registered-detector backend interface (ATR as one possible reference implementation). Adds detection-rate + bypass-rate + agreement-with-refusal as new comparable axes.
3.1k stars. The reference NLP-adversarial framework, used in undergraduate security curricula.
Issue #824 proposes a textattack-detection-atr companion PyPI package — plugin-shaped, zero core changes. Single ask: README cross-link.
Metasploit-shape CLI for AI red-teaming. The clean abstraction for command-line attack runs.
Issue #96 proposes a counterfit-detection-atr companion package adding a detection lane to scan output. Plugin-first; respects the project's maintenance state.
The original academic benchmark for prompt injection. 8.2k stars; cited by every prompt-injection paper since 2022.
Shipped 4 ATR rules (ATR-2026-00506..00509) closing user_input injection gaps identified from PromptInject attack taxonomy. Issue #9 proposes a corpus-to-ATR pipeline pairing every PromptInject attack with a matched detection rule.
10k stars, used by red teams at Klarna, Discord, Anduril. Promptfoo runs adversarial tests; ATR catches what Promptfoo found.
PR #8529 adds an MCP red-team example using ATR as the deterministic defense layer. Promptfoo runs the probe; ATR rules return the verdict.
A CTF-style training target with 10 intentionally-vulnerable MCP scenarios. The DVWA of agent security.
PR #29 ships the blue-team detection guide — every CTF challenge gets a paired ATR rule so trainees learn detection alongside the attack.
5 corpora · 75 new rules · HackAPrompt recall 28.6% → 66.0%
2026-05-12: 11 parallel agents ingested five external corpora and generated 75 rules, all passing the 6-gate quality process with 0 FP regression on the benign corpus. Full version record at /changelog.
This pipeline now runs daily: a red-team mega-scan flywheel and a CVE-ingestion flywheel have each completed a full sweep and moved to daily updates. New findings auto-crystallize into rules, growing the standard from 462 to the current 655 (published as agent-threat-rules on npm). The count moves daily — which is what a turning flywheel looks like.
All 75 rules passed the RFC-001 quality gate, benign corpus 0 FP, and cross-rule conflict check. 53 rules had regex generalized from literal fingerprints to multi-layer structural patterns. Full detail in v2.2.0 changelog.
8 corpora · 29 detection rules · 2 with no novel misses
Ingested 8 public agent red-team corpora to surface attack variants existing rules missed and crystallize them into new rules. 6 corpora produced 29 detection rules; 2 (TensorTrust, PoisonedRAG) yielded no novel misses beyond existing coverage. All passed the 6-gate quality process with 0 FP on the benign corpus.
29 distinct rules after de-duplication; several emerged from misses shared across corpora. Maturity spans stable and experimental; all cleared the safety gate with 0 FP on the benign corpus. Both flywheels (red-team + CVE) now run daily.
Honest scope boundaries.
Part of what a standard is worth comes from what it refuses. ATR does not import every available corpus — below are the datasets we made an active decision to leave out, and why.
Anthropic usage policy prevented our subagents from consuming this dataset. ATR does not import material it cannot verify provenance for.
Reclassified as a test corpus (data/test-corpora/) rather than a rule source. AdvBench describes target behaviors, not wrapped attack payloads — it measures attack success rates and does not directly yield detectable patterns.
Reclassified as a test corpus. HarmBench's 320 behaviors describe target outputs, not attack syntax. Used to measure ATR coverage, not to generate rules.
Reclassified as a test corpus. JBB-Behaviors is a standardized leaderboard input describing request content rather than attack patterns. Used for benchmarking, not rule generation.
These 4 rules have highly literal regexes — they are corpus fingerprints, not generalizable attack patterns. Generalizing them would produce an unacceptable FP rate. Kept as experimental with explicit corpus-fingerprint notation; not used for production blocking.
Red-team integrations queued. One a week. Public schedule.
ATR doesn't only chase the biggest names. Here's the queue for the next six weeks — real dates, real targets. Maintainers who see themselves on the schedule tend to review faster. That's the side-effect of a public commitment.
Cleanest direct-vs-indirect agent injection taxonomy. 1,054 attack cases. Complements AgentDojo coverage.
First credible LLM fuzzer. Pairing fuzz output with ATR detection closes the loop on what their tool discovers.
162-scenario agent safety benchmark with LLM-as-judge evaluator. ATR adds the complementary content-rule defense lane next to the judge layer.
MS Research LLM robustness eval framework. Second cross-link inside Microsoft after AGT + PyRIT.
ML testing framework with red-team mode. Open-source core + commercial cloud — integrate at OSS core.
Schedule syncs from filed GitHub issues / PRs. Once merged, entries move to "Already Integrated" above. Maintainers wanting earlier engagement: [email protected].
Auto-regex already clears 0 FP across 3,551 samples on the sample probe.
$ npx tsx scripts/auto-regex.ts \
--file proposals/red-team-probes/dan-trust-phrase-wrapping.proposal.yaml \
--write
[auto-regex] 3 TPs, 3 TNs — generating candidate regex…
[auto-regex] gate corpora: 431 benign + 1,352 extended + 157 research + 1,611 cross-rule TNs
[auto-regex] variant 0: 3 phrases, tp=100%, fp=0
(benign=0 ext=0 res=0 cross=0) — PASS
[auto-regex] wrote regex to proposals/red-team-probes/...
::auto-regex-summary::
{ "passed": true, "variant": 0, "tp_coverage": 1, "total_fp": 0 }10 minutes of form-filling. Permanent attribution. One rule, shared by every defender who adopts ATR.
MIT licensed. No CLA. No telemetry. Forever free. You keep every right to publish the attack itself — ATR only carries the detection, as a rule anyone, anywhere can cite by its stable ID.
655 rules deployed across 10 threat categories. Every one has an author field and metadata_provenance.discovered_by intact.