你找到攻擊。
我們發 detection 規則。
2 小時 16 分鐘。
一個攻擊被結晶成一條規則,那條規則就成了所有防禦者共用的字彙。Microsoft AGT 每週自動拉、Cisco AI Defense 的 skill-scanner 內建一份、MISP export 到 STIX 時帶著發現者的署名。這是一道閉環飛輪:攻擊餵養標準,標準與威脅同速演化。
2026-05-07 Microsoft Security disclosed Semantic Kernel CVE-2026-26030. 2026-05-11 06:07 UTC a Microsoft Copilot SWE Agent opened a regression-test PR presuming ATR already covered it. 08:24 UTC v2.1.2 published on npm with the paired rules. The loop from a disclosed attack to a shared detection closes in hours, not committee cycles.
Microsoft Agent Governance Toolkit (weekly auto-sync). Cisco AI Defense (rule pack in skill-scanner). CIRCL/MISP (taxonomies + galaxy merged by the project lead). OWASP A-S-R-H (lead merged with "Welcome to the team"). Gen Digital Sage (Norton / Avast parent security team merged). A rule written once becomes vocabulary five organizations share — which is the point of a standard.
Every rule carries author + metadata_provenance.discovered_by, and a stable ATR-YYYY-NNNNN ID that never changes after publication. Microsoft AGT, Cisco AI Defense, MISP, and OWASP all preserve the field on sync. When MISP exports to STIX, the attribution survives the format change. When NIST references the rule, the lineage back to the original red-teamer stays intact.
A 6-check quality gate: own-TP must match + 431 benign + 1,352 extended + 157 research-mention + 1,611 cross-rule conflict-free + own true_negative coverage. A rule that fires on the paper describing the attack does not ship. The standard would rather miss a clever variant than corrupt the corpus with a false positive — and it publishes its real per-lane figures rather than a single flattering one.
你的攻擊進了標準之後,留下什麼 — 依你是誰而不同。
A citable artifact paired with the attack you described — one with a stable ID, real downstream consumers, and fire counts in production telemetry, not just a paper that sits behind a paywall.
Your name lives in the rule file's author + metadata_provenance.discovered_by. When MISP exports the taxonomy to STIX, the attribution propagates. When NIST references the rule (collaboration in progress with iMichaela at NIST OSCAL — a working branch, not an endorsement), the lineage back to you is intact.
The variant your team found stops being an internal finding and becomes shared defensive vocabulary. Your team appears as the discoverer in the rule corpus that Microsoft AGT, Cisco AI Defense, MISP, OWASP A-S-R-H, and Gen Digital Sage all pull from.
Microsoft's Copilot SWE Agent already opens PRs presuming ATR coverage (AGT #1981, closed 2026-05-11). Originating an ATR-2026-NNNNN rule is durable, verifiable provenance — it sits in a public corpus, not a slide.
Ship a detection without hand-writing regex. Submit positive and negative examples; the deterministic auto-regex generator tries four variants against the full gate, and roughly one in three passes on the first attempt.
The generator clears 0 FP across 3,551 samples (benign + extended + research-mention + cross-rule) before a candidate is ever shown. The PR labels itself gate-passed and goes straight to maintainer review — the same gate every rule in the standard had to clear.
The same attack does two jobs: the bounty payout, and a permanent rule that catches it everywhere afterward. Composes cleanly with Huntr.dev, HackerOne AI scope, and Protect AI bounty programmes.
ATR proposals are MIT-licensed and citable. No NDA conflict — the bounty programme owns the responsible-disclosure window; ATR ships the detection only after disclosure, with public attribution to the finder.
Probe 進來、auto-regex 自動產、quality gate 全綠才 merge。
你填表單
3 個攻擊樣本、3 個 benign lookalike、攻擊類別、來源論文/repo。3-5 分鐘。
沒有 schema 要學、沒有 YAML 要寫、不用 fork repo。
Auto-regex 跑 4 個變體
Deterministic n-gram set-cover algorithm 從你的 positive examples 萃取 distinctive phrases,建 alternation regex,加 word boundary、whitespace anchor 或 co-occurrence 約束 — 每個變體跑完整 gate。
Gate = 自己 TP 必須 100% 命中 + 1,783 樣本 benign+extended corpus 0 FP + 157 樣本 research-mention 0 FP + 跨規則 0 衝突。
Gate 過了 → 完整規則送 PR
PR 帶 gate-passed label。Maintainer 看 regex shape 是否太字面、需不需要 generalize — 通常 1-3 天 merge。沒過就留 stub,maintainer 手寫 regex(仍然會用你的 test cases)。
規則自動往下游傳
Microsoft AGT 每週 sync、Cisco AI Defense 跟 release tag、MISP taxonomy + galaxy 每次 release 拉、OWASP A-S-R-H 在 fixture 中引用 rule ID。你的 discovered_by 跟著整條鏈傳遞——這是飛輪的下半圈:一個攻擊變成一條規則,一條規則變成所有防禦者共用的字彙。
Microsoft Copilot 的 regression PR 從開 PR 到 v2.1.2 npm publish 用了 2 小時 16 分鐘(06:07 → 08:24 UTC)。標準與威脅同速演化,靠的就是這個節奏。
攻擊側的工具與語料,正在餵養標準。你的工具可以是下一個。
這些是真的紅隊工具(offensive testing)與對抗語料 — 每一個都是飛輪的入口。防禦端框架在 /ecosystem。
HackAPrompt
The largest crowd-sourced prompt-injection competition corpus, 4,780 competition samples across GPT/Claude/PaLM. EMNLP 2023 best-paper nominee.
Clustered 4,780 HackAPrompt samples by attack family. Shipped 5 ATR rules (ATR-2026-00452..00456) from dominant clusters. HackAPrompt recall: 28.6% before sprint → 66.0% after. 100% precision maintained. Each rule cites the HackAPrompt cluster in metadata_provenance.
NeMo-Guardrails + llm-guard + Promptfoo
Vendor test suites from three production guardrail systems. Combined 94 adversarial samples covering jailbreak, injection, and output sanitization.
Ingested combined 94-sample corpus. Shipped 6 ATR rules (ATR-2026-00500..00505) covering attack patterns present across all three vendor test suites. Each rule validated against the vendor's own test cases as true-positive ground truth.
OWASP LLM Top 10 + MITRE ATLAS PoCs
Standards-defined PoC attack patterns from OWASP LLM01-LLM10 and corresponding MITRE ATLAS technique catalog (AML.T00XX series).
Shipped 8 ATR rules (ATR-2026-00510..00517), each mapping to a named OWASP LLM category (LLM01-LLM10) and a specific MITRE ATLAS technique. Rules are standards-aligned at the metadata_provenance level.
The reference open-source LLM vulnerability scanner. 50+ probe families, jmartin-tech + leondz maintainers.
Wrapped 330 ATR rules as garak detectors. PR #1676 cleared two review rounds; the in-the-wild jailbreak set (650 prompts) posted 98.0% recall, while the full 23-probe garak suite (3,475 prompts) is 38.5%. Per-family: latentinjection 34.4%, sysprompt_extraction 67.9%, dan 90.2%.
320-behavior standardized red-team benchmark — the citation when comparing attack methods across target models.
Issue #93 proposes pairing every behavior with a content-layer rule, adding detection-rate and bypass-rate columns next to ASR. Pre-PR gap-analysis offer attached.
The only agent-specific attack benchmark with a real tool-use harness. 78 attack tasks across 4 environments.
Issue #160 proposes a detection-evaluator extension that runs ATR rules against AgentDojo attack tasks. Honest about regex limits against paraphrase / multilingual; reports per-task expected detection rate.
Standardized jailbreak leaderboard with fixed eval interface. JBB-Behaviors is the citation for 'attack X beats baseline'.
Issue #48 proposes a registered-detector backend interface (ATR as one possible reference implementation). Adds detection-rate + bypass-rate + agreement-with-refusal as new comparable axes.
3.1k stars. The reference NLP-adversarial framework, used in undergraduate security curricula.
Issue #824 proposes a textattack-detection-atr companion PyPI package — plugin-shaped, zero core changes. Single ask: README cross-link.
Metasploit-shape CLI for AI red-teaming. The clean abstraction for command-line attack runs.
Issue #96 proposes a counterfit-detection-atr companion package adding a detection lane to scan output. Plugin-first; respects the project's maintenance state.
The original academic benchmark for prompt injection. 8.2k stars; cited by every prompt-injection paper since 2022.
Shipped 4 ATR rules (ATR-2026-00506..00509) closing user_input injection gaps identified from PromptInject attack taxonomy. Issue #9 proposes a corpus-to-ATR pipeline pairing every PromptInject attack with a matched detection rule.
10k stars, used by red teams at Klarna, Discord, Anduril. Promptfoo runs adversarial tests; ATR catches what Promptfoo found.
PR #8529 adds an MCP red-team example using ATR as the deterministic defense layer. Promptfoo runs the probe; ATR rules return the verdict.
A CTF-style training target with 10 intentionally-vulnerable MCP scenarios. The DVWA of agent security.
PR #29 ships the blue-team detection guide — every CTF challenge gets a paired ATR rule so trainees learn detection alongside the attack.
5 個語料庫 · 75 條新規則 · HackAPrompt 召回率 28.6% → 66.0%
2026-05-12:11 個並行 agent 消化五個外部語料庫,生成 75 條規則,均通過 6 道品質關卡,0 FP regression on benign corpus。詳細版本記錄在 /changelog。
這個流程現在每天跑:紅隊巨量掃描與 CVE 攝取兩條飛輪都已跑完整輪,並轉為每日更新。新發現自動結晶成規則,把標準從 462 條推進到目前的 655 條(發布於 npm 上的 agent-threat-rules)。規則數天天在動——這正是飛輪在轉的證據。
所有 75 條規則均通過 RFC-001 品質門檻、benign corpus 0 FP、跨規則衝突檢查。53 條規則的 regex 已從字面模式泛化為多層結構 pattern。詳細 changelog 見 v2.2.0。
8 個語料庫 · 29 條偵測規則 · 2 個無新增 miss
消化 8 個公開 agent 紅隊語料庫,找出現有規則漏抓的攻擊變體並結晶成新規則。6 個語料庫產出 29 條偵測規則,2 個(TensorTrust、PoisonedRAG)在現有覆蓋之外無新增 miss。全部通過 6 道品質關卡與 benign corpus 0 FP。
29 條為去重後的 distinct 規則;部分源自跨語料庫共通的 miss。成熟度涵蓋 stable 與 experimental,全部通過安全門檻與 benign corpus 0 FP。這兩條飛輪(紅隊 + CVE)現在每日運轉。
誠實說明範圍界限。
一個標準的可信度,一部分來自它拒絕了什麼。ATR 不引入所有可能的語料庫——以下是我們主動決定不引入的資料集,以及原因。
Anthropic 使用政策不允許我們的 subagent 消費此資料集。ATR 不導入無法驗證來源的材料。
重新分類為測試語料庫(data/test-corpora/)而非規則來源。AdvBench 描述目標行為,而非包裝好的攻擊 payload——它衡量攻擊成功率,不直接產生可偵測的模式。
重新分類為測試語料庫。HarmBench 的 320 個行為描述目標輸出,不是攻擊句法。用於評估 ATR 覆蓋率,不用於生成規則。
重新分類為測試語料庫。JBB-Behaviors 是 jailbreak 排行榜的標準輸入,描述請求內容而非攻擊模式。用於 benchmark,不用於規則生成。
這 4 條規則的 regex 非常字面——它們是語料庫指紋,不是可泛化的攻擊模式。泛化這些規則會帶來無法接受的 FP rate。保留為 experimental,並明確標注為語料庫指紋,不用於生產封鎖。
已經排好的紅隊整合 — 一週一個,公開排程。
ATR 不是只跟最大牌的整合。下面是接下來六週要送的 issue / PR — 真實日期、真實對象。Maintainer 看到自己被排上會優先 review,這是公開承諾的副作用。
Cleanest direct-vs-indirect agent injection taxonomy. 1,054 attack cases. Complements AgentDojo coverage.
First credible LLM fuzzer. Pairing fuzz output with ATR detection closes the loop on what their tool discovers.
162-scenario agent safety benchmark with LLM-as-judge evaluator. ATR adds the complementary content-rule defense lane next to the judge layer.
MS Research LLM robustness eval framework. Second cross-link inside Microsoft after AGT + PyRIT.
ML testing framework with red-team mode. Open-source core + commercial cloud — integrate at OSS core.
排程從 GitHub issue / PR 同步。Merge 後從這裡移到 "Already Integrated"。Maintainer 若想插隊:[email protected]。
Auto-regex 已經對你的範本 0 FP 跨 3,551 樣本。
$ npx tsx scripts/auto-regex.ts \
--file proposals/red-team-probes/dan-trust-phrase-wrapping.proposal.yaml \
--write
[auto-regex] 3 TPs, 3 TNs — generating candidate regex…
[auto-regex] gate corpora: 431 benign + 1,352 extended + 157 research + 1,611 cross-rule TNs
[auto-regex] variant 0: 3 phrases, tp=100%, fp=0
(benign=0 ext=0 res=0 cross=0) — PASS
[auto-regex] wrote regex to proposals/red-team-probes/...
::auto-regex-summary::
{ "passed": true, "variant": 0, "tp_coverage": 1, "total_fp": 0 }10 分鐘填表 → 永久署名 → 一條規則,所有採用 ATR 的防禦者共用。
MIT 授權、無 CLA、無遙測、永遠免費。你保留出版攻擊本身的所有權利 — ATR 只負責把它變成一條全世界都能引用的偵測規則。
已部署 655 條規則,跨 10 個威脅類別。每條都有 author + metadata_provenance.discovered_by。