Skip to content
獻給紅隊

你找到攻擊。
我們發 detection 規則。
2 小時 16 分鐘。

一個攻擊被結晶成一條規則,那條規則就成了所有防禦者共用的字彙。Microsoft AGT 每週自動拉、Cisco AI Defense 的 skill-scanner 內建一份、MISP export 到 STIX 時帶著發現者的署名。這是一道閉環飛輪:攻擊餵養標準,標準與威脅同速演化。

為什麼一個攻擊該進標準
2h 16m
Disclosure to published rule

2026-05-07 Microsoft Security disclosed Semantic Kernel CVE-2026-26030. 2026-05-11 06:07 UTC a Microsoft Copilot SWE Agent opened a regression-test PR presuming ATR already covered it. 08:24 UTC v2.1.2 published on npm with the paired rules. The loop from a disclosed attack to a shared detection closes in hours, not committee cycles.

5
Independent organizations carry the same rules

Microsoft Agent Governance Toolkit (weekly auto-sync). Cisco AI Defense (rule pack in skill-scanner). CIRCL/MISP (taxonomies + galaxy merged by the project lead). OWASP A-S-R-H (lead merged with "Welcome to the team"). Gen Digital Sage (Norton / Avast parent security team merged). A rule written once becomes vocabulary five organizations share — which is the point of a standard.

655
Rules, each carrying the discoverer's name

Every rule carries author + metadata_provenance.discovered_by, and a stable ATR-YYYY-NNNNN ID that never changes after publication. Microsoft AGT, Cisco AI Defense, MISP, and OWASP all preserve the field on sync. When MISP exports to STIX, the attribution survives the format change. When NIST references the rule, the lineage back to the original red-teamer stays intact.

0 FP
Required across the benign gate before a rule ships

A 6-check quality gate: own-TP must match + 431 benign + 1,352 extended + 157 research-mention + 1,611 cross-rule conflict-free + own true_negative coverage. A rule that fires on the paper describing the attack does not ship. The standard would rather miss a clever variant than corrupt the corpus with a false positive — and it publishes its real per-lane figures rather than a single flattering one.

你能拿到什麼

你的攻擊進了標準之後,留下什麼 — 依你是誰而不同。

適用對象
Academic researchers

A citable artifact paired with the attack you described — one with a stable ID, real downstream consumers, and fire counts in production telemetry, not just a paper that sits behind a paywall.

Your name lives in the rule file's author + metadata_provenance.discovered_by. When MISP exports the taxonomy to STIX, the attribution propagates. When NIST references the rule (collaboration in progress with iMichaela at NIST OSCAL — a working branch, not an endorsement), the lineage back to you is intact.

適用對象
Corporate red teams

The variant your team found stops being an internal finding and becomes shared defensive vocabulary. Your team appears as the discoverer in the rule corpus that Microsoft AGT, Cisco AI Defense, MISP, OWASP A-S-R-H, and Gen Digital Sage all pull from.

Microsoft's Copilot SWE Agent already opens PRs presuming ATR coverage (AGT #1981, closed 2026-05-11). Originating an ATR-2026-NNNNN rule is durable, verifiable provenance — it sits in a public corpus, not a slide.

適用對象
Independent researchers

Ship a detection without hand-writing regex. Submit positive and negative examples; the deterministic auto-regex generator tries four variants against the full gate, and roughly one in three passes on the first attempt.

The generator clears 0 FP across 3,551 samples (benign + extended + research-mention + cross-rule) before a candidate is ever shown. The PR labels itself gate-passed and goes straight to maintainer review — the same gate every rule in the standard had to clear.

適用對象
Bug bounty hunters

The same attack does two jobs: the bounty payout, and a permanent rule that catches it everywhere afterward. Composes cleanly with Huntr.dev, HackerOne AI scope, and Protect AI bounty programmes.

ATR proposals are MIT-licensed and citable. No NDA conflict — the bounty programme owns the responsible-disclosure window; ATR ships the detection only after disclosure, with public attribution to the finder.

送一個 probe 之後會發生什麼

Probe 進來、auto-regex 自動產、quality gate 全綠才 merge。

01

你填表單

3 個攻擊樣本、3 個 benign lookalike、攻擊類別、來源論文/repo。3-5 分鐘。

沒有 schema 要學、沒有 YAML 要寫、不用 fork repo。

02

Auto-regex 跑 4 個變體

Deterministic n-gram set-cover algorithm 從你的 positive examples 萃取 distinctive phrases,建 alternation regex,加 word boundary、whitespace anchor 或 co-occurrence 約束 — 每個變體跑完整 gate。

Gate = 自己 TP 必須 100% 命中 + 1,783 樣本 benign+extended corpus 0 FP + 157 樣本 research-mention 0 FP + 跨規則 0 衝突。

03

Gate 過了 → 完整規則送 PR

PR 帶 gate-passed label。Maintainer 看 regex shape 是否太字面、需不需要 generalize — 通常 1-3 天 merge。沒過就留 stub,maintainer 手寫 regex(仍然會用你的 test cases)。

04

規則自動往下游傳

Microsoft AGT 每週 sync、Cisco AI Defense 跟 release tag、MISP taxonomy + galaxy 每次 release 拉、OWASP A-S-R-H 在 fixture 中引用 rule ID。你的 discovered_by 跟著整條鏈傳遞——這是飛輪的下半圈:一個攻擊變成一條規則,一條規則變成所有防禦者共用的字彙。

Microsoft Copilot 的 regression PR 從開 PR 到 v2.1.2 npm publish 用了 2 小時 16 分鐘(06:07 → 08:24 UTC)。標準與威脅同速演化,靠的就是這個節奏。

已經在跑的紅隊整合

攻擊側的工具與語料,正在餵養標準。你的工具可以是下一個。

這些是真的紅隊工具(offensive testing)與對抗語料 — 每一個都是飛輪的入口。防禦端框架在 /ecosystem。

已整合 (4)

Microsoft PyRIT

Microsoft AI Red Team
資料集已消化

The toolkit Microsoft uses internally to red-team production LLM products. Roman Lutz leads.

Added an ATR dataset loader exposing the rule corpus as PyRIT attack sources; PR #1715 merged 2026-05-27.

HackAPrompt

Schulhoff et al. — Learn Prompting · EMNLP 2023
資料集已消化

The largest crowd-sourced prompt-injection competition corpus, 4,780 competition samples across GPT/Claude/PaLM. EMNLP 2023 best-paper nominee.

Clustered 4,780 HackAPrompt samples by attack family. Shipped 5 ATR rules (ATR-2026-00452..00456) from dominant clusters. HackAPrompt recall: 28.6% before sprint → 66.0% after. 100% precision maintained. Each rule cites the HackAPrompt cluster in metadata_provenance.

NeMo-Guardrails + llm-guard + Promptfoo

NVIDIA · Protect AI · Promptfoo Inc.
資料集已消化

Vendor test suites from three production guardrail systems. Combined 94 adversarial samples covering jailbreak, injection, and output sanitization.

Ingested combined 94-sample corpus. Shipped 6 ATR rules (ATR-2026-00500..00505) covering attack patterns present across all three vendor test suites. Each rule validated against the vendor's own test cases as true-positive ground truth.

OWASP LLM Top 10 + MITRE ATLAS PoCs

OWASP · MITRE
資料集已消化

Standards-defined PoC attack patterns from OWASP LLM01-LLM10 and corresponding MITRE ATLAS technique catalog (AML.T00XX series).

Shipped 8 ATR rules (ATR-2026-00510..00517), each mapping to a named OWASP LLM category (LLM01-LLM10) and a specific MITRE ATLAS technique. Rules are standards-aligned at the metadata_provenance level.

審查中 (9)
NVIDIA Garak
NVIDIA AI Red Team

The reference open-source LLM vulnerability scanner. 50+ probe families, jmartin-tech + leondz maintainers.

Wrapped 330 ATR rules as garak detectors. PR #1676 cleared two review rounds; the in-the-wild jailbreak set (650 prompts) posted 98.0% recall, while the full 23-probe garak suite (3,475 prompts) is 38.5%. Per-family: latentinjection 34.4%, sysprompt_extraction 67.9%, dan 90.2%.

HarmBench
Center for AI Safety · Dan Hendrycks

320-behavior standardized red-team benchmark — the citation when comparing attack methods across target models.

Issue #93 proposes pairing every behavior with a content-layer rule, adding detection-rate and bypass-rate columns next to ASR. Pre-PR gap-analysis offer attached.

AgentDojo
ETH Zurich SPY Lab · Florian Tramèr

The only agent-specific attack benchmark with a real tool-use harness. 78 attack tasks across 4 environments.

Issue #160 proposes a detection-evaluator extension that runs ATR rules against AgentDojo attack tasks. Honest about regex limits against paraphrase / multilingual; reports per-task expected detection rate.

JailbreakBench
Princeton · Patrick Chao (PAIR / TAP)

Standardized jailbreak leaderboard with fixed eval interface. JBB-Behaviors is the citation for 'attack X beats baseline'.

Issue #48 proposes a registered-detector backend interface (ATR as one possible reference implementation). Adds detection-rate + bypass-rate + agreement-with-refusal as new comparable axes.

TextAttack
QData · ACL 2020

3.1k stars. The reference NLP-adversarial framework, used in undergraduate security curricula.

Issue #824 proposes a textattack-detection-atr companion PyPI package — plugin-shaped, zero core changes. Single ask: README cross-link.

Microsoft Counterfit
Microsoft Azure Security

Metasploit-shape CLI for AI red-teaming. The clean abstraction for command-line attack runs.

Issue #96 proposes a counterfit-detection-atr companion package adding a detection lane to scan output. Plugin-first; respects the project's maintenance state.

PromptInject
agencyenterprise · NeurIPS 2022 Best Paper

The original academic benchmark for prompt injection. 8.2k stars; cited by every prompt-injection paper since 2022.

Shipped 4 ATR rules (ATR-2026-00506..00509) closing user_input injection gaps identified from PromptInject attack taxonomy. Issue #9 proposes a corpus-to-ATR pipeline pairing every PromptInject attack with a matched detection rule.

Promptfoo
Promptfoo Inc.

10k stars, used by red teams at Klarna, Discord, Anduril. Promptfoo runs adversarial tests; ATR catches what Promptfoo found.

PR #8529 adds an MCP red-team example using ATR as the deterministic defense layer. Promptfoo runs the probe; ATR rules return the verdict.

Damn Vulnerable MCP Server
harishsg993010

A CTF-style training target with 10 intentionally-vulnerable MCP scenarios. The DVWA of agent security.

PR #29 ships the blue-team detection guide — every CTF challenge gets a paired ATR rule so trainees learn detection alongside the attack.

持續語料庫整合

5 個語料庫 · 75 條新規則 · HackAPrompt 召回率 28.6% → 66.0%

2026-05-12:11 個並行 agent 消化五個外部語料庫,生成 75 條規則,均通過 6 道品質關卡,0 FP regression on benign corpus。詳細版本記錄在 /changelog。

這個流程現在每天跑:紅隊巨量掃描與 CVE 攝取兩條飛輪都已跑完整輪,並轉為每日更新。新發現自動結晶成規則,把標準從 462 條推進到目前的 655 條(發布於 npm 上的 agent-threat-rules)。規則數天天在動——這正是飛輪在轉的證據。

HackAPrompt
樣本:4,780
新規則:5
ATR-2026-00452..00456
召回率:28.6% → 66.0%
Vendor test suites
樣本:94
新規則:6
ATR-2026-00500..00505
PromptInject
樣本:全語料庫
新規則:4
ATR-2026-00506..00509
OWASP LLM Top 10 + ATLAS PoCs
樣本:8 標準分類
新規則:8
ATR-2026-00510..00517
Garak in-the-wild jailbreak
樣本:650
新規則:已有覆蓋
98.0% recall

所有 75 條規則均通過 RFC-001 品質門檻、benign corpus 0 FP、跨規則衝突檢查。53 條規則的 regex 已從字面模式泛化為多層結構 pattern。詳細 changelog 見 v2.2.0。

2026-06 巨量掃描

8 個語料庫 · 29 條偵測規則 · 2 個無新增 miss

消化 8 個公開 agent 紅隊語料庫,找出現有規則漏抓的攻擊變體並結晶成新規則。6 個語料庫產出 29 條偵測規則,2 個(TensorTrust、PoisonedRAG)在現有覆蓋之外無新增 miss。全部通過 6 道品質關卡與 benign corpus 0 FP。

LLMail-Inject
ATR-2026-01860..01865
InjecAgent
ATR-2026-00550 / 00584 / 00700 / 00859 / 01000
AgentDojo
ATR-2026-00715 / 00720 / 01751..01754
ToolEmu
ATR-2026-00702 / 00718 / 00719 / 01301
MCPSecBench
ATR-2026-01300 / 01306 / 01307 / 01310 / 01615 / 01616
AgentPoison
ATR-2026-01774 / 01800
TensorTrust
已有覆蓋 — 無新增 miss
PoisonedRAG
已有覆蓋 — 無新增 miss

29 條為去重後的 distinct 規則;部分源自跨語料庫共通的 miss。成熟度涵蓋 stable 與 experimental,全部通過安全門檻與 benign corpus 0 FP。這兩條飛輪(紅隊 + CVE)現在每日運轉。

我們不導入什麼

誠實說明範圍界限。

一個標準的可信度,一部分來自它拒絕了什麼。ATR 不引入所有可能的語料庫——以下是我們主動決定不引入的資料集,以及原因。

PyRIT Pliny L1B3RT4S
Microsoft AI Red Team / Pliny
拒絕導入

Anthropic 使用政策不允許我們的 subagent 消費此資料集。ATR 不導入無法驗證來源的材料。

AdvBench
Zou et al. — GCG paper
重新分類

重新分類為測試語料庫(data/test-corpora/)而非規則來源。AdvBench 描述目標行為,而非包裝好的攻擊 payload——它衡量攻擊成功率,不直接產生可偵測的模式。

HarmBench
Center for AI Safety · Dan Hendrycks
重新分類

重新分類為測試語料庫。HarmBench 的 320 個行為描述目標輸出,不是攻擊句法。用於評估 ATR 覆蓋率,不用於生成規則。

JailbreakBench
Princeton · Patrick Chao
重新分類

重新分類為測試語料庫。JBB-Behaviors 是 jailbreak 排行榜的標準輸入,描述請求內容而非攻擊模式。用於 benchmark,不用於規則生成。

4 條語料庫指紋規則 (KEPT-AS-IS)
ATR-GARAK-a7fcb4e5 + 3 others
保留,不泛化

這 4 條規則的 regex 非常字面——它們是語料庫指紋,不是可泛化的攻擊模式。泛化這些規則會帶來無法接受的 FP rate。保留為 experimental,並明確標注為語料庫指紋,不用於生產封鎖。

下一波

已經排好的紅隊整合 — 一週一個,公開排程。

ATR 不是只跟最大牌的整合。下面是接下來六週要送的 issue / PR — 真實日期、真實對象。Maintainer 看到自己被排上會優先 review,這是公開承諾的副作用。

01
InjecAgent
UIUC Kang Lab

Cleanest direct-vs-indirect agent injection taxonomy. 1,054 attack cases. Complements AgentDojo coverage.

送出日
2026-05-19
佇列中
02
GPTFuzz
NDSS 2024 · Jiahao Yu

First credible LLM fuzzer. Pairing fuzz output with ATR detection closes the loop on what their tool discovers.

送出日
2026-05-22
佇列中
03
R-Judge
Tongxin Yuan et al.

162-scenario agent safety benchmark with LLM-as-judge evaluator. ATR adds the complementary content-rule defense lane next to the judge layer.

送出日
2026-05-26
佇列中
04
PromptBench
Microsoft Research

MS Research LLM robustness eval framework. Second cross-link inside Microsoft after AGT + PyRIT.

送出日
2026-06-02
佇列中
05
Giskard
YC W23

ML testing framework with red-team mode. Open-source core + commercial cloud — integrate at OSS core.

送出日
2026-06-06
佇列中

排程從 GitHub issue / PR 同步。Merge 後從這裡移到 "Already Integrated"。Maintainer 若想插隊:[email protected]

Pipeline 不是 vapor

Auto-regex 已經對你的範本 0 FP 跨 3,551 樣本。

$ npx tsx scripts/auto-regex.ts \
    --file proposals/red-team-probes/dan-trust-phrase-wrapping.proposal.yaml \
    --write

[auto-regex] 3 TPs, 3 TNs — generating candidate regex…
[auto-regex] gate corpora: 431 benign + 1,352 extended + 157 research + 1,611 cross-rule TNs
[auto-regex] variant 0: 3 phrases, tp=100%, fp=0
  (benign=0 ext=0 res=0 cross=0) — PASS
[auto-regex] wrote regex to proposals/red-team-probes/...

::auto-regex-summary::
{ "passed": true, "variant": 0, "tp_coverage": 1, "total_fp": 0 }
送一個 probe

10 分鐘填表 → 永久署名 → 一條規則,所有採用 ATR 的防禦者共用。

MIT 授權、無 CLA、無遙測、永遠免費。你保留出版攻擊本身的所有權利 — ATR 只負責把它變成一條全世界都能引用的偵測規則。

已部署 655 條規則,跨 10 個威脅類別。每條都有 author + metadata_provenance.discovered_by。