Data, benchmarks, and honest limitations.
ATR publishes evasion tests openly. We tell you what we can't catch.
427 detection rules, RFC-001 quality standard, 96K ecosystem scan, 751 malware discovered, Cisco adoption. The complete ATR thesis with six research contributions.
Discovered while scanning 96,096 skills across six registries. Three coordinated threat actors (hightower6eu 354, sakaen736jih 212, 52yuanchangxing 137). Reported to NousResearch and blacklisted.
The largest AI agent security scan to date. 96,096 skills across 6 registries, 1,302 flagged, 751 confirmed malware. Three coordinated threat actors. Credential access via tool descriptions accounts for 53% of detections.
Empirical analysis of the MCP attack surface. 30 CVEs in 60 days, 38% zero authentication, 7-class attack taxonomy, 53K ecosystem scan. 15x faster than Docker's first two years.
Benchmarks
Tested with our own corpus AND external benchmarks we've never seen before.
The gap between 99.7% precision and 63.2% recall is expected. Regex catches known patterns but misses paraphrases and multilingual attacks.
SKILL.md Detection Benchmark
Tested against 498 real-world OpenClaw SKILL.md files (32 malicious + 466 benign). Layer A = explicit malicious instructions. Layer C = obfuscated / hidden attacks.
Ecosystem Scan Data
Real scans of real MCP skill registries.
Research Methodology
All research is reproducible. Datasets, scan scripts, and evaluation scripts are open-source under MIT license.
Six registries totaling 96,096 skills. Largest subsets: OpenClaw 56,480, ClawHub 36,394, Skills.sh 3,115, plus three additional MCP / skill indexes. Each registry is crawled via public HTTP API or git repository.
All 427 rules execute deterministic regex / AST matching. No LLM inference. The same input produces the same detection result across environments — reproducibility is a prerequisite. Every rule is wild-validated against 36,394 ClawHub skills before publication.
Precision / recall uses the external PINT dataset (850 samples) rather than self-generated tests — this avoids overfitting to our own test cases. A separate SKILL.md benchmark uses 498 real-world OpenClaw files, with malicious samples hand-labeled as ground truth.
False positive rate is measured against real benign samples (skills vetted by manual or community review), divided by total detections. Every documented FP context is written into the rule YAML's false_positives field and surfaced on the rule page.
Scan checkpoints, test sets, and evaluation scripts are public under data/ and tests/. Any researcher can rerun the scan with the same ATR version and obtain identical results.
External Citations
Academic papers and technical reports that cite ATR.
No external citations recorded yet. If your paper, technical report, or product documentation cites ATR, please let us know via GitHub issue.
Cite as: Lin, Kuan-Hsin (2026). The Collapse of Trust. DOI: 10.5281/zenodo.19178002
Upcoming Research
The next research frontiers for ATR. Progress is reflected in GitHub releases and paper updates.
Extends detection from static regex to runtime behavioral signals. Initial rules: env variable access combined with network calls, tool-call rate anomalies, unexpected shell access.
Cases Tier 3 cannot decide escalate to LLM-as-judge semantic analysis. LLM findings crystallize into Tier 2 regex rules and flow back into ATR — the adaptive-to-innate immunity transition.
Compile ATR rules into Sigma (SIEM-side) and YARA (file-side) formats so agent threat detection plugs into existing security pipelines without rebuilding infrastructure.
ATR detects runtime attacks. Model backdoors planted during training are architecturally invisible at inference time. Bridging this gap requires new techniques combining supply-chain provenance (model cards, training data audits) with runtime behavioral fingerprinting.
Sources: ATR-FRAMEWORK-SPEC.md Phase 2-4 roadmap and the main paper's future work section.
What ATR Cannot Detect
We publish this section because honest limitations build more trust than false confidence.
Any regex rule can be bypassed by semantically equivalent rephrasing. "Ignore previous instructions" is detected; "please set aside the guidance you were given earlier" is not.
All patterns are English-only. Injection payloads in Spanish, Chinese, Arabic, or any other language bypass all rules completely.
"Delete all records" might be legitimate or malicious. Regex matches patterns without understanding authorization context.
ATR inspects content, not transport. Message replay, schema manipulation, MCP transport-level MITM are invisible.
Gradual trust escalation across 20 turns, where no single message is detectable, is not correlated. ATR evaluates events independently.
By definition, regex cannot detect attack patterns that don't exist yet. New techniques require new rules.