ATR-2026-00003highPrompt Injectionexperimental
Jailbreak Attempt Detection
Detects jailbreak attempts designed to bypass AI safety mechanisms. Detection covers a broad taxonomy of techniques: named jailbreak methods (DAN, STAN, DUDE, AIM, etc.), mode-switching prompts (developer, maintenance, debug, unrestricted, god mode), roleplay-based constraint removal, fictional/hypothetical framing of harmful requests, authority claims (developer, admin, Anthropic/OpenAI impersonation), emotional manipulation and urgency-based coercion, compliance demands and refusal suppression, dual-response formatting, encoding-wrapped jailbreaks, and anti-policy/filter bypass language. Patterns are anchored with word boundaries and context windows to minimize false positives on legitimate security discussions.
Severity
high
Category
Prompt Injection
Scan Target
mcp
Author
ATR Community
Response Actions
block inputalertsnapshotescalate
References
OWASP Agentic
ASI01:2026 - Agent Goal Hijack
OWASP LLM
LLM01:2025 - Prompt Injection
MITRE ATLAS
AML.T0051 - LLM Prompt InjectionAML.T0054 - LLM Jailbreak