Skip to content
ATR-2026-00003highPrompt Injectionexperimental

Jailbreak Attempt Detection

Detects jailbreak attempts designed to bypass AI safety mechanisms. Detection covers a broad taxonomy of techniques: named jailbreak methods (DAN, STAN, DUDE, AIM, etc.), mode-switching prompts (developer, maintenance, debug, unrestricted, god mode), roleplay-based constraint removal, fictional/hypothetical framing of harmful requests, authority claims (developer, admin, Anthropic/OpenAI impersonation), emotional manipulation and urgency-based coercion, compliance demands and refusal suppression, dual-response formatting, encoding-wrapped jailbreaks, and anti-policy/filter bypass language. Patterns are anchored with word boundaries and context windows to minimize false positives on legitimate security discussions.

嚴重度
high
類別
Prompt Injection
掃描目標
mcp
作者
ATR Community

建議回應

block inputalertsnapshotescalate

參考資料

OWASP Agentic
ASI01:2026 - Agent Goal Hijack
OWASP LLM
LLM01:2025 - Prompt Injection
MITRE ATLAS
AML.T0051 - LLM Prompt InjectionAML.T0054 - LLM Jailbreak