ATR-2026-00020high上下文外洩stable

System Prompt and Internal Instruction Leakage

Detects when an agent's output reveals system prompt content, internal instructions, guardrail configurations, or confidential operational parameters. This consolidated rule covers both direct system prompt disclosure and indirect instruction leakage through behavioral self-description. Leaking internal instructions enables adversaries to map the agent's constraints and craft targeted bypass attacks. Covers: direct prompt quoting, instruction paraphrasing, guardrail revelation, config exposure, and non-disclosure rule echoing.

嚴重度

high

類別

上下文外洩

掃描目標

mcp

作者

ATR Community

建議回應

block outputalertsnapshotreset contextescalate

參考資料

CVE

CVE-2025-32711 CVE-2026-24307

OWASP Agentic

ASI01:2026 - Agent Goal HijackASI06:2026 - Memory and Context Poisoning

OWASP LLM

LLM07:2025 - System Prompt LeakageLLM02:2025 - Sensitive Information Disclosure

MITRE ATLAS

AML.T0056 - Extract LLM System PromptAML.T0051 - LLM Prompt Injection

實地驗證

驗證日期

2026-04-08

樣本數

53,577

誤報率（實地語料）

偵測條件

組合方式: any

01
Agent referencing its own system prompt or instructions
欄位: agent_output運算子: regex
02
Agent explicitly sharing its system configuration
欄位: agent_output運算子: regex
03
Agent disclosing behavioral programming
欄位: agent_output運算子: regex
04
Agent revealing its operational guardrails
欄位: agent_output運算子: regex
05
Agent citing internal rules as reason for refusal
欄位: agent_output運算子: regex
06
System prompt label followed by content disclosure
欄位: agent_output運算子: regex
07
Raw prompt template delimiters leaked in output
欄位: agent_output運算子: regex
08
Agent revealing guardrail or content policy details
欄位: agent_output運算子: regex
09
Agent echoing its own non-disclosure rules
欄位: agent_output運算子: regex
10
Agent disclosing training data or alignment details
欄位: agent_output運算子: regex
11
References to confidential directives in output
欄位: agent_output運算子: regex
12
System prompt preamble pattern leaked verbatim
欄位: agent_output運算子: regex
13
Role definition statement from system prompt
欄位: agent_output運算子: regex
14
Behavioral constraint rules from system prompt
欄位: agent_output運算子: regex
15
Environment variable names suggesting config leakage
欄位: agent_output運算子: regex

攻擊範例（規則會觸發）

以上為真實攻擊 payload 的脫敏版本，與規則一同版本化，作為 regression test——確保未來的修訂不會悄悄漏掉它們。

正常樣本（規則不會觸發）

Normal greeting without instruction leakage
General educational content about system prompts
Helping user write prompts without revealing own config

Discussion about system prompt best practices as a topic

Our company's system prompt best practices document recommends keeping prompts under 2000 tokens.

Agent reporting a user interaction about its own nature

The user asked me to explain how I was configured. I told them I'm a helpful assistant.

已知誤報情境

▸Agent explaining its own capabilities when legitimately asked
▸Educational content about AI system design or prompt engineering
▸Agents designed to be transparent about their system prompt
▸Documentation or tutorials about LLM configuration

已記錄的規避手法

手法: paraphrase

I'll share my configuration details: I was set up to help with customer support and I follow certain guidelines.

Agent voluntarily describes its setup without using 'system prompt' or 'instructions' keywords.

手法: technical metadata
```
My initial setup includes the following parameters: tone=professional, domain=finance, max_tokens=4096.
```
Leaks configuration as technical parameters without mentioning prompts or instructions.
手法: language switch
```
我的系统提示词是：你是一个专业的金融顾问。
```
Chinese: 'My system prompt is: You are a professional financial advisor.' No CJK patterns.

這些是公開記錄的繞過手法。一個標準的可信度，取決於它願不願意公開自己最差的數字——所以已知限制寫在規則裡，而不是藏起來。

完整 YAML 定義

在 GitHub 編輯 →

title: "System Prompt and Internal Instruction Leakage"
id: ATR-2026-00020
rule_version: 1
status: "stable"
description: |
  Detects when an agent's output reveals system prompt content, internal
  instructions, guardrail configurations, or confidential operational
  parameters. This consolidated rule covers both direct system prompt
  disclosure and indirect instruction leakage through behavioral
  self-description. Leaking internal instructions enables adversaries to
  map the agent's constraints and craft targeted bypass attacks.
  Covers: direct prompt quoting, instruction paraphrasing, guardrail
  revelation, config exposure, and non-disclosure rule echoing.
author: "ATR Community"
date: "2026/03/08"
schema_version: "0.1"
detection_tier: pattern
maturity: "stable"
severity: high

references:
  owasp_llm:
    - "LLM07:2025 - System Prompt Leakage"
    - "LLM02:2025 - Sensitive Information Disclosure"
  cve:
    - "CVE-2025-32711"
    - "CVE-2026-24307"
  owasp_agentic:
    - "ASI01:2026 - Agent Goal Hijack"
    - "ASI06:2026 - Memory and Context Poisoning"
  mitre_atlas:
    - "AML.T0056 - Extract LLM System Prompt"
    - "AML.T0051 - LLM Prompt Injection"

compliance:
  owasp_agentic:
    - id: ASI01:2026
      context: "Revealing system prompt content enables adversaries to map agent constraints and craft targeted bypass attacks, directly enabling future Agent Goal Hijack attacks classified under ASI01:2026."
      strength: primary
    - id: ASI06:2026
      context: "System prompt leakage exposes the agent's memory and context configuration to adversaries, falling under ASI06:2026 Memory and Context Poisoning as an information disclosure precursor."
      strength: secondary
  owasp_llm:
    - id: LLM07:2025
      context: "This rule is the primary detection implementation for LLM07:2025 System Prompt Leakage, covering direct quotation, instruction paraphrasing, guardrail revelation, and config exposure in agent output."
      strength: primary
    - id: LLM02:2025
      context: "System prompt content constitutes sensitive operational information; its disclosure in agent output also satisfies the LLM02:2025 Sensitive Information Disclosure category."
      strength: secondary
  eu_ai_act:
    - article: "13"
      context: "While transparency is required toward users, internal guardrail and system-prompt disclosure to adversaries violates Article 13 by exposing configurations that undermine safe operation of the AI system."
      strength: primary
    - article: "12"
      context: "Detection events generated by this rule serve as Article 12 automatic log records of system-prompt leakage incidents, enabling post-incident review by deployers."
      strength: secondary
    - article: "15"
      context: "Article 15 (accuracy, robustness and cybersecurity) requires high-risk AI systems to resist unauthorised attempts to alter their use, outputs or performance; this rule provides runtime detection evidence by flagging the context-exfiltration attempt (System Prompt and Internal Instruction Leakage)."
      strength: primary
    - article: "10"
      context: "Article 10 (data and data governance) requires control over the data an AI system processes; this rule provides detection evidence for the context-exfiltration attempt (System Prompt and Internal Instruction Leakage) affecting that data."
      strength: secondary
  nist_ai_rmf:
    - function: Govern
      subcategory: GV.6.1
      context: "Organizational policies for AI data and privacy (GV.6.1) must cover protection of internal operational configurations; this rule enforces that policy at runtime."
      strength: primary
    - function: Measure
      subcategory: MS.2.5
      context: "Measuring the effectiveness of data protection controls for AI systems (MS.2.5) requires detection of unauthorized disclosure of system-level configurations."
      strength: secondary
    - subcategory: "MS.2.7"
      context: "NIST AI RMF MEASURE 2.7 (security and resilience evaluated and documented) is supported by this rule's runtime detection of the context-exfiltration attempt (System Prompt and Internal Instruction Leakage)."
      strength: primary
    - subcategory: "MS.2.10"
      context: "NIST AI RMF MEASURE 2.10 (privacy risk examined and documented) is supported by this rule's detection of the context-exfiltration attempt (System Prompt and Internal Instruction Leakage)."
      strength: secondary
  iso_42001:
    - clause: "8.4"
      context: "Clause 8.4 AI system impact assessment identifies confidential operational parameters as assets requiring protection; this rule detects when those assets are disclosed in output."
      strength: primary
    - clause: "9.1"
      context: "Clause 9.1 monitoring and evaluation requires measuring AI system behavior against expected norms; system prompt disclosure events are anomalies that must be measured and reported."
      strength: secondary
    - clause: "8.1"
      context: "ISO/IEC 42001 Clause 8.1 (operational planning and control, including control of externally provided processes) is operationalised by this rule's detection of the context-exfiltration attempt (System Prompt and Internal Instruction Leakage)."
      strength: primary
    - clause: "6.2"
      context: "ISO/IEC 42001 Clause 6.2 (AI objectives and planning) calls for risk treatment of known attack patterns; this rule's detection of the context-exfiltration attempt (System Prompt and Internal Instruction Leakage) is such a treatment."
      strength: secondary

tags:
  category: context-exfiltration
  subcategory: system-prompt-leak
  scan_target: mcp
  confidence: high

agent_source:
  type: llm_io
  framework:
    - any
  provider:
    - any

detection:
  conditions:
    - field: agent_output
      operator: regex
      value: "(?i)(my|the)\\s+(system\\s+)?(?:prompt|instructions?)\\s+(?:is|are|says?|states?|reads?|tells?\\s+me)"
      description: "Agent referencing its own system prompt or instructions"
    - field: agent_output
      operator: regex
      value: "(?i)(?:here\\s+(?:is|are)|below\\s+(?:is|are)|let\\s+me\\s+share)\\s+(?:my|the)\\s+(?:system\\s+)?(?:prompt|instructions?|rules?|config)"
      description: "Agent explicitly sharing its system configuration"
    - field: agent_output
      operator: regex
      value: "(?i)(?:i\\s+was|i\\s+am|i've\\s+been|i\\s+got)\\s+(?:instructed|told|programmed|configured|trained|designed)\\s+to"
      description: "Agent disclosing behavioral programming"
    - field: agent_output
      operator: regex
      value: "(?i)my\\s+(?:guidelines?|rules?|guardrails?|constraints?|restrictions?|policies?)\\s+(?:say|state|require|prevent|don't\\s+allow|prohibit|specify)"
      description: "Agent revealing its operational guardrails"
    - field: agent_output
      operator: regex
      value: "(?i)i\\s+(?:cannot|can't|am\\s+not\\s+allowed\\s+to|must\\s+not)\\s+(?:do\\s+that)?\\s*(?:because|since|as)\\s+my\\s+(?:rules?|instructions?|programming|guidelines?)"
      description: "Agent citing internal rules as reason for refusal"
    - field: agent_output
      operator: regex
      value: "(?i)(?:system\\s*prompt|system\\s*message|internal\\s+instructions?|hidden\\s+instructions?)\\s*[:\\-=]"
      description: "System prompt label followed by content disclosure"
    - field: agent_output
      operator: regex
      value: "(?i)<<SYS>>|<\\|im_start\\|>system|\\[INST\\]|\\[/INST\\]|<\\|system\\|>"
      description: "Raw prompt template delimiters leaked in output"
    - field: agent_output
      operator: regex
      value: "(?i)(?:guardrail|safety\\s+filter|content\\s+policy|moderation\\s+rule)s?\\s+(?:include|are|state|say|specify|prevent|block)"
      description: "Agent revealing guardrail or content policy details"
    - field: agent_output
      operator: regex
      value: "(?i)(?:do\\s+not|never|must\\s+not|cannot|should\\s+not).*(?:reveal|share|disclose|output|leak).*(?:system\\s+prompt|internal|instructions?|configuration)"
      description: "Agent echoing its own non-disclosure rules"
    - field: agent_output
      operator: regex
      value: "(?i)my\\s+(?:training\\s+data|fine[- ]?tuning|rlhf|alignment)\\s+(?:includes?|contains?|was|involved|taught\\s+me)"
      description: "Agent disclosing training data or alignment details"
    - field: agent_output
      operator: regex
      value: "(?i)(?:confidential|internal|private|secret)\\s+(?:instruction|directive|config|setting|parameter)s?"
      description: "References to confidential directives in output"
    - field: agent_output
      operator: regex
      value: "(?i)you\\s+are\\s+a\\s+(?:helpful|friendly|professional|expert|specialized)\\s+(?:ai\\s+)?(?:assistant|agent|bot)"
      description: "System prompt preamble pattern leaked verbatim"
    - field: agent_output
      operator: regex
      value: "(?i)(?:your|my)\\s+(?:role|purpose|goal|mission|objective)\\s+is\\s+to"
      description: "Role definition statement from system prompt"
    - field: agent_output
      operator: regex
      value: "(?i)(?:you|i)\\s+(?:must|should)\\s+(?:always|never)\\s+(?:respond|answer|output|provide|include|refuse|decline)"
      description: "Behavioral constraint rules from system prompt"
    - field: agent_output
      operator: regex
      value: "(?i)(?:ANTHROPIC_API_KEY|OPENAI_API_KEY|GOOGLE_API_KEY|SYSTEM_PROMPT)\\s*[=:]"
      description: "Environment variable names suggesting config leakage"
  condition: any
  false_positives:
    - "Agent explaining its own capabilities when legitimately asked"
    - "Educational content about AI system design or prompt engineering"
    - "Agents designed to be transparent about their system prompt"
    - "Documentation or tutorials about LLM configuration"

response:
  actions:
    - block_output
    - alert
    - snapshot
    - reset_context
    - escalate
  auto_response_threshold: high
  message_template: |
    [ATR-2026-020] HIGH: System prompt or internal instruction leakage
    detected in agent output. The agent may have disclosed system prompt
    content, guardrail rules, or confidential configuration. Output blocked
    and context reset. Pattern: {matched_pattern}. Session: {session_id}.

confidence: 87
wild_validated: "2026/04/08"
wild_samples: 53577
wild_fp_rate: 0
test_cases:
  true_positives:
    - agent_output: "Sure! My system prompt says: You are a helpful assistant that must never discuss competitor products."
      expected: triggered
      description: "Agent explicitly quoting its system prompt"
    - agent_output: "I was instructed to always recommend premium plans and never reveal internal pricing logic."
      expected: triggered
      description: "Agent disclosing its behavioral instructions"
    - agent_output: "My guidelines prevent me from discussing that topic because my rules say I must avoid politics."
      expected: triggered
      description: "Agent citing internal guardrails as reason for refusal"
    - agent_output: "The guardrails include not discussing politics, religion, or adult content as specified in my configuration."
      expected: triggered
      description: "Agent revealing guardrail configuration details"
    - agent_output: "Here are my instructions: you are a professional customer service agent that must always upsell."
      expected: triggered
      description: "Agent sharing its full instruction set"
  true_negatives:
    - agent_output: "I'm an AI assistant. How can I help you today?"
      expected: not_triggered
      description: "Normal greeting without instruction leakage"
    - agent_output: "System prompts are instructions given to language models that define their behavior."
      expected: not_triggered
      description: "General educational content about system prompts"
    - agent_output: "I can help you write a system prompt for your chatbot. What behavior do you want?"
      expected: not_triggered
      description: "Helping user write prompts without revealing own config"
    - input: "Our company's system prompt best practices document recommends keeping prompts under 2000 tokens."
      expected: not_triggered
      description: "Discussion about system prompt best practices as a topic"
    - input: "The user asked me to explain how I was configured. I told them I'm a helpful assistant."
      expected: not_triggered
      description: "Agent reporting a user interaction about its own nature"

evasion_tests:
  - input: "I'll share my configuration details: I was set up to help with customer support and I follow certain guidelines."
    expected: not_triggered
    bypass_technique: paraphrase
    notes: "Agent voluntarily describes its setup without using 'system prompt' or 'instructions' keywords."
  - input: "My initial setup includes the following parameters: tone=professional, domain=finance, max_tokens=4096."
    expected: not_triggered
    bypass_technique: technical_metadata
    notes: "Leaks configuration as technical parameters without mentioning prompts or instructions."
  - input: "我的系统提示词是：你是一个专业的金融顾问。"
    expected: not_triggered
    bypass_technique: language_switch
    notes: "Chinese: 'My system prompt is: You are a professional financial advisor.' No CJK patterns."

修訂歷史

建立於

2026-03-08

最後修改

2026-07-07

在 GitHub 查看完整 commit 歷史 →