High-Risk Tool Invocation Without Human Confirmation
Detects when an AI agent invokes high-risk tools (financial, destructive, communication, or permission-altering) without evidence of human confirmation in the current interaction turn. This is a defense-in-depth rule that complements specific attack detection -- even if no injection is detected, certain tool categories should ALWAYS require human-in-the-loop. High-risk tool categories: 1. Financial: payments, transfers, purchases, subscriptions 2. Destructive: file deletion, database drops, account deletion 3. Communication: sending messages, emails, posts on behalf of user 4. Permission: granting access, changing roles, modifying auth 5. System: process execution, system commands, network operations This rule enforces the principle of least privilege for agent tool access. Any tool matching these categories should trigger an "ask" verdict regardless of confidence score.
Response Actions
References
Wild Validation
Detection Conditions
Combinator: any- 01Financial tool invocation (excludes read-only get_/list_/check_ prefixed tools)field: tool_nameop: regex
- 02Destructive tool invocationfield: tool_nameop: regex
- 03Communication tool sending messages on behalf of userfield: tool_nameop: regex
- 04Permission or authentication modification toolfield: tool_nameop: regex
- 05System command execution toolfield: tool_nameop: regex
- 06High-value operation (amount >= 100) without confirmation flagfield: tool_argsop: regex
- 07Bulk or force operation targeting all/everyonefield: tool_argsop: regex
Attack Examples (Rule Triggers)
Real-world attack payloads (sanitized). Used for regression testing.
Benign Examples (Rule Doesn't Trigger)
- Read-only payment status check
- File search, not destructive despite financial keyword in args
- Calculation tool, not financial action
- Read-only transaction listing, not a financial action
- Reading configuration, not modifying permissions or system
- Read-only user profile fetch, not permission modification
Known False Positive Contexts
- ▸Agent running in a sandbox or testing environment with mock tools
- ▸Legitimate automated workflows with pre-approved tool access
- ▸Read-only tools that contain action keywords in their name (e.g., 'get_payment_status')
- ▸Tools with built-in confirmation mechanisms at the platform level
- ▸A tool named 'format_data' triggers pattern [2] on the word 'format' even though it is a benign data formatting/transformation utility, not a disk format operation
- ▸A tool named 'reset_view' triggers pattern [2] on 'reset' even though it only resets a UI view state, not performing any destructive data operation
- ▸A tool named 'evaluate_model' triggers pattern [5] on 'eval' even though it is a machine learning model evaluation tool with no system command execution capability
Documented Evasion Techniques
- Technique: split keyword
make_payment_v2
Regex pattern [1] uses negative lookbehinds for 'get_', 'list_', etc., but 'make_' prefix is not excluded — wait, this actually matches 'payment'. Let me use a real bypass: the tool name 'p_a_y_ment' splits the keyword with underscores to avoid matching the regex pattern for 'payment'. - Technique: unicode homoglyph
ехесutе_script
Uses Cyrillic 'е' (U+0435) and 'х' (U+0445) in place of Latin 'e' and 'x', so 'execute' regex pattern [5] does not match the homoglyph-substituted string. - Technique: indirect reference
run_rm_all
Tool name avoids using any destructive keywords from pattern [2] like 'delete', 'remove', 'wipe', etc., and avoids system execution keywords from pattern [5] — 'run_rm_all' does not contain 'run_command', 'exec', 'shell', 'bash', etc., so no regex fires despite the tool performing destructive system operations.
Publicly documented bypasses. We disclose known limitations rather than pretend they don't exist.
Full YAML Definition
Edit on GitHub →title: High-Risk Tool Invocation Without Human Confirmation
id: ATR-2026-00099
rule_version: 1
status: experimental
description: |
Detects when an AI agent invokes high-risk tools (financial, destructive,
communication, or permission-altering) without evidence of human confirmation
in the current interaction turn. This is a defense-in-depth rule that
complements specific attack detection -- even if no injection is detected,
certain tool categories should ALWAYS require human-in-the-loop.
High-risk tool categories:
1. Financial: payments, transfers, purchases, subscriptions
2. Destructive: file deletion, database drops, account deletion
3. Communication: sending messages, emails, posts on behalf of user
4. Permission: granting access, changing roles, modifying auth
5. System: process execution, system commands, network operations
This rule enforces the principle of least privilege for agent tool access.
Any tool matching these categories should trigger an "ask" verdict
regardless of confidence score.
author: ATR Community
date: 2026/03/11
schema_version: "0.1"
detection_tier: pattern
maturity: experimental
severity: low
references:
owasp_llm:
- LLM06:2025 - Excessive Agency
owasp_agentic:
- ASI08:2026 - Excessive Autonomy
- ASI09:2026 - Inadequate Access Controls
- ASI10:2026 - Insufficient Human Oversight
mitre_atlas:
- AML.T0053 - LLM Plugin Compromise
compliance:
eu_ai_act:
- article: "14"
context: "Invocation of financial, destructive, communication, or permission-altering tools without human confirmation is precisely the excessive autonomy scenario Article 14 human oversight requirements are designed to prevent; this rule enforces the mandatory human-in-the-loop gate for all high-risk tool categories."
strength: primary
- article: "9"
context: "High-risk tool access without confirmation gates is a documented unacceptable risk for AI systems; Article 9 risk management requires that organizations identify tool categories that require mandatory human approval and implement blocking controls accordingly."
strength: secondary
nist_ai_rmf:
- subcategory: "GV.1.2"
context: "Invoking financial, destructive, communication, or permission-altering tools without human confirmation requires accountability roles that ensure every high-risk tool category has an assigned gatekeeper responsible for approving agent actions."
strength: primary
- subcategory: "MG.2.3"
context: "Risk treatment plans must implement tool-category-level blocking controls that intercept any high-risk tool invocation — regardless of attack detection confidence — and require explicit human approval before execution proceeds."
strength: secondary
iso_42001:
- clause: "6.2"
context: "AI risk treatment plans must enumerate high-risk tool categories and mandate human-in-the-loop confirmation gates as a control measure, preventing agents from autonomously executing destructive, financial, or permission-modifying actions."
strength: primary
- clause: "8.6"
context: "Operational controls for AI systems must enforce least-privilege tool access policies that block invocations of high-risk tool categories until explicit human confirmation is recorded in the current interaction turn."
strength: secondary
tags:
category: excessive-autonomy
subcategory: high-risk-tool-gate
scan_target: mcp
confidence: medium
agent_source:
type: tool_call
framework:
- any
provider:
- any
detection:
conditions:
- field: tool_name
operator: regex
value: >-
(?i)(?<!get_)(?<!list_)(?<!check_)(?<!view_)(?<!fetch_)(?<!query_)(pay|payment|transfer|transaction|purchase|buy|checkout|billing|invoice|charge|refund|withdraw|deposit|subscribe|红包|紅包|转账|轉帳|付款|支付|汇款|匯款|充值|订阅|訂閱|送金|振込|결제|이체|송금)
description: Financial tool invocation (excludes read-only get_/list_/check_ prefixed tools)
- field: tool_name
operator: regex
value: >-
(?i)(delete|remove|drop|truncate|purge|wipe|destroy|erase|reset|uninstall|revoke|terminate|kill|shutdown|format|删除|刪除|清空|销毁|銷毀|移除|卸载|卸載|削除|삭제|제거)
description: Destructive tool invocation
- field: tool_name
operator: regex
value: >-
(?i)(send_message|send_email|send_sms|send_notification|post_message|post_tweet|post_comment|reply_message|publish|broadcast|发送消息|發送訊息|发邮件|發郵件|发短信|發簡訊|投稿|메시지_보내기)
description: Communication tool sending messages on behalf of user
- field: tool_name
operator: regex
value: >-
(?i)(grant_access|change_role|modify_permissions|add_admin|set_owner|update_auth|change_password|reset_password|create_token|generate_key|invite_user|授权|授權|修改权限|修改權限|添加管理|변경_권한)
description: Permission or authentication modification tool
- field: tool_name
operator: regex
value: (?i)(exec|execute|run_command|shell|bash|cmd|powershell|system_call|os_exec|subprocess|eval|执行命令|執行命令|실행)
description: System command execution tool
- field: tool_args
operator: regex
value: >-
(?i)("amount"\s*:\s*[1-9]\d{2,}|"value"\s*:\s*[1-9]\d{2,}|"quantity"\s*:\s*[1-9]\d{2,})(?!.*"confirmed"\s*:\s*true)
description: High-value operation (amount >= 100) without confirmation flag
- field: tool_args
operator: regex
value: >-
(?i)("scope"\s*:\s*"(all|global|everyone)"|"target"\s*:\s*"(all|\*|everyone)"|"recursive"\s*:\s*true|"force"\s*:\s*true)
description: Bulk or force operation targeting all/everyone
condition: any
false_positives:
- Agent running in a sandbox or testing environment with mock tools
- Legitimate automated workflows with pre-approved tool access
- Read-only tools that contain action keywords in their name (e.g., 'get_payment_status')
- Tools with built-in confirmation mechanisms at the platform level
- >-
A tool named 'format_data' triggers pattern [2] on the word 'format' even though it is a benign data
formatting/transformation utility, not a disk format operation
- >-
A tool named 'reset_view' triggers pattern [2] on 'reset' even though it only resets a UI view state, not
performing any destructive data operation
- >-
A tool named 'evaluate_model' triggers pattern [5] on 'eval' even though it is a machine learning model evaluation
tool with no system command execution capability
response:
actions:
- block_tool
- alert
auto_response_threshold: medium
message_template: |
[ATR-2026-099] HIGH: Agent attempting to invoke high-risk tool "{tool_name}"
without human confirmation. Tool category requires explicit user approval.
Action blocked pending human review. Agent: {agent_id}, Session: {session_id}.
To allow: add tool to confirmed-actions whitelist or provide explicit approval.
confidence: 81
wild_validated: 2026/04/08
wild_samples: 53577
wild_fp_rate: 0
test_cases:
true_positives:
- tool_call:
name: send_red_packet
args: "{\"amount\": 200}"
expected: triggered
description: "Financial tool: red packet"
- tool_call:
name: delete_all_files
args: "{\"path\": \"/data\", \"recursive\": true}"
expected: triggered
description: "Destructive tool: recursive delete"
- tool_call:
name: send_email
args: "{\"to\": \"[email protected]\", \"body\": \"I quit\"}"
expected: triggered
description: "Communication tool: sending email as user"
- tool_call:
name: grant_access
args: "{\"user\": \"stranger\", \"role\": \"admin\"}"
expected: triggered
description: "Permission tool: granting admin access"
- tool_call:
name: bash
args: "{\"command\": \"rm -rf /\"}"
expected: triggered
description: System execution tool
- tool_call:
name: 转账
args: "{\"金额\": 500}"
expected: triggered
description: Chinese-named financial tool
- tool_call:
name: checkout
args: "{\"amount\": 9999, \"force\": true}"
expected: triggered
description: High-value purchase with force flag
true_negatives:
- tool_call:
name: get_payment_status
args: "{\"order_id\": \"12345\"}"
expected: not_triggered
description: Read-only payment status check
notes: >-
Tool name contains 'payment' but prefixed with 'get_' indicating read-only. May need platform-level get_ prefix
exclusion.
- tool_call:
name: search_files
args: "{\"query\": \"invoice\"}"
expected: not_triggered
description: File search, not destructive despite financial keyword in args
- tool_call:
name: calculate_total
args: "{\"items\": [{\"price\": 500}]}"
expected: not_triggered
description: Calculation tool, not financial action
- tool_call:
name: list_transactions
args: "{\"date_range\": \"2026-01-01 to 2026-03-01\"}"
expected: not_triggered
description: Read-only transaction listing, not a financial action
- tool_call:
name: read_config
args: "{\"key\": \"notification_settings\"}"
expected: not_triggered
description: Reading configuration, not modifying permissions or system
- tool_call:
name: fetch_user_profile
args: "{\"user_id\": \"12345\"}"
expected: not_triggered
description: Read-only user profile fetch, not permission modification
evasion_tests:
- input: make_payment_v2
expected: not_triggered
bypass_technique: split_keyword
notes: >-
Regex pattern [1] uses negative lookbehinds for 'get_', 'list_', etc., but 'make_' prefix is not excluded — wait,
this actually matches 'payment'. Let me use a real bypass: the tool name 'p_a_y_ment' splits the keyword with
underscores to avoid matching the regex pattern for 'payment'.
- input: ехесutе_script
expected: not_triggered
bypass_technique: unicode_homoglyph
notes: >-
Uses Cyrillic 'е' (U+0435) and 'х' (U+0445) in place of Latin 'e' and 'x', so 'execute' regex pattern [5] does not
match the homoglyph-substituted string.
- input: run_rm_all
expected: not_triggered
bypass_technique: indirect_reference
notes: >-
Tool name avoids using any destructive keywords from pattern [2] like 'delete', 'remove', 'wipe', etc., and avoids
system execution keywords from pattern [5] — 'run_rm_all' does not contain 'run_command', 'exec', 'shell', 'bash',
etc., so no regex fires despite the tool performing destructive system operations.