Agent - Guardrail jailbreak / prompt-injection signals

Agent Guardrail Jailbreak Signals

Query

let window = 1d;
AppDependencies
| where TimeGenerated > ago(window)
| where isnotempty(Properties["microsoft.foundry.content_filter.results"])
| extend
    Agent     = tostring(Properties["gen_ai.agent.name"]),
    Model     = tostring(Properties["gen_ai.request.model"]),
    ConvId    = tostring(Properties["gen_ai.conversation.id"]),
    ProjectId = tostring(Properties["microsoft.foundry.project.id"]),
    Prompt    = tostring(Properties["gen_ai.input.messages"]),
    ToolName  = tostring(Properties["gen_ai.tool.name"]),
    ToolArgs  = tostring(Properties["gen_ai.tool.call.arguments"]),
    FilterArr = todynamic(tostring(Properties["microsoft.foundry.content_filter.results"]))
| mv-expand Entry = FilterArr
| extend
    SourceType = tostring(Entry.source_type),
    Blocked    = tobool(Entry.blocked),
    Filter     = todynamic(Entry.content_filter_results)
| extend
    JailbreakDetected = tobool(Filter.jailbreak.detected) or tobool(Filter.jailbreak.filtered),
    PromptShieldHit   = tobool(Filter.prompt_shield.detected) or tobool(Filter.prompt_shield.filtered),
    IndirectAttackHit = tobool(Filter.indirect_attack.detected) or tobool(Filter.indirect_attack.filtered)
| where JailbreakDetected or PromptShieldHit or IndirectAttackHit
| extend Signal = case(
    JailbreakDetected, "Jailbreak",
    PromptShieldHit,   "PromptShield",
    IndirectAttackHit, "IndirectPromptInjection",
    "Unknown")
| project
    TimeGenerated, Signal, SourceType, Blocked, Agent, Model, ProjectId, ConvId,
    ToolName, ToolArgs, Prompt
| order by TimeGenerated desc

Explanation

This query is designed to detect attempts to bypass security measures in an AI system by identifying specific signals that indicate potential security threats. Here's a simplified breakdown:

Time Frame: It looks at data from the past day (1d).
Data Source: It examines telemetry data from the AppDependencies table, specifically focusing on entries that have content filter results.
Data Extraction: It extracts various properties such as the agent name, model, conversation ID, project ID, input messages (prompts), tool name, and tool arguments.
Content Filter Analysis: It analyzes the content filter results to identify if any of the following were detected:
- Jailbreak: Attempts to override the agent's instructions.
- PromptShield: Hits on prompt shields designed to prevent unauthorized actions.
- Indirect Prompt Injection: Attempts to smuggle instructions through indirect means.
Signal Identification: If any of these detections are found, it labels them with a signal type (e.g., "Jailbreak", "PromptShield", "IndirectPromptInjection").
Output: It projects relevant details such as the time of detection, signal type, source type, whether the action was blocked, and other contextual information like the agent and tool details.
Ordering: The results are ordered by the time they were generated, with the most recent first.

The query is tagged with tactics and techniques related to defense evasion and initial access, indicating its relevance to security monitoring and threat detection.

Details

David Alonso

Released: June 8, 2026

Tables

AppDependencies

Keywords

AppDependenciesPropertiesAgentModelConvIdProjectIdPromptToolNameToolArgsFilterArrEntrySourceTypeBlockedFilterJailbreakDetectedPromptShieldHitIndirectAttackHitSignalTimeGeneratedDefenseEvasionInitialAccessTacticsTechniquesSentinelAsCodeCustomFoundryAIContentSafetyGuardrailsJailbreak

Operators

letwhereisnotemptyextendtostringtodynamicmv-expandtoboolorcaseprojectorder bydesc

Tactics

DefenseEvasionInitialAccess

MITRE Techniques

T1562 T1059

Actions

GitHub

KQL Search