Microsoft 365 Copilot - Multi-turn jailbreak escalation hunting

Copilot Jailbreak Multi Turn Hunting

Query

// Confirmed schema: per-message JailbreakDetected aggregated by ThreadId.
// A thread with two or more flagged messages is a multi-turn jailbreak
// attempt (single-turn jailbreaks are already covered by the analytic).
let window = 1d;
CopilotActivity
| where TimeGenerated > ago(window)
| where RecordType == "CopilotInteraction"
| extend ThreadId = tostring(LLMEventData.ThreadId)
| mv-expand m = LLMEventData.Messages
| extend
    MessageId = tostring(m.Id),
    IsPrompt = tobool(m.isPrompt),
    JbDetected = tobool(m.JailbreakDetected)
| summarize
    Messages = count(),
    Prompts = countif(IsPrompt),
    JailbreakHits = countif(JbDetected),
    PromptJailbreakHits = countif(JbDetected and IsPrompt),
    Agents = make_set(AgentName, 4),
    Actors = make_set(ActorName, 4),
    JbMessageIds = make_set_if(MessageId, JbDetected, 32),
    FirstSeen = min(TimeGenerated),
    LastSeen = max(TimeGenerated)
    by ThreadId, TenantId
| where JailbreakHits >= 2
| extend EscalationRatio = todouble(JailbreakHits) / todouble(Messages)
| order by JailbreakHits desc, EscalationRatio desc, LastSeen desc

Explanation

This query is designed to identify and analyze conversations involving Microsoft 365 Copilot where a user gradually escalates their prompts in an attempt to bypass security measures. Here's a simplified breakdown of what the query does:

Purpose: The query hunts for conversations where a user starts with harmless questions and gradually escalates to using role-play or persona language, eventually attempting to bypass policies. This type of attack is known as a "multi-turn jailbreak."
Time Frame: It examines interactions within a one-day window.
Data Source: It looks at records from the CopilotActivity table, specifically those labeled as "CopilotInteraction."
Process:
- It extracts and processes each message in a conversation thread.
- It checks if each message is a prompt and whether it has been flagged as a potential jailbreak attempt.
- It aggregates data by conversation thread (ThreadId) and tenant (TenantId).
Criteria for Detection:
- A conversation is flagged if there are two or more messages in the thread that are identified as potential jailbreak attempts.
Output:
- It calculates the ratio of jailbreak attempts to total messages.
- It orders the results by the number of jailbreak attempts, escalation ratio, and the time of the last message.
Objective: The goal is to surface these potentially malicious conversations for further review by an analyst, allowing them to examine the full transcript of the interaction.
Security Context: This query is part of a defense strategy against tactics like Defense Evasion and Initial Access, using techniques such as T1562 (Impair Defenses) and T1059 (Command and Scripting Interpreter).
Tags: The query is tagged for use with Sentinel-As-Code, Custom solutions, Copilot, and AI-related activities.

Details

David Alonso

Released: May 20, 2026

Tables

CopilotActivity

Keywords

MicrosoftCopilotActivityThreadIdMessagesPromptsAgentsActorsTenantIdJailbreakHitsEscalationRatioTimeGenerated

Operators

let|where>ago==extendtostringmv-expandtoboolsummarizecount()countifmake_setmake_set_ifminmaxby>=todouble/order bydesc

Tactics

DefenseEvasionInitialAccess

MITRE Techniques

T1562 T1059

Actions

GitHub

KQL Search