Query Details
id: 6b9d4e75-4b7e-5c9f-ae80-1d2b3c4d5e62
name: Microsoft 365 Copilot - Toxic or unsafe agent output
description: |
Detects Microsoft 365 Copilot responses that the platform safety classifier
flagged (hate, sexual, self-harm, violence, weapons, malware) or
that contain toxic / harassing language patterns. Acts as a
defence-in-depth signal in case downstream content filters in
the calling app fail open.
An agent producing toxic or harmful content typically indicates
one of: a successful jailbreak, a misconfigured / unsafe system
prompt, model drift, or RAG poisoning. Triage by pivoting to
the prompt and tool sequence in the same conversation.
severity: High
requiredDataConnectors:
- connectorId: MicrosoftCopilot
dataTypes:
- CopilotActivity
queryFrequency: PT15M
queryPeriod: PT1H
triggerOperator: gt
triggerThreshold: 0
enabled: true
tactics:
- Impact
relevantTechniques:
- T1496
query: |
let toxicCategories = dynamic([
"hate", "sexual", "self-harm", "selfharm", "violence",
"weapons", "malware", "harassment", "extremism"
]);
let toxicMarkers = dynamic([
"kill yourself", "subhuman", "racial slur",
"build a bomb", "make ricin", "synthesise vx",
"ransomware payload", "wiper script"
]);
CopilotActivity
| where TimeGenerated > ago(1h)
| extend
SafetyVerdict = tostring(LLMEventData.SafetyVerdict),
SafetyCategories = tostring(LLMEventData.SafetyCategories),
Response = tostring(LLMEventData.Response),
ConversationId = tostring(LLMEventData.ConversationId)
| extend LowerResponse = tolower(Response)
| where SafetyVerdict in~ ("blocked", "flagged", "unsafe")
or SafetyCategories has_any (toxicCategories)
or LowerResponse has_any (toxicMarkers)
| summarize
UnsafeOutputCount = count(),
Verdicts = make_set(SafetyVerdict, 16),
Categories = make_set(SafetyCategories, 16),
Conversations = make_set(ConversationId, 16),
SampleResponses = make_set(Response, 8),
ClientIPs = make_set(SrcIpAddr, 16),
FirstSeen = min(TimeGenerated),
LastSeen = max(TimeGenerated)
by AgentId, AgentName, ActorName, TenantId
| extend SrcIpAddr = tostring(ClientIPs[0])
entityMappings:
- entityType: CloudApplication
fieldMappings:
- identifier: Name
columnName: AgentName
- identifier: AppId
columnName: AgentId
- entityType: Account
fieldMappings:
- identifier: Name
columnName: ActorName
- entityType: IP
fieldMappings:
- identifier: Address
columnName: SrcIpAddr
eventGroupingSettings:
aggregationKind: SingleAlert
incidentConfiguration:
createIncident: true
groupingConfiguration:
enabled: true
reopenClosedIncident: false
lookbackDuration: PT5H
matchingMethod: Selected
groupByEntities:
- CloudApplication
groupByAlertDetails: []
groupByCustomDetails: []
version: 1.0.0
kind: Scheduled
tags:
- Sentinel-As-Code
- Custom
- Copilot
- AI
This query is designed to monitor and detect potentially harmful or toxic outputs from Microsoft 365 Copilot, a tool that uses AI to assist users. Here's a simplified breakdown of what the query does:
Purpose: The query aims to identify responses from Microsoft 365 Copilot that are flagged as unsafe or contain toxic language, such as hate speech, sexual content, self-harm, violence, weapons, malware, harassment, or extremism.
Detection: It checks for responses that have been marked by the system as "blocked," "flagged," or "unsafe," or that contain specific toxic phrases (e.g., "kill yourself," "build a bomb").
Data Collection: The query looks at activities from the past hour and gathers information about these unsafe outputs, including the number of occurrences, the types of safety verdicts and categories involved, conversation IDs, sample responses, and the IP addresses of the clients involved.
Analysis: It summarizes the data by grouping it according to the agent (the AI model), the user (actor), and the tenant (organization). It also records the first and last time these unsafe outputs were seen.
Alerting: If any unsafe outputs are detected, an alert is generated. The system is set up to create incidents based on these alerts, grouping them by the cloud application involved.
Configuration: The query runs every 15 minutes and looks back over the past hour. It is enabled by default and is part of a scheduled monitoring system.
Entity Mapping: The query maps certain fields to entities like CloudApplication, Account, and IP to help in identifying and managing the incidents.
Incident Management: The system is configured to create incidents for detected issues, with specific settings for grouping and managing these incidents.
Overall, this query acts as a safety net to catch and alert on potentially harmful content generated by Microsoft 365 Copilot, ensuring that any issues are quickly identified and addressed.

David Alonso
Released: May 20, 2026
Tables
Keywords
Operators