Microsoft 365 Copilot - Toxic or unsafe agent output

Copilot Toxic Or Unsafe Output

Query

let toxicCategories = dynamic([
    "hate", "sexual", "self-harm", "selfharm", "violence",
    "weapons", "malware", "harassment", "extremism"
]);
let toxicMarkers = dynamic([
    "kill yourself", "subhuman", "racial slur",
    "build a bomb", "make ricin", "synthesise vx",
    "ransomware payload", "wiper script"
]);
CopilotActivity
| where TimeGenerated > ago(1h)
| extend
    SafetyVerdict = tostring(LLMEventData.SafetyVerdict),
    SafetyCategories = tostring(LLMEventData.SafetyCategories),
    Response = tostring(LLMEventData.Response),
    ConversationId = tostring(LLMEventData.ConversationId)
| extend LowerResponse = tolower(Response)
| where SafetyVerdict in~ ("blocked", "flagged", "unsafe")
    or SafetyCategories has_any (toxicCategories)
    or LowerResponse has_any (toxicMarkers)
| summarize
    UnsafeOutputCount = count(),
    Verdicts = make_set(SafetyVerdict, 16),
    Categories = make_set(SafetyCategories, 16),
    Conversations = make_set(ConversationId, 16),
    SampleResponses = make_set(Response, 8),
    ClientIPs = make_set(SrcIpAddr, 16),
    FirstSeen = min(TimeGenerated),
    LastSeen = max(TimeGenerated)
    by AgentId, AgentName, ActorName, TenantId
| extend SrcIpAddr = tostring(ClientIPs[0])

Explanation

This query is designed to monitor and detect potentially harmful or toxic outputs from Microsoft 365 Copilot, a tool that uses AI to assist users. Here's a simplified breakdown of what the query does:

Purpose: The query aims to identify responses from Microsoft 365 Copilot that are flagged as unsafe or contain toxic language, such as hate speech, sexual content, self-harm, violence, weapons, malware, harassment, or extremism.
Detection: It checks for responses that have been marked by the system as "blocked," "flagged," or "unsafe," or that contain specific toxic phrases (e.g., "kill yourself," "build a bomb").
Data Collection: The query looks at activities from the past hour and gathers information about these unsafe outputs, including the number of occurrences, the types of safety verdicts and categories involved, conversation IDs, sample responses, and the IP addresses of the clients involved.
Analysis: It summarizes the data by grouping it according to the agent (the AI model), the user (actor), and the tenant (organization). It also records the first and last time these unsafe outputs were seen.
Alerting: If any unsafe outputs are detected, an alert is generated. The system is set up to create incidents based on these alerts, grouping them by the cloud application involved.
Configuration: The query runs every 15 minutes and looks back over the past hour. It is enabled by default and is part of a scheduled monitoring system.
Entity Mapping: The query maps certain fields to entities like CloudApplication, Account, and IP to help in identifying and managing the incidents.
Incident Management: The system is configured to create incidents for detected issues, with specific settings for grouping and managing these incidents.

Overall, this query acts as a safety net to catch and alert on potentially harmful content generated by Microsoft 365 Copilot, ensuring that any issues are quickly identified and addressed.

Details

David Alonso

Released: May 20, 2026

Tables

CopilotActivity

Keywords

MicrosoftCopilotActivitySafetyVerdictCategoriesResponseConversationAgentActorTenantClientIPTimeGeneratedCloudApplicationAccountAddressAlertIncidentSentinelAsCodeCustomAI

Operators

letdynamictostringtoloweragoin~has_anysummarizecountmake_setminmaxextendwhere

Severity

High

Tactics

Impact

MITRE Techniques

T1496

Frequency: PT15M

Period: PT1H

Actions

GitHub

KQL Search