Query Details

Copilot Toxic Or Unsafe Output

Query

id: 6b9d4e75-4b7e-5c9f-ae80-1d2b3c4d5e62
name: Microsoft 365 Copilot - Toxic or unsafe agent output
description: |
  Detects Microsoft 365 Copilot responses that the platform safety classifier
  flagged (hate, sexual, self-harm, violence, weapons, malware) or
  that contain toxic / harassing language patterns. Acts as a
  defence-in-depth signal in case downstream content filters in
  the calling app fail open.

  An agent producing toxic or harmful content typically indicates
  one of: a successful jailbreak, a misconfigured / unsafe system
  prompt, model drift, or RAG poisoning. Triage by pivoting to
  the prompt and tool sequence in the same conversation.
severity: High
requiredDataConnectors:
- connectorId: MicrosoftCopilot
  dataTypes:
  - CopilotActivity
queryFrequency: PT15M
queryPeriod: PT1H
triggerOperator: gt
triggerThreshold: 0
enabled: true
tactics:
- Impact
relevantTechniques:
- T1496
query: |
  let toxicCategories = dynamic([
      "hate", "sexual", "self-harm", "selfharm", "violence",
      "weapons", "malware", "harassment", "extremism"
  ]);
  let toxicMarkers = dynamic([
      "kill yourself", "subhuman", "racial slur",
      "build a bomb", "make ricin", "synthesise vx",
      "ransomware payload", "wiper script"
  ]);
  CopilotActivity
  | where TimeGenerated > ago(1h)
  | extend
      SafetyVerdict = tostring(LLMEventData.SafetyVerdict),
      SafetyCategories = tostring(LLMEventData.SafetyCategories),
      Response = tostring(LLMEventData.Response),
      ConversationId = tostring(LLMEventData.ConversationId)
  | extend LowerResponse = tolower(Response)
  | where SafetyVerdict in~ ("blocked", "flagged", "unsafe")
      or SafetyCategories has_any (toxicCategories)
      or LowerResponse has_any (toxicMarkers)
  | summarize
      UnsafeOutputCount = count(),
      Verdicts = make_set(SafetyVerdict, 16),
      Categories = make_set(SafetyCategories, 16),
      Conversations = make_set(ConversationId, 16),
      SampleResponses = make_set(Response, 8),
      ClientIPs = make_set(SrcIpAddr, 16),
      FirstSeen = min(TimeGenerated),
      LastSeen = max(TimeGenerated)
      by AgentId, AgentName, ActorName, TenantId
  | extend SrcIpAddr = tostring(ClientIPs[0])
entityMappings:
- entityType: CloudApplication
  fieldMappings:
  - identifier: Name
    columnName: AgentName
  - identifier: AppId
    columnName: AgentId
- entityType: Account
  fieldMappings:
  - identifier: Name
    columnName: ActorName
- entityType: IP
  fieldMappings:
  - identifier: Address
    columnName: SrcIpAddr
eventGroupingSettings:
  aggregationKind: SingleAlert
incidentConfiguration:
  createIncident: true
  groupingConfiguration:
    enabled: true
    reopenClosedIncident: false
    lookbackDuration: PT5H
    matchingMethod: Selected
    groupByEntities:
    - CloudApplication
    groupByAlertDetails: []
    groupByCustomDetails: []
version: 1.0.0
kind: Scheduled
tags:
- Sentinel-As-Code
- Custom
- Copilot
- AI

Explanation

This query is designed to monitor and detect potentially harmful or toxic outputs from Microsoft 365 Copilot, a tool that uses AI to assist users. Here's a simplified breakdown of what the query does:

  1. Purpose: The query aims to identify responses from Microsoft 365 Copilot that are flagged as unsafe or contain toxic language, such as hate speech, sexual content, self-harm, violence, weapons, malware, harassment, or extremism.

  2. Detection: It checks for responses that have been marked by the system as "blocked," "flagged," or "unsafe," or that contain specific toxic phrases (e.g., "kill yourself," "build a bomb").

  3. Data Collection: The query looks at activities from the past hour and gathers information about these unsafe outputs, including the number of occurrences, the types of safety verdicts and categories involved, conversation IDs, sample responses, and the IP addresses of the clients involved.

  4. Analysis: It summarizes the data by grouping it according to the agent (the AI model), the user (actor), and the tenant (organization). It also records the first and last time these unsafe outputs were seen.

  5. Alerting: If any unsafe outputs are detected, an alert is generated. The system is set up to create incidents based on these alerts, grouping them by the cloud application involved.

  6. Configuration: The query runs every 15 minutes and looks back over the past hour. It is enabled by default and is part of a scheduled monitoring system.

  7. Entity Mapping: The query maps certain fields to entities like CloudApplication, Account, and IP to help in identifying and managing the incidents.

  8. Incident Management: The system is configured to create incidents for detected issues, with specific settings for grouping and managing these incidents.

Overall, this query acts as a safety net to catch and alert on potentially harmful content generated by Microsoft 365 Copilot, ensuring that any issues are quickly identified and addressed.

Details

David Alonso profile picture

David Alonso

Released: May 20, 2026

Tables

CopilotActivity

Keywords

MicrosoftCopilotActivitySafetyVerdictSafetyCategoriesResponseConversationAgentActorTenantClientIPTimeGeneratedCloudApplicationAccountIPAddressAlertIncidentSentinelAsCodeCustomAI

Operators

letdynamictostringtoloweragoin~has_anysummarizecountmake_setminmaxextendwhere

Actions