Agent - Content-safety filter hits (hate / sexual / violence / self-harm)

Agent Content Safety Filter Hits

Query

let window = 1d;
AppDependencies
| where TimeGenerated > ago(window)
| where isnotempty(Properties["microsoft.foundry.content_filter.results"])
| extend
    Agent      = tostring(Properties["gen_ai.agent.name"]),
    Model      = tostring(Properties["gen_ai.request.model"]),
    ConvId     = tostring(Properties["gen_ai.conversation.id"]),
    ProjectId  = tostring(Properties["microsoft.foundry.project.id"]),
    Prompt     = tostring(Properties["gen_ai.input.messages"]),
    Response   = tostring(Properties["gen_ai.output.messages"]),
    FilterArr  = todynamic(tostring(Properties["microsoft.foundry.content_filter.results"]))
| mv-expand Entry = FilterArr
| extend
    SourceType = tostring(Entry.source_type),
    Blocked    = tobool(Entry.blocked),
    Filter     = todynamic(Entry.content_filter_results)
| extend
    HateSeverity      = tostring(Filter.hate.severity),
    HateFiltered      = tobool(Filter.hate.filtered),
    SexualSeverity    = tostring(Filter.sexual.severity),
    SexualFiltered    = tobool(Filter.sexual.filtered),
    ViolenceSeverity  = tostring(Filter.violence.severity),
    ViolenceFiltered  = tobool(Filter.violence.filtered),
    SelfHarmSeverity  = tostring(Filter.self_harm.severity),
    SelfHarmFiltered  = tobool(Filter.self_harm.filtered)
| where HateFiltered or SexualFiltered or ViolenceFiltered or SelfHarmFiltered
     or HateSeverity in ("low","medium","high")
     or SexualSeverity in ("low","medium","high")
     or ViolenceSeverity in ("low","medium","high")
     or SelfHarmSeverity in ("low","medium","high")
| project
    TimeGenerated, Agent, Model, ProjectId, ConvId,
    HateSeverity, SexualSeverity, ViolenceSeverity, SelfHarmSeverity,
    Prompt, Response
| order by TimeGenerated desc

Explanation

This query is designed to monitor and identify instances where an Azure AI Content Safety filter has flagged content as potentially harmful in four categories: hate, sexual content, violence, and self-harm. Here's a simplified breakdown of what the query does:

Time Frame: It looks at data from the past day (1d).
Data Source: It examines records from AppDependencies where there is a non-empty result from the content filter.
Data Extraction: For each relevant record, it extracts details such as the agent name, model used, conversation ID, project ID, and the input and output messages.
Content Filter Results: It processes the content filter results to determine if any of the four harm categories were flagged as filtered or have a severity level of low, medium, or high.
Filtering: The query filters out records where any of the categories (hate, sexual, violence, self-harm) were flagged or have a notable severity.
Output: It projects key information like the time the event was generated, agent details, severity of each category, and the prompt and response messages.
Sorting: The results are sorted by the time they were generated, in descending order.

This query helps analysts quickly identify and triage potentially harmful content interactions by providing detailed information about the flagged content and its context.

Details

David Alonso

Released: June 8, 2026

Tables

AppDependencies

Keywords

AppDependenciesPropertiesAgentModelConvIdProjectIdPromptResponseFilterArrEntrySourceTypeBlockedFilterHateSeverityHateFilteredSexualSeveritySexualFilteredViolenceSeverityViolenceFilteredSelfHarmSeveritySelfHarmFilteredTimeGeneratedExecutionImpactTacticsTechniquesTagsSentinelAsCodeCustomFoundryAIContentSafety

Operators

letagowhereisnotemptyextendtostringtodynamicmv-expandtoboolprojectorder by

Tactics

ExecutionImpact

MITRE Techniques

T1059

Actions

GitHub

KQL Search