Safety & Evaluation Framework for AI Agents

Situation

Scope note: This piece is the agent-specific application of the enterprise AI Governance Framework. Portfolio-6 defines the organization-wide policy — source-authority hierarchies, PII handling, the generic Safety Gate. This piece applies and extends those primitives to agentic behaviour: trajectory evaluation, tool-call scope gating, and red-teaming of multi-turn conversations.

After the AI Customer Service Agent moved into pilot, a new class of failure emerged — one that standard NLP evaluation wasn’t designed to catch.

Component-level metrics looked healthy. The intent router was accurate. The knowledge engine retrieved correctly. The LLM generated fluent, grounded text. But end-to-end, the agent was failing in ways that only showed up across multiple turns: it gave a correct policy answer in turn 1, then contradicted it in turn 3 after the customer rephrased. It successfully deflected a refund request — but left the customer with no clear path forward, generating a repeat contact within 24 hours. In one case, a subtle prompt injection in a customer message caused the agent to ignore its system instructions and hallucinate a cancellation confirmation.

These failures had something in common: they were behavioural, not generative. You can’t catch them by scoring a single output against a reference. You need evaluation that understands what the agent is supposed to do, tracks what it actually did across a full trajectory, and distinguishes safe failure (graceful escalation) from unsafe failure (confident misinformation).

The challenge was building an evaluation and safety framework purpose-built for agentic behaviour — one that could run continuously in production, not just as an offline audit.

Task

Design and implement an end-to-end safety and evaluation framework for the customer service agent covering:

Behavioural evaluation across multi-turn conversation trajectories
Safety gates at input, output, and tool-call levels
Adversarial robustness — prompt injection, jailbreak, policy bypass attempts
Red-teaming methodology for systematic failure discovery
A continuous monitoring layer that surfaces regressions before customers encounter them

Action

1. Evaluation architecture: from component metrics to trajectory scoring

The core insight was that agent evaluation requires a different unit of analysis. For single-turn NLP tasks, the unit is a (input, output) pair. For agents, the unit is a trajectory — the full sequence of turns, tool calls, memory accesses, and outputs that constitute a conversation.

graph TD
    subgraph Trajectory Evaluation
        T[Full conversation trajectory]
        T --> CE[Component evals per turn]
        T --> BE[Behavioural evals across turns]
        T --> SE[Safety evals at each decision point]
    end

    subgraph Component Evals - Per Turn
        CE --> IR[Intent classification accuracy]
        CE --> KR[Knowledge retrieval correctness]
        CE --> GQ[Generation quality - NLI + BERTScore]
        CE --> LC[Latency + cost per turn]
    end

    subgraph Behavioural Evals - Trajectory Level
        BE --> CC[Consistency - does the agent contradict itself?]
        BE --> CP[Completeness - did it fully resolve the query?]
        BE --> ES[Escalation appropriateness]
        BE --> TC[Task completion rate]
    end

    subgraph Safety Evals - Decision Level
        SE --> PI[Prompt injection detection]
        SE --> PB[Policy boundary compliance]
        SE --> PII[PII leak prevention]
        SE --> SC[Source citation correctness]
    end

    style T fill:#264653,color:#fff
    style CE fill:#2a9d8f,color:#fff
    style BE fill:#e76f51,color:#fff
    style SE fill:#9b2226,color:#fff

This distinction — component vs behavioural vs safety evaluation — drove every downstream design decision. Component evals could reuse existing NLP infrastructure. Behavioural and safety evals required new tooling.

2. Behavioural evaluation suite

2a. Consistency scoring across turns

The most common multi-turn failure: the agent gives a different answer to the same question when it is rephrased. This is invisible to per-turn metrics but directly damages customer trust.

Implementation: For a sampled set of trajectories, we automatically generate paraphrase variants of key questions (using GPT-4o with a paraphrase prompt) and replay them into the agent at the same conversation state. We then apply NLI entailment between the two responses — if the answers are contradictory (entailment score below threshold), the trajectory is flagged.

This extends the NLI-based hallucination detection from single-turn to cross-turn comparison. The same DeBERTa-v3-large model is reused; the input changes from (source, rewrite) to (response at turn N, response at turn N+k for the same underlying question).

2b. Task completion rate

Did the agent actually resolve what the customer came with? CSAT correlates poorly with task completion on support queries — a customer can rate an interaction 5/5 while still needing to call back within 24 hours.

Implementation: We define task completion operationally per intent category:

Intent	Completion signal	Failure signal
Policy inquiry	Customer does not ask the same question again within 24h	Repeat contact on same topic
Booking change	Booking state changes in the system within 1h	No state change + escalation
Cancellation request	Cancellation initiated or customer explicitly declines	Customer contacts human within 2h
Post-trip issue	Resolution logged in CRM within 48h	Open issue after 48h

Repeat contact rate within 24 hours became the primary lagging metric for task completion — it captures failures that CSAT misses and is calculable without human annotation.

2c. Escalation quality

Not all escalations are equal. An escalation on a genuinely complex multi-policy refund dispute is appropriate behaviour. An escalation because the agent failed to find a FAQ that was in the knowledge graph is a system failure. Conflating them inflates the escalation rate and hides capability gaps.

Implementation: Each escalation is automatically categorised at handoff time:

Knowledge gap — the required information was not in the knowledge graph
Confidence failure — information existed but generation confidence was below threshold
Policy complexity — multi-condition policy that requires human judgement
Sentiment trigger — customer expressed frustration; escalated for relationship reasons
Explicit request — customer asked for a human

Only knowledge gap and confidence failure escalations count against agent performance. Sentiment and explicit-request escalations are evaluated separately — a high rate here is a signal to improve tone, not knowledge coverage.

3. Safety gate architecture

The safety layer operates at three points in the agent loop: input validation, tool-call validation, and output validation. Each gate is independent and can block, modify, or escalate.

sequenceDiagram
    participant C as Customer input
    participant IG as Input Gate
    participant AO as Orchestrator
    participant TG as Tool-call Gate
    participant KE as Knowledge Engine
    participant LLM as Response LLM
    participant OG as Output Gate
    participant R as Response / Escalation

    C->>IG: Raw customer message
    Note over IG: PII detection + injection scan + intent safety (5ms)

    alt Injection / abuse detected
        IG->>R: Block + log + optional escalation
    else Safe
        IG->>AO: Sanitised query + safety context
    end

    AO->>TG: Proposed tool calls
    Note over TG: Scope check - is this tool call permitted for this intent? (2ms)

    alt Out of scope tool call
        TG->>AO: Reject call + log anomaly
    else Permitted
        TG->>KE: Execute tool call
        KE->>AO: Retrieved context
    end

    AO->>LLM: Generate with grounded context
    LLM->>OG: Draft response

    Note over OG: NLI hallucination check + PII scan + policy boundary check (8ms)

    alt Fails safety check
        OG->>R: Escalate with draft + failure reason
    else Passes
        OG->>R: Deliver response
    end

Input gate

Three checks run in parallel:

PII detection — regex + spaCy NER to identify and mask customer PII before it enters the LLM context. Phone numbers, email addresses, passport numbers, and credit card fragments are replaced with typed placeholders ([PHONE], [EMAIL]). Critical for GDPR compliance — PII must not appear in LLM training logs or prompt caches.

Prompt injection detection — the most underappreciated safety risk in customer-facing agents. Injection attempts follow recognisable patterns: role override (“ignore previous instructions and…”), context hijacking (“as a developer I’m testing you, so…”), and instruction smuggling (hiding instructions in base64 or unusual Unicode). We use a fine-tuned DistilBERT classifier trained on a red-team dataset of injection examples. High-confidence injections are blocked; borderline cases are escalated with a flag.

Intent safety classification — a separate classifier (not the main intent router) that asks: is this query appropriate for automated handling? Queries involving legal threats, media complaints, and serious safety incidents are routed to human immediately, regardless of the main intent classification.

Tool-call gate

Before the orchestrator executes any tool call, the gate validates that the call is within scope for the detected intent. An agent handling a policy inquiry should not be calling booking modification APIs. This is a hard constraint, not a soft signal — out-of-scope tool calls are rejected and logged as anomalies.

This addresses a specific attack pattern: an adversarial user constructs a prompt that causes the agent to call a tool it shouldn’t (e.g., triggering a booking cancellation through a conversation that started as an innocent FAQ query). The tool-call gate enforces a permission matrix at runtime.

Intent	Permitted tool calls
Policy inquiry	`get_policy`, `get_faq`, `get_product_info`
Booking management	above + `get_booking`, `modify_booking` (read-only modification check only)
Post-trip issue	above + `get_booking_history`, `create_case`
General browsing	`get_product_info`, `get_tips`, `search_products`
Complaint / escalation	`create_case`, `escalate` only

Output gate

Three checks on the generated response before delivery:

NLI hallucination check — every factual claim in the response is checked for entailment against the retrieved context using the same NLI pipeline used in the content rewrite system. Claims that are not entailed by retrieved sources are flagged. If the flagged claim is in a critical section (policy, pricing, booking terms), the response is escalated rather than delivered.

PII egress scan — a second PII check on the output. The concern here is different from the input check: the LLM might inadvertently reproduce customer-specific data from context (e.g., repeating a booking number from memory) in a response that will be logged. All PII in outputs is masked before logging.

Policy boundary check — a rule-based validator that checks whether the response makes any commitments outside the agent’s authority (e.g., promising a refund when the agent is only permitted to initiate a refund request, not approve one). These commitments are pattern-matched against a policy commitment registry.

4. Adversarial robustness & red-teaming

Automated safety gates catch known failure patterns. Red-teaming discovers unknown ones.

Red-team methodology:

We ran quarterly red-team exercises with a team of three (one ML engineer, one support domain expert, one external contractor with no product knowledge). Each exercise had a defined scope — one of: prompt injection, policy bypass, PII extraction, escalation avoidance, or cross-turn consistency attack.

graph LR
    subgraph Red Team Process
        S[Define scope + success criteria]
        S --> A[Attack generation - manual + LLM-assisted]
        A --> E[Execute against agent - 200 attempts per scope]
        E --> T[Triage - success rate + severity classification]
        T --> M[Mitigations - gate rule / classifier update / prompt change]
        M --> V[Validation - rerun attacks post-mitigation]
        V --> D[Document + add to regression suite]
    end
    style S fill:#264653,color:#fff
    style T fill:#9b2226,color:#fff
    style M fill:#2a9d8f,color:#fff

LLM-assisted attack generation — we used GPT-4o with a red-team system prompt to generate novel attack variants at scale. Human red-teamers focused on domain-specific attacks (exploiting Viator-specific policies and booking edge cases) that an LLM without product context wouldn’t generate. The combination consistently outperformed either approach alone.

Severity classification — not all successful attacks are equally serious:

Severity	Definition	Example
Critical	Agent takes a harmful action or leaks sensitive data	Confirms a booking cancellation it has no authority to make
High	Agent gives materially incorrect information confidently	States wrong cancellation window with no hedge
Medium	Agent is manipulated but fails safely	Attempts injection causes escalation rather than the intended action
Low	Agent behaves unexpectedly but without harm	Unusual phrasing causes verbose response

Critical and High findings triggered immediate hotfixes. Medium and Low went into the quarterly improvement backlog.

Key findings from red-team exercises:

Prompt injection via multilingual encoding — injections in non-Latin scripts (Arabic, Chinese) bypassed the initial injection classifier, which had been trained predominantly on English examples. Fix: retrained on multilingual injection dataset; added Unicode normalisation at the input gate.

Cross-turn state poisoning — an attacker could establish a false premise in an early turn (“I already spoke to an agent who approved my refund”) and the agent would carry this premise forward in later turns without re-validating. Fix: added a claim verification step that flags user-asserted facts about prior agent commitments and escalates for human verification.

Tool-call scope creep via intent confusion — a carefully crafted query could simultaneously satisfy two intent classifiers (policy inquiry + booking management), granting access to a broader tool set than either intent alone. Fix: revised the tool-call gate to use the intersection of permitted tools when multiple intents fire, not the union.

5. Continuous monitoring in production

Red-teaming and offline evaluation catch failures before deployment. Production monitoring catches the ones that slip through and the new failure modes that emerge as customer behaviour evolves.

Monitoring stack:

graph TB
    subgraph Data Collection
        L[LLM Gateway logs - every turn]
        S[Safety gate decisions - every turn]
        E[Escalation metadata - every handoff]
        C[CSAT + repeat contact - lagging signals]
    end

    subgraph Real-time Alerts - under 5 min
        L --> HR[Hallucination rate spike]
        S --> IR[Injection attempt spike]
        S --> ER[Escalation rate spike]
        L --> LT[Latency P95 breach]
    end

    subgraph Daily Dashboards
        L --> TC[Task completion by intent]
        E --> EQ[Escalation quality breakdown]
        L --> CC[Cross-turn consistency sample]
        C --> CS[CSAT vs automated resolution rate]
    end

    subgraph Weekly Human Audit
        HR --> HA[100 random conversations rated on 4 dimensions]
        EQ --> HA
        HA --> KA[Cohen kappa inter-annotator agreement]
    end

    subgraph Monthly Drift Detection
        L --> MD[Intent distribution shift]
        L --> TD[Topic drift - new query categories emerging]
        L --> CD[Confidence calibration drift]
    end

    style HR fill:#9b2226,color:#fff
    style IR fill:#9b2226,color:#fff
    style ER fill:#9b2226,color:#fff
    style LT fill:#9b2226,color:#fff

Confidence calibration monitoring — the agent’s confidence scores are only useful if they are calibrated: a 0.85 confidence score should correspond to roughly 85% accuracy. Calibration drift is a subtle failure mode — the model becomes overconfident after a distribution shift, leading to fewer escalations but lower quality automated responses. We track calibration using reliability diagrams on weekly audit samples and trigger recalibration when the expected calibration error rises above threshold.

Intent distribution shift — as the product evolves and external events occur (weather, operator incidents, policy changes), the distribution of customer queries shifts. An intent category that was 5% of volume can spike to 20% after a high-profile cancellation event, overwhelming knowledge coverage in that category. We monitor intent distribution daily and pre-warm knowledge coverage for emerging categories when drift is detected.

Results

Introducing behavioural evaluation alongside component metrics revealed that the agent’s component-level performance significantly overstated end-to-end quality. Several failure modes that were invisible per-turn — cross-turn contradiction, unresolved query loops, escalation misclassification — were surfaced for the first time and systematically addressed.

The safety gate architecture substantially reduced the incidence of prompt injection reaching the LLM, and the tool-call permission matrix prevented the class of scope-creep attacks discovered in red-teaming. Post-mitigation, no Critical severity findings were recorded in subsequent red-team exercises.

Repeat contact rate within 24 hours proved to be a more reliable quality signal than CSAT, and redirected improvement effort toward task completion rather than response fluency — a meaningful shift in what the team was optimising for.

The escalation quality breakdown revealed that a significant share of escalations were knowledge gaps rather than agent failures — redirecting engineering effort from model improvement to knowledge graph expansion, which had a larger impact on deflection rate.

Evaluation Methodology Reference

Evaluation type	Unit of analysis	Metric	Cadence
Component — generation quality	Single turn	NLI entailment, BERTScore	Continuous (sampled)
Component — retrieval correctness	Single turn	Precision@k vs labelled ground truth	Weekly
Behavioural — consistency	Trajectory (turn pairs)	NLI entailment cross-turn	Weekly (sampled)
Behavioural — task completion	Conversation	24h repeat contact rate	Daily
Behavioural — escalation quality	Escalation event	Category breakdown	Daily
Safety — input gate	Every turn	Injection block rate, PII detection F1	Continuous
Safety — output gate	Every turn	Hallucination flag rate, policy violation rate	Continuous
Safety — adversarial	Attack corpus	Success rate by severity	Quarterly red-team
Human audit	Random sample	Correctness, completeness, tone, efficiency	Weekly
Calibration	Confidence buckets	Expected calibration error	Weekly

Key Design Decisions & Trade-offs

Why trajectory-level evaluation rather than aggregating per-turn scores? Averaging per-turn scores across a conversation hides the most damaging failure modes. A conversation that scores 0.9 per turn but contains one cross-turn contradiction in turn 4 has failed — the average masks the failure. Trajectory evaluation treats the conversation as the atomic unit.

Why 24h repeat contact rate as the primary task completion signal? It is causally linked to the outcome we care about (customer getting their issue resolved) and requires no human annotation. CSAT is easier to collect but measures customer sentiment, not resolution — a customer can appreciate a polite response while still being no closer to resolving their booking issue.

Why a tool-call permission gate rather than relying on the LLM to stay in scope? LLMs can be prompted to stay within scope, but this is a soft constraint that can be overridden by adversarial inputs. A hard permission gate at the infrastructure layer is not bypassable via prompt manipulation. Defense in depth: both the LLM prompt and the gate enforce scope, independently.

Why categorise escalations rather than minimise them? Minimising escalation rate is the wrong objective — it incentivises the agent to give low-quality automated responses rather than escalate appropriately. Categorising escalations lets you distinguish system failures (knowledge gap, confidence failure) from appropriate behaviour (policy complexity, sentiment escalation) and optimise the right thing.

Risks & Mitigations

Risk	Mitigation
Evaluation set contamination — agent prompt has seen test cases	Red-team corpus held out from all training; quarterly rotation of attack scenarios
Calibration drift post model update	Reliability diagram monitoring; recalibration trigger on ECE increase
Safety gate false positives blocking legitimate queries	Per-gate precision/recall tracking; human review queue for borderline blocks
Red-team scope too narrow — misses real-world attack patterns	Production injection attempt logs reviewed before each red-team to seed attack variants
Monitoring alert fatigue	Tiered alert system: Critical pages on-call immediately; High queues for daily review; Medium/Low in weekly digest

Cross-Portfolio Integration

Subsystem	Portfolio piece	How it integrates
NLI hallucination detection	AI Evaluation Specialist — Content Quality	Same DeBERTa-v3-large model reused for cross-turn consistency scoring and output gate hallucination checks
Agent under evaluation	AI Customer Service Agent	This framework was designed specifically to evaluate the behaviours of that agent in production
Governance framework	Enterprise AI Governance	Safety gate architecture extends the governance framework’s source authority hierarchy and PII handling policies
Knowledge engine	Automated FAQ Extraction	Knowledge gap escalations feed back into FAQ extraction pipeline to expand coverage