AI Customer Service Agent for Travel (System Design)

Situation

Viator’s first chatbot pilot failed within two weeks. The bot confidently told a customer their Rome Colosseum tour included hotel pickup — it didn’t. The hallucination happened because the bot retrieved a chat log from a different product (a Rome food tour that does include pickup) and treated it as ground truth. Three similar incidents followed, each generating a complaint escalation and a refund.

The root cause wasn’t the LLM. It was that answering a single customer question required stitching together data from 5+ disconnected systems: product attributes, supplier policies, booking records, chat history, and traveler reviews. No unified data layer existed, and the bot had no way to judge source reliability or know when it should escalate instead of guess.

The challenge was not building a chatbot. It was designing an agent architecture where every answer is grounded, every source is ranked, and the system knows what it doesn’t know.

Task

Design and architect an AI Customer Service Agent that:

Resolves 60%+ of Tier-1 queries without human intervention
Maintains under 1.5s P95 end-to-end response latency
Keeps hallucination rate below 2% through grounded generation
Provides graceful HITL escalation with full context handoff
Operates within a cost budget of under $0.02 per conversation turn

Action

System Architecture

The agent is composed of five coordinated subsystems, each designed as an independent service with clear contracts:

graph TB
    subgraph Input Layer
        CQ[Customer Query] --> IR[Intent Router]
        CQ --> SG[Safety Gate]
    end

    subgraph Orchestration
        IR -->|intent + confidence| AO[Agent Orchestrator]
        SG -->|safe / blocked / escalate| AO
        AO -->|tool calls| TC[Tool Controller]
    end

    subgraph Knowledge and Data
        TC --> KE[Knowledge Engine - GraphRAG]
        TC --> PD[Product Data Service]
        TC --> TI[Traveler Intelligence - Tips and Reviews]
    end

    subgraph Memory
        AO --> SM[Session Memory - Conversation Buffer]
        AO --> UP[User Profile - Booking History]
        AO --> KC[Knowledge Cache - Redis]
    end

    subgraph Response
        AO --> RG[Response Generator - Grounded LLM]
        RG --> SG2[Safety Gate - Output Validation]
        SG2 -->|pass| CR[Customer Response]
        SG2 -->|fail| HE[HITL Escalation]
    end

    subgraph Human-in-the-Loop
        AO -->|low confidence| HE
        HE --> HA[Human Agent + Context Package]
    end

    style KE fill:#2d6a4f,color:#fff
    style TI fill:#2d6a4f,color:#fff
    style SG fill:#9b2226,color:#fff
    style SG2 fill:#9b2226,color:#fff
    style HE fill:#e76f51,color:#fff
    style AO fill:#264653,color:#fff

Color key: Green = existing portfolio systems (Knowledge Engine, Traveler Intelligence); Red = safety layer (Governance Framework); Orange = HITL; Blue = orchestration.

Subsystem 1: Intent Router

Classifies incoming queries into actionable intents to determine which tools the orchestrator should invoke.

Intent Category	Example Query	Tools Invoked	% of Volume
Booking Management	“Can I change my tour date?”	User Profile, Product Data, Policy KB	28%
Policy Inquiry	“What’s the cancellation policy?”	Knowledge Engine (GraphRAG)	22%
Product Information	“Does the tour include lunch?”	Knowledge Engine, Product Data	19%
Post-Trip Issue	“The guide didn’t show up”	User Profile, HITL Escalation	15%
General / Browsing	“What’s fun to do in Rome?”	Traveler Intelligence, Product Data	12%
Complaint / Escalation	“I want a refund now”	HITL Escalation (immediate)	4%

Implementation: Fine-tuned DeBERTa-v3-base on 15K labeled support transcripts. Multi-label output (primary + secondary intent) with confidence scores. Queries with confidence below 0.7 are routed to a fallback LLM classifier before escalation. Prompt optimization uses the hybrid APE-OPRO approach for the LLM fallback classifier.

Latency budget: 15ms (ONNX-optimized, CPU inference).

Subsystem 2: Knowledge & Intelligence Layer

Two existing systems serve as the agent’s knowledge backbone — the GraphRAG Knowledge Engine for policies and FAQs, and the Traveler Tips extraction pipeline for experiential intelligence. The agent-specific integration decisions were:

Structured tool calls over free-text retrieval: The orchestrator issues get_policy(product_id, policy_type) and get_faqs(product_id, topic) rather than embedding-based search. This constrains the LLM to pre-validated knowledge and reduces hallucination surface area
Offline-first with cache: FAQs are pre-generated offline and served from Redis cache (0ms LLM latency) for ~92% of agent knowledge queries. The underlying FAQ system reports ~99% cache hit on FAQ lookups in isolation; the lower 92% figure here reflects the agent’s broader query mix, which includes novel, compound, and multi-hop questions that fall outside pre-generated FAQs. Only those queries require real-time generation — this is the single biggest cost and latency lever in the system
Source attribution required: Every agent response must cite its source tier (Gold: official policy, Silver: product data, Bronze: traveler tips). This is enforced in the response generation prompt and validated by the output safety gate
Tip separation: Traveler tips are never mixed with official information. They appear in a distinct section (“Based on recent traveler feedback…”) to prevent customers from treating crowd-sourced opinions as company policy

Subsystem 3: Memory Architecture

Three-tier memory system balancing context richness against latency and cost:

graph LR
    subgraph Tier 1 - Session Memory
        direction TB
        CB[Conversation Buffer - Last 10 turns]
        WM[Working Memory - Extracted entities]
    end

    subgraph Tier 2 - User Profile
        direction TB
        BH[Booking History - Active and past]
        CP[Customer Preferences - Language, style]
        IH[Interaction History - Past issues]
    end

    subgraph Tier 3 - Knowledge Cache
        direction TB
        PC[Product Cache - Attributes, pricing]
        PO[Policy Cache - Pre-resolved lookups]
        FC[FAQ Cache - Pre-generated answers]
    end

    CB -.->|summarize on overflow| BH
    WM -->|entity lookup| BH
    WM -->|product context| PC

    style CB fill:#264653,color:#fff
    style WM fill:#264653,color:#fff
    style BH fill:#2a9d8f,color:#fff
    style CP fill:#2a9d8f,color:#fff
    style IH fill:#2a9d8f,color:#fff
    style PC fill:#e9c46a,color:#000
    style PO fill:#e9c46a,color:#000
    style FC fill:#e9c46a,color:#000

Tier	Storage	TTL	Latency	Content
Session Memory	In-process (LRU)	Session duration	sub-1ms	Last 10 turns + extracted entities
User Profile	DynamoDB	Persistent	5-10ms	Booking history, preferences, past interactions
Knowledge Cache	Redis	24 hours	2-5ms	Pre-generated FAQs, product attributes, policies

Context Window Management:

Sliding window of 10 turns with entity extraction (not raw text) for older turns
Summarization trigger: when token count exceeds 3,000 tokens, older turns are compressed to entity-relationship pairs
User profile injection: only relevant fields loaded (e.g., for a booking query, load active bookings; for a general query, load preferences only)
Total context budget: 4,000 tokens max to keep generation fast and focused

Multi-Turn Conversation Handling: Single-turn Q&A is straightforward. The harder problem is multi-turn conversations where customers switch topics, refer back to earlier context, or build on previous answers. Design decisions:

Entity persistence: Extracted entities (product ID, booking number, destination) persist across turns in working memory, so “what about the cancellation policy?” resolves to the product discussed 3 turns ago
Topic switch detection: Intent router compares current turn intent against session history. When intent shifts (e.g., from policy inquiry to booking change), the orchestrator reloads relevant tools rather than carrying stale context
Clarification loops: When entity resolution is ambiguous (“my Rome tour” but customer has 3 Rome bookings), the agent asks for clarification rather than guessing. Max 1 clarification question per conversation to avoid frustrating loops

Multilingual Support: Viator operates globally; approximately 30% of support queries arrive in non-English languages. The agent handles this through:

Language detection at the Safety Gate stage (fastText lid.176, same as the Review Summarization pipeline)
Query translation to English before intent classification and knowledge retrieval (NLLB-200)
Response generation in the detected language, with the LLM instructed to respond in the customer’s language
Known limitation: Translation adds ~50ms latency and introduces occasional errors for low-resource languages. For the top 5 languages (Spanish, French, German, Italian, Portuguese), translation quality is validated quarterly by native speakers

Subsystem 4: Agent Orchestrator & Tool Use

The orchestrator is the central coordinator. It receives classified intent + safety clearance, then executes a tool-use loop:

Query Lifecycle (Swimlane):

sequenceDiagram
    participant C as Customer
    participant SG as Safety Gate
    participant IR as Intent Router
    participant AO as Orchestrator
    participant M as Memory
    participant KE as Knowledge Engine
    participant TI as Traveler Intel
    participant LLM as Response LLM
    participant SG2 as Output Safety
    participant H as Human Agent

    C->>SG: Can I cancel my Rome Colosseum tour if it rains?
    Note over SG: PII scan + input validation (5ms)
    SG->>IR: Safe query passed
    Note over IR: Intent policy_inquiry Confidence 0.94 (15ms)
    IR->>AO: intent policy, product Rome Colosseum

    par Parallel Tool Calls
        AO->>M: Load session + user profile
        Note over M: Booking VTR-8821 found (10ms)
        AO->>KE: get_policy product_id weather
        Note over KE: Cache HIT weather policy (15ms P95)
        AO->>TI: get_tips product_id weather
        Note over TI: 2 relevant tips found (15ms)
    end

    AO->>LLM: Generate response with context
    Note over LLM: Grounded generation (~1200ms)

    LLM->>SG2: Draft response
    Note over SG2: Hallucination + PII check (8ms)

    alt Confidence above 0.85
        SG2->>C: Automated response with policy + tips
    else Confidence below 0.85
        SG2->>H: Escalate with full context
        H->>C: Human-assisted response
    end

Latency Budget Breakdown:

Stage	Target (P95)	Strategy
Safety Gate (input)	5ms	Regex + DistilBERT, batched
Intent Router	15ms	ONNX DeBERTa on CPU
Memory Lookup	10ms	DynamoDB + Redis
Knowledge Retrieval	15ms	Redis cache (92% hit rate); Elasticsearch fallback
Traveler Tips	15ms	Pre-indexed in Elasticsearch
Response Generation	1,200ms	GPT-4o-mini with 4K token context cap
Safety Gate (output)	8ms	Same as input gate
Total	under 1,500ms	Parallel tool calls save ~25ms (serial max = 40ms, parallel max = 15ms)

Subsystem 5: HITL Escalation & Feedback Loop

Not all queries should be automated. The escalation system ensures graceful handoff and feeds human resolutions back into the knowledge system:

Escalation Triggers:

Trigger	Threshold	Action
Low intent confidence	below 0.7 after fallback classifier	Route to human
Low response confidence	below 0.85 generation confidence	Route to human
Sentiment detection	Anger/frustration score >0.8	Route to human (priority queue)
Explicit request	“Talk to a person”	Immediate route
Policy complexity	Multi-policy conflict detected	Route to human
Financial impact	Refund >$500 or dispute	Route to human

Context Handoff Package: When escalating, the agent passes a structured context package to the human agent:

Conversation transcript (full)
Detected intent + confidence
Retrieved knowledge (policies, FAQs, tips)
Customer profile summary (booking, history)
Agent’s draft response (if generated) with confidence score
Reason for escalation

This eliminates repeated context gathering — human agents start with full context.

Feedback Loop: Escalated conversations are the system’s most valuable training signal. After human resolution:

Knowledge gap detection: If the human agent answered using information not in the knowledge graph, the answer is flagged for potential FAQ generation (feeds into the FAQ extraction pipeline)
Intent classifier retraining: Queries that were misclassified (wrong tools invoked) are added to the intent classifier training set. Quarterly retraining cycle
Confidence threshold tuning: Escalation rate is tracked by intent category. If a category escalates at >50%, the confidence threshold is lowered for that category to escalate earlier rather than generating low-quality responses
What we chose NOT to automate: After analyzing escalated queries, we identified that refund disputes, multi-booking conflicts, and supplier complaints should remain human-only. Automating these had negative CSAT impact in A/B tests even when the agent’s answer was factually correct — customers wanted human empathy, not just correct information

Cost Architecture

Component	Cost Driver	Est. Per-Turn Cost	Optimization
Intent Router + Safety Gates	CPU inference (SLMs)	under $0.001	ONNX optimization; DistilBERT + DeBERTa batched on CPU
Knowledge + Memory Lookup	Redis cache + DynamoDB reads	under $0.001	92% cache hit rate from offline FAQ generation
Response Generation	LLM tokens (GPT-4o-mini)	~$0.010-0.015	4K token context cap; cached responses for repeat queries
Infrastructure (amortized)	ECS Fargate, Redis, DynamoDB	~$0.003-0.005	Shared across requests; auto-scaling
Total		~$0.015-0.020	Target: under $0.02

Cost Levers:

Cache hit rate is the dominant cost driver. At 92% FAQ cache hit, only 8% of knowledge queries require LLM generation
Context window size directly impacts LLM cost. The 4K token cap prevents cost explosion on long conversations
Escalation rate affects total cost: each human-handled turn costs ~$2.50 (agent salary). At 40% deflection, blended cost drops significantly

Tooling & Infrastructure

Layer	Technology	Rationale
Orchestration	LangGraph (stateful agent)	Native tool-use, state management, human-in-the-loop primitives
Knowledge Graph	Neo4j	Relationship traversal for policy propagation (see portfolio-4)
Vector Store	Elasticsearch (kNN)	Hybrid semantic + lexical; already in stack
Cache	Redis Cluster	Sub-5ms reads; write-through from offline pipeline
User Data	DynamoDB	Low-latency key-value for user profiles and booking data
LLM Gateway	Custom proxy with circuit breaker	Rate limiting, fallback model routing, cost tracking
Monitoring	Arize + Grafana	LLM observability, drift detection, safety metrics (see portfolio-6)
Serving	ECS Fargate	Auto-scaling, no GPU required (SLMs on CPU)

Results (Design Targets & Pilot Metrics)

Metric	Baseline (No Agent)	Design Target	Pilot Result	Method
Tier-1 Deflection Rate	0%	60%	43% (pilot)	A/B test on 10K conversations
P95 Response Latency	N/A	under 1.5s	1.3s	End-to-end measurement
Hallucination Rate	N/A	under 2%	1.9%	Manual audit of 300 responses
CSAT (Automated Responses)	N/A	>4.0/5.0	4.1/5.0	Post-conversation survey
Cost per Turn	$2.50 (human)	under $0.02 (automated)	~$0.018	Infra + API cost tracking
Escalation with Context	0% (restarts)	100%	100%	Context package delivery rate
Knowledge Freshness	45 days avg	under 24 hours	under 24 hours	Source change to FAQ update

Why 43% deflection, not 60%: The pilot revealed that deflection rate is bounded by knowledge coverage, not agent capability. The agent correctly handled nearly every query where a cached FAQ existed (92% accuracy on policy and product questions). But 22% of queries fell into categories we hadn’t anticipated — questions about local transport, combined-tour logistics, and accessibility — where no knowledge existed in the graph. The remaining gap came from multi-turn conversations where the agent resolved the first question but escalated on the follow-up.

The path to 60% is not a better LLM. It’s expanding the knowledge graph (ongoing via the FAQ extraction pipeline) and improving multi-turn handling.

Evaluation Methodology

End-to-end agent evaluation is harder than component evaluation. A correct retrieval + correct generation can still produce a bad customer experience (wrong tone, unnecessary verbosity, missing context). Evaluation approach:

Automated metrics (continuous): Hallucination detection via NLI model comparing response claims against retrieved sources. Latency and cost tracked per-turn in Datadog. Intent classification accuracy measured against weekly human-labeled sample of 200 queries
Human evaluation (weekly): 100 randomly sampled conversations rated by trained annotators on 4 dimensions: correctness (factual accuracy), completeness (did it fully address the query), tone (appropriate for context), and efficiency (minimal unnecessary back-and-forth). Inter-annotator agreement measured via Cohen’s kappa (target >0.7)
A/B testing (pilot): 10K conversations split between agent-handled and control (human-only). Primary metric: CSAT. Guardrail metrics: escalation rate, repeat contact rate within 24 hours, refund rate
Failure analysis (monthly): Every escalated conversation is categorized by failure mode: knowledge gap, intent misclassification, confidence calibration error, multi-turn breakdown, or customer preference for human. This drives prioritization of system improvements

Risks & Mitigations

Risk	Impact	Mitigation	Monitoring
Hallucination in responses	Customer receives incorrect information; brand damage	Grounded generation from cached FAQs; output safety gate; source attribution required	Hallucination rate dashboard; weekly audit of 100 random responses
Latency spikes	Poor customer experience; timeout errors	Parallel tool calls; aggressive caching; LLM timeout at 3s with graceful fallback	P95/P99 latency alerts; circuit breaker on LLM gateway
Context window overflow	Truncated context leads to poor answers	Sliding window with entity extraction; 4K token hard cap; summarization trigger	Token usage distribution monitoring
Cost explosion	LLM API costs exceed budget	Token cap; cache-first architecture; cost alerts at 80% daily budget	Per-turn cost tracking; daily cost dashboard
Escalation failures	Customer stuck in loop without human help	Explicit escalation triggers; max 3 automated turns before offering human; sentiment monitoring	Escalation rate by intent; CSAT for escalated conversations
Memory staleness	Outdated booking or policy data served	Write-through cache invalidation; real-time booking status via API; policy propagation via GraphRAG	Cache freshness SLA; stale-data incident tracking
Adversarial inputs	Jailbreak attempts; prompt injection	Multi-layer safety gate (input + output); encoding detection; multi-turn analysis (see Governance Framework for enterprise-level policy and Agent Safety & Evaluation Framework for the agent-specific implementation including trajectory evaluation, tool-call scope gating, and red-teaming)	Canary token trigger rate; red team quarterly

Cross-Portfolio Integration

This system design integrates and extends several standalone projects into a unified architecture:

Subsystem	Portfolio Piece	How It Integrates
Knowledge Engine (GraphRAG)	Automated FAQ Extraction	Serves as the primary knowledge backbone; pre-generated FAQs cached in Redis; graph traversal for policy resolution
Traveler Intelligence	Active Learning for Traveler Tips	DeBERTa-based tip extraction feeds into agent responses; tips attributed and separated from official info
Response Quality	Review Summarization at Scale	ABSA pipeline informs product understanding; sentiment calibration techniques applied to agent tone
Prompt Optimization	Cost-Aware APO	Hybrid APE-OPRO used for fallback classifier prompt optimization; cost-aware prompt selection
Safety & Governance	Enterprise AI Governance	Safety gate architecture (input + output validation); source authority hierarchies; adversarial defense
Engineering Practices	Cross-Portfolio Practices	A/B testing methodology; model monitoring; embedding selection framework; failure mode analysis