AI Evaluation Specialist — Content Quality at Viator
Situation
Viator hosts hundreds of thousands of travel experience listings. Product descriptions and marketing copy are critical to conversion — but inconsistent quality, hallucinated details, and brand-voice drift were measurable problems as LLM-generated rewrites scaled up. There was no systematic evaluation framework: quality was checked manually, sporadically, and without reproducibility.
The challenge was building a rigorous, automated evaluation stack that could catch factual errors, measure semantic fidelity, track marketing copy quality, and surface model failure modes — all at pipeline speed.
Task
Design and own the end-to-end evaluation framework for LLM-rewritten product descriptions and marketing content. This meant selecting and implementing the right metrics for each content dimension, integrating evaluation into the production rewrite pipeline, and building introspection tooling to enable root-cause analysis when quality regressed.
Action
1. Metric architecture by content dimension
Rather than applying a single metric, I designed a layered evaluation stack where each layer catches a different class of failure:
Layer 1 — Lexical fidelity (ROUGE-L, BLEU) Baseline overlap metrics to catch rewrites that drift too far from source material or drop key product features. ROUGE-L (longest common subsequence) is particularly useful for descriptions where preserving key phrases matters more than n-gram precision. Used as a fast pre-filter — if lexical overlap falls below threshold, the rewrite is flagged before more expensive layers run.
Layer 2 — Semantic fidelity (BERTScore F1, MoverScore) Contextual embedding similarity catches paraphrases that preserve meaning but score poorly on lexical metrics. BERTScore uses BERT token-level cosine similarity; MoverScore applies Word Mover’s Distance on contextual embeddings — more robust to reordering in itinerary-style copy. Together these cover the gap between surface similarity and actual meaning preservation.
Layer 3 — Factual grounding (NLI entailment) The most critical layer for product descriptions. An NLI model (DeBERTa-v3-large fine-tuned on MNLI) scores whether each rewritten sentence is entailed by the source. Any sentence scoring as contradiction or neutral triggers a hallucination flag. Introducing this layer produced the single largest quality improvement in the pipeline — hallucination rate dropped substantially in the first week of gating.
Layer 4 — Overall quality (G-Eval / LLM-as-judge) GPT-4o as evaluator with chain-of-thought prompting across four dimensions: coherence, fluency, relevance, factual consistency. Used as the final quality gate before content goes live. In calibration studies against human raters, G-Eval scores correlated strongly with human judgement — well enough to substantially reduce manual annotation overhead while maintaining coverage.
Layer 5 — Reference-free hallucination (SelfCheckGPT) For cases with no gold-standard source to compare against (e.g. marketing copy). Samples multiple independent completions and measures cross-consistency — inconsistent facts across samples indicate confabulation. Enabled hallucination detection without requiring a reference document.
2. Model introspection tooling
Evaluation metrics tell you that quality has degraded — introspection tells you why. I built a triage dashboard integrating:
- Gradient saliency maps (via Captum) to identify which input tokens most influenced problematic outputs
- Attention visualisation (BertViz) to detect when the model attends to irrelevant context
- Concept activation vectors (TCAV-style probing) to test whether style concepts (luxury tone, urgency, local specificity) were linearly encoded in transformer layers — used to compare fine-tuned vs base model representations
This tooling cut root-cause triage time dramatically — what previously took hours could typically be resolved within a single focused session.
3. Pipeline integration
Integrated the full evaluation stack into the content rewrite CI/CD pipeline:
- Layers 1–3 run as fast pre-filters at batch scale via PySpark
- Layer 4 (G-Eval) runs asynchronously as a quality audit on a random sample plus all flagged items
- Evaluation results written to a metrics store; dashboards alert on rolling 7-day metric degradation
Results
Introducing NLI entailment gating was the single biggest improvement — hallucination rate dropped substantially within the first week of deployment, and remained the most reliable signal for catching factual errors at scale.
BERTScore and ROUGE-L both improved meaningfully once semantic fidelity became an explicit optimisation target alongside lexical overlap, reflecting better preservation of source intent rather than just surface phrasing.
G-Eval scores correlated strongly with human rater judgement in calibration studies, enabling a significant reduction in manual annotation overhead. Human review was redirected toward edge cases and quarterly calibration audits rather than routine quality checks.
The introspection dashboard reduced the time to identify and fix a quality regression from hours to a single focused session — making it practical to investigate failures rather than simply reraise them as prompt issues.
Evaluation Metric Reference
| Metric | Type | What it catches | Reference needed? |
|---|---|---|---|
| ROUGE-L | Lexical | Key phrase omission, over-abstraction | Yes |
| BLEU | Lexical | Precision-side drift in short copy | Yes |
| BERTScore F1 | Semantic | Meaning drift despite surface similarity | Yes |
| MoverScore | Semantic | Reorder-sensitive semantic gaps | Yes |
| NLI entailment | Factual | Hallucination, contradiction, unsupported claims | Yes |
| G-Eval (LLM-judge) | Holistic | Coherence, fluency, relevance, factual consistency | Optional |
| SelfCheckGPT | Factual | Confabulation (reference-free) | No |
| Saliency/Attribution | Introspection | Model attending to wrong input signals | No |
| TCAV probing | Introspection | Style/tone concept encoding in latent space | No |
Key Design Decisions & Trade-offs
Why NLI over perplexity for hallucination? Perplexity measures how surprised the model is — not whether the output is factually grounded. A confident hallucination has low perplexity. NLI entailment directly tests logical consistency against the source, which is what matters for product descriptions.
Why G-Eval over BLEURT or METEOR? BLEURT and METEOR correlate well with human judgement on translation tasks but generalise poorly to open-ended marketing copy. G-Eval’s chain-of-thought scoring on named dimensions (coherence, fluency, relevance) better mirrors how a content editor would assess a rewrite.
Why SelfCheckGPT for marketing copy? Marketing copy has no canonical reference document — you cannot compare to a “correct” version. SelfCheckGPT exploits the model’s own output variance as a hallucination signal, enabling reference-free quality control.
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| NLI false positives on legitimate paraphrases | Calibrated entailment threshold on labelled holdout pairs from the actual pipeline |
| G-Eval score variance across API calls | Each item scored multiple times at temperature=0; final score is median |
| Metric gaming (optimising prompts to score well, not be good) | Quarterly human audit with correlation check against automated scores |
| Saliency maps expensive at scale | Approximate integrated gradients in production; full computation on-demand for triage only |