Review Summarization at Scale
Situation
LLM summarizers struggled with thousands of redundant reviews, producing hallucinations or overly positive summaries. Capturing nuanced customer sentiment required a more grounded, scalable approach.
Task
Build a domain-agnostic pipeline that generates explainable, grounded summaries by extracting structured information and removing redundancy, with explicit faithfulness guarantees.
Action
Phase 1: Hierarchical Theme Identification
- Zero-Shot Generation: GPT-4 generates candidate themes from a sample of 500 reviews
- Theme Validation: Human annotators label 200 reviews against generated themes; themes with <70% agreement discarded
- Semantic Deduplication: all-mpnet-base-v2 embeddings (768 dim) with cosine similarity >0.85 threshold to merge overlapping themes
- Theme Coverage Metric: % of reviews where at least one theme applies (validated against human labels)
Phase 2: Structured Extraction (ABSA)
Aspect-Based Sentiment Analysis extracts (Theme, Aspect, Opinion, Sentiment) tuples:
- Primary Extractor: Fine-tuned DeBERTa-v3-base on 5,000 labeled review sentences
- Secondary Validator: GPT-3.5 validates extractions; inter-model agreement: 87.3%
- Conflict Resolution: When models disagree, sample sent to human review queue
Phase 3: Opinion Clustering (Key Innovation)
- Embedding Model: all-MiniLM-L6-v2 (selected for speed; benchmarked against mpnet with <2% quality loss)
- Clustering: HDBSCAN with min_cluster_size=5, min_samples=3 (tuned via silhouette score)
- Representative Selection: Cluster medoid + 2 diverse samples per cluster
- Reduction: 5,000 reviews -> ~150 representative opinions
Multilingual Handling
Viator operates globally; 34% of reviews are non-English:
- Language Detection: fastText lid.176 model
- Translation: NLLB-200 for non-English reviews before processing
- Validation: Native speaker spot-checks for top-5 languages (ES, FR, DE, IT, PT)
Results
| Metric | Baseline (Direct LLM) | Pipeline Method | Method |
|---|---|---|---|
| Theme Coverage | ~50% | 94.2% | Human annotation on 500 review sample |
| Sentiment Accuracy | 71% | 89% | Compared to 3-annotator majority vote |
| Positivity Bias | 82% positive | 67% positive | Ground truth distribution: 65% positive |
| Hallucination Rate | 12% | 1.8% | Manual audit of 200 summaries |
| Token Usage (per product) | 45K | 8K | 82% reduction via clustering |
System Design & Architecture
Hybrid ABSA + clustering + LLM summarization pipeline:
- Input: Raw reviews from product database
- Stage 1: Theme identification and validation
- Stage 2: ABSA extraction with dual-model validation
- Stage 3: Opinion clustering and representative selection
- Stage 4: LLM generates summary from representatives only (grounded)
- Output: Structured summary with sentiment distribution and source traceability
Sentiment Calibration Methodology
- Collected ground truth sentiment distribution from 1,000 manually labeled reviews
- Baseline LLM summaries implied 82% positive sentiment (overestimate)
- Pipeline output: 67% positive, within 2% of ground truth (65%)
- Statistical test: Chi-square p<0.01 for baseline vs. ground truth; p=0.34 for pipeline vs. ground truth
Risks & Mitigations
| Risk | Impact | Mitigation | Monitoring |
|---|---|---|---|
| Hallucination | Summaries contain invented details | Faithfulness check: every claim traced to source review | Weekly audit of 50 random summaries |
| Theme Overlap | Redundant or confusing themes | Dual-level frequency analysis; merge threshold tuning | Theme drift detection monthly |
| Embedding Quality | Poor clustering affects representatives | Benchmark 3 embedding models; select best silhouette | Cluster quality metrics in dashboard |
| Translation Errors | Non-English reviews misprocessed | NLLB + native speaker validation for top languages | BLEU score monitoring for translation quality |
| Cost/Input Size | Thousands of reviews per product | Clustering reduces to ~150 representatives | Token usage alerts per product |