Active Learning for Traveler Tips Extraction
Situation
Traveler tips, valuable, actionable advice, were buried within user reviews. Lack of labeled data made conventional model training impractical.
Task
Fine-tune a Small Language Model to extract tips with accuracy exceeding few-shot LLM baselines, while minimizing labeling cost and maintaining production-viable inference costs.
Action
Model Selection
Evaluated three SLM candidates. Selected DeBERTa-v3-small: best balance of quality and latency for production serving.
| Model | Parameters | Inference Latency | Base F1 (few-shot) |
|---|---|---|---|
| DistilBERT | 66M | 12ms | 0.68 |
| DeBERTa-v3-small | 44M | 18ms | 0.71 |
| Flan-T5-small | 80M | 25ms | 0.74 |
Active Learning Workflow
- Initial Dataset: 4,000 samples with 50% tips (balanced for training stability). 2,000 from LLM-generated labels (GPT-3.5), 2,000 from existing weak labels.
- Teacher Model: GPT-4 as oracle for disagreement mining
- Disagreement Mining: Ran inference on 10,000 unlabeled reviews; flagged 1,200 where student prediction diverged from teacher by >0.3 confidence
- Human Review: Labeled 500 highest-disagreement samples (40 hours annotator time)
- Iteration: Repeated cycle 3 times until F1 improvement <1% per cycle
Distribution Calibration
Training on 50/50 balanced data creates probability miscalibration in production (tips are ~8% of reviews). Mitigation:
- Temperature Scaling: Post-hoc calibration on held-out set with true distribution
- Threshold Tuning: Operating threshold set to 0.72 (vs. default 0.5) to optimize precision-recall tradeoff
- Production Sampling: Inference outputs calibrated probabilities, not raw logits
Teacher Bias Mitigation
LLM-generated labels introduce teacher model biases:
- Bias Detection: Compared LLM labels to human labels on 200 samples; identified systematic under-labeling of negative tips (“avoid…”)
- Correction: Augmented training data with human-labeled negative tip examples
- Validation: Final model evaluated only on human-labeled test set (not on LLM labels)
Results
| Metric | Few-Shot GPT-4 | Fine-Tuned DeBERTa | Method |
|---|---|---|---|
| F1 Score | 0.76 | 0.84 | Human-labeled test set (n=500) |
| Precision | 0.72 | 0.88 | — |
| Recall | 0.81 | 0.80 | — |
| Inference Cost (per 1K reviews) | $4.20 | $0.03 | API vs. self-hosted |
| Latency (P95) | 1.2s | 22ms | — |
Model achieves 10.5% F1 improvement over few-shot baseline while reducing inference cost by 99.3%.
System Design & Architecture
- Training: PyTorch + HuggingFace Transformers; 3 epochs, learning rate 2e-5, batch size 32
- Serving: ONNX-optimized model on CPU instances (cost-effective for this workload)
- Pipeline: Batch inference nightly on new reviews; results cached in Elasticsearch
Risks & Mitigations
| Risk | Impact | Mitigation | Monitoring |
|---|---|---|---|
| Class Imbalance | Model biased toward majority class | Balanced training + threshold calibration | Per-class precision/recall weekly |
| Teacher Bias | LLM errors propagated to student | Human validation set; bias detection pipeline | Disagreement rate tracking |
| Diminishing Returns | Continued labeling wastes resources | Stop when F1 improvement <1% per cycle | Learning curve monitoring |
| Distribution Shift | Review language evolves over time | Quarterly re-evaluation on fresh samples | F1 trend monitoring monthly |