Situation

Traveler tips, valuable, actionable advice, were buried within user reviews. Lack of labeled data made conventional model training impractical.

Task

Fine-tune a Small Language Model to extract tips with accuracy exceeding few-shot LLM baselines, while minimizing labeling cost and maintaining production-viable inference costs.

Action

Model Selection

Evaluated three SLM candidates. Selected DeBERTa-v3-small after weighing few-shot F1 against fine-tuning headroom, architecture fit, and serving cost.

Model Parameters Inference Latency Base F1 (few-shot)
DistilBERT 66M 12ms 0.68
DeBERTa-v3-small 44M 18ms 0.71
Flan-T5-small 80M 25ms 0.74

Why DeBERTa-v3-small over the higher-F1 Flan-T5-small:

  • Task fit: Tip extraction is a span/token-classification problem. Encoder-only models (DeBERTa) fine-tune faster and more stably on this formulation than seq2seq models (Flan-T5), which have to emit the span as generated text and are more prone to format drift under small-label regimes.
  • Fine-tuning headroom: The base F1 gap (0.03) is small compared to the gain expected from task-specific fine-tuning on an encoder. After fine-tuning DeBERTa-v3-small reached 0.84, exceeding what we projected for a fine-tuned Flan-T5-small in pilot runs (plateaued around 0.79 with higher variance across seeds).
  • Serving cost: DeBERTa-v3-small is ~45% smaller (44M vs 80M), which translated to ~40% lower CPU-inference cost at our batch nightly-inference volume. The 7ms latency gap matters at scale even though either would clear the offline SLO.
  • ONNX support: Encoder-only DeBERTa exports cleanly to ONNX with no graph rewrites; Flan-T5’s seq2seq export required custom handling for the decoder, adding fragility.

Active Learning Workflow

  1. Initial Dataset: 4,000 samples with 50% tips (balanced for training stability). 2,000 from LLM-generated labels (GPT-3.5), 2,000 from existing weak labels.
  2. Teacher Model: GPT-4 as oracle for disagreement mining
  3. Disagreement Mining: Ran inference on 10,000 unlabeled reviews; flagged 1,200 where student prediction diverged from teacher by >0.3 confidence
  4. Human Review: Labeled 500 highest-disagreement samples (40 hours annotator time)
  5. Iteration: Repeated cycle 3 times until F1 improvement <1% per cycle

Distribution Calibration

Training on 50/50 balanced data creates probability miscalibration in production (tips are ~8% of reviews). Mitigation:

  • Temperature Scaling: Post-hoc calibration on held-out set with true distribution
  • Threshold Tuning: Operating threshold set to 0.72 (vs. default 0.5) to optimize precision-recall tradeoff
  • Production Sampling: Inference outputs calibrated probabilities, not raw logits

Teacher Bias Mitigation

LLM-generated labels introduce teacher model biases:

  • Bias Detection: Compared LLM labels to human labels on 200 samples; identified systematic under-labeling of negative tips (“avoid…”)
  • Correction: Augmented training data with human-labeled negative tip examples
  • Validation: Final model evaluated only on human-labeled test set (not on LLM labels)

Results

Metric Few-Shot GPT-4 Fine-Tuned DeBERTa Method
F1 Score 0.76 0.84 Human-labeled test set (n=500)
Precision 0.72 0.88
Recall 0.81 0.80
Inference Cost (per 1K reviews) $4.20 $0.03 API vs. self-hosted
Latency (P95) 1.2s 22ms

Model achieves 10.5% F1 improvement over few-shot baseline while reducing inference cost by 99.3%.

System Design & Architecture

  • Training: PyTorch + HuggingFace Transformers; 3 epochs, learning rate 2e-5, batch size 32
  • Serving: ONNX-optimized model on CPU instances (cost-effective for this workload)
  • Pipeline: Batch inference nightly on new reviews; results cached in Elasticsearch

Risks & Mitigations

Risk Impact Mitigation Monitoring
Class Imbalance Model biased toward majority class Balanced training + threshold calibration Per-class precision/recall weekly
Teacher Bias LLM errors propagated to student Human validation set; bias detection pipeline Disagreement rate tracking
Diminishing Returns Continued labeling wastes resources Stop when F1 improvement <1% per cycle Learning curve monitoring
Distribution Shift Review language evolves over time Quarterly re-evaluation on fresh samples F1 trend monitoring monthly