Cross-Portfolio Engineering Practices
The following practices are applied consistently across all projects:
A/B Testing Methodology
- Sample Size: Minimum detectable effect calculated pre-experiment; typical n > 500K sessions
- Duration: Minimum 2 weeks to capture weekly seasonality; 3 weeks for high-stakes changes
- Metrics: Primary metric (e.g., NDCG) + guardrail metrics (latency, error rate, revenue)
- Analysis: Bayesian analysis with 95% credible intervals; sequential testing for early stopping
Offline-Online Metric Alignment
Offline metrics often don’t translate to online gains. Practices:
- Calibration Studies: Quarterly analysis of offline lift vs. online lift across past experiments
- Discount Factor: Apply 0.7x multiplier to offline gains when projecting online impact
- Hybrid Evaluation: Interleaving experiments for ranking models provide online signal without full traffic split
Model Monitoring & Drift Detection
- Feature Drift: KL divergence between training and serving feature distributions; alert at >0.1
- Prediction Drift: Monitor prediction distribution shift; alert at >5% mean shift
- Performance Monitoring: Delayed labels used to compute online metrics with 7-day lag
- Tooling: Arize for ML observability; custom Grafana dashboards for business metrics
Data Versioning & Reproducibility
- Dataset Versioning: DVC for training data; each model tagged with data version hash
- Experiment Tracking: MLflow for hyperparameters, metrics, and model artifacts
- Reproducibility: Docker images pinned with exact dependency versions; random seeds logged
Embedding Model Selection Framework
When selecting embedding models, evaluate:
| Criterion | Benchmark | Threshold |
|---|---|---|
| Task Relevance | Downstream task performance (e.g., retrieval recall) | Top-2 on internal benchmark |
| Latency | P95 inference time | <20ms for online; <100ms for offline |
| Dimension/Cost | Storage and compute cost | Balance with quality needs |
| Language Coverage | Performance on non-English data | >90% of English performance |
Failure Mode Analysis Template
Each project includes explicit failure mode documentation:
- Identify top-5 failure modes during design phase
- Implement detection mechanisms for each failure mode
- Define automated response (alert, fallback, rollback)
- Post-mortem template for production incidents
- Quarterly review of failure mode coverage
Latency-Accuracy Tradeoff Documentation
All ML systems document their Pareto frontier:
- Model Variants: Test 3+ model sizes/architectures
- Pareto Chart: Plot accuracy vs. latency; identify knee points
- Operating Point: Document chosen tradeoff with business justification
- Fallback Tiers: Define degraded modes for latency spikes