Cross-Portfolio Engineering Practices

The following practices are applied consistently across all projects:

A/B Testing Methodology

Sample Size: Minimum detectable effect calculated pre-experiment; typical n > 500K sessions
Duration: Minimum 2 weeks to capture weekly seasonality; 3 weeks for high-stakes changes
Metrics: Primary metric (e.g., NDCG) + guardrail metrics (latency, error rate, revenue)
Analysis: Bayesian analysis with 95% credible intervals; sequential testing for early stopping

Offline metrics often don’t translate to online gains. Practices:

Calibration Studies: Quarterly analysis of offline lift vs. online lift across past experiments
Discount Factor: Apply 0.7x multiplier to offline gains when projecting online impact
Hybrid Evaluation: Interleaving experiments for ranking models provide online signal without full traffic split

Feature Drift: KL divergence between training and serving feature distributions; alert at >0.1
Prediction Drift: Monitor prediction distribution shift; alert at >5% mean shift
Performance Monitoring: Delayed labels used to compute online metrics with 7-day lag
Tooling: Arize for ML observability; custom Grafana dashboards for business metrics

Dataset Versioning: DVC for training data; each model tagged with data version hash
Experiment Tracking: MLflow for hyperparameters, metrics, and model artifacts
Reproducibility: Docker images pinned with exact dependency versions; random seeds logged

When selecting embedding models, evaluate:

Criterion	Benchmark	Threshold
Task Relevance	Downstream task performance (e.g., retrieval recall)	Top-2 on internal benchmark
Latency	P95 inference time	<20ms for online; <100ms for offline
Dimension/Cost	Storage and compute cost	Balance with quality needs
Language Coverage	Performance on non-English data	>90% of English performance

Each project includes explicit failure mode documentation:

All ML systems document their Pareto frontier: