The setup

Existing fraud-detection stack was rules-based: a hand-curated set of about 1,400 rules accumulated over a decade. New rules took six weeks to ship — write, train, regulatory review, deploy. Recall was plateauing. The business was losing fraud-loss ground every quarter while merchants demanded faster decline decisions.

The mandate: machine-learning fraud scoring at sub-10ms P99, deployed across the payment authorization path, with a path through regulatory review. Existing rules engine to remain as a backstop and as the source of training labels.

Lesson 1: The feature store is the hardest part

Models are easy. Feature stores are not. Half the engineering work was building the feature pipeline that could compute the right values for each transaction in real-time:

  • Hot features in Redis — recent activity per merchant, per device, per card, refreshed on every transaction
  • Warm features in Cassandra — 30-day rolling aggregates updated by streaming Kafka consumers
  • Cold features in BigQuery — full historical lookups for new entities

The feature retrieval budget was 4ms. Every percentile mattered. Hot/warm/cold separation kept the P99 stable while the data team kept enriching the model with new aggregations.

Lesson 2: Don't deploy a model without an eval harness

Before training started, we built the eval harness. Holdout sets per merchant category, per geography, per card type, per fraud taxonomy. The model's performance on each was tracked separately. We could see when the model regressed for travel-merchants while improving for retail.

The eval harness ran on every PR that touched the model code or features. CI failed if any cell in the matrix regressed beyond a threshold. The team caught a bad feature interaction in week three of training that would have cost ~$2M annually if shipped.

Lesson 3: Shadow before live, always

Six weeks of shadow scoring before any decision authority. The model scored every transaction in parallel with the rules engine. Decisions stayed with the rules engine; the model's scores were logged, compared, and reviewed weekly.

Shadow surfaced two production patterns the offline eval missed entirely. First, a class of merchant with regional traffic spikes that broke the time-of-day feature. Second, a chargeback-pattern interaction we hadn't seen in the holdout sets. Both fixable in shadow; both would have been outages in live.

Lesson 4: Sub-10ms means engineering, not just ML

Triton for serving (NVIDIA Triton Inference Server). XGBoost models compiled with Treelite for CPU-only inference. P99 budget allocation: 4ms feature retrieval, 3ms model scoring, 2ms decision logic + serialization, 1ms network. Every component had its own SLO and its own dashboard.

The win: P99 scoring latency: 8ms sustained in production. The discipline: every team owned their component's latency budget and was paged if they breached it.

Lesson 5: Drift detection is not a nice-to-have

Models in production silently degrade. The drift detection setup we shipped:

  • Feature drift — KS test on every feature, daily. Page if more than 3 features drift simultaneously.
  • Score drift — distribution of scores trended over rolling 7-day windows. Page on significant shifts.
  • Performance proxy — chargebacks lag actuals by 60-90 days, but precursor signals (decline patterns, device trust scores) shift faster. Track those weekly.

Drift triggered an automated retraining pipeline every quarter; out-of-cycle retraining when alerts fired. The team's job: review and approve, not write boilerplate retraining code.

Lesson 6: Explainability is a regulatory requirement, not a product feature

Every adverse action (decline) had to be explainable to a regulator. The model's top-3 contributing features for each decline, in human-readable form, captured at decision time and retained for the regulatory window. Built using SHAP value computation, cached and joined back to the decision log.

This was the deal-breaker for the regulatory review board. Without it, the model wouldn't have been approved for live decisioning regardless of its accuracy.

The outcomes

  • Fraud loss reduction: 31% in the first quarter post-launch
  • P99 scoring latency: 8ms sustained
  • New model versions deploy in days, not the prior six weeks
  • Drift detection triggered three out-of-cycle retrainings in the first year, each catching regression before fraud-loss ground was lost

The lesson that's hardest to convey to teams that haven't shipped real-time ML: the model is the easy part. The platform around it — features, eval, shadow, explainability — is where 70% of the engineering goes. Plan accordingly.

Related