Real-Time Fraud Detection with Kafka Streams and Online Feature Stores

← All articles

Card fraud costs the global financial system roughly $30 billion annually. The most effective defences share a common requirement: speed. A fraud model that takes 800ms to score is too slow to block a card-not-present transaction before authorization completes. Yet the features that predict fraud best—velocity patterns, network distances, behavioural anomalies—require aggregates over historical transaction streams, computed fresh for every new event.

Architecture Overview

The streaming fraud detection stack has four planes: the event plane (inbound transactions via Kafka), the feature plane (real-time feature computation and online store), the inference plane (model serving with sub-50ms SLO), and the feedback plane (confirmed fraud labels flowing back for model retraining).

End-to-End Pipeline Latency by Stage

Median latency at each stage of the inference path. p50 and p99 measured at 8,000 TPS sustained load. Target SLO: p99 < 95ms total.

The dominant latency contributors are feature store reads and model inference. Feature store latency is primarily a function of the number of aggregations required (each requiring a Redis HGETALL) and serialisation overhead. Model inference time is a function of model complexity—a gradient-boosted tree with 200 estimators at depth 6 typically takes 8–12ms; a neural network with embedding layers can take 25–40ms.

Velocity Features

The most predictive real-time features are velocity counts and aggregations over rolling time windows. These answer questions like "how many transactions has this card made in the last 15 minutes?" or "what is the ratio of this transaction amount to the 7-day average spend?"

Python / Flink — velocity feature computation

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows
from pyflink.common.time import Time

def compute_velocity_features(txn_stream):
    """Compute sliding-window velocity features per card."""

    velocity_15m = (
        txn_stream
        .key_by(lambda t: t['pan_token'])
        .window(SlidingEventTimeWindows.of(
            Time.minutes(15), Time.minutes(1)
        ))
        .aggregate(VelocityAggregator(
            count=True, sum_amount=True, distinct_merchants=True
        ))
    )

    velocity_1h = (
        txn_stream
        .key_by(lambda t: t['pan_token'])
        .window(SlidingEventTimeWindows.of(
            Time.hours(1), Time.minutes(5)
        ))
        .aggregate(VelocityAggregator(
            count=True, sum_amount=True, distinct_countries=True
        ))
    )

    # Write to online feature store (Redis)
    velocity_15m.add_sink(FeatureStoreSink(prefix='vel_15m', ttl_seconds=900))
    velocity_1h.add_sink(FeatureStoreSink(prefix='vel_1h', ttl_seconds=3600))

    return velocity_15m, velocity_1h

Feature Engineering for Fraud

Feature Category	Examples	Window	Typical IV
Card Velocity	txn_count_15m, amount_sum_1h, distinct_mcc_24h	15m – 24h	0.38 – 0.52
Amount Anomaly	amount_vs_7d_avg, amount_percentile_30d	7d – 30d	0.28 – 0.41
Geographic	geo_distance_km (last txn), country_change_flag	Last event	0.31 – 0.44
Merchant	merchant_fraud_rate_7d, new_merchant_flag	7d	0.24 – 0.36
Behavioural	hour_of_day_anomaly, weekend_flag, channel_change	Historical baseline	0.14 – 0.22

"Velocity at 15 minutes is your most powerful feature. A fraudster who steals a card will typically test it with a small purchase, then quickly scale up. That acceleration shows up immediately in the velocity count."

Precision–Recall Trade-off

Fraud detection is a rare-event classification problem with asymmetric costs. A false positive (blocking a legitimate transaction) costs €2–4 in customer service, goodwill loss, and chargeback handling. A false negative (missing a fraud) costs the full transaction amount plus chargeback penalties. The operating threshold must balance these costs, typically targeting a false positive rate of 0.3–0.8% (i.e., 3–8 blocked legitimate transactions per 1,000) while catching 80–90% of fraud.

Precision–Recall Curve — Champion vs Challenger Model

Champion: GBM trained on 18 months of data. Challenger: same architecture retrained on 24 months. Out-of-time evaluation window: July–Sep 2023. Fraud prevalence: 0.41%.

Champion / Challenger Deployment

Rather than a hard model swap, production fraud systems use traffic splitting: the champion model handles 90% of traffic and makes decisions; the challenger handles 10% in shadow mode, scoring transactions but not blocking them. After 4 weeks of shadow scoring, the challenger's decisions are compared to confirmed fraud labels. If the challenger outperforms on F-score at the target operating point, it is promoted to champion.

Fraud Rate Trend and Model Deployment Events (Monthly)

Card fraud rate (% of transaction value) with key model deployment events. Step-downs indicate new model versions improving detection. Q4 2021 – Q3 2023.

Infrastructure Requirements

A production real-time fraud system at moderate scale (5,000 TPS peak) requires:

Component	Technology	Sizing (5k TPS)	Key SLO
Event Bus	Apache Kafka	6-broker cluster, RF=3	Produce p99 < 5ms
Stream Processing	Apache Flink	12 TaskManagers, 4 slots	Processing lag < 200ms
Online Feature Store	Redis Cluster	6 shards, 64GB each	GET p99 < 2ms
Model Serving	Triton / custom FastAPI	4 GPU instances (A10)	Inference p99 < 15ms
Offline Store	Delta Lake / Spark	30-node Spark cluster	Daily training < 4h

Real-Time Fraud Detectionwith Kafka Streams and Online Feature Stores