Skip to main content
Lab
LIVE
Data Engineering

Streaming Fraud Detection Pipeline

End-to-end streaming pipeline built for real-time transaction scoring at scale. From Kafka ingestion through in-stream ML inference to production monitoring — every layer designed for sub-100ms decisioning and zero-downtime deployments.

<100ms

Scoring Latency

p99 end-to-end

10K+

Throughput

events / sec

99.9%

Uptime

SLA-grade

Architecture
IngestKafkaProcessPySparkScoreMLflowStoreDelta LakeServeFastAPI

Fixed-layout pipeline topology — data flows left to right

Real-time fraud decisioning that reduces false positives, protects revenue, and scales horizontally — built on battle-tested open-source infrastructure.

Live Pipeline Metrics
Pipeline Results Dashboard
Streaming

Medallion Layer Throughput

14.2M< 5s

Bronze

12.8M< 12s

Silver

11.1M< 45s

Gold

Layers
Layer 1

Ingestion

KafkaSchema RegistryPySpark Structured Streaming
Layer 2

Feature Engineering

PySparkWindow AggregationsFeature Store
Layer 3

ML Inference

MLflow Model RegistryXGBoostA/B Routing
Layer 4

Storage

Delta LakeMinIOTime-travel Queries
Layer 5

Serving & Monitoring

FastAPIKubernetesGrafanaPrometheus
Pipeline Flow
01Ingest

Real-Time Stream Processing

Kafka topics receive raw transaction events. PySpark Structured Streaming validates schemas, deduplicates, and computes windowed aggregations with exactly-once semantics.

02Score

In-Stream ML Inference

Feature vectors feed XGBoost models versioned in MLflow. Redis caches hot features for sub-100ms scoring. Champion/challenger routing compares model versions live.

03Act

Decisioning and Alerting

Scored transactions route through rule gates. High-risk events trigger alerts, block actions, and write to Delta Lake for audit. Grafana dashboards surface drift and latency in real time.

Stack
PySparkKafkaMLflowDelta LakeKubernetesFastAPI

This pipeline demonstrates production-grade streaming architecture — from ingestion through ML inference to real-time monitoring. Every component runs on self-hosted infrastructure with zero vendor lock-in.