What does RL-Anti-Money-Laundry do?

It trains a reinforcement-learning agent to adjust the weights and threshold of an AML risk-scoring rule per case, learning from ground-truth labels or analyst human-in-the-loop feedback, instead of using fixed hand-tuned weights.

Which RL algorithms are supported?

PPO, A2C, and DQN via Stable-Baselines3, on a custom Gymnasium contextual-bandit environment.

Is RL-Anti-Money-Laundry production-ready for AML compliance?

No. It is an educational proof of concept and is not a compliant, audited AML control.

RL-Anti-Money-Laundry — Reinforcement Learning for AML Risk Scoring

What it does

Traditional AML scoring engines apply static weights to risk factors (PEP status, suspicious-activity flags, turnover, adverse media) and compare the total to a fixed threshold. RL-Anti-Money-Laundry reframes weight tuning as a single-step contextual bandit and trains an agent (PPO / A2C / DQN) to adjust those weights case-by-case, learning from ground-truth labels or analyst human-in-the-loop feedback.

state features

discrete actions

RL algorithms

passing tests

Features

🎯 Adaptive scoring

An RL policy nudges weights and the decision threshold per case, clamped to safe bounds.

🧠 Multiple algorithms

PPO, A2C, and DQN via Stable-Baselines3 on a custom Gymnasium environment.

📊 Live dashboard

React + TypeScript UI with real-time accuracy/reward curves, evaluation, and data tooling.

🔍 Hyperparameter tuning

Grid, random, and a TPE-inspired Bayesian optimizer with parallel trials.

🔌 Drop-in integration

An adaptive decision node that replaces a fixed-weight LangGraph pipeline step.

⚖️ Paired evaluation

Fair RL-vs-fixed-weights comparison on an identical, re-seeded case sequence.

The RL formulation

A one-step contextual bandit: observe a case, pick a weight adjustment, score it, receive a reward, terminate.

Element	Definition
State	24-dim vector: LOB & reason-code one-hots, red-flag counts by severity, PEP/sanctions flags, normalised turnover, adverse media, transaction ratios, account age, prior alerts/SARs.
Action	19 discrete weight/threshold nudges (single step, double-step "BIG" variants, or no-op).
Reward	+1.0 correct suspicious/escalated · +0.5 efficient non-suspicious · −0.5 missed escalation · −1.0 wrong; or HITL gate feedback online.
Episode	Single step — one action per AML case.

Quick start

One script creates a venv, installs dependencies, and launches the backend (:8200) and dashboard (:4003).

git clone https://github.com/rrahimi-uci/rl-anti-money-laundry.git
cd rl-anti-money-laundry
bash start.sh

Train & evaluate from the CLI

# Generate synthetic episodes (self-contained)
python -m data.generate_episodes --out data/episodes.jsonl --count 2000

# Train a PPO agent
python -m training.train --episodes data/episodes.jsonl --timesteps 50000

# Evaluate against the fixed-weight baseline (paired)
python -m training.evaluate --model models/aml_ppo --n-eval 200

Tech stack

Python 3.9+ Gymnasium Stable-Baselines3 PyTorch FastAPI Uvicorn NumPy Pandas Matplotlib React 19 TypeScript Vite Tailwind CSS Recharts pytest

Disclaimer

⚠️ This is an educational proof of concept that demonstrates a technique. It is not a compliant, audited Anti-Money-Laundering control and must not be used to make real financial-crime decisions.