Major Matters
Fraud and Risk Architecture
Module 3 of 6
Module 3

Transaction-Level Fraud Detection

Real-time ML scoring, feature engineering, and the economics of false declines during payment authorisation


The Authorisation Window

A payment authorisation takes milliseconds. The customer initiates a transaction. The merchant passes the payment details to the acquirer. The acquirer sends an authorisation request to the card network. The network routes the request to the issuer. The issuer has a window, typically less than 100 milliseconds, to decide whether to approve or decline the transaction.

In that 100-millisecond window, a fraud scoring system has to gather signals, run them through a machine learning model, generate a score, and make a decision that will either result in a successful transaction or a declined card.

This is real-time machine learning at scale. A large payment processor might handle millions of transactions per second across dozens of markets. Every single transaction gets scored. Every decision is made in milliseconds. The model has to be fast enough not to delay the transaction, accurate enough to catch fraud, and calibrated carefully enough not to decline legitimate customers.

The speed constraint is unforgiving. Add too much latency to authorisation processing and merchants will switch to a faster processor. Add too much fraud and merchants will switch to a fraud provider with better results. Fraud scoring is an optimisation problem with hard constraints on both speed and accuracy.

Where Fraud Scoring Lives

Fraud scoring happens at multiple points in the authorisation chain. The issuer (the customer's bank) runs fraud scoring on transactions from their own customers. The network (Visa, Mastercard) runs fraud scoring on transactions flowing through their network. The acquirer and merchant might run additional scoring on transactions hitting their systems.

Each scorer has different information and different incentives. The issuer has access to the customer's account history, device history, and previous transaction patterns. The network has access to population-level patterns across all issuers and merchants. The merchant has access to their own customer history and their own fraud data.

The challenge is coordination. If the issuer declines a transaction that the network would have approved, the transaction fails. The customer is frustrated. The merchant loses the sale. There is no benefit to having multiple scorers if they do not agree on sufficient cases. This is driving standardisation around network-level scoring that issuers respect and that merchants can rely on.


Feature Engineering: Signals of Fraud

A fraud model is only as good as its features. Features are the signals that the model uses to predict fraud. The quality of feature engineering determines whether a model can distinguish between a fraud transaction and a legitimate one.

The features available to a fraud model fall into several categories.

Velocity Signals

How fast is the customer transacting? Velocity checks look for impossible patterns: five transactions in five different cities within 10 minutes (impossible to travel that far), 10 failed card tests in 60 seconds (clear sign of carding), multiple different merchants in rapid succession (sign of account compromise scanning for working payment methods).

Velocity signals are some of the strongest fraud indicators. A legitimate customer has natural transaction patterns. A fraudster is often trying to extract maximum value in minimum time before the attack is detected. Simple velocity checks catch a large portion of low-sophistication fraud.

The attacker's response: slow down. Modern fraud is patient. Instead of running 1,000 carding attempts in an hour, run 100 across an entire day. Velocity becomes less obvious. The attacker switches from multiple merchants to single merchant. Instead of $500 transaction amounts, run $50 transactions that are less likely to trigger alerts. The attacker is essentially reverse-engineering the fraud model to stay below the thresholds.

Geolocation Signals

Is the transaction happening in a location consistent with the customer's known geography? A customer who has never been outside the country suddenly making a purchase from overseas is unusual. A customer in New York making a transaction in London, then a transaction in New York three minutes later is impossible.

Geolocation comes from the IP address (approximate location), the billing address (customer's home country), the merchant location (country of the merchant), and travel patterns (are there flights or long drives between transactions that would explain the geography).

The limitation: geolocation is easy to spoof. A VPN masks the real IP address. A proxy service routes traffic through residential IPs in the right geography. The attacker knows that geography checks are in play and routes their traffic accordingly. Modern fraud often shows "correct" geography for the cardholder's location because the attacker is using a proxy service.

Device Fingerprint Signals

Is this transaction coming from the customer's known device or from a new device? Device fingerprinting captures dozens of signals from the browser or mobile device: operating system, browser version, screen resolution, timezone, browser plugins, fonts installed, network configuration. These combine into a fingerprint that is difficult for an attacker to spoof perfectly.

Device history is powerful. If a customer has made 500 legitimate transactions from Device A, then suddenly makes a transaction from Device B, that is a signal. But the signal has to be interpreted carefully. A customer might get a new phone. A customer might use a tablet or a computer instead of their phone. Device fingerprint is not definitive, but it is informative.

The attacker's response: steal the device. If you compromise the customer's phone through malware or phishing, you can make transactions that look like they are coming from the customer's known device. The device fingerprint will be correct because it is the correct device. Device fingerprint is still valuable because it protects against the attacker using a different device, but it does not protect against the attacker using the stolen device.

Biometric Signals

On transactions that include additional verification (like a biometric check or a one-time password), the system captures whether the biometric matched. Did the customer provide a fingerprint that matched the enrolled fingerprint? Did they pass face recognition? Did the one-time password they entered match what was generated?

These signals are valuable but limited. Not all transactions include biometric verification (many low-value transactions do not). And biometric spoofing is now possible with deepfakes and synthetic fingerprints. But a successful biometric match is still a signal worth including in the model.

Behavioral Signals

How does the customer normally behave? Behavioral signals include transaction amount, transaction frequency, merchant category (does the customer usually shop at clothing stores or gas stations?), time of day (does the customer usually shop during business hours or in the middle of the night?), and session behavior (how long do they browse before checking out?).

Behavioral signals are where machine learning becomes powerful. A human cannot track all of the subtle patterns in a customer's behaviour. But a model trained on millions of transactions can learn that a particular customer has a 0.5 percent false positive rate with a certain combination of purchase timing, amount, and merchant category. The model learns the customer's normal pattern and flags significant deviations.

Network Signals

Is this card being used by multiple people across different networks simultaneously? Is this card being tested at multiple merchants in parallel? Is this card part of a cluster of cards that are being used in similar ways? Network-level signals reveal coordinated fraud that looks invisible at single-transaction level.

Network signals require population-level data. A single merchant does not have enough data to detect network patterns. Only the networks, large acquiring processors, and fraud-as-a-service platforms have the scale to detect these patterns. This creates an asymmetry where larger processors have better fraud detection than smaller ones.

Merchant and Business Signals

Is the merchant known? Is the merchant's fraud rate in-line with their category? Is the merchant increasing their transaction volume suspiciously? Merchant-level signals help identify both merchant fraud and merchants that are being targeted by fraud attacks.


The Machine Learning Pipeline

Building a fraud scoring model requires orchestrating several steps: data collection, feature engineering, model training, model validation, and real-time scoring.

Fraud Detection ML Pipeline
Historical Transactions (6-12 months) Feature Engineering (signal extraction) Model Training (XGBoost, RF, NN) Validation (holdout test) (AUC, precision, recall) Production Scoring (<100ms latency) Continuous Monitoring (model drift detection, retraining)

Data Collection and Labeling

The training data comes from historical transactions. For each transaction, the system collects all features and a label: fraud or not fraud. The label is typically determined by whether the transaction was disputed, chargebacked, or reported as fraud by the customer or merchant.

Data quality is critical. If a transaction was not disputed, does that mean it was legitimate? What if the victim has not yet noticed the fraud? What if the victim decided the amount was too small to dispute? The label might be wrong. Training on noisy labels produces a model that learns the noise, not the actual fraud patterns.

The distribution of labels matters too. In most transaction datasets, fraud is rare: 0.05 to 0.2 percent of transactions. Training a model on imbalanced data (99.8 percent legitimate, 0.2 percent fraud) produces a model that learns to classify everything as legitimate, because that is right 99.8 percent of the time. Balancing the training data through oversampling fraud cases or applying class weights is essential.

Feature Extraction

Raw transaction data (card number, amount, merchant, timestamp) has to be transformed into features that are predictive. Extracting features requires domain knowledge and experimentation. Which merchant categories are high risk? How much should recent transactions be weighted compared to distant history? How do you combine geographic signals with device fingerprint signals?

Feature engineering is where a significant portion of fraud prevention effort goes. The model architecture matters, but the features matter more. A simple model with good features will outperform a sophisticated model with poor features.

Model Selection

What model architecture to use? The most common approaches are gradient boosted trees (XGBoost, LightGBM, CatBoost), random forests, and neural networks. Each has different tradeoffs.

Gradient boosted trees are the standard in fraud prevention. They are fast to train, relatively simple to interpret, handle mixed feature types well, and produce strong results on classification tasks. They do not require feature scaling, they handle non-linear relationships, and they are robust to outliers.

Random forests are simpler but generally perform worse than gradient boosted trees. They are more interpretable and faster to train, but have lower accuracy.

Neural networks can capture complex patterns but are slower to train, require more tuning, and are harder to interpret. They are useful for specific tasks like image classification (detecting fake ID documents), but for tabular transaction data, gradient boosted trees usually outperform neural networks.

Ensemble methods that combine multiple models (gradient boosting plus a neural network plus a logistic regression baseline) often outperform any single model. The challenge is latency: each additional model adds computational cost and latency. Production fraud scoring is often an ensemble of 2 to 5 models that collectively score within the latency budget.

Model Validation and Calibration

Before deploying a model, it has to be validated on data it has never seen before. A holdout test set of recent transactions is used to measure the model's performance on new data. The metrics tracked are:

Calibration is critical. A model's output score (0 to 100, or 0 to 1) should reflect actual probability. If the model outputs 70, then 70 percent of transactions with that score should actually be fraud. Poor calibration leads to incorrect decision thresholds. A well-calibrated model allows the business to set a threshold that reflects their economic optimization (maximum fraud catch at acceptable false positive rate).

Model Drift and Retraining

A model trained on 2024 data learns the fraud patterns of 2024. By 2025, new fraud techniques have emerged, attacker behaviour has changed, and the model's performance degrades. This is called model drift. Continuous monitoring of model performance on new data reveals when retraining is needed.

Most fraud models are retrained monthly or quarterly. New fraudsters, new patterns, new devices, and new attack vectors emerge constantly. The model has to adapt. The retrain cycle is a tradeoff: more frequent retraining keeps the model fresher but requires more engineering effort and risk of deploying a worse model (if the new training data is noisy).


Real-Time Scoring Constraints

A production fraud model has to score transactions in less than 100 milliseconds. That includes reading the transaction details from the database, extracting features, running the model, and returning a decision.

This constraint is unforgiving. A complex model that takes 150 milliseconds is not an option, no matter how accurate it is. The transaction will timeout and the merchant will switch to a faster processor.

Solutions include:

The best fraud detection systems pre-compute as much as possible and keep only the absolute essential computation in the real-time hot path.


Rules vs. Machine Learning

Should fraud detection use rules (explicit if/then logic written by humans) or machine learning (patterns learned from data)? The answer is both, and the balance depends on the use case.

Rules Are Best For:

Machine Learning Is Best For:

The best architectures use rules as a first pass (quick win catches obvious fraud) and ML as a second pass (catches subtle fraud). Rules block maybe 5 to 10 percent of transactions, with near-zero false positives. The ML model then scores the remaining 90 to 95 percent of transactions.


Network-Level Scoring: Visa and Mastercard

Visa Advanced Authorization and Mastercard Decision Intelligence are network-level fraud scoring systems. These systems process every authorisation request flowing through the network and apply machine learning models trained on trillions of transaction signals.

The power of network-level systems is that they see fraud patterns across entire ecosystems. If a card is being tested at 500 different merchants, the network sees it. If a coordinated carding operation is running across multiple countries, the network detects it. A single merchant has no visibility into these patterns.

Issuers (the customer's bank) respect network-level signals because they are based on more data than the issuer has access to. This is driving a shift toward issuer reliance on network scores, which simplifies the ecosystem and improves coordination.


The False Decline Problem and Calibration

The fundamental challenge of fraud detection is false declines. A false decline is when a legitimate transaction gets incorrectly flagged as fraud and declined. The customer loses the transaction, the merchant loses the revenue, and both are frustrated.

A fraud scoring system that catches 95 percent of fraud but has a 5 percent false decline rate will lose customers. The economics are brutal. A customer declined on a $100 transaction has not just lost $100 revenue. They have lost trust, might switch to a competitor, might close their account. The lifetime value of the customer is at risk.

False Decline Economics
Aggressive Fraud Filter (Low Threshold) Fraud Catch: 96% False Decline: 8% Result: High fraud catch, high customer churn Lenient Fraud Filter (High Threshold) Fraud Catch: 78% False Decline: 1% Result: More fraud leaks through, low customer friction Risk-Based (Optimal Calibration) Fraud Catch: 90% False Decline: 2% Customer LTV: Optimized Result: Different customers get different friction levels based on risk. High-value customers get fewer false declines.

Risk-based calibration solves this by treating different customer segments differently. A high-value customer (verified account, long history, high balance) gets a lenient fraud threshold. They are unlikely to be a fraud case and the economic loss of losing them is high. A new customer with no history gets a stricter fraud threshold. The risk is higher and the switching cost is lower.

This requires a two-tier model: first, estimate the customer's risk and value. Second, set the fraud threshold for that customer based on their risk profile. A model that does this automatically reduces both fraud losses and false decline losses compared to a one-size-fits-all threshold.

If your fraud detection system catches 95 percent of fraud but declines 5 percent of legitimate transactions, what is the economic value of catching one more percent of fraud if it requires declining an additional 2 percent of legitimate transactions?

Key Takeaways

Fraud Score
Numeric probability that a transaction is fraudulent. Output by ML model during authorisation. Compared to threshold to decide approve/decline.
Feature
Input signal to ML model. Examples: velocity, geolocation, device fingerprint, merchant category, transaction amount, customer history.
Velocity Check
Real-time analysis of transaction frequency and volume. Flags impossible patterns like multiple transactions in different locations within seconds.
Device Fingerprint
Unique identifier of device based on hardware, OS, browser characteristics. Used to detect transactions from unfamiliar devices.
Model Drift
Degradation of model performance over time as fraud patterns evolve. Detected through continuous monitoring, triggers retraining.
False Positive
Legitimate transaction incorrectly flagged as fraud. Results in false decline and lost customer trust.
False Negative
Fraudulent transaction incorrectly approved by fraud detection. Results in fraud loss and chargeback.
AUC
Area Under Curve. Metric measuring overall ranking ability of fraud model. Ranges 0.5 (random) to 1.0 (perfect).
XGBoost
Gradient boosting framework. Most common choice for fraud detection models due to speed, interpretability, and accuracy.
Calibration
Process of ensuring model output score reflects actual probability. Well-calibrated model enables informed threshold selection.
Next Module
Advanced Topics
Model explainability, interpretability for regulators, chargeback management, and emerging attack vectors.