Module 1

How AI Models Actually Work in Finance

The ML pipeline stripped of hype. Which models work where, and why production is nothing like notebooks

The Gap Between Theory and Production

Most conversations about AI in finance focus on what models can do. Build a neural network on fraud data, accuracy improves, deploy to production, problem solved. The reality is far messier. A model that achieves 95 percent accuracy in a Jupyter notebook can collapse in production. Accuracy is not the same as value. And value is what matters when real money is on the line.

The gap between a working model and a working system is where most AI projects break down. The model code might be elegant. The infrastructure might be sophisticated. But if the model decays over time, if it cannot meet latency requirements, if it makes decisions that regulators reject, or if it is so opaque that nobody knows why it said yes or no, the entire system fails.

A model that works in a notebook is a toy. A model that works in production is an asset. The difference is not code quality. It is understanding data drift, feature engineering, deployment architecture, monitoring, and the brutal constraints of real-world finance.

This module covers the production reality of AI in finance: which model types dominate real financial services use cases, why gradient boosted trees still outperform deep learning in most deployments, how the ML pipeline actually works, and what happens when models meet the real world.

The Three Paradigms: Supervised, Unsupervised, and Reinforcement Learning

All machine learning falls into three categories. Understanding which category applies to your problem determines the approach, the data requirements, and the expected performance ceiling.

Supervised Learning: Learning From Labelled Data

Supervised learning is learning from examples where we know the correct answer. A credit scoring model learns from historical loans where we know which defaulted (label: yes) and which repaid (label: no). A fraud detection model learns from transactions where we know which were fraud (label: yes) and which were legitimate (label: no). The model ingests features (transaction amount, geographic origin, customer history) and learns to predict the label.

Supervised learning splits into two categories: classification and regression. Classification predicts categories. Will this loan default (yes/no)? Is this transaction fraudulent (fraud/legitimate)? Is this account a synthetic identity (synthetic/real)? Regression predicts continuous numbers. What is the probability of default? What is the expected loss? What interest rate should we charge?

Supervised learning dominates finance because financial problems are almost always prediction problems. The label is something we observe after the fact (loan default, fraud loss, regulatory finding). We use historical data to train a model on the relationship between features and outcomes, then deploy the model to predict the outcome before it happens. The model makes a decision in real time. The system measures whether the decision was correct only later, after the outcome is revealed.

The catch: supervised learning requires labelled data. If you have 10 million transactions but only 100 are confirmed fraud, your dataset is massively imbalanced. If fraud resolution takes weeks (you do not know a transaction was fraud until the customer disputes it), your labels are delayed. If fraud patterns shift every quarter (attackers adapt), your historical labels become obsolete. These are production realities that datasets optimized for accuracy do not capture.

Unsupervised Learning: Finding Patterns Without Labels

Unsupervised learning finds patterns in data without being told what the right answer is. Nobody tells the model what fraud looks like. The model computes the statistical properties of legitimate transactions, then flags transactions that deviate from the norm as anomalies. Nobody labels which customer accounts are synthetic identities. The model finds accounts with unusual registration patterns, document quality issues, and behavioural signals that do not match the profile.

Unsupervised learning takes two forms: anomaly detection and clustering. Anomaly detection (also called outlier detection) finds datapoints that do not fit the expected distribution. Clustering groups similar datapoints together and identifies groups that behave differently from the norm.

Unsupervised learning shines when labelling is expensive or impossible. You cannot label which customer will become valuable to the platform over the next five years. You can cluster customer segments by behaviour and infer that one segment has higher lifetime value than others. You do not know which users on a platform are actual humans versus bots. You can detect bot-like behaviour patterns (identical click sequences, impossible geographic jumps, predictable device patterns) and flag them as anomalies.

The downside: unsupervised models are harder to evaluate and easier to fool. A supervised model for fraud detection can report accuracy: what percent of its predictions were correct? An unsupervised model flags anomalies, but you do not know if the anomalies are actually fraud or just unusual but legitimate behaviour. A customer with a single very large purchase is an anomaly. Is it fraud or a wealthy customer making a one-time acquisition? The model cannot tell you.

Reinforcement Learning: Learning From Interaction

Reinforcement learning learns by taking actions and observing rewards or penalties. A dynamic pricing engine uses reinforcement learning: try a price, observe whether the customer purchases, adjust the price up or down based on the outcome. A portfolio optimization algorithm uses reinforcement learning: allocate capital to assets, observe returns, adjust the allocation based on actual performance.

Reinforcement learning is rare in financial services but growing. The reason for rarity is that reinforcement learning requires an environment where you can take actions and observe immediate rewards. In fraud detection, you cannot experiment by allowing fraudulent transactions in production just to see what happens. In credit underwriting, you cannot systematically approve bad loans to learn from the outcomes. The cost of learning is the cost of errors, and that cost is unacceptable.

Where reinforcement learning works in finance: trading (markets reward correct bets instantly), dynamic pricing (customers give immediate feedback), marketing spend allocation (campaigns generate immediate ROI feedback), and portfolio rebalancing (you observe immediate returns). In all these cases, the feedback loop is fast, the cost of experimentation is acceptable, and the model can adapt quickly to changing conditions.

Classification and Regression: The Workhorses of Finance

Almost all operational AI in finance is either classification or regression. Classification: will this happen (yes/no)? Regression: how much will this be? Both are supervised learning. Both require labelled training data. And both have produced an arms race around which model architecture wins.

ML Model Types and Financial Services Use Cases

Why Gradient Boosted Trees Dominate in Finance

If you walk into a trading floor, a credit risk team, or a compliance department in a major financial institution and ask what model they use for fraud detection, credit scoring, or AML risk, the answer is almost always the same: gradient boosted trees, usually XGBoost or LightGBM.

This is not because neural networks are not powerful. Neural networks are extraordinarily powerful. They are because gradient boosted trees have properties that make them dominant for the specific shape of financial problems. First: they work on tabular data. Financial services is built on tabular data: columns of numbers representing customer history, transaction patterns, and account characteristics. Neural networks were built for unstructured data (images, text, video). They can handle tabular data, but they are not optimized for it. Gradient boosted trees are.

Second: they are fast to train and fast to score. A gradient boosted model for fraud detection can be trained on millions of transactions in minutes. A neural network on the same dataset might take hours or days. In production, scoring (making a prediction) is typically milliseconds for gradient boosted trees, microseconds sometimes. Neural networks are slower. When you have 100,000 transactions per second hitting your system, every microsecond counts.

Third: they require less data. A neural network needs tens of thousands or millions of labelled examples to achieve high accuracy. A gradient boosted tree can often achieve the same accuracy with thousands or tens of thousands of examples. In fraud detection, where labels are scarce and expensive (requires manual review or external verification), this is critical.

Fourth: they are interpretable. When a gradient boosted tree makes a decision, you can understand why. Which features mattered? In what order did they matter? Did the model flag this transaction because of the transaction amount, or the geographic jump, or the device fingerprint? Regulators care about this. A neural network is a black box. Your model scored this customer for a loan at 65 percent approval, but nobody, including the model, knows exactly why. That is a regulatory nightmare.

The result: gradient boosted trees are the industry standard for financial risk scoring. JPMorgan, Goldman Sachs, Stripe, Adyen, and most major financial institutions use them for fraud detection, credit risk, and pricing decisions. They have proven themselves across billions of transactions.

Where Neural Networks Win

Neural networks do win in specific domains where financial services needs them: natural language processing, computer vision, and sequence models. An AML compliance team might use natural language processing to extract key information from customer documents or news articles to flag potential sanctions matches. An identity verification system uses computer vision to verify that a customer's face matches their uploaded ID document. A transaction sequence model uses recurrent neural networks to detect unusual patterns in a customer's transaction history over time.

In these cases, the unstructured data (text, images, sequences) cannot be easily converted into tabular features for a gradient boosted model. The neural network ingests the raw data and learns the representations directly. The trade-off: slower inference, harder to explain, more data required. But the accuracy improvement can justify the trade-offs.

Ensemble Methods: Combining Multiple Models

The most sophisticated financial services systems do not use a single model. They use ensemble methods that combine multiple models, each capturing different signals. A fraud detection system might combine a gradient boosted tree trained on transaction patterns, a neural network trained on device fingerprint patterns, a rules-based system for obvious fraud patterns, and network-level signals that show whether this card has been used for fraud elsewhere.

The ensemble votes. Each model produces a score or a decision. The votes are combined (weighted average, majority vote, learned meta-model) into a final decision. The benefit: the ensemble is more robust than any single model. If one model is overfit or broken, the others compensate. If one model learns a pattern that does not generalize, the ensemble weights it down. Ensembles are more expensive computationally but worth it for high-stakes decisions.

The Production ML Pipeline: From Data to Deployment

The gap between notebooks and production is the gap between the ML pipeline and the models that live inside it. A notebook runs on a single dataset, once. A production system runs on continuous data streams, retrains constantly, and must handle failures gracefully. Understanding the pipeline is understanding production.

The Production ML Pipeline for Financial Services

Data Collection and Labelling

The pipeline starts with data. Not the cleaned, perfect data in your notebook. Real data. Messy, incomplete, biased data from production systems. A fraud detection system collects all transactions: the amount, the merchant, the card used, the geographic origin, the time, the device, any authentication challenges presented. Then it collects the outcome: was the transaction later disputed? Did the cardholder file a fraud claim? Was it investigated by the processor as fraud?

The labelling is imperfect. Some fraud goes undetected. Some transactions are labelled as fraud when they were actually legitimate. Labelling lag is common. You do not know a transaction was fraud until weeks later when the customer disputes it. Meanwhile, the model needs to make decisions now. The model trains on historical labels that are incomplete and delayed.

Feature Engineering: The Real Work

Raw data does not work. A model cannot learn from a transaction amount, timestamp, and merchant code directly. Feature engineering transforms raw data into signals the model can learn from. Raw timestamp becomes hours-since-customer-first-transaction, or day-of-week, or is-this-transaction-happening-at-an-unusual-time-for-this-customer.

Good features are the difference between a model that works and a model that fails. The best data scientists spend 80 percent of their time on feature engineering and 20 percent on model tuning. Features capture domain knowledge about the problem. In fraud detection, features might include: average transaction size in the last hour, number of transactions in the last hour, geographic distance from last transaction, device consistency, peer group comparison (is this behaviour normal for this customer segment), and network signals (has this card been used for fraud elsewhere).

Feature engineering is also where data leakage happens. A common mistake: including a feature that would not be available at prediction time. If your fraud detection system makes decisions at authorisation time (milliseconds), you cannot use features about the merchant's dispute rate. That data does not exist yet. If you train on it (it is correlated with fraud), the model learns the pattern. In production, the feature is always unknown, and the model makes worse decisions than you expected.

Model Training and Hyperparameter Tuning

Training is about learning the relationship between features and outcomes. A gradient boosted tree learns by building many small decision trees, each correcting the errors of the previous one. A neural network learns by adjusting weights through backpropagation. In both cases, the goal is to minimize the difference between predicted and actual outcomes on the training data, while also generalizing to new data the model has not seen.

Hyperparameters are the knobs you turn: how many trees in the ensemble, how deep should each tree be, what is the learning rate, how much regularization. The wrong hyperparameters lead to overfitting (the model memorises the training data and fails on new data) or underfitting (the model is too simple to capture the underlying pattern). Hyperparameter tuning is systematic search over the parameter space using grid search, random search, or Bayesian optimization.

Validation: Measuring Real Performance

Accuracy is a trap. If your dataset is 99 percent non-fraud, a model that predicts everything as non-fraud achieves 99 percent accuracy. A useless model. The validation step uses metrics that are appropriate for the problem. For fraud detection: precision (of the transactions I flagged as fraud, how many actually were?), recall (of all the fraudulent transactions, how many did I catch?), and the F1 score (the harmonic mean, balancing precision and recall).

Validation uses a held-out test set that the model has never seen. You split your data: 70 percent training, 15 percent validation, 15 percent test. You train on the training set, tune hyperparameters on the validation set, and measure final performance on the test set. If you measure on the training set, you are measuring overfitting, not real performance.

Deployment: Moving Models to Production

A trained model is a file: a gradient boosted tree model in XGBoost format is a few megabytes of XML. Deploying means taking that file and putting it in a system where it can serve predictions in real time. A fraud detection model deployed to a payment processor might need to score millions of transactions per second. Latency requirements are sub-100 milliseconds. The model ingests features computed in real time, returns a score, and the authorisation system decides whether to approve or decline.

Deployment requires versioning. You deploy version 1.0 of your fraud model. You monitor it, see performance degrade after two weeks, train a new version, deploy version 2.0. You might keep both versions running and gradually shift traffic from old to new (canary deployment) to ensure the new version does not break production.

Monitoring and Drift Detection

Production is where models break. A model trained on 2025 data is deployed in 2026. Attackers have evolved. New payment methods emerge. Customer behaviour shifts. The distribution of incoming data (features) diverges from the distribution the model trained on (feature drift). The relationships between features and outcomes change (concept drift). The model's performance degrades.

Monitoring detects this. You track metrics continuously: precision, recall, prediction distributions, feature distributions. When something changes significantly, alerts fire. A model that was 95 percent accurate has dropped to 91 percent. A feature that used to range from 0-100 is now ranging from 0-500. Something changed in the world. The model needs retraining.

Retraining and the MLOps Loop

Retraining is not a one-time event. It is a continuous cycle. Weekly retraining is common. For fast-moving domains like fraud, daily or even hourly retraining happens. The system collects new data, features are computed, the model is retrained on the latest data, and if performance improves (on a held-out validation set), the new model is deployed.

Automation is critical. Manual retraining does not scale. Machine learning operations (MLOps) is the discipline of automating the entire pipeline: data collection, feature computation, model training, validation, deployment, monitoring, and retraining. Tools like Airflow, Kubeflow, and Databricks orchestrate the pipeline, ensuring it runs reliably, logs everything, and alerts on failure.

Data Drift, Concept Drift, and Model Decay

The reason production models fail is simple: the world changes, and models do not. Understanding drift is understanding why your model works today and breaks in six months.

Feature Drift: The Distribution Changed

Feature drift (also called data drift) is when the distribution of your input features changes. You trained your fraud model on 2025 transaction data. The average transaction was $200. The median was $120. In 2026, your average transaction is $400. The median is $280. The distribution shifted.

Why does this matter? Because your model learned what "normal" looks like based on the 2025 distribution. A $500 transaction was in the 99th percentile (very unusual). In 2026, it is in the 80th percentile (fairly common). The model scored transactions at $500 as high risk, based on 2025 patterns. In 2026, those transactions are legitimate but still scored as risky. False declines increase. Revenue decreases.

Feature drift can happen for legitimate reasons (seasonality, business growth, new customer segments) or adversarial reasons (attackers adapting to evade detection). Either way, the model must adapt.

Concept Drift: The Relationship Changed

Concept drift is when the relationship between features and outcomes changes. You trained a fraud model where the feature "late-night transaction" was a strong signal for fraud. In 2025, late-night transactions were 5 times more likely to be fraudulent than daytime transactions. In 2026, due to changes in global commerce and remote work, late-night transactions are equally likely to be fraudulent as daytime transactions. The feature that was predictive is no longer predictive.

Concept drift is faster and more dangerous than feature drift. Your monitoring system might not catch it immediately. The feature distribution looks similar (still seeing late-night transactions at the same frequency). But the predictive power degraded. The model keeps using the feature at its old weight, making worse decisions than it should.

Model Decay and Retraining Strategies

Model decay is the cumulative effect of drift. A model that was 95 percent accurate on day 1 is 93 percent accurate on day 30, 91 percent on day 90. Performance does not drop off a cliff. It degrades slowly. And slowly degrading accuracy often goes unnoticed until the business notices the effect: more fraud slipping through, more false declines, more customer complaints.

Retraining strategy is how you fight decay. Periodic retraining (every week, every day, every hour) is the baseline. Some organisations retrain weekly and look for improvement on a validation set before deploying. Others retrain daily. The most sophisticated retrain hourly or continuously, using streaming data. If you cannot detect model decay until you have lost millions to fraud, your retraining strategy is too slow.

Feature Stores: The Bridge Between Data and Models

Features are not created once. They are created once, at training time. Then they need to be computed again, in real time, during serving. A feature like hours-since-customer-first-transaction is easy to compute. But hundreds of features, computed in real time, at scale, for millions of transactions per second, becomes a coordination problem.

A feature store is a central repository for features. It stores feature definitions (how to compute a feature), feature data (precomputed features for customers or transactions), and serves features to models at prediction time. In training, the feature store provides historical feature values. In serving, it looks up current feature values and serves them to the model for scoring.

Feature stores solve consistency problems. You do not want training features computed differently than serving features. You want the same feature definition used everywhere. A feature store enforces that. Tools like Tecton, Feast, and Databricks provide feature store infrastructure.

A/B Testing Models in Production

You trained a new fraud model. It is 3 percent more accurate on your test set. Should you deploy it? Not necessarily. A/B testing answers the question: does this new model actually improve business metrics when deployed?

A/B testing runs both models in parallel. The old model scores 50 percent of traffic. The new model scores 50 percent of traffic. You measure the outcomes for each: fraud rate, false decline rate, revenue, customer churn. If the new model produces better outcomes on the metrics you care about, you deploy it fully. If not, you stick with the old model or investigate why the new model underperforms in production despite higher accuracy.

A/B testing catches problems that offline testing misses. Sometimes a model is more accurate on historical data but makes decisions that are regrettable in production. Sometimes accuracy improves but false declines increase more than fraud decreases, hurting revenue. A/B testing is the ground truth for model quality in production.

MLOps in Financial Services

Machine learning operations (MLOps) is the discipline that makes the entire pipeline reliable, auditable, and compliant. Financial institutions cannot tolerate model failures. A fraud detection model that makes arbitrary decisions is a regulatory problem. A model that cannot explain why it declined a customer is a legal problem. A model that crashes in production is a business problem.

MLOps enforces: versioning (every model is tagged with a version, lineage is tracked), reproducibility (the same input data and hyperparameters always produce the same model), auditing (every decision is logged, including which model version made it), monitoring (model performance is tracked continuously), and governance (only approved models are deployed, model changes are reviewed).

Financial institutions spend as much engineering effort on MLOps as on model development. The model code might be 5 percent of the effort. The infrastructure, monitoring, versioning, and governance is the other 95 percent.

If you deploy a machine learning model for a high-stakes financial decision today, could you explain to a regulator, six months from now, exactly which training data it used, which features it used, which hyperparameters were optimised, and why it made each decision?

Key Takeaways

Supervised learning dominates finance: Classification and regression are the fundamental paradigms. You predict whether something will happen (fraud/non-fraud) or how much (default probability, expected loss).
Gradient boosted trees win for tabular data: XGBoost and LightGBM are the standard for fraud detection, credit scoring, and risk pricing. Faster, more interpretable, require less data than neural networks.
Neural networks win for unstructured data: NLP, computer vision, and sequence models excel at documents, images, and transaction sequences. Trade-offs: slower, less interpretable, need more data.
The pipeline is production: Feature engineering, training, validation, deployment, monitoring, and retraining are continuous. A notebook model is not a production system.
Data drift and concept drift break models: Feature distributions change. Relationships between features and outcomes shift. Models decay. Monitoring and retraining are not optional.
Feature stores coordinate real-time scoring: Hundreds of features computed at training time and serving time. Feature stores ensure consistency and scale.
MLOps is infrastructure: Versioning, reproducibility, auditing, monitoring, and governance are how financial institutions make AI reliable and compliant.

Supervised Learning

Learning from labelled examples where we know the correct answer. The model ingests features and predicts labels (classification) or continuous values (regression).

Unsupervised Learning

Learning patterns from unlabelled data. Anomaly detection finds outliers. Clustering groups similar datapoints. No examples with known correct answers.

Gradient Boosted Trees

Ensemble of decision trees built iteratively, each correcting errors of previous ones. Fast, accurate, interpretable. Dominates financial services for tabular data.

Feature Engineering

Transforming raw data into model-ready features. The most important determinant of model performance. Good features capture domain knowledge about the problem.

Overfitting

Model memorises training data and fails on new data. Results in high training accuracy but poor production performance. Regularisation and validation detect overfitting.

Data Drift

Distribution of input features changes over time. A feature that ranged 0-100 now ranges 0-500. Model performance degrades because it is seeing data different from training data.

Concept Drift

Relationship between features and outcomes changes over time. A feature that was predictive becomes non-predictive. More dangerous than data drift and harder to detect.

Feature Store

Central repository for features. Stores feature definitions and precomputed values. Serves features to models at training time and production serving time.

MLOps

Machine Learning Operations. Discipline of automating and monitoring the entire ML pipeline: data, training, validation, deployment, monitoring, retraining.

Ensemble Method

Combines predictions from multiple models into a final prediction. More robust than any single model. Gradient boosting is an ensemble method.

Model Decay

Cumulative degradation of model performance over time due to data drift and concept drift. A model that worked well at deployment becomes worse every day.

Hyperparameter Tuning

Optimisation of model configuration knobs (learning rate, tree depth, regularisation) to minimise validation error without overfitting.

Next Module

AI in Fraud and Risk

Real-time fraud scoring architectures. Feature signals. Ensemble methods. The precision/recall trade-off. Why every 1% improvement saves millions.