Comparative Analysis of ML Algorithms on Imbalanced Financial Datasets
In financial data science, the biggest hurdle isn't building the model; it's the data itself. How do we accurately predict a rare event when our dataset is overwhelmingly biased toward the norm?
When developing models to predict loan defaults, data scientists frequently encounter the massive hurdle of data imbalance. Exploratory Data Analysis (EDA) on financial datasets typically reveals a stark reality: there are far more individuals with good loans (Risk_Flag = 0) than bad loans (Risk_Flag = 1).
This imbalance creates a major problem. When the minority class makes up such a small proportion of the total dataset, it becomes exceptionally challenging for standard Machine Learning models to effectively assimilate the decision boundary. If a model simply guessed "No Default" every time, it might achieve 90% accuracy, but it would be entirely useless for risk management.
Data Preparation and SMOTE
Before any modeling can begin, the dataset requires strict pre-processing. Categorical variables (like Profession, City, State, and Car Ownership) must be converted into binary form using one-hot encoders to help the ML model process the data mathematically. Furthermore, data normalization is applied to transform characteristics and ensure they are all on the same scale, enhancing the model's stability throughout training.
To circumvent the severe class imbalance, researchers turn to the Synthetic Minority Oversampling Technique (SMOTE). Prior to fitting the model, duplicates and synthetic variations of the minority class are mathematically generated and injected into the training dataset. While SMOTE addresses unbalanced classification to a certain extent, it does not completely eliminate the issue.
Because of this lingering imbalance, standard "Accuracy" is a deceptive metric. Models must instead be evaluated on Average Precision, AUC-PR (Area Under the Precision-Recall Curve) Score, and the Weighted F1-Score to gain a true understanding of their predictive power.
Evaluating the Algorithms: A Deep Dive
A rigorous comparative analysis of seven distinct ML architectures yielded surprising insights into how different algorithms handle imbalanced financial data:
1. Gradient Boosting & XGBoost
Gradient Boosting constructs models step-by-step and permits optimization of any differentiable loss function. XGBoost builds on this by adding regularization approaches to improve generalization. Despite achieving high base accuracies (87.58% and 87.74%), their AUC-PR curves showed an inability to identify positive cases while avoiding false positives, yielding terrible AUC-PR scores of 0.1821 and 0.1700 respectively.
2. Long-Short Term Memory (LSTM)
LSTM is a recurrent neural network (RNN) created to solve the vanishing gradient issue. It utilizes an input gate, an output gate, and a forget gate to control information flow into and out of the cell state. While theoretically powerful, it only achieved an AUC-PR score of 0.3895 on this dataset, falling short in correctly identifying all positive cases.
3. LightGBM
LightGBM uses a unique leaf-wise progression method for efficient boosting and offers advantages like sparse optimization and early stopping. However, on this specific imbalanced dataset, it suffered a weak performance with an AUC-PR score of just 0.2430.
4. Gaussian Naive Bayes
This family of linear probabilistic classifiers assumes conditional independence between features. While highly scalable requiring minimal parameters, it proved to be the worst performing ML model in the study. It yielded an accuracy of 87.59% but a dismal AUC-PR Score of 0.1582.
5. Decision Trees & Random Forest
The standard Decision Tree classifier actually outperformed several complex neural models in identifying true positives (AUC-PR of 0.4890). However, the undisputed winner was the Random Forest Classifier. By employing an ensemble learning strategy where numerous decision trees collaborate, it achieved an AUC-PR score of 0.6160 and an accuracy of 89.11%.
The findings emphasize a critical lesson for financial data scientists: while advanced machine learning models offer superior predictive capabilities, the imbalanced nature of real-world datasets heavily impacts their true performance. Future work must continue to innovate on handling class imbalance to unlock the full potential of these algorithms.