Improving massively imbalanced datasets in machine learning with synthetic data

Company

Gretel.ai

Date Published

March 26, 2022

Author

Alex Watson

Word count

1220

Language

English

Hacker News points

URL

gretel.ai/blog/improving-massively-imbalanced-datasets-in-machine-learning-with-synthetic-data

Summary

Handling imbalanced datasets in machine learning, especially in fields such as fraud detection and cybersecurity, is challenging due to the limited instances of the minority class, like fraudulent transactions. This exploration uses a popular Kaggle dataset on credit card fraud to demonstrate how synthetic data can improve model accuracy. By employing a generative synthetic data model, the process creates additional fraudulent records by incorporating features from both fraudulent records and their nearest neighbors, labeled as non-fraudulent but potentially suspicious. This method, inspired by the Synthetic Minority Oversampling Technique (SMOTE), aims to enhance classifier performance by generating new instances that help the model generalize better to detect fraud. The approach involves using Gretel Synthetics, a tool that leverages deep learning to generate synthetic data, and optimizes the training process to balance data creation without overfitting. The addition of synthetic data to the training set aims to reduce the negative-to-positive ratio, potentially boosting the model's ability to detect fraud by up to 14%. This method underscores the potential of synthetic data to enhance machine learning models by overcoming challenges of extreme class imbalance and improving generalization across datasets.