Reducing AI bias with Synthetic data

Company

Gretel.ai

Date Published

Jan. 11, 2021

Author

Alex Watson

Word count

870

Language

English

Hacker News points

None

URL

gretel.ai/blog/reducing-ai-bias-with-synthetic-data

Summary

This post explores using synthetic data to balance a biased health dataset on Kaggle and improve overall model accuracy. The Heart Disease dataset published by the University of California Irvine is imbalanced, with male patient records accounting for 68% of the overall dataset and female patient records at only 32%. To reduce bias in the input data, synthetic female patient records were generated using Gretel.ai's open-source synthetic data library. The resulting augmented dataset was then run through ML algorithms on Kaggle to compare results against the original training set. In five out of six classification algorithms, accuracy increased when trained with the augmented dataset, achieving 96.7% overall accuracy for KNN (up from 88.5%) and 13% gains for the Decision Tree classifier.