Company
Date Published
Author
Dhruv Nair
Word count
1217
Language
English
Hacker News points
None

Summary

The Comet.ml team is participating in the Kaggle Home Credit Default Competition, a challenge focused on predicting loan default risks for applicants with limited credit histories using alternative data sources. This initial post in a series details their exploratory data analysis (EDA) process, examining the distribution and correlation of features within the provided dataset, which includes a significant number of numerical and categorical features. They highlight the use of tools like LightGBM and Random Forest for feature importance ranking, noting the imbalance in the target variable distribution, which affects evaluation metrics. A Principal Component Analysis (PCA) is applied to reduce dimensionality in highly correlated float-valued features, while categorical features are One Hot Encoded to assess their importance. The team uses Comet.ml to log their experiments, track datasets, and optimize hyperparameters, resulting in an initial LightGBM model with an AUC score of 0.745. This post serves as a foundational step before delving into more advanced feature engineering in future installments.