Why do tree-based models still outperform deep learning on tabular data? (2022)

In this paper, we delve into the question of whether deep learning outshines traditional tree-based models like XGBoost and Random Forests when it comes to tabular data. Through extensive benchmarks on various datasets and hyperparameter combinations, we discover that tree-based models still reign supreme on medium-sized data sets, despite the speed advantage deep learning models boast. We explore the inherent differences between the inductive biases of these models, leading to a set of challenges for researchers looking to develop tabular-specific Neural Networks. To facilitate further research in this area, we provide a standard benchmark and raw data for baseline comparisons.

https://arxiv.org/abs/2207.08815