Machine learning models often rely on validation sets to choose the best-performing model, but this approach can be misleading, as validation accuracy does not necessarily predict real-world performance due to data distribution shifts. The article argues for the importance of robustness tests, which evaluate a model's ability to maintain consistent predictions under various input variations, as a more reliable indicator of generalization to real-world data. By using an example with the histopathology dataset Camelyon17-WILDS, where different hospital data introduces domain generalization challenges, the article demonstrates that models selected based on robustness tests, such as ResNet-101, can outperform those chosen solely on validation accuracy. Lakera has developed a tool, MLTest, to facilitate robustness testing, allowing users to assess model performance more effectively without new data, thus offering a better understanding of a model's potential in real-world applications.