Whisper`, an open-source speech-to-text software launched by OpenAI, employs a unique approach that differs from current state-of-the-art systems. The model's training data contains 680,000 hours of audio, which is significantly larger than typical previous regimes. Whisper uses a simple autoregressive encoder-decoder structure and a cross-entropy loss function conditioned on the audio. This setup allows for impressive performance with supervised learning, but also raises questions about its generalization capabilities out-of-distribution. The authors suspect that a systemic failure in common evaluation protocols and overfitting to spurious correlations in the training data may be contributing factors to this issue. Despite these challenges, Whisper achieves human-level performance on certain tasks, demonstrating the potential of simple models with large-scale datasets. However, the plateauing nature of internet-scale supervised learning for English ASR systems suggests that more innovative approaches, such as self-supervised learning, are needed to overcome the limitations of current state-of-the-art systems.