OpenAI's o3-mini and 4o-mini models were compared for their ability to find real bugs in software. A benchmark dataset of 210 programs was used, each seeded with a realistic bug, across five programming languages. The results showed that o3-mini caught nearly twice as many bugs as 4o-mini, with better performance consistently across all languages. The gap between the models can be attributed to differences in planning and reasoning capabilities, model architecture, and training data. While 4o-mini still shows potential, particularly in surface-level issues or high-training-coverage languages, o3-mini is the better choice for catching hard-to-spot bugs that require deeper understanding of logic-heavy tasks.