We're building an AI code review tool that utilizes large language models (LLMs) to detect bugs and anti-patterns in pull requests. The quality of our reviews heavily relies on the underlying LLMs, so we continuously test new models to see how well they detect real-world bugs. Bug detection requires a deeper understanding of logic, structure, and developer intent, going beyond pattern matching to reasoning. We tested OpenAI's 4o model against o1 in finding difficult bugs in code. The results showed that while the difference wasn't massive, it was consistent, with 4o catching 20 bugs across 210 files compared to o1's 15. Notably, 4o outperformed o1 in detecting bugs in Python and Go programs, suggesting its architecture or training data gives it an edge in logic inference too. The model also caught a bug that required understanding the intent of a strategy, highlighting the importance of reasoning for AI code review. We're still early in the evolution of AI for software verification but models like 4o are pushing boundaries and showing signs of improvement.