Benchmarking AI Model Code Fix Generation for Mobile App Crashes

Post Details

Company

Luciq

Date Published

April 10, 2025

Author

Mostafa Megahid

Word Count

1,328

Language

English

Hacker News Points

-

Source URL

www.luciq.ai/blog/benchmarking-ai-code-fix-mobile-crashes

Summary

AI models are advancing rapidly, with new versions being released frequently, and while top language models generally perform similarly, some excel in specific tasks. In developing SmartResolve, Luciq's AI-powered code fix feature, an evaluation of several large language models (LLMs) was conducted to identify those best suited for analyzing crash reports and generating code fixes on iOS and Android platforms. The evaluation used standardized tests and real-world data to assess models like OpenAI's GPT-4o, Anthropic's Claude Sonnet, and Meta's LLama, among others, based on criteria such as correctness, similarity, depth, relevance, and coherence. Results showed that most models performed better on iOS, with GPT-4o, Claude 3.5 Haiku V1, and Claude 3.5 Sonnet V1 emerging as top performers due to their consistency and structured outputs. Models like LLaMA-3-70b and OpenAI o1 struggled, especially on Android, due to poor correctness and slow response times. The evaluation suggested using a hybrid model strategy for SmartResolve, combining high-coherence models like GPT-4o for structured responses with stable models like Claude Haiku 3.5 and Claude Sonnet 3.5 for balanced performance across platforms. The study highlights the importance of continually updating evaluations as new models enter the market to ensure optimal AI-powered mobile crash resolution.