Generative AI and agent systems are transforming the insurance industry by automating tasks such as underwriting, claims handling, and fraud detection, while ensuring personalized customer interactions. Companies like Lemonade, GEICO, Allstate, and AXA have implemented AI agents to enhance customer service, save costs, and improve efficiency, with AI-driven platforms like Lemonade's processing millions of claims and providing real-time fraud alerts. However, real-world deployments have also highlighted significant challenges, including biased underwriting, inaccurate claim denials, and chatbot errors, underscoring the need for robust guardrails and human oversight. To address these challenges, a synthetic dataset was created to evaluate AI models within real-world scenarios, capturing the complexity of user interactions. The Agent Leaderboard v2 ranks AI models based on metrics like action completion and tool selection quality, offering insights into their performance in the insurance sector. The leaderboard highlights the strengths and weaknesses of top models, such as Qwen-235b and GPT-4.1, in complex insurance scenarios. Strategic recommendations suggest choosing models based on task complexity, implementing error handling, and balancing cost and latency to maximize the value of AI agents in the insurance industry.