Stripe logo - Plushcap

Post Details

Company

Stripe

Date Published

March 2, 2026

Author

Carol Liang and Kevin Ho

Word Count

1,627

Language

English

Hacker News Points

-

Source URL

stripe.com/blog/can-ai-agents-build-real-stripe-integrations

Summary

State-of-the-art language models (LLMs) can now solve a majority of scoped coding problems, but there remains a gap between this ability and the capacity to autonomously manage entire software engineering projects, particularly in real-world settings that require complex integrations such as those with the Stripe API. A research team developed the Stripe integration benchmark to assess the ability of LLMs to handle backend, full-stack, and specific Stripe feature tasks, revealing both the strengths and limitations of these models. Surprisingly, LLMs like Claude Opus 4.5 and OpenAI’s GPT-5.2 showed proficiency in full-stack tasks and API feature sets, respectively, but still struggled with handling ambiguous situations and completing browser-based tasks. The benchmark was designed to push the models’ capabilities by creating complex environments that mimic realistic software development scenarios, and it serves as a testing ground for improving agentic tools through iterative evaluations. The project highlights the continued need for rigorous testing and refinement of LLMs to ensure fully accurate and autonomous software integrations, inviting collaboration and feedback from the broader software community to advance these developments.