Kwindla Hultman Kramer offers advice for developing voice AI agents, emphasizing the importance of understanding latency and instruction-following accuracy in the architecture. He suggests starting with a proven tech stack and progressively deploying to real users while optimizing cost and performance. Kramer highlights aiming for 800ms median voice-to-voice latency and advises on choosing the right models and tools, such as starting with models like GPT-4o or Gemini 2.5 Flash for effective function calling. The article discusses the necessity of using detailed prompts and context engineering to maintain instruction-following accuracy in multi-turn conversations. Kramer also underscores the importance of robust tooling for monitoring and debugging, the use of async function calls for efficiency, and the need for a reliable framework with good network transport, echo cancellation, and context management. He advises against starting with experimental models and stresses the importance of using widely accepted tools before delving into optimization.