The developers of a tool for benchmarking Large Language Models (LLMs) have released four new test environments that assess an LLM's ability to effectively use tools to accomplish tasks, including planning and function calling. The tests were designed to evaluate the models' capabilities in common agentic workflows, such as planning and task decomposition, function calling, and overriding pre-trained biases when needed. The results show that while some models perform well on certain tasks, others struggle with even simple ones, highlighting the challenges of building effective agents using LLMs. The tests were conducted across seven models, including GPT-4 and Claude 2.1, and found that service reliability is also an important factor to consider when deploying LLM-based agents. The developers hope that these results will help make it easier for others to test different LLM and prompting strategies to show what enables the best agentic behavior.