/plushcap/analysis/deepgram/apibank-llm-benchmark

API-Bank: Benchmarking Language Models’ Tool Use

What's this blog post about?

Researchers have developed a new benchmark called API-Bank for testing how well large language models (LLMs) use external tools such as APIs to accomplish tasks. The benchmark evaluates LLMs' abilities in three main areas: deciding when to call an API, finding the right tool for the job, and employing multiple APIs to complete a task. GPT-4 outperforms GPT-3.5 Turbo on most of the tests, but both models struggle with tasks requiring multiple rounds of interdependent API calls. The results highlight the potential for LLMs to become more efficient and useful by incorporating external tools, as well as areas where further improvements are needed.

Company
Deepgram

Date published
Aug. 28, 2023

Author(s)
Brad Nikkel

Word count
2334

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.