Together Evaluations: Benchmark Models for Your Tasks
Blog post from Together AI
In the rapidly evolving field of large language models (LLMs), assessing a model's performance on specific tasks is essential, and Together Evaluations provides a structured way to benchmark these models using LLMs as judges. This approach allows for fast and flexible evaluation by defining task-specific benchmarks and employing leading open-source models to compare responses, bypassing the need for manual labeling or rigid metrics. Together Evaluations supports three modes of evaluation—classify, score, and compare—each customizable via prompt templates, allowing users to tailor the process to their specific needs. This system enables developers to identify the best models or prompts, monitor model quality, and manage data drift, with support for serverless inference and the ability to upload pre-existing datasets for evaluation. By offering comprehensive tools and demonstrations, Together aims to streamline the development of LLM-driven applications and invites users to explore the platform through interactive resources and a webinar.