Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Together Evaluations: Benchmark Models for Your Tasks

Blog post from Together AI

Post Details
Company
Date Published
Author
Ivan Provilkov, Ruslan Khaidurov, Kirah Sapong, George Grigorev, Gleb Vazhenin, Yogish Baliga, Zain Hasan, Max Ryabinin
Word Count
1,176
Language
English
Hacker News Points
-
Summary

In the rapidly evolving field of large language models (LLMs), assessing a model's performance on specific tasks is essential, and Together Evaluations provides a structured way to benchmark these models using LLMs as judges. This approach allows for fast and flexible evaluation by defining task-specific benchmarks and employing leading open-source models to compare responses, bypassing the need for manual labeling or rigid metrics. Together Evaluations supports three modes of evaluation—classify, score, and compare—each customizable via prompt templates, allowing users to tailor the process to their specific needs. This system enables developers to identify the best models or prompts, monitor model quality, and manage data drift, with support for serverless inference and the ability to upload pre-existing datasets for evaluation. By offering comprehensive tools and demonstrations, Together aims to streamline the development of LLM-driven applications and invites users to explore the platform through interactive resources and a webinar.