Define, run, and scale custom LLM-as-a-judge evaluations in Datadog

Post Details

Company

Datadog

Date Published

Nov. 25, 2025

Author

Rashel Hoover, Miguel Tulla Lizardi, Shri Subramanian, Will Potts

Word Count

1,018

Language

English

Hacker News Points

-

Source URL

www.datadoghq.com/blog/custom-llm-evaluations

Summary

Datadog's LLM Observability offers a comprehensive solution for evaluating the quality of Large Language Model (LLM) applications by closing the gap between operational metrics and qualitative assessments like factual accuracy, safety, and tone. While many teams measure speed and cost, few assess response quality, creating a significant observability shortfall. Datadog addresses this by tracing requests from prompt to response and providing built-in evaluations for common issues such as hallucinations and toxicity. The platform introduces custom LLM-as-a-judge evaluations, allowing teams to define domain-specific quality standards using supported LLM providers like OpenAI and Anthropic. These evaluations run automatically, scoring responses in real-time and integrating with existing dashboards to track trends, set monitors, and debug failures. This enables teams to tailor evaluations to specific applications, such as financial chatbots or medical assistants, and iterate improvements based on real-world data. Datadog's approach facilitates faster deployment of reliable LLM applications by combining qualitative insights with operational data in a unified framework.