Datadog's LLM Observability offers a comprehensive solution for evaluating the quality of Large Language Model (LLM) applications by closing the gap between operational metrics and qualitative assessments like factual accuracy, safety, and tone. While many teams measure speed and cost, few assess response quality, creating a significant observability shortfall. Datadog addresses this by tracing requests from prompt to response and providing built-in evaluations for common issues such as hallucinations and toxicity. The platform introduces custom LLM-as-a-judge evaluations, allowing teams to define domain-specific quality standards using supported LLM providers like OpenAI and Anthropic. These evaluations run automatically, scoring responses in real-time and integrating with existing dashboards to track trends, set monitors, and debug failures. This enables teams to tailor evaluations to specific applications, such as financial chatbots or medical assistants, and iterate improvements based on real-world data. Datadog's approach facilitates faster deployment of reliable LLM applications by combining qualitative insights with operational data in a unified framework.