Home / Companies / Datadog / Blog / Post Details
Content Deep Dive

Introducing ARFBench: A time series question-answering benchmark based on real incidents

Blog post from Datadog

Post Details
Company
Date Published
Author
Othmane Abou-Amal, Ben Cohen, Ameet Talwalkar, Stephan Xie
Word Count
2,015
Company Posts That Month
33
Language
English
Hacker News Points
-
Summary

The Anomaly Reasoning Framework Benchmark (ARFBench) is a newly introduced benchmark designed for time series question-answering (TSQA) tasks, derived from real internal incidents at Datadog using its telemetry data. ARFBench aims to evaluate the performance of AI models, such as large language models (LLMs), vision-language models (VLMs), and time series foundation models (TSFMs), in diagnosing system anomalies by analyzing observability metrics. It highlights the substantial room for improvement in current models and introduces a novel hybrid TSFM-VLM model, Toto-1.0-QA-Experimental, which demonstrates promising results by achieving high accuracy and F1 scores while offering efficiency gains. The benchmark is structured into three tiers of increasing difficulty, emphasizing compositional reasoning and the integration of context across data modalities. ARFBench sets a new superhuman frontier when combined with human expertise, showcasing complementary strengths between models and experts. The framework is positioned as a significant step in developing end-to-end agentic systems for incident response, with resources available on platforms like Hugging Face and GitHub for further exploration and development.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
LLM 12 5,932 1,046 223 -2%
AI Model Fine-tuning 5 420 130 55 -54%
Observability 5 4,496 812 176 +40%
Real-time 1 6,296 1,346 246 -2%
Reinforcement learning 1 104 49 23 -14%