Introducing ARFBench: A time series question-answering benchmark based on real incidents
Blog post from Datadog
The Anomaly Reasoning Framework Benchmark (ARFBench) is a newly introduced benchmark designed for time series question-answering (TSQA) tasks, derived from real internal incidents at Datadog using its telemetry data. ARFBench aims to evaluate the performance of AI models, such as large language models (LLMs), vision-language models (VLMs), and time series foundation models (TSFMs), in diagnosing system anomalies by analyzing observability metrics. It highlights the substantial room for improvement in current models and introduces a novel hybrid TSFM-VLM model, Toto-1.0-QA-Experimental, which demonstrates promising results by achieving high accuracy and F1 scores while offering efficiency gains. The benchmark is structured into three tiers of increasing difficulty, emphasizing compositional reasoning and the integration of context across data modalities. ARFBench sets a new superhuman frontier when combined with human expertise, showcasing complementary strengths between models and experts. The framework is positioned as a significant step in developing end-to-end agentic systems for incident response, with resources available on platforms like Hugging Face and GitHub for further exploration and development.