Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
Blog post from HuggingFace
"Proof of Time" (PoT) is a novel benchmarking framework designed to evaluate scientific idea judgments by linking them to downstream signals that become observable in the future, such as citation counts and peer-review awards. The framework seeks to address the limitations of traditional peer review, which can be slow and inconsistent, by freezing a snapshot of evidence at a time cutoff and asking models to predict future outcomes. PoT operates in an offline sandbox environment to ensure that any improvements in model performance come from better reasoning and use of available evidence rather than access to real-time information. The framework evaluates multiple task families, including impact prediction, peer-review awards, research evolution, and technological frontier, using distinct verifiable signals. It utilizes models from major AI providers and compares different solver configurations, such as zero-shot and agentic approaches, to explore how test-time compute affects performance. PoT's post-cutoff evaluation method helps ensure that models are not relying solely on training data recall, emphasizing the importance of objective, scalable evaluation methods for assessing "fuzzy" concepts like idea quality.