Tool Calling Quality, pt 1: Aligning LLM Judges with Human Labels across 450 Tool Call Trajectories
Blog post from Credal
An initiative was undertaken to evaluate the quality of agent runs by combining human and AI assessments, specifically a council of LLMs, to identify when agents used tools effectively and completed tasks without unnecessary iterations or errors. Initial findings showed a 60% agreement between human evaluators and the AI council, which was insufficient to replace human judgment but highlighted systematic discrepancies rather than random errors. The disagreements were often due to rubric ambiguities, severity-threshold mismatches, or known failure modes. By identifying these issues, explicit rules were created to guide an AI adjudicator in resolving disagreements, resulting in improved alignment between council and human evaluations, especially in areas like tool policy. The process demonstrated that while LLMs can provide valuable insights, they require calibration against human judgments to ensure reliability. This calibrated dataset will be used to develop a multi-head encoder to evaluate and predict tool-call quality, aiming to enhance agent performance monitoring.