Designing Efficient Verifiers for Legal Agents
Blog post from LangChain
Harvey has released LAB, an open-source benchmark designed to evaluate the performance of agents on complex legal work, emphasizing the challenges in verifying the correctness of legal agents due to the domain's complexity and specificity. The LAB benchmark employs a verification method akin to a human reviewer, assessing tasks against a set of predefined criteria using LLM judges. The study explores two methods to enhance verification efficiency: reducing token usage by running verifiers in batch mode and employing cheaper models. Experiments conducted across various legal practice areas demonstrated that batch verification is significantly more cost-effective, although it tends to have lower match rates than per-criterion verification. Models like DeepSeek offer a cost-efficient alternative to frontier models such as Opus, despite discrepancies in label agreement, with targeted prompt tuning reducing false-pass rates. The study emphasizes the potential for open models to enable more affordable and scalable evaluations, challenging the reliance on frontier closed models and advocating for further research into fine-tuning verifiers for specific domains.