Evaluating Context Compression for AI Agents
Blog post from Factory
The evaluation framework developed to assess context retention in AI agents during long-running tasks reveals that structured summarization preserves more useful information compared to other compression strategies by OpenAI and Anthropic, without compromising on compression efficiency. The study tested three approaches—anchored iterative summarization by Factory, OpenAI’s compact endpoint, and Anthropic’s Claude SDK—across diverse real-world tasks like debugging and feature implementation. The framework uses a probe-based evaluation method to assess functional quality based on six dimensions: accuracy, context awareness, artifact trail, completeness, continuity, and instruction following. Factory's approach, which maintains structured summaries with sections dedicated to specific types of information, scored higher in preserving technical details and maintaining context. This approach was particularly effective in maintaining continuity and accuracy, crucial for software development tasks, though all methods struggled with artifact tracking. The findings emphasize that the total tokens required to complete a task, rather than the compression ratio, should be the focus for optimizing AI agents' performance in task continuation.