Introducing AutoJudge: Streamlined inference acceleration via automated dataset curation
Blog post from Together AI
Speculative decoding is an advanced method that accelerates token generation by using a small draft model alongside a larger target model, where the draft suggests potential next tokens and the target verifies them. AutoJudge, an enhancement of this method, introduces an automated system for identifying and accepting "unimportant" mismatches, which are differences that do not affect the final output's correctness. This approach eliminates the need for human labeling by using a small classifier trained on existing embeddings to predict the importance of mismatches. AutoJudge shows notable improvements in inference speed across various testing scenarios, such as mathematical reasoning and programming tasks, by allowing more tokens to be accepted per cycle with minimal accuracy loss. It integrates seamlessly with existing speculative decoding frameworks and demonstrates substantial throughput gains, particularly in bandwidth-limited scenarios. However, the speedup benefits are dependent on the specific task and the frequency of unimportant mismatches, suggesting that threshold tuning for the classifier may be necessary for optimal results.