LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary
Blog post from Portkey
The paper introduces LLMLingua, a method designed to compress prompts in large language models (LLMs) to speed up model inference and reduce associated costs. This approach integrates a budget controller for determining compression ratios, a token-level iterative compression algorithm, and an instruction tuning-based method aimed at distribution alignment. The research demonstrates that LLMLingua achieves state-of-the-art performance, enabling up to 20x compression with minimal performance degradation. The method was tested on four datasets from diverse domains, showcasing its effectiveness across multiple scenarios. LLMLingua significantly reduces LLM inference costs while maintaining the semantic integrity of prompts, is compatible with black-box LLMs accessible only via API, and requires no gradient flow through the models, making it suitable for various LLM applications.