LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary

Post Details

Company

Portkey

Date Published

Oct. 14, 2023

Author

The Quill

Word Count

230

Language

English

Hacker News Points

-

Source URL

portkey.ai/blog/llmlingua-compressing-prompts-for-accelerated-inference-of-large-language-models-summary

Summary

The paper introduces LLMLingua, a method designed to compress prompts in large language models (LLMs) to speed up model inference and reduce associated costs. This approach integrates a budget controller for determining compression ratios, a token-level iterative compression algorithm, and an instruction tuning-based method aimed at distribution alignment. The research demonstrates that LLMLingua achieves state-of-the-art performance, enabling up to 20x compression with minimal performance degradation. The method was tested on four datasets from diverse domains, showcasing its effectiveness across multiple scenarios. LLMLingua significantly reduces LLM inference costs while maintaining the semantic integrity of prompts, is compatible with black-box LLMs accessible only via API, and requires no gradient flow through the models, making it suitable for various LLM applications.