Home / Companies / Featherless / Blog / Post Details
Content Deep Dive

Key-Value Means: Transformers with Expandable Block-Compressed Memory

Blog post from Featherless

Post Details
Company
Date Published
Author
Featherless
Word Count
2,418
Language
English
Hacker News Points
-
Summary

Key-Value Means (KVM) is an innovative approach designed to integrate the fixed-cost inference benefits of linear RNNs with the high-fidelity memory capabilities of full softmax attention in a single architecture. KVM maintains the familiar Transformer key-value cache structure while treating a portion of it as an expandable recurrent state, enabling a flexible trade-off between memory and computational resources as context length increases. This method interpolates between fixed-state linear RNNs and full attention, allowing for a dynamic adjustment of memory usage. KVM uses Block Sliding Window Attention (BSWA) to manage state updates, ensuring tokens are represented efficiently without redundancy. It employs just-in-time normalization and a winner-take-all merging strategy to maintain the distinctiveness and usability of state rows. The architecture allows for state growth by appending novel, non-redundant tokens, providing sublinear growth in memory usage for long contexts. KVM's design allows it to perform well in long-context benchmarks by maintaining strong recall without the necessity of a full KV cache expansion, offering a middle ground in memory management between traditional RNNs and Transformers.