An experiment with attention.
Blog post from HuggingFace
An experiment explored whether compressed context states could effectively replace full attention mechanisms in language models, particularly in preserving weak, parallel instructions over long sequences. The experiment, conducted using a synthetic dataset, compared a traditional attention-based model with a model utilizing a compressed memory state across varying context lengths. Results indicated that the attention-based model outperformed the compressed model in both accuracy and speed, especially as context length increased. While the compressed model conceptually aimed to retain early rules without explicit classification, it failed to match the performance of attention, revealing that a naïve compression approach was insufficient. The findings underscore the robustness of full attention in handling tasks with complex rule retention requirements and highlight the need for more refined strategies in designing efficient context mechanisms. The study suggests future improvements could involve enhancing the preservation of weak constraints and optimizing implementation for parallel processing, rather than merely increasing compression.