Company
Date Published
Author
-
Word count
963
Language
English
Hacker News points
None

Summary

Fireworks has introduced Quantization Aware Fine Tuning (QAT) for its DeepSeek R1 and V3 models, aiming to optimize these state-of-the-art open models for quality, latency, and cost through their FireOptimizer adaptation engine. The challenges of fine-tuning these models include dealing with accuracy drops from varying training and serving configurations, substantial GPU memory requirements due to their 671 billion parameters, and complexities with the Mixture-of-Experts structure in DeepSeek V3. QAT, which builds on LoRA and QLoRA techniques, helps achieve high accuracy with reduced memory usage by simulating inference setup through "fake quantization" of merged weights and activations. This method is shown to provide an edge over naive FP8 LoRA tuning and can be seamlessly implemented on Fireworks for models like Llama and DeepSeek V2, promising faster inference speeds without significant accuracy loss. The initiative also highlights the stability and improved alignment of evaluation metrics with inference numerics, showcasing the potential of QAT for expanding model capabilities beyond DeepSeek to other bfloat16 models.