How we built DeepL's next-generation LLMs with FP8 for training and inference

Company

DeepL

Date Published

Aug. 7, 2025

Author

Markus Schnös & Fabian Joswig, DeepL Staff Research HPC Engineers

Word count

2302

Language

English

Hacker News points

None

URL

www.deepl.com/en/blog/tech/next-generation-llm-fp8-training

Summary

DeepL has significantly enhanced its Large Language Models (LLMs) by transitioning from BF16 to FP8 precision, leveraging NVIDIA's H100 Tensor Core GPUs. The adoption of FP8, which uses fewer bits than BF16, has enabled DeepL to increase computational throughput and reduce memory demands without compromising the quality of training, despite FP8's narrower range and lower precision. This transition has accelerated the training process by 50% and allowed for the development of larger models with improved translation quality for European languages by 1.4 times and complex pairs like English and Japanese by 1.7 times, all while maintaining consistent latency for inference. By utilizing NVIDIA's Transformer Engine for mixed-precision training and TensorRT-LLM for inference, DeepL has effectively doubled the throughput capacity of its LLMs, enabling them to handle more requests and deliver optimal user experiences. This evolution signifies a substantial leap in scaling DeepL's Language AI capabilities, with future prospects of further advancements using FP4 tensor operations.