DeepSeek v3 and R1 represent a significant advancement in artificial intelligence model architecture, building on the foundational principles of the Transformer block while enhancing efficiency through innovations like FP8 precision pre-training and an aggressive Mixture of Experts (MoE) approach. This version increases the capacity for knowledge and memory by expanding the number of routed experts and implementing local balanced routing to prevent issues like routing collapse. The shift to FP8 precision in training not only boosts compute efficiency and reduces memory usage but also introduces complexities in maintaining numerical stability, which the DeepSeek team addresses through fine-granularity quantization and a mixed-precision approach. By employing a more aggressive MoE strategy, DeepSeek v3 achieves high-quality benchmarks at reduced computation costs, suggesting that ultra-large models could be constructed with more experts while maintaining efficiency. These architectural advancements, along with customized data formats and dynamic range quantization, illustrate how engineering ingenuity can significantly enhance model performance and efficiency.