Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang
Blog post from HuggingFace
Novita AI has developed a suite of optimizations aimed at enhancing the performance of GLM4-MOE models using the SGLANG framework, achieving up to a 65% reduction in Time-to-First-Token (TTFT) and a 22% improvement in Time-Per-Output-Token (TPOT) during agentic coding tasks. The optimizations include Shared Experts Fusion, which unifies shared and routed experts for improved compute efficiency, and Qknorm Fusion, which merges head-wise computations into a single kernel to reduce overhead. Additionally, Async Transfer optimizes data movement by advancing the data transfer step to improve throughput, while Suffix Decoding leverages pattern repetition to further decrease TPOT, particularly in agentic coding scenarios. These enhancements have been validated on H200 clusters and are already in production, demonstrating significant improvements in latency and throughput for demanding environments.