Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Novita AI
Word Count
1,047
Language
-
Hacker News Points
-
Summary

Novita AI has developed a suite of optimizations aimed at enhancing the performance of GLM4-MOE models using the SGLANG framework, achieving up to a 65% reduction in Time-to-First-Token (TTFT) and a 22% improvement in Time-Per-Output-Token (TPOT) during agentic coding tasks. The optimizations include Shared Experts Fusion, which unifies shared and routed experts for improved compute efficiency, and Qknorm Fusion, which merges head-wise computations into a single kernel to reduce overhead. Additionally, Async Transfer optimizes data movement by advancing the data transfer step to improve throughput, while Suffix Decoding leverages pattern repetition to further decrease TPOT, particularly in agentic coding scenarios. These enhancements have been validated on H200 clusters and are already in production, demonstrating significant improvements in latency and throughput for demanding environments.