Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

Post Details

Company

HuggingFace

Date Published

Jan. 22, 2026

Author

Novita AI

Word Count

1,047

Company Posts That Month

56

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/novita/sglang-glm4-moe

Summary

Novita AI has developed a suite of optimizations aimed at enhancing the performance of GLM4-MOE models using the SGLANG framework, achieving up to a 65% reduction in Time-to-First-Token (TTFT) and a 22% improvement in Time-Per-Output-Token (TPOT) during agentic coding tasks. The optimizations include Shared Experts Fusion, which unifies shared and routed experts for improved compute efficiency, and Qknorm Fusion, which merges head-wise computations into a single kernel to reduce overhead. Additionally, Async Transfer optimizes data movement by advancing the data transfer step to improve throughput, while Suffix Decoding leverages pattern repetition to further decrease TPOT, particularly in agentic coding scenarios. These enhancements have been validated on H200 clusters and are already in production, demonstrating significant improvements in latency and throughput for demanding environments.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	1	3,836	662	193	+2%
Real-time	1	4,546	943	215	-38%