GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra

Post Details

Company

Deepinfra

Date Published

Dec. 1, 2025

Author

Deep

Word Count

2,022

Language

English

Hacker News Points

-

Source URL

deepinfra.com/blog/glm-4-6-api-get-fast-first-tokens-at-the-best-m-from-deepinfras-api

Summary

GLM-4.6, a high-capacity reasoning-tuned model from Zhipu, is designed for applications like coding copilots, long-context retrieval-augmented generation (RAG), and multi-tool agent loops, with a context window increased to 200k tokens from its predecessor GLM-4.5. DeepInfra’s implementation of GLM-4.6 is notable for its sub-second Time-to-First-Token (TTFT) of 0.51 seconds and a competitive throughput of 48 tokens per second at 100k input tokens, offering the lowest output cost of $1.9 per million tokens. While Baseten provides the fastest TTFT and highest throughput, it is more expensive per output token. DeepInfra is positioned as the optimal choice for balancing speed, predictability, and cost, particularly for scenarios requiring strong reasoning capabilities and extensive context handling, offering a cost-effective solution without sacrificing perceived speed. The article highlights the importance of responsiveness and the steadiness of performance over peak benchmarks, emphasizing DeepInfra's competitive edge.