Home / Companies / Deepinfra / Blog / Post Details
Content Deep Dive

GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra

Blog post from Deepinfra

Post Details
Company
Date Published
Author
Deep
Word Count
2,022
Language
English
Hacker News Points
-
Summary

GLM-4.6, a high-capacity reasoning-tuned model from Zhipu, is designed for applications like coding copilots, long-context retrieval-augmented generation (RAG), and multi-tool agent loops, with a context window increased to 200k tokens from its predecessor GLM-4.5. DeepInfra’s implementation of GLM-4.6 is notable for its sub-second Time-to-First-Token (TTFT) of 0.51 seconds and a competitive throughput of 48 tokens per second at 100k input tokens, offering the lowest output cost of $1.9 per million tokens. While Baseten provides the fastest TTFT and highest throughput, it is more expensive per output token. DeepInfra is positioned as the optimal choice for balancing speed, predictability, and cost, particularly for scenarios requiring strong reasoning capabilities and extensive context handling, offering a cost-effective solution without sacrificing perceived speed. The article highlights the importance of responsiveness and the steadiness of performance over peak benchmarks, emphasizing DeepInfra's competitive edge.