Company
Date Published
Author
Deep
Word count
2022
Language
English
Hacker News points
None

Summary

GLM-4.6, a high-capacity reasoning-tuned model from Zhipu, is designed for applications like coding copilots, long-context retrieval-augmented generation (RAG), and multi-tool agent loops, with a context window increased to 200k tokens from its predecessor GLM-4.5. DeepInfra’s implementation of GLM-4.6 is notable for its sub-second Time-to-First-Token (TTFT) of 0.51 seconds and a competitive throughput of 48 tokens per second at 100k input tokens, offering the lowest output cost of $1.9 per million tokens. While Baseten provides the fastest TTFT and highest throughput, it is more expensive per output token. DeepInfra is positioned as the optimal choice for balancing speed, predictability, and cost, particularly for scenarios requiring strong reasoning capabilities and extensive context handling, offering a cost-effective solution without sacrificing perceived speed. The article highlights the importance of responsiveness and the steadiness of performance over peak benchmarks, emphasizing DeepInfra's competitive edge.