How we built the most performant DeepSeek V3.2, MiniMax-M2.5 and Qwen 3.5 397B on DigitalOcean Serverless Inference
Blog post from DigitalOcean
DigitalOcean has announced the availability of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B on its Serverless Inference platform, emphasizing their superior output speed, particularly DeepSeek V3.2, which delivers 230 tokens per second with a sub-1-second Time-to-First-Token for 10,000 input tokens. The focus on fast inference is crucial due to the rise of real-time AI applications, where latency significantly affects user engagement. Achieving top performance involved optimizing every stack layer, from hardware to software, including leveraging NVIDIA's Blackwell Ultra GPUs and implementing advanced techniques like model quantization and speculative decoding. These efforts resulted in substantial performance improvements, with DeepSeek V3.2 outperforming major competitors like AWS and Google in key benchmarks. The optimized models have already enhanced customer applications, such as Workato, by significantly reducing latency and inference costs. DigitalOcean plans to continue scaling its infrastructure to meet the increasing demand for high-performing AI inference.