The RTX 5090 Is Here: Serve 65,000+ Tokens Per Second on RunPod
Blog post from RunPod
RunPod has introduced access to the NVIDIA RTX 5090 GPU, offering powerful performance for real-time language model inference with impressive throughput and memory capacity suitable for small and mid-sized AI models. This next-generation GPU supports high-concurrency deployments like chatbots and inference APIs, delivering substantial performance gains, as demonstrated by internal benchmarks where models such as Qwen2-0.5B achieved significant throughput improvements. Utilizing vLLM, a high-performance inference engine, the RTX 5090 efficiently manages large-batch and low-latency workloads, with VRAM usage indicating full memory capacity utilization. The benchmarks showed that even with high concurrency, efficiency remained stable, making the RTX 5090 an advantageous choice for scalable production endpoints. The GPU also provides flexibility for scaling up to larger models or hosting multiple models simultaneously, offering a cost-effective solution for high-volume inference needs, potentially reducing costs per request and benefiting both startups and large-scale deployments. The RTX 5090 is now available on RunPod for on-demand and containerized workloads, enabling users to quickly deploy their models using prebuilt templates or custom setups.