Home / Companies / RunPod / Blog / Post Details
Content Deep Dive

The RTX 5090 Is Here: Serve 65,000+ Tokens Per Second on RunPod

Blog post from RunPod

Post Details
Company
Date Published
Author
Alyssa Mazzina
Word Count
561
Language
English
Hacker News Points
-
Summary

RunPod has introduced access to the NVIDIA RTX 5090 GPU, offering powerful performance for real-time language model inference with impressive throughput and memory capacity suitable for small and mid-sized AI models. This next-generation GPU supports high-concurrency deployments like chatbots and inference APIs, delivering substantial performance gains, as demonstrated by internal benchmarks where models such as Qwen2-0.5B achieved significant throughput improvements. Utilizing vLLM, a high-performance inference engine, the RTX 5090 efficiently manages large-batch and low-latency workloads, with VRAM usage indicating full memory capacity utilization. The benchmarks showed that even with high concurrency, efficiency remained stable, making the RTX 5090 an advantageous choice for scalable production endpoints. The GPU also provides flexibility for scaling up to larger models or hosting multiple models simultaneously, offering a cost-effective solution for high-volume inference needs, potentially reducing costs per request and benefiting both startups and large-scale deployments. The RTX 5090 is now available on RunPod for on-demand and containerized workloads, enabling users to quickly deploy their models using prebuilt templates or custom setups.