Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale
Blog post from Together AI
Cursor is an AI-driven coding platform that utilizes real-time intelligence to optimize code development by predicting edits, refactoring, and managing context as developers work. It collaborates with Together AI to build an efficient infrastructure using NVIDIA Blackwell architecture, focusing on low-latency inference to maintain responsiveness. The integration supports the platform's need for predictable latency and stable operation under concurrent workloads. Cursor benefits from early access to NVIDIA Blackwell hardware, utilizing NVIDIA GB200 NVL72 and HGX B200 for enhanced performance. The collaboration includes porting the inference stack to ARM architecture and developing custom kernels for Blackwell's new Tensor Core instructions, ensuring efficient parallelism. The process involves quantization to balance memory constraints and output quality, crucial for maintaining code accuracy. Cursor's production deployment emphasizes throughput and utilization, aiming to enhance per-GPU economics with higher-throughput endpoints as demand increases.