Serverless 2.0: Three Ways to Run Inference, One API
Blog post from Fireworks AI
Serverless 2.0 introduces a more flexible approach to running AI inferences by offering three distinct serving paths—Standard, Priority, and Fast—within a single API, eliminating the need for reserved capacity. Standard serves as the default, cost-efficient option, Priority provides stronger admission during network congestion, and Fast offers high-throughput for speed-sensitive applications. This new model allows users to better manage reliability and throughput by choosing the appropriate path based on their specific workload needs. The platform clarifies previous issues with error codes by distinguishing between rate-limit problems and temporary saturation, allowing for more accurate retry logic and alert configurations. Serverless 2.0 is designed to accommodate evolving AI product demands, providing teams the flexibility to stay pay-per-token as they learn about production requirements, without the immediate need for dedicated deployments. The system also introduces Background processing for asynchronous tasks at a reduced cost, further enhancing operational efficiency.