Home / Companies / Dash0 / Blog / Post Details
Content Deep Dive

Observing vLLM with OpenTelemetry and Dash0

Blog post from Dash0

Post Details
Company
Date Published
Author
Julia Furst Morgado
Word Count
2,995
Language
English
Hacker News Points
-
Summary

vLLM, an inference server with built-in OpenTelemetry instrumentation, requires specific configurations for effective observability in production environments. Unlike standard Application Performance Monitoring (APM) that indicates slow requests, vLLM's observability identifies distinct latency causes such as KV cache preemptions or decode bottlenecks through inference-specific signals like cache utilization and queue depth. This setup uses the OTel Collector and Dash0 as the observability backend to capture these signals for capacity planning and latency debugging. The architecture involves setting up a trace and metrics pipeline using Docker Compose with a FastAPI RAG app, vLLM server, and OTel Collector. This setup allows for detailed distributed tracing and metrics collection, helping differentiate between latency causes in LLM inference and standard HTTP services. The system provides insights into phases of LLM latency, such as scheduling, prefill, and decode, which require different tuning strategies. The integration with Dash0 enables monitoring of metrics related to GPU cache usage and queue depth, facilitating proactive capacity management and debugging.