Run a vLLM Server on HF Jobs in One Command
Blog post from HuggingFace
Quentin Gallouédec provides a guide on deploying a private, OpenAI-compatible large language model (LLM) endpoint on Hugging Face's infrastructure using a single command, eliminating the need for manual server provisioning and Kubernetes management, and offering a pay-per-second billing model. The setup involves using the official vllm/vllm-openai image, requesting a GPU, and exposing the model's port through Hugging Face's public jobs proxy for easy access from any location via an API token. It caters to various use cases such as tests, evaluations, and batch generation, and details how to scale the command for larger models, use curl or Python for queries, and secure access with an HF token. The post also explains additional functionalities like integrating with Gradio for a UI chat interface, SSH access for debugging, and utilizing the endpoint as a coding-agent backend with Pi. It provides a comparison between Hugging Face Jobs and Inference Endpoints, recommending the former for flexibility and experiments, and the latter for production-ready, long-term services with enhanced access control and operational features.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Kubernetes | 1 | 1,993 | 294 | 100 | +1% |
| LLM | 1 | 5,172 | 1,006 | 220 | -43% |