Run a vLLM Server on HF Jobs in One Command

Post Details

Company

HuggingFace

Date Published

June 26, 2026

Author

Quentin Gallouédec

Word Count

1,611

Company Posts That Month

90

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/vllm-jobs

Summary

Quentin Gallouédec provides a guide on deploying a private, OpenAI-compatible large language model (LLM) endpoint on Hugging Face's infrastructure using a single command, eliminating the need for manual server provisioning and Kubernetes management, and offering a pay-per-second billing model. The setup involves using the official vllm/vllm-openai image, requesting a GPU, and exposing the model's port through Hugging Face's public jobs proxy for easy access from any location via an API token. It caters to various use cases such as tests, evaluations, and batch generation, and details how to scale the command for larger models, use curl or Python for queries, and secure access with an HF token. The post also explains additional functionalities like integrating with Gradio for a UI chat interface, SSH access for debugging, and utilizing the endpoint as a coding-agent backend with Pi. It provides a comparison between Hugging Face Jobs and Inference Endpoints, recommending the former for flexibility and experiments, and the latter for production-ready, long-term services with enhanced access control and operational features.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	1	1,993	294	100	+1%
LLM	1	5,172	1,006	220	-43%