Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Run a vLLM Server on HF Jobs in One Command

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Quentin Gallouédec
Word Count
1,611
Company Posts That Month
90
Language
-
Hacker News Points
-
Summary

Quentin Gallouédec provides a guide on deploying a private, OpenAI-compatible large language model (LLM) endpoint on Hugging Face's infrastructure using a single command, eliminating the need for manual server provisioning and Kubernetes management, and offering a pay-per-second billing model. The setup involves using the official vllm/vllm-openai image, requesting a GPU, and exposing the model's port through Hugging Face's public jobs proxy for easy access from any location via an API token. It caters to various use cases such as tests, evaluations, and batch generation, and details how to scale the command for larger models, use curl or Python for queries, and secure access with an HF token. The post also explains additional functionalities like integrating with Gradio for a UI chat interface, SSH access for debugging, and utilizing the endpoint as a coding-agent backend with Pi. It provides a comparison between Hugging Face Jobs and Inference Endpoints, recommending the former for flexibility and experiments, and the latter for production-ready, long-term services with enhanced access control and operational features.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Kubernetes 1 1,993 294 100 +1%
LLM 1 5,172 1,006 220 -43%