Waypoint-1: Real-time Interactive Video Diffusion from Overworld
Blog post from HuggingFace
Waypoint-1, developed by Overworld, is a real-time interactive video diffusion model designed for immersive experiences, allowing users to interact with generated worlds using text, mouse, and keyboard inputs without latency. Trained on 10,000 hours of video game footage, it employs a frame-causal rectified flow transformer and a latent model approach, focusing on compressed frames for enhanced interactivity. Unlike other models that face control limitations and latency issues, Waypoint-1 offers seamless camera movement and input responsiveness. The model's training incorporates diffusion forcing and self-forcing techniques to improve frame generation accuracy and minimize error accumulation during long rollouts. Powered by Overworld's WorldEngine, the inference library is optimized for low latency and high throughput, achieving up to 60 FPS with targeted optimizations like AdaLN feature caching and static rolling KV cache. The platform encourages community engagement through events like hackathons to explore further enhancements of the WorldEngine.