Mochi 1: New State of the Art in Open-Source Text-to-Video
Blog post from RunPod
Text-to-video generation has faced challenges in the open-source domain due to the complexity and cost of training video models, but the release of Mochi 1 by Genmo represents a significant advancement in this field. Mochi can generate videos from text prompts at 30 frames per second and up to 5.4 seconds in 480p resolution, focusing on photorealism, motion, and prompt adherence. While it requires substantial computing power, such as four H100 GPUs for optimal performance, the workflow is adaptable for lower hardware specifications. The system employs VAE tiling to manage memory constraints, although this can lead to some image quality tradeoffs. Users can experiment with the ComfyUI workflow on a single GPU, which is particularly effective with the H100 NVL and will improve with the upcoming H200. The platform encourages creativity and offers various options for deploying Mochi, including on Runpod, with ongoing developments aimed at enhancing accessibility and functionality.