Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Ting-Yun Chang, Miguel Martin, Jonathan Allen, Ke Ding, and Pooya Jannaty
Word Count
2,653
Language
-
Hacker News Points
-
Summary

NVIDIA Cosmos Predict 2.5 is a world model designed for generating realistic videos based on text, images, or video prompts and can be fine-tuned to specific domains like robot manipulation. Fine-tuning large models is often resource-intensive, so techniques like LoRA and DoRA are used to inject smaller, trainable adapter modules into a frozen base model, making the process more efficient and flexible. By utilizing these methods, the model can be fine-tuned on a single GPU while maintaining general knowledge. This process allows for the generation of synthetic robot trajectories, which are useful for training robot policies without the high cost of collecting real-world data. The guide details the parameter-efficient fine-tuning process using the diffusers and accelerate libraries, implementing LoRA and DoRA, and evaluating the model's performance based on physical plausibility and instruction-following metrics. The study concludes that fine-tuning for 100 epochs on 8 H100 GPUs significantly improves video generation quality in terms of temporal stability, geometric consistency, and task completion, with LoRA and DoRA offering different advantages based on memory and stability requirements.