Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation
Blog post from HuggingFace
NVIDIA Cosmos Predict 2.5 is a world model designed for generating realistic videos based on text, images, or video prompts and can be fine-tuned to specific domains like robot manipulation. Fine-tuning large models is often resource-intensive, so techniques like LoRA and DoRA are used to inject smaller, trainable adapter modules into a frozen base model, making the process more efficient and flexible. By utilizing these methods, the model can be fine-tuned on a single GPU while maintaining general knowledge. This process allows for the generation of synthetic robot trajectories, which are useful for training robot policies without the high cost of collecting real-world data. The guide details the parameter-efficient fine-tuning process using the diffusers and accelerate libraries, implementing LoRA and DoRA, and evaluating the model's performance based on physical plausibility and instruction-following metrics. The study concludes that fine-tuning for 100 epochs on 8 H100 GPUs significantly improves video generation quality in terms of temporal stability, geometric consistency, and task completion, with LoRA and DoRA offering different advantages based on memory and stability requirements.