Holotron-12B - High Throughput Computer Use Agent
Blog post from HuggingFace
Holotron-12B is a multimodal computer-use model developed by H Company, post-trained from NVIDIA's Nemotron-Nano-2 VL model and designed for high-throughput serving in interactive environments. It utilizes a hybrid State-Space Model (SSM) and attention mechanism to achieve efficient and scalable inference, particularly benefitting agentic workloads with lengthy interaction histories and multiple images. Demonstrating significant performance improvements over its predecessors, Holotron-12B excels in benchmarks such as WebVoyager, showcasing enhanced throughput and VRAM utilization on a single H100 GPU. It was trained using supervised fine-tuning on proprietary data, achieving notable advances in localization, grounding, and UI-level interactions. The model, outperforming previous iterations like Holo2-8B, is available on Hugging Face under an NVIDIA Open Model License, setting the stage for future advancements with the upcoming Nemotron 3 Omni, which aims to further enhance reasoning and multimodal precision for large-scale autonomous applications.