Ollama's new engine for multimodal models

Post Details

Company

Ollama

Date Published

May 15, 2025

Author

-

Word Count

1,447

Language

-

Hacker News Points

-

Source URL

ollama.com/blog/multimodal-models

Summary

Ollama has introduced a new engine to support multimodal models, beginning with vision models like Meta Llama 4 and Google Gemma 3, which enhance general multimodal understanding and reasoning. This development allows for sophisticated interactions with visual data, such as interpreting images and answering queries about them, and leverages models like Llama 4 Scout, a mixture-of-experts model with 109 billion parameters. Ollama emphasizes model modularity, ensuring each model can operate independently to improve reliability and ease integration for developers, while also focusing on accuracy and memory management through methods like image caching and optimized KV cache usage. Collaborations with hardware manufacturers and software partners aim to enhance memory efficiency and support longer context sizes, enabling more robust use of multimodal capabilities. This initiative sets the foundation for future expansions in modalities like speech and video generation, with contributions from partners like GGML and major tech companies such as NVIDIA and Microsoft.