Qwen3-VL, the latest and most advanced vision language model in the Qwen series, is now accessible via Ollama’s cloud platform, with local availability planned for the future. This model boasts a range of capabilities, including visual agent functions for operating GUIs, visual coding boost for generating code from images or videos, and advanced spatial perception for 2D and 3D grounding. It supports long context and video understanding with a native 256K context expandable to 1M, enhanced multimodal reasoning particularly in STEM fields, and upgraded visual recognition of diverse objects and languages. Additionally, it features expanded OCR capabilities supporting 32 languages and improved text understanding that aligns with pure language models. Users can interact with the model using Ollama’s CLI, API, and JavaScript/Python libraries, and Ollama offers OpenAI-compatible API endpoints for seamless integration.