Every modality will browse.
Blog post from Browserbase
Browser environments are emerging as essential platforms for training and evaluating computer-use agents (CUAs) in multimodal tasks, leveraging the dynamic and unpredictable nature of live web interactions to foster learning and adaptability. These environments allow agents to access and synthesize vision, audio, and text inputs/outputs, facilitating a more comprehensive understanding of the real world by exposing them to various states and scenarios that require complex decision-making. The live browser, unlike static recordings, provides a richer observation space, enabling agents to experience consequences and recover dynamically, thus becoming more reliable and efficient. The infrastructure supporting these agents must accommodate concurrent browser sessions, enabling specialized agents to collaborate effectively in isolated workspaces with unique credentials and environments. This approach not only enhances multimodal learning but also optimizes the system's reaction time and inference speed, ultimately paving the way for scalable, interactive systems where agents can observe, act, fail, and recover in a continuously evolving web setting.