Key research and product announcements at the AI Native Conf
Blog post from Together AI
Together Research announced several advancements at the AI Native Conf, showcasing innovations like FlashAttention-4, a kernel co-design that significantly enhances the performance of large-scale language models on NVIDIA GPUs, achieving faster processing at lower costs. Their Megakernel implementation, tailored for real-time voice agents, dramatically improved performance metrics by optimizing the entire model in one kernel. The introduction of together.compile automates kernel optimization, boosting production efficiency for video models. The new Reinforcement Learning API provides teams with control over RL training configurations, enhancing rollout efficiency. ThunderAgent overcomes challenges in agentic workflows by treating them as cohesive scheduling units, resulting in substantial throughput improvements. ATLAS-2, a speculative decoding method, continuously updates speculator models in real-time, maintaining performance as traffic patterns shift. Additionally, Cache-aware prefill–decode disaggregation (CPD) optimizes long-context inference, achieving higher throughput by managing cache usage effectively. Together's approach emphasizes the synergistic relationship between research and production, aiming to expand AI infrastructure capabilities for demanding applications.