Company
Date Published
Author
-
Word count
672
Language
-
Hacker News points
None

Summary

Researchers from Stanford's Hazy Research lab have developed a method to offload a substantial portion of large language model (LLM) workloads to consumer devices by enabling small on-device models to collaborate with larger cloud-based models. This approach, detailed in a new paper with accompanying open-source code, seeks to reduce cloud costs with minimal or no quality loss through two protocols: Minion, where a cloud model interacts with a single local model to reach a solution, achieving a 30.4x cost reduction while maintaining 87% of cloud model performance, and MinionS, which breaks tasks into smaller subtasks to be solved in parallel by small LLMs, achieving a 5.7x cost reduction while maintaining 97.9% performance. The project, which involves models like Llama 3.2 and GPT-4o, allows users to explore these protocols through a demo app and provides instructions for setting up and running both protocols programmatically with Python.