Backbone-Optimizer Coupling Bias: The Hidden Co-Design Principle
Blog post from HuggingFace
The concept of Backbone-Optimizer Coupling Bias (BOCB) challenges the traditional view of treating neural network architectures and optimizers as separate entities, suggesting instead that they are inherently interconnected within the learning process. This interconnectedness is grounded in the Nested Learning framework, which views both architectures and optimizers as nested associative memory systems that influence each other throughout the training process. BOCB posits that the inductive bias of an architecture and the dynamical bias of its optimizer must be co-designed to ensure optimal learning dynamics, stability, and generalization. The synergy between these components is exemplified by the effective pairing of Transformers with adaptive optimizers like AdamW, which compensate for the architectural heterogeneity of Transformers, unlike classical methods such as SGD(M). This perspective advocates for a paradigm shift toward an integrated co-design philosophy, where the architecture and optimizer are jointly optimized as a coupled dynamical system, leading to more efficient and adaptable neural learning systems. The framework introduces principles for aligning the primal geometry of architectures with the dual dynamics of optimizers, emphasizing the need for consistency across different training phases to maintain the geometric integrity of learned models.