From RMSProp to AdamW: The Optimizer Evolution Story
The progression from RMSProp to AdamW shows how three specific training problems got solved over time.
Each optimizer built on the previous one while addressing a particular issue that emerged in large-scale neural network training.
Different parameters need different learning rates
Training deep networks with SGD runs into a fundamental issue: different parameter types have vastly different gradient magnitudes. Token embeddings might have gradients around 0.1 while deep attention weights have gradients around 0.001. With a single learning rate, you either overshoot the large gradients or barely update the small ones.
RMSProp solved this by scaling each parameter's learning rate based on its recent gradient history. Parameters with large gradients get automatically smaller effective learning rates, while parameters with small gradients get larger ones. This made it possible to train networks with heterogeneous parameter types without manually tuning learning rates for different layers.
Mini-batch gradients are noisy
RMSProp handled the scaling problem, but training with small batches creates noisy gradient estimates. These estimates can vary significantly from the true gradient, causing optimization to bounce around instead of moving consistently toward the minimum.
Adam added momentum to smooth this out. RMSProp already tracked gradient variance for adaptive scaling, but Adam added momentum (exponential moving average of gradients) for directional consistency. Adam also includes bias correction to handle the fact that both momentum and variance estimates start at zero and build up slowly.
This combination filters out much of the mini-batch noise while preserving the adaptive benefits of RMSProp.
Weight decay was getting corrupted
Adam worked well for optimization, but had an unintended interaction with L2 regularization. When you add weight decay to the loss function, it contributes to the gradient and gets processed through Adam's adaptive machinery along with the actual gradients.
This means weight decay strength becomes parameter-dependent. Parameters with large gradient histories get less weight decay, while parameters with small gradient histories get more. The regularization behavior becomes unpredictable and tied to optimization dynamics.
AdamW fixed this by applying weight decay directly to parameters, separate from the gradient-based updates. This restored uniform regularization across all parameters regardless of their gradient statistics.
How problems drive solutions
Each optimizer solved the most pressing remaining issue with neural network training at the time. RMSProp handled parameter scaling differences. Adam added noise robustness and proper initialization. AdamW cleaned up the regularization interaction.
Understanding these design motivations helps explain when each optimizer works best and why AdamW became the default choice for many applications. The progression also shows how optimization algorithms evolve by addressing specific, observable training problems rather than pursuing abstract mathematical improvements.
The full post builds up intuition about the desired optimizer properties using controlled experiments and visualizations: https://naskovai.github.io/posts/from-rmsprop-to-adamw-the-optimizer-evolution-story/