How to Scale Your EMA

[ad_1]

*=Equal Contributors

Preserving coaching dynamics throughout batch sizes is a crucial device for sensible machine studying because it permits the trade-off between batch dimension and wall-clock time. This trade-off is often enabled by a scaling rule; for instance, in stochastic gradient descent, one ought to scale the educational charge linearly with the batch dimension. One other necessary machine studying device is the mannequin EMA, a purposeful copy of a goal mannequin whose parameters transfer in direction of these of its goal mannequin in keeping with an Exponential Transferring Common (EMA) at a charge parameterized by a momentum hyperparameter. This mannequin EMA can enhance the robustness and generalization of supervised studying, stabilize pseudo-labeling, and supply a studying sign for Self-Supervised Studying (SSL). Prior works haven’t thought-about the optimization of the mannequin EMA when performing scaling, resulting in totally different coaching dynamics throughout batch sizes and decrease mannequin efficiency. On this work, we offer a scaling rule for optimization within the presence of a mannequin EMA and exhibit the rule’s validity throughout a spread of architectures, optimizers, and information modalities. We additionally present the rule’s validity the place the mannequin EMA contributes to the optimization of the goal mannequin, enabling us to coach EMA-based pseudo-labeling and SSL strategies at small and enormous batch sizes. For SSL, we allow coaching of BYOL as much as batch dimension 24,576 with out sacrificing efficiency, a 6× wall-clock time discount beneath idealized {hardware} settings.

[ad_2]

Source link