You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But max(1, A/B)**0.5 is not used in the paper.
And in the footnote on page 4 of the paper, there is the following:
K. Jordan et al. 2024’s original implementation scales the updates by sqrt(max(1, A/B)), which is equivalent to our proposal (up to a global scale) if all matrices have the same second dimension; Pethick et al. 2025 and You 2025 discussed a similar issue on update scaling factors concurrently to our work.
Therefore, when "use_adjusted_lr" is True, I think max(1, A/B)**0.5 should be not used.
I check the implementation in the link and it is implemented as described in the paper. sqrt(max(1.0, B/A)) is not used as part of the learning rate.
Oh, sorry for the confusion. I misunderstood that part. you're right sqrt(max(1.0, B/A)) this should not be used when use_adjusted_lr is True. I'll work on that. thanks for pointing this out :)
Describe the bug
In the Muon in this repository, max(1, A/B)**0.5 is used as part of lr whether "use_adjusted_lr" is True or False.
pytorch_optimizer/pytorch_optimizer/optimizer/muon.py
Lines 205 to 207 in d4e7564
But max(1, A/B)**0.5 is not used in the paper.
And in the footnote on page 4 of the paper, there is the following:
Therefore, when "use_adjusted_lr" is True, I think max(1, A/B)**0.5 should be not used.
The paper I referenced is:
https://arxiv.org/abs/2502.16982
The text was updated successfully, but these errors were encountered: