Skip to content

Muon's Adajust LR is different from the description in the paper #371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hatonosuke opened this issue Apr 20, 2025 · 4 comments · Fixed by #373
Closed

Muon's Adajust LR is different from the description in the paper #371

hatonosuke opened this issue Apr 20, 2025 · 4 comments · Fixed by #373
Assignees
Labels
bug Something isn't working

Comments

@hatonosuke
Copy link

Describe the bug

In the Muon in this repository, max(1, A/B)**0.5 is used as part of lr whether "use_adjusted_lr" is True or False.

lr: float = self.adjust_lr_for_muon(group['lr'], p.size()) if group['use_adjusted_lr'] else group['lr']
p.add_(g, alpha=-lr * (max(1.0, p.size(-2) / p.size(-1)) ** 0.5))

But max(1, A/B)**0.5 is not used in the paper.
And in the footnote on page 4 of the paper, there is the following:

K. Jordan et al. 2024’s original implementation scales the updates by sqrt(max(1, A/B)), which is equivalent to our proposal (up to a global scale) if all matrices have the same second dimension; Pethick et al. 2025 and You 2025 discussed a similar issue on update scaling factors concurrently to our work.

Therefore, when "use_adjusted_lr" is True, I think max(1, A/B)**0.5 should be not used.

The paper I referenced is:

https://arxiv.org/abs/2502.16982

@hatonosuke hatonosuke added the bug Something isn't working label Apr 20, 2025
@kozistr
Copy link
Owner

kozistr commented Apr 22, 2025

@hatonosuke Hi! Thanks for pointing this out.

Yes, use_adjusted_lr variant is not from the original paper, it's from the Moonlight optimizer, which is a variant of the Muon optimizer.

You can also check the use_adjusted_lr parameter description here in the Muon optimizer here.

@hatonosuke
Copy link
Author

Hello. Thank you for your reply.

I check the implementation in the link and it is implemented as described in the paper.
sqrt(max(1.0, B/A)) is not used as part of the learning rate.

@kozistr
Copy link
Owner

kozistr commented Apr 22, 2025

Hello. Thank you for your reply.

I check the implementation in the link and it is implemented as described in the paper. sqrt(max(1.0, B/A)) is not used as part of the learning rate.

Oh, sorry for the confusion. I misunderstood that part. you're right sqrt(max(1.0, B/A)) this should not be used when use_adjusted_lr is True. I'll work on that. thanks for pointing this out :)

@hatonosuke
Copy link
Author

Thank you for your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants