Muon's Adajust LR is different from the description in the paper #371

hatonosuke · 2025-04-20T16:20:20Z

Describe the bug

In the Muon in this repository, max(1, A/B)**0.5 is used as part of lr whether "use_adjusted_lr" is True or False.

pytorch_optimizer/pytorch_optimizer/optimizer/muon.py

Lines 205 to 207 in d4e7564

    
           lr: float = self.adjust_lr_for_muon(group['lr'], p.size()) if group['use_adjusted_lr'] else group['lr'] 
        
           p.add_(g, alpha=-lr * (max(1.0, p.size(-2) / p.size(-1)) ** 0.5))

But max(1, A/B)**0.5 is not used in the paper.
And in the footnote on page 4 of the paper, there is the following:

K. Jordan et al. 2024’s original implementation scales the updates by sqrt(max(1, A/B)), which is equivalent to our proposal (up to a global scale) if all matrices have the same second dimension; Pethick et al. 2025 and You 2025 discussed a similar issue on update scaling factors concurrently to our work.

Therefore, when "use_adjusted_lr" is True, I think max(1, A/B)**0.5 should be not used.

The paper I referenced is:

https://arxiv.org/abs/2502.16982

kozistr · 2025-04-22T02:01:33Z

@hatonosuke Hi! Thanks for pointing this out.

Yes, use_adjusted_lr variant is not from the original paper, it's from the Moonlight optimizer, which is a variant of the Muon optimizer.

You can also check the use_adjusted_lr parameter description here in the Muon optimizer here.

hatonosuke · 2025-04-22T03:26:32Z

Hello. Thank you for your reply.

I check the implementation in the link and it is implemented as described in the paper.
sqrt(max(1.0, B/A)) is not used as part of the learning rate.

kozistr · 2025-04-22T03:31:06Z

Hello. Thank you for your reply.

I check the implementation in the link and it is implemented as described in the paper. sqrt(max(1.0, B/A)) is not used as part of the learning rate.

Oh, sorry for the confusion. I misunderstood that part. you're right sqrt(max(1.0, B/A)) this should not be used when use_adjusted_lr is True. I'll work on that. thanks for pointing this out :)

hatonosuke · 2025-04-22T11:37:05Z

Thank you for your fix.

hatonosuke added the bug Something isn't working label Apr 20, 2025

hatonosuke assigned kozistr Apr 20, 2025

kozistr mentioned this issue Apr 22, 2025

[Fix] Correct the learning rate ratio in Muon optimizer #373

Merged

2 tasks

kozistr closed this as completed in #373 Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muon's Adajust LR is different from the description in the paper #371

Muon's Adajust LR is different from the description in the paper #371

hatonosuke commented Apr 20, 2025

kozistr commented Apr 22, 2025

hatonosuke commented Apr 22, 2025

kozistr commented Apr 22, 2025

hatonosuke commented Apr 22, 2025

Muon's Adajust LR is different from the description in the paper #371

Muon's Adajust LR is different from the description in the paper #371

Comments

hatonosuke commented Apr 20, 2025

Describe the bug

kozistr commented Apr 22, 2025

hatonosuke commented Apr 22, 2025

kozistr commented Apr 22, 2025

hatonosuke commented Apr 22, 2025