Where is SignSGD performed ? #1

manishadubey91 · 2018-11-03T07:37:22Z

I am unable to figure out where exactly Sign of gradient is being taken into consideration (except in the toy example) ?

jxbz · 2018-11-03T18:19:11Z

Hi @manishadubey91, sorry this is unclear. You have to pass in the optimiser as a command line argument. For example:

python train_resnet.py --optim signum --lr 0.0001 --wd 0.00001

This works because signum was implemented in the mxnet deep learning framework (see this page). I can also share Pytorch code for the optimiser if that would help.

amitport · 2020-10-05T05:06:19Z

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

jxbz · 2020-10-27T18:45:19Z

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

Hi @amitport, you're right and thanks for pointing this out. In this paper, we used an implementation of the sign function that quantised positiive gradients to +1, negative gradients to -1, and 0 gradients to 0. I think this was done at the time under the (naïve) assumption that a gradient component exactly zero was unlikely to occur in practice. I'm planning to run some experiments to test if/how much this makes a difference to convergence, and will report back.

jxbz · 2021-01-12T11:53:07Z

Hi @amitport, I tested the difference between the version that sends sign(0) --> 0 and the version that sends sign(0) --> ±1 at random. The tests and results are in this Jupyter notebook. At least for training Resnet-18 on CIFAR-10, there was little difference between the two implementations.

That being said, in the distributed experiments in the ICLR 2019 paper, we used an implementation of the sign function that maps sign(0) --> +1 deterministically. So if this issue still bothers you (it bothers me) then it's safer to look at the experimental results in that paper. The compression in that paper is carried out in bit2byte.cpp which gets called by compressor.py.

amitport · 2021-01-13T11:53:26Z

@jxbz thank you. I just wanted to make sure I understand what was used in the graphs which I guess is one bit sign.

In any case, we can probably agree that ternary sign {-1, 0, 1} is significantly better than one bit sign so the distinction is meaningful. (And also that randomizing 0 is a big improvement over simple one bit sign).

jxbz closed this as completed Nov 3, 2018

jxbz reopened this Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where is SignSGD performed ? #1

Where is SignSGD performed ? #1

manishadubey91 commented Nov 3, 2018 •

edited

Loading

jxbz commented Nov 3, 2018

amitport commented Oct 5, 2020

jxbz commented Oct 27, 2020 •

edited

Loading

jxbz commented Jan 12, 2021

amitport commented Jan 13, 2021 •

edited

Loading

Where is SignSGD performed ? #1

Where is SignSGD performed ? #1

Comments

manishadubey91 commented Nov 3, 2018 • edited Loading

jxbz commented Nov 3, 2018

amitport commented Oct 5, 2020

jxbz commented Oct 27, 2020 • edited Loading

jxbz commented Jan 12, 2021

amitport commented Jan 13, 2021 • edited Loading

manishadubey91 commented Nov 3, 2018 •

edited

Loading

jxbz commented Oct 27, 2020 •

edited

Loading

amitport commented Jan 13, 2021 •

edited

Loading