Skip to content

Where is SignSGD performed ? #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
manishadubey91 opened this issue Nov 3, 2018 · 5 comments
Open

Where is SignSGD performed ? #1

manishadubey91 opened this issue Nov 3, 2018 · 5 comments

Comments

@manishadubey91
Copy link

manishadubey91 commented Nov 3, 2018

I am unable to figure out where exactly Sign of gradient is being taken into consideration (except in the toy example) ?

@jxbz
Copy link
Owner

jxbz commented Nov 3, 2018

Hi @manishadubey91, sorry this is unclear. You have to pass in the optimiser as a command line argument. For example:

python train_resnet.py --optim signum --lr 0.0001 --wd 0.00001

This works because signum was implemented in the mxnet deep learning framework (see this page). I can also share Pytorch code for the optimiser if that would help.

@jxbz jxbz closed this as completed Nov 3, 2018
@amitport
Copy link

amitport commented Oct 5, 2020

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

@jxbz
Copy link
Owner

jxbz commented Oct 27, 2020

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

Hi @amitport, you're right and thanks for pointing this out. In this paper, we used an implementation of the sign function that quantised positiive gradients to +1, negative gradients to -1, and 0 gradients to 0. I think this was done at the time under the (naïve) assumption that a gradient component exactly zero was unlikely to occur in practice. I'm planning to run some experiments to test if/how much this makes a difference to convergence, and will report back.

@jxbz jxbz reopened this Oct 27, 2020
@jxbz
Copy link
Owner

jxbz commented Jan 12, 2021

Hi @amitport, I tested the difference between the version that sends sign(0) --> 0 and the version that sends sign(0) --> ±1 at random. The tests and results are in this Jupyter notebook. At least for training Resnet-18 on CIFAR-10, there was little difference between the two implementations.

That being said, in the distributed experiments in the ICLR 2019 paper, we used an implementation of the sign function that maps sign(0) --> +1 deterministically. So if this issue still bothers you (it bothers me) then it's safer to look at the experimental results in that paper. The compression in that paper is carried out in bit2byte.cpp which gets called by compressor.py.

@amitport
Copy link

amitport commented Jan 13, 2021

@jxbz thank you. I just wanted to make sure I understand what was used in the graphs which I guess is one bit sign.

In any case, we can probably agree that ternary sign {-1, 0, 1} is significantly better than one bit sign so the distinction is meaningful. (And also that randomizing 0 is a big improvement over simple one bit sign).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants