Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

Nan loss after several training iteration #33

Closed
7crystle7 opened this issue Oct 26, 2018 · 8 comments
Closed

Nan loss after several training iteration #33

7crystle7 opened this issue Oct 26, 2018 · 8 comments
Labels
question Further information is requested

Comments

@7crystle7
Copy link

❓ Questions and Help

Hi. I am training maskrcnn model with e2e_faster_rcnn_R_50_FPN_1x.yaml on coco dataset. However, after a few iterations (460), all the losses in this model become nan. I have checked the default number of class in config file, but the result (81) seems correct for coco dataset. Can anyone give me some suggestion about how to solve that? Many thanks!

@fmassa
Copy link
Contributor

fmassa commented Oct 26, 2018

Hi,

Did you change the hyper parameters for training, like number of iterations or batch size?
The loss becoming nan usually happens when you have a learning rate that is too high.

If you change the batch size, you need to adapt the learning rate / number of iterations accordingly.
For example, if you set the number of GPUs to be 1, then you need to modify the IMS_PER_BATCH to be 2, and the learning rate / etc should follow the rules in here.

@fmassa fmassa added question Further information is requested awaiting response labels Oct 26, 2018
@7crystle7
Copy link
Author

Hi, all the losses keep updating after I set a smaller learning rate for the model. It seems that your suggestion works. Many thanks!

@fmassa
Copy link
Contributor

fmassa commented Oct 26, 2018

Great, thanks!

I'll update the README with more details.

@guanbin1994
Copy link

Hi,

Did you change the hyper parameters for training, like number of iterations or batch size?
The loss becoming nan usually happens when you have a learning rate that is too high.

If you change the batch size, you need to adapt the learning rate / number of iterations accordingly.
For example, if you set the number of GPUs to be 1, then you need to modify the IMS_PER_BATCH to be 2, and the learning rate / etc should follow the rules in here.

Hi, because the objects in my own dataset is small, so i made two change:
_C.MODEL.RPN.ANCHOR_SIZES = (32, 64, 128, 256, 512) ---> (16, 32, 64, 128, 256)
ANCHOR_STRIDE: (4, 8, 16, 32, 64) ---> (2, 4, 8, 16, 32)

training after a few iterations , all the losses in this model become nan.
My lr is 0.01 and i have 4 gpu.
Why?

@fmassa
Copy link
Contributor

fmassa commented Nov 13, 2018

@guanbin1994 actually ANCHOR_STRIDE is not a free parameter of the model, but instead depends on the backbone. In your case, your feature maps should define the ANCHOR_STRIDE.

Ideally we would like to remove ANCHOR_STRIDE altogether, and this is being tracked in #87.

@guanbin1994
Copy link

@guanbin1994 actually ANCHOR_STRIDE is not a free parameter of the model, but instead depends on the backbone. In your case, your feature maps should define the ANCHOR_STRIDE.

Ideally we would like to remove ANCHOR_STRIDE altogether, and this is being tracked in #87.

I just use Res101 and FPN, so what should i do?

@fmassa
Copy link
Contributor

fmassa commented Nov 13, 2018

You should keep the same ANCHOR_STRIDE, and increase the INPUT.MIN_SIZE_* and INPUT.MAX_SIZE_*.
But it will require more memory.

@AzimAhmadzadeh
Copy link

AzimAhmadzadeh commented Jul 1, 2019

I also had this problem and I solved it.
As everyone mentioned in different issues raised in this repo, the problem is with the learning rate.
In my case the original setting in config file is:
BASE_LR: 0.02 | STEPS: (60000, 80000) | MAX_ITER: 90000
which caused nan for loss after the 3rd iteration! Then I changed it to:
BASE_LR: 0.0025 | STEPS: (480000, 640000) | MAX_ITER: 720000
which comes from dividing the first by 8, and multiplying the other two by 8, as suggested in the readme here.
The default setting is set for 8 GPUs. I have only 2. So, some changes were expected.

However, the above changes made the expectation time for training (i.e., eta) from 4 days to 41 days! So, I avoided such a long training by only changing BASE_LR form 0.02 to 0.01. To evaluate whether this is enough or not, I have to see the loss plot and where it plateaus.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants