Nan loss after several training iteration #33

7crystle7 · 2018-10-26T11:00:24Z

❓ Questions and Help

Hi. I am training maskrcnn model with e2e_faster_rcnn_R_50_FPN_1x.yaml on coco dataset. However, after a few iterations (460), all the losses in this model become nan. I have checked the default number of class in config file, but the result (81) seems correct for coco dataset. Can anyone give me some suggestion about how to solve that? Many thanks!

fmassa · 2018-10-26T11:19:49Z

Hi,

Did you change the hyper parameters for training, like number of iterations or batch size?
The loss becoming nan usually happens when you have a learning rate that is too high.

If you change the batch size, you need to adapt the learning rate / number of iterations accordingly.
For example, if you set the number of GPUs to be 1, then you need to modify the IMS_PER_BATCH to be 2, and the learning rate / etc should follow the rules in here.

7crystle7 · 2018-10-26T11:35:28Z

Hi, all the losses keep updating after I set a smaller learning rate for the model. It seems that your suggestion works. Many thanks!

fmassa · 2018-10-26T11:37:54Z

Great, thanks!

I'll update the README with more details.

guanbin1994 · 2018-11-13T05:40:31Z

Hi,

Did you change the hyper parameters for training, like number of iterations or batch size?
The loss becoming nan usually happens when you have a learning rate that is too high.

If you change the batch size, you need to adapt the learning rate / number of iterations accordingly.
For example, if you set the number of GPUs to be 1, then you need to modify the IMS_PER_BATCH to be 2, and the learning rate / etc should follow the rules in here.

Hi, because the objects in my own dataset is small, so i made two change:
_C.MODEL.RPN.ANCHOR_SIZES = (32, 64, 128, 256, 512) ---> (16, 32, 64, 128, 256)
ANCHOR_STRIDE: (4, 8, 16, 32, 64) ---> (2, 4, 8, 16, 32)

training after a few iterations , all the losses in this model become nan.
My lr is 0.01 and i have 4 gpu.
Why?

fmassa · 2018-11-13T09:27:32Z

@guanbin1994 actually ANCHOR_STRIDE is not a free parameter of the model, but instead depends on the backbone. In your case, your feature maps should define the ANCHOR_STRIDE.

Ideally we would like to remove ANCHOR_STRIDE altogether, and this is being tracked in #87.

guanbin1994 · 2018-11-13T09:31:33Z

@guanbin1994 actually ANCHOR_STRIDE is not a free parameter of the model, but instead depends on the backbone. In your case, your feature maps should define the ANCHOR_STRIDE.

Ideally we would like to remove ANCHOR_STRIDE altogether, and this is being tracked in #87.

I just use Res101 and FPN, so what should i do?

fmassa · 2018-11-13T09:33:11Z

You should keep the same ANCHOR_STRIDE, and increase the INPUT.MIN_SIZE_* and INPUT.MAX_SIZE_*.
But it will require more memory.

AzimAhmadzadeh · 2019-07-01T14:14:20Z

I also had this problem and I solved it.
As everyone mentioned in different issues raised in this repo, the problem is with the learning rate.
In my case the original setting in config file is:
BASE_LR: 0.02 | STEPS: (60000, 80000) | MAX_ITER: 90000
which caused nan for loss after the 3rd iteration! Then I changed it to:
BASE_LR: 0.0025 | STEPS: (480000, 640000) | MAX_ITER: 720000
which comes from dividing the first by 8, and multiplying the other two by 8, as suggested in the readme here.
The default setting is set for 8 GPUs. I have only 2. So, some changes were expected.

However, the above changes made the expectation time for training (i.e., eta) from 4 days to 41 days! So, I avoided such a long training by only changing BASE_LR form 0.02 to 0.01. To evaluate whether this is enough or not, I have to see the loss plot and where it plateaus.

From here: facebookresearch/maskrcnn-benchmark#33

fmassa added question Further information is requested awaiting response labels Oct 26, 2018

7crystle7 closed this as completed Oct 26, 2018

fmassa removed the awaiting response label Oct 26, 2018

fmassa mentioned this issue Oct 26, 2018

Improve single-GPU explanation in the README #35

Merged

youngkyoonjang mentioned this issue Oct 26, 2018

Segmentation fault (core dumped) #21

Closed

fmassa mentioned this issue Nov 19, 2018

Wrong index access in RPNPostProcessor.select_over_all_levels() #175

Closed

engineer1109 mentioned this issue Dec 20, 2018

Strange Problem #293

Open

wang-shihao mentioned this issue Feb 19, 2019

How to resume training with a different learning rate #464

Closed

JudoWill added a commit to DamLabResources/Mask_RCNN that referenced this issue Jul 17, 2019

updating based on nan loses

2890be6

From here: facebookresearch/maskrcnn-benchmark#33

dedoogong mentioned this issue Aug 16, 2019

RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' #1048

Open

Jacobew mentioned this issue Apr 19, 2020

add dcn from mmdetection #693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan loss after several training iteration #33

Nan loss after several training iteration #33

7crystle7 commented Oct 26, 2018

fmassa commented Oct 26, 2018

7crystle7 commented Oct 26, 2018

fmassa commented Oct 26, 2018

guanbin1994 commented Nov 13, 2018

fmassa commented Nov 13, 2018

guanbin1994 commented Nov 13, 2018

fmassa commented Nov 13, 2018

AzimAhmadzadeh commented Jul 1, 2019 •

edited

Loading

Nan loss after several training iteration #33

Nan loss after several training iteration #33

Comments

7crystle7 commented Oct 26, 2018

❓ Questions and Help

fmassa commented Oct 26, 2018

7crystle7 commented Oct 26, 2018

fmassa commented Oct 26, 2018

guanbin1994 commented Nov 13, 2018

fmassa commented Nov 13, 2018

guanbin1994 commented Nov 13, 2018

fmassa commented Nov 13, 2018

AzimAhmadzadeh commented Jul 1, 2019 • edited Loading

AzimAhmadzadeh commented Jul 1, 2019 •

edited

Loading