Gradient checkpointing with DDP in a loop #10479

shivammehta25 · 2021-11-11T11:15:05Z

shivammehta25
Nov 11, 2021

Since my method is an Autoregressive algorithm It is making a huge gradient tape, I am trying to do something like this

for i in range(len(maxtrix.shape)):
    output = torch.utils.checkpoint.checkpoint(NNModel(matrix[i]))

It works fine on single GPU but on DDP it throws this error

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 30 with name module.model.decoder.decoder_network.layers.1.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

I am running it with

pytorch_lightning.plugins.DDPPlugin(find_unused_parameters=False)

Any workaround for this?

kuixu · 2022-01-22T05:36:48Z

kuixu
Jan 22, 2022

Dear @shivammehta007 , I also got this error, does it have been solved?

0 replies

ananthsub · 2022-01-22T06:07:48Z

ananthsub
Jan 22, 2022

This does not appear to be a lightning issue, but rather with DistributedDataParallel from torch.distributed not supporting gradient checkpointing

1 reply

Bhavay-2001 Aug 4, 2023

Hi @ananthsub , I guess they have added the functionality of _set_static_graph() in the torch DDP but how to work it with pytorchlightning DDP?

kuixu · 2022-01-22T06:27:01Z

kuixu
Jan 22, 2022

Currently, I solved this problem. Cause of the model has parameters that were not used in producing a loss. Do the following two settings, and you will find the unused parameter names.

setfind_unused_parameters=False in DDP
set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL.

1 reply

stuartzong Apr 16, 2024

can you elaborate on your solution?

Bhavay-2001 · 2023-08-04T18:44:42Z

Bhavay-2001
Aug 4, 2023

Hi @kuixu, we can find the parameter names but how to go about the solution? Do we need to remove them or what? How will the issue be solved?
I'm facing a similar problem

2 replies

stuartzong Apr 16, 2024

any update on this. have you solve the problem?

sappho-x Dec 11, 2024

encountered as well

rubenweitzman · 2024-10-31T02:56:58Z

rubenweitzman
Oct 31, 2024

hi, still having this issue, working with fabric, and getting error Expected to mark a variable ready only once, what is the work around this?

0 replies

angerhang · 2025-02-27T15:41:48Z

angerhang
Feb 27, 2025

Hey any updated on this, really keen to get DDP and gradient checkpointing to work. Otherwise do we have to use FDSP?

0 replies

Gradient checkpointing with DDP in a loop #10479

Uh oh!

Uh oh!

Replies: 6 comments · 4 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 4 replies