Skip to content

[Bug]: Segmentation fault when continuing training on another machine due to non-portable serialization of learning_rate and clip_range #2115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
akanto opened this issue Apr 8, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@akanto
Copy link
Contributor

akanto commented Apr 8, 2025

🐛 Bug

When training is started on one machine and continued on another with a different system setup (e.g., different OS, Python version, or architecture), model.load() results in a segmentation fault.

Root cause

The clip_range and lr_schedule are serialized with cloudpickle and stored in the data file inside the model.zip. These are closures or lambdas that are not portable between platforms or Python versions.

For example, after decoding the serialized object from the data file:

"clip_range": {
  ":type:": "<class 'function'>",
  ":serialized:": 
  "<!decoded> gAWV... progress_remainingK/opt/pytorch/lib/python3.12/site-packages/stable_baselines3/common/utils.py<lambda>!get_s....."
}

You can see that the full absolute path to the Python file is embedded, along with other objects that cannot be restored safely on another system.

Suggested improvements

  • Safe save: maybe these values should not be stored in the model.zip, but always recreating them during load
  • Use callable classes instead of lambdas: a class with call function could be used to wrap these values. I have tried this approach for learning_rate and it worked as expected, and restore seemed to work on the other machine

To Reproduce

1.) Train a PPO model on machine A (e.g., macOS or Ubuntu with Python 3.12):

model = PPO("CnnPolicy", env)
model.learn(total_timesteps=100_000)
model.save("model.zip")

2.) Transfer the model to machine B (e.g., Linux with different Python version or architecture)
3.) Try to load and continue training

model = PPO.load("model.zip", env=env)
model.learn(total_timesteps=100_000, reset_num_timesteps=False)

4.) You’ll see a hard crash:

Fatal Python error: Segmentation fault

Relevant log output / Error message

Fatal Python error: Segmentation fault

Thread 0x00000003e16ab000 (most recent call first):
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 363 in wait
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/queue.py", line 213 in get
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1041 in _bootstrap_inner
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1012 in _bootstrap

Current thread 0x00000001ec254c00 (most recent call first):
  File "/opt/pytorch/lib/python3.12/site-packages/stable_baselines3/common/utils.py", line 98 in <lambda>
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/utils.py", line 98 in <lambda>
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 193 in train
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 337 in learn
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 311 in learn

As you can see, it tries to deserialize the object from "/opt/pytorch/lib.." which points to the other machine where the object was saved.

System Info

  • Python: 3.13.2
  • Stable-Baselines3: 2.6.0
  • PyTorch: 2.6.0
  • GPU Enabled: False
  • Numpy: 2.2.0
  • Cloudpickle: 3.1.1
  • Gymnasium: 1.1.1

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@akanto akanto added the bug Something isn't working label Apr 8, 2025
@araffin
Copy link
Member

araffin commented Apr 8, 2025

Hello,

yes, this is a known limitation.

Safe save: maybe these values should not be stored in the model.zip, but always recreating them during load

this should work in your case:
https://github.com/DLR-RM/rl-baselines3-zoo/blob/325ef5dafe46e483ce9d727d2851aff18b70db7c/rl_zoo3/enjoy.py#L184-L196

@akanto
Copy link
Contributor Author

akanto commented Apr 8, 2025

Hi,

Thanks, the workaround works for me.

However, if any of the suggestions above seem reasonable (e.g., supporting callable classes or adding a safe_save option), I’d be happy to help by submitting a pull request.

@araffin
Copy link
Member

araffin commented Apr 14, 2025

how the safe save option would look like?
same question for supporting callable classes. And would it be backward compatible? (important)
Best would be to share a POC or a minimal example.

akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 26, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 26, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 26, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 26, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 26, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 27, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 27, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `ScheduleWrapper` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 27, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `FloatConverterSchedule` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
@akanto
Copy link
Contributor Author

akanto commented Apr 27, 2025

I have created a PoC with callable classes: #2125

Besides the tests included in the pull request, I also manually tested portability. Before applying the changes, if I trained the model on Linux and then moved it to macOS, I encountered the following issue:

Training on mps
Loading model from models/mario_ppo_latest.zip
Wrapping the env in a VecTransposeImage.
Number of timesteps trained: 8192, reset_num_timesteps: False
Logging to logs/PPO_22
------------------------------
| time/              |       |
|    fps             | 216   |
|    iterations      | 1     |
|    time_elapsed    | 9     |
|    total_timesteps | 10240 |
------------------------------
Fatal Python error: Segmentation fault

Thread 0x00000003ca707000 (most recent call first):
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 363 in wait
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/queue.py", line 213 in get
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1041 in _bootstrap_inner
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1012 in _bootstrap

Current thread 0x0000000201844c00 (most recent call first):
  File "/home/ubuntu/mario-brain/.venv/lib/python3.12/site-packages/stable_baselines3/common/utils.py", line 98 in <lambda>
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/utils.py", line 101 in __call__
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 193 in train
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 337 in learn
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 311 in learn
  File "/Users/akanto/prj/akanto/mario-brain/mario_brain/train.py", line 69 in train
  File "/Users/akanto/prj/akanto/mario-brain/mario_brain/train.py", line 83 in <module>

After applying my changes, which replace lambdas with classes, I saved the model on Linux and continued training on macOS, it worked as expected:

Training on mps
Loading model from models/mario_ppo_latest.zip
Wrapping the env in a VecTransposeImage.
Number of timesteps trained: 8192, reset_num_timesteps: False
Logging to logs/PPO_22
------------------------------
| time/              |       |
|    fps             | 224   |
|    iterations      | 1     |
|    time_elapsed    | 9     |
|    total_timesteps | 10240 |
------------------------------
-------------------------------------------
| time/                   |               |
|    fps                  | 138           |
|    iterations           | 2             |
|    time_elapsed         | 29            |
|    total_timesteps      | 12288         |
| train/                  |               |
|    approx_kl            | 4.0190353e-06 |
|    clip_fraction        | 0             |
|    clip_range           | 0.2           |
|    entropy_loss         | -1.95         |
|    explained_variance   | 0.00373       |
|    learning_rate        | 5.77e-06      |
|    loss                 | 629           |
|    n_updates            | 50            |
|    policy_gradient_loss | -1.8e-05      |
|    value_loss           | 1.26e+03      |
-------------------------------------------

akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 27, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `FloatConverterSchedule` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
akanto added a commit to akanto/stable-baselines3 that referenced this issue Apr 27, 2025
…non-portable schedules

Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused
segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them.

This commit rewrites:
- `constant_fn` as a `ConstantSchedule` class
- `get_schedule_fn` as a `FloatConverterSchedule` class
- `get_linear_fn` as a `LinearSchedule` class

All schedules are now proper callable classes, making them portable and safely pickleable.
Old functions are kept (marked as  deprecated) for backward compatibility when loading
existing models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants