[Bug]: Segmentation fault when continuing training on another machine due to non-portable serialization of learning_rate and clip_range #2115

akanto · 2025-04-08T06:55:56Z

🐛 Bug

When training is started on one machine and continued on another with a different system setup (e.g., different OS, Python version, or architecture), model.load() results in a segmentation fault.

Root cause

The clip_range and lr_schedule are serialized with cloudpickle and stored in the data file inside the model.zip. These are closures or lambdas that are not portable between platforms or Python versions.

For example, after decoding the serialized object from the data file:

"clip_range": {
  ":type:": "<class 'function'>",
  ":serialized:": 
  "<!decoded> gAWV... progress_remainingK/opt/pytorch/lib/python3.12/site-packages/stable_baselines3/common/utils.py<lambda>!get_s....."
}

You can see that the full absolute path to the Python file is embedded, along with other objects that cannot be restored safely on another system.

Suggested improvements

Safe save: maybe these values should not be stored in the model.zip, but always recreating them during load
Use callable classes instead of lambdas: a class with call function could be used to wrap these values. I have tried this approach for learning_rate and it worked as expected, and restore seemed to work on the other machine

To Reproduce

1.) Train a PPO model on machine A (e.g., macOS or Ubuntu with Python 3.12):

model = PPO("CnnPolicy", env)
model.learn(total_timesteps=100_000)
model.save("model.zip")

2.) Transfer the model to machine B (e.g., Linux with different Python version or architecture)
3.) Try to load and continue training

model = PPO.load("model.zip", env=env)
model.learn(total_timesteps=100_000, reset_num_timesteps=False)

4.) You’ll see a hard crash:

Fatal Python error: Segmentation fault

Relevant log output / Error message

Fatal Python error: Segmentation fault

Thread 0x00000003e16ab000 (most recent call first):
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 363 in wait
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/queue.py", line 213 in get
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1041 in _bootstrap_inner
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1012 in _bootstrap

Current thread 0x00000001ec254c00 (most recent call first):
  File "/opt/pytorch/lib/python3.12/site-packages/stable_baselines3/common/utils.py", line 98 in <lambda>
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/utils.py", line 98 in <lambda>
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 193 in train
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 337 in learn
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 311 in learn

As you can see, it tries to deserialize the object from "/opt/pytorch/lib.." which points to the other machine where the object was saved.

System Info

Python: 3.13.2
Stable-Baselines3: 2.6.0
PyTorch: 2.6.0
GPU Enabled: False
Numpy: 2.2.0
Cloudpickle: 3.1.1
Gymnasium: 1.1.1

Checklist

My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
I have checked that there is no similar issue in the repo
I have read the documentation
I have provided a minimal and working example to reproduce the bug
I've used the markdown code blocks for both code and stack traces.

araffin · 2025-04-08T07:53:51Z

Hello,

yes, this is a known limitation.

Safe save: maybe these values should not be stored in the model.zip, but always recreating them during load

this should work in your case:
https://github.com/DLR-RM/rl-baselines3-zoo/blob/325ef5dafe46e483ce9d727d2851aff18b70db7c/rl_zoo3/enjoy.py#L184-L196

akanto · 2025-04-08T16:46:49Z

Hi,

Thanks, the workaround works for me.

However, if any of the suggestions above seem reasonable (e.g., supporting callable classes or adding a safe_save option), I’d be happy to help by submitting a pull request.

araffin · 2025-04-14T15:20:35Z

how the safe save option would look like?
same question for supporting callable classes. And would it be backward compatible? (important)
Best would be to share a POC or a minimal example.

…non-portable schedules Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them. This commit rewrites: - `constant_fn` as a `ConstantSchedule` class - `get_schedule_fn` as a `ScheduleWrapper` class - `get_linear_fn` as a `LinearSchedule` class All schedules are now proper callable classes, making them portable and safely pickleable. Old functions are kept (marked as deprecated) for backward compatibility when loading existing models.

…non-portable schedules Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them. This commit rewrites: - `constant_fn` as a `ConstantSchedule` class - `get_schedule_fn` as a `FloatConverterSchedule` class - `get_linear_fn` as a `LinearSchedule` class All schedules are now proper callable classes, making them portable and safely pickleable. Old functions are kept (marked as deprecated) for backward compatibility when loading existing models.

akanto · 2025-04-27T14:20:08Z

I have created a PoC with callable classes: #2125

Besides the tests included in the pull request, I also manually tested portability. Before applying the changes, if I trained the model on Linux and then moved it to macOS, I encountered the following issue:

Training on mps
Loading model from models/mario_ppo_latest.zip
Wrapping the env in a VecTransposeImage.
Number of timesteps trained: 8192, reset_num_timesteps: False
Logging to logs/PPO_22
------------------------------
| time/              |       |
|    fps             | 216   |
|    iterations      | 1     |
|    time_elapsed    | 9     |
|    total_timesteps | 10240 |
------------------------------
Fatal Python error: Segmentation fault

Thread 0x00000003ca707000 (most recent call first):
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 363 in wait
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/queue.py", line 213 in get
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1041 in _bootstrap_inner
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1012 in _bootstrap

Current thread 0x0000000201844c00 (most recent call first):
  File "/home/ubuntu/mario-brain/.venv/lib/python3.12/site-packages/stable_baselines3/common/utils.py", line 98 in <lambda>
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/utils.py", line 101 in __call__
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 193 in train
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 337 in learn
  File "/Users/akanto/prj/akanto/mario-brain/.venv/lib/python3.13/site-packages/stable_baselines3/ppo/ppo.py", line 311 in learn
  File "/Users/akanto/prj/akanto/mario-brain/mario_brain/train.py", line 69 in train
  File "/Users/akanto/prj/akanto/mario-brain/mario_brain/train.py", line 83 in <module>

After applying my changes, which replace lambdas with classes, I saved the model on Linux and continued training on macOS, it worked as expected:

Training on mps
Loading model from models/mario_ppo_latest.zip
Wrapping the env in a VecTransposeImage.
Number of timesteps trained: 8192, reset_num_timesteps: False
Logging to logs/PPO_22
------------------------------
| time/              |       |
|    fps             | 224   |
|    iterations      | 1     |
|    time_elapsed    | 9     |
|    total_timesteps | 10240 |
------------------------------
-------------------------------------------
| time/                   |               |
|    fps                  | 138           |
|    iterations           | 2             |
|    time_elapsed         | 29            |
|    total_timesteps      | 12288         |
| train/                  |               |
|    approx_kl            | 4.0190353e-06 |
|    clip_fraction        | 0             |
|    clip_range           | 0.2           |
|    entropy_loss         | -1.95         |
|    explained_variance   | 0.00373       |
|    learning_rate        | 5.77e-06      |
|    loss                 | 629           |
|    n_updates            | 50            |
|    policy_gradient_loss | -1.8e-05      |
|    value_loss           | 1.26e+03      |
-------------------------------------------

…non-portable schedules Previously, using closures (e.g., lambdas) for learning_rate or clip_range caused segmentation faults when loading models across different platforms (e.g., macOS to Linux), because cloudpickle could not safely serialize/deserialize them. This commit rewrites: - `constant_fn` as a `ConstantSchedule` class - `get_schedule_fn` as a `FloatConverterSchedule` class - `get_linear_fn` as a `LinearSchedule` class All schedules are now proper callable classes, making them portable and safely pickleable. Old functions are kept (marked as deprecated) for backward compatibility when loading existing models.

akanto added the bug Something isn't working label Apr 8, 2025

akanto mentioned this issue Apr 27, 2025

Use classes instead of lambdas for schedules #2125

Merged

16 tasks

araffin closed this as completed in f9c4ca5 May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Segmentation fault when continuing training on another machine due to non-portable serialization of learning_rate and clip_range #2115

[Bug]: Segmentation fault when continuing training on another machine due to non-portable serialization of learning_rate and clip_range #2115

akanto commented Apr 8, 2025

araffin commented Apr 8, 2025

Uh oh!

akanto commented Apr 8, 2025

Uh oh!

araffin commented Apr 14, 2025

Uh oh!

akanto commented Apr 27, 2025

Uh oh!

[Bug]: Segmentation fault when continuing training on another machine due to non-portable serialization of learning_rate and clip_range #2115

[Bug]: Segmentation fault when continuing training on another machine due to non-portable serialization of learning_rate and clip_range #2115

Comments

akanto commented Apr 8, 2025

🐛 Bug

Root cause

Suggested improvements

To Reproduce

Relevant log output / Error message

System Info

Checklist

araffin commented Apr 8, 2025

Uh oh!

akanto commented Apr 8, 2025

Uh oh!

araffin commented Apr 14, 2025

Uh oh!

akanto commented Apr 27, 2025

Uh oh!