[Question] Why does unscaling action behaves differently in training and eval #1592

akane0314 · 2023-07-04T13:06:04Z

❓ Question

I'm using PPO with squash_output=True option. It looks statistics (e.g., average reward, episode length) differs during collecting rollout and evaluating policy. After digging down, I found this is caused by the different behavior of unscaling action in collect_rollouts and predict function used while train and eval, respectively. Why is unscale_action not applied in collect_rollouts()?

stable-baselines3/stable_baselines3/common/policies.py

Lines 352 to 361 in d68ff2e

    
           if isinstance(self.action_space, spaces.Box): 
        
               if self.squash_output: 
        
                   # Rescale to proper domain when using squashing 
        
                   actions = self.unscale_action(actions) 
        
               else: 
        
                   # Actions could be on arbitrary scale, so clip the actions to avoid 
        
                   # out of bound error (e.g. if sampling from a Gaussian distribution) 
        
                   actions = np.clip(actions, self.action_space.low, self.action_space.high)

stable-baselines3/stable_baselines3/common/on_policy_algorithm.py

Lines 174 to 177 in d68ff2e

    
           # Clip the actions to avoid out of bound error 
        
           if isinstance(self.action_space, spaces.Box): 
        
               clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2023-07-04T13:14:24Z

Hello,
I assume you are using a custom environment and the boundaries are not [-1, 1] ? (the env checker should have warned you).
But yes, it looks like a bug.

akane0314 · 2023-07-04T13:16:43Z

Yes, I'm using custom env and boundaries are not in [-1, 1].

MikhailGerasimov · 2023-07-05T07:38:54Z

There is a related bug (or possibly a feature?) in SB3. It attempts to unsquash actions even when I use a distribution that does not support squashing. For example:
model = PPO("MlpPolicy", env, policy_kwargs=dict(squash_output=True))
Although my action space is bounded between -1 and 1, due to the unsquashing process, I receive values in the env.step() function that go beyond the boundaries of [-1, 1].

araffin · 2023-07-05T08:24:20Z

Hello,
I would welcome PR that solves both issues ;) (for the second one, we should not allow squash_output=True when not using gSDE).

ReHoss · 2023-07-12T20:57:16Z

Hi,

Correct me if I am wrong but the "squash_output: Whether to squash the output using a tanh function" in the documentation is misleading as the $\tanh$ non-linearity when squash_output=True is only applied through create_mlp which is only called by ContinuousCritic and Actor objects in the source code, excluding for instance PPO. I rather observe a simpler scaling from the self.unscale_actions function in predict when using squash_output=True.

If I understand correctly the first issue is the following:

The forward method of the nn.Module does not unscale actions using action space bounds.
The predict method of the BasePolicy does unscale actions using action space bounds.
The rollout buffer contains actions that are not unscaled.
The algorithm learns with not unscaled actions from the rollout buffer to interact with the environment.
The predict method during evaluation implies interaction with the environments using unscaled actions.

It looks like the squash_output argument is almost useless in the case of on-policy algorithms in the code and was developed for Experience Replay methods see create_mlp in Off-Policy methods?

Would replacing predict by the implicit low-level forward call in evaluate be a solution ? I am not sure since Off-policy algorithms are truly squashed.

I don't think modifying collect_rollouts with unscale_action being relevant as in the On-policy case there is nothing to truly unsquash (since no squashing non-linearity is applied to the policy)? It would make more sense I think in the Squashed Gaussian case as it is used for SAC. Otherwise, the risk would be to unscale a distribution with unbounded support...
A solution would be to unscale after the actions clipping in the collect_rollouts method but it is not very elegant and inconsistant with Off-policy methods implementation as we would use the clipping present in collect_rollouts to simulate a distribution with bounded support, while activation functions are used in all the other cases...

After reasoning, it seems the true bug being On-policy methods not really squashing the outputs ? Another solution would be to remove the squash_output in the On-Policy case, somehow simplifying the code.

Best,

araffin · 2023-08-25T19:19:45Z

It looks like the squash_output argument is almost useless in the case of on-policy algorithms in the code and was developed for Experience Replay methods see create_mlp in Off-Policy methods?

Actually, the tanh comes from the distributions (squashed Gaussian and gSDE):

stable-baselines3/stable_baselines3/common/distributions.py

Line 250 in f4ec0f6

return th.tanh(self.gaussian_actions)

and

stable-baselines3/stable_baselines3/common/distributions.py

Lines 465 to 468 in f4ec0f6

    
           if squash_output: 
        
               self.bijector = TanhBijector(epsilon) 
        
           else: 
        
               self.bijector = None

the tanh in the create_mlp is for algorithms that don't rely on distributions (DDPG, TD3).

To sum up, everything which is internal uses scaled action, only the interaction with the env, so the predict() method unscales the action to the correct bounds when needed.

* prevents squash_output if not use_sde, see #1592 * update changelog * add unscaling of actions taken during training * add test regarding squashing and unquashing * avoids try-except block * format Gymnasium code with black * makes mypy pass * makes pytype pass * sort imports * makes error message in assert statement clearer Co-authored-by: Antonin RAFFIN <[email protected]> * improves code commenting * replaces full env with wrapper * Cleanup code * Reformat --------- Co-authored-by: PatrickHelm <[email protected]> Co-authored-by: Antonin RAFFIN <[email protected]> Co-authored-by: Antonin Raffin <[email protected]>

akane0314 added the question Further information is requested label Jul 4, 2023

araffin added the custom gym env Issue related to Custom Gym Env label Jul 4, 2023

araffin added the more information needed Please fill the issue template completely label Jul 4, 2023

araffin added bug Something isn't working help wanted Help from contributors is welcomed labels Jul 4, 2023

PatrickHelm pushed a commit to PatrickHelm/stable-baselines3 that referenced this issue Aug 23, 2023

prevents squash_output if not use_sde, see DLR-RM#1592

87f8830

PatrickHelm mentioned this issue Aug 23, 2023

Fix squash output unscaling when using gSDE #1652

Merged

16 tasks

araffin closed this as completed in #1652 Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Why does unscaling action behaves differently in training and eval #1592

[Question] Why does unscaling action behaves differently in training and eval #1592

akane0314 commented Jul 4, 2023

araffin commented Jul 4, 2023

Uh oh!

akane0314 commented Jul 4, 2023

Uh oh!

MikhailGerasimov commented Jul 5, 2023

Uh oh!

araffin commented Jul 5, 2023

Uh oh!

ReHoss commented Jul 12, 2023 •

edited

Loading

Uh oh!

araffin commented Aug 25, 2023 •

edited

Loading

Uh oh!

[Question] Why does unscaling action behaves differently in training and eval #1592

[Question] Why does unscaling action behaves differently in training and eval #1592

Comments

akane0314 commented Jul 4, 2023

❓ Question

Checklist

araffin commented Jul 4, 2023

Uh oh!

akane0314 commented Jul 4, 2023

Uh oh!

MikhailGerasimov commented Jul 5, 2023

Uh oh!

araffin commented Jul 5, 2023

Uh oh!

ReHoss commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

araffin commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ReHoss commented Jul 12, 2023 •

edited

Loading

araffin commented Aug 25, 2023 •

edited

Loading