Skip to content

[Question] Why does unscaling action behaves differently in training and eval #1592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
akane0314 opened this issue Jul 4, 2023 · 6 comments · Fixed by #1652
Closed
4 tasks done

[Question] Why does unscaling action behaves differently in training and eval #1592

akane0314 opened this issue Jul 4, 2023 · 6 comments · Fixed by #1652
Labels
bug Something isn't working custom gym env Issue related to Custom Gym Env help wanted Help from contributors is welcomed more information needed Please fill the issue template completely question Further information is requested

Comments

@akane0314
Copy link

❓ Question

I'm using PPO with squash_output=True option. It looks statistics (e.g., average reward, episode length) differs during collecting rollout and evaluating policy. After digging down, I found this is caused by the different behavior of unscaling action in collect_rollouts and predict function used while train and eval, respectively. Why is unscale_action not applied in collect_rollouts()?

if isinstance(self.action_space, spaces.Box):
if self.squash_output:
# Rescale to proper domain when using squashing
actions = self.unscale_action(actions)
else:
# Actions could be on arbitrary scale, so clip the actions to avoid
# out of bound error (e.g. if sampling from a Gaussian distribution)
actions = np.clip(actions, self.action_space.low, self.action_space.high)

# Clip the actions to avoid out of bound error
if isinstance(self.action_space, spaces.Box):
clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)

Checklist

@akane0314 akane0314 added the question Further information is requested label Jul 4, 2023
@araffin araffin added the custom gym env Issue related to Custom Gym Env label Jul 4, 2023
@araffin
Copy link
Member

araffin commented Jul 4, 2023

Hello,
I assume you are using a custom environment and the boundaries are not [-1, 1] ? (the env checker should have warned you).
But yes, it looks like a bug.

@araffin araffin added the more information needed Please fill the issue template completely label Jul 4, 2023
@akane0314
Copy link
Author

Yes, I'm using custom env and boundaries are not in [-1, 1].

@araffin araffin added bug Something isn't working help wanted Help from contributors is welcomed labels Jul 4, 2023
@MikhailGerasimov
Copy link

There is a related bug (or possibly a feature?) in SB3. It attempts to unsquash actions even when I use a distribution that does not support squashing. For example:
model = PPO("MlpPolicy", env, policy_kwargs=dict(squash_output=True))
Although my action space is bounded between -1 and 1, due to the unsquashing process, I receive values in the env.step() function that go beyond the boundaries of [-1, 1].

@araffin
Copy link
Member

araffin commented Jul 5, 2023

Hello,
I would welcome PR that solves both issues ;) (for the second one, we should not allow squash_output=True when not using gSDE).

@ReHoss
Copy link
Contributor

ReHoss commented Jul 12, 2023

Hi,

Correct me if I am wrong but the "squash_output: Whether to squash the output using a tanh function" in the documentation is misleading as the $\tanh$ non-linearity when squash_output=True is only applied through create_mlp which is only called by ContinuousCritic and Actor objects in the source code, excluding for instance PPO. I rather observe a simpler scaling from the self.unscale_actions function in predict when using squash_output=True.

If I understand correctly the first issue is the following:

  • The forward method of the nn.Module does not unscale actions using action space bounds.
  • The predict method of the BasePolicy does unscale actions using action space bounds.
  • The rollout buffer contains actions that are not unscaled.
  • The algorithm learns with not unscaled actions from the rollout buffer to interact with the environment.
  • The predict method during evaluation implies interaction with the environments using unscaled actions.

It looks like the squash_output argument is almost useless in the case of on-policy algorithms in the code and was developed for Experience Replay methods see create_mlp in Off-Policy methods?

Would replacing predict by the implicit low-level forward call in evaluate be a solution ? I am not sure since Off-policy algorithms are truly squashed.

I don't think modifying collect_rollouts with unscale_action being relevant as in the On-policy case there is nothing to truly unsquash (since no squashing non-linearity is applied to the policy)? It would make more sense I think in the Squashed Gaussian case as it is used for SAC. Otherwise, the risk would be to unscale a distribution with unbounded support...
A solution would be to unscale after the actions clipping in the collect_rollouts method but it is not very elegant and inconsistant with Off-policy methods implementation as we would use the clipping present in collect_rollouts to simulate a distribution with bounded support, while activation functions are used in all the other cases...

After reasoning, it seems the true bug being On-policy methods not really squashing the outputs ? Another solution would be to remove the squash_output in the On-Policy case, somehow simplifying the code.

Best,

PatrickHelm pushed a commit to PatrickHelm/stable-baselines3 that referenced this issue Aug 23, 2023
@araffin
Copy link
Member

araffin commented Aug 25, 2023

It looks like the squash_output argument is almost useless in the case of on-policy algorithms in the code and was developed for Experience Replay methods see create_mlp in Off-Policy methods?

Actually, the tanh comes from the distributions (squashed Gaussian and gSDE):

return th.tanh(self.gaussian_actions)

and
if squash_output:
self.bijector = TanhBijector(epsilon)
else:
self.bijector = None

the tanh in the create_mlp is for algorithms that don't rely on distributions (DDPG, TD3).

To sum up, everything which is internal uses scaled action, only the interaction with the env, so the predict() method unscales the action to the correct bounds when needed.

araffin added a commit that referenced this issue Sep 1, 2023
* prevents squash_output if not use_sde, see #1592

* update changelog

* add unscaling of actions taken during training

* add test regarding squashing and unquashing

* avoids try-except block

* format Gymnasium code with black

* makes mypy pass

* makes pytype pass

* sort imports

* makes error message in assert statement clearer

Co-authored-by: Antonin RAFFIN <[email protected]>

* improves code commenting

* replaces full env with wrapper

* Cleanup code

* Reformat

---------

Co-authored-by: PatrickHelm <[email protected]>
Co-authored-by: Antonin RAFFIN <[email protected]>
Co-authored-by: Antonin Raffin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working custom gym env Issue related to Custom Gym Env help wanted Help from contributors is welcomed more information needed Please fill the issue template completely question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants