-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Question] Why does unscaling action behaves differently in training and eval #1592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, |
Yes, I'm using custom env and boundaries are not in [-1, 1]. |
There is a related bug (or possibly a feature?) in SB3. It attempts to unsquash actions even when I use a distribution that does not support squashing. For example: |
Hello, |
Hi, Correct me if I am wrong but the "squash_output: Whether to squash the output using a tanh function" in the documentation is misleading as the If I understand correctly the first issue is the following:
It looks like the Would replacing I don't think modifying
Best, |
Actually, the tanh comes from the distributions (squashed Gaussian and gSDE):
and stable-baselines3/stable_baselines3/common/distributions.py Lines 465 to 468 in f4ec0f6
the tanh in the To sum up, everything which is internal uses scaled action, only the interaction with the env, so the |
* prevents squash_output if not use_sde, see #1592 * update changelog * add unscaling of actions taken during training * add test regarding squashing and unquashing * avoids try-except block * format Gymnasium code with black * makes mypy pass * makes pytype pass * sort imports * makes error message in assert statement clearer Co-authored-by: Antonin RAFFIN <[email protected]> * improves code commenting * replaces full env with wrapper * Cleanup code * Reformat --------- Co-authored-by: PatrickHelm <[email protected]> Co-authored-by: Antonin RAFFIN <[email protected]> Co-authored-by: Antonin Raffin <[email protected]>
❓ Question
I'm using PPO with
squash_output=True
option. It looks statistics (e.g., average reward, episode length) differs during collecting rollout and evaluating policy. After digging down, I found this is caused by the different behavior of unscaling action incollect_rollouts
andpredict
function used while train and eval, respectively. Why isunscale_action
not applied incollect_rollouts()
?stable-baselines3/stable_baselines3/common/policies.py
Lines 352 to 361 in d68ff2e
stable-baselines3/stable_baselines3/common/on_policy_algorithm.py
Lines 174 to 177 in d68ff2e
Checklist
The text was updated successfully, but these errors were encountered: