Closed
Description
Took me a really long time to debug this, so hopefully this helps others out.
Describe the bug
The on-policy rollout collects last_obs, current reward, current dones. See here
In stable-baselines, the rollout collects last_obs, current reward, and last dones. See here
This messes up the returns and advantage calculations.
I fixed this locally, and PPO improved dramatically on my custom environment (red is before fix, green is after fix).