[bug] on-policy rollout collects current "dones" instead of last "dones"

Took me a really long time to debug this, so hopefully this helps others out.

**Describe the bug**
The on-policy rollout collects last_obs, current reward, current dones. See [here](https://github.com/DLR-RM/stable-baselines3/blob/3cf6e9714b816ab7f1352d6aa439059becff707b/stable_baselines3/common/on_policy_algorithm.py#L156)

In stable-baselines, the rollout collects last_obs, current reward, and last dones. See [here](https://github.com/hill-a/stable-baselines/blob/c54dc5dc47be2be796f42bf76a3d2e74fa3db7d8/stable_baselines/ppo2/ppo2.py#L471)

This messes up the returns and advantage calculations.

I fixed this locally, and PPO improved dramatically on my custom environment (red is before fix, green is after fix).
![Imgur](https://i.imgur.com/wkVL3go.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions