[Bug]: Possible inconsistencies with the PPO implementation

### 🐛 Bug

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation. The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in _nine_ environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).

<img width="1014" alt="Screenshot 2024-08-02 at 5 18 48 PM" src="https://github.com/user-attachments/assets/fa77d6ff-4438-4bbb-bdb4-2e24f8032c01">

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, _three_ out of nine environments previously flagged as statistically different were now _not_ different, as seen in the table above with Baselines108. The inconsistencies is likely to be related to the environments, so I am now investigating parts of Stable Baselines3's implementation which might affect a subset of environments (similar to the frames per episode). Was wondering if they were any specific differences in the implementation by Stable Baselines3 which might have contributed to the differences in performance? Any suggestions would be greatly appreciated :)

### To Reproduce

Run Command: 
```bash
python ppo_atari.py --gpu 0 --env Atlantis --trials 5
```
The hyperparameters follow that of the original PPO implementation (without LSTM).
ppo_atari.py:
```python
import argparse
import json
import os
import pathlib
import time
import uuid

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.logger import configure
from stable_baselines3.common.torch_layers import NatureCNN
from stable_baselines3.common.utils import get_linear_fn
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack


def train_atari(args):
    env = make_atari_env(
        f"{args.env}NoFrameskip-v4",
        n_envs=8,
        seed=args.seed,
        wrapper_kwargs={
            "noop_max": 30,
            "frame_skip": 4,
            "screen_size": 84,
            "terminal_on_life_loss": True,
            "clip_reward": True,
            "action_repeat_probability": 0.0,
        },
        vec_env_cls=DummyVecEnv,
    )
    env = VecFrameStack(env, n_stack=4)

    model = PPO(
        "CnnPolicy",
        env,
        learning_rate=get_linear_fn(2.5e-4, 0, 1.0),
        n_steps=128,
        batch_size=256,
        n_epochs=4,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.1,
        clip_range_vf=0.1,
        normalize_advantage=True,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=float("inf") if args.noclip else 0.5,
        use_sde=False,
        target_kl=None,
        stats_window_size=100,
        policy_kwargs={
            "ortho_init": True,
            "features_extractor_class": NatureCNN,
            "share_features_extractor": True,
            "normalize_images": True,
        },
        seed=args.seed,
    )

    logger = configure(args.path, ["csv"])
    model.set_logger(logger)
    start_time = time.time()
    model.learn(total_timesteps=10000000, log_interval=1, progress_bar=True)
    train_end_time = time.time()
    mean_reward, _ = evaluate_policy(
        model,
        model.get_env(),
        n_eval_episodes=100,
        deterministic=False,
    )
    eval_end_time = time.time()
    args.training_time_h = ((train_end_time - start_time) / 60) / 60
    args.total_time_h = ((eval_end_time - start_time) / 60) / 60
    args.eval_mean_reward = mean_reward


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-g",
        "--gpu",
        type=int,
        help="Specify GPU index",
        default=0,
    )
    parser.add_argument(
        "-e",
        "--env",
        type=str,
        help="Specify Atari environment w/o version",
        default="Pong",
    )
    parser.add_argument(
        "-t",
        "--trials",
        type=int,
        help="Specify number of trials",
        default=5,
    )
    parser.add_argument(
        "-nc",
        "--noclip",
        action="store_true",
        help="Only specify for no gradient clipping",
    )
    args = parser.parse_args()
    for _ in range(args.trials):
        args.id = uuid.uuid4().hex
        if args.noclip:
            args.path = os.path.join("trials", "ppo", f"{args.env}_NoClip", args.id)
        else:
            args.path = os.path.join("trials", "ppo", args.env, args.id)
        args.seed = int(time.time())

        # create dir
        pathlib.Path(args.path).mkdir(parents=True, exist_ok=True)

        # set gpu
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{args.gpu}"

        train_atari(args)

        # save trial info
        with open(os.path.join(args.path, "info.json"), "w") as f:
            json.dump(vars(args), f, indent=4)
```


### Relevant log output / Error message

_No response_

### System Info

- OS: Linux-5.15.0-72-generic-x86_64-with-glibc2.31 # 79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023
- Python: 3.11.9
- Stable-Baselines3: 2.3.0
- PyTorch: 2.3.1+cu121
- GPU Enabled: True
- Numpy: 1.26.4
- Cloudpickle: 3.0.0
- Gymnasium: 0.29.1

### Checklist

- [X] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
- [X] I have checked that there is no similar [issue](https://github.com/DLR-RM/stable-baselines3/issues) in the repo
- [X] I have read the [documentation](https://stable-baselines3.readthedocs.io/en/master/)
- [X] I have provided a [minimal and working](https://github.com/DLR-RM/stable-baselines3/issues/982#issuecomment-1197044014) example to reproduce the bug
- [X] I've used the [markdown code blocks](https://help.github.com/en/articles/creating-and-highlighting-code-blocks) for both code and stack traces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Possible inconsistencies with the PPO implementation #1986

🐛 Bug

To Reproduce

Relevant log output / Error message

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Possible inconsistencies with the PPO implementation #1986

Description

🐛 Bug

To Reproduce

Relevant log output / Error message

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions