Skip to content

[Bug]: Possible inconsistencies with the PPO implementation #1986

Closed as not planned
@rajdeepsh

Description

@rajdeepsh

🐛 Bug

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation. The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).

Screenshot 2024-08-02 at 5 18 48 PM

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistencies is likely to be related to the environments, so I am now investigating parts of Stable Baselines3's implementation which might affect a subset of environments (similar to the frames per episode). Was wondering if they were any specific differences in the implementation by Stable Baselines3 which might have contributed to the differences in performance? Any suggestions would be greatly appreciated :)

To Reproduce

Run Command:

python ppo_atari.py --gpu 0 --env Atlantis --trials 5

The hyperparameters follow that of the original PPO implementation (without LSTM).
ppo_atari.py:

import argparse
import json
import os
import pathlib
import time
import uuid

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.logger import configure
from stable_baselines3.common.torch_layers import NatureCNN
from stable_baselines3.common.utils import get_linear_fn
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack


def train_atari(args):
    env = make_atari_env(
        f"{args.env}NoFrameskip-v4",
        n_envs=8,
        seed=args.seed,
        wrapper_kwargs={
            "noop_max": 30,
            "frame_skip": 4,
            "screen_size": 84,
            "terminal_on_life_loss": True,
            "clip_reward": True,
            "action_repeat_probability": 0.0,
        },
        vec_env_cls=DummyVecEnv,
    )
    env = VecFrameStack(env, n_stack=4)

    model = PPO(
        "CnnPolicy",
        env,
        learning_rate=get_linear_fn(2.5e-4, 0, 1.0),
        n_steps=128,
        batch_size=256,
        n_epochs=4,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.1,
        clip_range_vf=0.1,
        normalize_advantage=True,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=float("inf") if args.noclip else 0.5,
        use_sde=False,
        target_kl=None,
        stats_window_size=100,
        policy_kwargs={
            "ortho_init": True,
            "features_extractor_class": NatureCNN,
            "share_features_extractor": True,
            "normalize_images": True,
        },
        seed=args.seed,
    )

    logger = configure(args.path, ["csv"])
    model.set_logger(logger)
    start_time = time.time()
    model.learn(total_timesteps=10000000, log_interval=1, progress_bar=True)
    train_end_time = time.time()
    mean_reward, _ = evaluate_policy(
        model,
        model.get_env(),
        n_eval_episodes=100,
        deterministic=False,
    )
    eval_end_time = time.time()
    args.training_time_h = ((train_end_time - start_time) / 60) / 60
    args.total_time_h = ((eval_end_time - start_time) / 60) / 60
    args.eval_mean_reward = mean_reward


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-g",
        "--gpu",
        type=int,
        help="Specify GPU index",
        default=0,
    )
    parser.add_argument(
        "-e",
        "--env",
        type=str,
        help="Specify Atari environment w/o version",
        default="Pong",
    )
    parser.add_argument(
        "-t",
        "--trials",
        type=int,
        help="Specify number of trials",
        default=5,
    )
    parser.add_argument(
        "-nc",
        "--noclip",
        action="store_true",
        help="Only specify for no gradient clipping",
    )
    args = parser.parse_args()
    for _ in range(args.trials):
        args.id = uuid.uuid4().hex
        if args.noclip:
            args.path = os.path.join("trials", "ppo", f"{args.env}_NoClip", args.id)
        else:
            args.path = os.path.join("trials", "ppo", args.env, args.id)
        args.seed = int(time.time())

        # create dir
        pathlib.Path(args.path).mkdir(parents=True, exist_ok=True)

        # set gpu
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{args.gpu}"

        train_atari(args)

        # save trial info
        with open(os.path.join(args.path, "info.json"), "w") as f:
            json.dump(vars(args), f, indent=4)

Relevant log output / Error message

No response

System Info

  • OS: Linux-5.15.0-72-generic-x86_64-with-glibc2.31 # 79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023
  • Python: 3.11.9
  • Stable-Baselines3: 2.3.0
  • PyTorch: 2.3.1+cu121
  • GPU Enabled: True
  • Numpy: 1.26.4
  • Cloudpickle: 3.0.0
  • Gymnasium: 0.29.1

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions