Skip to content

Error while using MaskablePPO in sb3_contrib #1596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
koliber31 opened this issue Jul 6, 2023 · 2 comments
Closed
5 tasks done

Error while using MaskablePPO in sb3_contrib #1596

koliber31 opened this issue Jul 6, 2023 · 2 comments
Labels
custom gym env Issue related to Custom Gym Env

Comments

@koliber31
Copy link

koliber31 commented Jul 6, 2023

🐛 Bug

Hi
I switched from PPO to MaskablePPO and since then I'm facing problem with error. Interesting fact is that this error doesn't occur immediately, it occurs after 100k-300k timesteps. Very similar (or even the same) was described here #1553 but author didn't provide explanation how to fix it. The problem is my env is big enough that it's hard to provide minimal example but I'll try my best.

Code example

Setting my action mask

from draw_board import board
from move_choice import move_choice
from Move import Move
import numpy as np
import gymnasium as gym

def get_action_mask(env: gym.Env):
    mask = np.zeros((666,), dtype=np.float32)
    for i in range(len(board)):
        if board[i].player != 1:
            for j in range(0,18):
                mask[i*18 + j] = 0
        
        else:
            moves = move_choice(i)
            for j in range(0,18):
                if type(moves[j]) == Move:
                    if move[j].player == 0:
                        mask[i*18 + j] = 1
                    else:
                        mask[i*18 + j] = 0
    
    return mask

Teaching agent the game

from sb3_contrib import MaskablePPO
from sb3_contrib.common.wrappers import ActionMasker
from get_action_mask import get_action_mask
import os
from hexenv import HexEnv
import time

models_dir = f"models/{int(time.time())}/"
logdir = f"logs/{int(time.time())}/"

if not os.path.exists(models_dir):
	os.makedirs(models_dir)

if not os.path.exists(logdir):
	os.makedirs(logdir)

env = HexEnv()
env = ActionMasker(env, get_action_mask)
env.reset()

model = MaskablePPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)

TIMESTEPS = 20000
iters = 0
while True:
	iters += 1
	model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
	model.save(f"{models_dir}/{TIMESTEPS*iters}")

Init and step functions in env

npObs = np.ndarray((37,), dtype=np.float32)
class HexEnv(gym.Env):
    def __init__(self):
        super(HexEnv, self).__init__()
        

        self.action_space = spaces.Discrete(666)
        self.observation_space = spaces.MultiDiscrete(3 * np.ones(37))
    def step(self, action):
        reward = 0
        done = False
       
        if self.phase== 1:
            phase, run, illegalMove, captured = phase(action)
            if illegalMove == True:
                reward = -1
                done = False

        if phase== 2:
            faza = phase2()

        if phase == 3:
            #Check which player won
            #If agent won rewad = 1
            #If opponent won reward = -1
            #done = True

        if phase == 4:
            #If move was illegal give penalty
            done = True                        
            for i in range(len(plansza)):
                npObs[i] = plansza[i].gracz              
            info = {}      
            return npObs, reward, done, False, info
                  
        points = 0
        for i in range(len(plansza)):
            npObs[i] = plansza[i].gracz         
        info = {}          
        return npObs, reward, done, False, info

Reset function

def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        run = True
        
        self.phase= 1
        self.done = False
        for pawn in plansza:
            pawn.player= 0
            pawn.color = 'white'

        pawnNum = 0
        for pawn in board:            
            pawn.mark_tile(11,1) # Set first players pawns
            pawn.mark_tile(47,1)
            pawn.mark_tile(74,1)
        
            pawn.mark_tile(14,2)  # Set second players pawns
            pawn.mark_tile(41,2)
            pawn.mark_tile(77,2)       
       
        for i in range(len(board)):
            npObs[i] = board[i].player           
        info = {}
        return npObs, info

Relevant log output / Error message

Traceback (most recent call last):
  File "C:\Users\Igor\Desktop\Maskowanie\leanHex.py", line 27, in <module>
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 547, in learn
    self.train()
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 412, in train
    values, log_prob, entropy = self.policy.evaluate_actions(
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\common\maskable\policies.py", line 324, in evaluate_actions
    distribution.apply_masking(action_masks)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 159, in apply_masking
    self.distribution.apply_masking(masks)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 67, in apply_masking
    super().__init__(logits=logits)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributions\categorical.py", line 66, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributions\distribution.py", line 62, in __init__
    raise ValueError(
ValueError: Expected parameter probs (Tensor of shape (64, 666)) of distribution MaskableCategorical(probs: torch.Size([64, 666]), logits: torch.Size([64, 666])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[9.9025e-05, 9.8432e-05, 9.9046e-05,  ..., 9.8342e-05, 9.8511e-05,
         9.8658e-05],
        [4.0191e-04, 3.9996e-04, 4.0048e-04,  ..., 3.9975e-04, 3.9897e-04,
         4.0053e-04],
        [5.1898e-04, 5.2030e-04, 5.1716e-04,  ..., 5.1855e-04, 5.1745e-04,
         5.1734e-04],
        ...,
        [2.0156e-04, 2.0234e-04, 2.0060e-04,  ..., 2.0211e-04, 2.0169e-04,
         2.0162e-04],
        [9.9025e-05, 9.8432e-05, 9.9046e-05,  ..., 9.8342e-05, 9.8511e-05,
         9.8658e-05],
        [2.0408e-04, 2.0522e-04, 2.0378e-04,  ..., 2.0520e-04, 2.0453e-04,
         2.0475e-04]], grad_fn=<SoftmaxBackward0>)

System Info

  • OS: Windows-10-10.0.19044-SP0 10.0.19044
  • Python: 3.10.10
  • Stable-Baselines3: 2.0.0
  • PyTorch: 2.0.1+cpu
  • GPU Enabled: False
  • Numpy: 1.24.3
  • Cloudpickle: 2.2.1
  • Gymnasium: 0.28.1
  • OpenAI Gym: 0.21.0

Checklist

@koliber31 koliber31 added the custom gym env Issue related to Custom Gym Env label Jul 6, 2023
@koliber31
Copy link
Author

I found the answer in this Stable-Baselines-Team/stable-baselines3-contrib#81 issue. You have to change super().__init__(logits=logits) to super().__init__(logits=logits, validate_args=False) in sb3_contrib/common/maskable/distributions.py

@LiuXiao617111
Copy link

@koliber31 Do you means that should change the source code file? Any other ways to change the paramete "validate_args" to False?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env
Projects
None yet
Development

No branches or pull requests

2 participants