Error while using MaskablePPO in sb3_contrib #1596

koliber31 · 2023-07-06T14:07:48Z

🐛 Bug

Hi
I switched from PPO to MaskablePPO and since then I'm facing problem with error. Interesting fact is that this error doesn't occur immediately, it occurs after 100k-300k timesteps. Very similar (or even the same) was described here #1553 but author didn't provide explanation how to fix it. The problem is my env is big enough that it's hard to provide minimal example but I'll try my best.

Code example

Setting my action mask

from draw_board import board
from move_choice import move_choice
from Move import Move
import numpy as np
import gymnasium as gym

def get_action_mask(env: gym.Env):
    mask = np.zeros((666,), dtype=np.float32)
    for i in range(len(board)):
        if board[i].player != 1:
            for j in range(0,18):
                mask[i*18 + j] = 0
        
        else:
            moves = move_choice(i)
            for j in range(0,18):
                if type(moves[j]) == Move:
                    if move[j].player == 0:
                        mask[i*18 + j] = 1
                    else:
                        mask[i*18 + j] = 0
    
    return mask

Teaching agent the game

from sb3_contrib import MaskablePPO
from sb3_contrib.common.wrappers import ActionMasker
from get_action_mask import get_action_mask
import os
from hexenv import HexEnv
import time

models_dir = f"models/{int(time.time())}/"
logdir = f"logs/{int(time.time())}/"

if not os.path.exists(models_dir):
	os.makedirs(models_dir)

if not os.path.exists(logdir):
	os.makedirs(logdir)

env = HexEnv()
env = ActionMasker(env, get_action_mask)
env.reset()

model = MaskablePPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)

TIMESTEPS = 20000
iters = 0
while True:
	iters += 1
	model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
	model.save(f"{models_dir}/{TIMESTEPS*iters}")

Init and step functions in env

npObs = np.ndarray((37,), dtype=np.float32)
class HexEnv(gym.Env):
    def __init__(self):
        super(HexEnv, self).__init__()
        

        self.action_space = spaces.Discrete(666)
        self.observation_space = spaces.MultiDiscrete(3 * np.ones(37))
    def step(self, action):
        reward = 0
        done = False
       
        if self.phase== 1:
            phase, run, illegalMove, captured = phase(action)
            if illegalMove == True:
                reward = -1
                done = False

        if phase== 2:
            faza = phase2()

        if phase == 3:
            #Check which player won
            #If agent won rewad = 1
            #If opponent won reward = -1
            #done = True

        if phase == 4:
            #If move was illegal give penalty
            done = True                        
            for i in range(len(plansza)):
                npObs[i] = plansza[i].gracz              
            info = {}      
            return npObs, reward, done, False, info
                  
        points = 0
        for i in range(len(plansza)):
            npObs[i] = plansza[i].gracz         
        info = {}          
        return npObs, reward, done, False, info

Reset function

def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        run = True
        
        self.phase= 1
        self.done = False
        for pawn in plansza:
            pawn.player= 0
            pawn.color = 'white'

        pawnNum = 0
        for pawn in board:            
            pawn.mark_tile(11,1) # Set first players pawns
            pawn.mark_tile(47,1)
            pawn.mark_tile(74,1)
        
            pawn.mark_tile(14,2)  # Set second players pawns
            pawn.mark_tile(41,2)
            pawn.mark_tile(77,2)       
       
        for i in range(len(board)):
            npObs[i] = board[i].player           
        info = {}
        return npObs, info

Relevant log output / Error message

Traceback (most recent call last):
  File "C:\Users\Igor\Desktop\Maskowanie\leanHex.py", line 27, in <module>
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 547, in learn
    self.train()
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 412, in train
    values, log_prob, entropy = self.policy.evaluate_actions(
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\common\maskable\policies.py", line 324, in evaluate_actions
    distribution.apply_masking(action_masks)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 159, in apply_masking
    self.distribution.apply_masking(masks)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 67, in apply_masking
    super().__init__(logits=logits)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributions\categorical.py", line 66, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "C:\Users\Igor\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributions\distribution.py", line 62, in __init__
    raise ValueError(
ValueError: Expected parameter probs (Tensor of shape (64, 666)) of distribution MaskableCategorical(probs: torch.Size([64, 666]), logits: torch.Size([64, 666])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[9.9025e-05, 9.8432e-05, 9.9046e-05,  ..., 9.8342e-05, 9.8511e-05,
         9.8658e-05],
        [4.0191e-04, 3.9996e-04, 4.0048e-04,  ..., 3.9975e-04, 3.9897e-04,
         4.0053e-04],
        [5.1898e-04, 5.2030e-04, 5.1716e-04,  ..., 5.1855e-04, 5.1745e-04,
         5.1734e-04],
        ...,
        [2.0156e-04, 2.0234e-04, 2.0060e-04,  ..., 2.0211e-04, 2.0169e-04,
         2.0162e-04],
        [9.9025e-05, 9.8432e-05, 9.9046e-05,  ..., 9.8342e-05, 9.8511e-05,
         9.8658e-05],
        [2.0408e-04, 2.0522e-04, 2.0378e-04,  ..., 2.0520e-04, 2.0453e-04,
         2.0475e-04]], grad_fn=<SoftmaxBackward0>)

System Info

OS: Windows-10-10.0.19044-SP0 10.0.19044
Python: 3.10.10
Stable-Baselines3: 2.0.0
PyTorch: 2.0.1+cpu
GPU Enabled: False
Numpy: 1.24.3
Cloudpickle: 2.2.1
Gymnasium: 0.28.1
OpenAI Gym: 0.21.0

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
I have provided a minimal and working example to reproduce the bug
I have checked my env using the env checker
I've used the markdown code blocks for both code and stack traces.

koliber31 · 2023-07-11T11:35:06Z

I found the answer in this Stable-Baselines-Team/stable-baselines3-contrib#81 issue. You have to change super().__init__(logits=logits) to super().__init__(logits=logits, validate_args=False) in sb3_contrib/common/maskable/distributions.py

LiuXiao617111 · 2024-11-03T14:32:55Z

@koliber31 Do you means that should change the source code file? Any other ways to change the paramete "validate_args" to False?

koliber31 added the custom gym env Issue related to Custom Gym Env label Jul 6, 2023

koliber31 closed this as completed Jul 11, 2023

koliber31 mentioned this issue Jul 12, 2023

Problems with MaskablePPO Stable-Baselines-Team/stable-baselines3-contrib#195

Open

5 tasks

araffin mentioned this issue May 10, 2024

MaskablePPO Masking Doesn't Work with Big Action Space Stable-Baselines-Team/stable-baselines3-contrib#247

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error while using MaskablePPO in sb3_contrib #1596

Error while using MaskablePPO in sb3_contrib #1596

koliber31 commented Jul 6, 2023 •

edited

Loading

koliber31 commented Jul 11, 2023

Uh oh!

LiuXiao617111 commented Nov 3, 2024

Uh oh!

Error while using MaskablePPO in sb3_contrib #1596

Error while using MaskablePPO in sb3_contrib #1596

Comments

koliber31 commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

Code example

Relevant log output / Error message

System Info

Checklist

koliber31 commented Jul 11, 2023

Uh oh!

LiuXiao617111 commented Nov 3, 2024

Uh oh!

koliber31 commented Jul 6, 2023 •

edited

Loading