I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988

Zero1366166516 · 2022-07-29T00:08:53Z

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

📚 Documentation

The problem of off policy network has been bothering me for several days.
I use the example to create a class (customcnn) as the feature extractor and define the
policy_ kwargs = dict(
features_ extractor_ class=CustomCNN,
net_ arch=dict(qf=[256, 256], pi=[256, 256])
)
CNN neural network is used as the feature extractor, and the code is as follows:
###The following is a class I modified according to the example, because I want to make a feature extractor to extract the features of time series. I want to use CNN network.

`class CustomCNN(BaseFeaturesExtractor):
    """`class CustomCNN(BaseFeaturesExtractor):
    """
    :param observation_space: (gym.Space)
    :param features_dim: (int) Number of features extracted.
        This corresponds to the number of unit for the last layer.
    """

def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 1):
    super(CustomCNN, self).__init__(observation_space, features_dim)
    # We assume CxHxW images (channels first)
    # Re-ordering will be done by pre-preprocessing or wrapper
    n_input_channels = observation_space.shape[0]
    self.cnn = nn.Sequential(
        nn.Conv1d(self.features_dim, n_input_channels, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Conv1d(n_input_channels, self.features_dim, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Flatten(),
    )
    with th.no_grad():
        n_flatten = self.cnn(
            th.as_tensor(observation_space.sample()[None]).float()
        ).shape[1]
    self.linear = nn.Sequential(nn.Linear(n_flatten, self.features_dim), nn.Tanh())

Now the problem is in the forword function. The first sampling of the program is the structure of [1,1,13], and the second is the structure of [1,128,13]. So I added a judgment. If the structure changes, redefine nn.sequential.

def forward(self, observations: th.Tensor) -> th.Tensor:
    n_flatten = np.array(observations).shape[1]
    features_dim = np.array(observations).shape[0]
    print(features_dim, n_flatten)
    if features_dim != 1:
        self.cnn = nn.Sequential(
            nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )
        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.Tanh())
    return self.linear(self.cnn(observations))

####Error tracking is:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.1.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1491, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.1.3\plugins\python-ce\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Administrator/PycharmProjects/demo/utils/models.py", line 487, in
trained_sac = agent.train_model(
File "C:/Users/Administrator/PycharmProjects/demo/utils/models.py", line 409, in train_model
model = model.learn(total_timesteps=total_timesteps, tb_log_name=tb_log_name)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\sac.py", line 292, in learn
return super(SAC, self).learn(
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 366, in learn
self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\sac.py", line 206, in train
actions_pi, log_prob = self.actor.action_log_prob(replay_data.observations)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\policies.py", line 180, in action_log_prob
mean_actions, log_std, kwargs = self.get_action_dist_params(obs)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\policies.py", line 163, in get_action_dist_params
latent_pi = self.latent_pi(features)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x128 and 1x256)

I know there is a full connection layer behind the feature extractor, but why [1,1,13] is OK and [1128,13] reports an error.
I'm puzzled. Also, can I add a more detailed introduction to the document, about the feature extractor and the full connection layer.
I don't know if my analysis is correct. Please help me have a look. Thank you very much!!!
I'll send you some more code of the model,

policy_kwargs = dict(
features_extractor_class=CustomCNN,
net_arch=dict(qf=[256, 256], pi=[256, 256])
)
def get_model(
self,
model_name: str,
policy: str = "MlpPolicy",
#policy: str = "MultiInputPolicy",
policy_kwargs: dict = policy_kwargs,
model_kwargs: dict = None,
verbose: int = 1
) -> Any:
# print("set Debug!")
if model_name not in MODELS:
raise NotImplementedError("NotImplementedError")
if model_kwargs is None:
model_kwargs = MODEL_KWARGS[model_name]
if "action_noise" in model_kwargs:
n_actions = self.env.action_space.shape[-1]
model_kwargs["action_noise"] = NOISE[model_kwargs["action_noise"]](
mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
)
print(model_kwargs)
model = MODELS[model_name](
policy=policy,
env=self.env,
tensorboard_log="{}/{}".format(config.TENSORBOARD_LOG_DIR, model_name),
verbose=verbose,
policy_kwargs=policy_kwargs,
**model_kwargs
)
return model
def train_model(
self, model: Any, tb_log_name: str, total_timesteps: int = 5000
) -> Any:
"""train model"""
model = model.learn(total_timesteps=total_timesteps, tb_log_name=tb_log_name)
return model
if name == "main":
from pull_data import Pull_data
from preprocessors import FeatureEngineer, split_data
from utils import config
import time
# pull data
#df = Pull_data(config.SSE_50[:2], save_data=False).pull_data()
df = Pull_data(config.SSE_50[:2]).pull_data()
df = FeatureEngineer().preprocess_data(df)
df = split_data(df, '2009-01-01', '2019-01-01')
print(df.head())
#
stock_dimension = len(df.tic.unique()) # 2
state_space = 1 + 2*stock_dimension +
len(config.TECHNICAL_INDICATORS_LIST)*stock_dimension # 23
print("stock_dimension: {}, state_space: {}".format(stock_dimension, state_space))
env_kwargs = {
#"stock_dim": stock_dimension,
"hmax": 100,
"initial_amount": 1e6,
"buy_cost_pct": 0.001,
"sell_cost_pct": 0.001,
#"reward_scaling": 1e-4,
#"state_space": state_space,
#"action_space": stock_dimension,
#"tech_indicator_list": config.TECHNICAL_INDICATORS_LIST
}
# test env
e_train_gym = StockLearningEnv(df=df, **env_kwargs)
## mulpt test
observation = e_train_gym.reset()
count = 0
for t in range(10):
action = e_train_gym.action_space.sample()
observation, reward, done, info = e_train_gym.step(action)
if done:
break
count+=1
time.sleep(0.2)
print("observation: ", observation)
print("action: ", action)
print("reward: {}, done: {},info: {}".format(reward, done, info))
# test model
env_train, _ = e_train_gym.get_sb_env()
print(type(env_train))
##register_policy('CustomPolicy', CustomPolicy)
##register_policy('CustomActorCriticPolicy', CustomActorCriticPolicy)
agent = DRL_Agent(env= env_train)
SAC_PARAMS = {
"batch_size": 128,
"buffer_size": 1000000,
"learning_rate": 0.0001,
"learning_starts": 100,
"ent_coef": "auto_0.1"
}
model_sac = agent.get_model("sac", model_kwargs=SAC_PARAMS)
trained_sac = agent.train_model(
model=model_sac,
tb_log_name='sac',
total_timesteps= 50000
)

A clear and concise description of what should be improved in the documentation.

### Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

qgallouedec · 2022-07-29T05:01:35Z

As explained in the issue template and 3 times in #982, we can't help you if you don't provide us with a well formatted minimal code to reproduce the error you encounter.
Are you having trouble understanding what this means?

Zero1366166516 · 2022-07-29T05:23:41Z

Sorry, I just registered before and didn't understand the rules of githu.

Zero1366166516 · 2022-07-29T06:20:55Z

I try my best to provide a good minimum compliance code. Thank you for your selfless help first!
I use the example to create a class (customcnn) as the feature extractor and define the

policy_ kwargs = dict(
features_ extractor_ class=CustomCNN,
net_ arch=dict(qf=[256, 256], pi=[256, 256])
)

The following is a class I modified according to the example, because I want to make a feature extractor to extract the features of time series. I want to use CNN network.

def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 1):
    super(CustomCNN, self).__init__(observation_space, features_dim)
    # We assume CxHxW images (channels first)
    # Re-ordering will be done by pre-preprocessing or wrapper
    n_input_channels = observation_space.shape[0]
    self.cnn = nn.Sequential(
        nn.Conv1d(self.features_dim, n_input_channels, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Conv1d(n_input_channels, self.features_dim, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Flatten(),
    )
    with th.no_grad():
        n_flatten = self.cnn(
            th.as_tensor(observation_space.sample()[None]).float()
        ).shape[1]
    self.linear = nn.Sequential(nn.Linear(n_flatten, self.features_dim), nn.Tanh())

Now the problem is in the forword function. The first sampling of the program is the structure of [1,1,13], and the second is the structure of [1,128,13]. So I added a judgment. If the structure changes, redefine nn.sequential.

def forward(self, observations: th.Tensor) -> th.Tensor:
    n_flatten = np.array(observations).shape[1]
    features_dim = np.array(observations).shape[0]
    print(features_dim, n_flatten)
    if features_dim != 1:
        self.cnn = nn.Sequential(
            nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )
        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.Tanh())
    return self.linear(self.cnn(observations))

The following error occurred:

return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x128 and 1x256)

I know there is a full connection layer behind the feature extractor, but why [1,1,13] is OK and [1128,13] reports an error.
I'm puzzled. Also, can you add a more detailed introduction to the document, about the feature extractor and the full connection layer.
I don't know whether what I'm writing now meets the requirements, which has caused you trouble. Sorry.

Zero1366166516 · 2022-07-29T06:24:09Z

I want to use mlppolicy policy network and CNN network as the feature extractor. I don't know whether this method is feasible, or I must customize the policy network to achieve this purpose.

qgallouedec · 2022-07-29T16:50:44Z

I want to use mlppolicy policy network and CNN network as the feature extractor.

I think you don't have a clear understanding of policy in SB3 : the feature extractor is the first stage of any policy. You should read the documentation: SB3 Policy

I don't know whether what I'm writing now meets the requirements, which has caused you trouble. Sorry.

It doesn't. You need to provide a code, that I can just copy-paste and run to reproduce the error. WARNING: this code has to be MINIMAL: if one line can be removed without removing the error, your code is not minimal.

qgallouedec · 2022-07-29T17:17:40Z

I also understand that you want to implement a feature extractor with Conv1D. If that's the case, you have to check the other issues that discuss this topic

Zero1366166516 · 2022-07-30T00:01:22Z

Thank you for your help. I'm writing a DRL algorithm about stock portfolio return. The idea is to try to use MLP, CNN and LSTM as feature extractors to compare which is the best for financial time series.

I have found the reason for the problem of the above customized CNN feature extractor, and it has been solved. Thank you again for your help.

Excuse me, can you add another example of LSTM feature extractor to the document?

qgallouedec · 2022-07-30T05:12:41Z

Please provide the fix so that other people can benefit from it.

qgallouedec · 2022-07-30T05:21:10Z

Excuse me, can you add another example of LSTM feature extractor to the document?

If you think that the documentation can be improved, for example by adding more examples, for feel free to open a PR.

Zero1366166516 · 2022-07-30T06:38:42Z

OK, I'll send the modified code.This is my rewriting based on the example program. I write a custom feature extractor using CNN neural network on the mlppolicy side rate network

class CustomCNN(BaseFeaturesExtractor):
    """
    :param observation_space: (gym.Space)
    :param features_dim: (int) Number of features extracted.
        This corresponds to the number of unit for the last layer.
    """

    def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 1):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        # We assume CxHxW images (channels first)
        # Re-ordering will be done by pre-preprocessing or wrapper
        n_input_channels = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv1d(self.features_dim, n_input_channels, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv1d(n_input_channels, self.features_dim, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )
        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]
            #print("n_flatten", n_flatten)
            ##print("cnn", self.cnn)

        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.Tanh())
    def forward(self, observations: th.Tensor) -> th.Tensor:
        with th.no_grad():
            n_flatten = np.array(observations).shape[-1]
            features_dim = np.array(observations).shape[-2]
            #print(features_dim, n_flatten, np.array(observations).shape)
            i = 0
            j = 0
            if features_dim != 1:
                self.cnn = nn.Sequential(
                    nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Flatten(),
                )
                self.linear = nn.Sequential(nn.Linear(n_flatten, 1), nn.Tanh())
                i += 1
            else:
                j += 1
                self.cnn = nn.Sequential(
                    nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Flatten(),
                )
                self.linear = nn.Sequential(nn.Linear(n_flatten, 1), nn.Tanh())
        return self.linear(self.cnn(observations))

following is code in main program:

policy_kwargs = dict(
     features_extractor_class=CustomCNN,
     net_arch=dict(qf=[128, 128], pi=[256, 256])
 )
     def get_model(
     self,
     model_name: str,
     policy: str = "MlpPolicy",    
     #policy: str = "MultiInputPolicy",
     policy_kwargs: dict = policy_kwargs,
     model_kwargs: dict = None,
     verbose: int = 1
 ) -> Any:
   
     # print("set Debug!")

     if model_name not in MODELS:
         raise NotImplementedError("NotImplementedError")
     
     if model_kwargs is None:
         model_kwargs = MODEL_KWARGS[model_name]

     if "action_noise" in model_kwargs:
         n_actions = self.env.action_space.shape[-1]                         
         model_kwargs["action_noise"] = NOISE[model_kwargs["action_noise"]](
             mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
         )
     print(model_kwargs)    
     model = MODELS[model_name](       
         policy=policy,
         env=self.env,
         tensorboard_log="{}/{}".format(config.TENSORBOARD_LOG_DIR, model_name),
         verbose=verbose,
         policy_kwargs=policy_kwargs,
         **model_kwargs
     )
     return model

araffin · 2022-07-30T09:54:17Z

Excuse me, can you add another example of LSTM feature extractor to the document?

As stated in the documentation, only RecurrentPPO (from SB3 contrib) has LSTM support.

Closing as the original question was answered.

The following is an automated answer:

as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader:

Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.

I was a quant trader at a major hedge fund for several years. I am now retired.
On average, traders lose money. On average, retail traders especially lose money. An excellent approximation of trading, and especially of retail trading, is 'gambling'.
There is a lot more bad advice on trading out there than good advice. It is extraordinarily difficult to demonstrate that any particular advice is some of the rare good advice.
As such, it's reasonable to treat all commentary on retail trading as an epsilon away from snake oil salesmanship. Sometimes that'll be wrong, but it's a strong rule of thumb.
I feel a sense of responsibility to the less world-wise members of this community - which includes plenty of highschoolers - and so I find myself unable to let a conversation about retail trading occur without interceding and warning that it's very likely snake oil.
I find repeatedly making these warnings and the subsequent fights to be exhausting.

Zero1366166516 added the documentation Improvements or additions to documentation label Jul 29, 2022

araffin added question Further information is requested and removed documentation Improvements or additions to documentation labels Jul 29, 2022

araffin added the trading warning Trading with RL is usually a bad idea label Jul 30, 2022

araffin closed this as completed Jul 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988

I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988

Zero1366166516 commented Jul 29, 2022 •

edited

Loading

qgallouedec commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022

Uh oh!

qgallouedec commented Jul 29, 2022 •

edited

Loading

Uh oh!

qgallouedec commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 30, 2022

Uh oh!

qgallouedec commented Jul 30, 2022

Uh oh!

qgallouedec commented Jul 30, 2022

Uh oh!

Zero1366166516 commented Jul 30, 2022

Uh oh!

araffin commented Jul 30, 2022

Uh oh!

I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988

I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988

Comments

Zero1366166516 commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📚 Documentation

qgallouedec commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022

Uh oh!

qgallouedec commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jul 29, 2022

Uh oh!

Zero1366166516 commented Jul 30, 2022

Uh oh!

qgallouedec commented Jul 30, 2022

Uh oh!

qgallouedec commented Jul 30, 2022

Uh oh!

Zero1366166516 commented Jul 30, 2022

Uh oh!

araffin commented Jul 30, 2022

Uh oh!

Zero1366166516 commented Jul 29, 2022 •

edited

Loading

qgallouedec commented Jul 29, 2022 •

edited

Loading