-
Notifications
You must be signed in to change notification settings - Fork 1.9k
I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As explained in the issue template and 3 times in #982, we can't help you if you don't provide us with a well formatted minimal code to reproduce the error you encounter. |
Sorry, I just registered before and didn't understand the rules of githu. |
I try my best to provide a good minimum compliance code. Thank you for your selfless help first!
The following is a class I modified according to the example, because I want to make a feature extractor to extract the features of time series. I want to use CNN network.
Now the problem is in the forword function. The first sampling of the program is the structure of [1,1,13], and the second is the structure of [1,128,13]. So I added a judgment. If the structure changes, redefine nn.sequential.
The following error occurred:
I know there is a full connection layer behind the feature extractor, but why [1,1,13] is OK and [1128,13] reports an error. |
I want to use mlppolicy policy network and CNN network as the feature extractor. I don't know whether this method is feasible, or I must customize the policy network to achieve this purpose. |
I think you don't have a clear understanding of policy in SB3 : the feature extractor is the first stage of any policy. You should read the documentation: SB3 Policy
It doesn't. You need to provide a code, that I can just copy-paste and run to reproduce the error. WARNING: this code has to be MINIMAL: if one line can be removed without removing the error, your code is not minimal. |
I also understand that you want to implement a feature extractor with Conv1D. If that's the case, you have to check the other issues that discuss this topic |
Thank you for your help. I'm writing a DRL algorithm about stock portfolio return. The idea is to try to use MLP, CNN and LSTM as feature extractors to compare which is the best for financial time series. I have found the reason for the problem of the above customized CNN feature extractor, and it has been solved. Thank you again for your help. Excuse me, can you add another example of LSTM feature extractor to the document? |
Please provide the fix so that other people can benefit from it. |
If you think that the documentation can be improved, for example by adding more examples, for feel free to open a PR. |
OK, I'll send the modified code.This is my rewriting based on the example program. I write a custom feature extractor using CNN neural network on the mlppolicy side rate network
following is code in main program:
|
As stated in the documentation, only Closing as the original question was answered. The following is an automated answer: as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader: Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.
|
Uh oh!
There was an error while loading. Please reload this page.
Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on the RL Discord, Reddit or Stack Overflow in that case.
📚 Documentation
The problem of off policy network has been bothering me for several days.
I use the example to create a class (customcnn) as the feature extractor and define the
policy_ kwargs = dict(
features_ extractor_ class=CustomCNN,
net_ arch=dict(qf=[256, 256], pi=[256, 256])
)
CNN neural network is used as the feature extractor, and the code is as follows:
###The following is a class I modified according to the example, because I want to make a feature extractor to extract the features of time series. I want to use CNN network.
Now the problem is in the forword function. The first sampling of the program is the structure of [1,1,13], and the second is the structure of [1,128,13]. So I added a judgment. If the structure changes, redefine nn.sequential.
####Error tracking is:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.1.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1491, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.1.3\plugins\python-ce\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Administrator/PycharmProjects/demo/utils/models.py", line 487, in
trained_sac = agent.train_model(
File "C:/Users/Administrator/PycharmProjects/demo/utils/models.py", line 409, in train_model
model = model.learn(total_timesteps=total_timesteps, tb_log_name=tb_log_name)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\sac.py", line 292, in learn
return super(SAC, self).learn(
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 366, in learn
self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\sac.py", line 206, in train
actions_pi, log_prob = self.actor.action_log_prob(replay_data.observations)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\policies.py", line 180, in action_log_prob
mean_actions, log_std, kwargs = self.get_action_dist_params(obs)
File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\policies.py", line 163, in get_action_dist_params
latent_pi = self.latent_pi(features)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x128 and 1x256)
I know there is a full connection layer behind the feature extractor, but why [1,1,13] is OK and [1128,13] reports an error.
I'm puzzled. Also, can I add a more detailed introduction to the document, about the feature extractor and the full connection layer.
I don't know if my analysis is correct. Please help me have a look. Thank you very much!!!
I'll send you some more code of the model,
policy_kwargs = dict(
features_extractor_class=CustomCNN,
net_arch=dict(qf=[256, 256], pi=[256, 256])
)
def get_model(
self,
model_name: str,
policy: str = "MlpPolicy",
#policy: str = "MultiInputPolicy",
policy_kwargs: dict = policy_kwargs,
model_kwargs: dict = None,
verbose: int = 1
) -> Any:
# print("set Debug!")
if model_name not in MODELS:
raise NotImplementedError("NotImplementedError")
if model_kwargs is None:
model_kwargs = MODEL_KWARGS[model_name]
if "action_noise" in model_kwargs:
n_actions = self.env.action_space.shape[-1]
model_kwargs["action_noise"] = NOISE[model_kwargs["action_noise"]](
mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
)
print(model_kwargs)
model = MODELS[model_name](
policy=policy,
env=self.env,
tensorboard_log="{}/{}".format(config.TENSORBOARD_LOG_DIR, model_name),
verbose=verbose,
policy_kwargs=policy_kwargs,
**model_kwargs
)
return model
def train_model(
self, model: Any, tb_log_name: str, total_timesteps: int = 5000
) -> Any:
"""train model"""
model = model.learn(total_timesteps=total_timesteps, tb_log_name=tb_log_name)
return model
if name == "main":
from pull_data import Pull_data
from preprocessors import FeatureEngineer, split_data
from utils import config
import time
# pull data
#df = Pull_data(config.SSE_50[:2], save_data=False).pull_data()
df = Pull_data(config.SSE_50[:2]).pull_data()
df = FeatureEngineer().preprocess_data(df)
df = split_data(df, '2009-01-01', '2019-01-01')
print(df.head())
#
stock_dimension = len(df.tic.unique()) # 2
state_space = 1 + 2*stock_dimension +
len(config.TECHNICAL_INDICATORS_LIST)*stock_dimension # 23
print("stock_dimension: {}, state_space: {}".format(stock_dimension, state_space))
env_kwargs = {
#"stock_dim": stock_dimension,
"hmax": 100,
"initial_amount": 1e6,
"buy_cost_pct": 0.001,
"sell_cost_pct": 0.001,
#"reward_scaling": 1e-4,
#"state_space": state_space,
#"action_space": stock_dimension,
#"tech_indicator_list": config.TECHNICAL_INDICATORS_LIST
}
# test env
e_train_gym = StockLearningEnv(df=df, **env_kwargs)
## mulpt test
observation = e_train_gym.reset()
count = 0
for t in range(10):
action = e_train_gym.action_space.sample()
observation, reward, done, info = e_train_gym.step(action)
if done:
break
count+=1
time.sleep(0.2)
print("observation: ", observation)
print("action: ", action)
print("reward: {}, done: {},info: {}".format(reward, done, info))
# test model
env_train, _ = e_train_gym.get_sb_env()
print(type(env_train))
##register_policy('CustomPolicy', CustomPolicy)
##register_policy('CustomActorCriticPolicy', CustomActorCriticPolicy)
agent = DRL_Agent(env= env_train)
SAC_PARAMS = {
"batch_size": 128,
"buffer_size": 1000000,
"learning_rate": 0.0001,
"learning_starts": 100,
"ent_coef": "auto_0.1"
}
model_sac = agent.get_model("sac", model_kwargs=SAC_PARAMS)
trained_sac = agent.train_model(
model=model_sac,
tb_log_name='sac',
total_timesteps= 50000
)
A clear and concise description of what should be improved in the documentation.
### Checklist
The text was updated successfully, but these errors were encountered: