-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Too many errors when customizing policy, a full example for Off-Policy Algorithms should be added in user guide #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, I would also highly recommend you to read more about DDPG because the current architecture you wrote here cannot work with it (the actor requires a tanh at the output and the critic is q value function that needs the action as input): https://spinningup.openai.com/en/latest/algorithms/ddpg.html |
To answer your original request, giving an advanced customization example for any off-policy algorithm is not really possible as each algorithm would need a different one. Also, customizing the network more than what we present requires good knowledge about the algorithm, that's why i would favor reading the code in that case (we should probably add such warning in the doc). |
No thing is special for the DDPG algorithm that I used. At the beginning of user guide for custom policy, it does not says that the example shows on the page is only for on-policy algorithm. If what you said is true ( Giving an advanced customization example for any off-policy algorithm is not really possible), I suggest the instruction of customizing policy networks for off-policy algorithm should be removed from the guide page, or else, it will confused users. Thank you for your advice and honest answer, best wishes. |
Custom feature extractor is working the same for on/off-policy algorithm, that's why there is no specific warning, |
Behind Custom feature extractor, there should be a fully-connected network in On/Off-Policy Algorithms. |
As I wrote before, for that you would need to take a look at the code. class CustomActor(Actor):
"""
Actor network (policy) for TD3.
"""
def __init__(self, *args, **kwargs):
super(CustomActor, self).__init__(*args, **kwargs)
# Define custom network with Dropout
# WARNING: it must end with a tanh activation to squash the output
self.mu = nn.Sequential(...)
class CustomContinuousCritic(BaseModel):
"""
Critic network(s) for DDPG/SAC/TD3.
"""
def __init__(
self,
observation_space: gym.spaces.Space,
action_space: gym.spaces.Space,
net_arch: List[int],
features_extractor: nn.Module,
features_dim: int,
activation_fn: Type[nn.Module] = nn.ReLU,
normalize_images: bool = True,
n_critics: int = 2,
share_features_extractor: bool = True,
):
super().__init__(
observation_space,
action_space,
features_extractor=features_extractor,
normalize_images=normalize_images,
)
action_dim = get_action_dim(self.action_space)
self.share_features_extractor = share_features_extractor
self.n_critics = n_critics
self.q_networks = []
for idx in range(n_critics):
# q_net = create_mlp(features_dim + action_dim, 1, net_arch, activation_fn)
# Define critic with Dropout here
q_net = nn.Sequential(...)
self.add_module(f"qf{idx}", q_net)
self.q_networks.append(q_net)
def forward(self, obs: th.Tensor, actions: th.Tensor) -> Tuple[th.Tensor, ...]:
# Learn the features extractor using the policy loss only
# when the features_extractor is shared with the actor
with th.set_grad_enabled(not self.share_features_extractor):
features = self.extract_features(obs)
qvalue_input = th.cat([features, actions], dim=1)
return tuple(q_net(qvalue_input) for q_net in self.q_networks)
def q1_forward(self, obs: th.Tensor, actions: th.Tensor) -> th.Tensor:
"""
Only predict the Q-value using the first network.
This allows to reduce computation when all the estimates are not needed
(e.g. when updating the policy in TD3).
"""
with th.no_grad():
features = self.extract_features(obs)
return self.q_networks[0](th.cat([features, actions], dim=1))
class CustomTD3Policy(TD3Policy):
def __init__(self, *args, **kwargs):
super(CustomTD3Policy, self).__init__(*args, **kwargs)
def make_actor(self, features_extractor: Optional[BaseFeaturesExtractor] = None) -> CustomActor:
actor_kwargs = self._update_features_extractor(self.actor_kwargs, features_extractor)
return CustomActor(**actor_kwargs).to(self.device)
def make_critic(self, features_extractor: Optional[BaseFeaturesExtractor] = None) -> CustomContinuousCritic:
critic_kwargs = self._update_features_extractor(self.critic_kwargs, features_extractor)
return CustomContinuousCritic(**critic_kwargs).to(self.device)
# To register a policy, so you can use a string to create the network
# TD3.policy_aliases["CustomTD3Policy"] = CustomTD3Policy You can find a complete example for SAC here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/feat/densemlp/utils/networks.py (EDIT by @qgallouedec : You have to replace |
Thank you for explaining it in detail, I will try it this weekend. |
hello @araffin, is it possible to use Softmax instead of Tanh as the output layer of policy in TD3 (or SAC) algorithm? |
Hello @araffin, if i add a feature extractor network, will the feature extractor layer also trained (Will the weights of the feature extractor network updated everytime the actor and critic network is updated)? |
yes, it will. For A2C/PPO, there will be a shared feature extractor that will both use policy and critic loss, for SAC/TD3, if you use latest SB3 version (and you should ;)), there will be independent copies of the features extractor, one for the actor and one for the critic. EDIT: if you want to freeze the features extractor, it is also possible, but I would recommend to use a wrapper (gym wrapper or |
Uh oh!
There was an error while loading. Please reload this page.
I try to migrate my paper code to stable baselines3, the original code of my paper runs well. And in stable baselines3, my custom environment has passed check_env.
In particular, I found that most of scholars and researchers haven't been aware of the importance of customizing neural networks for complex mission when using deep reinforcement learning.
And I think the user guide of stable baselines3 have not clearly told users how to customize policy networks for Off-Policy Algorithms. It is necessary that a full customizing policy example for Off-Policy Algorithms should be shown in the user guide or in example code, otherwise, it will confuses users.
According to the analysis above, in order to explain the problem more clearly and hope a full example added in user guide, I would like to paste all my custom policy networks code below.
### Describe the bug
I follow the doc of stable baselines3 to customize policy network for DDPG algorithm, but it always make errors when defining DDPG model. If I remove action_noise parameter, it will appear another error.
The code shows below (all the code follows the user guide):
### Code example
I got the following error:
### System Info
### Checklist
The text was updated successfully, but these errors were encountered: