-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Question] Rationale for Absence of Explicit Gradient Clipping in SAC Implementation #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello,
The answer is pretty simple, we followed the original implementation. However, I was also planning to give it a try (with https://github.com/araffin/sbx) to see if it would have an influence (no significant difference so far).
Smaller learning rate, bigger batch size, policy delay, smaller polyak coefficient, Adam beta parameters, ... |
Thank you for the clear explanation, @araffin! It makes sense that the implementation follows the original paper and that explicit clipping often isn't necessary for general performance with SAC. Interestingly, in my own experiments specifically using TQC from the sbx library, I've consistently observed very high actor and critic losses on the peg-unplug-side-v2 task. This observation is what led me to try adding gradient clipping (max_grad_norm), treating it as a potential stabilization measure specifically for what seems to be a challenging environment for the algorithm. Perhaps this suggests that while not universally required, gradient clipping might still prove beneficial for enhancing stability in certain particularly demanding environments, or maybe even more so for TQC compared to standard SAC under such conditions. I appreciate you sharing that you're also exploring this in sbx – I'd be interested to hear if you encounter similar environment-specific behaviors in your tests. |
Did it improve the performance in your case? |
No... |
❓ Question
Hi Stable-Baselines3 team,
I've been studying the SAC implementation within the library and comparing it to other algorithms like PPO. I noticed that PPO includes an explicit max_grad_norm parameter for gradient clipping, which is often helpful for stabilizing training.
However, when looking through the SAC documentation and the source code (specifically the train method and optimizer steps for the actor and critics), I couldn't find an equivalent explicit mechanism or parameter for gradient clipping being applied by default during the main policy and value function updates.
My Question:
Could you please shed some light on the design decision behind not including explicit gradient clipping (like max_grad_norm) in the SAC actor and critic updates?
Is gradient clipping generally considered less critical for the stability or performance of SAC compared to algorithms like PPO?
Are there alternative mechanisms within the SAC algorithm (perhaps related to entropy maximization or the target network updates) that implicitly provide sufficient stabilization, making explicit clipping unnecessary?
Or is it simply an implementation choice left to the user to add if needed via custom policies or callbacks?
Understanding the reasoning here would be very helpful for better appreciating the nuances of the SAC algorithm and its robust implementation in Stable-Baselines3.
Checklist
The text was updated successfully, but these errors were encountered: