-
-
Notifications
You must be signed in to change notification settings - Fork 109
Executing RLBase.plan! after end of experiment #913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I keep writing answers to this but it's a little confusing. Imagine the stop_condition is after 1 step. The run loop would be
It indeed makes no sense as we did not optimise even though we completed a step. But deleting the content of the if block would also prevent that optimization to happen, if it was setup correctly. |
In my opinion, this should be included in the normal run loop without the need to take care after the stop_condition is reached. It should be added right after the action was passed to the environment. push!(policy, PostActStage(), env) would be the perfect place for that, right? Especially for sequential multi agent environment this off-sync (out of sync with environment steps and run loop step) causes a big problem when thinking about trajectory similarity. I recognized for pettingzoo which i included in PR #782: |
What i basically mean is that the default run cycle of an RL Algorithm should be like: signals from the environment / action of agent execution on environment of planned action next state For policies with trajectories / experience buffer: Use For the sake of completness (in src/ReinforcementLearningCore/src/core/run.jl): use push!(policy, PostActStage(), env) on line 104 and delete everything that is connected to let the policy see the last observation, namely lines 110 - 113 and change behaviour of I am currently working on 3 Pull Request showing this issue in detail:
Especially in the last one, we can see that the rewards are not correctly assigned to the agents. Hence, we have to think about a generic way of collecting the signals during run. Otherwise we have to have a look on how the policies take care of it by incorporating strange and not easy understandable hacks like the one with the trajectory here. BUT in my opinion the described way is the cleanest and easiest way. |
@HenriDeh Sorry for the delayed response. I agree that the push! sequence doesn't work as well as it might. In my project that depends on RL.jl, I implemented a modified cache and push sequence for SART policies, here: https://github.com/jeremiahpslewis/AlgorithmicCompetition.jl/blob/main/src/AIAPC2020/rt_cache_sart_qlearning.jl |
@HenriDeh @jeremiahpslewis I edited the comment above. Please feel free to give me feedback on my thoughts. Till this is not resolved, policies may be simply not working because of falsely gathered information of the environments. |
I'm currently reworking Trajectories to correctly store transitions (see my mention above). It will have to be accompanied by a change of the current run loop. I will ensure to fix this issue in the process and tag you for a review. Although I don't understand what you mean by
I agree that we still need to query the last action for algorithms such as SARSA, but can you elaborate what you mean with multiple agents doing stuff ? Sounds like something which is algorithm-specific and should be dealt with by overloading/specializing the push method. |
That is just a minor point, but there is a command in the _run code which states: Let the policy see the last observation for this push! method. For sure this is algorithm specific, but in my opinion in this hook there are more possibilities than just seeing the last observation. That is what i meant by missleading. |
As i described it? Or in another way? Because i think the previous one was pretty hard to understand how things are stored. That can be done efficiently while keeping readability in my opinion. (And eliminates the source of failures because of not correct stored experience) |
Your suggestion involves storing The cache would be cleared after plan! because we need the cached state from the previous iteration. |
But wouldn't this also store the states twice? But what you suggest is what I meant. Feel free to mention me in the PR or if there are any design choices I might also provide help. |
Uh oh!
There was an error while loading. Please reload this page.
Why do we execute RLBase.plan! after the experiment is done? Also we should test if the environment is terminated?
For environments with FULL_ACTION_SET Action_style, this yields an error as the legal_action_mask is empty. Or the policies should be updated s.t. they do not return anything when environment is terminated for plan!? BUT this behavior would also be unexpected.
src\ReinforcementLearningCore\src\policies\agent\multi_agent.jl:133:140
In my opinion, this should be completely omitted: Only do
The text was updated successfully, but these errors were encountered: