Executing RLBase.plan! after end of experiment #913

Mytolo · 2023-06-27T10:58:10Z

Why do we execute RLBase.plan! after the experiment is done? Also we should test if the environment is terminated?

For environments with FULL_ACTION_SET Action_style, this yields an error as the legal_action_mask is empty. Or the policies should be updated s.t. they do not return anything when environment is terminated for plan!? BUT this behavior would also be unexpected.

src\ReinforcementLearningCore\src\policies\agent\multi_agent.jl:133:140

if check_stop(stop_condition, policy, env)
       is_stop = true
       push!(multiagent_policy, PreActStage(), env)
       optimise!(multiagent_policy, PreActStage())
       push!(multiagent_hook, PreActStage(), policy, env)
       RLBase.plan!(multiagent_policy, env)  # let the policy see the last observation
       break
end

In my opinion, this should be completely omitted: Only do

is_stop = true
break

The text was updated successfully, but these errors were encountered:

HenriDeh · 2023-06-29T12:40:22Z

I keep writing answers to this but it's a little confusing.

Imagine the stop_condition is after 1 step. The run loop would be

PreAct: s0
plan!: a0 
<s0, a0> are in the trajectory
optimise!: nothing, we don't have a full step yet
PostAct: r0, t0
check_stop: true, we did 1 step
PreAct: s1
optimise!: nothing because we don't have a full step yet (did not call plan! to push r0 and t0)
plan!: a1
<s0,a0,r0,t0,s1,a1> are in the trajectory

It indeed makes no sense as we did not optimise even though we completed a step. But deleting the content of the if block would also prevent that optimization to happen, if it was setup correctly.
Maybe we can move this the the PostEpisodeStage? This would not solve your issue.
But we must push the last step in the cache. Maybe the trajectory push should not happen at plan!, to decouple the cache pushing at the action query ? Thoughts @jeremiahpslewis ?

Mytolo · 2023-07-03T10:01:37Z

In my opinion, this should be included in the normal run loop without the need to take care after the stop_condition is reached. It should be added right after the action was passed to the environment. push!(policy, PostActStage(), env) would be the perfect place for that, right?

Especially for sequential multi agent environment this off-sync (out of sync with environment steps and run loop step) causes a big problem when thinking about trajectory similarity. I recognized for pettingzoo which i included in PR #782:
Although there are environments with totally same reward at end of each agents state, the trajectories include different (shifted) rewards.

Mytolo · 2023-07-06T08:48:49Z

What i basically mean is that the default run cycle of an RL Algorithm should be like:

signals from the environment / action of agent
initial state $s_t$
planned action $a_t$

execution on environment of planned action $a_t$

next state $s_{t+1}$
reward $r_{t+1}$
terminal signal $d_{t+1}$

For policies with trajectories / experience buffer: Use push!(policy, PostActStage(), env) to add tuple $(s_t, a_t, r_{t+1}, s_{t+1}, d_{t+1})$ to add to trajectory buffer.

For the sake of completness (in src/ReinforcementLearningCore/src/core/run.jl):

use push!(policy, PostActStage(), env) on line 104 and delete everything that is connected to let the policy see the last observation, namely lines 110 - 113 and change behaviour of push!(policy, PostEpisodeStage(), env) for agent / multi_agent.
Moreover, to reduce push!(policy, PostEpisodeStage(), env) to just taking care of the last state is in my opinion not adequeat and is missleading: Multiple agents may communicate after each episode or there may be some sort of specail integration of the experience gained from one episode. This command should be deleted also.

I am currently working on 3 Pull Request showing this issue in detail:

Independent Q-Learning for TicTacToe
Adjustment of Pettingzoo to the new mutli agent implementation
Baseline policy for the multi particle environment simple spread

Especially in the last one, we can see that the rewards are not correctly assigned to the agents. Hence, we have to think about a generic way of collecting the signals during run. Otherwise we have to have a look on how the policies take care of it by incorporating strange and not easy understandable hacks like the one with the trajectory here. BUT in my opinion the described way is the cleanest and easiest way.

jeremiahpslewis · 2023-07-06T09:00:20Z

@HenriDeh Sorry for the delayed response. I agree that the push! sequence doesn't work as well as it might. In my project that depends on RL.jl, I implemented a modified cache and push sequence for SART policies, here: https://github.com/jeremiahpslewis/AlgorithmicCompetition.jl/blob/main/src/AIAPC2020/rt_cache_sart_qlearning.jl

Mytolo · 2023-07-06T09:09:23Z

@HenriDeh @jeremiahpslewis I edited the comment above. Please feel free to give me feedback on my thoughts. Till this is not resolved, policies may be simply not working because of falsely gathered information of the environments.

HenriDeh · 2023-07-06T09:26:21Z

Hence, we have to think about a generic way of collecting the signals during run.

I'm currently reworking Trajectories to correctly store transitions (see my mention above). It will have to be accompanied by a change of the current run loop. I will ensure to fix this issue in the process and tag you for a review.

Although I don't understand what you mean by

to just taking care of the last state is in my opinion not adequeat and is missleading: Multiple agents may communicate after each episode or there may be some sort of specail integration of the experience gained from one episode. This command should be deleted also.

I agree that we still need to query the last action for algorithms such as SARSA, but can you elaborate what you mean with multiple agents doing stuff ? Sounds like something which is algorithm-specific and should be dealt with by overloading/specializing the push method.

Mytolo · 2023-07-06T09:34:39Z

That is just a minor point, but there is a command in the _run code which states: Let the policy see the last observation for this push! method. For sure this is algorithm specific, but in my opinion in this hook there are more possibilities than just seeing the last observation. That is what i meant by missleading.

Mytolo · 2023-07-06T09:37:43Z

I'm currently reworking Trajectories to correctly store transitions (see my mention above). It will have to be accompanied by a change of the current run loop. I will ensure to fix this issue in the process and tag you for a review.

As i described it? Or in another way? Because i think the previous one was pretty hard to understand how things are stored. That can be done efficiently while keeping readability in my opinion. (And eliminates the source of failures because of not correct stored experience)

HenriDeh · 2023-07-06T10:00:54Z

Your suggestion involves storing $(s_t, a_t, r_{t+1}, s_{t+1}, d_{t+1})$ at every step. The problem with that is that is stores each state twice, once when it's the next_state, then a second time when it's the current state. So it'll be a bit different. Probably
PreEpisode: store s0
PreAct: plan!
act!
PostAct: store a0, r0, s1, d1
PostEpisode: nothing.

The cache would be cleared after plan! because we need the cached state from the previous iteration.

jpfr · 2023-07-06T13:16:47Z

@HenriDeh Sounds good.
In that case the if check_stop() code would simplify just as @Mytolo suggested.

Mytolo · 2023-07-07T09:13:59Z

Your suggestion involves storing (st,at,rt+1,st+1,dt+1) at every step. The problem with that is that is stores each state twice, once when it's the next_state, then a second time when it's the current state. So it'll be a bit different. Probably PreEpisode: store s0 PreAct: plan! act! PostAct: store a0, r0, s1, d1 PostEpisode: nothing.

The cache would be cleared after plan! because we need the cached state from the previous iteration.

But wouldn't this also store the states twice? But what you suggest is what I meant. Feel free to mention me in the PR or if there are any design choices I might also provide help.

HenriDeh mentioned this issue Jun 30, 2023

adding an EpisodesBuffer JuliaReinforcementLearning/ReinforcementLearningTrajectories.jl#43

Merged

11 tasks

HenriDeh linked a pull request Jul 7, 2023 that will close this issue

Rework the run loop #921

Merged

7 tasks

Mytolo mentioned this issue Jul 19, 2023

refactored multi agent proposal #926

Open

jeremiahpslewis closed this as completed in #921 Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Executing RLBase.plan! after end of experiment #913

Executing RLBase.plan! after end of experiment #913

Mytolo commented Jun 27, 2023 •

edited

Loading

HenriDeh commented Jun 29, 2023

Uh oh!

Mytolo commented Jul 3, 2023

Uh oh!

Mytolo commented Jul 6, 2023 •

edited

Loading

Uh oh!

jeremiahpslewis commented Jul 6, 2023

Uh oh!

Mytolo commented Jul 6, 2023

Uh oh!

HenriDeh commented Jul 6, 2023 •

edited

Loading

Uh oh!

Mytolo commented Jul 6, 2023

Uh oh!

Mytolo commented Jul 6, 2023

Uh oh!

HenriDeh commented Jul 6, 2023

Uh oh!

jpfr commented Jul 6, 2023

Uh oh!

Mytolo commented Jul 7, 2023

Uh oh!

Uh oh!

Executing RLBase.plan! after end of experiment #913

Executing RLBase.plan! after end of experiment #913

Comments

Mytolo commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HenriDeh commented Jun 29, 2023

Uh oh!

Mytolo commented Jul 3, 2023

Uh oh!

Mytolo commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiahpslewis commented Jul 6, 2023

Uh oh!

Mytolo commented Jul 6, 2023

Uh oh!

HenriDeh commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytolo commented Jul 6, 2023

Uh oh!

Mytolo commented Jul 6, 2023

Uh oh!

HenriDeh commented Jul 6, 2023

Uh oh!

jpfr commented Jul 6, 2023

Uh oh!

Mytolo commented Jul 7, 2023

Uh oh!

Mytolo commented Jun 27, 2023 •

edited

Loading

Mytolo commented Jul 6, 2023 •

edited

Loading

HenriDeh commented Jul 6, 2023 •

edited

Loading