-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Investigate into several streams identified of possible HL implementations
see variants from policy(s) = a:
policy((s_t, s_t+1, s_t+2)) = a
breaking Markov property. implementation of DQN with Recurrent Neural Networks will do exactly this, but can use a more straight forward approach, testing in regular q-learning tabular approachs
policy((s_t, s_t+1, s_t+2)) = (a_t, a_t+1, a_t+2)
map sequence of states to sequence of actions
not sure on how to implement, and how similar is this considered to the mapping of policy(s) to new policy described below
see how similar is to current macro-actions implementations
policy(s) = (a_t, a_t+1, a_t+2)
maps one state to several actions
it is something actually performed in the action repeat scenario, like in Pong or the Drone simulator. except for blindly repeating that action, already maps the state to a sequence of actions, not exactly equal to one another
policy(s) = policy_t
policy((s_t, s_t+1, s_t+2)) = policy_t
essential approach to hierarchical learning investigated in H-DQN
a new policy is triggered either by reaching a specific state (like a termination/initiation state, like reaching a door out of one room and into another room)
or by doing a specific sequence of states (example: after getting a problem right 5 times in a row, or after solving problems A,B and C move to a different problem)