File Download
Supplementary

postgraduate thesis: Policy optimization for offline reinforcement learning

TitlePolicy optimization for offline reinforcement learning
Authors
Advisors
Advisor(s):Hofert, JM
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Liu, Y. [刘阳]. (2023). Policy optimization for offline reinforcement learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractOffline reinforcement learning (RL) aims to learn a policy from previously collected data that can outperform the behavior policy that generates the data. This thesis proposes policy optimization approaches from the model-free and model-based perspectives. For model-free RL, this dissertation proposes to implicitly or explicitly unify maximizing Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to the discrepancy between the target policy and the offline data. For implicit unification, we propose to unify the action space by generative adversarial networks that try to make the actions of the target policy and behavior policy indistinguishable. For explicit unification, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. For model-based RL, this dissertation proposes an approach of policy optimization by looking ahead (POLA). Existing approaches first learn a value function from historical data, then guide the update of policy parameters by maximizing the value function at a single time step, which tries to find the optimal action at each time. We argue this strategy is greedy and propose to optimize the policy by looking ahead to alleviate the greediness. Concretely, we look $T$ time steps ahead and then optimize the policy on both current and future states where the future states are predicted by a transition model. A trajectory contains numerous actions before the agent reaches the terminal state. Performing the best action at each time step does not necessarily mean we can get an optimal trajectory in the end. Occasionally, we need to allow sub-optimal or negative actions. Besides, hidden confounding factors may affect the decision making process. The policy should consider the unobserved variables when making decisions. To that end, we incorporate the correlations among dimensions of a state into the policy, consequently providing more information of the environment for the policy. Then the new state with correlation information is fed to the diffusion policy that is good at generating diverse actions. Empirical results on the MuJoCo environments show the effectiveness of the proposed approach. Extensive experiments have been conducted on the D4RL dataset, the results show that our approaches exhibit superior performance. Our results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also propose a topK loss for ensemble Q-values to alleviate training instability. We find that the topK loss and MIS can stabilize the reward curve effectively. For POLA, when the rollout length is set properly, the performance is better than without looking ahead. The optimal rollout lengths can be different for various tasks. The correlation demonstrates its effectiveness in improving the convergence rate of reward curve.
DegreeMaster of Philosophy
SubjectReinforcement learning - Mathematical models
Dept/ProgramStatistics and Actuarial Science
Persistent Identifierhttp://hdl.handle.net/10722/335944

 

DC FieldValueLanguage
dc.contributor.advisorHofert, JM-
dc.contributor.authorLiu, Yang-
dc.contributor.author刘阳-
dc.date.accessioned2023-12-29T04:05:03Z-
dc.date.available2023-12-29T04:05:03Z-
dc.date.issued2023-
dc.identifier.citationLiu, Y. [刘阳]. (2023). Policy optimization for offline reinforcement learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/335944-
dc.description.abstractOffline reinforcement learning (RL) aims to learn a policy from previously collected data that can outperform the behavior policy that generates the data. This thesis proposes policy optimization approaches from the model-free and model-based perspectives. For model-free RL, this dissertation proposes to implicitly or explicitly unify maximizing Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to the discrepancy between the target policy and the offline data. For implicit unification, we propose to unify the action space by generative adversarial networks that try to make the actions of the target policy and behavior policy indistinguishable. For explicit unification, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. For model-based RL, this dissertation proposes an approach of policy optimization by looking ahead (POLA). Existing approaches first learn a value function from historical data, then guide the update of policy parameters by maximizing the value function at a single time step, which tries to find the optimal action at each time. We argue this strategy is greedy and propose to optimize the policy by looking ahead to alleviate the greediness. Concretely, we look $T$ time steps ahead and then optimize the policy on both current and future states where the future states are predicted by a transition model. A trajectory contains numerous actions before the agent reaches the terminal state. Performing the best action at each time step does not necessarily mean we can get an optimal trajectory in the end. Occasionally, we need to allow sub-optimal or negative actions. Besides, hidden confounding factors may affect the decision making process. The policy should consider the unobserved variables when making decisions. To that end, we incorporate the correlations among dimensions of a state into the policy, consequently providing more information of the environment for the policy. Then the new state with correlation information is fed to the diffusion policy that is good at generating diverse actions. Empirical results on the MuJoCo environments show the effectiveness of the proposed approach. Extensive experiments have been conducted on the D4RL dataset, the results show that our approaches exhibit superior performance. Our results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also propose a topK loss for ensemble Q-values to alleviate training instability. We find that the topK loss and MIS can stabilize the reward curve effectively. For POLA, when the rollout length is set properly, the performance is better than without looking ahead. The optimal rollout lengths can be different for various tasks. The correlation demonstrates its effectiveness in improving the convergence rate of reward curve.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshReinforcement learning - Mathematical models-
dc.titlePolicy optimization for offline reinforcement learning-
dc.typePG_Thesis-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineStatistics and Actuarial Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044751041503414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats