Multi-model based reinforcement learning and application

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Reinforcement learning (RL) as a category of machine learning methods learns sequential decision-making policies by interacting with the environment and maximizing a cumulative reward. The biggest challenges are 1) it heavily relies on real environment interactions, which is not feasible in domains where real environment interactions are expensive or inaccessible; 2) it has low learning efficiency and this disadvantage is particularly prominent in large-scale scenarios. Although deep neural networks with reinforcement learning have witnessed unprecedented progress in solving more complex problems, such as game playing and robotic control, a high-fidelity environment or simulator is still necessary. To alleviate these two obstacles, in this thesis we utilize multi-parallel executions to collect more training data to improve a single agent’s RL learning efficiency. ☐ We first propose using a multi-agent system with dispersed exploration foci to explore a shared environment simultaneously and accelerate a single environment interactive agent’s RL. A high-fidelity environment simulator is assumed to be avail- able. A single agent’s global RL policy is aggregated from distributed local policies. ☐ Then, we relax the assumption on the available simulators and assume that there is no environment simulator is available. We propose to train a neural network to represent the environment first in a supervised learning fashion, by specifically predicting the next state from a given state and action. The trained model can generate fictitious trajectories for RL to carry out more updates and reduce its dependency on real environment interactions. However, an imperfectly trained model may exacerbate poor RL policy’s performance—the generated data errors can import high-performance bias or even hinder the policy’s convergence. To overcome this model imperfection problem, we use an ensemble method with multiple environment models to reduce high potential bias by forcing the policy to have a good performance in the majority of models. ☐ Finally, we implement the second work in a real-world, full-scale financial trading problem. An additional auto-encoder is used to embed high-dimensional financial data into a continuous lower-dimensional latent space for a more efficient learning process. An LSTM-MDN model is used to learn longer temporal dependencies for generating fictitious trajectories with a longer time horizon. Compared to industrial benchmarks, our approach has a higher net profit on daily basis.
Description
Keywords
Reinforcement learning, Machine learning, Decision-making policies, Environment interactions, Low learning efficiency, Global RL policy
Citation