Jump-Start Reinforcement Learning

Anonymous Authors

Abstract

Reinforcement learning (RL) provides a theoretical framework for continuously improving an agent's behavior via trial and error. However, efficiently learning policies from scratch can be very difficult, particularly for tasks with exploration challenges. In such settings, it might be desirable to initialize RL with an existing policy, offline data, or demonstrations. However, naively performing such initialization in RL often works poorly, especially for value-based methods. In this paper, we present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy, and is compatible with any RL approach. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks: a guide-policy, and an exploration-policy. By using the guide-policy to form a curriculum of starting states for the exploration-policy, we are able to efficiently improve performance on a set of simulated robotic tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms, particularly in the small-data regime. In addition, we provide an upper bound on the sample complexity of JSRL and show that with the help of a guide-policy, one can improve the sample complexity for non-optimism exploration methods from exponential in horizon to polynomial.


Approach

We’re introducing a meta-algorithm called Jump-Start Reinforcement Learning (JSRL) that can use a pre-existing policy of any form to initialize any type of RL algorithm. JSRL uses two policies to learn tasks: a guide-policy, and an exploration-policy. The exploration-policy is an RL policy that is trained online with new experience, and the guide-policy is a fixed, pre-existing policy of any form. In this work, we focus on scenarios where the guide-policy is learned from demonstrations, but many other kinds of guide-policies can be used. It could be a scripted policy, a policy trained with RL, or even a live human demonstrator. The only requirements are that the guide-policy is reasonable (i.e., better than random exploration), and it can select actions based on observations of the environment.

At the beginning of training, we roll out the guide-policy for a fixed number of steps so that the agent is closer to goal states. The exploration-policy then takes over and continues acting in the environment to reach these goals. As the performance of the exploration-policy improves, we gradually reduce the number of steps that the guide-policy takes, until the exploration-policy takes over completely. This process creates a curriculum of starting states for the exploration-policy such that in each curriculum stage, it only needs to learn to reach the initial states of prior curriculum stages.


Results

Comparison to IL+RL Baselines: Since JSRL can use a prior policy to initialize RL, a natural comparison would be to imitation and reinforcement learning (IL+RL) methods that train on offline datasets, then fine-tune. We show how JSRL compares to competitive IL+RL methods on the D4RL benchmark tasks, which vary in complexity and offline dataset quality. Out of the D4RL tasks, we focus on the difficult ant maze and adroit dexterous manipulation environments.

For each experiment, we train on an offline dataset then run online fine-tuning. We compare against algorithms designed specifically for this setting, which include AWAC, IQL, CQL, and behavioral cloning (BC). While JSRL can be used in combination with any initial guide-policy or fine-tuning algorithm, we use a pre-trained IQL policy as the guide and also use IQL for fine-tuning. We find that JSRL performs well even with limited access to demonstrations:

Vision-Based Robotic Tasks: Utilizing offline data is especially challenging in complex tasks such as vision-based robotic manipulation. The high dimensionality of both the continuous-control action spaces as well as the pixel-based state space present unique scaling challenges for IL+RL methods. To study how JSRL scales to such settings, we focus on two challenging simulated robotic manipulation tasks: indiscriminate grasping and instance grasping.

We compare our algorithm against methods that are able to scale to complex vision-based robotics settings such as Qt-Opt and AW-Opt. Each method has access to the same offline dataset of successful demonstrations and is allowed to run online fine-tuning for up to 100,000 steps. In these experiments, we use BC as a guide-policy and combine JSRL with Qt-Opt for fine-tuning. The combination of Qt-Opt+JSRL significantly outperforms the other methods in both sample efficiency and final performance.