Abstract

This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system. Unlike state-based RL, vision-based RL is much more memory inefficient, resulting in relatively low batch sizes, which is not amenable for algorithms like PPO. Nevertheless, it is still an attractive method as unlike the more commonly used techniques which distill state-based policies into vision networks, end-to-end RL can allow for emergent active vision behaviors. We identify a key bottleneck in training these policies is the way most existing simulators scale to multiple GPUs using traditional data parallelism techniques. We propose a new method where we disaggregate the simulator and RL (both training and experience buffers) onto separate GPUs. On a node with four GPUs, we have the simulator running on three of them, and PPO running on the fourth. We are able to show that with the same number of GPUs, we can double the number of existing environments compared to the previous baseline of standard data parallelism. This allows us to train vision-based environments, end-to-end with depth, which were previously performing far worse with the baseline. We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality. This improvement is likely due to the observability gap between state and vision policies which does not exist when distilling depth policies into stereo RGB. We further show that the increased batch size brought about by disaggregated simulation also improves real world performance. When deploying in the real world, we improve upon the previous state-of-the-art vision-based results using our end-to-end policies. To our knowledge, this is the first work that has demonstrated end-to-end RL for dexterous grasping with multifingered hands.

Contributions

  • First sim-to-real with multifingered hands using end-to-end RL training.
  • Improved simulation infrastructure to scale up vision-based training.
  • State of the art results for vision-based grasping.

Videos

Method

End-to-end RL

Vision-based policies are incredibly important for real-world manipulation. However, training them directly from RL has historically been challenging due to the high sample complexity of image space. This has led to two-stage methods gaining prominence whereby a state-based teacher policy is trained with RL and then distilled into a vision-based student policy. While this has led to training successful policies, they fundamentally do not learn vision-aware behaviors. For example, imagine a robot arm trying to grasp an object. Suppose the arm is currently occluding said object. The teacher policy, which has access to the groundtruth object position, will simply just pick it up. However, the student policy may struggle to recreate this behavior as it has not learned to move the arm out of the way in order to see the object. In such cases, the student is trying to mimic state-based behaviors while only having access to vision information which causes it to act sub-optimally with respect to its inputs. Therefore, end-to-end training, where the RL policy learns directly from images, will lead to policy behaviors that lend themselves better to their sensory modalities. However, RGB end-to-end RL is much slower than depth-based RL as the rendering process for accurate light transport simulation is far more time consuming. Thus, a suitable middle ground that meets the requirements above while also being able to run on a reasonable hardware budget is to train a depth based policy with RL, and distill this into a stereo RGB-based policy in order to deploy in the real world.

Disaggregated Simulation vs Data Parallelism

When training vision-based RL, it's important to have a sufficiently large batch size to get a reliable learning signal from PPO. The standard method of scaling up is data parallelism. Our new method, disaggregated simulation and RL, is able to double the number of environments with the same hardware. The diagrams below illustrate the data flow across 4 GPUs for both data parallelism and disaggregated simulation with horizon length = 3. Click the buttons to view the animation.

Results

We evaluate scaling capacity and policy performance in both simulation and the real world.

Scaling capacity on a 4×GPU node

Maximum concurrent environments at different input resolutions.

Input resolution Data Parallel Disaggregated Simulation
160×120
1024 / GPU
(4096 total)
2800 / GPU
(8400 total)
320×240
256 / GPU
(1024 total)
700 / GPU
(2100 total)

Data Parallel: 4× sim+RL GPUs.

Disaggregated Simulation: 3×simulators + 1×learner GPUs.

Simulation progress and success

Data Parallel (DP) vs Disaggregated Simulation (Disagg)at 160×120 and 320×240 (5 seeds avg) for e2e depth RL.

Res. Method ADR Inc. ↑ % Full ADR ↑ SR ↑
160×120 DP 0.38 20% 0.37
Disagg 1.00 100% 0.42
320×240 DP 0.00 0% 0.00
Disagg 0.90 20% 0.35

Depth vs State Teachers Distillation

Simulation performance between distilled stereo RGB policies.

Depth vs State teachers chart

Real-world success rates

Depth teachers lead to better results; disagg further improves performance.

Model Success Rate ↑
DextrAH-G (state teacher) 87%
DextrAH-RGB (state teacher) 77%
Ours (depth teacher) 87%
Ours (depth teacher, disagg) 93%