Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Professional Abstract
"The paper introduces HERO, a novel paradigm for visual loco-manipulation of arbitrary objects using humanoid robots. This approach addresses the limitations of existing methods that rely heavily on real-world imitation learning, which often struggle with generalization due to the challenges of acquiring extensive training datasets. HERO integrates advanced techniques from both classical robotics and machine learning to enhance the control performance of humanoid robots in diverse environments. The core innovation lies in the development of a residual-aware end-effector (EE) tracking policy that significantly improves the accuracy of robotic movements. This policy employs a combination of inverse kinematics to transform residual end-effector targets into reference trajectories, a learned neural forward model to achieve precise forward kinematics, and mechanisms for goal adjustment and replanning. These strategies collectively reduce the end-effector tracking error by a factor of 3.2, showcasing a substantial improvement in performance. The modular system designed for loco-manipulation leverages open-vocabulary large vision models, which contribute to robust visual generalization across various real-world settings, including offices and coffee shops. The robots demonstrate the ability to manipulate everyday objects, such as mugs, apples, and toys, on surfaces with varying heights from 43cm to 92cm. The authors conducted systematic modular and end-to-end tests both in simulation and real-world scenarios, validating the effectiveness of their proposed design. The advancements presented in this research hold significant potential for the future of humanoid robots, particularly in enhancing their interaction capabilities with everyday objects, thereby paving the way for more intuitive and versatile robotic systems."