Research

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

arXiv•February 18, 2026 ()•Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta

Professional Abstract

"The paper introduces HERO, a novel paradigm for visual loco-manipulation of arbitrary objects using humanoid robots. This approach addresses the limitations of existing methods that rely heavily on real-world imitation learning, which often struggle with generalization due to the challenges of acquiring extensive training datasets. HERO integrates advanced techniques from both classical robotics and machine learning to enhance the control performance of humanoid robots in diverse environments. The core innovation lies in the development of a residual-aware end-effector (EE) tracking policy that significantly improves the accuracy of robotic movements. This policy employs a combination of inverse kinematics to transform residual end-effector targets into reference trajectories, a learned neural forward model to achieve precise forward kinematics, and mechanisms for goal adjustment and replanning. These strategies collectively reduce the end-effector tracking error by a factor of 3.2, showcasing a substantial improvement in performance. The modular system designed for loco-manipulation leverages open-vocabulary large vision models, which contribute to robust visual generalization across various real-world settings, including offices and coffee shops. The robots demonstrate the ability to manipulate everyday objects, such as mugs, apples, and toys, on surfaces with varying heights from 43cm to 92cm. The authors conducted systematic modular and end-to-end tests both in simulation and real-world scenarios, validating the effectiveness of their proposed design. The advancements presented in this research hold significant potential for the future of humanoid robots, particularly in enhancing their interaction capabilities with everyday objects, thereby paving the way for more intuitive and versatile robotic systems."

Technical Insights

1Introduction of HERO, a new paradigm for humanoid robot loco-manipulation that enhances generalization and control performance.

2Combines large vision models for open-vocabulary understanding with simulated training for robust end-effector control.

3Development of a residual-aware EE tracking policy that integrates classical robotics techniques with machine learning.

4Utilizes inverse kinematics to convert residual targets into reference trajectories, improving tracking accuracy.

5Incorporates a learned neural forward model for precise forward kinematics, enhancing movement fidelity.

6Features goal adjustment and replanning mechanisms to adapt to dynamic environments and tasks.

7Achieves a 3.2x reduction in end-effector tracking error compared to previous methods.

8Demonstrates effective manipulation of various objects in diverse real-world environments, including heights ranging from 43cm to 92cm.

9Systematic testing in both simulation and real-world scenarios confirms the effectiveness and versatility of the HERO system.