Overview
To enhance autonomy and generalization in robotic manipulation, an imitation learning-based control framework is developed. Large-scale task data is generated in simulation, and sim-to-real transfer enables robust task execution in real-world environments. In addition, a Vision-Language-Action (VLA) model is employed to autonomously interpret visual and language inputs, allowing the robot to perform tasks based on human intent. This establishes a foundation for flexible manipulation and intelligent human–robot collaboration in complex environments.
Key contribution
▸ Sim-to-real transfer for manipulation learning: Large-scale demonstration data generated in simulation is leveraged to learn diverse manipulation tasks
▸ Generalization improvement via domain randomization: Robust performance achieved in real-world environments under diverse conditions
▸ Vision-language-action-based interaction: Integrated understanding of visual and language inputs enabling flexible task execution based on user intent
Impact
▸ Generalizable autonomous manipulation capability: Enabling operation across diverse environments and previously unseen scenarios
▸ Robust autonomous execution in real-world environments: Reliable task performance despite environmental variations through sim-to-real transfer
▸ Expansion of human–robot collaboration paradigm: Enabling intuitive interaction through integrated visual and language understanding