Skip to main navigation Skip to search Skip to main content

VLAbot: A human Vision–Language–Action models interaction framework for robotic assembly

  • Xueting Wang
  • , Xiwen Dengxiong
  • , Shi Bai
  • , Pai Zheng
  • , Yunbo Zhang (Corresponding Author)

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

AbstractIn this work, we propose an intelligent human–robot collaboration system designed to assist embodied intelligence in learning complex, long-horizon manufacturing assembly tasks. The system integrates multiple expert agents and augmented reality (AR) interaction interfaces, enabling robots to request planning and execution guidance from humans and efficiently complete intricate tasks. Specifically, the expert agents equipped with a high-level planner and a Vision-Language-Action (VLA) model actively interact with users through text, vision, and action modalities to acquire critical information, learn task-specific skills, and develop sub-task planning strategies. A distributed data and model architecture ensures real-time interactions between different models and facilitates seamless human–robot collaboration. We evaluate the system on two challenging long-horizon manufacturing assembly tasks (gear assembly and peg insertion) to demonstrate the effectiveness of the proposed approach. The system successfully learns both assembly tasks within five trials and enables the embodied intelligence to complete them with progressively reduced execution time.

Original languageEnglish
Article number103268
Number of pages13
JournalRobotics and Computer-Integrated Manufacturing
Volume100
DOIs
Publication statusPublished - Aug 2026

Keywords

  • Augmented reality
  • Embodied intelligence
  • Human–robot collaboration
  • Multimodal large-language model
  • Visual language action model

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Software
  • General Mathematics
  • Computer Science Applications
  • Industrial and Manufacturing Engineering

Fingerprint

Dive into the research topics of 'VLAbot: A human Vision–Language–Action models interaction framework for robotic assembly'. Together they form a unique fingerprint.

Cite this