Abstract
Human–robot collaboration enhances efficiency by enabling robots to work alongside human operators in shared tasks. Accurately understanding human intentions is critical for achieving a high level of collaboration. Existing methods heavily rely on case-specific data and face challenges with new tasks and unseen categories, while often limited data is available under real-world conditions. To bolster the proactive cognitive abilities of collaborative robots, this work introduces a Visual-Language-Temporal approach, conceptualizing intent recognition as a multimodal learning problem with HRC-oriented prompts. A large model with prior knowledge is fine-tuned to acquire industrial domain expertise, then enables efficient rapid transfer through few-shot learning in data-scarce scenarios. Comparisons with state-of-the-art methods across various datasets demonstrate the proposed approach achieves new benchmarks. Ablation studies confirm the efficacy of the multimodal framework, and few-shot experiments further underscore meta-perceptual potential. This work addresses the challenges of perceptual data and training costs, building a human–robot bridge (H2R Bridge) for semantic communication, and is expected to facilitate proactive HRC and further integration of large models in industrial applications.
Original language | English |
---|---|
Pages (from-to) | 524-535 |
Number of pages | 12 |
Journal | Journal of Manufacturing Systems |
Volume | 80 |
DOIs | |
Publication status | Published - Jun 2025 |
Keywords
- Few-shot learning
- Human–robot collaboration
- Intent recognition
- Vision-language models
ASJC Scopus subject areas
- Control and Systems Engineering
- Software
- Hardware and Architecture
- Industrial and Manufacturing Engineering