Abstract
The study of operational actions of experienced pilots and the capture of expert knowledge in the cockpit are highly significant in the aviation training industry. Due to the limitations of textual operation manuals in describing pilot actions, multimodal information contains more relevant action knowledge of pilots, making it a promising avenue for capturing aviation operational knowledge. The multimodal information in the cockpit is often verbose and low in information density, encompassing sustained human–machine interaction, crew collaboration, and multitasking decision-making scenarios. This study aims to structurally represent, identify, and capture aviation knowledge related to pilot operations within multimodal information. This paper first proposes a structured and comprehensive representation of pilot-centered actions in the cockpit through a graphical form called the Pilot-Action Graph (PAG), along with a matrix for describing temporal information known as the Pilot-Cockpit Action Sequence Matrix (PASM). Second, a cross-modal recognition and reasoning learning framework based on a multimodal large language model called MERGE (Multimodal Extraction and Reasoning for Graph Enhancement) is proposed. It consists of two key modules: the Multimodal Extraction Module (MEM), responsible for extracting PASM from multimodal data, and the Reasoning and Graph Enhancement Module (RGEM), which performs logical reasoning and enhances the structure of the PAG. Third, we conduct a comprehensive case study and validation using Xplane simulator videos. The results indicate that MEM can more accurately identify key frames, reduce information loss, and achieve more effective event localization through submodule collaboration. RGEM compares single-agent and multi-agent collaborative learning modes with feedback correction mechanisms, with the latter achieving superior graph structure parameter performance, better information fusion, and enhanced PAG completion. Additionally, reasoning tasks based on the PAG validate that this framework can effectively describe pilot actions and cockpit information. Our proposed method effectively identifies and captures pilot-centered operational aviation knowledge from multimodal information, forming a logical PAG structure. This approach is suitable for multiple tasks and complex scenarios, providing an effective means for aviation knowledge representation and learning.
| Original language | English |
|---|---|
| Article number | 104628 |
| Journal | Advanced Engineering Informatics |
| Volume | 74 |
| DOIs | |
| Publication status | Published - Sept 2026 |
| Externally published | Yes |
Keywords
- Agent-based system
- Knowledge learning
- Multimodal large language models
- Multimodal learning
- Pilot-Action Graph
ASJC Scopus subject areas
- Information Systems
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'MERGE-PAG: Agent-based multimodal knowledge extraction and reasoning framework for pilot-action graph'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver