Context-aware vision-language model agent enriched with domain-specific ontology for construction site safety monitoring

Chak Fu Chan, Peter Kok Yiu Wong, Xiaowen Guo, Jack C.P. Cheng, Jolly Pui Ching Chan, Pak Him Leung, Xingyu Tao

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Traditional approaches of construction site safety monitoring heavily rely on manual on-site inspection, which are prone to overlooked incidents. Existing computer vision methods require time-consuming and case-by-case data labeling, and lack high-level reasoning capability. This paper develops a human-alike virtual assistant agent by integrating a multi-modal vision-language model into video analytics: (1) To efficiently generate image-text data for model development, a semi-automatic image-text labeling pipeline based on in-context learning is designed; (2) To optimize a virtual agent from pre-trained to domain-tailored, a two-stage curriculum learning paradigm is designed to enhance model fine-tuning effectiveness toward domain-specific tasks; (3) To inject construction-domain knowledge more effectively into the virtual agent, a hierarchical prompting framework driven by a construction safety ontology is developed for more domain-tailored reasoning capability. The virtual agent has been deployed on a real construction site for real-time video analytics, with over 90 % accuracy in identifying violations of work-at-height safety regulations.

Original languageEnglish
Article number106305
JournalAutomation in Construction
Volume177
DOIs
Publication statusPublished - Sept 2025
Externally publishedYes

Keywords

  • Construction safety ontology
  • Construction site safety monitoring
  • Context-aware vision-language model
  • Domain-tailored prompt engineering
  • Virtual construction safety assistant

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Civil and Structural Engineering
  • Building and Construction

Fingerprint

Dive into the research topics of 'Context-aware vision-language model agent enriched with domain-specific ontology for construction site safety monitoring'. Together they form a unique fingerprint.

Cite this