Falls are the primary safety hazard in construction, with traditional manual inspections being inefficient and error-prone, and existing computer vision methods lacking generalization in complex scenarios. This paper presents the Construction Safety Vision-Language Model (CS-VLM), a framework for construction site fall hazard identification and automated captioning, which integrates ModelScope Swift (MS-Swift) adapters and Low-Rank Adaptation (LoRA) technology for efficient fine-tuning of the Qwen2.5-7B-Instruct model. To support model training, a standardized image-text dataset for fall hazards is constructed using a Bidirectional Encoder Representations from Transformers (BERT) -based natural language conversion method. Experimental results demonstrate that CS-VLM achieves a Consensus-based Image Description Evaluation (CIDEr) score of 1.324, Semantic Propositional Image Caption Evaluation (SPICE) score of 0.391, and hazard identification F1-score of 90.2%, outperforming state-of-the-art methods in complex scenario adaptability while reducing computational costs. This research enables precise, standardized hazard description generation, facilitating proactive safety management and accident prevention in construction environments.
扫码关注我们
求助内容:
应助结果提醒方式:
