CNC fault detection in industrial scenarios still suffers from several important challenges, such as poor generalization to imbalanced samples, poor consistency between visual-textual semantics, and limited utilization of domain-specific multimodal knowledge. To address these issues, we propose an RLHF (Reinforcement Learning from Human Feedback)-optimized industrial large model (CNC-VLM) that leverages vision-language multimodal knowledge for autonomous fault detection. CNC-VLM consists of a visual encoder, a projector, and a textual decoder. Specifically, the visual encoder extracts discriminative visual landmarks from fault time-frequency images, which are then further mapped into a modality-aligned latent space via a projection layer. The decoder incorporates a vision-language cross-attention mechanism to combine visual semantics with textual prompts for precise reasoning and detection. We further design a direct preference optimization (DPO) algorithm for CNC fault detection based on the RLHF training paradigm, which combines the model output with human feedback (expert knowledge) to enhance the imbalanced detection ability for small sample faults. Through engineering samples obtained from CNC machines, the verification results demonstrate that CNC-VLM significantly outperforms existing state-of-the-art models in terms of accuracy and F1-score. Moreover, CNC-VLM can automatically generate detection descriptions and maintenance suggestions, showing good potential for practical deployment in the field of intelligent manufacturing.
扫码关注我们
求助内容:
应助结果提醒方式:
