Visual Language Models (VLMs) have demonstrated superior context understanding and generalization across various tasks compared to models tailored for specific tasks. However, due to their complexity and limited information on their training processes, estimating their performance on specific tasks often requires exhaustive testing, which can be costly and may not account for edge cases. To leverage the zero-shot capabilities of VLMs in safety-critical applications like Driver Monitoring Systems, it is crucial to characterize their knowledge and abilities to ensure consistent performance. This research proposes a methodology to explore and gain a deeper understanding of the functioning of these models in driver’s gaze estimation. It involves detailed task decomposition, identification of necessary data knowledge and abilities (e.g., understanding gaze concepts), and exploration through targeted prompting strategies. Applying this methodology to several VLMs (Idefics2, Qwen2-VL, Moondream, GPT-4o) revealed significant limitations, including sensitivity to prompt phrasing, vocabulary mismatches, reliance on image-relative spatial frames, and difficulties inferring non-visible elements. The findings from this evaluation have highlighted specific areas for improvement and guided the development of more effective prompting and fine-tuning strategies, resulting in enhanced performance comparable with traditional CNN-based approaches. This research is also useful for initial model filtering, for selecting the best model among alternatives and for understanding the model’s limitations and expected behaviors, thereby increasing reliability.
扫码关注我们
求助内容:
应助结果提醒方式:
