This study aims to introduce a methodology for assessing the agreement between AI and human ratings, specifically focusing on visual large language models (LLMs). This paper presents empirical findings on the alignment between ratings generated by GPT-4 Vision (GPT-4V) and Gemini Pro Vision with human subjective evaluations of environmental visuals. Using photographs of restaurant interior design and food, the study estimates the degree of agreement with human preferences. The intraclass correlation reveals that GPT-4V, unlike Gemini Pro Vision, achieves moderate agreement with participants’ general restaurant preferences. Similar results are observed for rating food photos. Additionally, there is good agreement in categorizing restaurants into low-cost, mid-range and exclusive categories based on interior quality. Finally, differences in ratings were observed at the subsample level based on age, gender, and socioeconomic status across the human sample and LLMs. The results of repeated-measures ANOVAs indicate varying degrees of alignment between humans and LLMs across different sociodemographic characteristics. Overall, GPT-4V currently demonstrates limited ability to provide meaningful ratings of visual stimuli compared to human ratings and performs better in this task compared to Gemini Pro Vision.
扫码关注我们
求助内容:
应助结果提醒方式:
