Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Cognitive Computation Pub Date : 2024-05-27 DOI:10.1007/s12559-024-10281-5
Mohammad Nadeem, Shahab Saquib Sohail, Laeeba Javed, Faisal Anwer, Abdul Khader Jilani Saudagar, Khan Muhammad
{"title":"Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition","authors":"Mohammad Nadeem, Shahab Saquib Sohail, Laeeba Javed, Faisal Anwer, Abdul Khader Jilani Saudagar, Khan Muhammad","doi":"10.1007/s12559-024-10281-5","DOIUrl":null,"url":null,"abstract":"<p>The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: <span>\\(CNN_1\\)</span> and <span>\\(CNN_2\\)</span>), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The <span>\\(CNN_2\\)</span> was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than <span>\\(CNN_1\\)</span> and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":"97 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10281-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: \(CNN_1\) and \(CNN_2\)), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The \(CNN_2\) was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than \(CNN_1\) and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于视觉的大型语言和深度学习模型,用于基于图像的情感识别
基于人工智能(AI)的工具和系统在能力、推理和效率方面的巨大进步是显而易见的。这类工具中值得一提的例子包括基于生成式人工智能的大型语言模型(LLM),如生成式预训练变换器 3.5(GPT 3.5)、生成式预训练变换器 4(GPT-4)和巴德(Bard)。LLMs 用途广泛,可有效完成各种任务,如创作诗歌、编写代码、生成文章和解谜。迄今为止,LLM 只能有效处理基于文本的输入。然而,最近的进步使它们能够处理多模态输入,如文本、图像和音频,从而使它们成为高度通用的工具。LLM 在模式识别任务(如分类)中取得了不俗的表现,因此,人们对通用 LLM 的表现是否能与专为特定任务训练的专业深度学习模型(DLM)相媲美甚至更胜一筹充满了好奇。在本研究中,我们比较了微调 DLM 与通用 LLM 在基于图像的情感识别中的表现。我们训练了 DLMs,即一个卷积神经网络(CNN)(使用了两个 CNN 模型:\(CNN_1\)和\(CNN_2\))、ResNet50和VGG-16模型,然后在另一个数据集上测试它们的性能。随后,我们将同一个测试数据集交给了两个支持视觉的 LLM(LLaVa 和 GPT-4)。结果发现,CNN_2\是最优秀的模型,准确率为62%,而VGG16的准确率最低,只有31%。在 LLM 类别中,GPT-4 表现最好,准确率为 55.81%。LLava LLM 的准确率高于(CNN_1)和 VGG16 模型。其他性能指标,如精确度、召回率和 F1 分数,也呈现出类似的趋势。然而,GPT-4 在小型数据集上的表现最好。在 LLMs 中观察到的较差结果可归因于它们的通用性,尽管进行了大量的预训练,但它们可能无法像针对特定任务微调的模型那样有效地捕捉特定任务(如图像中的情感识别)所需的特征。LLM 并没有超越专用模型,但取得了不相上下的性能,这使它们成为无需额外训练即可完成特定任务的可行选择。此外,当可用数据集较少时,LLMs 也可被视为一种很好的选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Cognitive Computation
Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES
CiteScore
9.30
自引率
3.70%
发文量
116
审稿时长
>12 weeks
期刊介绍: Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.
期刊最新文献
A Joint Network for Low-Light Image Enhancement Based on Retinex Incorporating Template-Based Contrastive Learning into Cognitively Inspired, Low-Resource Relation Extraction A Novel Cognitive Rough Approach for Severity Analysis of Autistic Children Using Spherical Fuzzy Bipolar Soft Sets Cognitively Inspired Three-Way Decision Making and Bi-Level Evolutionary Optimization for Mobile Cybersecurity Threats Detection: A Case Study on Android Malware Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1