优化图像字幕:遥感视觉转换器和 VGG 网络的有效性

IF 4.3 3区 材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC ACS Applied Electronic Materials Pub Date : 2024-06-13 DOI:10.1016/j.bdr.2024.100477
Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi
{"title":"优化图像字幕:遥感视觉转换器和 VGG 网络的有效性","authors":"Huimin Han ,&nbsp;Bouba oumarou Aboubakar ,&nbsp;Mughair Bhatti ,&nbsp;Bandeh Ali Talpur ,&nbsp;Yasser A. Ali ,&nbsp;Muna Al-Razgan ,&nbsp;Yazeed Yasid Ghadi","doi":"10.1016/j.bdr.2024.100477","DOIUrl":null,"url":null,"abstract":"<div><p>This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.</p></div>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing\",\"authors\":\"Huimin Han ,&nbsp;Bouba oumarou Aboubakar ,&nbsp;Mughair Bhatti ,&nbsp;Bandeh Ali Talpur ,&nbsp;Yasser A. Ali ,&nbsp;Muna Al-Razgan ,&nbsp;Yazeed Yasid Ghadi\",\"doi\":\"10.1016/j.bdr.2024.100477\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.</p></div>\",\"PeriodicalId\":3,\"journal\":{\"name\":\"ACS Applied Electronic Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Electronic Materials\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2214579624000534\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579624000534","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

本研究对遥感数据图像标题领域的两个著名深度学习模型--Vision Transformer(ViT)和 VGG16 进行了全面评估。BLEU 分数是评估机器学习模型根据一组参考标题生成的文本质量的一个广为接受的指标,本研究利用这一指标,旨在剖析和了解这些模型在不同样本量(25、50、75 和 100 个样本)下的能力和性能细微差别。我们的研究结果表明,在所有评估的样本量中,Vision Transformer 模型的性能普遍优于 VGG16 模型,在 50 个样本时达到最高性能,BLEU 得分为 0.5507。这一性能表明,ViT 能够捕捉数据中的全局依赖关系,从而提供对图像更细致入微的理解。不过,随着样本量增加到 50 个以上,性能略有下降,这表明在可扩展性或过度拟合训练数据方面存在潜在挑战。相反,VGG16 模型则显示出不同的性能轨迹,开始时样本量较小,BLEU 分数较低,但随着样本量的增加,BLEU 分数不断提高,最终在 100 个样本时达到最高的 0.4783。这种模式表明,VGG16 可能需要更大的数据集才能从数据中充分学习和泛化,尽管与 ViT 相比,它的性能上限更低。通过对这些发现的详细分析,本研究强调了每个模型在图像字幕方面的优势和局限性。Vision Transformer 的卓越性能凸显了它在要求高精度图像文本生成的应用中的潜力。相比之下,VGG16 所表现出的渐进式改进表明,它适用于有大型数据集的场景,而且可扩展性是优先考虑的问题。这项研究为人工智能界正在进行的有关为图像字幕等复杂任务选择和优化深度学习模型的讨论做出了贡献,并提供了可指导该领域未来研究和应用开发的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing

This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.20
自引率
4.30%
发文量
567
期刊最新文献
Vitamin B12: prevention of human beings from lethal diseases and its food application. Current status and obstacles of narrowing yield gaps of four major crops. Cold shock treatment alleviates pitting in sweet cherry fruit by enhancing antioxidant enzymes activity and regulating membrane lipid metabolism. Removal of proteins and lipids affects structure, in vitro digestion and physicochemical properties of rice flour modified by heat-moisture treatment. Investigating the impact of climate variables on the organic honey yield in Turkey using XGBoost machine learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1