Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi
{"title":"Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing","authors":"Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi","doi":"10.1016/j.bdr.2024.100477","DOIUrl":null,"url":null,"abstract":"<div><p>This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.</p></div>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579624000534","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.