Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks

IF 5.1 2区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY Engineering Science and Technology-An International Journal-Jestech Pub Date : 2025-03-03 DOI:10.1016/j.jestch.2025.102009

Abbas Memiş , Serdar Yıldız

{"title":"Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks","authors":"Abbas Memiş , Serdar Yıldız","doi":"10.1016/j.jestch.2025.102009","DOIUrl":null,"url":null,"abstract":"<div><div>In the last decade, the outstanding performance of deep learning has also led to a rapid and inevitable rise in automatic image captioning, as well as the need for large amounts of data. Although well-known, conventional and publicly available datasets have been proposed for the image captioning task, the lack of ground-truth caption data still remains a major challenge in the generation of accurate image captions. To address this issue, in this paper we introduced a novel image captioning benchmark dataset called Tiny TR-CAP, which consists of 1076 original images and 5380 handwritten captions (5 captions for each image with high diversity). The captions, which were translated into English using two web-based language translation APIs and a novel multilingual deep machine translation model, were tested against 11 state-of-the-art and prominent deep learning-based models, including CLIPCap, BLIP, BLIP2, FUSECAP, OFA, PromptCap, Kosmos2, MiniGPT4, LlaVA, BakLlaVA, and GIT. In the experimental studies, the accuracy statistics of the captions generated by the related models were reported in terms of the BLEU, METEOR, ROUGE-L, CIDEr, SPICE, and WMD captioning metrics, and their performance was evaluated comparatively. In the performance analysis, quite promising captioning performances were observed, and the best success rates were achieved with the OFA model with scores of 0.7097 BLEU-1, 0.5389 BLEU-2, 0.3940 BLEU-3, 0.2875 BLEU-4, 0.1797 METEOR, 0.4627 ROUGE-L, 0.2938 CIDEr, 0.0626 SPICE, and 0.4605 WMD. To support research studies in the field of image captioning, the image and caption sets of Tiny TR-CAP will also be publicly available on GitHub (<span><span>https://github.com/abbasmemis/tiny_TR-CAP</span><svg><path></path></svg></span>) for academic research purposes.</div></div>","PeriodicalId":48609,"journal":{"name":"Engineering Science and Technology-An International Journal-Jestech","volume":"64 ","pages":"Article 102009"},"PeriodicalIF":5.1000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Science and Technology-An International Journal-Jestech","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215098625000643","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

In the last decade, the outstanding performance of deep learning has also led to a rapid and inevitable rise in automatic image captioning, as well as the need for large amounts of data. Although well-known, conventional and publicly available datasets have been proposed for the image captioning task, the lack of ground-truth caption data still remains a major challenge in the generation of accurate image captions. To address this issue, in this paper we introduced a novel image captioning benchmark dataset called Tiny TR-CAP, which consists of 1076 original images and 5380 handwritten captions (5 captions for each image with high diversity). The captions, which were translated into English using two web-based language translation APIs and a novel multilingual deep machine translation model, were tested against 11 state-of-the-art and prominent deep learning-based models, including CLIPCap, BLIP, BLIP2, FUSECAP, OFA, PromptCap, Kosmos2, MiniGPT4, LlaVA, BakLlaVA, and GIT. In the experimental studies, the accuracy statistics of the captions generated by the related models were reported in terms of the BLEU, METEOR, ROUGE-L, CIDEr, SPICE, and WMD captioning metrics, and their performance was evaluated comparatively. In the performance analysis, quite promising captioning performances were observed, and the best success rates were achieved with the OFA model with scores of 0.7097 BLEU-1, 0.5389 BLEU-2, 0.3940 BLEU-3, 0.2875 BLEU-4, 0.1797 METEOR, 0.4627 ROUGE-L, 0.2938 CIDEr, 0.0626 SPICE, and 0.4605 WMD. To support research studies in the field of image captioning, the image and caption sets of Tiny TR-CAP will also be publicly available on GitHub (https://github.com/abbasmemis/tiny_TR-CAP) for academic research purposes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Engineering Science and Technology-An International Journal-Jestech Materials Science-Electronic, Optical and Magnetic Materials

CiteScore

11.20

自引率

3.50%

发文量

153

审稿时长

22 days

期刊介绍： Engineering Science and Technology, an International Journal (JESTECH) (formerly Technology), a peer-reviewed quarterly engineering journal, publishes both theoretical and experimental high quality papers of permanent interest, not previously published in journals, in the field of engineering and applied science which aims to promote the theory and practice of technology and engineering. In addition to peer-reviewed original research papers, the Editorial Board welcomes original research reports, state-of-the-art reviews and communications in the broadly defined field of engineering science and technology. The scope of JESTECH includes a wide spectrum of subjects including: -Electrical/Electronics and Computer Engineering (Biomedical Engineering and Instrumentation; Coding, Cryptography, and Information Protection; Communications, Networks, Mobile Computing and Distributed Systems; Compilers and Operating Systems; Computer Architecture, Parallel Processing, and Dependability; Computer Vision and Robotics; Control Theory; Electromagnetic Waves, Microwave Techniques and Antennas; Embedded Systems; Integrated Circuits, VLSI Design, Testing, and CAD; Microelectromechanical Systems; Microelectronics, and Electronic Devices and Circuits; Power, Energy and Energy Conversion Systems; Signal, Image, and Speech Processing) -Mechanical and Civil Engineering (Automotive Technologies; Biomechanics; Construction Materials; Design and Manufacturing; Dynamics and Control; Energy Generation, Utilization, Conversion, and Storage; Fluid Mechanics and Hydraulics; Heat and Mass Transfer; Micro-Nano Sciences; Renewable and Sustainable Energy Technologies; Robotics and Mechatronics; Solid Mechanics and Structure; Thermal Sciences) -Metallurgical and Materials Engineering (Advanced Materials Science; Biomaterials; Ceramic and Inorgnanic Materials; Electronic-Magnetic Materials; Energy and Environment; Materials Characterizastion; Metallurgy; Polymers and Nanocomposites)