A transformer-based Urdu image caption generation

3区 计算机科学 Q1 Computer Science Journal of Ambient Intelligence and Humanized Computing Pub Date : 2024-07-02 DOI:10.1007/s12652-024-04824-9
Muhammad Hadi, Iqra Safder, Hajra Waheed, Farooq Zaman, Naif Radi Aljohani, Raheel Nawaz, Saeed Ul Hassan, Raheem Sarwar
{"title":"A transformer-based Urdu image caption generation","authors":"Muhammad Hadi, Iqra Safder, Hajra Waheed, Farooq Zaman, Naif Radi Aljohani, Raheel Nawaz, Saeed Ul Hassan, Raheem Sarwar","doi":"10.1007/s12652-024-04824-9","DOIUrl":null,"url":null,"abstract":"<p>Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.</p>","PeriodicalId":14959,"journal":{"name":"Journal of Ambient Intelligence and Humanized Computing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Ambient Intelligence and Humanized Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12652-024-04824-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于变换器的乌尔都语图像标题生成器
图像标题生成已成为自然语言处理(NLP)和计算机视觉(CV)之间的重要桥梁。它处于这两个领域的交叉点,并提出了独特的挑战,尤其是在处理乌尔都语等低资源语言时。有关乌尔都语基本理解的研究有限,因此有必要在这一领域进行进一步探索。在本研究中,我们提出了三种基于 Seq2Seq 的架构,专门用于乌尔都语图像标题的生成。我们的方法涉及利用转换器模型生成乌尔都语标题,这是一项比英语更具挑战性的任务。为了便于训练和评估我们的模型,我们创建了一个经过乌尔都语翻译的 flickr8k 数据集子集,其中包含了以狗的行动为主题的图片,并附有相应的乌尔都语标题。我们设计的模型采用了基于深度学习的方法,利用了三种不同的架构:卷积神经网络(CNN)+长短期记忆(LSTM)与采用 word2Vec 嵌入的软关注、CNN+变换器和 Vit+Roberta 模型。实验结果表明,我们提出的模型优于现有的最先进方法,达到了 86 BLEU-1 和 90 BERT-F1 分数。生成的乌尔都语图像标题在语法、上下文和语义方面都表现出了正确性。我们的研究凸显了在低资源语言上重新训练模型所面临的固有挑战。我们的研究结果凸显了预训练模型在促进低资源语言环境下的 NLP 和 CV 应用开发方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Ambient Intelligence and Humanized Computing
Journal of Ambient Intelligence and Humanized Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
9.60
自引率
0.00%
发文量
854
期刊介绍: The purpose of JAIHC is to provide a high profile, leading edge forum for academics, industrial professionals, educators and policy makers involved in the field to contribute, to disseminate the most innovative researches and developments of all aspects of ambient intelligence and humanized computing, such as intelligent/smart objects, environments/spaces, and systems. The journal discusses various technical, safety, personal, social, physical, political, artistic and economic issues. The research topics covered by the journal are (but not limited to): Pervasive/Ubiquitous Computing and Applications Cognitive wireless sensor network Embedded Systems and Software Mobile Computing and Wireless Communications Next Generation Multimedia Systems Security, Privacy and Trust Service and Semantic Computing Advanced Networking Architectures Dependable, Reliable and Autonomic Computing Embedded Smart Agents Context awareness, social sensing and inference Multi modal interaction design Ergonomics and product prototyping Intelligent and self-organizing transportation networks & services Healthcare Systems Virtual Humans & Virtual Worlds Wearables sensors and actuators
期刊最新文献
Predicting the unconfined compressive strength of stabilized soil using random forest coupled with meta-heuristic algorithms Expressive sign language system for deaf kids with MPEG-4 approach of virtual human character MEDCO: an efficient protocol for data compression in wireless body sensor network A multi-objective gene selection for cancer diagnosis using particle swarm optimization and mutual information Partial policy hidden medical data access control method based on CP-ABE
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1