A transformer-based Urdu image caption generation

3区计算机科学 Q1 Computer Science Journal of Ambient Intelligence and Humanized Computing Pub Date : 2024-07-02 DOI:10.1007/s12652-024-04824-9

Muhammad Hadi, Iqra Safder, Hajra Waheed, Farooq Zaman, Naif Radi Aljohani, Raheel Nawaz, Saeed Ul Hassan, Raheem Sarwar

{"title":"A transformer-based Urdu image caption generation","authors":"Muhammad Hadi, Iqra Safder, Hajra Waheed, Farooq Zaman, Naif Radi Aljohani, Raheel Nawaz, Saeed Ul Hassan, Raheem Sarwar","doi":"10.1007/s12652-024-04824-9","DOIUrl":null,"url":null,"abstract":"<p>Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.</p>","PeriodicalId":14959,"journal":{"name":"Journal of Ambient Intelligence and Humanized Computing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Ambient Intelligence and Humanized Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12652-024-04824-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变换器的乌尔都语图像标题生成器

图像标题生成已成为自然语言处理（NLP）和计算机视觉（CV）之间的重要桥梁。它处于这两个领域的交叉点，并提出了独特的挑战，尤其是在处理乌尔都语等低资源语言时。有关乌尔都语基本理解的研究有限，因此有必要在这一领域进行进一步探索。在本研究中，我们提出了三种基于 Seq2Seq 的架构，专门用于乌尔都语图像标题的生成。我们的方法涉及利用转换器模型生成乌尔都语标题，这是一项比英语更具挑战性的任务。为了便于训练和评估我们的模型，我们创建了一个经过乌尔都语翻译的 flickr8k 数据集子集，其中包含了以狗的行动为主题的图片，并附有相应的乌尔都语标题。我们设计的模型采用了基于深度学习的方法，利用了三种不同的架构：卷积神经网络（CNN）+长短期记忆（LSTM）与采用 word2Vec 嵌入的软关注、CNN+变换器和 Vit+Roberta 模型。实验结果表明，我们提出的模型优于现有的最先进方法，达到了 86 BLEU-1 和 90 BERT-F1 分数。生成的乌尔都语图像标题在语法、上下文和语义方面都表现出了正确性。我们的研究凸显了在低资源语言上重新训练模型所面临的固有挑战。我们的研究结果凸显了预训练模型在促进低资源语言环境下的 NLP 和 CV 应用开发方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Ambient Intelligence and Humanized Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

9.60

自引率

0.00%

发文量

854

期刊介绍： The purpose of JAIHC is to provide a high profile, leading edge forum for academics, industrial professionals, educators and policy makers involved in the field to contribute, to disseminate the most innovative researches and developments of all aspects of ambient intelligence and humanized computing, such as intelligent/smart objects, environments/spaces, and systems. The journal discusses various technical, safety, personal, social, physical, political, artistic and economic issues. The research topics covered by the journal are (but not limited to): Pervasive/Ubiquitous Computing and Applications Cognitive wireless sensor network Embedded Systems and Software Mobile Computing and Wireless Communications Next Generation Multimedia Systems Security, Privacy and Trust Service and Semantic Computing Advanced Networking Architectures Dependable, Reliable and Autonomic Computing Embedded Smart Agents Context awareness, social sensing and inference Multi modal interaction design Ergonomics and product prototyping Intelligent and self-organizing transportation networks & services Healthcare Systems Virtual Humans & Virtual Worlds Wearables sensors and actuators