Z. U. Kamangar, G. Shaikh, Saif Hassan, Nimra Mughal, U. A. Kamangar
{"title":"Image Caption Generation Related to Object Detection and Colour Recognition Using Transformer-Decoder","authors":"Z. U. Kamangar, G. Shaikh, Saif Hassan, Nimra Mughal, U. A. Kamangar","doi":"10.1109/iCoMET57998.2023.10099161","DOIUrl":null,"url":null,"abstract":"The dependence on digital images is increasing in different fields. i.e, education, business, medicine, or defense, as they are shifting towards the online paradigm. So, there is a dire need for computers and other similar machines to interpret information related to these images and help the users understand the meaning of it. This has been achieved with the help of automatic Image captioning using different prediction models, such as machine learning and deep learning models. However, the problem with the traditional models, especially machine learning models, is that they may not generate a caption that accurately represents that Image. Although deep learning methods are better for generating captions of an image, it is still an open research area that requires a lot of work. Therefore, a model proposed in this research uses transformers with the help of attention layers to encode and decode the image token. Finally, it generates the image caption by identifying the objects along with their colours. The fliker8k and Conceptual Captions datasets are used to train this model, which contains images and captions. The fliker8k contains 8,092 images, each with five captions, and Conceptual Captions contains more than 3 million images, each with one caption. The contribution of this presented work is that it can be utilized by different companies, which require the interpretation of diverse images automatically and the naming of the images to describe some scenario or descriptions related to the images. In the future, the accuracy can be increased by increasing the number of images and captions or incorporating different deep-learning techniques.","PeriodicalId":369792,"journal":{"name":"2023 4th International Conference on Computing, Mathematics and Engineering Technologies (iCoMET)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 4th International Conference on Computing, Mathematics and Engineering Technologies (iCoMET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iCoMET57998.2023.10099161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The dependence on digital images is increasing in different fields. i.e, education, business, medicine, or defense, as they are shifting towards the online paradigm. So, there is a dire need for computers and other similar machines to interpret information related to these images and help the users understand the meaning of it. This has been achieved with the help of automatic Image captioning using different prediction models, such as machine learning and deep learning models. However, the problem with the traditional models, especially machine learning models, is that they may not generate a caption that accurately represents that Image. Although deep learning methods are better for generating captions of an image, it is still an open research area that requires a lot of work. Therefore, a model proposed in this research uses transformers with the help of attention layers to encode and decode the image token. Finally, it generates the image caption by identifying the objects along with their colours. The fliker8k and Conceptual Captions datasets are used to train this model, which contains images and captions. The fliker8k contains 8,092 images, each with five captions, and Conceptual Captions contains more than 3 million images, each with one caption. The contribution of this presented work is that it can be utilized by different companies, which require the interpretation of diverse images automatically and the naming of the images to describe some scenario or descriptions related to the images. In the future, the accuracy can be increased by increasing the number of images and captions or incorporating different deep-learning techniques.