Attending to Transforms: A Survey on Transformer-based Image Captioning

2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS) Pub Date : 2023-04-05 DOI:10.1109/PCEMS58491.2023.10136098

Kshitij Ambilduke, Thanmay Jayakumar, Luqman Farooqui, Himanshu Padole, Anamika Singh

引用次数: 0

Abstract

Image captioning is a challenging task that lies at the intersection of Computer Vision and Natural Language Processing. There exists a legion of works that generate meaningful and realistic descriptions of images. Recently, with the advent of attention mechanisms and transformers, there has been a drastic shift in modelling both language and vision tasks. However, there are very few extensive studies that review these approaches based on their progression, advantages and disadvantages. This paper presents a detailed summary of transformer-based models employed for tackling image captioning. In addition to this, we provide an overview of various pre-training tasks, datasets and metrics used for image captioning. Finally, the performance of all the reviewed approaches are compared on the COCO Captions dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关注变换:基于变换的图像字幕研究

图像字幕是一项具有挑战性的任务，它位于计算机视觉和自然语言处理的交叉点。有大量的作品产生了有意义的和现实的图像描述。最近，随着注意力机制和变形器的出现，在语言和视觉任务的建模方面发生了巨大的变化。然而，很少有广泛的研究根据这些方法的进展、优缺点来回顾这些方法。本文详细概述了用于处理图像字幕的基于变压器的模型。除此之外，我们还提供了用于图像字幕的各种预训练任务、数据集和指标的概述。最后，在COCO Captions数据集上比较了所有评审方法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS)

自引率

0.00%

发文量