Softmax Pooling for Super Visual Semantic Embedding*

2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON) Pub Date : 2021-10-27 DOI:10.1109/iemcon53756.2021.9623131

Zhixian Zeng, Jianjun Cao, Nianfeng Weng, Guoquan Jiang, Yizhuo Rao, Yuxin Xu

{"title":"Softmax Pooling for Super Visual Semantic Embedding*","authors":"Zhixian Zeng, Jianjun Cao, Nianfeng Weng, Guoquan Jiang, Yizhuo Rao, Yuxin Xu","doi":"10.1109/iemcon53756.2021.9623131","DOIUrl":null,"url":null,"abstract":"The purpose of visual-semantic embedding is to respectively map image and text to a common embedding space and perform cross-modal semantic alignment learning. Image-text matching is also the main research content of visual semantic embedding. Existing researches have confirmed that in visual-semantic embedding, a simple pooling strategy can also achieve a good performance. However, the existing visual semantic pooling strategies (aggregators) generally have some problems, including adding additional training parameters, increasing training time, ignoring intra-modal semantic-related information, and so on. In this paper, we propose a Super Visual Semantic Embedding (SVSE) Model based on Softmax Pooling (SoftPool). We introduced the softmax pooling strategy into visual semantic embedding for the first time. SoftPool is not only simple to implement but also doesn't introduce new additional training parameters. It can adaptively calculate the weights between different feature values and preserve more intra-modal correlation information between different features. At the same time, we combine the enhanced semantic representation module and our softmax pooling strategy to construct the intra-modal semantic association, which is used to improve the performance of the visual semantic embedding in image-text matching. Undoubtedly, our proposed method possesses a higher engineering application value than other methods. Experiments are conducted on two widely used cross-modal image-text datasets, namely MS-COCO and Flickr-30K. Comparing with the best pooling strategy, our proposed softmax pooling strategy not only is better in training time but also outperforms by 0.48% (5K) on MS-COCO and 1.95% on Flickr-30K at R@1 (image retrieval). Moreover, comparing with the best visual semantic embedding model, our proposed SVSE outperforms by 2.83% (5K) on MS-COCO and 4.89% (1K) on Flickr-30K at R@1 (image retrieval), respectively. Our code is available at https://github.com/zengzhixian/SoftPool_SVSE.git.","PeriodicalId":272590,"journal":{"name":"2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iemcon53756.2021.9623131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

The purpose of visual-semantic embedding is to respectively map image and text to a common embedding space and perform cross-modal semantic alignment learning. Image-text matching is also the main research content of visual semantic embedding. Existing researches have confirmed that in visual-semantic embedding, a simple pooling strategy can also achieve a good performance. However, the existing visual semantic pooling strategies (aggregators) generally have some problems, including adding additional training parameters, increasing training time, ignoring intra-modal semantic-related information, and so on. In this paper, we propose a Super Visual Semantic Embedding (SVSE) Model based on Softmax Pooling (SoftPool). We introduced the softmax pooling strategy into visual semantic embedding for the first time. SoftPool is not only simple to implement but also doesn't introduce new additional training parameters. It can adaptively calculate the weights between different feature values and preserve more intra-modal correlation information between different features. At the same time, we combine the enhanced semantic representation module and our softmax pooling strategy to construct the intra-modal semantic association, which is used to improve the performance of the visual semantic embedding in image-text matching. Undoubtedly, our proposed method possesses a higher engineering application value than other methods. Experiments are conducted on two widely used cross-modal image-text datasets, namely MS-COCO and Flickr-30K. Comparing with the best pooling strategy, our proposed softmax pooling strategy not only is better in training time but also outperforms by 0.48% (5K) on MS-COCO and 1.95% on Flickr-30K at R@1 (image retrieval). Moreover, comparing with the best visual semantic embedding model, our proposed SVSE outperforms by 2.83% (5K) on MS-COCO and 4.89% (1K) on Flickr-30K at R@1 (image retrieval), respectively. Our code is available at https://github.com/zengzhixian/SoftPool_SVSE.git.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于超视觉语义嵌入的Softmax池*

视觉语义嵌入的目的是将图像和文本分别映射到一个共同的嵌入空间，并进行跨模态语义对齐学习。图像-文本匹配也是视觉语义嵌入的主要研究内容。已有研究证实，在视觉语义嵌入中，简单的池化策略也能取得良好的性能。然而，现有的视觉语义池策略(聚合器)普遍存在一些问题，包括增加额外的训练参数、增加训练时间、忽略模态内语义相关信息等。本文提出了一种基于Softmax池(SoftPool)的超视觉语义嵌入(SVSE)模型。首次将softmax池化策略引入到视觉语义嵌入中。SoftPool不仅易于实现，而且不会引入新的额外训练参数。它可以自适应计算不同特征值之间的权重，并保留更多不同特征之间的模态内相关信息。同时，我们将增强的语义表示模块与softmax池化策略相结合，构建模态内语义关联，用于提高图像-文本匹配中视觉语义嵌入的性能。毫无疑问，我们提出的方法比其他方法具有更高的工程应用价值。在MS-COCO和Flickr-30K两种广泛使用的跨模态图像-文本数据集上进行了实验。与最佳池化策略相比，我们提出的softmax池化策略不仅在训练时间上更好，而且在R@1(图像检索)上在MS-COCO上的性能优于0.48% (5K)，在Flickr-30K上的性能优于1.95%。此外，与最佳视觉语义嵌入模型相比，我们提出的SVSE在MS-COCO和Flickr-30K上分别优于2.83% (5K)和4.89% (1K) (R@1(图像检索)。我们的代码可在https://github.com/zengzhixian/SoftPool_SVSE.git上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)

自引率

0.00%

发文量