Softmax Pooling for Super Visual Semantic Embedding*

Zhixian Zeng, Jianjun Cao, Nianfeng Weng, Guoquan Jiang, Yizhuo Rao, Yuxin Xu
{"title":"Softmax Pooling for Super Visual Semantic Embedding*","authors":"Zhixian Zeng, Jianjun Cao, Nianfeng Weng, Guoquan Jiang, Yizhuo Rao, Yuxin Xu","doi":"10.1109/iemcon53756.2021.9623131","DOIUrl":null,"url":null,"abstract":"The purpose of visual-semantic embedding is to respectively map image and text to a common embedding space and perform cross-modal semantic alignment learning. Image-text matching is also the main research content of visual semantic embedding. Existing researches have confirmed that in visual-semantic embedding, a simple pooling strategy can also achieve a good performance. However, the existing visual semantic pooling strategies (aggregators) generally have some problems, including adding additional training parameters, increasing training time, ignoring intra-modal semantic-related information, and so on. In this paper, we propose a Super Visual Semantic Embedding (SVSE) Model based on Softmax Pooling (SoftPool). We introduced the softmax pooling strategy into visual semantic embedding for the first time. SoftPool is not only simple to implement but also doesn't introduce new additional training parameters. It can adaptively calculate the weights between different feature values and preserve more intra-modal correlation information between different features. At the same time, we combine the enhanced semantic representation module and our softmax pooling strategy to construct the intra-modal semantic association, which is used to improve the performance of the visual semantic embedding in image-text matching. Undoubtedly, our proposed method possesses a higher engineering application value than other methods. Experiments are conducted on two widely used cross-modal image-text datasets, namely MS-COCO and Flickr-30K. Comparing with the best pooling strategy, our proposed softmax pooling strategy not only is better in training time but also outperforms by 0.48% (5K) on MS-COCO and 1.95% on Flickr-30K at R@1 (image retrieval). Moreover, comparing with the best visual semantic embedding model, our proposed SVSE outperforms by 2.83% (5K) on MS-COCO and 4.89% (1K) on Flickr-30K at R@1 (image retrieval), respectively. Our code is available at https://github.com/zengzhixian/SoftPool_SVSE.git.","PeriodicalId":272590,"journal":{"name":"2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iemcon53756.2021.9623131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The purpose of visual-semantic embedding is to respectively map image and text to a common embedding space and perform cross-modal semantic alignment learning. Image-text matching is also the main research content of visual semantic embedding. Existing researches have confirmed that in visual-semantic embedding, a simple pooling strategy can also achieve a good performance. However, the existing visual semantic pooling strategies (aggregators) generally have some problems, including adding additional training parameters, increasing training time, ignoring intra-modal semantic-related information, and so on. In this paper, we propose a Super Visual Semantic Embedding (SVSE) Model based on Softmax Pooling (SoftPool). We introduced the softmax pooling strategy into visual semantic embedding for the first time. SoftPool is not only simple to implement but also doesn't introduce new additional training parameters. It can adaptively calculate the weights between different feature values and preserve more intra-modal correlation information between different features. At the same time, we combine the enhanced semantic representation module and our softmax pooling strategy to construct the intra-modal semantic association, which is used to improve the performance of the visual semantic embedding in image-text matching. Undoubtedly, our proposed method possesses a higher engineering application value than other methods. Experiments are conducted on two widely used cross-modal image-text datasets, namely MS-COCO and Flickr-30K. Comparing with the best pooling strategy, our proposed softmax pooling strategy not only is better in training time but also outperforms by 0.48% (5K) on MS-COCO and 1.95% on Flickr-30K at R@1 (image retrieval). Moreover, comparing with the best visual semantic embedding model, our proposed SVSE outperforms by 2.83% (5K) on MS-COCO and 4.89% (1K) on Flickr-30K at R@1 (image retrieval), respectively. Our code is available at https://github.com/zengzhixian/SoftPool_SVSE.git.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于超视觉语义嵌入的Softmax池*
视觉语义嵌入的目的是将图像和文本分别映射到一个共同的嵌入空间,并进行跨模态语义对齐学习。图像-文本匹配也是视觉语义嵌入的主要研究内容。已有研究证实,在视觉语义嵌入中,简单的池化策略也能取得良好的性能。然而,现有的视觉语义池策略(聚合器)普遍存在一些问题,包括增加额外的训练参数、增加训练时间、忽略模态内语义相关信息等。本文提出了一种基于Softmax池(SoftPool)的超视觉语义嵌入(SVSE)模型。首次将softmax池化策略引入到视觉语义嵌入中。SoftPool不仅易于实现,而且不会引入新的额外训练参数。它可以自适应计算不同特征值之间的权重,并保留更多不同特征之间的模态内相关信息。同时,我们将增强的语义表示模块与softmax池化策略相结合,构建模态内语义关联,用于提高图像-文本匹配中视觉语义嵌入的性能。毫无疑问,我们提出的方法比其他方法具有更高的工程应用价值。在MS-COCO和Flickr-30K两种广泛使用的跨模态图像-文本数据集上进行了实验。与最佳池化策略相比,我们提出的softmax池化策略不仅在训练时间上更好,而且在R@1(图像检索)上在MS-COCO上的性能优于0.48% (5K),在Flickr-30K上的性能优于1.95%。此外,与最佳视觉语义嵌入模型相比,我们提出的SVSE在MS-COCO和Flickr-30K上分别优于2.83% (5K)和4.89% (1K) (R@1(图像检索)。我们的代码可在https://github.com/zengzhixian/SoftPool_SVSE.git上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Maximization of the User Association of a Low-Power Tier Deploying Biased User Association Scheme in 5G Multi-Tier Heterogeneous Network A Deep Reinforcement Learning: Location-based Resource Allocation for Congested C-V2X Scenario A Deep Learning Approach to Predict Chronic Kidney Disease in Human Evaluation of a bio-socially inspired secure DSA scheme using testbed-calibrated hybrid simulations Siamese Network based Pulse and Signal Attribute Identification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1