SAMR: Symmetric masked multimodal modeling for general multi-modal 3D motion retrieval

IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Displays Pub Date : 2025-04-01 Epub Date: 2025-02-07 DOI:10.1016/j.displa.2025.102987
Yunhao Li , Sijing Wu , Yucheng Zhu , Wei Sun , Zhichao Zhang , Song Song , Guangtao Zhai
{"title":"SAMR: Symmetric masked multimodal modeling for general multi-modal 3D motion retrieval","authors":"Yunhao Li ,&nbsp;Sijing Wu ,&nbsp;Yucheng Zhu ,&nbsp;Wei Sun ,&nbsp;Zhichao Zhang ,&nbsp;Song Song ,&nbsp;Guangtao Zhai","doi":"10.1016/j.displa.2025.102987","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present <strong>SAMR</strong>, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"87 ","pages":"Article 102987"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225000241","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present SAMR, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SAMR:用于一般多模态三维运动检索的对称屏蔽多模态建模
近年来,文本到三维人体运动的检索一直是计算机视觉领域的研究热点。然而,现有的方法以对比学习和运动重建作为主要的代理任务。虽然这些方法取得了很好的性能,但这些简单的策略可能会导致网络丢失时间运动信息和扭曲文本特征,从而影响运动检索结果。同时,目前的运动检索方法忽略了对预测的相似矩阵的后处理。考虑到这两个问题,在这项工作中,我们提出了SAMR,一个基于编码器-解码器的变压器框架,具有对称掩蔽多模态信息建模。具体来说,我们去除KL散度损失,并共同重建运动和文本输入。为了增强检索模型的鲁棒性,我们还提出了一种掩模建模策略。我们的SAMR对图像和文本输入进行联合掩蔽,在训练过程中,我们对每个模态同时重建原始输入模态和掩蔽模态以稳定训练。在训练后,我们还利用双softmax优化方法来提高最终的性能。我们在文本到运动数据集和语音到运动数据集上进行了广泛的实验。实验结果表明,SAMR在多种跨模态运动检索任务(包括语音到运动和文本到运动)中达到了最先进的性能,显示出作为通用基础运动检索框架的巨大潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Displays
Displays 工程技术-工程:电子与电气
CiteScore
4.60
自引率
25.60%
发文量
138
审稿时长
92 days
期刊介绍: Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.
期刊最新文献
Towards high-dimensional IMU-based human activity recognition: data construction via 3D body modeling and classification with a multi-channel attention fusion network Single blind image deblurring: advances and prospects All Inkjet-Printed red Micro Quantum-Dots Light-Emitting diodes (QLEDs): Fabrication and performance Comparative analysis of virtual keyboard typing methods in VR: controller, poke, and pinch techniques Circadian and color analysis of programming themes: effects of theme design, ambient lighting, and viewing distance on developer visual ergonomics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1