Yunhao Li , Sijing Wu , Yucheng Zhu , Wei Sun , Zhichao Zhang , Song Song , Guangtao Zhai
{"title":"SAMR: Symmetric masked multimodal modeling for general multi-modal 3D motion retrieval","authors":"Yunhao Li , Sijing Wu , Yucheng Zhu , Wei Sun , Zhichao Zhang , Song Song , Guangtao Zhai","doi":"10.1016/j.displa.2025.102987","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present <strong>SAMR</strong>, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"87 ","pages":"Article 102987"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225000241","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present SAMR, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.
期刊介绍:
Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface.
Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.