MLP：用于未修剪三维人体运动中时态句子定位的运动标签先验法

IF 10.8 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-07-04 DOI:10.1109/TCSVT.2024.3421565

Sheng Yan;Mengyuan Liu;Yong Wang;Yang Liu;Hong Liu

{"title":"MLP：用于未修剪三维人体运动中时态句子定位的运动标签先验法","authors":"Sheng Yan;Mengyuan Liu;Yong Wang;Yang Liu;Hong Liu","doi":"10.1109/TCSVT.2024.3421565","DOIUrl":null,"url":null,"abstract":"In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed \n<monospace>MLP</monospace>\n achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at \n<uri>https://github.com/eanson023/mlp</uri>\n.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 11","pages":"11535-11550"},"PeriodicalIF":10.8000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions\",\"authors\":\"Sheng Yan;Mengyuan Liu;Yong Wang;Yang Liu;Hong Liu\",\"doi\":\"10.1109/TCSVT.2024.3421565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed \\n<monospace>MLP</monospace>\\n achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at \\n<uri>https://github.com/eanson023/mlp</uri>\\n.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"34 11\",\"pages\":\"11535-11550\"},\"PeriodicalIF\":10.8000,\"publicationDate\":\"2024-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10584551/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10584551/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们探讨了人类运动中的时间句定位（TSLM）这一尚未探索的问题，旨在从三维人类运动中找到与文本查询语义对应的目标时刻。考虑到三维人体运动是通过专门的运动捕捉设备捕捉到的，只有几个关节的运动缺乏复杂的场景信息，如物体和光照。由于这一特点，运动数据的上下文丰富度较低，帧与帧之间的语义含糊不清，这就限制了目前扩展到 TSLM 的视频定位框架所做预测的准确性，只能达到粗略的水平。为了改进这一点，我们设计了两种新颖的标签先验辅助训练方案：一种方案是嵌入前景和背景的先验知识，以突出目标时刻的定位机会；另一种方案是在恢复训练过程中，迫使原本粗糙的预测与从翻转的开始/结束先验标签序列中获得的更精确预测重叠。我们的研究表明，将标签先验知识注入模型对于提高高 IoU 性能至关重要。在我们构建的 TSLM 基准中，我们称为 MLP 的模型在 BABEL 数据集 IoU@0.7 上实现了 44.13 的召回率，在 HumanML3D (Restore) 上实现了 71.17 的召回率，表现优于之前的研究成果。最后，我们展示了我们的方法在语料库级时刻检索中的潜力。我们的源代码可在 https://github.com/eanson023/mlp 上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions

In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at https://github.com/eanson023/mlp .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.

期刊最新文献

Class Knowledge-Guided Lightweight Network for Salient Object Detection of Strip Steel Surface Defect M4FT: Mamba, Migratory, Mobile, and Multiple Fish Tracking Collaborative Model and Data Adaptation at Test Time Monotonic Rank Knowledge Distillation via Kendall Correlation WmLSTM: A Plug-and-Play Window-Level mLSTM-Based Temporal Encoder for Robust Visual Tracking