Gloss-driven Conditional Diffusion Models for Sign Language Production

IF 5.2 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-03 DOI:10.1145/3663572
Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong
{"title":"Gloss-driven Conditional Diffusion Models for Sign Language Production","authors":"Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong","doi":"10.1145/3663572","DOIUrl":null,"url":null,"abstract":"<p>Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3663572","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于手语生成的词汇驱动条件扩散模型
手语制作(SLP)旨在将文本或音频句子转换成与其语义相对应的手语视频,由于手语的多样性和复杂性以及跨模态语义映射问题,这项工作极具挑战性。在这项工作中,我们提出了用于 SLP 的光泽驱动条件扩散模型(GCDM)。GCDM 的核心是一个扩散模型架构,其中符号光泽序列由基于变换器的编码器编码,并作为语义先验条件输入扩散模型。在符号姿态生成的过程中,编码光泽特征所携带的文本语义先验条件通过交叉注意整合到嵌入式高斯噪声中。随后,模型通过 T 轮去噪步骤将融合后的特征转换为手语姿势序列。在训练过程中,模型以手语姿势的地面实况标签为起点,通过 T 轮噪声生成高斯噪声,然后执行 T 轮去噪,以逼近真实的手语姿势。整个过程受 MAE 损失函数的限制,以确保生成的手势尽可能接近真实标签。在推理阶段,模型直接随机采样一组高斯噪声,在光泽序列的指导下生成多个手语手势序列假设,并通过平均多个假设输出高置信度的手语手势视频。在 Phoenix2014T 数据集上的实验结果表明,所提出的 GCDM 方法在定量性能和定性可视化方面都具有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.50
自引率
5.90%
发文量
285
审稿时长
7.5 months
期刊介绍: The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.
期刊最新文献
TA-Detector: A GNN-based Anomaly Detector via Trust Relationship KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1