Semantic-driven diffusion for sign language production with gloss-pose latent spaces alignment

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-09-01 Epub Date: 2024-06-07 DOI:10.1016/j.cviu.2024.104050

Sheng Chen, Qingshan Wang, Qi Wang

{"title":"Semantic-driven diffusion for sign language production with gloss-pose latent spaces alignment","authors":"Sheng Chen, Qingshan Wang, Qi Wang","doi":"10.1016/j.cviu.2024.104050","DOIUrl":null,"url":null,"abstract":"<div><p>Sign Language Production (SLP) aims to translate spoken language into visual sign language sequences. The most challenging process in SLP is the transformation of a sequence of sign glosses into corresponding sign poses (G2P). Existing approaches on G2P mainly focus on constructing mappings of sign language glosses to frame-level sign pose representations, while neglecting gloss is just a weak annotation of the sequence of sign poses. To address this problem, this paper proposes the semantic-driven diffusion model with gloss-pose latent spaces alignment (SDD-GPLA) for G2P. G2P is divided into two phases. In the first phase, we design the gloss-pose latent spaces alignment (GPLA) to model the sign pose latent representations with glosses dependency. In the second phase, we propose semantic-driven diffusion (SDD) with supervised pose reconstruction guidance as a mapping between the gloss and sign poses latent features. In addition, we propose the sign pose decoder (<span><math><msup><mrow><mtext>Decoder</mtext></mrow><mrow><mi>p</mi></mrow></msup></math></span>) to progressively generate high-resolution sign poses from latent sign pose features and to guide the SDD training process. We evaluated SDD-GPLA on a self-collected dataset of Daily Chinese Sign Language (DCSL) and a public dataset called RWTH-Phoenix-Weather-2014T. Compared with the state-of-the-art G2P methods, we obtain at least 22.9% and 2.3% improvement in WER scores on the above two datasets, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"246 ","pages":"Article 104050"},"PeriodicalIF":3.5000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001310","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Sign Language Production (SLP) aims to translate spoken language into visual sign language sequences. The most challenging process in SLP is the transformation of a sequence of sign glosses into corresponding sign poses (G2P). Existing approaches on G2P mainly focus on constructing mappings of sign language glosses to frame-level sign pose representations, while neglecting gloss is just a weak annotation of the sequence of sign poses. To address this problem, this paper proposes the semantic-driven diffusion model with gloss-pose latent spaces alignment (SDD-GPLA) for G2P. G2P is divided into two phases. In the first phase, we design the gloss-pose latent spaces alignment (GPLA) to model the sign pose latent representations with glosses dependency. In the second phase, we propose semantic-driven diffusion (SDD) with supervised pose reconstruction guidance as a mapping between the gloss and sign poses latent features. In addition, we propose the sign pose decoder ( ${Decoder}^{p}$ ) to progressively generate high-resolution sign poses from latent sign pose features and to guide the SDD training process. We evaluated SDD-GPLA on a self-collected dataset of Daily Chinese Sign Language (DCSL) and a public dataset called RWTH-Phoenix-Weather-2014T. Compared with the state-of-the-art G2P methods, we obtain at least 22.9% and 2.3% improvement in WER scores on the above two datasets, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

语义驱动的手语生产扩散与词汇潜在空间对齐

手语制作（SLP）旨在将口语转化为视觉手语序列。手语制作中最具挑战性的过程是将手语词汇序列转换为相应的手语姿势（G2P）。现有的 G2P 方法主要侧重于构建手语词汇到帧级手势表示的映射，而忽略了词汇只是手势序列的一个弱注释。针对这一问题，本文提出了针对 G2P 的语义驱动扩散模型与词汇-姿势潜空间配准（SDD-GPLA）。G2P 分为两个阶段。在第一阶段，我们设计了词汇-姿势-潜在空间配准（GPLA）来模拟具有词汇依赖性的符号姿势潜在表征。在第二阶段，我们提出了语义驱动扩散（SDD），将监督姿势重构指导作为词汇和符号姿势潜特征之间的映射。此外，我们还提出了符号姿势解码器（Decoderp），以便从潜在符号姿势特征逐步生成高分辨率符号姿势，并指导 SDD 的训练过程。我们在《每日中国手语》（DCSL）自收集数据集和名为 RWTH-Phoenix-Weather-2014T 的公共数据集上对 SDD-GPLA 进行了评估。与最先进的 G2P 方法相比，我们在上述两个数据集上的 WER 分数分别提高了至少 22.9% 和 2.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems