Attentional bias for hands: Cascade dual-decoder transformer for sign language production

IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IET Computer Vision Pub Date : 2024-03-08 DOI:10.1049/cvi2.12273
Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung
{"title":"Attentional bias for hands: Cascade dual-decoder transformer for sign language production","authors":"Xiaohan Ma,&nbsp;Rize Jin,&nbsp;Jianming Wang,&nbsp;Tae-Sun Chung","doi":"10.1049/cvi2.12273","DOIUrl":null,"url":null,"abstract":"<p>Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings <i>SLP</i><sub><i>hand</i></sub>: <i>Text</i> → <i>Hand pose</i> and <i>SLP</i><sub>sign</sub>: <i>Text</i> → <i>Sign pose</i>, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"696-708"},"PeriodicalIF":1.5000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12273","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cvi2.12273","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings SLPhand: TextHand pose and SLPsign: TextSign pose, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
手的注意偏差用于手语制作的级联双解码转换器
手语制作(SLP)是指将口语的文字形式转化为相应手语表达的任务。手语通过多个异步发音器(包括手动和非手动信息通道)传达意义。最近基于深度学习的 SLP 模型以端到端的方式直接从文本输入生成完整的发音手势序列。然而,由于平均值回归的影响,这些模型在很大程度上忽略了手动发音中细微差别的重要性。为了探索这些被忽视的方面,我们提出了一种用于 SLP 的高效级联双解码器转换器(CasDual-Transformer),以连续学习两个映射 SLPhand:文本→手部姿势和 SLPsign:文本 → 手势姿势,利用基于注意力的对齐模块,融合前一时间步骤的手部和手势特征,预测当前时间步骤中更具表现力的手势姿势。此外,为了提供更有效的指导,还引入了一种新的时空损失,以惩罚生成序列的形状不相似性和时间扭曲。为了验证所提模型的性能,我们在两个来自不同文化的基准手语数据集上进行了实验研究。定量和定性结果都表明,与最先进的模型相比,作者的模型表现出了极具竞争力的性能,在某些情况下甚至比它们有了相当大的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IET Computer Vision
IET Computer Vision 工程技术-工程:电子与电气
CiteScore
3.30
自引率
11.80%
发文量
76
审稿时长
3.4 months
期刊介绍: IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf
期刊最新文献
SRL-ProtoNet: Self-supervised representation learning for few-shot remote sensing scene classification Balanced parametric body prior for implicit clothed human reconstruction from a monocular RGB Social-ATPGNN: Prediction of multi-modal pedestrian trajectory of non-homogeneous social interaction HIST: Hierarchical and sequential transformer for image captioning Multi-modal video search by examples—A video quality impact analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1