Ni Zeng, Yiqiang Chen, Yang Gu, Dongdong Liu, Yunbing Xing
{"title":"基于可变运动帧插值的高度流畅的手语合成","authors":"Ni Zeng, Yiqiang Chen, Yang Gu, Dongdong Liu, Yunbing Xing","doi":"10.1109/SMC42975.2020.9283193","DOIUrl":null,"url":null,"abstract":"Sign Language Synthesis (SLS) is a domain-specific problem where multiple sign language words are stitched to generate a whole sentence in video, which serves to facilitate communications between the hearing-impaired people and healthy population. This paper presents a Variable Motion Frame Interpolation (VMFI) method for highly fluent SLS in scattered videos. Existing approaches for SLS mainly focus on mechanical virtual human technology, lacking high flexibility and natural effect. Also, the representative solutions to interpolate frames usually assume that the motion object moves at a constant speed which is not suitable for predicting the complex hand motion in frames of scattered sign language videos. To address the above issues, the proposed VMFI adopts acceleration to predict more accurate interpolated frames based on an end-to-end convolutional neural network. The framework of VMFI consists of variable optical flow estimation network and high-quality frame synthesis network that can approximate and fuse the intermediate optical flow to generate interpolated frames for synthesis. Experimental results on our realistic collected Chinese sign language dataset demonstrate that the proposed VMFI model achieves efficiency by performing better in PSNR (Peak Signal to Noise Ratio), SSIM (Structural Similarity) and MA (Motion Activity) and gets higher score in MOS (Mean Opinion Score) than other two representative methods.","PeriodicalId":6718,"journal":{"name":"2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)","volume":"7 1","pages":"1772-1777"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Highly Fluent Sign Language Synthesis Based on Variable Motion Frame Interpolation\",\"authors\":\"Ni Zeng, Yiqiang Chen, Yang Gu, Dongdong Liu, Yunbing Xing\",\"doi\":\"10.1109/SMC42975.2020.9283193\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sign Language Synthesis (SLS) is a domain-specific problem where multiple sign language words are stitched to generate a whole sentence in video, which serves to facilitate communications between the hearing-impaired people and healthy population. This paper presents a Variable Motion Frame Interpolation (VMFI) method for highly fluent SLS in scattered videos. Existing approaches for SLS mainly focus on mechanical virtual human technology, lacking high flexibility and natural effect. Also, the representative solutions to interpolate frames usually assume that the motion object moves at a constant speed which is not suitable for predicting the complex hand motion in frames of scattered sign language videos. To address the above issues, the proposed VMFI adopts acceleration to predict more accurate interpolated frames based on an end-to-end convolutional neural network. The framework of VMFI consists of variable optical flow estimation network and high-quality frame synthesis network that can approximate and fuse the intermediate optical flow to generate interpolated frames for synthesis. Experimental results on our realistic collected Chinese sign language dataset demonstrate that the proposed VMFI model achieves efficiency by performing better in PSNR (Peak Signal to Noise Ratio), SSIM (Structural Similarity) and MA (Motion Activity) and gets higher score in MOS (Mean Opinion Score) than other two representative methods.\",\"PeriodicalId\":6718,\"journal\":{\"name\":\"2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)\",\"volume\":\"7 1\",\"pages\":\"1772-1777\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SMC42975.2020.9283193\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMC42975.2020.9283193","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
手语合成(Sign Language Synthesis, SLS)是将多个手语单词拼接成一个完整的视频句子,以方便听障人群与健康人群之间的交流的领域问题。针对离散视频中高度流畅的SLS,提出了一种可变运动帧插值方法。现有的SLS方法主要集中在机械虚拟人技术上,缺乏高度的灵活性和自然效果。此外,典型的插值帧解通常假设运动对象以恒定速度运动,这不适用于预测分散的手语视频帧中的复杂手部运动。为了解决上述问题,本文提出的VMFI采用基于端到端卷积神经网络的加速来预测更准确的插值帧。VMFI框架由可变光流估计网络和高质量帧合成网络组成,高质量帧合成网络可以近似和融合中间光流,生成插值帧进行合成。在实际收集的中国手语数据集上的实验结果表明,所提出的VMFI模型在峰值信噪比(PSNR)、结构相似度(SSIM)和运动活跃度(MA)方面具有更好的性能,在平均意见得分(MOS)方面取得了比其他两种代表性方法更高的分数。
Highly Fluent Sign Language Synthesis Based on Variable Motion Frame Interpolation
Sign Language Synthesis (SLS) is a domain-specific problem where multiple sign language words are stitched to generate a whole sentence in video, which serves to facilitate communications between the hearing-impaired people and healthy population. This paper presents a Variable Motion Frame Interpolation (VMFI) method for highly fluent SLS in scattered videos. Existing approaches for SLS mainly focus on mechanical virtual human technology, lacking high flexibility and natural effect. Also, the representative solutions to interpolate frames usually assume that the motion object moves at a constant speed which is not suitable for predicting the complex hand motion in frames of scattered sign language videos. To address the above issues, the proposed VMFI adopts acceleration to predict more accurate interpolated frames based on an end-to-end convolutional neural network. The framework of VMFI consists of variable optical flow estimation network and high-quality frame synthesis network that can approximate and fuse the intermediate optical flow to generate interpolated frames for synthesis. Experimental results on our realistic collected Chinese sign language dataset demonstrate that the proposed VMFI model achieves efficiency by performing better in PSNR (Peak Signal to Noise Ratio), SSIM (Structural Similarity) and MA (Motion Activity) and gets higher score in MOS (Mean Opinion Score) than other two representative methods.