Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang
{"title":"基于事件流的手语翻译:高清基准数据集与新算法","authors":"Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang","doi":"arxiv-2408.10488","DOIUrl":null,"url":null,"abstract":"Sign Language Translation (SLT) is a core task in the field of AI-assisted\ndisability. Unlike traditional SLT based on visible light videos, which is\neasily affected by factors such as lighting, rapid hand movements, and privacy\nbreaches, this paper proposes the use of high-definition Event streams for SLT,\neffectively mitigating the aforementioned issues. This is primarily because\nEvent streams have a high dynamic range and dense temporal signals, which can\nwithstand low illumination and motion blur well. Additionally, due to their\nsparsity in space, they effectively protect the privacy of the target person.\nMore specifically, we propose a new high-resolution Event stream sign language\ndataset, termed Event-CSL, which effectively fills the data gap in this area of\nresearch. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in\nthe text vocabulary. These samples are collected in a variety of indoor and\noutdoor scenes, encompassing multiple angles, light intensities, and camera\nmovements. We have benchmarked existing mainstream SLT works to enable fair\ncomparison for future efforts. Based on this dataset and several other\nlarge-scale datasets, we propose a novel baseline method that fully leverages\nthe Mamba model's ability to integrate temporal information of CNN features,\nresulting in improved sign language translation outcomes. Both the benchmark\ndataset and source code will be released on\nhttps://github.com/Event-AHU/OpenESL","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"52 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm\",\"authors\":\"Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang\",\"doi\":\"arxiv-2408.10488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sign Language Translation (SLT) is a core task in the field of AI-assisted\\ndisability. Unlike traditional SLT based on visible light videos, which is\\neasily affected by factors such as lighting, rapid hand movements, and privacy\\nbreaches, this paper proposes the use of high-definition Event streams for SLT,\\neffectively mitigating the aforementioned issues. This is primarily because\\nEvent streams have a high dynamic range and dense temporal signals, which can\\nwithstand low illumination and motion blur well. Additionally, due to their\\nsparsity in space, they effectively protect the privacy of the target person.\\nMore specifically, we propose a new high-resolution Event stream sign language\\ndataset, termed Event-CSL, which effectively fills the data gap in this area of\\nresearch. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in\\nthe text vocabulary. These samples are collected in a variety of indoor and\\noutdoor scenes, encompassing multiple angles, light intensities, and camera\\nmovements. We have benchmarked existing mainstream SLT works to enable fair\\ncomparison for future efforts. Based on this dataset and several other\\nlarge-scale datasets, we propose a novel baseline method that fully leverages\\nthe Mamba model's ability to integrate temporal information of CNN features,\\nresulting in improved sign language translation outcomes. Both the benchmark\\ndataset and source code will be released on\\nhttps://github.com/Event-AHU/OpenESL\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"52 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.10488\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm
Sign Language Translation (SLT) is a core task in the field of AI-assisted
disability. Unlike traditional SLT based on visible light videos, which is
easily affected by factors such as lighting, rapid hand movements, and privacy
breaches, this paper proposes the use of high-definition Event streams for SLT,
effectively mitigating the aforementioned issues. This is primarily because
Event streams have a high dynamic range and dense temporal signals, which can
withstand low illumination and motion blur well. Additionally, due to their
sparsity in space, they effectively protect the privacy of the target person.
More specifically, we propose a new high-resolution Event stream sign language
dataset, termed Event-CSL, which effectively fills the data gap in this area of
research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in
the text vocabulary. These samples are collected in a variety of indoor and
outdoor scenes, encompassing multiple angles, light intensities, and camera
movements. We have benchmarked existing mainstream SLT works to enable fair
comparison for future efforts. Based on this dataset and several other
large-scale datasets, we propose a novel baseline method that fully leverages
the Mamba model's ability to integrate temporal information of CNN features,
resulting in improved sign language translation outcomes. Both the benchmark
dataset and source code will be released on
https://github.com/Event-AHU/OpenESL