{"title":"基于低复杂度长短期记忆的语音活动检测","authors":"Ruiting Yang, Jie Liu, Xiang Deng, Zhuochao Zheng","doi":"10.1109/MMSP48831.2020.9287142","DOIUrl":null,"url":null,"abstract":"Voice Activity Detection (VAD) plays an important role in audio processing, but it is also a common challenge when a voice signal is corrupted with strong and transient noise. In this paper, an accurate and causal VAD module using a long short-term memory (LSTM) deep neural network is proposed. A set of features including Gammatone cepstral coefficients (GTCC) and selected spectral features are used. The low complex structure allows it can be easily implemented in speech processing algorithms and applications. With carefully pre-processing and labeling the collected training data in the classes of speech or non-speech and training on the LSTM net, experiments show the proposed VAD is able to distinguish speech from different types of noisy background effectively. Its robustness against changes including varying frame length, moving speech sources and speaking in different languages, are further investigated.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Low Complexity Long Short-Term Memory Based Voice Activity Detection\",\"authors\":\"Ruiting Yang, Jie Liu, Xiang Deng, Zhuochao Zheng\",\"doi\":\"10.1109/MMSP48831.2020.9287142\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice Activity Detection (VAD) plays an important role in audio processing, but it is also a common challenge when a voice signal is corrupted with strong and transient noise. In this paper, an accurate and causal VAD module using a long short-term memory (LSTM) deep neural network is proposed. A set of features including Gammatone cepstral coefficients (GTCC) and selected spectral features are used. The low complex structure allows it can be easily implemented in speech processing algorithms and applications. With carefully pre-processing and labeling the collected training data in the classes of speech or non-speech and training on the LSTM net, experiments show the proposed VAD is able to distinguish speech from different types of noisy background effectively. Its robustness against changes including varying frame length, moving speech sources and speaking in different languages, are further investigated.\",\"PeriodicalId\":188283,\"journal\":{\"name\":\"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MMSP48831.2020.9287142\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MMSP48831.2020.9287142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Low Complexity Long Short-Term Memory Based Voice Activity Detection
Voice Activity Detection (VAD) plays an important role in audio processing, but it is also a common challenge when a voice signal is corrupted with strong and transient noise. In this paper, an accurate and causal VAD module using a long short-term memory (LSTM) deep neural network is proposed. A set of features including Gammatone cepstral coefficients (GTCC) and selected spectral features are used. The low complex structure allows it can be easily implemented in speech processing algorithms and applications. With carefully pre-processing and labeling the collected training data in the classes of speech or non-speech and training on the LSTM net, experiments show the proposed VAD is able to distinguish speech from different types of noisy background effectively. Its robustness against changes including varying frame length, moving speech sources and speaking in different languages, are further investigated.