Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Hoirin Kim
{"title":"Linear-scale filterbank for deep neural network-based voice activity detection","authors":"Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Hoirin Kim","doi":"10.1109/ICSDA.2017.8384446","DOIUrl":null,"url":null,"abstract":"Voice activity detection (VAD) is an important preprocessing module in many speech applications. Choosing appropriate features and model structures is a significant challenge and an active area of current VAD research. Mel-scale features such as Mel-frequency cepstral coefficients (MFCCs) and log Mel-filterbank (LMFB) energies have been widely used in VAD as well as speech recognition. The reason for feature extraction in Mel- frequency scale to be one of the most popular methods is that it mimics how human ears process sound. However, for certain types of sound, in which important characteristics are reflected more in the high frequency range, a linear-scale in frequency may provide more information than the Mel- scale. Therefore, in this paper, we propose a deep neural network (DNN)-based VAD system using linear-scale feature. This study shows that the linear-scale feature, especially log linear-filterbank (LLFB) energy, can be used for the DNN-based VAD system and shows better performance than the LMFB for certain types of noise. Moreover, a combination of LMFB and LLFB can integrates both advantages of the two features.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2017.8384446","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Voice activity detection (VAD) is an important preprocessing module in many speech applications. Choosing appropriate features and model structures is a significant challenge and an active area of current VAD research. Mel-scale features such as Mel-frequency cepstral coefficients (MFCCs) and log Mel-filterbank (LMFB) energies have been widely used in VAD as well as speech recognition. The reason for feature extraction in Mel- frequency scale to be one of the most popular methods is that it mimics how human ears process sound. However, for certain types of sound, in which important characteristics are reflected more in the high frequency range, a linear-scale in frequency may provide more information than the Mel- scale. Therefore, in this paper, we propose a deep neural network (DNN)-based VAD system using linear-scale feature. This study shows that the linear-scale feature, especially log linear-filterbank (LLFB) energy, can be used for the DNN-based VAD system and shows better performance than the LMFB for certain types of noise. Moreover, a combination of LMFB and LLFB can integrates both advantages of the two features.