{"title":"Investigating the Important Temporal Modulations for Deep-Learning-Based Speech Activity Detection","authors":"Tyler Vuong, Nikhil Madaan, Rohan Panda, R. Stern","doi":"10.1109/SLT54892.2023.10022462","DOIUrl":null,"url":null,"abstract":"We describe a learnable modulation spectrogram feature for speech activity detection (SAD). Modulation features capture the temporal dynamics of each frequency subband. We compute learnable modulation spectrogram features by first calculating the log-mel spectrogram. Next, we filter each frequency subband with a bandpass filter that contains a learnable center frequency. The resulting SAD system was evaluated on the Fearless Steps Phase-04 SAD challenge. Experimental results showed that temporal modulations around the 4–6 Hz range are crucial for deep-learning-based SAD. These experimental results align with previous studies that found slow temporal modulation to be most important for speech-processing tasks and speech intelligibility. Additionally, we found that the learnable modulation spectrogram feature outperforms both the standard log-mel and fixed modulation spectrogram features on the Fearless Steps Phase-04 SAD test set.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
We describe a learnable modulation spectrogram feature for speech activity detection (SAD). Modulation features capture the temporal dynamics of each frequency subband. We compute learnable modulation spectrogram features by first calculating the log-mel spectrogram. Next, we filter each frequency subband with a bandpass filter that contains a learnable center frequency. The resulting SAD system was evaluated on the Fearless Steps Phase-04 SAD challenge. Experimental results showed that temporal modulations around the 4–6 Hz range are crucial for deep-learning-based SAD. These experimental results align with previous studies that found slow temporal modulation to be most important for speech-processing tasks and speech intelligibility. Additionally, we found that the learnable modulation spectrogram feature outperforms both the standard log-mel and fixed modulation spectrogram features on the Fearless Steps Phase-04 SAD test set.