Paolo Vecchiotti, E. Principi, S. Squartini, F. Piazza
{"title":"Deep Neural Networks for Joint Voice Activity Detection and Speaker Localization","authors":"Paolo Vecchiotti, E. Principi, S. Squartini, F. Piazza","doi":"10.23919/EUSIPCO.2018.8553461","DOIUrl":null,"url":null,"abstract":"Detecting the presence of speakers and suitably localize them in indoor environments undoubtedly represent two important tasks in the speech processing community. Several algorithms have been proposed for Voice Activity Detection (VAD) and Speaker LOCalization (SLOC) so far, while their accomplishment by means of a joint integrated model has not received much attention. In particular, no studies focused on cooperative exploitation of VAD and SLOC information by means of machine learning have been conducted, up to the authors' knowledge. That is why the authors propose in this work a data driven approach for joint speech detection and speaker localization, relying on Convolutional Neural Network (CNN) which simultaneously process LogMel and GCC-PHAT Patterns features. The proposed algorithm is compared with a two-stage model composed by the cascade of a neural network (NN) based VAD and an NN based SLOC, discussed in previous authors' contributions. Computer simulations, accomplished against the DIRHA dataset addressing a multi-room acoustic environment, show that the proposed method allows to achieve a remarkable relative reduction of speech activity detection error equal to 33% compared to the original NN based VAD. Moreover, the overall localization accuracy is improved as well, by employing the joint model as speech detector and the standard neural SLOC system in cascade.","PeriodicalId":303069,"journal":{"name":"2018 26th European Signal Processing Conference (EUSIPCO)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 26th European Signal Processing Conference (EUSIPCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/EUSIPCO.2018.8553461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Detecting the presence of speakers and suitably localize them in indoor environments undoubtedly represent two important tasks in the speech processing community. Several algorithms have been proposed for Voice Activity Detection (VAD) and Speaker LOCalization (SLOC) so far, while their accomplishment by means of a joint integrated model has not received much attention. In particular, no studies focused on cooperative exploitation of VAD and SLOC information by means of machine learning have been conducted, up to the authors' knowledge. That is why the authors propose in this work a data driven approach for joint speech detection and speaker localization, relying on Convolutional Neural Network (CNN) which simultaneously process LogMel and GCC-PHAT Patterns features. The proposed algorithm is compared with a two-stage model composed by the cascade of a neural network (NN) based VAD and an NN based SLOC, discussed in previous authors' contributions. Computer simulations, accomplished against the DIRHA dataset addressing a multi-room acoustic environment, show that the proposed method allows to achieve a remarkable relative reduction of speech activity detection error equal to 33% compared to the original NN based VAD. Moreover, the overall localization accuracy is improved as well, by employing the joint model as speech detector and the standard neural SLOC system in cascade.