Prem Seetharaman, G. Mysore, P. Smaragdis, Bryan Pardo
{"title":"Blind Estimation of the Speech Transmission Index for Speech Quality Prediction","authors":"Prem Seetharaman, G. Mysore, P. Smaragdis, Bryan Pardo","doi":"10.1109/ICASSP.2018.8461827","DOIUrl":null,"url":null,"abstract":"The speech transmission index (STI) of a listening position within a given room indicates the quality and intelligibility of speech uttered in that room. The measure is very reliable for predicting speech intelligibility in many room conditions but requires an STI measurement of the impulse response for the room. We present a method for blindly estimating the STI without measuring or modeling the impulse response of the room using deep convolutional neural networks. Our model is trained entirely using simulated room impulse responses combined with clean speech examples from the DAPS dataset [1] and works directly on PCM audio. Our experiments show that our method predicts true STI with a high degree of accuracy – an average error of under 4%. It can also distinguish between different STI conditions to a level of granularity that is comparable to humans.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"591-595"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8461827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23
Abstract
The speech transmission index (STI) of a listening position within a given room indicates the quality and intelligibility of speech uttered in that room. The measure is very reliable for predicting speech intelligibility in many room conditions but requires an STI measurement of the impulse response for the room. We present a method for blindly estimating the STI without measuring or modeling the impulse response of the room using deep convolutional neural networks. Our model is trained entirely using simulated room impulse responses combined with clean speech examples from the DAPS dataset [1] and works directly on PCM audio. Our experiments show that our method predicts true STI with a high degree of accuracy – an average error of under 4%. It can also distinguish between different STI conditions to a level of granularity that is comparable to humans.