Multilingual non-intrusive binaural intelligibility prediction based on phone classification

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-07-03 DOI:10.1016/j.csl.2024.101684
Jana Roßbach , Kirsten C. Wagener , Bernd T. Meyer
{"title":"Multilingual non-intrusive binaural intelligibility prediction based on phone classification","authors":"Jana Roßbach ,&nbsp;Kirsten C. Wagener ,&nbsp;Bernd T. Meyer","doi":"10.1016/j.csl.2024.101684","DOIUrl":null,"url":null,"abstract":"<div><p>Speech intelligibility (SI) prediction models are a valuable tool for the development of speech processing algorithms for hearing aids or consumer electronics. For the use in realistic environments it is desirable that the SI model is non-intrusive (does not require separate input of original and degraded speech, transcripts or <em>a-priori</em> knowledge about the signals) and does a binaural processing of the audio signals. Most of the existing SI models do not fulfill all of these criteria. In this study, we propose an SI model based on phone probabilities obtained from a deep neural net. The model comprises a binaural enhancement stage for prediction of the speech recognition threshold (SRT) in realistic acoustic scenes. In the first part of the study, SRT predictions in different spatial configurations are compared to the results from normal-hearing listeners. On average, our approach produces lower errors and higher correlations compared to three intrusive baseline models. In the second part, we explore if measures relevant in spatial hearing, i.e., the intelligibility level difference (ILD) and the binaural ILD (BILD), can be predicted with our modeling approach. We also investigate if a language mismatch between training and testing the model plays a role when predicting ILD and BILD. This point is especially important for low-resource languages, where not thousands of hours of language material are available for training. Binaural benefits are predicted by our model with an error of 1.5 dB. This is slightly higher than the error with a competitive baseline MBSTOI (1.1 dB), but does not require separate input of original and degraded speech. We also find that good binaural predictions can be obtained with models that are not specifically trained with the target language.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000676/pdfft?md5=2480b19144d8254f73d5748237f56388&pid=1-s2.0-S0885230824000676-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000676","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Speech intelligibility (SI) prediction models are a valuable tool for the development of speech processing algorithms for hearing aids or consumer electronics. For the use in realistic environments it is desirable that the SI model is non-intrusive (does not require separate input of original and degraded speech, transcripts or a-priori knowledge about the signals) and does a binaural processing of the audio signals. Most of the existing SI models do not fulfill all of these criteria. In this study, we propose an SI model based on phone probabilities obtained from a deep neural net. The model comprises a binaural enhancement stage for prediction of the speech recognition threshold (SRT) in realistic acoustic scenes. In the first part of the study, SRT predictions in different spatial configurations are compared to the results from normal-hearing listeners. On average, our approach produces lower errors and higher correlations compared to three intrusive baseline models. In the second part, we explore if measures relevant in spatial hearing, i.e., the intelligibility level difference (ILD) and the binaural ILD (BILD), can be predicted with our modeling approach. We also investigate if a language mismatch between training and testing the model plays a role when predicting ILD and BILD. This point is especially important for low-resource languages, where not thousands of hours of language material are available for training. Binaural benefits are predicted by our model with an error of 1.5 dB. This is slightly higher than the error with a competitive baseline MBSTOI (1.1 dB), but does not require separate input of original and degraded speech. We also find that good binaural predictions can be obtained with models that are not specifically trained with the target language.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于手机分类的多语言非侵入式双耳可懂度预测
语音清晰度(SI)预测模型是开发助听器或消费电子产品语音处理算法的重要工具。为了在现实环境中使用,SI 模型最好是非侵入式的(不需要分别输入原始语音和降级语音、文字记录或有关信号的先验知识),并能对音频信号进行双耳处理。大多数现有的 SI 模型并不符合所有这些标准。在本研究中,我们提出了一种基于深度神经网络获得的电话概率的 SI 模型。该模型包括一个双耳增强阶段,用于预测现实声学场景中的语音识别阈值(SRT)。在研究的第一部分,不同空间配置下的 SRT 预测结果与正常听力听者的结果进行了比较。平均而言,与三个干扰基线模型相比,我们的方法产生的误差更低,相关性更高。在第二部分中,我们探讨了与空间听力相关的指标,即可懂度级差(ILD)和双耳可懂度级差(BILD),是否可以用我们的建模方法预测。我们还研究了在预测 ILD 和 BILD 时,训练和测试模型之间的语言不匹配是否会产生影响。这一点对于低资源语言尤为重要,因为在低资源语言中,没有数千小时的语言材料可用于训练。我们的模型在预测双耳优势时误差为 1.5 dB。这略高于具有竞争力的基线 MBSTOI 误差(1.1 dB),但不需要分别输入原始语音和降级语音。我们还发现,没有经过目标语言专门训练的模型也能获得良好的双耳预测效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
期刊最新文献
Editorial Board Enhancing analysis of diadochokinetic speech using deep neural networks Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge Significance of chirp MFCC as a feature in speech and audio applications Artificial disfluency detection, uh no, disfluency generation for the masses
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1