基于深度神经网络的语音通信非侵入性语音质量评估。

IF 10.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE transactions on neural networks and learning systems Pub Date : 2023-10-12 DOI:10.1109/TNNLS.2023.3321076
Miao Liu, Jing Wang, Fei Wang, Fei Xiang, Jingdong Chen
{"title":"基于深度神经网络的语音通信非侵入性语音质量评估。","authors":"Miao Liu, Jing Wang, Fei Wang, Fei Xiang, Jingdong Chen","doi":"10.1109/TNNLS.2023.3321076","DOIUrl":null,"url":null,"abstract":"<p><p>Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":10.2000,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication.\",\"authors\":\"Miao Liu, Jing Wang, Fei Wang, Fei Xiang, Jingdong Chen\",\"doi\":\"10.1109/TNNLS.2023.3321076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.</p>\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":10.2000,\"publicationDate\":\"2023-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/TNNLS.2023.3321076\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/TNNLS.2023.3321076","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

传统上,语音质量评估依赖于主观评估或需要参考信号或附加设备的侵入性方法。然而,近年来,非侵入式语音质量评估已成为一种很有前途的替代方法,引起了研究人员和行业专业人士的极大关注。本文提出了一种基于深度学习的方法,该方法利用大规模侵入性模拟数据来提高非侵入性方法的准确性和通用性。本文的主要贡献如下。首先,提出了一种数据模拟方法,该方法生成退化的语音信号,并用感知客观听力质量评估(POLQA)标记其语音质量。生成的数据被证明对深度学习模型的预训练是有用的。其次,提出应用对抗性说话人分类器来减少说话人相关信息对语音质量评估的影响。第三,根据表示学习和对抗性训练(AT)方法的原理,提出了一种基于自动编码器的深度学习方案,该方案能够转移从POLQA标记的大量模拟语音数据中学习到的知识。借助于从自动编码器中提取的判别表示,预测模型可以在通过主观听力测试标记的相对少量的语音数据上得到很好的训练。第四,开发了一种以幅度谱和相位谱特征为输入的端到端语音质量评估神经网络。这种相位感知模型比仅使用幅度谱特征的模型更准确。使用三个数据集进行了大量实验:一个用POLQA获得的标签模拟,两个用主观听力测试获得的标签记录。结果表明,所提出的相位感知方法提高了基线模型的性能,并且所提出的具有从对抗性自动编码器(AAE)中提取的潜在表示的模型优于最先进的客观质量评估方法,在北京理工学院(BIT)数据集和腾讯语料库上,均方根误差(RMSE)分别降低了10.5%和12.2%。代码和补充材料可在https://github.com/liushenme/AAE-SQA.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication.

Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE transactions on neural networks and learning systems
IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
CiteScore
23.80
自引率
9.60%
发文量
2102
审稿时长
3-8 weeks
期刊介绍: The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.
期刊最新文献
Adaptive Graph Convolutional Network for Unsupervised Generalizable Tabular Representation Learning Gently Sloped and Extended Classification Margin for Overconfidence Relaxation of Out-of-Distribution Samples NNG-Mix: Improving Semi-Supervised Anomaly Detection With Pseudo-Anomaly Generation Alleviate the Impact of Heterogeneity in Network Alignment From Community View Hierarchical Contrastive Learning for Semantic Segmentation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1