{"title":"Leveraging the Objective Intelligibility and Noise Estimation to Improve Conformer-Based MetricGAN","authors":"Chia Dai, Wan-Ling Zeng, Jia-Xuan Zeng, J. Hung","doi":"10.1109/ICASI57738.2023.10179495","DOIUrl":null,"url":null,"abstract":"Conformer-based MetricGAN (CMGAN) is a deep neural network (DNN)-based speech enhancement (SE) method that uses time-frequency (TF) domain features to learn a novel conformer-wise generative network, and it has demonstrated excellent SE performance in terms of various perceptual evaluation metrics.In this study, we propose to revise CMGAN along three directions. To begin, we incorporate phone-fortified perceptual loss (PFPL) into its loss function. The PFPL is calculated using latent representations of speech from the wav2vec module. With PFPL as part of the loss function can effectively use perceptual and linguistic speech information to direct CMGAN model training. Next, we revise the discriminator output by adding the STOI values. The original discriminator is trained to estimate the enhanced PESQ score by taking both clean and enhanced spectrum as inputs as well as the associated PESQ label. In other words, the initial discriminator only takes into account the PESQ score. By further considering STOI, we expect to improve the discriminator. Finally, we add noise label estimation to the entire CMGAN framework. The original CMGAN only calculates the disparity between the estimated value provided by the model and the clean target with clean labels. Instead, we further take into account noise estimation loss, which can show the discrepancy between the predicted noise and the noise label.The Voicebank-Demand dataset is used for the evaluation experiments. According to the experimental results, the revised CMGAN outperforms the original by gaining greater scores on objective perceptual metrics including PESQ and STOI. As a result, we confirm the success of the presented revisions in CMGAN.","PeriodicalId":281254,"journal":{"name":"2023 9th International Conference on Applied System Innovation (ICASI)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 9th International Conference on Applied System Innovation (ICASI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASI57738.2023.10179495","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Conformer-based MetricGAN (CMGAN) is a deep neural network (DNN)-based speech enhancement (SE) method that uses time-frequency (TF) domain features to learn a novel conformer-wise generative network, and it has demonstrated excellent SE performance in terms of various perceptual evaluation metrics.In this study, we propose to revise CMGAN along three directions. To begin, we incorporate phone-fortified perceptual loss (PFPL) into its loss function. The PFPL is calculated using latent representations of speech from the wav2vec module. With PFPL as part of the loss function can effectively use perceptual and linguistic speech information to direct CMGAN model training. Next, we revise the discriminator output by adding the STOI values. The original discriminator is trained to estimate the enhanced PESQ score by taking both clean and enhanced spectrum as inputs as well as the associated PESQ label. In other words, the initial discriminator only takes into account the PESQ score. By further considering STOI, we expect to improve the discriminator. Finally, we add noise label estimation to the entire CMGAN framework. The original CMGAN only calculates the disparity between the estimated value provided by the model and the clean target with clean labels. Instead, we further take into account noise estimation loss, which can show the discrepancy between the predicted noise and the noise label.The Voicebank-Demand dataset is used for the evaluation experiments. According to the experimental results, the revised CMGAN outperforms the original by gaining greater scores on objective perceptual metrics including PESQ and STOI. As a result, we confirm the success of the presented revisions in CMGAN.