Improving Mispronunciation Detection Using Speech Reconstruction

IF 4.1 2区 计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-29 DOI:10.1109/TASLP.2024.3434497
Anurag Das;Ricardo Gutierrez-Osuna
{"title":"Improving Mispronunciation Detection Using Speech Reconstruction","authors":"Anurag Das;Ricardo Gutierrez-Osuna","doi":"10.1109/TASLP.2024.3434497","DOIUrl":null,"url":null,"abstract":"Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4420-4433"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10613461/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用语音重建改进错误发音检测
同时训练相关的机器学习任务可以提高这两项任务的性能。文本到语音(TTS)和错误发音检测与诊断(MDD)都是利用语音信息进行操作的,我们希望研究 MDD 性能的提升是否可以通过两个任务来实现。我们提出了一种从 MDD 系统产生的语音中重建语音并计算语音重建损失的网络。我们假设,如果重建的语音听起来更接近原始语音,那么 MDD 系统产生的电话将更接近地面实况。为了验证这一点,我们首先从预先训练好的模型中提取 wav2vec 特征,并将其与文本输入一起输入 MDD 系统。然后,MDD 系统预测目标注释电话,再根据预测电话合成语音。因此,该系统是通过计算语音重建损失和 MDD 损失来进行训练的。我们将提出的系统与不带语音重构的相同系统和另一个最先进的基线系统进行了比较,发现提出的系统获得了更高的错误发音检测和诊断(MDD)分数。在一组训练过程中未见的句子中,同时进行语音重建和说话人验证可使拟议系统获得更高的 MDD 分数,这表明从预测电话重建语音信号有助于系统泛化到新的测试句子。我们还测试了当输入的电话发音错误时,系统能否生成重音语音。我们的感知实验结果表明,与没有任何发音错误的电话相比,由含有错误发音的电话生成的语音听起来重音更重,可懂度更低,这表明系统能够识别电话中的差异,并生成所需的语音信号。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
期刊最新文献
CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1