Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual
{"title":"Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility","authors":"Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual","doi":"arxiv-2409.09357","DOIUrl":null,"url":null,"abstract":"Speech restoration aims at restoring full-band speech with high quality and\nintelligibility, considering a diverse set of distortions. MaskSR is a recently\nproposed generative model for this task. As other models of its kind, MaskSR\nattains high quality but, as we show, intelligibility can be substantially\nimproved. We do so by boosting the speech encoder component of MaskSR with\npredictions of semantic representations of the target speech, using a\npre-trained self-supervised teacher model. Then, a masked language model is\nconditioned on the learned semantic features to predict acoustic tokens that\nencode low level spectral details of the target speech. We show that, with the\nsame MaskSR model capacity and inference time, the proposed model, MaskSR2,\nsignificantly reduces the word error rate, a typical metric for\nintelligibility. MaskSR2 also achieves competitive word error rate among other\nmodels, while providing superior quality. An ablation study shows the\neffectiveness of various semantic representations.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":"65 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
联合语义知识提炼和掩蔽声学建模,实现具有更高可懂度的全频段语音修复
语音修复的目的是在考虑各种失真的情况下,恢复高质量和可理解的全频段语音。MaskSR 是最近针对这一任务提出的生成模型。与其他同类模型一样,MaskSR 可获得高质量,但正如我们所展示的,其可懂度也可大幅提高。为此,我们使用预先训练好的自监督教师模型,通过预测目标语音的语义表征来增强 MaskSR 的语音编码器部分。然后,以学习到的语义特征为条件建立掩码语言模型,预测编码目标语音低级频谱细节的声学标记。我们的研究表明,在 MaskSR 模型容量和推理时间相同的情况下,所提出的模型 MaskSR2 显著降低了单词错误率,而单词错误率是衡量语音可理解性的典型指标。MaskSR2 在提供卓越质量的同时,还在其他模型中实现了具有竞争力的词错误率。一项消融研究显示了各种语义表征的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Blind Deconvolution on Graphs: Exact and Stable Recovery End-to-End Learning of Transmitter and Receiver Filters in Bandwidth Limited Fiber Optic Communication Systems Atmospheric Turbulence-Immune Free Space Optical Communication System based on Discrete-Time Analog Transmission User Subgrouping in Scalable Cell-Free Massive MIMO Multicasting Systems Covert Communications Without Pre-Sharing of Side Information and Channel Estimation Over Quasi-Static Fading Channels
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1