Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks

Future Internet Pub Date : 2024-02-17 DOI:10.3390/fi16020063
Haohan Shi, Xiyu Shi, Safak Dogan
{"title":"Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks","authors":"Haohan Shi, Xiyu Shi, Safak Dogan","doi":"10.3390/fi16020063","DOIUrl":null,"url":null,"abstract":"Audio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpainting using Long Short-Term Memory (LSTM) networks, i.e., a restoring task where the missing parts of speech signals are recovered from the previous information in the time domain. The lost or corrupted speech signals are also referred to as gaps. We regard the speech inpainting task as a time-series prediction problem in this research work. To address this problem, we designed multi-layer LSTM networks and trained them on different speech datasets. Our study aims to investigate the inpainting performance of the proposed models on different datasets and with varying LSTM layers and explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and a frequency analysis of the spectrogram. Our proposed multi-layer LSTM models are able to restore up to 1 s of gaps with high perceptual audio quality using the features captured from the time domain only. Specifically, for gap lengths under 500 ms, the MOS can reach up to 3~4, and for gap lengths ranging between 500 ms and 1 s, the MOS can reach up to 2~3. In the time domain, the proposed models can proficiently restore the envelope and trend of lost speech signals. In the frequency domain, the proposed models can restore spectrogram blocks with higher similarity to the original signals at frequencies less than 2.0 kHz and comparatively lower similarity at frequencies in the range of 2.0 kHz~8.0 kHz.","PeriodicalId":509567,"journal":{"name":"Future Internet","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Internet","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/fi16020063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Audio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpainting using Long Short-Term Memory (LSTM) networks, i.e., a restoring task where the missing parts of speech signals are recovered from the previous information in the time domain. The lost or corrupted speech signals are also referred to as gaps. We regard the speech inpainting task as a time-series prediction problem in this research work. To address this problem, we designed multi-layer LSTM networks and trained them on different speech datasets. Our study aims to investigate the inpainting performance of the proposed models on different datasets and with varying LSTM layers and explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and a frequency analysis of the spectrogram. Our proposed multi-layer LSTM models are able to restore up to 1 s of gaps with high perceptual audio quality using the features captured from the time domain only. Specifically, for gap lengths under 500 ms, the MOS can reach up to 3~4, and for gap lengths ranging between 500 ms and 1 s, the MOS can reach up to 2~3. In the time domain, the proposed models can proficiently restore the envelope and trend of lost speech signals. In the frequency domain, the proposed models can restore spectrogram blocks with higher similarity to the original signals at frequencies less than 2.0 kHz and comparatively lower similarity at frequencies in the range of 2.0 kHz~8.0 kHz.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于多层长短期记忆网络的语音涂抹技术
音频绘制在处理不完整、受损或缺失的音频信号方面发挥着重要作用,有助于改善互联网和移动网络多媒体通信的服务质量和整体用户体验。本文提出了一种利用长短期记忆(LSTM)网络进行语音涂抹的创新解决方案,即从时域中的先前信息中恢复语音信号缺失部分的还原任务。丢失或损坏的语音信号也被称为间隙。在这项研究工作中,我们将语音涂抹任务视为时间序列预测问题。为了解决这个问题,我们设计了多层 LSTM 网络,并在不同的语音数据集上对其进行了训练。我们的研究旨在调查拟议模型在不同数据集和不同 LSTM 层上的绘制性能,并从感知音频质量的角度探讨多层 LSTM 网络对语音样本预测的影响。通过平均意见分数(MOS)和频谱图的频率分析来评估内绘语音质量。我们提出的多层 LSTM 模型仅使用从时域捕获的特征,就能以较高的感知音频质量恢复长达 1 秒的间隙。具体来说,对于 500 毫秒以下的间隙长度,MOS 可以达到 3~4,而对于 500 毫秒到 1 秒之间的间隙长度,MOS 可以达到 2~3。 在时域,所提出的模型可以熟练地恢复丢失的语音信号的包络和趋势。在频域中,所提出的模型可以还原频率小于 2.0 kHz 时与原始信号相似度较高的频谱图块,而频率在 2.0 kHz~8.0 kHz 范围内的相似度相对较低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Achieving Accountability and Data Integrity in Message Queuing Telemetry Transport Using Blockchain and Interplanetary File System Watch the Skies: A Study on Drone Attack Vectors, Forensic Approaches, and Persisting Security Challenges Multi-Agent Dynamic Fog Service Placement Approach The Use of Virtual Reality in the Countries of the Central American Bank for Economic Integration (CABEI) Emotion Recognition from Videos Using Multimodal Large Language Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1