Improved Alias-and-Separate Speech Coding Framework With Minimal Algorithmic Delay

IF 13.7 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Selected Topics in Signal Processing Pub Date : 2024-11-18 DOI:10.1109/JSTSP.2024.3501681

Eunkyun Lee;Seungkwon Beack;Jong Won Shin

{"title":"Improved Alias-and-Separate Speech Coding Framework With Minimal Algorithmic Delay","authors":"Eunkyun Lee;Seungkwon Beack;Jong Won Shin","doi":"10.1109/JSTSP.2024.3501681","DOIUrl":null,"url":null,"abstract":"Alias-and-Separate (AaS) speech coding framework has shown the possibility to encode wideband (WB) speech with a narrowband (NB) speech codec and reconstruct it using speech separation. WB speech is first decimated incurring aliasing and then coded, transmitted, and decoded with a NB codec. The decoded signal is then separated into lower band and spectrally-flipped high band using a speech separation module, which are expanded, lowpass/highpass filtered, and added together to reconstruct the WB speech. The original AaS system, however, has algorithmic delay originated from the overlap-add operation for consecutive segments. This algorithmic delay can be reduced by omitting the overlap-add procedure, but the quality of the reconstructed speech is also degraded due to artifacts on the segment boundaries. In this work, we propose an improved AaS framework with minimum algorithmic delay. The decoded signal is first expanded by inserting zeros in-between samples before being processed by source separation module. As the expanded signal can be viewed as a summation of the frequency-shifted versions of the original signal, the decoded-and-expanded signal is then separated into the frequency-shifted signals, which are multiplied by complex exponentials and summed up to reconstruct the original signal. With carefully designed transposed convolution operation in the separation module, the proposed system requires minimal algorithmic delay while preventing discontinuity at the segment boundaries. Additionally, we propose to employ a generative vocoder to further improve the perceived quality and a modified multi-resolution short-time Fourier transform (MR-STFT) loss. Experimental results on the WB speech coding with a NB codec demonstrated that the proposed system outperformed the original AaS system and the existing WB speech codec in the subjective listening test. We have also shown that the proposed method can be applied when the decimation factor is not 2 in the experiment on the fullband speech coding with a WB codec.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1414-1426"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10756718/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Alias-and-Separate (AaS) speech coding framework has shown the possibility to encode wideband (WB) speech with a narrowband (NB) speech codec and reconstruct it using speech separation. WB speech is first decimated incurring aliasing and then coded, transmitted, and decoded with a NB codec. The decoded signal is then separated into lower band and spectrally-flipped high band using a speech separation module, which are expanded, lowpass/highpass filtered, and added together to reconstruct the WB speech. The original AaS system, however, has algorithmic delay originated from the overlap-add operation for consecutive segments. This algorithmic delay can be reduced by omitting the overlap-add procedure, but the quality of the reconstructed speech is also degraded due to artifacts on the segment boundaries. In this work, we propose an improved AaS framework with minimum algorithmic delay. The decoded signal is first expanded by inserting zeros in-between samples before being processed by source separation module. As the expanded signal can be viewed as a summation of the frequency-shifted versions of the original signal, the decoded-and-expanded signal is then separated into the frequency-shifted signals, which are multiplied by complex exponentials and summed up to reconstruct the original signal. With carefully designed transposed convolution operation in the separation module, the proposed system requires minimal algorithmic delay while preventing discontinuity at the segment boundaries. Additionally, we propose to employ a generative vocoder to further improve the perceived quality and a modified multi-resolution short-time Fourier transform (MR-STFT) loss. Experimental results on the WB speech coding with a NB codec demonstrated that the proposed system outperformed the original AaS system and the existing WB speech codec in the subjective listening test. We have also shown that the proposed method can be applied when the decimation factor is not 2 in the experiment on the fullband speech coding with a WB codec.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

最小算法延迟的改进型别名和分离语音编码框架

别名分离（AaS）语音编码框架显示了用窄带语音编解码器对宽带语音进行编码并使用语音分离对其进行重构的可能性。WB语音首先被抽取产生混叠，然后用NB编解码器进行编码、传输和解码。然后使用语音分离模块将解码后的信号分离为低频段和频谱翻转的高频段，对其进行扩展、低通/高通滤波，并将其加在一起重建WB语音。然而，原始的AaS系统由于对连续段进行重叠添加操作而存在算法延迟。该算法可以通过省略重叠添加过程来减少延迟，但由于段边界上的伪影，重构语音的质量也会降低。在这项工作中，我们提出了一个具有最小算法延迟的改进的AaS框架。解码后的信号首先通过在采样之间插入零进行扩展，然后由源分离模块进行处理。由于扩展后的信号可以看作是原始信号的频移版本的总和，解码和扩展后的信号然后被分离成频移信号，这些频移信号乘以复指数并求和以重建原始信号。通过在分离模块中精心设计的转置卷积操作，所提出的系统需要最小的算法延迟，同时防止在段边界处的不连续。此外，我们建议采用生成式声码器来进一步提高感知质量和改进的多分辨率短时傅里叶变换（MR-STFT）损失。用NB编解码器进行WB语音编码的实验结果表明，该系统在主观听力测试中优于原有的AaS系统和现有的WB语音编解码器。在用WB编解码器进行全频带语音编码的实验中，我们也证明了该方法可以在抽取因子不为2的情况下应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal of Selected Topics in Signal Processing 工程技术-工程：电子与电气

CiteScore

19.00

自引率

1.30%

发文量

135

审稿时长

3 months

期刊介绍： The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.

期刊最新文献

Front Cover Table of Contents IEEE Signal Processing Society Publication Information IEEE Signal Processing Society Information 2025 Index IEEE Journal of Selected Topics in Signal Processing Vol. 19