IBACodec: End-to-end speech codec with intra-inter broad attention

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Processing & Management Pub Date : 2024-12-03 DOI:10.1016/j.ipm.2024.103979

Xiaonan Yang , Jinjie Zhou , Deshan Yang, Yunwei Wan, Limin Pan, Senlin Luo

{"title":"IBACodec: End-to-end speech codec with intra-inter broad attention","authors":"Xiaonan Yang , Jinjie Zhou , Deshan Yang, Yunwei Wan, Limin Pan, Senlin Luo","doi":"10.1016/j.ipm.2024.103979","DOIUrl":null,"url":null,"abstract":"<div><div>Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 3","pages":"Article 103979"},"PeriodicalIF":7.4000,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324003388","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.