SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

IF 13.7 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Selected Topics in Signal Processing Pub Date : 2024-11-26 DOI:10.1109/JSTSP.2024.3506286

Haohe Liu;Xuenan Xu;Yi Yuan;Mengyue Wu;Wenwu Wang;Mark D. Plumbley

{"title":"SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound","authors":"Haohe Liu;Xuenan Xu;Yi Yuan;Mengyue Wu;Wenwu Wang;Mark D. Plumbley","doi":"10.1109/JSTSP.2024.3506286","DOIUrl":null,"url":null,"abstract":"Large languagemodels (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1448-1461"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10768970/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Large languagemodels (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SemantiCodec：一个超低比特率的通用声音语义音频编解码器

大型语言模型（LLM）通过音频编解码器将音频转换为离散的词块，从而使语言建模技术能够应用于音频数据，大大推进了音频处理技术的发展。然而，传统编解码器通常以高比特率或在语音等狭窄领域内运行，缺乏高效语言建模所需的语义线索。为了应对这些挑战，我们推出了 SemantiCodec，这是一种新颖的编解码器，旨在将各种音频类型（包括语音、一般声音和音乐）的音频压缩到每秒不到一百个词组，同时不影响质量。SemantiCodec 采用双编码器架构：语义编码器使用自监督预训练的音频屏蔽自动编码器（AudioMAE），通过对大量音频数据进行 k-means 聚类来离散化；声学编码器捕捉其余细节。语义编码器和声学编码器的输出用于通过基于扩散模型的解码器重建音频。SemantiCodec 有三个变体，标记率分别为每秒 25、50 和 100，支持 0.31 kbps 至 1.40 kbps 的超低比特率范围。实验结果表明，SemantiCodec 在重构质量上明显优于最先进的 Descript 编解码器。我们的结果还表明，SemantiCodec 包含的语义信息比所有经过评估的最先进音频编解码器要丰富得多，即使在比特率明显较低的情况下也是如此。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal of Selected Topics in Signal Processing 工程技术-工程：电子与电气

CiteScore

19.00

自引率

1.30%

发文量

135

审稿时长

3 months

期刊介绍： The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.

期刊最新文献

Front Cover Table of Contents Efficient Video Representation via Hybrid Key Frame Reconstruction: A Test-Time Data Augmentation Approach Exploiting Spatial Multiplexing Based on Pixel Antennas: An Antenna Coding Approach Fluid Antenna-Aided ISAC Systems for Low-Altitude Economy Networks