SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

IF 8.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Selected Topics in Signal Processing Pub Date : 2024-11-26 DOI:10.1109/JSTSP.2024.3506286
Haohe Liu;Xuenan Xu;Yi Yuan;Mengyue Wu;Wenwu Wang;Mark D. Plumbley
{"title":"SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound","authors":"Haohe Liu;Xuenan Xu;Yi Yuan;Mengyue Wu;Wenwu Wang;Mark D. Plumbley","doi":"10.1109/JSTSP.2024.3506286","DOIUrl":null,"url":null,"abstract":"Large languagemodels (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1448-1461"},"PeriodicalIF":8.7000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10768970/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Large languagemodels (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型语言模型(LLM)通过音频编解码器将音频转换为离散的词块,从而使语言建模技术能够应用于音频数据,大大推进了音频处理技术的发展。然而,传统编解码器通常以高比特率或在语音等狭窄领域内运行,缺乏高效语言建模所需的语义线索。为了应对这些挑战,我们推出了 SemantiCodec,这是一种新颖的编解码器,旨在将各种音频类型(包括语音、一般声音和音乐)的音频压缩到每秒不到一百个词组,同时不影响质量。SemantiCodec 采用双编码器架构:语义编码器使用自监督预训练的音频屏蔽自动编码器(AudioMAE),通过对大量音频数据进行 k-means 聚类来离散化;声学编码器捕捉其余细节。语义编码器和声学编码器的输出用于通过基于扩散模型的解码器重建音频。SemantiCodec 有三个变体,标记率分别为每秒 25、50 和 100,支持 0.31 kbps 至 1.40 kbps 的超低比特率范围。实验结果表明,SemantiCodec 在重构质量上明显优于最先进的 Descript 编解码器。我们的结果还表明,SemantiCodec 包含的语义信息比所有经过评估的最先进音频编解码器要丰富得多,即使在比特率明显较低的情况下也是如此。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Journal of Selected Topics in Signal Processing
IEEE Journal of Selected Topics in Signal Processing 工程技术-工程:电子与电气
CiteScore
19.00
自引率
1.30%
发文量
135
审稿时长
3 months
期刊介绍: The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.
期刊最新文献
Front Cover Table of Contents IEEE Signal Processing Society Information List of Reviewers 2024 Editorial JSTSP NSAC Editorial
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1