The whole is greater than the sum of its parts: improving music source separation by bridging networks

IF 1.7 3区 计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-07-19 DOI:10.1186/s13636-024-00354-6
Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji
{"title":"The whole is greater than the sum of its parts: improving music source separation by bridging networks","authors":"Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji","doi":"10.1186/s13636-024-00354-6","DOIUrl":null,"url":null,"abstract":"This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00354-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
整体大于部分之和:通过网络桥接改善音乐源分离效果
本文提出了一种交叉方案(X-scheme),可在几乎不增加计算成本的情况下提高基于深度神经网络(DNN)的音乐源分离(MSS)性能。它由三个部分组成:(i) 多域损耗 (MDL),(ii) 桥接操作(耦合单个乐器网络)和 (iii) 组合损耗 (CL)。MDL 能够利用音频信号的频域和时域表示。我们修改了目标网络,即基于 DNN 的原始 MSS 的网络结构,为每个输出仪器添加了桥接路径,以共享它们的信息。然后,将 MDL 应用于输出源的组合以及每个独立源;因此,我们称之为 CL。MDL 和 CL 可以轻松应用于许多基于 DNN 的分离方法,因为它们只是损失函数,只在训练过程中使用,并不影响推理步骤。桥接操作不会增加网络中可学习参数的数量。实验结果表明,使用我们的 X 架构扩展的开放式混音网络(UMX)、密集连接的扩张型 DenseNet(D3Net)和卷积时域音频分离网络(Conv-TasNet)(分别称为 X-UMX、X-D3Net 和 X-Conv-TasNet)与它们的原始版本进行比较,结果是正确的。我们还验证了 X 架构在大规模数据机制中的有效性,显示了它在数据规模方面的通用性。X-UMX Large(X-UMXL)是在大规模内部数据上训练出来的,并在我们的实验中得到了应用,其最新版本可在 https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX 网站上查阅。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Eurasip Journal on Audio Speech and Music Processing
Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
4.10
自引率
4.20%
发文量
0
审稿时长
12 months
期刊介绍: The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.
期刊最新文献
Compression of room impulse responses for compact storage and fast low-latency convolution Guest editorial: AI for computational audition—sound and music processing Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach Physics-informed neural network for volumetric sound field reconstruction of speech signals Optimal sensor placement for the spatial reconstruction of sound fields
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1