首页 > 最新文献

Eurasip Journal on Audio Speech and Music Processing最新文献

英文 中文
Compression of room impulse responses for compact storage and fast low-latency convolution 压缩室内脉冲响应,实现紧凑存储和快速低延迟卷积
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-13 DOI: 10.1186/s13636-024-00363-5
Martin Jälmby, Filip Elvander, Toon van Waterschoot
Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.
室内脉冲响应(RIR)被用于增强现实和虚拟现实等多种应用中。这些应用需要在严格的延迟限制下,将大量 RIR 与音频进行卷积。在本文中,我们将结合快速时域卷积来考虑 RIR 的压缩问题。为了压缩 RIR,我们考虑了三种不同的 RIR 近似方法,并将它们与最先进的压缩方法进行了比较。我们使用几种标准的客观质量测量方法对这些方法进行了评估,既有基于信道的,也有基于信号的。我们还提出了一种基于低秩的新型快速时域卷积算法,并展示了如何在无需解压缩 RIR 的情况下进行卷积。我们使用在三个不同房间录制的不同长度的 RIR 进行了数值模拟。结果表明,与最先进的 Opus 压缩技术相比,使用低秩近似法进行压缩是一种非常有吸引力的选择,因为除了一项考虑的指标外,它在其他所有指标上的表现都与 Opus 压缩技术相当或更好,而且还具有适合快速时域卷积的额外优势。
{"title":"Compression of room impulse responses for compact storage and fast low-latency convolution","authors":"Martin Jälmby, Filip Elvander, Toon van Waterschoot","doi":"10.1186/s13636-024-00363-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00363-5","url":null,"abstract":"Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guest editorial: AI for computational audition—sound and music processing 特邀社论:计算听觉的人工智能--声音和音乐处理
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-11 DOI: 10.1186/s13636-024-00353-7
Zijin Li, Wenwu Wang, Kejun Zhang, Mengyao Zhu
Nowadays, the application of artificial intelligence (AI) algorithms and techniques is ubiquitous and transversal. Fields that take advantage of AI advances include sound and music processing. The advances in interdisciplinary research potentially yield new insights that may further advance the AI methods in this field. This special issue aims to report recent progress and spur new research lines in AI-driven sound and music processing, especially within interdisciplinary research scenarios.
如今,人工智能(AI)算法和技术的应用无处不在、横跨各个领域。利用人工智能进步的领域包括声音和音乐处理。跨学科研究的进展有可能产生新的见解,从而进一步推动人工智能方法在这一领域的应用。本特刊旨在报道人工智能驱动的声音和音乐处理方面的最新进展,并推动新的研究方向,尤其是在跨学科研究的情况下。
{"title":"Guest editorial: AI for computational audition—sound and music processing","authors":"Zijin Li, Wenwu Wang, Kejun Zhang, Mengyao Zhu","doi":"10.1186/s13636-024-00353-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00353-7","url":null,"abstract":"Nowadays, the application of artificial intelligence (AI) algorithms and techniques is ubiquitous and transversal. Fields that take advantage of AI advances include sound and music processing. The advances in interdisciplinary research potentially yield new insights that may further advance the AI methods in this field. This special issue aims to report recent progress and spur new research lines in AI-driven sound and music processing, especially within interdisciplinary research scenarios.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach 区域到区域声学传递函数的物理约束自适应内核插值:一种贝叶斯方法
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-10 DOI: 10.1186/s13636-024-00362-6
Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari
A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.
本文提出了一种内核插值方法,用于在受声音物理约束的区域之间进行声学传递函数(ATF)插值,同时又能适应数据。大多数 ATF 内插方法旨在通过使用与测量结果相适应的估算技术为固定声源建立 ATF 模型,而不考虑问题的物理特性。我们的目标是对 ATF 进行区域到区域的内插估算,这意味着我们要考虑到源和接收器位置的变化。通过使用非常通用的重现核函数公式,我们创建了一个核函数,将定向场和残差场视为两个独立的核函数。定向场核考虑了具有大振幅的反射场成分的稀疏选择,并将其表述为定向核的组合。残差场由其余振幅较小的密集分布成分组成。其核权重由一个通用近似器--神经网络来表示,以便从数据中自由学习模式。这些核参数的学习既可以在高斯先验假设下使用贝叶斯推断法,也可以使用马尔科夫链蒙特卡罗模拟法,以更有方向性的方式进行推断。我们在数值模拟中对所有已建立的核公式进行了比较,结果表明所提出的核模型能够恰当地表示 ATF 的复杂性。
{"title":"Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach","authors":"Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari","doi":"10.1186/s13636-024-00362-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00362-6","url":null,"abstract":"A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physics-informed neural network for volumetric sound field reconstruction of speech signals 用于语音信号体积声场重建的物理信息神经网络
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-09 DOI: 10.1186/s13636-024-00366-2
Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande
Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.
声学信号处理领域的最新发展见证了深度学习方法与基于经典波形扩展的方法(尤其是声场重建方法)的融合。物理信息神经网络(PINN)作为一种新颖的框架出现,在数据驱动和基于模型的技术之间架起了一座桥梁,用于处理由偏微分方程支配的物理现象。本文介绍了一种基于 PINN 的方法,用于恢复任意体积声场。该网络结合了波方程,对时域信号重建施加正则化。这种方法使网络能够学习声音传播的物理规律,并根据有限的观测数据对声场进行完整描述。通过在真实环境中对语音信号进行实验,并考虑不同数量的可用测量值,验证了所提出方法的有效性。此外,还与现有文献中最先进的频域和时域重建方法进行了比较分析,突出显示了在各种测量配置下所提高的精确度。
{"title":"Physics-informed neural network for volumetric sound field reconstruction of speech signals","authors":"Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00366-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00366-2","url":null,"abstract":"Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal sensor placement for the spatial reconstruction of sound fields 声场空间重建的最佳传感器位置
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-08-17 DOI: 10.1186/s13636-024-00364-4
Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande
The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.
空间声场估算是声场控制和分析、空间音频、室内声学和虚拟现实等领域所关注的问题。声场可以通过分布在空间中的大量测量数据进行估算,但由于需要大量的实验工作,这仍然是一个具有挑战性的问题。在这项工作中,我们研究了估算声场的最佳传感器分布。这种优化非常有价值,因为它可以大大减少所需的测量次数。通过寻找贝叶斯克拉梅尔-拉奥约束(BCRB)最小化的位置,对描述声场的参数或在感兴趣区域重建的压力的传感器位置进行优化。我们通过数值研究以及测量的室内脉冲响应对优化后的分布进行了研究。我们发现,与随机分布相比,优化传感器位置以重建声场时,测量次数减少了约 50%。结果表明,与声学稀疏阵列处理中经常采用的随机传感器分布相比,当参数向量稀疏时,优化传感器位置也很有价值。
{"title":"Optimal sensor placement for the spatial reconstruction of sound fields","authors":"Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00364-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00364-4","url":null,"abstract":"The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recognition of target domain Japanese speech using language model replacement 使用语言模型替换识别目标域日语语音
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-20 DOI: 10.1186/s13636-024-00360-8
Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka
End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.
端到端 (E2E) 自动语音识别(ASR)模型由深度学习模型组成,能够使用单个神经网络执行 ASR 任务。这些模型应使用大量数据进行训练;然而,收集与目标语音域相匹配的语音数据可能很困难,因此经常会使用与目标域不完全匹配的语音数据,从而导致性能降低。与语音数据相比,域内文本数据更容易获得。因此,传统的 ASR 系统使用单独训练的语言模型和基于 HMM 的声学模型。然而,E2E ASR 模型很难将语言信息分离出来,因为该模型是以综合方式学习声学和语言信息的,这使得为专门目标域创建 E2E ASR 模型非常困难,而这些模型又能以合理的成本达到足够的识别性能。在本文中,我们提出了一种在预训练的 E2E ASR 模型中替换语言信息的方法,以实现对目标领域的适应。具体做法是删除 ASR 模型中包含的 "隐含 "语言信息,方法是减去用对数域 ASR 训练数据转录训练的源域语言模型。然后,我们在对数域中通过加法整合目标域语言模型。这种语言模型的减法和加法替换是基于贝叶斯定理的。在实验中,我们首先使用了自发日语语料库(CSJ)的两个数据集来评估我们方法的有效性。然后,我们使用日语报纸文章语音(JNAS)和 CSJ 语料库对我们的方法进行了评估,这两个语料库分别包含朗读语音和自发语音领域的音频数据,以测试我们提出的方法在缩小这两个语言领域之间的差距方面的有效性。结果表明,我们提出的语言模型替换方法比非适配(基线)ASR 模型和使用传统浅层融合方法适配的 ASR 模型都取得了更好的 ASR 性能。
{"title":"Recognition of target domain Japanese speech using language model replacement","authors":"Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka","doi":"10.1186/s13636-024-00360-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00360-8","url":null,"abstract":"End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The whole is greater than the sum of its parts: improving music source separation by bridging networks 整体大于部分之和:通过网络桥接改善音乐源分离效果
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-19 DOI: 10.1186/s13636-024-00354-6
Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji
This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .
本文提出了一种交叉方案(X-scheme),可在几乎不增加计算成本的情况下提高基于深度神经网络(DNN)的音乐源分离(MSS)性能。它由三个部分组成:(i) 多域损耗 (MDL),(ii) 桥接操作(耦合单个乐器网络)和 (iii) 组合损耗 (CL)。MDL 能够利用音频信号的频域和时域表示。我们修改了目标网络,即基于 DNN 的原始 MSS 的网络结构,为每个输出仪器添加了桥接路径,以共享它们的信息。然后,将 MDL 应用于输出源的组合以及每个独立源;因此,我们称之为 CL。MDL 和 CL 可以轻松应用于许多基于 DNN 的分离方法,因为它们只是损失函数,只在训练过程中使用,并不影响推理步骤。桥接操作不会增加网络中可学习参数的数量。实验结果表明,使用我们的 X 架构扩展的开放式混音网络(UMX)、密集连接的扩张型 DenseNet(D3Net)和卷积时域音频分离网络(Conv-TasNet)(分别称为 X-UMX、X-D3Net 和 X-Conv-TasNet)与它们的原始版本进行比较,结果是正确的。我们还验证了 X 架构在大规模数据机制中的有效性,显示了它在数据规模方面的通用性。X-UMX Large(X-UMXL)是在大规模内部数据上训练出来的,并在我们的实验中得到了应用,其最新版本可在 https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX 网站上查阅。
{"title":"The whole is greater than the sum of its parts: improving music source separation by bridging networks","authors":"Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji","doi":"10.1186/s13636-024-00354-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00354-6","url":null,"abstract":"This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition 探索藏语多方言语音识别中的任务多样化元学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-17 DOI: 10.1186/s13636-024-00361-7
Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji
The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.
藏语三大方言在语音学和语料方面的差异加剧了一种方言的单一任务模型难以适应其他不同方言的问题。为解决这一问题,本文提出了任务多样化元学习(task-diverse meta-learning)。我们的模型可以获得更全面、更稳健的特征,便于适应不同方言之间的差异。本研究将藏语方言 ID 识别和藏语说话人识别作为元学习的源任务,旨在增强模型辨别不同方言之间差异的能力。因此,该模型在藏语多方言语音识别任务中的性能得到了提高。实验结果表明,任务多样化元学习提高了藏语多方言语音识别的性能。这证明了任务多样化元学习的有效性和适用性,从而推动了多方言环境下语音识别技术的发展。
{"title":"Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition","authors":"Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji","doi":"10.1186/s13636-024-00361-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00361-7","url":null,"abstract":"The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simplified and controllable model of mode coupling for addressing nonlinear phenomena in sound synthesis processes 用于解决声音合成过程中非线性现象的简化可控模式耦合模型
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-17 DOI: 10.1186/s13636-024-00358-2
Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet
This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.
本文介绍了模态合成中模态耦合的简化可控模型。该模型采用高效耦合滤波器进行声音合成,旨在模拟强非线性条件下声源辐射声音的产生。这种滤波器以相互依存的方式产生音调成分,旨在以高效的方式模拟乐器中真实的感知效果。滤波器之间的能量传递控制是通过耦合矩阵实现的。本文介绍了如何利用滤波器组生成与非线性声源相对应的原型声音。特别是,提出了生成与薄结构撞击声和物体与其他物体碰撞时的振动扰动声相对应的声音的例子。论文中介绍的声音示例可在随附网站上收听,这些示例表明,只需简单控制输入参数,就能生成连贯的声音,而加入随机过程后,所生成声音的逼真度会显著提高。
{"title":"A simplified and controllable model of mode coupling for addressing nonlinear phenomena in sound synthesis processes","authors":"Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet","doi":"10.1186/s13636-024-00358-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00358-2","url":null,"abstract":"This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive multi-task learning for speech to text translation 语音到文本翻译的自适应多任务学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-13 DOI: 10.1186/s13636-024-00359-1
Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu
End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.
端到端语音到文本翻译旨在将一种语言的语音直接翻译成另一种语言的文本,这是一项具有挑战性的跨模态任务,尤其是在数据有限的情况下。多任务学习是语音翻译和机器翻译之间知识共享的有效策略,它允许模型利用大量机器翻译数据来学习源语言和目标语言之间的映射,从而提高语音翻译的性能。然而,在多任务学习中,找到一组能平衡各种任务的权重具有挑战性且计算成本高昂。我们提出了一种自适应多任务学习方法,可根据训练过程中产生的损失比例动态调整多任务权重,从而在语音到文本翻译的多任务学习中实现自适应平衡。此外,不同模态之间固有的表征差异阻碍了语音翻译模型有效利用文本数据。为了弥合不同模态之间的差距,我们建议在端到端模型的输入中应用最优传输,以找到语音和文本序列之间的对齐,并学习它们之间的共享表征。实验结果表明,我们的方法有效提高了藏汉、英德和英法语音翻译数据集的性能。
{"title":"Adaptive multi-task learning for speech to text translation","authors":"Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu","doi":"10.1186/s13636-024-00359-1","DOIUrl":"https://doi.org/10.1186/s13636-024-00359-1","url":null,"abstract":"End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141611132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Eurasip Journal on Audio Speech and Music Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1