Pub Date : 2024-09-13DOI: 10.1186/s13636-024-00363-5
Martin Jälmby, Filip Elvander, Toon van Waterschoot
Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.
室内脉冲响应(RIR)被用于增强现实和虚拟现实等多种应用中。这些应用需要在严格的延迟限制下,将大量 RIR 与音频进行卷积。在本文中,我们将结合快速时域卷积来考虑 RIR 的压缩问题。为了压缩 RIR,我们考虑了三种不同的 RIR 近似方法,并将它们与最先进的压缩方法进行了比较。我们使用几种标准的客观质量测量方法对这些方法进行了评估,既有基于信道的,也有基于信号的。我们还提出了一种基于低秩的新型快速时域卷积算法,并展示了如何在无需解压缩 RIR 的情况下进行卷积。我们使用在三个不同房间录制的不同长度的 RIR 进行了数值模拟。结果表明,与最先进的 Opus 压缩技术相比,使用低秩近似法进行压缩是一种非常有吸引力的选择,因为除了一项考虑的指标外,它在其他所有指标上的表现都与 Opus 压缩技术相当或更好,而且还具有适合快速时域卷积的额外优势。
{"title":"Compression of room impulse responses for compact storage and fast low-latency convolution","authors":"Martin Jälmby, Filip Elvander, Toon van Waterschoot","doi":"10.1186/s13636-024-00363-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00363-5","url":null,"abstract":"Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"16 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-11DOI: 10.1186/s13636-024-00353-7
Zijin Li, Wenwu Wang, Kejun Zhang, Mengyao Zhu
Nowadays, the application of artificial intelligence (AI) algorithms and techniques is ubiquitous and transversal. Fields that take advantage of AI advances include sound and music processing. The advances in interdisciplinary research potentially yield new insights that may further advance the AI methods in this field. This special issue aims to report recent progress and spur new research lines in AI-driven sound and music processing, especially within interdisciplinary research scenarios.
{"title":"Guest editorial: AI for computational audition—sound and music processing","authors":"Zijin Li, Wenwu Wang, Kejun Zhang, Mengyao Zhu","doi":"10.1186/s13636-024-00353-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00353-7","url":null,"abstract":"Nowadays, the application of artificial intelligence (AI) algorithms and techniques is ubiquitous and transversal. Fields that take advantage of AI advances include sound and music processing. The advances in interdisciplinary research potentially yield new insights that may further advance the AI methods in this field. This special issue aims to report recent progress and spur new research lines in AI-driven sound and music processing, especially within interdisciplinary research scenarios.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"79 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1186/s13636-024-00362-6
Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari
A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.
{"title":"Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach","authors":"Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari","doi":"10.1186/s13636-024-00362-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00362-6","url":null,"abstract":"A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"60 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1186/s13636-024-00366-2
Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande
Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.
{"title":"Physics-informed neural network for volumetric sound field reconstruction of speech signals","authors":"Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00366-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00366-2","url":null,"abstract":"Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17DOI: 10.1186/s13636-024-00364-4
Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande
The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.
{"title":"Optimal sensor placement for the spatial reconstruction of sound fields","authors":"Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00364-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00364-4","url":null,"abstract":"The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"425 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.
端到端 (E2E) 自动语音识别(ASR)模型由深度学习模型组成,能够使用单个神经网络执行 ASR 任务。这些模型应使用大量数据进行训练;然而,收集与目标语音域相匹配的语音数据可能很困难,因此经常会使用与目标域不完全匹配的语音数据,从而导致性能降低。与语音数据相比,域内文本数据更容易获得。因此,传统的 ASR 系统使用单独训练的语言模型和基于 HMM 的声学模型。然而,E2E ASR 模型很难将语言信息分离出来,因为该模型是以综合方式学习声学和语言信息的,这使得为专门目标域创建 E2E ASR 模型非常困难,而这些模型又能以合理的成本达到足够的识别性能。在本文中,我们提出了一种在预训练的 E2E ASR 模型中替换语言信息的方法,以实现对目标领域的适应。具体做法是删除 ASR 模型中包含的 "隐含 "语言信息,方法是减去用对数域 ASR 训练数据转录训练的源域语言模型。然后,我们在对数域中通过加法整合目标域语言模型。这种语言模型的减法和加法替换是基于贝叶斯定理的。在实验中,我们首先使用了自发日语语料库(CSJ)的两个数据集来评估我们方法的有效性。然后,我们使用日语报纸文章语音(JNAS)和 CSJ 语料库对我们的方法进行了评估,这两个语料库分别包含朗读语音和自发语音领域的音频数据,以测试我们提出的方法在缩小这两个语言领域之间的差距方面的有效性。结果表明,我们提出的语言模型替换方法比非适配(基线)ASR 模型和使用传统浅层融合方法适配的 ASR 模型都取得了更好的 ASR 性能。
{"title":"Recognition of target domain Japanese speech using language model replacement","authors":"Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka","doi":"10.1186/s13636-024-00360-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00360-8","url":null,"abstract":"End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"27 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-19DOI: 10.1186/s13636-024-00354-6
Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji
This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .
{"title":"The whole is greater than the sum of its parts: improving music source separation by bridging networks","authors":"Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji","doi":"10.1186/s13636-024-00354-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00354-6","url":null,"abstract":"This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.
藏语三大方言在语音学和语料方面的差异加剧了一种方言的单一任务模型难以适应其他不同方言的问题。为解决这一问题,本文提出了任务多样化元学习(task-diverse meta-learning)。我们的模型可以获得更全面、更稳健的特征,便于适应不同方言之间的差异。本研究将藏语方言 ID 识别和藏语说话人识别作为元学习的源任务,旨在增强模型辨别不同方言之间差异的能力。因此,该模型在藏语多方言语音识别任务中的性能得到了提高。实验结果表明,任务多样化元学习提高了藏语多方言语音识别的性能。这证明了任务多样化元学习的有效性和适用性,从而推动了多方言环境下语音识别技术的发展。
{"title":"Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition","authors":"Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji","doi":"10.1186/s13636-024-00361-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00361-7","url":null,"abstract":"The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"97 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1186/s13636-024-00358-2
Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet
This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.
{"title":"A simplified and controllable model of mode coupling for addressing nonlinear phenomena in sound synthesis processes","authors":"Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet","doi":"10.1186/s13636-024-00358-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00358-2","url":null,"abstract":"This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-13DOI: 10.1186/s13636-024-00359-1
Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu
End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.
{"title":"Adaptive multi-task learning for speech to text translation","authors":"Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu","doi":"10.1186/s13636-024-00359-1","DOIUrl":"https://doi.org/10.1186/s13636-024-00359-1","url":null,"abstract":"End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"56 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141611132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}