Eurasip Journal on Audio Speech and Music Processing最新文献

Singing to speech conversion with generative flow.

IF 1.7 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2025-01-01 Epub Date: 2025-03-10 DOI: 10.1186/s13636-025-00400-x

Jiawen Huang, Emmanouil Benetos

This paper introduces singing to speech conversion (S2S), a cross-domain voice conversion task, and presents the first deep learning-based S2S system. S2S aims to transform singing into speech while retaining the phonetic information, reducing variations in pitch, rhythm, and timbre. Inspired by the Glow-TTS architecture, the proposed model is built using generative flow, with an adjusted alignment module between the latent features. We adapt the original monotonic alignment search (MAS) to the S2S scenario and utilize a duration predictor to deal with the duration differences between the two modalities. Subjective evaluations show that the proposed model outperforms signal processing baselines in naturalness and outperforms a transcribe-and-synthesize baseline in phonetic similarity to the original singing. We further demonstrate that singing-to-speech could be an effective augmentation method for low-resource lyrics transcription.

引用次数: 0

Robust and early howling detection based on a sparsity measure.

IF 1.7 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2025-01-01 Epub Date: 2025-03-27 DOI: 10.1186/s13636-025-00399-1

Mina Mounir, Giuliano Bernardi, Toon van Waterschoot

Despite recent advances in audio technology, acoustic feedback remains a problem encountered in many sound reinforcement applications, ranging from public address systems to hearing aids. Acoustic feedback occurs due to the acoustic coupling between a loudspeaker and microphone, creating a closed-loop system that may become unstable and produce an acoustic artifact referred to as howling. One solution to the acoustic feedback problem, known as notch-filter-based howling suppression (NHS), consists in detecting and suppressing howling components hence stabilizing the closed-loop system and removing audible howling artifacts. The key component of any NHS method is howling detection (HD), which is typically based on the calculation of temporal and/or spectral features that allow to discriminate howling from desired audio signal components. In this paper, three contributions to HD research are presented. Firstly, we propose a novel howling detection feature, coined as NINOS $^{2}$ -Transposed (NINOS $^{2}$ -T), that exploits the particular time-frequency structure of a howling artifact. The NINOS $^{2}$ -T feature is shown to outperform common state-of-the-art HD features, to be more robust to detection threshold variations, and to allow for the detection of early howling and ringing by discarding the often used concept of howling candidates selection. Secondly, a new annotated dataset for HD research is introduced which is significantly larger and more diverse than existing datasets containing realistic howling artifacts. Thirdly, a new HD performance evaluation procedure is proposed that is suitable when using HD features that do not rely on a howling candidates selection. This procedure opens the door for the evaluation of early howling and ringing detection performance and can handle the high class imbalance inherent in the HD problem by using precision-recall (PR) instead of receiver operating characteristic (ROC) curves.

{"title":"Robust and early howling detection based on a sparsity measure.","authors":"Mina Mounir, Giuliano Bernardi, Toon van Waterschoot","doi":"10.1186/s13636-025-00399-1","DOIUrl":"https://doi.org/10.1186/s13636-025-00399-1","url":null,"abstract":"Despite recent advances in audio technology, acoustic feedback remains a problem encountered in many sound reinforcement applications, ranging from public address systems to hearing aids. Acoustic feedback occurs due to the acoustic coupling between a loudspeaker and microphone, creating a closed-loop system that may become unstable and produce an acoustic artifact referred to as howling. One solution to the acoustic feedback problem, known as notch-filter-based howling suppression (NHS), consists in detecting and suppressing howling components hence stabilizing the closed-loop system and removing audible howling artifacts. The key component of any NHS method is howling detection (HD), which is typically based on the calculation of temporal and/or spectral features that allow to discriminate howling from desired audio signal components. In this paper, three contributions to HD research are presented. Firstly, we propose a novel howling detection feature, coined as NINOS <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mn>2</mn></mmultiscripts> </math> -Transposed (NINOS <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mn>2</mn></mmultiscripts> </math> -T), that exploits the particular time-frequency structure of a howling artifact. The NINOS <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mn>2</mn></mmultiscripts> </math> -T feature is shown to outperform common state-of-the-art HD features, to be more robust to detection threshold variations, and to allow for the detection of early howling and ringing by discarding the often used concept of howling candidates selection. Secondly, a new annotated dataset for HD research is introduced which is significantly larger and more diverse than existing datasets containing realistic howling artifacts. Thirdly, a new HD performance evaluation procedure is proposed that is suitable when using HD features that do not rely on a howling candidates selection. This procedure opens the door for the evaluation of early howling and ringing detection performance and can handle the high class imbalance inherent in the HD problem by using precision-recall (PR) instead of receiver operating characteristic (ROC) curves.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2025 1","pages":"14"},"PeriodicalIF":1.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11950036/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143755452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compression of room impulse responses for compact storage and fast low-latency convolution 压缩室内脉冲响应，实现紧凑存储和快速低延迟卷积

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-09-13 DOI: 10.1186/s13636-024-00363-5

Martin Jälmby, Filip Elvander, Toon van Waterschoot

Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.

室内脉冲响应（RIR）被用于增强现实和虚拟现实等多种应用中。这些应用需要在严格的延迟限制下，将大量 RIR 与音频进行卷积。在本文中，我们将结合快速时域卷积来考虑 RIR 的压缩问题。为了压缩 RIR，我们考虑了三种不同的 RIR 近似方法，并将它们与最先进的压缩方法进行了比较。我们使用几种标准的客观质量测量方法对这些方法进行了评估，既有基于信道的，也有基于信号的。我们还提出了一种基于低秩的新型快速时域卷积算法，并展示了如何在无需解压缩 RIR 的情况下进行卷积。我们使用在三个不同房间录制的不同长度的 RIR 进行了数值模拟。结果表明，与最先进的 Opus 压缩技术相比，使用低秩近似法进行压缩是一种非常有吸引力的选择，因为除了一项考虑的指标外，它在其他所有指标上的表现都与 Opus 压缩技术相当或更好，而且还具有适合快速时域卷积的额外优势。

{"title":"Compression of room impulse responses for compact storage and fast low-latency convolution","authors":"Martin Jälmby, Filip Elvander, Toon van Waterschoot","doi":"10.1186/s13636-024-00363-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00363-5","url":null,"abstract":"Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"16 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest editorial: AI for computational audition—sound and music processing 特邀社论：计算听觉的人工智能--声音和音乐处理

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-09-11 DOI: 10.1186/s13636-024-00353-7

Zijin Li, Wenwu Wang, Kejun Zhang, Mengyao Zhu

Nowadays, the application of artificial intelligence (AI) algorithms and techniques is ubiquitous and transversal. Fields that take advantage of AI advances include sound and music processing. The advances in interdisciplinary research potentially yield new insights that may further advance the AI methods in this field. This special issue aims to report recent progress and spur new research lines in AI-driven sound and music processing, especially within interdisciplinary research scenarios.

如今，人工智能（AI）算法和技术的应用无处不在、横跨各个领域。利用人工智能进步的领域包括声音和音乐处理。跨学科研究的进展有可能产生新的见解，从而进一步推动人工智能方法在这一领域的应用。本特刊旨在报道人工智能驱动的声音和音乐处理方面的最新进展，并推动新的研究方向，尤其是在跨学科研究的情况下。

引用次数: 0

Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach 区域到区域声学传递函数的物理约束自适应内核插值：一种贝叶斯方法

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-09-10 DOI: 10.1186/s13636-024-00362-6

Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari

A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.

本文提出了一种内核插值方法，用于在受声音物理约束的区域之间进行声学传递函数（ATF）插值，同时又能适应数据。大多数 ATF 内插方法旨在通过使用与测量结果相适应的估算技术为固定声源建立 ATF 模型，而不考虑问题的物理特性。我们的目标是对 ATF 进行区域到区域的内插估算，这意味着我们要考虑到源和接收器位置的变化。通过使用非常通用的重现核函数公式，我们创建了一个核函数，将定向场和残差场视为两个独立的核函数。定向场核考虑了具有大振幅的反射场成分的稀疏选择，并将其表述为定向核的组合。残差场由其余振幅较小的密集分布成分组成。其核权重由一个通用近似器--神经网络来表示，以便从数据中自由学习模式。这些核参数的学习既可以在高斯先验假设下使用贝叶斯推断法，也可以使用马尔科夫链蒙特卡罗模拟法，以更有方向性的方式进行推断。我们在数值模拟中对所有已建立的核公式进行了比较，结果表明所提出的核模型能够恰当地表示 ATF 的复杂性。

{"title":"Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach","authors":"Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari","doi":"10.1186/s13636-024-00362-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00362-6","url":null,"abstract":"A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"60 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Physics-informed neural network for volumetric sound field reconstruction of speech signals 用于语音信号体积声场重建的物理信息神经网络

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-09-09 DOI: 10.1186/s13636-024-00366-2

Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande

Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.

声学信号处理领域的最新发展见证了深度学习方法与基于经典波形扩展的方法（尤其是声场重建方法）的融合。物理信息神经网络（PINN）作为一种新颖的框架出现，在数据驱动和基于模型的技术之间架起了一座桥梁，用于处理由偏微分方程支配的物理现象。本文介绍了一种基于 PINN 的方法，用于恢复任意体积声场。该网络结合了波方程，对时域信号重建施加正则化。这种方法使网络能够学习声音传播的物理规律，并根据有限的观测数据对声场进行完整描述。通过在真实环境中对语音信号进行实验，并考虑不同数量的可用测量值，验证了所提出方法的有效性。此外，还与现有文献中最先进的频域和时域重建方法进行了比较分析，突出显示了在各种测量配置下所提高的精确度。

{"title":"Physics-informed neural network for volumetric sound field reconstruction of speech signals","authors":"Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00366-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00366-2","url":null,"abstract":"Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal sensor placement for the spatial reconstruction of sound fields 声场空间重建的最佳传感器位置

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-08-17 DOI: 10.1186/s13636-024-00364-4

Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande

The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.

空间声场估算是声场控制和分析、空间音频、室内声学和虚拟现实等领域所关注的问题。声场可以通过分布在空间中的大量测量数据进行估算，但由于需要大量的实验工作，这仍然是一个具有挑战性的问题。在这项工作中，我们研究了估算声场的最佳传感器分布。这种优化非常有价值，因为它可以大大减少所需的测量次数。通过寻找贝叶斯克拉梅尔-拉奥约束（BCRB）最小化的位置，对描述声场的参数或在感兴趣区域重建的压力的传感器位置进行优化。我们通过数值研究以及测量的室内脉冲响应对优化后的分布进行了研究。我们发现，与随机分布相比，优化传感器位置以重建声场时，测量次数减少了约 50%。结果表明，与声学稀疏阵列处理中经常采用的随机传感器分布相比，当参数向量稀疏时，优化传感器位置也很有价值。

{"title":"Optimal sensor placement for the spatial reconstruction of sound fields","authors":"Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00364-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00364-4","url":null,"abstract":"The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"425 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recognition of target domain Japanese speech using language model replacement 使用语言模型替换识别目标域日语语音

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-07-20 DOI: 10.1186/s13636-024-00360-8

Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka

End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.

端到端 (E2E) 自动语音识别（ASR）模型由深度学习模型组成，能够使用单个神经网络执行 ASR 任务。这些模型应使用大量数据进行训练；然而，收集与目标语音域相匹配的语音数据可能很困难，因此经常会使用与目标域不完全匹配的语音数据，从而导致性能降低。与语音数据相比，域内文本数据更容易获得。因此，传统的 ASR 系统使用单独训练的语言模型和基于 HMM 的声学模型。然而，E2E ASR 模型很难将语言信息分离出来，因为该模型是以综合方式学习声学和语言信息的，这使得为专门目标域创建 E2E ASR 模型非常困难，而这些模型又能以合理的成本达到足够的识别性能。在本文中，我们提出了一种在预训练的 E2E ASR 模型中替换语言信息的方法，以实现对目标领域的适应。具体做法是删除 ASR 模型中包含的 "隐含 "语言信息，方法是减去用对数域 ASR 训练数据转录训练的源域语言模型。然后，我们在对数域中通过加法整合目标域语言模型。这种语言模型的减法和加法替换是基于贝叶斯定理的。在实验中，我们首先使用了自发日语语料库（CSJ）的两个数据集来评估我们方法的有效性。然后，我们使用日语报纸文章语音（JNAS）和 CSJ 语料库对我们的方法进行了评估，这两个语料库分别包含朗读语音和自发语音领域的音频数据，以测试我们提出的方法在缩小这两个语言领域之间的差距方面的有效性。结果表明，我们提出的语言模型替换方法比非适配（基线）ASR 模型和使用传统浅层融合方法适配的 ASR 模型都取得了更好的 ASR 性能。

{"title":"Recognition of target domain Japanese speech using language model replacement","authors":"Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka","doi":"10.1186/s13636-024-00360-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00360-8","url":null,"abstract":"End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"27 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The whole is greater than the sum of its parts: improving music source separation by bridging networks 整体大于部分之和：通过网络桥接改善音乐源分离效果

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-07-19 DOI: 10.1186/s13636-024-00354-6

Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .

本文提出了一种交叉方案（X-scheme），可在几乎不增加计算成本的情况下提高基于深度神经网络（DNN）的音乐源分离（MSS）性能。它由三个部分组成：(i) 多域损耗 (MDL)，(ii) 桥接操作（耦合单个乐器网络）和 (iii) 组合损耗 (CL)。MDL 能够利用音频信号的频域和时域表示。我们修改了目标网络，即基于 DNN 的原始 MSS 的网络结构，为每个输出仪器添加了桥接路径，以共享它们的信息。然后，将 MDL 应用于输出源的组合以及每个独立源；因此，我们称之为 CL。MDL 和 CL 可以轻松应用于许多基于 DNN 的分离方法，因为它们只是损失函数，只在训练过程中使用，并不影响推理步骤。桥接操作不会增加网络中可学习参数的数量。实验结果表明，使用我们的 X 架构扩展的开放式混音网络（UMX）、密集连接的扩张型 DenseNet（D3Net）和卷积时域音频分离网络（Conv-TasNet）（分别称为 X-UMX、X-D3Net 和 X-Conv-TasNet）与它们的原始版本进行比较，结果是正确的。我们还验证了 X 架构在大规模数据机制中的有效性，显示了它在数据规模方面的通用性。X-UMX Large（X-UMXL）是在大规模内部数据上训练出来的，并在我们的实验中得到了应用，其最新版本可在 https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX 网站上查阅。

{"title":"The whole is greater than the sum of its parts: improving music source separation by bridging networks","authors":"Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji","doi":"10.1186/s13636-024-00354-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00354-6","url":null,"abstract":"This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition 探索藏语多方言语音识别中的任务多样化元学习

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-07-17 DOI: 10.1186/s13636-024-00361-7

Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji

The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.

藏语三大方言在语音学和语料方面的差异加剧了一种方言的单一任务模型难以适应其他不同方言的问题。为解决这一问题，本文提出了任务多样化元学习（task-diverse meta-learning）。我们的模型可以获得更全面、更稳健的特征，便于适应不同方言之间的差异。本研究将藏语方言 ID 识别和藏语说话人识别作为元学习的源任务，旨在增强模型辨别不同方言之间差异的能力。因此，该模型在藏语多方言语音识别任务中的性能得到了提高。实验结果表明，任务多样化元学习提高了藏语多方言语音识别的性能。这证明了任务多样化元学习的有效性和适用性，从而推动了多方言环境下语音识别技术的发展。

引用次数: 0