首页 > 最新文献

Eurasip Journal on Audio Speech and Music Processing最新文献

英文 中文
Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition 探索藏语多方言语音识别中的任务多样化元学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-17 DOI: 10.1186/s13636-024-00361-7
Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji
The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.
藏语三大方言在语音学和语料方面的差异加剧了一种方言的单一任务模型难以适应其他不同方言的问题。为解决这一问题,本文提出了任务多样化元学习(task-diverse meta-learning)。我们的模型可以获得更全面、更稳健的特征,便于适应不同方言之间的差异。本研究将藏语方言 ID 识别和藏语说话人识别作为元学习的源任务,旨在增强模型辨别不同方言之间差异的能力。因此,该模型在藏语多方言语音识别任务中的性能得到了提高。实验结果表明,任务多样化元学习提高了藏语多方言语音识别的性能。这证明了任务多样化元学习的有效性和适用性,从而推动了多方言环境下语音识别技术的发展。
{"title":"Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition","authors":"Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji","doi":"10.1186/s13636-024-00361-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00361-7","url":null,"abstract":"The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"97 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simplified and controllable model of mode coupling for addressing nonlinear phenomena in sound synthesis processes 用于解决声音合成过程中非线性现象的简化可控模式耦合模型
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-17 DOI: 10.1186/s13636-024-00358-2
Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet
This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.
本文介绍了模态合成中模态耦合的简化可控模型。该模型采用高效耦合滤波器进行声音合成,旨在模拟强非线性条件下声源辐射声音的产生。这种滤波器以相互依存的方式产生音调成分,旨在以高效的方式模拟乐器中真实的感知效果。滤波器之间的能量传递控制是通过耦合矩阵实现的。本文介绍了如何利用滤波器组生成与非线性声源相对应的原型声音。特别是,提出了生成与薄结构撞击声和物体与其他物体碰撞时的振动扰动声相对应的声音的例子。论文中介绍的声音示例可在随附网站上收听,这些示例表明,只需简单控制输入参数,就能生成连贯的声音,而加入随机过程后,所生成声音的逼真度会显著提高。
{"title":"A simplified and controllable model of mode coupling for addressing nonlinear phenomena in sound synthesis processes","authors":"Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet","doi":"10.1186/s13636-024-00358-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00358-2","url":null,"abstract":"This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive multi-task learning for speech to text translation 语音到文本翻译的自适应多任务学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-13 DOI: 10.1186/s13636-024-00359-1
Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu
End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.
端到端语音到文本翻译旨在将一种语言的语音直接翻译成另一种语言的文本,这是一项具有挑战性的跨模态任务,尤其是在数据有限的情况下。多任务学习是语音翻译和机器翻译之间知识共享的有效策略,它允许模型利用大量机器翻译数据来学习源语言和目标语言之间的映射,从而提高语音翻译的性能。然而,在多任务学习中,找到一组能平衡各种任务的权重具有挑战性且计算成本高昂。我们提出了一种自适应多任务学习方法,可根据训练过程中产生的损失比例动态调整多任务权重,从而在语音到文本翻译的多任务学习中实现自适应平衡。此外,不同模态之间固有的表征差异阻碍了语音翻译模型有效利用文本数据。为了弥合不同模态之间的差距,我们建议在端到端模型的输入中应用最优传输,以找到语音和文本序列之间的对齐,并学习它们之间的共享表征。实验结果表明,我们的方法有效提高了藏汉、英德和英法语音翻译数据集的性能。
{"title":"Adaptive multi-task learning for speech to text translation","authors":"Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu","doi":"10.1186/s13636-024-00359-1","DOIUrl":"https://doi.org/10.1186/s13636-024-00359-1","url":null,"abstract":"End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"56 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141611132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration GLFER-Net:基于全局-局部特征提取和重新校准的复调声源定位和检测网络
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-26 DOI: 10.1186/s13636-024-00356-4
Mengzhen Ma, Ying Hu, Liang He, Hao Huang
Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.
复调声源定位和检测(SSLD)任务旨在识别声音事件的类别、确定其起始和偏移时间,并检测其相应的到达方向(DOA),其中复调指的是在一个片段中出现多个重叠声源。然而,基于卷积递归神经网络(CRNN)的传统 SSLD 方法存在特征提取不足的问题。CRNN 中的卷积核为单尺度,无法充分提取具有不同时频特征的声音事件的多尺度特征。这导致提取的特征缺乏有助于声源定位的细粒度信息。为了应对这些挑战,我们提出了一种基于全局本地特征提取和重新校准的多声道 SSLD 网络(GLFER-Net),其中全局本地特征提取器(GLF)通过全向动态卷积(ODConv)层和多尺度特征提取(MSFE)模块提取多尺度全局特征。局部特征提取(LFE)单元用于捕捉细节信息。此外,我们还设计了一个特征重新校准(FR)模块,以强调多个维度上的关键特征。在 DCASE 2021 年和 2022 年挑战赛任务 3 的开放数据集上,我们将所提出的 GLFER-Net 分别与六种和四种 SSLD 方法进行了比较。结果表明,GLFER-Net 的性能极具竞争力。通过一系列消融实验和可视化分析,我们验证了所设计模块的有效性。
{"title":"GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration","authors":"Mengzhen Ma, Ying Hu, Liang He, Hao Huang","doi":"10.1186/s13636-024-00356-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00356-4","url":null,"abstract":"Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"94 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fake speech detection using VGGish with attention block 使用带有注意力区块的 VGGish 进行假语音检测
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-26 DOI: 10.1186/s13636-024-00348-4
Tahira Kanwal, Rabbia Mahum, Abdul Malik AlSalman, Mohamed Sharaf, Haseeb Hassan
While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.
虽然深度学习技术在生成深度伪造信息方面取得了显著进展,但其滥用已成为众所周知的问题。因此,无处不在地使用深度伪造来增加虚假信息对个人的安全和隐私构成了重大风险。音频欺骗检测的主要目的是识别通过大量基于人工智能的技术生成的音频。目前已经有几种使用机器学习算法的假音频检测技术。然而,这些技术缺乏通用性,可能无法识别所有类型的人工智能合成音频,例如重放攻击、语音转换和文本到语音(TTS)。本文介绍了一个深度分层模型,即 VGGish,以及一个用于欺骗检测的注意力模块,即卷积块注意力模块(CBAM)。我们建议的模型成功地将输入音频分为两类:我们建议的模型成功地将输入音频分为两类:假音频和真音频,将它们转换成旋律谱图,并通过注意块提取出它们最具代表性的特征。我们的模型具有简单的分层结构,是音频欺骗检测的重要技术。它能捕捉到音频信号中的复杂关系,这些关系是由注意力模块中的空间和信道特征造成的。为了评估模型的有效性,我们使用 ASVspoof 2019 数据集进行了深入测试。针对物理访问(PA)攻击和逻辑访问(LA)攻击,所提出的技术分别实现了 0.52% 和 0.07% 的 EER。
{"title":"Fake speech detection using VGGish with attention block","authors":"Tahira Kanwal, Rabbia Mahum, Abdul Malik AlSalman, Mohamed Sharaf, Haseeb Hassan","doi":"10.1186/s13636-024-00348-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00348-4","url":null,"abstract":"While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"169 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic dysarthria detection and severity level assessment using CWT-layered CNN model 使用 CWT 分层 CNN 模型自动检测构音障碍并评估严重程度
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-25 DOI: 10.1186/s13636-024-00357-3
Shaik Sajiha, Kodali Radha, Dhulipalla Venkata Rao, Nammi Sneha, Suryanarayana Gunnam, Durga Prasad Bavirisetti
Dysarthria is a speech disorder that affects the ability to communicate due to articulation difficulties. This research proposes a novel method for automatic dysarthria detection (ADD) and automatic dysarthria severity level assessment (ADSLA) by using a variable continuous wavelet transform (CWT) layered convolutional neural network (CNN) model. To determine their efficiency, the proposed model is assessed using two distinct corpora, TORGO and UA-Speech, comprising both dysarthria patients and healthy subject speech signals. The research study explores the effectiveness of CWT-layered CNN models that employ different wavelets such as Amor, Morse, and Bump. The study aims to analyze the models’ performance without the need for feature extraction, which could provide deeper insights into the effectiveness of the models in processing complex data. Also, raw waveform modeling preserves the original signal’s integrity and nuance, making it ideal for applications like speech recognition, signal processing, and image processing. Extensive analysis and experimentation have revealed that the Amor wavelet surpasses the Morse and Bump wavelets in accurately representing signal characteristics. The Amor wavelet outperforms the others in terms of signal reconstruction fidelity, noise suppression capabilities, and feature extraction accuracy. The proposed CWT-layered CNN model emphasizes the importance of selecting the appropriate wavelet for signal-processing tasks. The Amor wavelet is a reliable and precise choice for applications. The UA-Speech dataset is crucial for more accurate dysarthria classification. Advanced deep learning techniques can simplify early intervention measures and expedite the diagnosis process.
构音障碍是一种由于发音困难而影响交流能力的语言障碍。本研究通过使用可变连续小波变换(CWT)分层卷积神经网络(CNN)模型,提出了一种自动构音障碍检测(ADD)和自动构音障碍严重程度评估(ADSLA)的新方法。为了确定其效率,我们使用 TORGO 和 UA-Speech 这两个不同的语料库(包括构音障碍患者和健康人的语音信号)对所提出的模型进行了评估。研究探讨了采用 Amor、Morse 和 Bump 等不同小波的 CWT 分层 CNN 模型的有效性。该研究旨在分析模型的性能,而无需进行特征提取,从而更深入地了解模型在处理复杂数据时的有效性。此外,原始波形建模保留了原始信号的完整性和细微差别,因此非常适合语音识别、信号处理和图像处理等应用。大量的分析和实验表明,Amor 小波在准确表达信号特征方面超越了 Morse 小波和 Bump 小波。Amor 小波在信号重建保真度、噪声抑制能力和特征提取准确性方面都优于其他小波。所提出的 CWT 层 CNN 模型强调了为信号处理任务选择合适小波的重要性。Amor 小波是可靠而精确的应用选择。UA-Speech 数据集对更准确的构音障碍分类至关重要。先进的深度学习技术可以简化早期干预措施,加快诊断过程。
{"title":"Automatic dysarthria detection and severity level assessment using CWT-layered CNN model","authors":"Shaik Sajiha, Kodali Radha, Dhulipalla Venkata Rao, Nammi Sneha, Suryanarayana Gunnam, Durga Prasad Bavirisetti","doi":"10.1186/s13636-024-00357-3","DOIUrl":"https://doi.org/10.1186/s13636-024-00357-3","url":null,"abstract":"Dysarthria is a speech disorder that affects the ability to communicate due to articulation difficulties. This research proposes a novel method for automatic dysarthria detection (ADD) and automatic dysarthria severity level assessment (ADSLA) by using a variable continuous wavelet transform (CWT) layered convolutional neural network (CNN) model. To determine their efficiency, the proposed model is assessed using two distinct corpora, TORGO and UA-Speech, comprising both dysarthria patients and healthy subject speech signals. The research study explores the effectiveness of CWT-layered CNN models that employ different wavelets such as Amor, Morse, and Bump. The study aims to analyze the models’ performance without the need for feature extraction, which could provide deeper insights into the effectiveness of the models in processing complex data. Also, raw waveform modeling preserves the original signal’s integrity and nuance, making it ideal for applications like speech recognition, signal processing, and image processing. Extensive analysis and experimentation have revealed that the Amor wavelet surpasses the Morse and Bump wavelets in accurately representing signal characteristics. The Amor wavelet outperforms the others in terms of signal reconstruction fidelity, noise suppression capabilities, and feature extraction accuracy. The proposed CWT-layered CNN model emphasizes the importance of selecting the appropriate wavelet for signal-processing tasks. The Amor wavelet is a reliable and precise choice for applications. The UA-Speech dataset is crucial for more accurate dysarthria classification. Advanced deep learning techniques can simplify early intervention measures and expedite the diagnosis process.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIRACLE—a microphone array impulse response dataset for acoustic learning MIRACLE--用于声学学习的麦克风阵列脉冲响应数据集
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-18 DOI: 10.1186/s13636-024-00352-8
Adam Kujawski, Art J. R. Pelling, Ennes Sarradj
This work introduces a large dataset comprising impulse responses of spatially distributed sources within a plane parallel to a planar microphone array. The dataset, named MIRACLE, encompasses 856,128 single-channel impulse responses and includes four different measurement scenarios. Three measurement scenarios were conducted under anechoic conditions. The fourth scenario includes an additional specular reflection from a reflective panel. The source positions were obtained by uniformly discretizing a rectangular source plane parallel to the microphone for each scenario. The dataset contains three scenarios with a spatial resolution of $$23,textrm{mm}$$ at two different source-plane-to-array distances, as well as a scenario with a resolution of $$5,textrm{mm}$$ for the shorter distance. In contrast to existing room impulse response datasets, the accuracy of the provided source location labels is assessed and additional metadata, such as the directivity of the loudspeaker used for excitation, is provided. The MIRACLE dataset can be used as a benchmark for data-driven modelling and interpolation methods as well as for various acoustic machine learning tasks, such as source separation, localization, and characterization. Two timely applications of the dataset are presented in this work: the generation of microphone array data for data-driven source localization and characterization tasks and data-driven model order reduction.
这项研究引入了一个大型数据集,该数据集包含与平面麦克风阵列平行的平面内空间分布声源的脉冲响应。该数据集名为 MIRACLE,包含 856 128 个单通道脉冲响应,包括四种不同的测量场景。其中三个测量场景是在消声条件下进行的。第四种情况包括来自反射板的额外镜面反射。每个场景的声源位置都是通过均匀离散平行于麦克风的矩形声源平面获得的。数据集包含两种不同声源平面到阵列距离下空间分辨率为 $23textrm{mm}$ 的三种场景,以及一种较短距离下分辨率为 $5textrm{mm}$ 的场景。与现有的房间脉冲响应数据集不同的是,该数据集对所提供的声源位置标签的准确性进行了评估,并提供了额外的元数据,如用于激励的扬声器的指向性。MIRACLE 数据集可作为数据驱动建模和插值方法以及各种声学机器学习任务(如声源分离、定位和特征描述)的基准。本作品介绍了该数据集的两个适时应用:为数据驱动的声源定位和特征描述任务生成麦克风阵列数据,以及数据驱动的模型阶次缩减。
{"title":"MIRACLE—a microphone array impulse response dataset for acoustic learning","authors":"Adam Kujawski, Art J. R. Pelling, Ennes Sarradj","doi":"10.1186/s13636-024-00352-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00352-8","url":null,"abstract":"This work introduces a large dataset comprising impulse responses of spatially distributed sources within a plane parallel to a planar microphone array. The dataset, named MIRACLE, encompasses 856,128 single-channel impulse responses and includes four different measurement scenarios. Three measurement scenarios were conducted under anechoic conditions. The fourth scenario includes an additional specular reflection from a reflective panel. The source positions were obtained by uniformly discretizing a rectangular source plane parallel to the microphone for each scenario. The dataset contains three scenarios with a spatial resolution of $$23,textrm{mm}$$ at two different source-plane-to-array distances, as well as a scenario with a resolution of $$5,textrm{mm}$$ for the shorter distance. In contrast to existing room impulse response datasets, the accuracy of the provided source location labels is assessed and additional metadata, such as the directivity of the loudspeaker used for excitation, is provided. The MIRACLE dataset can be used as a benchmark for data-driven modelling and interpolation methods as well as for various acoustic machine learning tasks, such as source separation, localization, and characterization. Two timely applications of the dataset are presented in this work: the generation of microphone array data for data-driven source localization and characterization tasks and data-driven model order reduction.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"197 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating the first and second derivatives of discrete audio data 估计离散音频数据的第一和第二导数
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-18 DOI: 10.1186/s13636-024-00355-5
Marcin Lewandowski
A new method for estimating the first and second derivatives of discrete audio signals intended to achieve higher computational precision in analyzing the performance and characteristics of digital audio systems is presented. The method could find numerous applications in modeling nonlinear audio circuit systems, e.g., for audio synthesis and creating audio effects, music recognition and classification, time-frequency analysis based on nonstationary audio signal decomposition, audio steganalysis and digital audio authentication or audio feature extraction methods. The proposed algorithm employs the ordinary 7 point-stencil central-difference formulas with improvements that minimize the round-off and truncation errors. This is achieved by treating the step size of numerical differentiation as a regularization parameter, which acts as a decision threshold in all calculations. This approach requires shifting discrete audio data by fractions of the initial sample rate, which was obtained by fractional delay FIR filters designed with modified 11-term cosine-sum windows for interpolation and shifting of audio signals. The maximum relative error in estimating first and second derivatives of discrete audio signals are respectively in order of $$10^{-13}$$ and $$10^{-10}$$ over the entire audio band, which is close to double-precision floating-point accuracy for the first and better than single-precision floating-point accuracy for the second derivative estimation. Numerical testing showed that this performance of the proposed method is not influenced by the type of signal being differentiated (either stationary or nonstationary), and provides better results than other known differentiation methods, in the audio band up to 21 kHz.
本文介绍了一种估算离散音频信号一阶导数和二阶导数的新方法,目的是在分析数字音频系统的性能和特性时实现更高的计算精度。该方法可广泛应用于非线性音频电路系统建模,例如音频合成和音频效果创建、音乐识别和分类、基于非稳态音频信号分解的时频分析、音频隐写分析和数字音频认证或音频特征提取方法。所提出的算法采用普通的 7 点-模板中心差分公式,并进行了改进,最大限度地减少了舍入和截断误差。这是通过将数值微分的步长视为正则化参数来实现的,该参数在所有计算中都充当决策阈值。这种方法需要以初始采样率的分数来移位离散音频数据,而初始采样率是通过分数延迟 FIR 滤波器获得的,该滤波器采用修改过的 11 次余弦和窗设计,用于音频信号的插值和移位。在整个音频频段内,估计离散音频信号一阶导数和二阶导数的最大相对误差分别为 $$10^{-13}$ 和 $$10^{-10}$,一阶导数估计接近双精度浮点精度,二阶导数估计优于单精度浮点精度。数值测试表明,在高达 21 kHz 的音频频段内,拟议方法的这一性能不受被微分信号类型(静态或非静态)的影响,而且比其他已知微分方法的结果更好。
{"title":"Estimating the first and second derivatives of discrete audio data","authors":"Marcin Lewandowski","doi":"10.1186/s13636-024-00355-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00355-5","url":null,"abstract":"A new method for estimating the first and second derivatives of discrete audio signals intended to achieve higher computational precision in analyzing the performance and characteristics of digital audio systems is presented. The method could find numerous applications in modeling nonlinear audio circuit systems, e.g., for audio synthesis and creating audio effects, music recognition and classification, time-frequency analysis based on nonstationary audio signal decomposition, audio steganalysis and digital audio authentication or audio feature extraction methods. The proposed algorithm employs the ordinary 7 point-stencil central-difference formulas with improvements that minimize the round-off and truncation errors. This is achieved by treating the step size of numerical differentiation as a regularization parameter, which acts as a decision threshold in all calculations. This approach requires shifting discrete audio data by fractions of the initial sample rate, which was obtained by fractional delay FIR filters designed with modified 11-term cosine-sum windows for interpolation and shifting of audio signals. The maximum relative error in estimating first and second derivatives of discrete audio signals are respectively in order of $$10^{-13}$$ and $$10^{-10}$$ over the entire audio band, which is close to double-precision floating-point accuracy for the first and better than single-precision floating-point accuracy for the second derivative estimation. Numerical testing showed that this performance of the proposed method is not influenced by the type of signal being differentiated (either stationary or nonstationary), and provides better results than other known differentiation methods, in the audio band up to 21 kHz.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"135 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Music time signature detection using ResNet18 使用 ResNet 检测音乐时间特征18
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-13 DOI: 10.1186/s13636-024-00346-6
Jeremiah Abimbola, Daniel Kostrzewa, Pawel Kasprowski
Time signature detection is a fundamental task in music information retrieval, aiding in music organization. In recent years, the demand for robust and efficient methods in music analysis has amplified, underscoring the significance of advancements in time signature detection. In this study, we explored the effectiveness of residual networks for time signature detection. Additionally, we compared the performance of the residual network (ResNet18) to already existing models such as audio similarity matrix (ASM) and beat similarity matrix (BSM). We also juxtaposed with traditional algorithms such as support vector machine (SVM), random forest, K-nearest neighbor (KNN), naive Bayes, and that of deep learning models, such as convolutional neural network (CNN) and convolutional recurrent neural network (CRNN). The evaluation is conducted using Mel-frequency cepstral coefficients (MFCCs) as feature representations on the Meter2800 dataset. Our results indicate that ResNet18 outperforms all other models thereby showing the potential of deep learning models for accurate time signature detection.
时间特征检测是音乐信息检索的一项基本任务,有助于音乐的组织。近年来,音乐分析对稳健高效方法的需求日益增长,这凸显了时间特征检测技术进步的重要意义。在这项研究中,我们探讨了残差网络在时间特征检测中的有效性。此外,我们还将残差网络(ResNet18)的性能与音频相似性矩阵(ASM)和节拍相似性矩阵(BSM)等现有模型进行了比较。我们还将其与支持向量机 (SVM)、随机森林、K-近邻 (KNN)、天真贝叶斯等传统算法以及卷积神经网络 (CNN) 和卷积递归神经网络 (CRNN) 等深度学习模型进行了比较。评估是在 Meter2800 数据集上使用 Mel-frequency cepstral coefficients (MFCC) 作为特征表示进行的。结果表明,ResNet18 优于所有其他模型,从而显示了深度学习模型在准确检测时间特征方面的潜力。
{"title":"Music time signature detection using ResNet18","authors":"Jeremiah Abimbola, Daniel Kostrzewa, Pawel Kasprowski","doi":"10.1186/s13636-024-00346-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00346-6","url":null,"abstract":"Time signature detection is a fundamental task in music information retrieval, aiding in music organization. In recent years, the demand for robust and efficient methods in music analysis has amplified, underscoring the significance of advancements in time signature detection. In this study, we explored the effectiveness of residual networks for time signature detection. Additionally, we compared the performance of the residual network (ResNet18) to already existing models such as audio similarity matrix (ASM) and beat similarity matrix (BSM). We also juxtaposed with traditional algorithms such as support vector machine (SVM), random forest, K-nearest neighbor (KNN), naive Bayes, and that of deep learning models, such as convolutional neural network (CNN) and convolutional recurrent neural network (CRNN). The evaluation is conducted using Mel-frequency cepstral coefficients (MFCCs) as feature representations on the Meter2800 dataset. Our results indicate that ResNet18 outperforms all other models thereby showing the potential of deep learning models for accurate time signature detection.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"61 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploration of Whisper fine-tuning strategies for low-resource ASR 探索针对低资源 ASR 的 Whisper 微调策略
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-06-01 DOI: 10.1186/s13636-024-00349-3
Yunpeng Liu, Xukui Yang, Dan Qu
Limited data availability remains a significant challenge for Whisper’s low-resource speech recognition performance, falling short of practical application requirements. While previous studies have successfully reduced the recognition error rates of target language speech through fine-tuning, a comprehensive exploration and analysis of Whisper’s fine-tuning capabilities and the advantages and disadvantages of various fine-tuning strategies are still lacking. This paper aims to fill this gap by conducting comprehensive experimental exploration for Whisper’s low-resource speech recognition performance using five fine-tuning strategies with limited supervised data from seven low-resource languages. The results and analysis demonstrate that all fine-tuning strategies explored in this paper significantly enhance Whisper’s performance. However, different strategies vary in their suitability and practical effectiveness, highlighting the need for careful selection based on specific use cases and resources available.
有限的数据可用性仍然是 Whisper 低资源语音识别性能的一大挑战,无法满足实际应用要求。虽然以往的研究通过微调成功降低了目标语言语音的识别错误率,但对 Whisper 的微调能力和各种微调策略的优缺点仍缺乏全面的探索和分析。本文旨在填补这一空白,利用有限的七种低资源语言监督数据,采用五种微调策略对 Whisper 的低资源语音识别性能进行了全面的实验探索。实验结果和分析表明,本文探讨的所有微调策略都能显著提高 Whisper 的性能。然而,不同的策略在适用性和实际效果方面存在差异,因此需要根据具体的使用情况和可用资源进行仔细选择。
{"title":"Exploration of Whisper fine-tuning strategies for low-resource ASR","authors":"Yunpeng Liu, Xukui Yang, Dan Qu","doi":"10.1186/s13636-024-00349-3","DOIUrl":"https://doi.org/10.1186/s13636-024-00349-3","url":null,"abstract":"Limited data availability remains a significant challenge for Whisper’s low-resource speech recognition performance, falling short of practical application requirements. While previous studies have successfully reduced the recognition error rates of target language speech through fine-tuning, a comprehensive exploration and analysis of Whisper’s fine-tuning capabilities and the advantages and disadvantages of various fine-tuning strategies are still lacking. This paper aims to fill this gap by conducting comprehensive experimental exploration for Whisper’s low-resource speech recognition performance using five fine-tuning strategies with limited supervised data from seven low-resource languages. The results and analysis demonstrate that all fine-tuning strategies explored in this paper significantly enhance Whisper’s performance. However, different strategies vary in their suitability and practical effectiveness, highlighting the need for careful selection based on specific use cases and resources available.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"21 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141190231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Eurasip Journal on Audio Speech and Music Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1