The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.
藏语三大方言在语音学和语料方面的差异加剧了一种方言的单一任务模型难以适应其他不同方言的问题。为解决这一问题,本文提出了任务多样化元学习(task-diverse meta-learning)。我们的模型可以获得更全面、更稳健的特征,便于适应不同方言之间的差异。本研究将藏语方言 ID 识别和藏语说话人识别作为元学习的源任务,旨在增强模型辨别不同方言之间差异的能力。因此,该模型在藏语多方言语音识别任务中的性能得到了提高。实验结果表明,任务多样化元学习提高了藏语多方言语音识别的性能。这证明了任务多样化元学习的有效性和适用性,从而推动了多方言环境下语音识别技术的发展。
{"title":"Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition","authors":"Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji","doi":"10.1186/s13636-024-00361-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00361-7","url":null,"abstract":"The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"97 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1186/s13636-024-00358-2
Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet
This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.
{"title":"A simplified and controllable model of mode coupling for addressing nonlinear phenomena in sound synthesis processes","authors":"Samuel Poirot, Stefan Bilbao, Richard Kronland-Martinet","doi":"10.1186/s13636-024-00358-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00358-2","url":null,"abstract":"This paper introduces a simplified and controllable model for mode coupling in the context of modal synthesis. The model employs efficient coupled filters for sound synthesis purposes, intended to emulate the generation of sounds radiated by sources under strongly nonlinear conditions. Such filters generate tonal components in an interdependent way and are intended to emulate realistic perceptually salient effects in musical instruments in an efficient manner. The control of energy transfer between the filters is realized through a coupling matrix. The generation of prototypical sounds corresponding to nonlinear sources with the filter bank is presented. In particular, examples are proposed to generate sounds corresponding to impacts on thin structures and to the perturbation of the vibration of objects when it collides with an other object. The sound examples presented in the paper and available for listening on the accompanying site illustrate that a simple control of the input parameters allows the generation of sounds whose evocation is coherent and that the addition of random processes yields a significant improvement to the realism of the generated sounds.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-13DOI: 10.1186/s13636-024-00359-1
Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu
End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.
{"title":"Adaptive multi-task learning for speech to text translation","authors":"Xin Feng, Yue Zhao, Wei Zong, Xiaona Xu","doi":"10.1186/s13636-024-00359-1","DOIUrl":"https://doi.org/10.1186/s13636-024-00359-1","url":null,"abstract":"End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"56 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141611132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-26DOI: 10.1186/s13636-024-00356-4
Mengzhen Ma, Ying Hu, Liang He, Hao Huang
Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.
{"title":"GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration","authors":"Mengzhen Ma, Ying Hu, Liang He, Hao Huang","doi":"10.1186/s13636-024-00356-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00356-4","url":null,"abstract":"Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"94 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-26DOI: 10.1186/s13636-024-00348-4
Tahira Kanwal, Rabbia Mahum, Abdul Malik AlSalman, Mohamed Sharaf, Haseeb Hassan
While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.
{"title":"Fake speech detection using VGGish with attention block","authors":"Tahira Kanwal, Rabbia Mahum, Abdul Malik AlSalman, Mohamed Sharaf, Haseeb Hassan","doi":"10.1186/s13636-024-00348-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00348-4","url":null,"abstract":"While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"169 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dysarthria is a speech disorder that affects the ability to communicate due to articulation difficulties. This research proposes a novel method for automatic dysarthria detection (ADD) and automatic dysarthria severity level assessment (ADSLA) by using a variable continuous wavelet transform (CWT) layered convolutional neural network (CNN) model. To determine their efficiency, the proposed model is assessed using two distinct corpora, TORGO and UA-Speech, comprising both dysarthria patients and healthy subject speech signals. The research study explores the effectiveness of CWT-layered CNN models that employ different wavelets such as Amor, Morse, and Bump. The study aims to analyze the models’ performance without the need for feature extraction, which could provide deeper insights into the effectiveness of the models in processing complex data. Also, raw waveform modeling preserves the original signal’s integrity and nuance, making it ideal for applications like speech recognition, signal processing, and image processing. Extensive analysis and experimentation have revealed that the Amor wavelet surpasses the Morse and Bump wavelets in accurately representing signal characteristics. The Amor wavelet outperforms the others in terms of signal reconstruction fidelity, noise suppression capabilities, and feature extraction accuracy. The proposed CWT-layered CNN model emphasizes the importance of selecting the appropriate wavelet for signal-processing tasks. The Amor wavelet is a reliable and precise choice for applications. The UA-Speech dataset is crucial for more accurate dysarthria classification. Advanced deep learning techniques can simplify early intervention measures and expedite the diagnosis process.
{"title":"Automatic dysarthria detection and severity level assessment using CWT-layered CNN model","authors":"Shaik Sajiha, Kodali Radha, Dhulipalla Venkata Rao, Nammi Sneha, Suryanarayana Gunnam, Durga Prasad Bavirisetti","doi":"10.1186/s13636-024-00357-3","DOIUrl":"https://doi.org/10.1186/s13636-024-00357-3","url":null,"abstract":"Dysarthria is a speech disorder that affects the ability to communicate due to articulation difficulties. This research proposes a novel method for automatic dysarthria detection (ADD) and automatic dysarthria severity level assessment (ADSLA) by using a variable continuous wavelet transform (CWT) layered convolutional neural network (CNN) model. To determine their efficiency, the proposed model is assessed using two distinct corpora, TORGO and UA-Speech, comprising both dysarthria patients and healthy subject speech signals. The research study explores the effectiveness of CWT-layered CNN models that employ different wavelets such as Amor, Morse, and Bump. The study aims to analyze the models’ performance without the need for feature extraction, which could provide deeper insights into the effectiveness of the models in processing complex data. Also, raw waveform modeling preserves the original signal’s integrity and nuance, making it ideal for applications like speech recognition, signal processing, and image processing. Extensive analysis and experimentation have revealed that the Amor wavelet surpasses the Morse and Bump wavelets in accurately representing signal characteristics. The Amor wavelet outperforms the others in terms of signal reconstruction fidelity, noise suppression capabilities, and feature extraction accuracy. The proposed CWT-layered CNN model emphasizes the importance of selecting the appropriate wavelet for signal-processing tasks. The Amor wavelet is a reliable and precise choice for applications. The UA-Speech dataset is crucial for more accurate dysarthria classification. Advanced deep learning techniques can simplify early intervention measures and expedite the diagnosis process.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-18DOI: 10.1186/s13636-024-00352-8
Adam Kujawski, Art J. R. Pelling, Ennes Sarradj
This work introduces a large dataset comprising impulse responses of spatially distributed sources within a plane parallel to a planar microphone array. The dataset, named MIRACLE, encompasses 856,128 single-channel impulse responses and includes four different measurement scenarios. Three measurement scenarios were conducted under anechoic conditions. The fourth scenario includes an additional specular reflection from a reflective panel. The source positions were obtained by uniformly discretizing a rectangular source plane parallel to the microphone for each scenario. The dataset contains three scenarios with a spatial resolution of $$23,textrm{mm}$$ at two different source-plane-to-array distances, as well as a scenario with a resolution of $$5,textrm{mm}$$ for the shorter distance. In contrast to existing room impulse response datasets, the accuracy of the provided source location labels is assessed and additional metadata, such as the directivity of the loudspeaker used for excitation, is provided. The MIRACLE dataset can be used as a benchmark for data-driven modelling and interpolation methods as well as for various acoustic machine learning tasks, such as source separation, localization, and characterization. Two timely applications of the dataset are presented in this work: the generation of microphone array data for data-driven source localization and characterization tasks and data-driven model order reduction.
{"title":"MIRACLE—a microphone array impulse response dataset for acoustic learning","authors":"Adam Kujawski, Art J. R. Pelling, Ennes Sarradj","doi":"10.1186/s13636-024-00352-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00352-8","url":null,"abstract":"This work introduces a large dataset comprising impulse responses of spatially distributed sources within a plane parallel to a planar microphone array. The dataset, named MIRACLE, encompasses 856,128 single-channel impulse responses and includes four different measurement scenarios. Three measurement scenarios were conducted under anechoic conditions. The fourth scenario includes an additional specular reflection from a reflective panel. The source positions were obtained by uniformly discretizing a rectangular source plane parallel to the microphone for each scenario. The dataset contains three scenarios with a spatial resolution of $$23,textrm{mm}$$ at two different source-plane-to-array distances, as well as a scenario with a resolution of $$5,textrm{mm}$$ for the shorter distance. In contrast to existing room impulse response datasets, the accuracy of the provided source location labels is assessed and additional metadata, such as the directivity of the loudspeaker used for excitation, is provided. The MIRACLE dataset can be used as a benchmark for data-driven modelling and interpolation methods as well as for various acoustic machine learning tasks, such as source separation, localization, and characterization. Two timely applications of the dataset are presented in this work: the generation of microphone array data for data-driven source localization and characterization tasks and data-driven model order reduction.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"197 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-18DOI: 10.1186/s13636-024-00355-5
Marcin Lewandowski
A new method for estimating the first and second derivatives of discrete audio signals intended to achieve higher computational precision in analyzing the performance and characteristics of digital audio systems is presented. The method could find numerous applications in modeling nonlinear audio circuit systems, e.g., for audio synthesis and creating audio effects, music recognition and classification, time-frequency analysis based on nonstationary audio signal decomposition, audio steganalysis and digital audio authentication or audio feature extraction methods. The proposed algorithm employs the ordinary 7 point-stencil central-difference formulas with improvements that minimize the round-off and truncation errors. This is achieved by treating the step size of numerical differentiation as a regularization parameter, which acts as a decision threshold in all calculations. This approach requires shifting discrete audio data by fractions of the initial sample rate, which was obtained by fractional delay FIR filters designed with modified 11-term cosine-sum windows for interpolation and shifting of audio signals. The maximum relative error in estimating first and second derivatives of discrete audio signals are respectively in order of $$10^{-13}$$ and $$10^{-10}$$ over the entire audio band, which is close to double-precision floating-point accuracy for the first and better than single-precision floating-point accuracy for the second derivative estimation. Numerical testing showed that this performance of the proposed method is not influenced by the type of signal being differentiated (either stationary or nonstationary), and provides better results than other known differentiation methods, in the audio band up to 21 kHz.
{"title":"Estimating the first and second derivatives of discrete audio data","authors":"Marcin Lewandowski","doi":"10.1186/s13636-024-00355-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00355-5","url":null,"abstract":"A new method for estimating the first and second derivatives of discrete audio signals intended to achieve higher computational precision in analyzing the performance and characteristics of digital audio systems is presented. The method could find numerous applications in modeling nonlinear audio circuit systems, e.g., for audio synthesis and creating audio effects, music recognition and classification, time-frequency analysis based on nonstationary audio signal decomposition, audio steganalysis and digital audio authentication or audio feature extraction methods. The proposed algorithm employs the ordinary 7 point-stencil central-difference formulas with improvements that minimize the round-off and truncation errors. This is achieved by treating the step size of numerical differentiation as a regularization parameter, which acts as a decision threshold in all calculations. This approach requires shifting discrete audio data by fractions of the initial sample rate, which was obtained by fractional delay FIR filters designed with modified 11-term cosine-sum windows for interpolation and shifting of audio signals. The maximum relative error in estimating first and second derivatives of discrete audio signals are respectively in order of $$10^{-13}$$ and $$10^{-10}$$ over the entire audio band, which is close to double-precision floating-point accuracy for the first and better than single-precision floating-point accuracy for the second derivative estimation. Numerical testing showed that this performance of the proposed method is not influenced by the type of signal being differentiated (either stationary or nonstationary), and provides better results than other known differentiation methods, in the audio band up to 21 kHz.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"135 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-13DOI: 10.1186/s13636-024-00346-6
Jeremiah Abimbola, Daniel Kostrzewa, Pawel Kasprowski
Time signature detection is a fundamental task in music information retrieval, aiding in music organization. In recent years, the demand for robust and efficient methods in music analysis has amplified, underscoring the significance of advancements in time signature detection. In this study, we explored the effectiveness of residual networks for time signature detection. Additionally, we compared the performance of the residual network (ResNet18) to already existing models such as audio similarity matrix (ASM) and beat similarity matrix (BSM). We also juxtaposed with traditional algorithms such as support vector machine (SVM), random forest, K-nearest neighbor (KNN), naive Bayes, and that of deep learning models, such as convolutional neural network (CNN) and convolutional recurrent neural network (CRNN). The evaluation is conducted using Mel-frequency cepstral coefficients (MFCCs) as feature representations on the Meter2800 dataset. Our results indicate that ResNet18 outperforms all other models thereby showing the potential of deep learning models for accurate time signature detection.
{"title":"Music time signature detection using ResNet18","authors":"Jeremiah Abimbola, Daniel Kostrzewa, Pawel Kasprowski","doi":"10.1186/s13636-024-00346-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00346-6","url":null,"abstract":"Time signature detection is a fundamental task in music information retrieval, aiding in music organization. In recent years, the demand for robust and efficient methods in music analysis has amplified, underscoring the significance of advancements in time signature detection. In this study, we explored the effectiveness of residual networks for time signature detection. Additionally, we compared the performance of the residual network (ResNet18) to already existing models such as audio similarity matrix (ASM) and beat similarity matrix (BSM). We also juxtaposed with traditional algorithms such as support vector machine (SVM), random forest, K-nearest neighbor (KNN), naive Bayes, and that of deep learning models, such as convolutional neural network (CNN) and convolutional recurrent neural network (CRNN). The evaluation is conducted using Mel-frequency cepstral coefficients (MFCCs) as feature representations on the Meter2800 dataset. Our results indicate that ResNet18 outperforms all other models thereby showing the potential of deep learning models for accurate time signature detection.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"61 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01DOI: 10.1186/s13636-024-00349-3
Yunpeng Liu, Xukui Yang, Dan Qu
Limited data availability remains a significant challenge for Whisper’s low-resource speech recognition performance, falling short of practical application requirements. While previous studies have successfully reduced the recognition error rates of target language speech through fine-tuning, a comprehensive exploration and analysis of Whisper’s fine-tuning capabilities and the advantages and disadvantages of various fine-tuning strategies are still lacking. This paper aims to fill this gap by conducting comprehensive experimental exploration for Whisper’s low-resource speech recognition performance using five fine-tuning strategies with limited supervised data from seven low-resource languages. The results and analysis demonstrate that all fine-tuning strategies explored in this paper significantly enhance Whisper’s performance. However, different strategies vary in their suitability and practical effectiveness, highlighting the need for careful selection based on specific use cases and resources available.
{"title":"Exploration of Whisper fine-tuning strategies for low-resource ASR","authors":"Yunpeng Liu, Xukui Yang, Dan Qu","doi":"10.1186/s13636-024-00349-3","DOIUrl":"https://doi.org/10.1186/s13636-024-00349-3","url":null,"abstract":"Limited data availability remains a significant challenge for Whisper’s low-resource speech recognition performance, falling short of practical application requirements. While previous studies have successfully reduced the recognition error rates of target language speech through fine-tuning, a comprehensive exploration and analysis of Whisper’s fine-tuning capabilities and the advantages and disadvantages of various fine-tuning strategies are still lacking. This paper aims to fill this gap by conducting comprehensive experimental exploration for Whisper’s low-resource speech recognition performance using five fine-tuning strategies with limited supervised data from seven low-resource languages. The results and analysis demonstrate that all fine-tuning strategies explored in this paper significantly enhance Whisper’s performance. However, different strategies vary in their suitability and practical effectiveness, highlighting the need for careful selection based on specific use cases and resources available.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"21 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141190231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}