Pub Date : 2024-02-29DOI: 10.1186/s13636-024-00334-w
Javier Tejedor, Doroteo T. Toledano
The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.
{"title":"Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge","authors":"Javier Tejedor, Doroteo T. Toledano","doi":"10.1186/s13636-024-00334-w","DOIUrl":"https://doi.org/10.1186/s13636-024-00334-w","url":null,"abstract":"The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"21 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140010189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-26DOI: 10.1186/s13636-024-00336-8
Serhat Hizlisoy, Recep Sinan Arslan, Emel Çolakoğlu
Analyzing songs is a problem that is being investigated to aid various operations on music access platforms. At the beginning of these problems is the identification of the person who sings the song. In this study, a singer identification application, which consists of Turkish singers and works for the Turkish language, is proposed in order to find a solution to this problem. Mel-spectrogram and octave-based spectral contrast values are extracted from the songs, and these values are combined into a hybrid feature vector. Thus, problem-specific situations such as determining the differences in the voices of the singers and reducing the effects of the year and album differences on the result are discussed. As a result of the tests and systematic evaluations, it has been shown that a certain level of success has been achieved in the determination of the singer who sings the song, and that the song is in a stable structure against the changes in the singing style and song structure. The results were analyzed in a database of 9 singers and 180 songs. An accuracy value of 89.4% was obtained using the reduction of the feature vector by PCA, the normalization of the data, and the Extra Trees classifier. Precision, recall and f-score values were 89.9%, 89.4% and 89.5%, respectively.
分析歌曲是一个正在研究的问题,以帮助音乐访问平台上的各种操作。在这些问题中,首先是歌曲演唱者的识别问题。在本研究中,为了找到解决这一问题的方法,我们提出了一个由土耳其歌手组成并适用于土耳其语的歌手识别应用程序。我们从歌曲中提取了旋律谱图和基于倍频程的频谱对比值,并将这些值组合成一个混合特征向量。因此,讨论了特定问题的具体情况,如确定歌手声音的差异以及减少年份和专辑差异对结果的影响。测试和系统评估的结果表明,在确定演唱歌曲的歌手方面取得了一定的成功,而且歌曲结构稳定,不受演唱风格和歌曲结构变化的影响。我们在一个包含 9 位歌手和 180 首歌曲的数据库中对结果进行了分析。通过 PCA 缩减特征向量、对数据进行归一化处理以及使用 Extra Trees 分类器,准确率达到 89.4%。精确度、召回率和 f 值分别为 89.9%、89.4% 和 89.5%。
{"title":"Singer identification model using data augmentation and enhanced feature conversion with hybrid feature vector and machine learning","authors":"Serhat Hizlisoy, Recep Sinan Arslan, Emel Çolakoğlu","doi":"10.1186/s13636-024-00336-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00336-8","url":null,"abstract":"Analyzing songs is a problem that is being investigated to aid various operations on music access platforms. At the beginning of these problems is the identification of the person who sings the song. In this study, a singer identification application, which consists of Turkish singers and works for the Turkish language, is proposed in order to find a solution to this problem. Mel-spectrogram and octave-based spectral contrast values are extracted from the songs, and these values are combined into a hybrid feature vector. Thus, problem-specific situations such as determining the differences in the voices of the singers and reducing the effects of the year and album differences on the result are discussed. As a result of the tests and systematic evaluations, it has been shown that a certain level of success has been achieved in the determination of the singer who sings the song, and that the song is in a stable structure against the changes in the singing style and song structure. The results were analyzed in a database of 9 singers and 180 songs. An accuracy value of 89.4% was obtained using the reduction of the feature vector by PCA, the normalization of the data, and the Extra Trees classifier. Precision, recall and f-score values were 89.9%, 89.4% and 89.5%, respectively.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139967872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1186/s13636-024-00333-x
Zining Liang, Wen Zhang, Thushara D. Abhayapala
Accurately representing the sound field with high spatial resolution is crucial for immersive and interactive sound field reproduction technology. In recent studies, there has been a notable emphasis on efficiently estimating sound fields from a limited number of discrete observations. In particular, kernel-based methods using Gaussian processes (GPs) with a covariance function to model spatial correlations have been proposed. However, the current methods rely on pre-defined kernels for modeling, requiring the manual identification of optimal kernels and their parameters for different sound fields. In this work, we propose a novel approach that parameterizes GPs using a deep neural network based on neural processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.
{"title":"Sound field reconstruction using neural processes with dynamic kernels","authors":"Zining Liang, Wen Zhang, Thushara D. Abhayapala","doi":"10.1186/s13636-024-00333-x","DOIUrl":"https://doi.org/10.1186/s13636-024-00333-x","url":null,"abstract":"Accurately representing the sound field with high spatial resolution is crucial for immersive and interactive sound field reproduction technology. In recent studies, there has been a notable emphasis on efficiently estimating sound fields from a limited number of discrete observations. In particular, kernel-based methods using Gaussian processes (GPs) with a covariance function to model spatial correlations have been proposed. However, the current methods rely on pre-defined kernels for modeling, requiring the manual identification of optimal kernels and their parameters for different sound fields. In this work, we propose a novel approach that parameterizes GPs using a deep neural network based on neural processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"32 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139928614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-16DOI: 10.1186/s13636-024-00332-y
Marcos Lazaro Alvarez, Laura Arjona, Miguel E. Iglesias Martínez, Alfonso Bahillo
This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.
{"title":"Automatic classification of the physical surface in sound uroflowmetry using machine learning methods","authors":"Marcos Lazaro Alvarez, Laura Arjona, Miguel E. Iglesias Martínez, Alfonso Bahillo","doi":"10.1186/s13636-024-00332-y","DOIUrl":"https://doi.org/10.1186/s13636-024-00332-y","url":null,"abstract":"This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1186/s13636-024-00329-7
Huda Barakat, Oytun Turk, Cenk Demiroglu
Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.
{"title":"Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources","authors":"Huda Barakat, Oytun Turk, Cenk Demiroglu","doi":"10.1186/s13636-024-00329-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00329-7","url":null,"abstract":"Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"17 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-10DOI: 10.1186/s13636-024-00328-8
Priyanka Gupta, Hemant A. Patil, Rodrigo Capobianco Guido
Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker’s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker’s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the possibility of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker’s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.
说话者声称的身份可通过自动说话者验证(ASV)系统(也称为语音生物识别系统)进行验证。在 INTERSPEECH 会议期间组织的 2015、2017、2019 和 2021 年 ASVspoof 挑战活动中,提出了若干应对措施 (CM)。此外,最近还倡议组织 ASVSpoof 5 挑战赛,目的是收集大量欺骗/深度伪造攻击数据(即第 1 阶段),并设计一个欺骗感知 ASV 系统,使用一个分类器同时处理 ASV 和 CM,以设计 CM-ASV 集成解决方案(第 2 阶段)。为此,本文对成功攻击 ASV 系统的各种可能策略和漏洞进行了调查,如目标选择、无法使用全局反制措施来减少攻击者发现弱点的机会、基于机器学习的最先进对抗攻击以及深度伪造生成。本文还讨论了攻击的可能性,如对 ASV 系统的硬件攻击。最后,我们还从攻击者的角度讨论了几项技术挑战,这些挑战可以用来为 ASV 系统的安全提出更好的防御机制。
{"title":"Vulnerability issues in Automatic Speaker Verification (ASV) systems","authors":"Priyanka Gupta, Hemant A. Patil, Rodrigo Capobianco Guido","doi":"10.1186/s13636-024-00328-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00328-8","url":null,"abstract":"Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker’s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker’s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the possibility of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker’s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"34 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1186/s13636-024-00330-0
Reemt Hinrichs, Kevin Gerkens, Alexander Lange, Jörn Ostermann
Audio effects are an ubiquitous tool in music production due to the interesting ways in which they can shape the sound of music. Guitar effects, the subset of all audio effects focusing on guitar signals, are commonly used in popular music to shape the guitar sound to fit specific genres or to create more variety within musical compositions. Automatic extraction of guitar effects and their parameter settings, with the aim to copy a target guitar sound, has been previously investigated, where artificial neural networks first determine the effect class of a reference signal and subsequently the parameter settings. These approaches require a corresponding guitar effect implementation to be available. In general, for very close sound matching, additional research regarding effect implementations is necessary. In this work, we present a different approach to circumvent these issues. We propose blind extraction of guitar effects through a combination of blind system inversion and neural guitar effect modeling. That way, an immediately usable, blind copy of the target guitar effect is obtained. The proposed method is tested with the phaser, softclipping and slapback delay effect. Listening tests with eight subjects indicate excellent quality of the blind copies, i.e., little to no difference to the reference guitar effect.
{"title":"Blind extraction of guitar effects through blind system inversion and neural guitar effect modeling","authors":"Reemt Hinrichs, Kevin Gerkens, Alexander Lange, Jörn Ostermann","doi":"10.1186/s13636-024-00330-0","DOIUrl":"https://doi.org/10.1186/s13636-024-00330-0","url":null,"abstract":"Audio effects are an ubiquitous tool in music production due to the interesting ways in which they can shape the sound of music. Guitar effects, the subset of all audio effects focusing on guitar signals, are commonly used in popular music to shape the guitar sound to fit specific genres or to create more variety within musical compositions. Automatic extraction of guitar effects and their parameter settings, with the aim to copy a target guitar sound, has been previously investigated, where artificial neural networks first determine the effect class of a reference signal and subsequently the parameter settings. These approaches require a corresponding guitar effect implementation to be available. In general, for very close sound matching, additional research regarding effect implementations is necessary. In this work, we present a different approach to circumvent these issues. We propose blind extraction of guitar effects through a combination of blind system inversion and neural guitar effect modeling. That way, an immediately usable, blind copy of the target guitar effect is obtained. The proposed method is tested with the phaser, softclipping and slapback delay effect. Listening tests with eight subjects indicate excellent quality of the blind copies, i.e., little to no difference to the reference guitar effect.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"136 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in deep learning-based speech enhancement models have extensively used attention mechanisms to achieve state-of-the-art methods by demonstrating their effectiveness. This paper proposes a transformer attention network based sub-convolutional U-Net (TANSCUNet) for speech enhancement. Instead of adopting conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel transformer-based attention network between the sub-convolutional U-Net encoder and decoder for better feature learning. More specifically, it is composed of several adaptive time―frequency attention modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate hierarchical contextual information. Additionally, a sub-convolutional encoder-decoder model used different kernel sizes to extract multi-scale local and contextual features from the noisy speech. The experimental results show that the proposed model outperforms several state-of-the-art methods.
{"title":"Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement","authors":"Sivaramakrishna Yecchuri, Sunny Dayal Vanambathina","doi":"10.1186/s13636-024-00331-z","DOIUrl":"https://doi.org/10.1186/s13636-024-00331-z","url":null,"abstract":"Recent advancements in deep learning-based speech enhancement models have extensively used attention mechanisms to achieve state-of-the-art methods by demonstrating their effectiveness. This paper proposes a transformer attention network based sub-convolutional U-Net (TANSCUNet) for speech enhancement. Instead of adopting conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel transformer-based attention network between the sub-convolutional U-Net encoder and decoder for better feature learning. More specifically, it is composed of several adaptive time―frequency attention modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate hierarchical contextual information. Additionally, a sub-convolutional encoder-decoder model used different kernel sizes to extract multi-scale local and contextual features from the noisy speech. The experimental results show that the proposed model outperforms several state-of-the-art methods.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"21 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139662818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-02DOI: 10.1186/s13636-023-00326-2
Lingyun Xie, Yuehong Wang, Yan Gao
Chinese traditional music, a vital expression of Chinese cultural heritage, possesses both a profound emotional resonance and artistic allure. This study sets forth to refine and analyze the acoustical features essential for the aesthetic recognition of Chinese traditional music, utilizing a dataset spanning five aesthetic genres. Through recursive feature elimination, we distilled an initial set of 447 low-level physical features to a more manageable 44, establishing their feature-importance coefficients. This reduction allowed us to estimate the quantified influence of higher-level musical components on aesthetic recognition, following the establishment of a correlation between these components and their physical counterparts. We conducted a comprehensive examination of the impact of various musical elements on aesthetic genres. Our findings indicate that the selected 44-dimensional feature set could enhance aesthetic recognition. Among the high-level musical factors, timbre emerges as the most influential, followed by rhythm, pitch, and tonality. Timbre proved pivotal in distinguishing between the JiYang and BeiShang genres, while rhythm and tonality were key in differentiating LingDong from JiYang, as well as LingDong from BeiShang.
{"title":"Acoustical feature analysis and optimization for aesthetic recognition of Chinese traditional music","authors":"Lingyun Xie, Yuehong Wang, Yan Gao","doi":"10.1186/s13636-023-00326-2","DOIUrl":"https://doi.org/10.1186/s13636-023-00326-2","url":null,"abstract":"Chinese traditional music, a vital expression of Chinese cultural heritage, possesses both a profound emotional resonance and artistic allure. This study sets forth to refine and analyze the acoustical features essential for the aesthetic recognition of Chinese traditional music, utilizing a dataset spanning five aesthetic genres. Through recursive feature elimination, we distilled an initial set of 447 low-level physical features to a more manageable 44, establishing their feature-importance coefficients. This reduction allowed us to estimate the quantified influence of higher-level musical components on aesthetic recognition, following the establishment of a correlation between these components and their physical counterparts. We conducted a comprehensive examination of the impact of various musical elements on aesthetic genres. Our findings indicate that the selected 44-dimensional feature set could enhance aesthetic recognition. Among the high-level musical factors, timbre emerges as the most influential, followed by rhythm, pitch, and tonality. Timbre proved pivotal in distinguishing between the JiYang and BeiShang genres, while rhythm and tonality were key in differentiating LingDong from JiYang, as well as LingDong from BeiShang.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"76 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139662758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-20DOI: 10.1186/s13636-023-00325-3
Gebremichael Kibret Sheferaw, Waweru Mwangi, Michael Kimwele, Adane Mamuye
Speech coding is a method to reduce the amount of data needs to represent speech signals by exploiting the statistical properties of the speech signal. Recently, in the speech coding process, a neural network prediction model has gained attention as the reconstruction process of a nonlinear and nonstationary speech signal. This study proposes a novel approach to improve speech coding performance by using a gated recurrent unit (GRU)-based adaptive differential pulse code modulation (ADPCM) system. This GRU predictor model is trained using a data set of speech samples from the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus actual sample and the ADPCM fixed-predictor output speech sample. Our contribution lies in the development of an algorithm for training the GRU predictive model that can improve its performance in speech coding prediction and a new offline trained predictive model for speech decoder. The results indicate that the proposed system significantly improves the accuracy of speech prediction, demonstrating its potential for speech prediction applications. Overall, this work presents a unique application of the GRU predictive model with ADPCM decoding in speech signal compression, providing a promising approach for future research in this field.
{"title":"Gated recurrent unit predictor model-based adaptive differential pulse code modulation speech decoder","authors":"Gebremichael Kibret Sheferaw, Waweru Mwangi, Michael Kimwele, Adane Mamuye","doi":"10.1186/s13636-023-00325-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00325-3","url":null,"abstract":"Speech coding is a method to reduce the amount of data needs to represent speech signals by exploiting the statistical properties of the speech signal. Recently, in the speech coding process, a neural network prediction model has gained attention as the reconstruction process of a nonlinear and nonstationary speech signal. This study proposes a novel approach to improve speech coding performance by using a gated recurrent unit (GRU)-based adaptive differential pulse code modulation (ADPCM) system. This GRU predictor model is trained using a data set of speech samples from the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus actual sample and the ADPCM fixed-predictor output speech sample. Our contribution lies in the development of an algorithm for training the GRU predictive model that can improve its performance in speech coding prediction and a new offline trained predictive model for speech decoder. The results indicate that the proposed system significantly improves the accuracy of speech prediction, demonstrating its potential for speech prediction applications. Overall, this work presents a unique application of the GRU predictive model with ADPCM decoding in speech signal compression, providing a promising approach for future research in this field.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"85 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}