Pub Date : 2024-04-01DOI: 10.1186/s13636-024-00335-9
Rabbia Mahum, Aun Irtaza, Ali Javed, Haitham A. Mahmoud, Haseeb Hassan
Spoofed speeches are becoming a big threat to society due to advancements in artificial intelligence techniques. Therefore, there must be an automated spoofing detector that can be integrated into automatic speaker verification (ASV) systems. In this study, we recommend a novel and robust model, named DeepDet, based on deep-layered architecture, to categorize speech into two classes: spoofed and bonafide. DeepDet is an improved model based on Yet Another Mobile Network (YAMNet) employing a customized MobileNet combined with a bottleneck attention module (BAM). First, we convert audio into mel-spectrograms that consist of time–frequency representations on mel-scale. Second, we trained our deep layered model using the extracted mel-spectrograms on a Logical Access (LA) set, including synthesized speeches and voice conversions of the ASVspoof-2019 dataset. In the end, we classified the audios, utilizing our trained binary classifier. More precisely, we utilized the power of layered architecture and guided attention that can discern the spoofed speech from bonafide samples. Our proposed improved model employs depth-wise linearly separate convolutions, which makes our model lighter weight than existing techniques. Furthermore, we implemented extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. We attained an equal error rate (EER) of 0.042% on Logical Access (LA), whereas 0.43% on Physical Access (PA) attacks. Therefore, the performance of the proposed model is significant on the ASVspoof 2019 dataset and indicates the effectiveness of the DeepDet over existing spoofing detectors. Additionally, our proposed model is robust enough that can identify the unseen spoofed audios and classifies the several attacks accurately.
{"title":"DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection","authors":"Rabbia Mahum, Aun Irtaza, Ali Javed, Haitham A. Mahmoud, Haseeb Hassan","doi":"10.1186/s13636-024-00335-9","DOIUrl":"https://doi.org/10.1186/s13636-024-00335-9","url":null,"abstract":"Spoofed speeches are becoming a big threat to society due to advancements in artificial intelligence techniques. Therefore, there must be an automated spoofing detector that can be integrated into automatic speaker verification (ASV) systems. In this study, we recommend a novel and robust model, named DeepDet, based on deep-layered architecture, to categorize speech into two classes: spoofed and bonafide. DeepDet is an improved model based on Yet Another Mobile Network (YAMNet) employing a customized MobileNet combined with a bottleneck attention module (BAM). First, we convert audio into mel-spectrograms that consist of time–frequency representations on mel-scale. Second, we trained our deep layered model using the extracted mel-spectrograms on a Logical Access (LA) set, including synthesized speeches and voice conversions of the ASVspoof-2019 dataset. In the end, we classified the audios, utilizing our trained binary classifier. More precisely, we utilized the power of layered architecture and guided attention that can discern the spoofed speech from bonafide samples. Our proposed improved model employs depth-wise linearly separate convolutions, which makes our model lighter weight than existing techniques. Furthermore, we implemented extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. We attained an equal error rate (EER) of 0.042% on Logical Access (LA), whereas 0.43% on Physical Access (PA) attacks. Therefore, the performance of the proposed model is significant on the ASVspoof 2019 dataset and indicates the effectiveness of the DeepDet over existing spoofing detectors. Additionally, our proposed model is robust enough that can identify the unseen spoofed audios and classifies the several attacks accurately.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"1 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-28DOI: 10.1186/s13636-024-00337-7
Luca Comanducci, Fabio Antonacci, Augusto Sarti
Most soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article, we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e., where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decomposition-based technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complex-valued convolutional neural network (CNN), we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. The proposed model must be retrained for each irregular loudspeaker array configuration. Numerical results show better reproduction accuracy with respect to the plane wave decomposition-based technique, pressure-matching approach, and linear optimizers for driving signal compensation.
{"title":"Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks","authors":"Luca Comanducci, Fabio Antonacci, Augusto Sarti","doi":"10.1186/s13636-024-00337-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00337-7","url":null,"abstract":"Most soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article, we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e., where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decomposition-based technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complex-valued convolutional neural network (CNN), we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. The proposed model must be retrained for each irregular loudspeaker array configuration. Numerical results show better reproduction accuracy with respect to the plane wave decomposition-based technique, pressure-matching approach, and linear optimizers for driving signal compensation.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"61 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-27DOI: 10.1186/s13636-024-00338-6
Shivam Saini, Isaac Engel, Jürgen Peissig
Audio augmented reality (AAR), a prominent topic in the field of audio, requires understanding the listening environment of the user for rendering an authentic virtual auditory object. Reverberation time ( $$RT_{60}$$ ) is a predominant metric for the characterization of room acoustics and numerous approaches have been proposed to estimate it blindly from a reverberant speech signal. However, a single $$RT_{60}$$ value may not be sufficient to correctly describe and render the acoustics of a room. This contribution presents a method for the estimation of multiple room acoustic parameters required to render close-to-accurate room acoustics in an unknown environment. It is shown how these parameters can be estimated blindly using an audio transformer that can be deployed on a mobile device. Furthermore, the paper also discusses the use of the estimated room acoustic parameters to find a similar room from a dataset of real BRIRs that can be further used for rendering the virtual audio source. Additionally, a novel binaural room impulse response (BRIR) augmentation technique to overcome the limitation of inadequate data is proposed. Finally, the proposed method is validated perceptually by means of a listening test.
{"title":"An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment","authors":"Shivam Saini, Isaac Engel, Jürgen Peissig","doi":"10.1186/s13636-024-00338-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00338-6","url":null,"abstract":"Audio augmented reality (AAR), a prominent topic in the field of audio, requires understanding the listening environment of the user for rendering an authentic virtual auditory object. Reverberation time ( $$RT_{60}$$ ) is a predominant metric for the characterization of room acoustics and numerous approaches have been proposed to estimate it blindly from a reverberant speech signal. However, a single $$RT_{60}$$ value may not be sufficient to correctly describe and render the acoustics of a room. This contribution presents a method for the estimation of multiple room acoustic parameters required to render close-to-accurate room acoustics in an unknown environment. It is shown how these parameters can be estimated blindly using an audio transformer that can be deployed on a mobile device. Furthermore, the paper also discusses the use of the estimated room acoustic parameters to find a similar room from a dataset of real BRIRs that can be further used for rendering the virtual audio source. Additionally, a novel binaural room impulse response (BRIR) augmentation technique to overcome the limitation of inadequate data is proposed. Finally, the proposed method is validated perceptually by means of a listening test.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"117 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-29DOI: 10.1186/s13636-024-00334-w
Javier Tejedor, Doroteo T. Toledano
The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.
{"title":"Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge","authors":"Javier Tejedor, Doroteo T. Toledano","doi":"10.1186/s13636-024-00334-w","DOIUrl":"https://doi.org/10.1186/s13636-024-00334-w","url":null,"abstract":"The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"21 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140010189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-26DOI: 10.1186/s13636-024-00336-8
Serhat Hizlisoy, Recep Sinan Arslan, Emel Çolakoğlu
Analyzing songs is a problem that is being investigated to aid various operations on music access platforms. At the beginning of these problems is the identification of the person who sings the song. In this study, a singer identification application, which consists of Turkish singers and works for the Turkish language, is proposed in order to find a solution to this problem. Mel-spectrogram and octave-based spectral contrast values are extracted from the songs, and these values are combined into a hybrid feature vector. Thus, problem-specific situations such as determining the differences in the voices of the singers and reducing the effects of the year and album differences on the result are discussed. As a result of the tests and systematic evaluations, it has been shown that a certain level of success has been achieved in the determination of the singer who sings the song, and that the song is in a stable structure against the changes in the singing style and song structure. The results were analyzed in a database of 9 singers and 180 songs. An accuracy value of 89.4% was obtained using the reduction of the feature vector by PCA, the normalization of the data, and the Extra Trees classifier. Precision, recall and f-score values were 89.9%, 89.4% and 89.5%, respectively.
分析歌曲是一个正在研究的问题,以帮助音乐访问平台上的各种操作。在这些问题中,首先是歌曲演唱者的识别问题。在本研究中,为了找到解决这一问题的方法,我们提出了一个由土耳其歌手组成并适用于土耳其语的歌手识别应用程序。我们从歌曲中提取了旋律谱图和基于倍频程的频谱对比值,并将这些值组合成一个混合特征向量。因此,讨论了特定问题的具体情况,如确定歌手声音的差异以及减少年份和专辑差异对结果的影响。测试和系统评估的结果表明,在确定演唱歌曲的歌手方面取得了一定的成功,而且歌曲结构稳定,不受演唱风格和歌曲结构变化的影响。我们在一个包含 9 位歌手和 180 首歌曲的数据库中对结果进行了分析。通过 PCA 缩减特征向量、对数据进行归一化处理以及使用 Extra Trees 分类器,准确率达到 89.4%。精确度、召回率和 f 值分别为 89.9%、89.4% 和 89.5%。
{"title":"Singer identification model using data augmentation and enhanced feature conversion with hybrid feature vector and machine learning","authors":"Serhat Hizlisoy, Recep Sinan Arslan, Emel Çolakoğlu","doi":"10.1186/s13636-024-00336-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00336-8","url":null,"abstract":"Analyzing songs is a problem that is being investigated to aid various operations on music access platforms. At the beginning of these problems is the identification of the person who sings the song. In this study, a singer identification application, which consists of Turkish singers and works for the Turkish language, is proposed in order to find a solution to this problem. Mel-spectrogram and octave-based spectral contrast values are extracted from the songs, and these values are combined into a hybrid feature vector. Thus, problem-specific situations such as determining the differences in the voices of the singers and reducing the effects of the year and album differences on the result are discussed. As a result of the tests and systematic evaluations, it has been shown that a certain level of success has been achieved in the determination of the singer who sings the song, and that the song is in a stable structure against the changes in the singing style and song structure. The results were analyzed in a database of 9 singers and 180 songs. An accuracy value of 89.4% was obtained using the reduction of the feature vector by PCA, the normalization of the data, and the Extra Trees classifier. Precision, recall and f-score values were 89.9%, 89.4% and 89.5%, respectively.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139967872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1186/s13636-024-00333-x
Zining Liang, Wen Zhang, Thushara D. Abhayapala
Accurately representing the sound field with high spatial resolution is crucial for immersive and interactive sound field reproduction technology. In recent studies, there has been a notable emphasis on efficiently estimating sound fields from a limited number of discrete observations. In particular, kernel-based methods using Gaussian processes (GPs) with a covariance function to model spatial correlations have been proposed. However, the current methods rely on pre-defined kernels for modeling, requiring the manual identification of optimal kernels and their parameters for different sound fields. In this work, we propose a novel approach that parameterizes GPs using a deep neural network based on neural processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.
{"title":"Sound field reconstruction using neural processes with dynamic kernels","authors":"Zining Liang, Wen Zhang, Thushara D. Abhayapala","doi":"10.1186/s13636-024-00333-x","DOIUrl":"https://doi.org/10.1186/s13636-024-00333-x","url":null,"abstract":"Accurately representing the sound field with high spatial resolution is crucial for immersive and interactive sound field reproduction technology. In recent studies, there has been a notable emphasis on efficiently estimating sound fields from a limited number of discrete observations. In particular, kernel-based methods using Gaussian processes (GPs) with a covariance function to model spatial correlations have been proposed. However, the current methods rely on pre-defined kernels for modeling, requiring the manual identification of optimal kernels and their parameters for different sound fields. In this work, we propose a novel approach that parameterizes GPs using a deep neural network based on neural processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"32 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139928614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-16DOI: 10.1186/s13636-024-00332-y
Marcos Lazaro Alvarez, Laura Arjona, Miguel E. Iglesias Martínez, Alfonso Bahillo
This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.
{"title":"Automatic classification of the physical surface in sound uroflowmetry using machine learning methods","authors":"Marcos Lazaro Alvarez, Laura Arjona, Miguel E. Iglesias Martínez, Alfonso Bahillo","doi":"10.1186/s13636-024-00332-y","DOIUrl":"https://doi.org/10.1186/s13636-024-00332-y","url":null,"abstract":"This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1186/s13636-024-00329-7
Huda Barakat, Oytun Turk, Cenk Demiroglu
Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.
{"title":"Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources","authors":"Huda Barakat, Oytun Turk, Cenk Demiroglu","doi":"10.1186/s13636-024-00329-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00329-7","url":null,"abstract":"Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"17 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-10DOI: 10.1186/s13636-024-00328-8
Priyanka Gupta, Hemant A. Patil, Rodrigo Capobianco Guido
Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker’s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker’s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the possibility of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker’s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.
说话者声称的身份可通过自动说话者验证(ASV)系统(也称为语音生物识别系统)进行验证。在 INTERSPEECH 会议期间组织的 2015、2017、2019 和 2021 年 ASVspoof 挑战活动中,提出了若干应对措施 (CM)。此外,最近还倡议组织 ASVSpoof 5 挑战赛,目的是收集大量欺骗/深度伪造攻击数据(即第 1 阶段),并设计一个欺骗感知 ASV 系统,使用一个分类器同时处理 ASV 和 CM,以设计 CM-ASV 集成解决方案(第 2 阶段)。为此,本文对成功攻击 ASV 系统的各种可能策略和漏洞进行了调查,如目标选择、无法使用全局反制措施来减少攻击者发现弱点的机会、基于机器学习的最先进对抗攻击以及深度伪造生成。本文还讨论了攻击的可能性,如对 ASV 系统的硬件攻击。最后,我们还从攻击者的角度讨论了几项技术挑战,这些挑战可以用来为 ASV 系统的安全提出更好的防御机制。
{"title":"Vulnerability issues in Automatic Speaker Verification (ASV) systems","authors":"Priyanka Gupta, Hemant A. Patil, Rodrigo Capobianco Guido","doi":"10.1186/s13636-024-00328-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00328-8","url":null,"abstract":"Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker’s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker’s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the possibility of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker’s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"34 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1186/s13636-024-00330-0
Reemt Hinrichs, Kevin Gerkens, Alexander Lange, Jörn Ostermann
Audio effects are an ubiquitous tool in music production due to the interesting ways in which they can shape the sound of music. Guitar effects, the subset of all audio effects focusing on guitar signals, are commonly used in popular music to shape the guitar sound to fit specific genres or to create more variety within musical compositions. Automatic extraction of guitar effects and their parameter settings, with the aim to copy a target guitar sound, has been previously investigated, where artificial neural networks first determine the effect class of a reference signal and subsequently the parameter settings. These approaches require a corresponding guitar effect implementation to be available. In general, for very close sound matching, additional research regarding effect implementations is necessary. In this work, we present a different approach to circumvent these issues. We propose blind extraction of guitar effects through a combination of blind system inversion and neural guitar effect modeling. That way, an immediately usable, blind copy of the target guitar effect is obtained. The proposed method is tested with the phaser, softclipping and slapback delay effect. Listening tests with eight subjects indicate excellent quality of the blind copies, i.e., little to no difference to the reference guitar effect.
{"title":"Blind extraction of guitar effects through blind system inversion and neural guitar effect modeling","authors":"Reemt Hinrichs, Kevin Gerkens, Alexander Lange, Jörn Ostermann","doi":"10.1186/s13636-024-00330-0","DOIUrl":"https://doi.org/10.1186/s13636-024-00330-0","url":null,"abstract":"Audio effects are an ubiquitous tool in music production due to the interesting ways in which they can shape the sound of music. Guitar effects, the subset of all audio effects focusing on guitar signals, are commonly used in popular music to shape the guitar sound to fit specific genres or to create more variety within musical compositions. Automatic extraction of guitar effects and their parameter settings, with the aim to copy a target guitar sound, has been previously investigated, where artificial neural networks first determine the effect class of a reference signal and subsequently the parameter settings. These approaches require a corresponding guitar effect implementation to be available. In general, for very close sound matching, additional research regarding effect implementations is necessary. In this work, we present a different approach to circumvent these issues. We propose blind extraction of guitar effects through a combination of blind system inversion and neural guitar effect modeling. That way, an immediately usable, blind copy of the target guitar effect is obtained. The proposed method is tested with the phaser, softclipping and slapback delay effect. Listening tests with eight subjects indicate excellent quality of the blind copies, i.e., little to no difference to the reference guitar effect.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"136 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}