Eurasip Journal on Audio Speech and Music Processing最新文献_第4页

Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge 用于语音搜索的基于耳语的口语术语检测系统 ALBAYZIN 评估挑战

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-29 DOI: 10.1186/s13636-024-00334-w

Javier Tejedor, Doroteo T. Toledano

The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.

音频资料库中存储了大量信息，因此有必要开发高效的自动音频内容搜索方法。在这方面，语音搜索（SoS）在过去几十年中受到了广泛关注。为了推动自动系统的发展，ALBAYZIN 评估自 2012 年起推出了语音搜索挑战赛。该挑战赛发布了多个涵盖不同声学领域的数据库（如电视节目中的自发语音、会议演讲、议会会议等），旨在建立自动系统，从这些数据库中检索术语集。本文介绍了基于 Whisper 自动语音识别器的基线系统，该系统可用于 2022 年在 ALBAYZIN 评估范围内举行的语音搜索挑战赛中的口语术语检测任务。该基线系统将与本出版物一起发布，并提供给即将于 2024 年举行的 SoS ALBAYZIN 评估的参与者。此外，还进行了几项基于术语属性的分析（即语内术语和外来术语、单词术语和多词术语），以显示 Whisper 在检索表达特定属性的术语方面的能力。尽管在某些数据库（如广播新闻领域）中获得的结果远非完美，但这种基于 Whisper 的方法在迄今为止的挑战赛数据库中获得了最佳结果，从而为即将到来的挑战赛提供了一个强大的基准系统，鼓励参赛者不断改进。

{"title":"Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge","authors":"Javier Tejedor, Doroteo T. Toledano","doi":"10.1186/s13636-024-00334-w","DOIUrl":"https://doi.org/10.1186/s13636-024-00334-w","url":null,"abstract":"The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"21 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140010189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Singer identification model using data augmentation and enhanced feature conversion with hybrid feature vector and machine learning 利用混合特征向量和机器学习进行数据扩增和增强特征转换的歌手识别模型

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-26 DOI: 10.1186/s13636-024-00336-8

Serhat Hizlisoy, Recep Sinan Arslan, Emel Çolakoğlu

Analyzing songs is a problem that is being investigated to aid various operations on music access platforms. At the beginning of these problems is the identification of the person who sings the song. In this study, a singer identification application, which consists of Turkish singers and works for the Turkish language, is proposed in order to find a solution to this problem. Mel-spectrogram and octave-based spectral contrast values are extracted from the songs, and these values are combined into a hybrid feature vector. Thus, problem-specific situations such as determining the differences in the voices of the singers and reducing the effects of the year and album differences on the result are discussed. As a result of the tests and systematic evaluations, it has been shown that a certain level of success has been achieved in the determination of the singer who sings the song, and that the song is in a stable structure against the changes in the singing style and song structure. The results were analyzed in a database of 9 singers and 180 songs. An accuracy value of 89.4% was obtained using the reduction of the feature vector by PCA, the normalization of the data, and the Extra Trees classifier. Precision, recall and f-score values were 89.9%, 89.4% and 89.5%, respectively.

分析歌曲是一个正在研究的问题，以帮助音乐访问平台上的各种操作。在这些问题中，首先是歌曲演唱者的识别问题。在本研究中，为了找到解决这一问题的方法，我们提出了一个由土耳其歌手组成并适用于土耳其语的歌手识别应用程序。我们从歌曲中提取了旋律谱图和基于倍频程的频谱对比值，并将这些值组合成一个混合特征向量。因此，讨论了特定问题的具体情况，如确定歌手声音的差异以及减少年份和专辑差异对结果的影响。测试和系统评估的结果表明，在确定演唱歌曲的歌手方面取得了一定的成功，而且歌曲结构稳定，不受演唱风格和歌曲结构变化的影响。我们在一个包含 9 位歌手和 180 首歌曲的数据库中对结果进行了分析。通过 PCA 缩减特征向量、对数据进行归一化处理以及使用 Extra Trees 分类器，准确率达到 89.4%。精确度、召回率和 f 值分别为 89.9%、89.4% 和 89.5%。

{"title":"Singer identification model using data augmentation and enhanced feature conversion with hybrid feature vector and machine learning","authors":"Serhat Hizlisoy, Recep Sinan Arslan, Emel Çolakoğlu","doi":"10.1186/s13636-024-00336-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00336-8","url":null,"abstract":"Analyzing songs is a problem that is being investigated to aid various operations on music access platforms. At the beginning of these problems is the identification of the person who sings the song. In this study, a singer identification application, which consists of Turkish singers and works for the Turkish language, is proposed in order to find a solution to this problem. Mel-spectrogram and octave-based spectral contrast values are extracted from the songs, and these values are combined into a hybrid feature vector. Thus, problem-specific situations such as determining the differences in the voices of the singers and reducing the effects of the year and album differences on the result are discussed. As a result of the tests and systematic evaluations, it has been shown that a certain level of success has been achieved in the determination of the singer who sings the song, and that the song is in a stable structure against the changes in the singing style and song structure. The results were analyzed in a database of 9 singers and 180 songs. An accuracy value of 89.4% was obtained using the reduction of the feature vector by PCA, the normalization of the data, and the Extra Trees classifier. Precision, recall and f-score values were 89.9%, 89.4% and 89.5%, respectively.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139967872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sound field reconstruction using neural processes with dynamic kernels 利用带动态核的神经过程重建声场

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-20 DOI: 10.1186/s13636-024-00333-x

Zining Liang, Wen Zhang, Thushara D. Abhayapala

Accurately representing the sound field with high spatial resolution is crucial for immersive and interactive sound field reproduction technology. In recent studies, there has been a notable emphasis on efficiently estimating sound fields from a limited number of discrete observations. In particular, kernel-based methods using Gaussian processes (GPs) with a covariance function to model spatial correlations have been proposed. However, the current methods rely on pre-defined kernels for modeling, requiring the manual identification of optimal kernels and their parameters for different sound fields. In this work, we propose a novel approach that parameterizes GPs using a deep neural network based on neural processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.

以高空间分辨率准确呈现声场对于身临其境和交互式声场再现技术至关重要。在最近的研究中，从有限的离散观测数据中有效估计声场受到了广泛重视。特别是，有人提出了基于核的方法，使用具有协方差函数的高斯过程（GPs）来模拟空间相关性。然而，目前的方法依赖于预定义的核进行建模，需要针对不同声场手动确定最佳核及其参数。在这项工作中，我们提出了一种新方法，利用基于神经过程（NPs）的深度神经网络对 GPs 进行参数化，以重建声场的幅度。这种方法的优点是利用注意力机制从数据中动态学习内核，从而具有更大的灵活性和对声场声学特性的适应性。数值实验证明，我们提出的方法在重建准确性方面优于现有方法，为声场重建提供了一种有前途的替代方法。

{"title":"Sound field reconstruction using neural processes with dynamic kernels","authors":"Zining Liang, Wen Zhang, Thushara D. Abhayapala","doi":"10.1186/s13636-024-00333-x","DOIUrl":"https://doi.org/10.1186/s13636-024-00333-x","url":null,"abstract":"Accurately representing the sound field with high spatial resolution is crucial for immersive and interactive sound field reproduction technology. In recent studies, there has been a notable emphasis on efficiently estimating sound fields from a limited number of discrete observations. In particular, kernel-based methods using Gaussian processes (GPs) with a covariance function to model spatial correlations have been proposed. However, the current methods rely on pre-defined kernels for modeling, requiring the manual identification of optimal kernels and their parameters for different sound fields. In this work, we propose a novel approach that parameterizes GPs using a deep neural network based on neural processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"32 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139928614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic classification of the physical surface in sound uroflowmetry using machine learning methods 利用机器学习方法对声波尿流测量中的物理表面进行自动分类

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-16 DOI: 10.1186/s13636-024-00332-y

Marcos Lazaro Alvarez, Laura Arjona, Miguel E. Iglesias Martínez, Alfonso Bahillo

This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.

这项研究首次利用机器学习技术对无创声尿流测量测试中的排泄流冲击表面进行自动分类。通常情况下，排泄流冲击的是马桶壁（传统上由陶瓷制成），而不是马桶内的水。这可能会导致记录的音频信号强度降低，从而导致提取的包络线振幅减小。因此，仅通过分析包络线，无法判断包络线振幅的减小是由于排空流量的减小还是由于对厕所墙壁的影响。在这项工作中，我们研究了男性受试者尿流声测量数据的分类，具体取决于尿液在马桶内撞击的表面：三个类别分别是水、陶瓷和静音（静音指排泄流中断）。我们探索了三个频段，研究去除人声频段（8 kHz 以下）以保护用户隐私的可行性。在分类任务方面，我们评估了三种机器学习算法：支持向量机、随机森林和 k 近邻。这些算法的准确率分别为 96%、99.46% 和 99.05%。这些算法是在一个新的数据集上进行训练的，该数据集由在四个标准西班牙语厕所中录制的音频信号组成。该数据集由 6481 个 1 秒音频信号组成，分别标记为安静、在陶瓷上排空和在水上排空。所获得的结果标志着在评估声音尿流测量测试方面向前迈进了一步，而无需要求患者始终将排尿流量对准水。我们为未来的研究敞开了大门，这些研究试图根据厕所中尿液接触的表面来估算尿流参数和重建信号包络。

{"title":"Automatic classification of the physical surface in sound uroflowmetry using machine learning methods","authors":"Marcos Lazaro Alvarez, Laura Arjona, Miguel E. Iglesias Martínez, Alfonso Bahillo","doi":"10.1186/s13636-024-00332-y","DOIUrl":"https://doi.org/10.1186/s13636-024-00332-y","url":null,"abstract":"This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources 基于深度学习的表达式语音合成：方法、挑战和资源的系统回顾

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-12 DOI: 10.1186/s13636-024-00329-7

Huda Barakat, Oytun Turk, Cenk Demiroglu

Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.

由于从机器学习到深度学习模型的转变，语音合成技术取得了长足进步。当代的文本到语音（TTS）模型具有生成极高质量语音的能力，可以近似模拟人类语音。然而，鉴于目前采用 TTS 模型的应用范围广泛，仅仅生成高质量语音已经不够了。当今的 TTS 模型还必须擅长生成富有表现力的语音，能够传达各种说话风格和情感，与人类语音相仿。因此，近年来研究人员集中精力开发更高效的表达式语音合成模型。本文系统回顾了过去 5 年中发表的有关表现力语音合成模型的文献，并特别强调了基于深度学习的方法。我们为这些模型提供了一个全面的分类方案，并对每一类模型进行了简明扼要的描述。此外，我们还总结了这一研究领域遇到的主要挑战，并概述了文献中记载的应对这些挑战的策略。在第 8 部分，我们指出了这一领域需要进一步探索的一些研究空白。我们的目标是对这一热门研究领域进行全方位的概述，为感兴趣的研究人员和该领域的未来努力提供指导。

{"title":"Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources","authors":"Huda Barakat, Oytun Turk, Cenk Demiroglu","doi":"10.1186/s13636-024-00329-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00329-7","url":null,"abstract":"Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"17 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vulnerability issues in Automatic Speaker Verification (ASV) systems 自动语音验证 (ASV) 系统的漏洞问题

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-10 DOI: 10.1186/s13636-024-00328-8

Priyanka Gupta, Hemant A. Patil, Rodrigo Capobianco Guido

Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker’s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker’s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the possibility of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker’s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.

说话者声称的身份可通过自动说话者验证（ASV）系统（也称为语音生物识别系统）进行验证。在 INTERSPEECH 会议期间组织的 2015、2017、2019 和 2021 年 ASVspoof 挑战活动中，提出了若干应对措施 (CM)。此外，最近还倡议组织 ASVSpoof 5 挑战赛，目的是收集大量欺骗/深度伪造攻击数据（即第 1 阶段），并设计一个欺骗感知 ASV 系统，使用一个分类器同时处理 ASV 和 CM，以设计 CM-ASV 集成解决方案（第 2 阶段）。为此，本文对成功攻击 ASV 系统的各种可能策略和漏洞进行了调查，如目标选择、无法使用全局反制措施来减少攻击者发现弱点的机会、基于机器学习的最先进对抗攻击以及深度伪造生成。本文还讨论了攻击的可能性，如对 ASV 系统的硬件攻击。最后，我们还从攻击者的角度讨论了几项技术挑战，这些挑战可以用来为 ASV 系统的安全提出更好的防御机制。

{"title":"Vulnerability issues in Automatic Speaker Verification (ASV) systems","authors":"Priyanka Gupta, Hemant A. Patil, Rodrigo Capobianco Guido","doi":"10.1186/s13636-024-00328-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00328-8","url":null,"abstract":"Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker’s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker’s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the possibility of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker’s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"34 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blind extraction of guitar effects through blind system inversion and neural guitar effect modeling 通过盲系统反转和神经吉他效果建模盲提取吉他效果

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-07 DOI: 10.1186/s13636-024-00330-0

Reemt Hinrichs, Kevin Gerkens, Alexander Lange, Jörn Ostermann

Audio effects are an ubiquitous tool in music production due to the interesting ways in which they can shape the sound of music. Guitar effects, the subset of all audio effects focusing on guitar signals, are commonly used in popular music to shape the guitar sound to fit specific genres or to create more variety within musical compositions. Automatic extraction of guitar effects and their parameter settings, with the aim to copy a target guitar sound, has been previously investigated, where artificial neural networks first determine the effect class of a reference signal and subsequently the parameter settings. These approaches require a corresponding guitar effect implementation to be available. In general, for very close sound matching, additional research regarding effect implementations is necessary. In this work, we present a different approach to circumvent these issues. We propose blind extraction of guitar effects through a combination of blind system inversion and neural guitar effect modeling. That way, an immediately usable, blind copy of the target guitar effect is obtained. The proposed method is tested with the phaser, softclipping and slapback delay effect. Listening tests with eight subjects indicate excellent quality of the blind copies, i.e., little to no difference to the reference guitar effect.

音频特效是音乐制作中无处不在的工具，因为它们能以有趣的方式塑造音乐的音效。吉他特效是以吉他信号为核心的所有音频特效的子集，常用于流行音乐中，以塑造吉他音效，使其适合特定流派或在音乐作品中创造更多变化。为了复制目标吉他音效，以前曾研究过自动提取吉他音效及其参数设置的方法，其中人工神经网络首先确定参考信号的音效类别，然后确定参数设置。这些方法都需要有相应的吉他效果实现。一般来说，要实现非常接近的声音匹配，还需要对效果实现进行额外的研究。在这项工作中，我们提出了一种不同的方法来规避这些问题。我们建议通过盲系统反转和神经吉他效果建模相结合的方法，对吉他效果进行盲提取。这样，就能立即获得可用的目标吉他效果盲拷贝。我们用相位器、软剪辑和回拍延迟效果对所提出的方法进行了测试。由八名受试者进行的听力测试表明，盲拷贝的质量极佳，即与参考吉他效果几乎没有差别。

{"title":"Blind extraction of guitar effects through blind system inversion and neural guitar effect modeling","authors":"Reemt Hinrichs, Kevin Gerkens, Alexander Lange, Jörn Ostermann","doi":"10.1186/s13636-024-00330-0","DOIUrl":"https://doi.org/10.1186/s13636-024-00330-0","url":null,"abstract":"Audio effects are an ubiquitous tool in music production due to the interesting ways in which they can shape the sound of music. Guitar effects, the subset of all audio effects focusing on guitar signals, are commonly used in popular music to shape the guitar sound to fit specific genres or to create more variety within musical compositions. Automatic extraction of guitar effects and their parameter settings, with the aim to copy a target guitar sound, has been previously investigated, where artificial neural networks first determine the effect class of a reference signal and subsequently the parameter settings. These approaches require a corresponding guitar effect implementation to be available. In general, for very close sound matching, additional research regarding effect implementations is necessary. In this work, we present a different approach to circumvent these issues. We propose blind extraction of guitar effects through a combination of blind system inversion and neural guitar effect modeling. That way, an immediately usable, blind copy of the target guitar effect is obtained. The proposed method is tested with the phaser, softclipping and slapback delay effect. Listening tests with eight subjects indicate excellent quality of the blind copies, i.e., little to no difference to the reference guitar effect.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"136 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement 用于端到端单信道语音增强的子卷积 U-Net 与变压器注意网络

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-03 DOI: 10.1186/s13636-024-00331-z

Sivaramakrishna Yecchuri, Sunny Dayal Vanambathina

Recent advancements in deep learning-based speech enhancement models have extensively used attention mechanisms to achieve state-of-the-art methods by demonstrating their effectiveness. This paper proposes a transformer attention network based sub-convolutional U-Net (TANSCUNet) for speech enhancement. Instead of adopting conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel transformer-based attention network between the sub-convolutional U-Net encoder and decoder for better feature learning. More specifically, it is composed of several adaptive time―frequency attention modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate hierarchical contextual information. Additionally, a sub-convolutional encoder-decoder model used different kernel sizes to extract multi-scale local and contextual features from the noisy speech. The experimental results show that the proposed model outperforms several state-of-the-art methods.

基于深度学习的语音增强模型的最新进展广泛使用了注意力机制，通过证明其有效性来实现最先进的方法。本文提出了一种用于语音增强的基于变压器注意网络的子卷积 U-Net（TANSCUNet）。我们没有采用传统的 RNN 和时序卷积网络进行序列建模，而是在次卷积 U-Net 编码器和解码器之间采用了一种新颖的基于变压器的注意力网络，以实现更好的特征学习。更具体地说，它由多个自适应时频注意模块和一个自适应分层注意模块组成，旨在捕捉长期时频依赖性并进一步聚合分层上下文信息。此外，子卷积编码器-解码器模型使用不同的核大小，从噪声语音中提取多尺度局部和上下文特征。实验结果表明，所提出的模型优于几种最先进的方法。

引用次数: 0

Acoustical feature analysis and optimization for aesthetic recognition of Chinese traditional music 用于中国传统音乐审美识别的声学特征分析与优化

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-02-02 DOI: 10.1186/s13636-023-00326-2

Lingyun Xie, Yuehong Wang, Yan Gao

Chinese traditional music, a vital expression of Chinese cultural heritage, possesses both a profound emotional resonance and artistic allure. This study sets forth to refine and analyze the acoustical features essential for the aesthetic recognition of Chinese traditional music, utilizing a dataset spanning five aesthetic genres. Through recursive feature elimination, we distilled an initial set of 447 low-level physical features to a more manageable 44, establishing their feature-importance coefficients. This reduction allowed us to estimate the quantified influence of higher-level musical components on aesthetic recognition, following the establishment of a correlation between these components and their physical counterparts. We conducted a comprehensive examination of the impact of various musical elements on aesthetic genres. Our findings indicate that the selected 44-dimensional feature set could enhance aesthetic recognition. Among the high-level musical factors, timbre emerges as the most influential, followed by rhythm, pitch, and tonality. Timbre proved pivotal in distinguishing between the JiYang and BeiShang genres, while rhythm and tonality were key in differentiating LingDong from JiYang, as well as LingDong from BeiShang.

中国传统音乐是中国文化遗产的重要表现形式，具有深刻的情感共鸣和艺术魅力。本研究利用横跨五种审美流派的数据集，对中国传统音乐审美识别所必需的声学特征进行了提炼和分析。通过递归特征消除，我们将最初的 447 个低级物理特征提炼为更易于管理的 44 个，并确定了它们的重要特征系数。在建立了较高层次的音乐要素与物理要素之间的相关性之后，我们就能估算出这些要素对审美识别的量化影响。我们全面考察了各种音乐要素对审美流派的影响。我们的研究结果表明，所选的 44 维特征集可以提高审美识别能力。在高层次的音乐因素中，音色的影响最大，其次是节奏、音高和音调。音色被证明是区分济阳和北商音乐流派的关键，而节奏和音调则是区分岭东和济阳以及岭东和北商音乐流派的关键。

{"title":"Acoustical feature analysis and optimization for aesthetic recognition of Chinese traditional music","authors":"Lingyun Xie, Yuehong Wang, Yan Gao","doi":"10.1186/s13636-023-00326-2","DOIUrl":"https://doi.org/10.1186/s13636-023-00326-2","url":null,"abstract":"Chinese traditional music, a vital expression of Chinese cultural heritage, possesses both a profound emotional resonance and artistic allure. This study sets forth to refine and analyze the acoustical features essential for the aesthetic recognition of Chinese traditional music, utilizing a dataset spanning five aesthetic genres. Through recursive feature elimination, we distilled an initial set of 447 low-level physical features to a more manageable 44, establishing their feature-importance coefficients. This reduction allowed us to estimate the quantified influence of higher-level musical components on aesthetic recognition, following the establishment of a correlation between these components and their physical counterparts. We conducted a comprehensive examination of the impact of various musical elements on aesthetic genres. Our findings indicate that the selected 44-dimensional feature set could enhance aesthetic recognition. Among the high-level musical factors, timbre emerges as the most influential, followed by rhythm, pitch, and tonality. Timbre proved pivotal in distinguishing between the JiYang and BeiShang genres, while rhythm and tonality were key in differentiating LingDong from JiYang, as well as LingDong from BeiShang.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"76 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139662758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gated recurrent unit predictor model-based adaptive differential pulse code modulation speech decoder 基于门控递归单元预测器模型的自适应差分脉冲编码调制语音解码器

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing

Pub Date : 2024-01-20 DOI: 10.1186/s13636-023-00325-3

Gebremichael Kibret Sheferaw, Waweru Mwangi, Michael Kimwele, Adane Mamuye

Speech coding is a method to reduce the amount of data needs to represent speech signals by exploiting the statistical properties of the speech signal. Recently, in the speech coding process, a neural network prediction model has gained attention as the reconstruction process of a nonlinear and nonstationary speech signal. This study proposes a novel approach to improve speech coding performance by using a gated recurrent unit (GRU)-based adaptive differential pulse code modulation (ADPCM) system. This GRU predictor model is trained using a data set of speech samples from the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus actual sample and the ADPCM fixed-predictor output speech sample. Our contribution lies in the development of an algorithm for training the GRU predictive model that can improve its performance in speech coding prediction and a new offline trained predictive model for speech decoder. The results indicate that the proposed system significantly improves the accuracy of speech prediction, demonstrating its potential for speech prediction applications. Overall, this work presents a unique application of the GRU predictive model with ADPCM decoding in speech signal compression, providing a promising approach for future research in this field.

语音编码是一种通过利用语音信号的统计特性来减少表示语音信号所需数据量的方法。最近，在语音编码过程中，神经网络预测模型作为非线性和非稳态语音信号的重构过程受到关注。本研究提出了一种新方法，通过使用基于门控递归单元（GRU）的自适应差分脉冲编码调制（ADPCM）系统来提高语音编码性能。该 GRU 预测器模型是利用来自 DARPA TIMIT 声韵连续语音语料库实际样本和 ADPCM 固定预测器输出语音样本的语音样本数据集进行训练的。我们的贡献在于开发了一种用于训练 GRU 预测模型的算法，该算法可以提高 GRU 预测模型在语音编码预测中的性能，同时还为语音解码器开发了一种新的离线训练预测模型。结果表明，所提出的系统显著提高了语音预测的准确性，证明了其在语音预测应用方面的潜力。总之，这项研究提出了 GRU 预测模型与 ADPCM 解码在语音信号压缩中的独特应用，为该领域的未来研究提供了一种前景广阔的方法。

{"title":"Gated recurrent unit predictor model-based adaptive differential pulse code modulation speech decoder","authors":"Gebremichael Kibret Sheferaw, Waweru Mwangi, Michael Kimwele, Adane Mamuye","doi":"10.1186/s13636-023-00325-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00325-3","url":null,"abstract":"Speech coding is a method to reduce the amount of data needs to represent speech signals by exploiting the statistical properties of the speech signal. Recently, in the speech coding process, a neural network prediction model has gained attention as the reconstruction process of a nonlinear and nonstationary speech signal. This study proposes a novel approach to improve speech coding performance by using a gated recurrent unit (GRU)-based adaptive differential pulse code modulation (ADPCM) system. This GRU predictor model is trained using a data set of speech samples from the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus actual sample and the ADPCM fixed-predictor output speech sample. Our contribution lies in the development of an algorithm for training the GRU predictive model that can improve its performance in speech coding prediction and a new offline trained predictive model for speech decoder. The results indicate that the proposed system significantly improves the accuracy of speech prediction, demonstrating its potential for speech prediction applications. Overall, this work presents a unique application of the GRU predictive model with ADPCM decoding in speech signal compression, providing a promising approach for future research in this field.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"85 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0