June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung
Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. However, the most widely used augmentation technique for audio and speech, SpecAugment, requires 2-dimensional spectrogram format and cannot be applied to models pretrained on speech waveforms. To address this, we propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment, but is also suitable for respiratory sound classification with waveform pretrained models. Experimental results show that our approach outperforms the SpecAugment, demonstrating a substantial improvement in the accuracy of minority disease classes, reaching up to 7.14%.
{"title":"RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification","authors":"June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung","doi":"arxiv-2405.02996","DOIUrl":"https://doi.org/arxiv-2405.02996","url":null,"abstract":"Recent advancements in AI have democratized its deployment as a healthcare\u0000assistant. While pretrained models from large-scale visual and audio datasets\u0000have demonstrably generalized to this task, surprisingly, no studies have\u0000explored pretrained speech models, which, as human-originated sounds,\u0000intuitively would share closer resemblance to lung sounds. This paper explores\u0000the efficacy of pretrained speech models for respiratory sound classification.\u0000We find that there is a characterization gap between speech and lung sound\u0000samples, and to bridge this gap, data augmentation is essential. However, the\u0000most widely used augmentation technique for audio and speech, SpecAugment,\u0000requires 2-dimensional spectrogram format and cannot be applied to models\u0000pretrained on speech waveforms. To address this, we propose RepAugment, an\u0000input-agnostic representation-level augmentation technique that outperforms\u0000SpecAugment, but is also suitable for respiratory sound classification with\u0000waveform pretrained models. Experimental results show that our approach\u0000outperforms the SpecAugment, demonstrating a substantial improvement in the\u0000accuracy of minority disease classes, reaching up to 7.14%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor
In the last three decades, the Steered Response Power (SRP) method has been widely used for the task of Sound Source Localization (SSL), due to its satisfactory localization performance on moderately reverberant and noisy scenarios. Many works have analyzed and extended the original SRP method to reduce its computational cost, to allow it to locate multiple sources, or to improve its performance in adverse environments. In this work, we review over 200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized version of the SRP algorithm which allows the reviewed extensions to be implemented. We provide a Python implementation of the algorithm which includes selected extensions from the literature.
{"title":"Steered Response Power for Sound Source Localization: A Tutorial Review","authors":"Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor","doi":"arxiv-2405.02991","DOIUrl":"https://doi.org/arxiv-2405.02991","url":null,"abstract":"In the last three decades, the Steered Response Power (SRP) method has been\u0000widely used for the task of Sound Source Localization (SSL), due to its\u0000satisfactory localization performance on moderately reverberant and noisy\u0000scenarios. Many works have analyzed and extended the original SRP method to\u0000reduce its computational cost, to allow it to locate multiple sources, or to\u0000improve its performance in adverse environments. In this work, we review over\u0000200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT\u0000method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized\u0000version of the SRP algorithm which allows the reviewed extensions to be\u0000implemented. We provide a Python implementation of the algorithm which includes\u0000selected extensions from the literature.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara
This paper addresses the challenge of learning to recite the Quran for non-Arabic speakers. We explore the possibility of crowdsourcing a carefully annotated Quranic dataset, on top of which AI models can be built to simplify the learning process. In particular, we use the volunteer-based crowdsourcing genre and implement a crowdsourcing API to gather audio assets. We integrated the API into an existing mobile application called NamazApp to collect audio recitations. We developed a crowdsourcing platform called Quran Voice for annotating the gathered audio assets. As a result, we have collected around 7000 Quranic recitations from a pool of 1287 participants across more than 11 non-Arabic countries, and we have annotated 1166 recitations from the dataset in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater agreement of 0.63 between the annotators, and 0.89 between the labels assigned by the algorithm and the expert judgments.
{"title":"Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers","authors":"Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara","doi":"arxiv-2405.02675","DOIUrl":"https://doi.org/arxiv-2405.02675","url":null,"abstract":"This paper addresses the challenge of learning to recite the Quran for\u0000non-Arabic speakers. We explore the possibility of crowdsourcing a carefully\u0000annotated Quranic dataset, on top of which AI models can be built to simplify\u0000the learning process. In particular, we use the volunteer-based crowdsourcing\u0000genre and implement a crowdsourcing API to gather audio assets. We integrated\u0000the API into an existing mobile application called NamazApp to collect audio\u0000recitations. We developed a crowdsourcing platform called Quran Voice for\u0000annotating the gathered audio assets. As a result, we have collected around\u00007000 Quranic recitations from a pool of 1287 participants across more than 11\u0000non-Arabic countries, and we have annotated 1166 recitations from the dataset\u0000in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater\u0000agreement of 0.63 between the annotators, and 0.89 between the labels assigned\u0000by the algorithm and the expert judgments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models, presenting advancements in efficiency and explainability. By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent. It can potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while remaining lightweight. Furthermore, we demonstrate the efficiency and interpretability of the front-end layer using the PhysioNet Heart Sound Database, illustrating its ability to handle and capture intricate long waveform patterns. Our contributions offer a portable solution for building efficient and interpretable models for raw waveform data.
{"title":"Toward end-to-end interpretable convolutional neural networks for waveform signals","authors":"Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan","doi":"arxiv-2405.01815","DOIUrl":"https://doi.org/arxiv-2405.01815","url":null,"abstract":"This paper introduces a novel convolutional neural networks (CNN) framework\u0000tailored for end-to-end audio deep learning models, presenting advancements in\u0000efficiency and explainability. By benchmarking experiments on three standard\u0000speech emotion recognition datasets with five-fold cross-validation, our\u0000framework outperforms Mel spectrogram features by up to seven percent. It can\u0000potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while\u0000remaining lightweight. Furthermore, we demonstrate the efficiency and\u0000interpretability of the front-end layer using the PhysioNet Heart Sound\u0000Database, illustrating its ability to handle and capture intricate long\u0000waveform patterns. Our contributions offer a portable solution for building\u0000efficient and interpretable models for raw waveform data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"111 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
{"title":"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":"https://doi.org/arxiv-2405.02179","url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\u0000struggle to provide reliable results on out-of-distribution data. Given the\u0000speed at which more and more accurate synthesis methods are developed, it is\u0000very important to design techniques that work well also on data they were not\u0000trained for. In this paper we study the potential of large-scale pre-trained\u0000models for audio deepfake detection, with special focus on generalization\u0000ability. To this end, the detection problem is reformulated in a speaker\u0000verification framework and fake audios are exposed by the mismatch between the\u0000voice sample under test and the voice of the claimed identity. With this\u0000paradigm, no fake speech sample is necessary in training, cutting off any link\u0000with the generation method at the root, and ensuring full generalization\u0000ability. Features are extracted by general-purpose large pre-trained models,\u0000with no need for training or fine-tuning on specific fake detection or speaker\u0000verification datasets. At detection time only a limited set of voice fragments\u0000of the identity under test is required. Experiments on several datasets\u0000widespread in the community show that detectors based on pre-trained models\u0000achieve excellent performance and show strong generalization ability, rivaling\u0000supervised methods on in-distribution data and largely overcoming them on\u0000out-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao
The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, there is still potential for enhancement in the performance of these methods. In this paper, we present GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), a novel HuBERT-based adaptive transfer learning framework for SER. Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level gender-augmented multi-scale pseudo-labels. Then, to fully leverage both obtained frame-level and utterance-level emotion labels, we incorporate model retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a WAR of 80.0% and a UAR of 82.0%, surpassing state-of-the-art unimodal SER methods, while also yielding comparable results with multimodal SER approaches.
{"title":"GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT","authors":"Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao","doi":"arxiv-2405.02151","DOIUrl":"https://doi.org/arxiv-2405.02151","url":null,"abstract":"The continuous evolution of pre-trained speech models has greatly advanced\u0000Speech Emotion Recognition (SER). However, there is still potential for\u0000enhancement in the performance of these methods. In this paper, we present\u0000GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning),\u0000a novel HuBERT-based adaptive transfer learning framework for SER.\u0000Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing\u0000multi-task learning and multi-scale k-means clustering to acquire frame-level\u0000gender-augmented multi-scale pseudo-labels. Then, to fully leverage both\u0000obtained frame-level and utterance-level emotion labels, we incorporate model\u0000retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on\u0000IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a\u0000WAR of 80.0% and a UAR of 82.0%, surpassing state-of-the-art unimodal SER\u0000methods, while also yielding comparable results with multimodal SER approaches.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audio recordings may provide important evidence in criminal investigations. One such case is the forensic association of the recorded audio to the recording location. For example, a voice message may be the only investigative cue to narrow down the candidate sites for a crime. Up to now, several works provide tools for closed-set recording environment classification under relatively clean recording conditions. However, in forensic investigations, the candidate locations are case-specific. Thus, closed-set tools are not applicable without retraining on a sufficient amount of training samples for each case and respective candidate set. In addition, a forensic tool has to deal with audio material from uncontrolled sources with variable properties and quality. In this work, we therefore attempt a major step towards practical forensic application scenarios. We propose a representation learning framework called EnvId, short for environment identification. EnvId avoids case-specific retraining. Instead, it is the first tool for robust few-shot classification of unseen environment locations. We demonstrate that EnvId can handle forensically challenging material. It provides good quality predictions even under unseen signal degradations, environment characteristics or recording position mismatches. Our code and datasets will be made publicly available upon acceptance.
{"title":"Can We Identify Unknown Audio Recording Environments in Forensic Scenarios?","authors":"Denise Moussa, Germans Hirsch, Christian Riess","doi":"arxiv-2405.02119","DOIUrl":"https://doi.org/arxiv-2405.02119","url":null,"abstract":"Audio recordings may provide important evidence in criminal investigations.\u0000One such case is the forensic association of the recorded audio to the\u0000recording location. For example, a voice message may be the only investigative\u0000cue to narrow down the candidate sites for a crime. Up to now, several works\u0000provide tools for closed-set recording environment classification under\u0000relatively clean recording conditions. However, in forensic investigations, the\u0000candidate locations are case-specific. Thus, closed-set tools are not\u0000applicable without retraining on a sufficient amount of training samples for\u0000each case and respective candidate set. In addition, a forensic tool has to\u0000deal with audio material from uncontrolled sources with variable properties and\u0000quality. In this work, we therefore attempt a major step towards practical forensic\u0000application scenarios. We propose a representation learning framework called\u0000EnvId, short for environment identification. EnvId avoids case-specific\u0000retraining. Instead, it is the first tool for robust few-shot classification of\u0000unseen environment locations. We demonstrate that EnvId can handle forensically\u0000challenging material. It provides good quality predictions even under unseen\u0000signal degradations, environment characteristics or recording position\u0000mismatches. Our code and datasets will be made publicly available upon acceptance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"247 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sentiment or mood can express themselves on various levels in music. In automatic analysis, the actual audio data is usually analyzed, but the lyrics can also play a crucial role in the perception of moods. We first evaluate various models for sentiment analysis based on lyrics and audio separately. The corresponding approaches already show satisfactory results, but they also exhibit weaknesses, the causes of which we examine in more detail. Furthermore, different approaches to combining the audio and lyrics results are proposed and evaluated. Considering both modalities generally leads to improved performance. We investigate misclassifications and (also intentional) contradictions between audio and lyrics sentiment more closely, and identify possible causes. Finally, we address fundamental problems in this research area, such as high subjectivity, lack of data, and inconsistency in emotion taxonomies.
{"title":"Joint sentiment analysis of lyrics and audio in music","authors":"Lea Schaab, Anna Kruspe","doi":"arxiv-2405.01988","DOIUrl":"https://doi.org/arxiv-2405.01988","url":null,"abstract":"Sentiment or mood can express themselves on various levels in music. In\u0000automatic analysis, the actual audio data is usually analyzed, but the lyrics\u0000can also play a crucial role in the perception of moods. We first evaluate\u0000various models for sentiment analysis based on lyrics and audio separately. The\u0000corresponding approaches already show satisfactory results, but they also\u0000exhibit weaknesses, the causes of which we examine in more detail. Furthermore,\u0000different approaches to combining the audio and lyrics results are proposed and\u0000evaluated. Considering both modalities generally leads to improved performance.\u0000We investigate misclassifications and (also intentional) contradictions between\u0000audio and lyrics sentiment more closely, and identify possible causes. Finally,\u0000we address fundamental problems in this research area, such as high\u0000subjectivity, lack of data, and inconsistency in emotion taxonomies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie
Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.
大型语言模型(LLMs)在各种 NLP 任务中表现出了无与伦比的有效性,而将 LLMs 与自动语音识别(ASR)集成正在成为一种主流范式。在这一势头的推动下,我们的研究致力于在大型开源中文数据集上对这一范例进行深入检验。具体来说,我们的研究旨在评估在语音基础编码器-LLM ASR 范式的背景下,语音编码器、LLM 和投影仪模块的各种配置所产生的影响。此外,我们还引入了一种三阶段训练方法,专门用于提高模型协调听觉和文本信息的能力。这种方法的实施以及 ASR 组件的战略性集成,使我们能够在 AISHELL-1、Test_Net 和 Test_Meeting 测试集上实现 SOTA 性能。我们的分析为基于 LLM 的 ASR 系统的未来研究奠定了经验基础,并为使用中文数据集优化性能提供了启示。我们将公开发布用于数据准备、训练、推理和评分的所有脚本,以及预训练模型和训练日志,以促进可重复研究。
{"title":"Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets","authors":"Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie","doi":"arxiv-2405.02132","DOIUrl":"https://doi.org/arxiv-2405.02132","url":null,"abstract":"Large Language Models (LLMs) have demonstrated unparalleled effectiveness in\u0000various NLP tasks, and integrating LLMs with automatic speech recognition (ASR)\u0000is becoming a mainstream paradigm. Building upon this momentum, our research\u0000delves into an in-depth examination of this paradigm on a large open-source\u0000Chinese dataset. Specifically, our research aims to evaluate the impact of\u0000various configurations of speech encoders, LLMs, and projector modules in the\u0000context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we\u0000introduce a three-stage training approach, expressly developed to enhance the\u0000model's ability to align auditory and textual information. The implementation\u0000of this approach, alongside the strategic integration of ASR components,\u0000enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and\u0000Test_Meeting test sets. Our analysis presents an empirical foundation for\u0000future research in LLM-based ASR systems and offers insights into optimizing\u0000performance using Chinese datasets. We will publicly release all scripts used\u0000for data preparation, training, inference, and scoring, as well as pre-trained\u0000models and training logs to promote reproducible research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer
Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.
{"title":"Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios","authors":"Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer","doi":"arxiv-2405.01967","DOIUrl":"https://doi.org/arxiv-2405.01967","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their\u0000intelligibility for users of hearing aids. Deep models suited for real-world\u0000application should feature a low computational complexity and low processing\u0000delay of only a few milliseconds. In this paper, we explore deep speech\u0000enhancement that matches these requirements and contrast monaural and binaural\u0000processing algorithms in two complex acoustic scenes. Both algorithms are\u0000evaluated with objective metrics and in experiments with hearing-impaired\u0000listeners performing a speech-in-noise test. Results are compared to two\u0000traditional enhancement strategies, i.e., adaptive differential microphone\u0000processing and binaural beamforming. While in diffuse noise, all algorithms\u0000perform similarly, the binaural deep learning approach performs best in the\u0000presence of spatial interferers. Through a post-analysis, this can be\u0000attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}