首页 > 最新文献

arXiv - EE - Audio and Speech Processing最新文献

英文 中文
Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data 利用建构的代码切换数据增强 LLM 中的多语言语音生成和识别能力
Pub Date : 2024-09-17 DOI: arxiv-2409.10969
Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen
While large language models (LLMs) have been explored in the speech domainfor both generation and recognition tasks, their applications are predominantlyconfined to the monolingual scenario, with limited exploration in multilingualand code-switched (CS) contexts. Additionally, speech generation andrecognition tasks are often handled separately, such as VALL-E and Qwen-Audio.In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integratingmultilingual speech generation and recognition tasks within the single LLM.Furthermore, we develop an effective data construction approach that splits andconcatenates words from different languages to equip LLMs with CS synthesisability without relying on CS data. The experimental results demonstrate thatour model outperforms other baselines with a comparable data scale.Furthermore, our data construction approach not only equips LLMs with CS speechsynthesis capability with comparable speaker consistency and similarity to anygiven speaker, but also improves the performance of LLMs in multilingual speechgeneration and recognition tasks.
虽然大语言模型(LLM)在语音领域的生成和识别任务中都有所探索,但其应用主要局限于单语言场景,在多语言和代码转换(CS)语境中的探索有限。此外,语音生成和识别任务通常是分开处理的,如 VALL-E 和 Qwen-Audio。在本文中,我们提出了一种多语言多任务(MutltiLingual MultiTask,MLMT)模型,将多语言语音生成和识别任务整合到单个 LLM 中。实验结果表明,在数据规模相当的情况下,我们的模型优于其他基线模型。此外,我们的数据构建方法不仅使 LLM 具备了 CS 语音合成能力,而且说话人的一致性和相似性与任何给定说话人相当,同时还提高了 LLM 在多语言语音生成和识别任务中的性能。
{"title":"Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data","authors":"Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen","doi":"arxiv-2409.10969","DOIUrl":"https://doi.org/arxiv-2409.10969","url":null,"abstract":"While large language models (LLMs) have been explored in the speech domain\u0000for both generation and recognition tasks, their applications are predominantly\u0000confined to the monolingual scenario, with limited exploration in multilingual\u0000and code-switched (CS) contexts. Additionally, speech generation and\u0000recognition tasks are often handled separately, such as VALL-E and Qwen-Audio.\u0000In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating\u0000multilingual speech generation and recognition tasks within the single LLM.\u0000Furthermore, we develop an effective data construction approach that splits and\u0000concatenates words from different languages to equip LLMs with CS synthesis\u0000ability without relying on CS data. The experimental results demonstrate that\u0000our model outperforms other baselines with a comparable data scale.\u0000Furthermore, our data construction approach not only equips LLMs with CS speech\u0000synthesis capability with comparable speaker consistency and similarity to any\u0000given speaker, but also improves the performance of LLMs in multilingual speech\u0000generation and recognition tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy 3DFacePolicy:采用扩散策略的语音驱动三维面部动画
Pub Date : 2024-09-17 DOI: arxiv-2409.10848
Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Yuki Uranishi
Audio-driven 3D facial animation has made immersive progress both in researchand application developments. The newest approaches focus on Transformer-basedmethods and diffusion-based methods, however, there is still gap in thevividness and emotional expression between the generated animation and realhuman face. To tackle this limitation, we propose 3DFacePolicy, a diffusionpolicy model for 3D facial animation prediction. This method generates variableand realistic human facial movements by predicting the 3D vertex trajectory onthe 3D facial template with diffusion policy instead of facial generation forevery frame. It takes audio and vertex states as observations to predict thevertex trajectory and imitate real human facial expressions, which keeps thecontinuous and natural flow of human emotions. The experiments show that ourapproach is effective in variable and dynamic facial motion synthesizing.
音频驱动的三维面部动画在研究和应用方面都取得了令人身临其境的进展。最新的方法主要集中在基于变换器的方法和基于扩散的方法上,但生成的动画与真实人脸在生动性和情感表达方面仍有差距。为了解决这个问题,我们提出了一种用于三维人脸动画预测的扩散策略模型--3DFacePolicy。这种方法通过在三维面部模板上预测三维顶点轨迹,用扩散策略生成多变而逼真的人脸动作,而不是每帧都生成面部动作。它以音频和顶点状态为观测对象,预测顶点轨迹,模仿真实的人类面部表情,保持了人类情感的连续和自然流露。实验表明,我们的方法在可变和动态的面部动作合成方面效果显著。
{"title":"3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy","authors":"Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Yuki Uranishi","doi":"arxiv-2409.10848","DOIUrl":"https://doi.org/arxiv-2409.10848","url":null,"abstract":"Audio-driven 3D facial animation has made immersive progress both in research\u0000and application developments. The newest approaches focus on Transformer-based\u0000methods and diffusion-based methods, however, there is still gap in the\u0000vividness and emotional expression between the generated animation and real\u0000human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion\u0000policy model for 3D facial animation prediction. This method generates variable\u0000and realistic human facial movements by predicting the 3D vertex trajectory on\u0000the 3D facial template with diffusion policy instead of facial generation for\u0000every frame. It takes audio and vertex states as observations to predict the\u0000vertex trajectory and imitate real human facial expressions, which keeps the\u0000continuous and natural flow of human emotions. The experiments show that our\u0000approach is effective in variable and dynamic facial motion synthesizing.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spontaneous Informal Speech Dataset for Punctuation Restoration 用于标点符号修复的自发非正式语音数据集
Pub Date : 2024-09-17 DOI: arxiv-2409.11241
Xing Yi Liu, Homayoon Beigi
Presently, punctuation restoration models are evaluated almost solely onwell-structured, scripted corpora. On the other hand, real-world ASR systemsand post-processing pipelines typically apply towards spontaneous speech withsignificant irregularities, stutters, and deviations from perfect grammar. Toaddress this discrepancy, we introduce SponSpeech, a punctuation restorationdataset derived from informal speech sources, which includes punctuation andcasing information. In addition to publicly releasing the dataset, wecontribute a filtering pipeline that can be used to generate more data. Ourfiltering pipeline examines the quality of both speech audio and transcriptiontext. We also carefully construct a ``challenging" test set, aimed atevaluating models' ability to leverage audio information to predict otherwisegrammatically ambiguous punctuation. SponSpeech is available athttps://github.com/GitHubAccountAnonymous/PR, along with all code for datasetbuilding and model runs.
目前,标点符号修复模型几乎只能在结构良好的脚本语料库中进行评估。另一方面,现实世界中的 ASR 系统和后处理管道通常适用于自发语音,这些语音存在明显的不规则、口吃和语法偏差。为了解决这一差异,我们引入了 SponSpeech,这是一个标点符号还原数据集,源自非正式语音源,其中包括标点符号和音调信息。除了公开发布数据集之外,我们还提供了一个过滤管道,可用于生成更多数据。我们的过滤管道同时检查语音音频和转录文本的质量。我们还精心构建了一个 "挑战性 "测试集,旨在评估模型利用音频信息预测语法模糊标点符号的能力。SponSpeech可在https://github.com/GitHubAccountAnonymous/PR,以及用于数据集构建和模型运行的所有代码。
{"title":"Spontaneous Informal Speech Dataset for Punctuation Restoration","authors":"Xing Yi Liu, Homayoon Beigi","doi":"arxiv-2409.11241","DOIUrl":"https://doi.org/arxiv-2409.11241","url":null,"abstract":"Presently, punctuation restoration models are evaluated almost solely on\u0000well-structured, scripted corpora. On the other hand, real-world ASR systems\u0000and post-processing pipelines typically apply towards spontaneous speech with\u0000significant irregularities, stutters, and deviations from perfect grammar. To\u0000address this discrepancy, we introduce SponSpeech, a punctuation restoration\u0000dataset derived from informal speech sources, which includes punctuation and\u0000casing information. In addition to publicly releasing the dataset, we\u0000contribute a filtering pipeline that can be used to generate more data. Our\u0000filtering pipeline examines the quality of both speech audio and transcription\u0000text. We also carefully construct a ``challenging\" test set, aimed at\u0000evaluating models' ability to leverage audio information to predict otherwise\u0000grammatically ambiguous punctuation. SponSpeech is available at\u0000https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset\u0000building and model runs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Room impulse response prototyping using receiver distance estimations for high quality room equalisation algorithms 利用接收器距离估算室内脉冲响应原型,实现高质量室内均衡算法
Pub Date : 2024-09-16 DOI: arxiv-2409.10131
James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par
Room equalisation aims to increase the quality of loudspeaker reproduction inreverberant environments, compensating for colouration caused by imperfect roomreflections and frequency dependant loudspeaker directivity. A common techniquein the field of room equalisation, is to invert a prototype Room ImpulseResponse (RIR). Rather than inverting a single RIR at the listening position, aprototype response is composed of several responses distributed around thelistening area. This paper proposes a method of impulse response prototyping,using estimated receiver positions, to form a weighted average prototyperesponse. A method of receiver distance estimation is described, supporting theimplementation of the prototype RIR. The proposed prototyping method iscompared to other methods by measuring their post equalisation spectraldeviation at several positions in a simulated room.
室内均衡的目的是提高扬声器在混响环境中的再现质量,补偿因不完美的室内反射和扬声器的频率指向性而产生的色彩。房间均衡领域的一种常用技术是反转房间脉冲响应(RIR)原型。原型响应由分布在聆听区域周围的多个响应组成,而不是在聆听位置反转单一的 RIR。本文提出了一种脉冲响应原型方法,利用估计的接收机位置,形成加权平均原型响应。本文描述了一种接收机距离估计方法,支持原型 RIR 的实现。通过测量模拟房间中几个位置的均衡后频谱偏差,将所提出的原型方法与其他方法进行比较。
{"title":"Room impulse response prototyping using receiver distance estimations for high quality room equalisation algorithms","authors":"James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par","doi":"arxiv-2409.10131","DOIUrl":"https://doi.org/arxiv-2409.10131","url":null,"abstract":"Room equalisation aims to increase the quality of loudspeaker reproduction in\u0000reverberant environments, compensating for colouration caused by imperfect room\u0000reflections and frequency dependant loudspeaker directivity. A common technique\u0000in the field of room equalisation, is to invert a prototype Room Impulse\u0000Response (RIR). Rather than inverting a single RIR at the listening position, a\u0000prototype response is composed of several responses distributed around the\u0000listening area. This paper proposes a method of impulse response prototyping,\u0000using estimated receiver positions, to form a weighted average prototype\u0000response. A method of receiver distance estimation is described, supporting the\u0000implementation of the prototype RIR. The proposed prototyping method is\u0000compared to other methods by measuring their post equalisation spectral\u0000deviation at several positions in a simulated room.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement 利用 MAMBA 联合频谱和空间学习进行多通道语音增强
Pub Date : 2024-09-16 DOI: arxiv-2409.10376
Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao
In multichannel speech enhancement, effectively capturing spatial andspectral information across different microphones is crucial for noisereduction. Traditional methods, such as CNN or LSTM, attempt to model thetemporal dynamics of full-band and sub-band spectral and spatial features.However, these approaches face limitations in fully modeling complex temporaldependencies, especially in dynamic acoustic environments. To overcome thesechallenges, we modify the current advanced model McNet by introducing animproved version of Mamba, a state-space model, and further propose MCMamba.MCMamba has been completely reengineered to integrate full-band and narrow-bandspatial information with sub-band and full-band spectral features, providing amore comprehensive approach to modeling spatial and spectral information. Ourexperimental results demonstrate that MCMamba significantly improves themodeling of spatial and spectral features in multichannel speech enhancement,outperforming McNet and achieving state-of-the-art performance on the CHiME-3dataset. Additionally, we find that Mamba performs exceptionally well inmodeling spectral information.
在多通道语音增强中,有效捕捉不同麦克风的空间和频谱信息对于降低噪声至关重要。CNN 或 LSTM 等传统方法试图对全频段和子频段的频谱和空间特征的时态动态进行建模。为了克服这些挑战,我们对现有的高级模型 McNet 进行了修改,引入了状态空间模型 Mamba 的改进版本,并进一步提出了 MCMamba。MCMamba 经过全新设计,将全波段和窄波段空间信息与子波段和全波段频谱特征整合在一起,为空间和频谱信息建模提供了更全面的方法。实验结果表明,MCMamba 显著改善了多通道语音增强中的空间和频谱特征建模,在 CHiME-3 数据集上的表现优于 McNet,达到了最先进的水平。此外,我们还发现 Mamba 在模拟频谱信息方面表现出色。
{"title":"Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement","authors":"Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao","doi":"arxiv-2409.10376","DOIUrl":"https://doi.org/arxiv-2409.10376","url":null,"abstract":"In multichannel speech enhancement, effectively capturing spatial and\u0000spectral information across different microphones is crucial for noise\u0000reduction. Traditional methods, such as CNN or LSTM, attempt to model the\u0000temporal dynamics of full-band and sub-band spectral and spatial features.\u0000However, these approaches face limitations in fully modeling complex temporal\u0000dependencies, especially in dynamic acoustic environments. To overcome these\u0000challenges, we modify the current advanced model McNet by introducing an\u0000improved version of Mamba, a state-space model, and further propose MCMamba.\u0000MCMamba has been completely reengineered to integrate full-band and narrow-band\u0000spatial information with sub-band and full-band spectral features, providing a\u0000more comprehensive approach to modeling spatial and spectral information. Our\u0000experimental results demonstrate that MCMamba significantly improves the\u0000modeling of spatial and spectral features in multichannel speech enhancement,\u0000outperforming McNet and achieving state-of-the-art performance on the CHiME-3\u0000dataset. Additionally, we find that Mamba performs exceptionally well in\u0000modeling spectral information.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RF-GML: Reference-Free Generative Machine Listener RF-GML:无参考生成机器监听器
Pub Date : 2024-09-16 DOI: arxiv-2409.10210
Arijit Biswas, Guanxin Jiang
This paper introduces a novel reference-free (RF) audio quality metric calledthe RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono,stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transferlearning from a state-of-the-art full-reference (FR) Generative MachineListener (GML) with minimal architectural modifications. The term "generative"refers to the model's ability to generate an arbitrary number of simulatedlistening scores. Unlike existing RF models, RF-GML accurately predictssubjective quality scores across diverse content types and codecs. Extensiveevaluations demonstrate its superiority in rating unencoded audio anddistinguishing different levels of coding artifacts. RF-GML's performance andversatility make it a valuable tool for coded audio quality assessment andmonitoring in various applications, all without the need for a referencesignal.
本文介绍了一种名为 RF-Generative Machine Listener(RF-GML)的新型无参考(RF)音频质量度量,旨在评估 48 kHz 采样率下的编码单声道、立体声和双声道音频。RF-GML 利用了最先进的全参考(FR)生成式机器监听器(GML)的迁移学习功能,只做了最小的架构修改。所谓 "生成",是指该模型能够生成任意数量的模拟收听分数。与现有的 RF 模型不同,RF-GML 可以准确预测不同内容类型和编解码器的主观质量分数。广泛的评估证明了它在为未编码音频评分和区分不同程度的编码人工痕迹方面的优越性。RF-GML 的性能和通用性使其成为各种应用中编码音频质量评估和监控的重要工具,而且无需参考信号。
{"title":"RF-GML: Reference-Free Generative Machine Listener","authors":"Arijit Biswas, Guanxin Jiang","doi":"arxiv-2409.10210","DOIUrl":"https://doi.org/arxiv-2409.10210","url":null,"abstract":"This paper introduces a novel reference-free (RF) audio quality metric called\u0000the RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono,\u0000stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transfer\u0000learning from a state-of-the-art full-reference (FR) Generative Machine\u0000Listener (GML) with minimal architectural modifications. The term \"generative\"\u0000refers to the model's ability to generate an arbitrary number of simulated\u0000listening scores. Unlike existing RF models, RF-GML accurately predicts\u0000subjective quality scores across diverse content types and codecs. Extensive\u0000evaluations demonstrate its superiority in rating unencoded audio and\u0000distinguishing different levels of coding artifacts. RF-GML's performance and\u0000versatility make it a valuable tool for coded audio quality assessment and\u0000monitoring in various applications, all without the need for a reference\u0000signal.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating Training Objectives for Generative Speech Enhancement 研究生成式语音增强的训练目标
Pub Date : 2024-09-16 DOI: arxiv-2409.10753
Julius Richter, Danilo de Oliveira, Timo Gerkmann
Generative speech enhancement has recently shown promising advancements inimproving speech quality in noisy environments. Multiple diffusion-basedframeworks exist, each employing distinct training objectives and learningtechniques. This paper aims at explaining the differences between theseframeworks by focusing our investigation on score-based generative models andSchr"odinger bridge. We conduct a series of comprehensive experiments tocompare their performance and highlight differing training behaviors.Furthermore, we propose a novel perceptual loss function tailored for theSchr"odinger bridge framework, demonstrating enhanced performance and improvedperceptual quality of the enhanced speech signals. All experimental code andpre-trained models are publicly available to facilitate further research anddevelopment in this.
最近,生成语音增强技术在改善嘈杂环境下的语音质量方面取得了可喜的进步。目前存在多种基于扩散的框架,每种框架都采用了不同的训练目标和学习技术。本文旨在通过重点研究基于分数的生成模型和薛定谔桥来解释这些框架之间的差异。此外,我们还提出了一种为薛定谔桥框架量身定制的新型感知损失函数,证明了增强语音信号的性能和感知质量。所有实验代码和预先训练的模型都是公开的,以促进这方面的进一步研究和开发。
{"title":"Investigating Training Objectives for Generative Speech Enhancement","authors":"Julius Richter, Danilo de Oliveira, Timo Gerkmann","doi":"arxiv-2409.10753","DOIUrl":"https://doi.org/arxiv-2409.10753","url":null,"abstract":"Generative speech enhancement has recently shown promising advancements in\u0000improving speech quality in noisy environments. Multiple diffusion-based\u0000frameworks exist, each employing distinct training objectives and learning\u0000techniques. This paper aims at explaining the differences between these\u0000frameworks by focusing our investigation on score-based generative models and\u0000Schr\"odinger bridge. We conduct a series of comprehensive experiments to\u0000compare their performance and highlight differing training behaviors.\u0000Furthermore, we propose a novel perceptual loss function tailored for the\u0000Schr\"odinger bridge framework, demonstrating enhanced performance and improved\u0000perceptual quality of the enhanced speech signals. All experimental code and\u0000pre-trained models are publicly available to facilitate further research and\u0000development in this.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models oboVox 远场扬声器识别:使用预训练模型的新型数据增强方法
Pub Date : 2024-09-16 DOI: arxiv-2409.10240
Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin
In this study, we address the challenge of speaker recognition using a noveldata augmentation technique of adding noise to enrollment files. This techniqueefficiently aligns the sources of test and enrollment files, enhancingcomparability. Various pre-trained models were employed, with the resnet modelachieving the highest DCF of 0.84 and an EER of 13.44. The augmentationtechnique notably improved these results to 0.75 DCF and 12.79 EER for theresnet model. Comparative analysis revealed the superiority of resnet overmodels such as ECPA, Mel-spectrogram, Payonnet, and Titanet large. Results,along with different augmentation schemes, contribute to the success of RoboVoxfar-field speaker recognition in this paper
在这项研究中,我们采用了一种新的数据增强技术,即在注册文件中添加噪音,来应对说话人识别的挑战。这项技术有效地调整了测试文件和注册文件的来源,提高了可比性。我们采用了多种预训练模型,其中 resnet 模型的 DCF 最高,为 0.84,EER 为 13.44。增强技术显著改善了这些结果,resnet 模型的 DCF 和 EER 分别为 0.75 和 12.79。对比分析表明,resnet 优于 ECPA、Mel-spectrogram、Payonnet 和 Titanet large 等模型。这些结果以及不同的增强方案为本文中 RoboVox 远场扬声器识别的成功做出了贡献。
{"title":"oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models","authors":"Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin","doi":"arxiv-2409.10240","DOIUrl":"https://doi.org/arxiv-2409.10240","url":null,"abstract":"In this study, we address the challenge of speaker recognition using a novel\u0000data augmentation technique of adding noise to enrollment files. This technique\u0000efficiently aligns the sources of test and enrollment files, enhancing\u0000comparability. Various pre-trained models were employed, with the resnet model\u0000achieving the highest DCF of 0.84 and an EER of 13.44. The augmentation\u0000technique notably improved these results to 0.75 DCF and 12.79 EER for the\u0000resnet model. Comparative analysis revealed the superiority of resnet over\u0000models such as ECPA, Mel-spectrogram, Payonnet, and Titanet large. Results,\u0000along with different augmentation schemes, contribute to the success of RoboVox\u0000far-field speaker recognition in this paper","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages Meta-Whisper:基于语音的元智能语言(Meta-ICL),用于低资源语言的 ASR
Pub Date : 2024-09-16 DOI: arxiv-2409.10429
Ming-Hao Hsu, Kuan Po Huang, Hung-yi Lee
This paper presents Meta-Whisper, a novel approach to improve automaticspeech recognition (ASR) for low-resource languages using the Whisper model. Byleveraging Meta In-Context Learning (Meta-ICL) and a k-Nearest Neighbors (KNN)algorithm for sample selection, Meta-Whisper enhances Whisper's ability torecognize speech in unfamiliar languages without extensive fine-tuning.Experiments on the ML-SUPERB dataset show that Meta-Whisper significantlyreduces the Character Error Rate (CER) for low-resource languages compared tothe original Whisper model. This method offers a promising solution fordeveloping more adaptable multilingual ASR systems, particularly for languageswith limited resources.
本文介绍了 Meta-Whisper,这是一种利用 Whisper 模型改进低资源语言自动语音识别(ASR)的新方法。通过利用元上下文学习(Meta-ICL)和 k-Nearest Neighbors (KNN) 算法进行样本选择,Meta-Whisper 增强了 Whisper 识别陌生语言语音的能力,而无需进行大量微调。这种方法为开发适应性更强的多语言自动识别系统提供了一种前景广阔的解决方案,尤其适用于资源有限的语言。
{"title":"Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages","authors":"Ming-Hao Hsu, Kuan Po Huang, Hung-yi Lee","doi":"arxiv-2409.10429","DOIUrl":"https://doi.org/arxiv-2409.10429","url":null,"abstract":"This paper presents Meta-Whisper, a novel approach to improve automatic\u0000speech recognition (ASR) for low-resource languages using the Whisper model. By\u0000leveraging Meta In-Context Learning (Meta-ICL) and a k-Nearest Neighbors (KNN)\u0000algorithm for sample selection, Meta-Whisper enhances Whisper's ability to\u0000recognize speech in unfamiliar languages without extensive fine-tuning.\u0000Experiments on the ML-SUPERB dataset show that Meta-Whisper significantly\u0000reduces the Character Error Rate (CER) for low-resource languages compared to\u0000the original Whisper model. This method offers a promising solution for\u0000developing more adaptable multilingual ASR systems, particularly for languages\u0000with limited resources.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems 交互式口语对话系统的高效自学习框架
Pub Date : 2024-09-16 DOI: arxiv-2409.10515
Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister
Dialog systems, such as voice assistants, are expected to engage with usersin complex, evolving conversations. Unfortunately, traditional automatic speechrecognition (ASR) systems deployed in such applications are usually trained torecognize each turn independently and lack the ability to adapt to theconversational context or incorporate user feedback. In this work, we introducea general framework for ASR in dialog systems that can go beyond learning fromsingle-turn utterances and learn over time how to adapt to both explicitsupervision and implicit user feedback present in multi-turn conversations. Weaccomplish that by leveraging advances in student-teacher learning andcontext-aware dialog processing, and designing contrastive self-supervisionapproaches with Ohm, a new online hard-negative mining approach. We show thatleveraging our new framework compared to traditional training leads to relativeWER reductions of close to 10% in real-world dialog systems, and up to 26% onpublic synthetic data.
语音助手等对话系统需要与用户进行复杂、不断变化的对话。遗憾的是,在这类应用中部署的传统自动语音识别(ASR)系统通常是训练成独立识别每个回合的,缺乏适应对话语境或结合用户反馈的能力。在这项工作中,我们为对话系统中的 ASR 引入了一个通用框架,该框架不仅可以从单次转折语句中学习,还可以随着时间的推移学习如何适应多转折对话中的明示监督和隐式用户反馈。我们利用在师生学习和语境感知对话处理方面取得的进步,并通过 Ohm(一种新的在线硬负挖掘方法)设计对比性自我监督方法,从而实现了这一目标。我们的研究表明,与传统的训练方法相比,利用我们的新框架可以在真实世界的对话系统中将相对 WER 降低近 10%,而在公开的合成数据中最高可降低 26%。
{"title":"An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems","authors":"Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister","doi":"arxiv-2409.10515","DOIUrl":"https://doi.org/arxiv-2409.10515","url":null,"abstract":"Dialog systems, such as voice assistants, are expected to engage with users\u0000in complex, evolving conversations. Unfortunately, traditional automatic speech\u0000recognition (ASR) systems deployed in such applications are usually trained to\u0000recognize each turn independently and lack the ability to adapt to the\u0000conversational context or incorporate user feedback. In this work, we introduce\u0000a general framework for ASR in dialog systems that can go beyond learning from\u0000single-turn utterances and learn over time how to adapt to both explicit\u0000supervision and implicit user feedback present in multi-turn conversations. We\u0000accomplish that by leveraging advances in student-teacher learning and\u0000context-aware dialog processing, and designing contrastive self-supervision\u0000approaches with Ohm, a new online hard-negative mining approach. We show that\u0000leveraging our new framework compared to traditional training leads to relative\u0000WER reductions of close to 10% in real-world dialog systems, and up to 26% on\u0000public synthetic data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - EE - Audio and Speech Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1