While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.
{"title":"Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data","authors":"Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen","doi":"arxiv-2409.10969","DOIUrl":"https://doi.org/arxiv-2409.10969","url":null,"abstract":"While large language models (LLMs) have been explored in the speech domain\u0000for both generation and recognition tasks, their applications are predominantly\u0000confined to the monolingual scenario, with limited exploration in multilingual\u0000and code-switched (CS) contexts. Additionally, speech generation and\u0000recognition tasks are often handled separately, such as VALL-E and Qwen-Audio.\u0000In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating\u0000multilingual speech generation and recognition tasks within the single LLM.\u0000Furthermore, we develop an effective data construction approach that splits and\u0000concatenates words from different languages to equip LLMs with CS synthesis\u0000ability without relying on CS data. The experimental results demonstrate that\u0000our model outperforms other baselines with a comparable data scale.\u0000Furthermore, our data construction approach not only equips LLMs with CS speech\u0000synthesis capability with comparable speaker consistency and similarity to any\u0000given speaker, but also improves the performance of LLMs in multilingual speech\u0000generation and recognition tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.
{"title":"3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy","authors":"Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Yuki Uranishi","doi":"arxiv-2409.10848","DOIUrl":"https://doi.org/arxiv-2409.10848","url":null,"abstract":"Audio-driven 3D facial animation has made immersive progress both in research\u0000and application developments. The newest approaches focus on Transformer-based\u0000methods and diffusion-based methods, however, there is still gap in the\u0000vividness and emotional expression between the generated animation and real\u0000human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion\u0000policy model for 3D facial animation prediction. This method generates variable\u0000and realistic human facial movements by predicting the 3D vertex trajectory on\u0000the 3D facial template with diffusion policy instead of facial generation for\u0000every frame. It takes audio and vertex states as observations to predict the\u0000vertex trajectory and imitate real human facial expressions, which keeps the\u0000continuous and natural flow of human emotions. The experiments show that our\u0000approach is effective in variable and dynamic facial motion synthesizing.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.
目前,标点符号修复模型几乎只能在结构良好的脚本语料库中进行评估。另一方面,现实世界中的 ASR 系统和后处理管道通常适用于自发语音,这些语音存在明显的不规则、口吃和语法偏差。为了解决这一差异,我们引入了 SponSpeech,这是一个标点符号还原数据集,源自非正式语音源,其中包括标点符号和音调信息。除了公开发布数据集之外,我们还提供了一个过滤管道,可用于生成更多数据。我们的过滤管道同时检查语音音频和转录文本的质量。我们还精心构建了一个 "挑战性 "测试集,旨在评估模型利用音频信息预测语法模糊标点符号的能力。SponSpeech可在https://github.com/GitHubAccountAnonymous/PR,以及用于数据集构建和模型运行的所有代码。
{"title":"Spontaneous Informal Speech Dataset for Punctuation Restoration","authors":"Xing Yi Liu, Homayoon Beigi","doi":"arxiv-2409.11241","DOIUrl":"https://doi.org/arxiv-2409.11241","url":null,"abstract":"Presently, punctuation restoration models are evaluated almost solely on\u0000well-structured, scripted corpora. On the other hand, real-world ASR systems\u0000and post-processing pipelines typically apply towards spontaneous speech with\u0000significant irregularities, stutters, and deviations from perfect grammar. To\u0000address this discrepancy, we introduce SponSpeech, a punctuation restoration\u0000dataset derived from informal speech sources, which includes punctuation and\u0000casing information. In addition to publicly releasing the dataset, we\u0000contribute a filtering pipeline that can be used to generate more data. Our\u0000filtering pipeline examines the quality of both speech audio and transcription\u0000text. We also carefully construct a ``challenging\" test set, aimed at\u0000evaluating models' ability to leverage audio information to predict otherwise\u0000grammatically ambiguous punctuation. SponSpeech is available at\u0000https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset\u0000building and model runs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par
Room equalisation aims to increase the quality of loudspeaker reproduction in reverberant environments, compensating for colouration caused by imperfect room reflections and frequency dependant loudspeaker directivity. A common technique in the field of room equalisation, is to invert a prototype Room Impulse Response (RIR). Rather than inverting a single RIR at the listening position, a prototype response is composed of several responses distributed around the listening area. This paper proposes a method of impulse response prototyping, using estimated receiver positions, to form a weighted average prototype response. A method of receiver distance estimation is described, supporting the implementation of the prototype RIR. The proposed prototyping method is compared to other methods by measuring their post equalisation spectral deviation at several positions in a simulated room.
{"title":"Room impulse response prototyping using receiver distance estimations for high quality room equalisation algorithms","authors":"James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par","doi":"arxiv-2409.10131","DOIUrl":"https://doi.org/arxiv-2409.10131","url":null,"abstract":"Room equalisation aims to increase the quality of loudspeaker reproduction in\u0000reverberant environments, compensating for colouration caused by imperfect room\u0000reflections and frequency dependant loudspeaker directivity. A common technique\u0000in the field of room equalisation, is to invert a prototype Room Impulse\u0000Response (RIR). Rather than inverting a single RIR at the listening position, a\u0000prototype response is composed of several responses distributed around the\u0000listening area. This paper proposes a method of impulse response prototyping,\u0000using estimated receiver positions, to form a weighted average prototype\u0000response. A method of receiver distance estimation is described, supporting the\u0000implementation of the prototype RIR. The proposed prototyping method is\u0000compared to other methods by measuring their post equalisation spectral\u0000deviation at several positions in a simulated room.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.
{"title":"Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement","authors":"Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao","doi":"arxiv-2409.10376","DOIUrl":"https://doi.org/arxiv-2409.10376","url":null,"abstract":"In multichannel speech enhancement, effectively capturing spatial and\u0000spectral information across different microphones is crucial for noise\u0000reduction. Traditional methods, such as CNN or LSTM, attempt to model the\u0000temporal dynamics of full-band and sub-band spectral and spatial features.\u0000However, these approaches face limitations in fully modeling complex temporal\u0000dependencies, especially in dynamic acoustic environments. To overcome these\u0000challenges, we modify the current advanced model McNet by introducing an\u0000improved version of Mamba, a state-space model, and further propose MCMamba.\u0000MCMamba has been completely reengineered to integrate full-band and narrow-band\u0000spatial information with sub-band and full-band spectral features, providing a\u0000more comprehensive approach to modeling spatial and spectral information. Our\u0000experimental results demonstrate that MCMamba significantly improves the\u0000modeling of spatial and spectral features in multichannel speech enhancement,\u0000outperforming McNet and achieving state-of-the-art performance on the CHiME-3\u0000dataset. Additionally, we find that Mamba performs exceptionally well in\u0000modeling spectral information.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative speech enhancement has recently shown promising advancements in improving speech quality in noisy environments. Multiple diffusion-based frameworks exist, each employing distinct training objectives and learning techniques. This paper aims at explaining the differences between these frameworks by focusing our investigation on score-based generative models and Schr"odinger bridge. We conduct a series of comprehensive experiments to compare their performance and highlight differing training behaviors. Furthermore, we propose a novel perceptual loss function tailored for the Schr"odinger bridge framework, demonstrating enhanced performance and improved perceptual quality of the enhanced speech signals. All experimental code and pre-trained models are publicly available to facilitate further research and development in this.
{"title":"Investigating Training Objectives for Generative Speech Enhancement","authors":"Julius Richter, Danilo de Oliveira, Timo Gerkmann","doi":"arxiv-2409.10753","DOIUrl":"https://doi.org/arxiv-2409.10753","url":null,"abstract":"Generative speech enhancement has recently shown promising advancements in\u0000improving speech quality in noisy environments. Multiple diffusion-based\u0000frameworks exist, each employing distinct training objectives and learning\u0000techniques. This paper aims at explaining the differences between these\u0000frameworks by focusing our investigation on score-based generative models and\u0000Schr\"odinger bridge. We conduct a series of comprehensive experiments to\u0000compare their performance and highlight differing training behaviors.\u0000Furthermore, we propose a novel perceptual loss function tailored for the\u0000Schr\"odinger bridge framework, demonstrating enhanced performance and improved\u0000perceptual quality of the enhanced speech signals. All experimental code and\u0000pre-trained models are publicly available to facilitate further research and\u0000development in this.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin
In this study, we address the challenge of speaker recognition using a novel data augmentation technique of adding noise to enrollment files. This technique efficiently aligns the sources of test and enrollment files, enhancing comparability. Various pre-trained models were employed, with the resnet model achieving the highest DCF of 0.84 and an EER of 13.44. The augmentation technique notably improved these results to 0.75 DCF and 12.79 EER for the resnet model. Comparative analysis revealed the superiority of resnet over models such as ECPA, Mel-spectrogram, Payonnet, and Titanet large. Results, along with different augmentation schemes, contribute to the success of RoboVox far-field speaker recognition in this paper
{"title":"oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models","authors":"Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin","doi":"arxiv-2409.10240","DOIUrl":"https://doi.org/arxiv-2409.10240","url":null,"abstract":"In this study, we address the challenge of speaker recognition using a novel\u0000data augmentation technique of adding noise to enrollment files. This technique\u0000efficiently aligns the sources of test and enrollment files, enhancing\u0000comparability. Various pre-trained models were employed, with the resnet model\u0000achieving the highest DCF of 0.84 and an EER of 13.44. The augmentation\u0000technique notably improved these results to 0.75 DCF and 12.79 EER for the\u0000resnet model. Comparative analysis revealed the superiority of resnet over\u0000models such as ECPA, Mel-spectrogram, Payonnet, and Titanet large. Results,\u0000along with different augmentation schemes, contribute to the success of RoboVox\u0000far-field speaker recognition in this paper","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel reference-free (RF) audio quality metric called the RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono, stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transfer learning from a state-of-the-art full-reference (FR) Generative Machine Listener (GML) with minimal architectural modifications. The term "generative" refers to the model's ability to generate an arbitrary number of simulated listening scores. Unlike existing RF models, RF-GML accurately predicts subjective quality scores across diverse content types and codecs. Extensive evaluations demonstrate its superiority in rating unencoded audio and distinguishing different levels of coding artifacts. RF-GML's performance and versatility make it a valuable tool for coded audio quality assessment and monitoring in various applications, all without the need for a reference signal.
{"title":"RF-GML: Reference-Free Generative Machine Listener","authors":"Arijit Biswas, Guanxin Jiang","doi":"arxiv-2409.10210","DOIUrl":"https://doi.org/arxiv-2409.10210","url":null,"abstract":"This paper introduces a novel reference-free (RF) audio quality metric called\u0000the RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono,\u0000stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transfer\u0000learning from a state-of-the-art full-reference (FR) Generative Machine\u0000Listener (GML) with minimal architectural modifications. The term \"generative\"\u0000refers to the model's ability to generate an arbitrary number of simulated\u0000listening scores. Unlike existing RF models, RF-GML accurately predicts\u0000subjective quality scores across diverse content types and codecs. Extensive\u0000evaluations demonstrate its superiority in rating unencoded audio and\u0000distinguishing different levels of coding artifacts. RF-GML's performance and\u0000versatility make it a valuable tool for coded audio quality assessment and\u0000monitoring in various applications, all without the need for a reference\u0000signal.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents Meta-Whisper, a novel approach to improve automatic speech recognition (ASR) for low-resource languages using the Whisper model. By leveraging Meta In-Context Learning (Meta-ICL) and a k-Nearest Neighbors (KNN) algorithm for sample selection, Meta-Whisper enhances Whisper's ability to recognize speech in unfamiliar languages without extensive fine-tuning. Experiments on the ML-SUPERB dataset show that Meta-Whisper significantly reduces the Character Error Rate (CER) for low-resource languages compared to the original Whisper model. This method offers a promising solution for developing more adaptable multilingual ASR systems, particularly for languages with limited resources.
{"title":"Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages","authors":"Ming-Hao Hsu, Kuan Po Huang, Hung-yi Lee","doi":"arxiv-2409.10429","DOIUrl":"https://doi.org/arxiv-2409.10429","url":null,"abstract":"This paper presents Meta-Whisper, a novel approach to improve automatic\u0000speech recognition (ASR) for low-resource languages using the Whisper model. By\u0000leveraging Meta In-Context Learning (Meta-ICL) and a k-Nearest Neighbors (KNN)\u0000algorithm for sample selection, Meta-Whisper enhances Whisper's ability to\u0000recognize speech in unfamiliar languages without extensive fine-tuning.\u0000Experiments on the ML-SUPERB dataset show that Meta-Whisper significantly\u0000reduces the Character Error Rate (CER) for low-resource languages compared to\u0000the original Whisper model. This method offers a promising solution for\u0000developing more adaptable multilingual ASR systems, particularly for languages\u0000with limited resources.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister
Dialog systems, such as voice assistants, are expected to engage with users in complex, evolving conversations. Unfortunately, traditional automatic speech recognition (ASR) systems deployed in such applications are usually trained to recognize each turn independently and lack the ability to adapt to the conversational context or incorporate user feedback. In this work, we introduce a general framework for ASR in dialog systems that can go beyond learning from single-turn utterances and learn over time how to adapt to both explicit supervision and implicit user feedback present in multi-turn conversations. We accomplish that by leveraging advances in student-teacher learning and context-aware dialog processing, and designing contrastive self-supervision approaches with Ohm, a new online hard-negative mining approach. We show that leveraging our new framework compared to traditional training leads to relative WER reductions of close to 10% in real-world dialog systems, and up to 26% on public synthetic data.
语音助手等对话系统需要与用户进行复杂、不断变化的对话。遗憾的是,在这类应用中部署的传统自动语音识别(ASR)系统通常是训练成独立识别每个回合的,缺乏适应对话语境或结合用户反馈的能力。在这项工作中,我们为对话系统中的 ASR 引入了一个通用框架,该框架不仅可以从单次转折语句中学习,还可以随着时间的推移学习如何适应多转折对话中的明示监督和隐式用户反馈。我们利用在师生学习和语境感知对话处理方面取得的进步,并通过 Ohm(一种新的在线硬负挖掘方法)设计对比性自我监督方法,从而实现了这一目标。我们的研究表明,与传统的训练方法相比,利用我们的新框架可以在真实世界的对话系统中将相对 WER 降低近 10%,而在公开的合成数据中最高可降低 26%。
{"title":"An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems","authors":"Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister","doi":"arxiv-2409.10515","DOIUrl":"https://doi.org/arxiv-2409.10515","url":null,"abstract":"Dialog systems, such as voice assistants, are expected to engage with users\u0000in complex, evolving conversations. Unfortunately, traditional automatic speech\u0000recognition (ASR) systems deployed in such applications are usually trained to\u0000recognize each turn independently and lack the ability to adapt to the\u0000conversational context or incorporate user feedback. In this work, we introduce\u0000a general framework for ASR in dialog systems that can go beyond learning from\u0000single-turn utterances and learn over time how to adapt to both explicit\u0000supervision and implicit user feedback present in multi-turn conversations. We\u0000accomplish that by leveraging advances in student-teacher learning and\u0000context-aware dialog processing, and designing contrastive self-supervision\u0000approaches with Ohm, a new online hard-negative mining approach. We show that\u0000leveraging our new framework compared to traditional training leads to relative\u0000WER reductions of close to 10% in real-world dialog systems, and up to 26% on\u0000public synthetic data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}