首页 > 最新文献

Interspeech最新文献

英文 中文
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection 呼吸和语音信号的声学表示学习用于COVID-19检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10376
Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh
In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set. Copyright © 2022 ISCA.
在本文中,我们描述了一种用于COVID-19检测任务的音频信号表示学习方法。原始音频样本用一组参数化为余弦调制高斯函数的一维卷积滤波器进行处理。这些核的选择允许将滤波器组解释为平滑带通滤波器。过滤后的输出进行池化、日志压缩,并用于基于自关注的相关性加权机制。相关性加权强调时频分解的关键区域,这些区域对下游任务很重要。该模型的后续层由循环架构组成,并对模型进行COVID-19检测任务的训练。在我们对Coswara数据集的实验中,我们表明所提出的模型比基线系统以及其他表示学习方法取得了显着的性能改进。此外,所提出的方法被证明是统一适用于语音和呼吸信号,并从更大的数据集迁移学习。版权所有©2022 ISCA。
{"title":"Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection","authors":"Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh","doi":"10.21437/interspeech.2022-10376","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10376","url":null,"abstract":"In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2863-2867"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47618169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion 日语asr -鲁棒预训练的伪错误句模型
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-327
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida
Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.
口语理解系统通常由自动语音识别(ASR)和自然语言处理(NLP)模块组成。虽然预训练语言模型(PLMs)通过在大型书面文本语料库上进行训练在NLP中取得了成功;口语有严重的ASR错误会改变其意思,很难理解。我们提出了一种不使用ASR对ASR误差进行鲁棒预训练的方法。该方法以书面文本为例,利用基于神经网络的字素-音素和音素-字素模型构建的伪错误字典生成含有伪asr错误的句子。语音对话总结实验表明,当对包含ASR误差的对话进行微调时,用该方法预训练的ASR鲁棒LM在ROUGE-L上的性能优于用标准屏蔽语言建模预训练的LM 3.17分。
{"title":"Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion","authors":"Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida","doi":"10.21437/interspeech.2022-327","DOIUrl":"https://doi.org/10.21437/interspeech.2022-327","url":null,"abstract":"Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2688-2692"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47792200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages 自下而上发现不同语言中响应令牌(“后台通道”)的结构和变化
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11288
Andreas Liesenfeld, Mark Dingemanse
Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.
响应令牌(也称为backchannel、continuers或feedback)是人类交互的常见特征,用于显示理解和简化轮询。我们提出了一种自下而上的方法来研究16种语言(8个语系)的响应行为。我们使用顺序上下文和回合循环格式,以语言不可知的方式在不同的会话语料库中识别候选响应令牌。然后我们直接在语音信号上使用UMAP聚类来表示结构和变化。我们发现(i)书写的正字法注释没有充分代表被测试的变化,(ii)格式之间的区别可以是渐变的,而不是离散的,(iii)大多数语言似乎在最小的鼻音格式“mm”和更完整的“yeah”-类似格式之间提供了广泛的区别。绘制人类交互的这一方面有助于我们理解跨语言的交互基础结构,并可以为语音技术的设计提供信息。
{"title":"Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages","authors":"Andreas Liesenfeld, Mark Dingemanse","doi":"10.21437/interspeech.2022-11288","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11288","url":null,"abstract":"Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1126-1130"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48906637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion FlowCPCVC:一种用于任意语音转换的对比预测编码监督流框架
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-577
Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang
Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .
近年来,对任意语音转换(VC)的研究发展迅速。然而,它们的质量往往不尽如人意,需要两个阶段的训练,其中频谱生成过程是必不可少的。在本文中,我们提出了FlowCPCVC系统,该系统具有更高的语音自然度和音色相似性。FlowCPCVC是我们所知的第一个针对任何任务的单阶段训练系统,它利用了VAE和对比学习。我们使用扬声器编码器来提取音色信息,并使用基于对比预测编码(CPC)的内容提取器来引导流模块丢弃音色并保留语言信息。我们的方法直接将声码器结合到训练中,从而避免了两阶段训练中频谱信息的丢失。使用一种奇特的方法来训练任意到任意任务,当在任意到多转换中使用它时,我们也可以获得稳健的结果。实验表明,与目前最先进的任意语音转换系统VQMIVC相比,FlowCPCVC实现了明显的改进。我们的演示可在线获得1。
{"title":"FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion","authors":"Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang","doi":"10.21437/interspeech.2022-577","DOIUrl":"https://doi.org/10.21437/interspeech.2022-577","url":null,"abstract":"Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2558-2562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48445659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Reducing Domain mismatch in Self-supervised speech pre-training 减少自监督语音预训练中的域不匹配
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-736
M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano
Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.
掩码语音建模(MSM)方法,如wav2vec2或w2v-BERT,学习在话语中随机掩码的语音帧上的表示。虽然这些方法提高了自动语音识别(ASR)系统的性能,但它们有一个主要的局限性。他们对所有的无监督语音样本的权重都是相等的,这阻碍了学习,因为不是所有的样本都有相关的信息来学习有意义的表示。在这项工作中,我们解决了这个限制。我们提出了一种在MSM预训练中关注特定样本的新方法ask2mask (ATM)。ATM采用外部ASR模型或评分器通过执行细粒度数据选择来对无监督输入样本进行加权。ATM对评分者选择的高度自信的输入帧执行屏蔽。这允许模型学习有意义的表示。我们对libisspeech(与预训练数据匹配)和AMI和CHiME-6(与预训练数据不匹配)两个经过良好基准测试的corpora进行了微调实验。结果证实了ATM在不匹配条件下显著提高识别性能,而在匹配条件下仍有适度的提高。
{"title":"Reducing Domain mismatch in Self-supervised speech pre-training","authors":"M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano","doi":"10.21437/interspeech.2022-736","DOIUrl":"https://doi.org/10.21437/interspeech.2022-736","url":null,"abstract":"Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3028-3032"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48701115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method 语义差分法对穿透性语音的感知评价
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-100
T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano
Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.
有些扬声器具有穿透性的声音,即使在很大的噪音中或从很远的地方也能清晰地听到。本研究利用因子分析法对穿透性语音的音质进行了研究。11名参与者对124名说话者的声音如何从嘈杂声中突现出来进行打分。以得分为渗透指数,选取得分高的和得分低的发言者各10人,用语义差分法进行评分实验。40名本科生用14个双极7分制的声音质量量表对这些演讲者所产生的日语句子进行评分。采用13个量表的数据(即在14个量表中剔除1个渗透量表)进行因子分析。获得了三个主要因素:(1)强大和金属,(2)女性化,(3)美学。第一个因素(强力和金属)与穿透等级高度相关。这些结果表明,穿透性声音具有多维音质,穿透性声音的特征与声音的力量和金属性有关。
{"title":"Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method","authors":"T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano","doi":"10.21437/interspeech.2022-100","DOIUrl":"https://doi.org/10.21437/interspeech.2022-100","url":null,"abstract":"Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3063-3067"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45037272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis 用于表达文本到语音合成的约束跨模态说话风格建模
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11275
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng
{"title":"CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis","authors":"Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng","doi":"10.21437/interspeech.2022-11275","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11275","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5533-5537"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45690327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting 基于空间感知的多渠道多方会议发言人日记
Pub Date : 2022-09-18 DOI: 10.21437/Interspeech.2022-11412
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong
This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which consists of attention superdirective beamforming (ASDB) block and Conformer encoder. The proposed ASDB is a self-adapted channel-wise block that extracts the latent spatial features of array audios by modeling interdependencies between channels. We explore DMSNet to address overlapped speech problem on multi-channel audio and achieve 93.53% accuracy on evaluation set. By performing DMSNet based overlapped speech detection (OSD) module, the diarization error rate (DER) of cluster-based diarization system decrease significantly from 13.45% to 7.64%.
本文描述了一个用于多渠道多党会议的空间感知说话人日记系统。二值化系统通过麦克风阵列获取说话人的方位信息。说话人空间嵌入是由超定向波束形成(SDB)导出的x矢量和s矢量生成的,这使得嵌入更加鲁棒。具体来说,我们提出了一种新的多通道序列到序列神经网络架构,称为判别多流神经网络(DMSNet),它由注意力超定向波束形成(ASDB)块和保形编码器组成。所提出的ASDB是一个自适应的通道块,通过建模通道之间的相互依赖性来提取阵列音频的潜在空间特征。我们探索DMSNet来解决多声道音频上的重叠语音问题,并在评估集上达到93.53%的准确率。通过执行基于DMSNet的重叠语音检测(OSD)模块,基于聚类的二元化系统的二元错误率(DER)从13.45%显著降低到7.64%。
{"title":"Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting","authors":"Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong","doi":"10.21437/Interspeech.2022-11412","DOIUrl":"https://doi.org/10.21437/Interspeech.2022-11412","url":null,"abstract":"This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which consists of attention superdirective beamforming (ASDB) block and Conformer encoder. The proposed ASDB is a self-adapted channel-wise block that extracts the latent spatial features of array audios by modeling interdependencies between channels. We explore DMSNet to address overlapped speech problem on multi-channel audio and achieve 93.53% accuracy on evaluation set. By performing DMSNet based overlapped speech detection (OSD) module, the diarization error rate (DER) of cluster-based diarization system decrease significantly from 13.45% to 7.64%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1491-1495"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45765166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Online Learning of Open-set Speaker Identification by Active User-registration 基于主动用户注册的开放集说话人识别在线学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-25
Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee
Registering each user’s identity for voice assistants is bur-densome and complex for multi-user environments like a household scenario. This is particularly true when the registration needs to happen on-the-fly with a relatively minimum effort. Most of the prior works for speaker identification (SID) do not seamlessly allow the addition of new speakers as these do not support online updates. To deal with such limitation, we introduce a novel online learning approach to open-set SID that can actively register unknown users in the household setting. Based on MPART (Message Passing Adaptive Resonance The-ory), our method performs online active semi-supervised learning for open-set SID by using speaking embedding vectors to infer new speakers and request user’s identity. Our method pro-gressively improves the overall SID performance without forgetting, making it attractive for many interactive real-world ap-plications. We evaluate our model for the online learning setting of an open-set SID task where new speakers are added on-the-fly, demonstrating its superior performance.
对于像家庭场景这样的多用户环境来说,为语音助手注册每个用户的身份既麻烦又复杂。当注册需要以相对最小的工作量进行时,情况尤其如此。大多数先前的扬声器识别(SID)工作都不允许无缝添加新的扬声器,因为这些扬声器不支持在线更新。为了解决这种限制,我们引入了一种新的在线学习方法来开放集SID,该方法可以在家庭环境中主动注册未知用户。基于MPART(Message Passing Adaptive Resonance Theory),我们的方法通过使用语音嵌入向量来推断新的说话者并请求用户身份,对开放集SID进行在线主动半监督学习。我们的方法在不忘记的情况下逐步提高了SID的整体性能,使其对许多交互式现实世界应用程序具有吸引力。我们评估了我们的模型,用于开放集SID任务的在线学习设置,其中动态添加了新的扬声器,展示了其卓越的性能。
{"title":"Online Learning of Open-set Speaker Identification by Active User-registration","authors":"Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee","doi":"10.21437/interspeech.2022-25","DOIUrl":"https://doi.org/10.21437/interspeech.2022-25","url":null,"abstract":"Registering each user’s identity for voice assistants is bur-densome and complex for multi-user environments like a household scenario. This is particularly true when the registration needs to happen on-the-fly with a relatively minimum effort. Most of the prior works for speaker identification (SID) do not seamlessly allow the addition of new speakers as these do not support online updates. To deal with such limitation, we introduce a novel online learning approach to open-set SID that can actively register unknown users in the household setting. Based on MPART (Message Passing Adaptive Resonance The-ory), our method performs online active semi-supervised learning for open-set SID by using speaking embedding vectors to infer new speakers and request user’s identity. Our method pro-gressively improves the overall SID performance without forgetting, making it attractive for many interactive real-world ap-plications. We evaluate our model for the online learning setting of an open-set SID task where new speakers are added on-the-fly, demonstrating its superior performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5065-5069"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45807880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild MSDWild:狂野中的多模态说话人日记数据集
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10466
Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu
Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.
在现实声环境中,扬声器的特征化是一项具有挑战性的任务,越来越受到学术界和工业界的关注。虽然人们普遍认为,结合视觉信息有利于语音识别等音频处理任务,但目前还没有完全发布的数据集可用于在现实环境中对多模态说话人dialarization性能进行基准测试。在本文中,我们发布了MSDWild∗,这是一个用于野外多模态说话人diarization的基准数据集。该数据集收集自公开视频,涵盖了丰富的现实世界场景和语言。所有视频剪辑都是自然拍摄的视频,没有镜头切换等过度编辑。音频和视频都被释放。特别是,MSDWild有很大一部分自然重叠的语音,为鸡尾酒会问题研究提供了一个很好的测试平台。此外,我们还在数据集上使用纯音频、纯视觉和视听扬声器拨号进行基线实验。
{"title":"MSDWild: Multi-modal Speaker Diarization Dataset in the Wild","authors":"Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu","doi":"10.21437/interspeech.2022-10466","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10466","url":null,"abstract":"Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1476-1480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42383323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1