首页 > 最新文献

Interspeech最新文献

英文 中文
Streaming model for Acoustic to Articulatory Inversion with transformer networks 基于变压器网络的声-铰接反演流模型
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10159
Sathvik Udupa, Aravind Illa, P. Ghosh
Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.
从语音声学中估计语音发音运动被称为声学到发音反转(AAI)。最近,基于变压器的AAI模型已被证明能够实现最先进的性能。然而,在变压器网络中,注意力集中在整个话语上,因此需要在推理之前获得完整的话语,这导致了高延迟,并且不适合流式AAI。为了在推理过程中实现流,可以在不重叠的卡盘上执行评估,而不是在完整的话语上执行评估。然而,由于在训练和评估过程中注意接受野的不匹配,AAI的表现可能会下降。为了克服这种情况,在这项工作中,我们使用不同的注意力面具进行实验,并在训练期间使用先前预测的上下文。实验结果表明,将随机开始掩码注意与变压器解码器先前预测的上下文结合使用,效果优于基线结果。
{"title":"Streaming model for Acoustic to Articulatory Inversion with transformer networks","authors":"Sathvik Udupa, Aravind Illa, P. Ghosh","doi":"10.21437/interspeech.2022-10159","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10159","url":null,"abstract":"Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"625-629"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44495671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi Gram Vaani ASR挑战印地语地区变体的自发电话语音记录
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11371
Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad
This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.
本文描述了在印地语区域变体中Gram-Vaani自动语音识别(ASR)挑战的语料库和基线系统。这一挑战的语料库包括社交科技企业Gram Vaani收集的自发电话语音记录。印地语的区域变异,加上语音的自发性、自然背景和由于众包而具有可变准确性的转录,使其成为ASR关于自发电话语音的独特语料库。作为挑战的一部分,已经发布了大约1108小时的真实世界自发语音记录,包括1000小时的未标记训练数据、100小时的标记训练数据,5小时的发展数据和3小时的评估数据。在传统的时延神经网络隐马尔可夫模型(TDNN-HMM)框架和完全神经端到端(E2E)设置中,在不同的ASR系统上验证了训练集和测试集的有效性。在100小时的标记数据上训练的TDNN模型的eval集上的字错误率(WER)和字符错误率(CER)为29。7%和15。分别为1%。而在E2E设置中,在100小时的数据上训练的一致性模型的评估集上的WER和CER为32。9%和19。分别为0%。
{"title":"Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi","authors":"Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad","doi":"10.21437/interspeech.2022-11371","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11371","url":null,"abstract":"This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3548-3552"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43519978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis MISP2021挑战中的视听语音识别:数据集发布和深度分析
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10483
Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, S. Siniscalchi, Shinji Watanabe, O. Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan
In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audio-visual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus 1 and the code 2 are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.
在本文中,我们提出了MISP2021挑战中更新的视听语音识别(AVSR)语料库,这是一个大型视听汉语会话语料库,由远/中/近麦克风和远/中摄像机在34个真实家庭电视房间中收集的141小时音频和视频数据组成。据我们所知,我们的语料库是第一个远程多麦克风对话中文视听语料库,也是第一个在不利的家庭电视场景下的大词汇量连续中文唇读数据集。此外,我们对语料库进行了深入的分析,并对纯音频/纯视频/视听系统中的所有音频和视频数据进行了全面的消融研究。误差分析表明,视频模态补充被噪声退化的声信息以减少删除错误,并在重叠语音中提供判别信息以减少替换错误。最后,我们还设计了一组实验,如前端、数据增强和端到端模型,为潜在的未来工作提供方向。语料库1和代码2的发布不仅是为了促进语音领域的研究,也是为了促进计算机视觉领域和跨学科的研究。
{"title":"Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis","authors":"Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, S. Siniscalchi, Shinji Watanabe, O. Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan","doi":"10.21437/interspeech.2022-10483","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10483","url":null,"abstract":"In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audio-visual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus 1 and the code 2 are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1766-1770"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43761010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees 基于数据增强的Laryntomes端到端无声语音识别
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10868
Beiming Cao, Kristin J. Teplansky, Nordine Sebkhi, Arpan Bhavsar, O. Inan, Robin A. Samlan, T. Mau, Jun Wang
Silent speech recognition (SSR) predicts textual information from silent articulation, which is an algorithm design in silent speech interfaces (SSIs). SSIs have the potential of recov-ering the speech ability of individuals who lost their voice but can still articulate (e.g., laryngectomees). Due to the lo-gistic difficulties in articulatory data collection, current SSR studies suffer limited amount of dataset. Data augmentation aims to increase the training data amount by introducing variations into the existing dataset, but has rarely been investigated in SSR for laryngectomees. In this study, we investigated the effectiveness of multiple data augmentation approaches for SSR including consecutive and intermittent time masking, articulatory dimension masking, sinusoidal noise injection and randomly scaling. Different experimental setups including speaker-dependent, speaker-independent, and speaker-adaptive were used. The SSR models were end-to-end speech recognition models trained with connectionist temporal classification (CTC). Electromagnetic articulography (EMA) datasets collected from multiple healthy speakers and laryngectomees were used. The experimental results have demonstrated that the data augmentation approaches explored performed differently, but generally improved SSR performance. Especially, the consecutive time masking has brought significant improvement on SSR for both healthy speakers and laryngectomees.
无声语音识别(SSR)是无声语音接口(ssi)中的一种算法设计,通过无声发音来预测文本信息。ssi有可能恢复失去声音但仍能表达的个体的语言能力(例如,喉切除术患者)。由于发音数据收集的逻辑困难,目前的SSR研究数据量有限。数据增强旨在通过在现有数据集中引入变体来增加训练数据量,但很少在喉切除术患者的SSR中进行研究。在本研究中,我们研究了连续和间歇时间掩蔽、关节维掩蔽、正弦噪声注入和随机缩放等多种SSR数据增强方法的有效性。不同的实验设置包括说话人依赖、说话人独立和说话人自适应。SSR模型是用连接时间分类(CTC)训练的端到端语音识别模型。使用从多个健康说话者和喉切除术者收集的电磁关节造影(EMA)数据集。实验结果表明,所探索的数据增强方法表现不同,但总体上提高了SSR性能。特别是,连续的时间掩蔽对健康说话者和喉切除者的SSR都有显著的改善。
{"title":"Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees","authors":"Beiming Cao, Kristin J. Teplansky, Nordine Sebkhi, Arpan Bhavsar, O. Inan, Robin A. Samlan, T. Mau, Jun Wang","doi":"10.21437/interspeech.2022-10868","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10868","url":null,"abstract":"Silent speech recognition (SSR) predicts textual information from silent articulation, which is an algorithm design in silent speech interfaces (SSIs). SSIs have the potential of recov-ering the speech ability of individuals who lost their voice but can still articulate (e.g., laryngectomees). Due to the lo-gistic difficulties in articulatory data collection, current SSR studies suffer limited amount of dataset. Data augmentation aims to increase the training data amount by introducing variations into the existing dataset, but has rarely been investigated in SSR for laryngectomees. In this study, we investigated the effectiveness of multiple data augmentation approaches for SSR including consecutive and intermittent time masking, articulatory dimension masking, sinusoidal noise injection and randomly scaling. Different experimental setups including speaker-dependent, speaker-independent, and speaker-adaptive were used. The SSR models were end-to-end speech recognition models trained with connectionist temporal classification (CTC). Electromagnetic articulography (EMA) datasets collected from multiple healthy speakers and laryngectomees were used. The experimental results have demonstrated that the data augmentation approaches explored performed differently, but generally improved SSR performance. Especially, the consecutive time masking has brought significant improvement on SSR for both healthy speakers and laryngectomees.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3653-3657"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43442432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Improving Spoken Language Understanding with Cross-Modal Contrastive Learning 运用跨模态对比学习提高口语理解能力
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-658
Jingjing Dong, Jiayi Fu, P. Zhou, Hao Li, Xiaorui Wang
Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.
口语理解(SLU)通常基于存在错误传播问题的流水线体系结构。为了缓解这个问题,提出了端到端(E2E)模型来直接将语音输入映射到期望的语义输出。同时,其他人试图通过采用多模态架构来利用声学信息之外的语言信息。在这项工作中,我们提出了一种新的多模态SLU方法,称为CMCL,该方法利用跨模态对比学习来学习更好的多模态表示。特别地,设计了一个双流多模态框架,并在语音和文本表示之间执行对比学习任务。此外,CMCL采用了多模式共享分类任务与对比学习任务相结合的方法来指导所学习的表示,以提高意图分类任务的性能。我们还研究了在预训练中使用跨模态对比学习的效果。CMCL在FSC和Smartlights数据集上的准确率分别达到99.69%和92.50%,优于最先进的比较方法。此外,当在FSC数据集的10%和1%上训练时,性能仅分别下降0.32%和2.8%,这表明它在少镜头场景下具有优势。
{"title":"Improving Spoken Language Understanding with Cross-Modal Contrastive Learning","authors":"Jingjing Dong, Jiayi Fu, P. Zhou, Hao Li, Xiaorui Wang","doi":"10.21437/interspeech.2022-658","DOIUrl":"https://doi.org/10.21437/interspeech.2022-658","url":null,"abstract":"Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2693-2697"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43733271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation 用于声学回声消除的扬声器和电话感知卷积变压器网络
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10077
Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li
Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pre-training model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Com-parative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.
最近的研究表明,基于深度学习(DL)的声学回声消除(AEC)方法在背景噪声和非线性失真场景中的有效性。然而,内容和说话者的变化降低了这种基于DL的AEC模型的性能。在这项研究中,我们提出了一个AEC模型,该模型以语音和说话人身份特征作为辅助输入,并提出了一种复杂的双路径卷积变换网络(DPCTNet)。给定输入信号,由作为自监督预训练模型的对比预测编码网络提取的语音和说话者身份特征,以及由短时间傅立叶变换生成的复频谱被视为DPCTNet的频谱模式输入。此外,DPCTNet应用了通过插入双路径转换器改进的编码器-解码器架构,以有效地对单个帧中提取的输入和连续帧之间的相关性进行建模。对比实验结果表明,通过明确考虑语音和说话人身份特征,可以提高AEC的性能。
{"title":"Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation","authors":"Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li","doi":"10.21437/interspeech.2022-10077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10077","url":null,"abstract":"Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pre-training model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Com-parative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2513-2517"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43855916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals 基于语音和颈部加速度计信号的语音质量分类卷积神经网络
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10513
Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku
Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).
在语音质量自动分类方面,以往的研究主要是研究以声学语音信号为输入的支持向量机(SVM)分类器。最近,一项语音质量分类研究以颈部表面加速度计(NSA)和语音信号为输入,并使用具有手工制作声门源特征的支持向量机进行。本研究使用卷积神经网络(cnn)作为分类器,对同时记录的NSA和语音信号进行了三种语音质量(呼吸、模态和按压)的分类。该研究有两个目标:(1)研究两种信号(NSA和speech)中哪一种在分类任务中更有用;(2)比较基于深度学习的CNN分类器与使用手工制作声门源特征的SVM分类器相比,具有谱图和mel-谱图特征的分类器是否能够提高分类精度。结果表明,NSA信号对语音质量的分类效果优于语音信号,CNN分类器的分类效果优于SVM分类器,且差值较大。以mel- spectrum作为CNN分类器的输入,其平均分类准确率最高(NSA为93.8%,speech为90.6%)。
{"title":"Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals","authors":"Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku","doi":"10.21437/interspeech.2022-10513","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10513","url":null,"abstract":"Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5253-5257"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43858994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition 用于语音情感识别的音频与文本相结合的循环多头注意力融合网络
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-888
C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse
{"title":"Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition","authors":"C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse","doi":"10.21437/interspeech.2022-888","DOIUrl":"https://doi.org/10.21437/interspeech.2022-888","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"744-748"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46927325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism 成年自闭症患者在会话中使用与转身和韵律不太同步的点头
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11388
K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue
Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers’ turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.
自闭症谱系障碍(ASD)是一种高度流行的神经发育障碍,其特征是缺乏沟通和社交。点头是一种视觉通道,用于共同构建对话,对顺利的社交互动至关重要。在本研究中,我们定量地分析了日语会话中点头与言语转折和韵律变化的关系。结果表明,在ASD参与者中,点头的频率较低,尤其是在说话者的转向转换前后,而在具有典型发展(TD)的个体中,点头在转向前后都很显著。使用16秒长时间滑动段的分析显示,TD组的点头频率和平均发声强度之间的同步性高于ASD组。使用这些提出的特征通过支持向量机(SVM)进行分类实现了高性能,准确率为91.1%,F测度为0.942。此外,研究结果表明,根据转弯结束和重点,点头是一种最佳的方式,可以为ASD患者的社交技能培训提供标准的参考或反馈。此外,研究结果所暗示的点头的自然时间也可以应用于开发人形机器人或计算机图形(CG)代理的互动反应。
{"title":"Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism","authors":"K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue","doi":"10.21437/interspeech.2022-11388","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11388","url":null,"abstract":"Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers’ turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1136-1140"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42124598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition 远程语音识别前端系统的弱监督神经全秩空间协方差分析
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11077
Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai
This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.
本文提出了一种弱监督多通道神经语音分离方法,用于实际会话语音混合的远程语音识别(DSR)。一种称为神经全秩空间协方差分析(FCA)的盲源分离(BSS)方法可以在没有任何监督的情况下通过使用深谱模型来精确分离多声道语音混合物。然而,神经FCA要求事先确定并知道声源的数量。这一要求使其在DSR前端系统中的应用复杂化,用于多扬声器对话,其中扬声器的数量动态变化。在本文中,我们提出了一种神经FCA的扩展,通过将目标说话者的时间语音活动作为辅助信息来处理动态变化的声源数量。我们使用多声道音频混合物及其语音活动的数据集,以弱监督的方式训练源分离网络。CHiME-6数据集的实验结果表明,我们的方法在单词错误率方面优于传统的基于BSS的系统,该数据集的任务是识别晚宴上的对话。
{"title":"Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition","authors":"Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai","doi":"10.21437/interspeech.2022-11077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11077","url":null,"abstract":"This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3824-3828"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41762710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1