首页 > 最新文献

Interspeech最新文献

英文 中文
An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English 一种利用发音特征进行二语英语发音错误检测与诊断的对齐方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10309
Qi Chen, Binghuai Lin, Yanlu Xie
Mispronunciation Detection and Diagnosis (MD&D) technology is used for detecting mispronunciations and providing feedback. Most MD&D systems are based on phoneme recognition. However, few studies have made use of the predefined reference text which has been provided to second language (L2) learners while practicing pronunciation. In this paper, we propose a novel alignment method based on linguistic knowledge of articulatory manner and places to align the phone sequences of the reference text with L2 learners speech. After getting the alignment results, we concatenate the corresponding phoneme embedding and the acoustic features of each speech frame as input. This method makes reasonable use of the reference text information as extra input. Experimental results show that the model can implicitly learn valid information in the reference text by this method. Meanwhile, it avoids introducing misleading information in the reference text, which will cause false acceptance (FA). Besides, the method incorporates articulatory features, which helps the model recognize phonemes. We evaluate the method on the L2-ARCTIC dataset and it turns out that our approach improves the F1-score over the state-of-the-art system by 4.9% relative.
发音错误检测和诊断(MD&D)技术用于检测发音错误并提供反馈。大多数MD&D系统都是基于音素识别的。然而,很少有研究使用在练习发音时提供给第二语言学习者的预定义参考文本。在本文中,我们提出了一种基于发音方式和位置的语言学知识的新的对齐方法,以将参考文本的电话序列与二语学习者的语音对齐。在得到对齐结果后,我们将对应的音素嵌入和每个语音帧的声学特征连接起来作为输入。该方法合理利用参考文本信息作为额外输入。实验结果表明,该模型可以通过这种方法隐式地学习参考文本中的有效信息。同时,它避免了在参考文本中引入误导性信息,从而导致虚假接受。此外,该方法结合了发音特征,有助于模型识别音素。我们在L2-ARCTIC数据集上评估了该方法,结果表明,与最先进的系统相比,我们的方法将F1分数提高了4.9%。
{"title":"An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English","authors":"Qi Chen, Binghuai Lin, Yanlu Xie","doi":"10.21437/interspeech.2022-10309","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10309","url":null,"abstract":"Mispronunciation Detection and Diagnosis (MD&D) technology is used for detecting mispronunciations and providing feedback. Most MD&D systems are based on phoneme recognition. However, few studies have made use of the predefined reference text which has been provided to second language (L2) learners while practicing pronunciation. In this paper, we propose a novel alignment method based on linguistic knowledge of articulatory manner and places to align the phone sequences of the reference text with L2 learners speech. After getting the alignment results, we concatenate the corresponding phoneme embedding and the acoustic features of each speech frame as input. This method makes reasonable use of the reference text information as extra input. Experimental results show that the model can implicitly learn valid information in the reference text by this method. Meanwhile, it avoids introducing misleading information in the reference text, which will cause false acceptance (FA). Besides, the method incorporates articulatory features, which helps the model recognize phonemes. We evaluate the method on the L2-ARCTIC dataset and it turns out that our approach improves the F1-score over the state-of-the-art system by 4.9% relative.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4342-4346"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46451269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Investigation on the Band Importance of Phase-aware Speech Enhancement 相位感知语音增强的频带重要性研究
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-284
Z. Zhang, D. Williamson, Yi Shen
Many existing phase-aware speech enhancement algorithms consider the phase at all spectral frequencies to be equally important to perceptual quality and intelligibility. Although im-provements are observed according to both objective and subjective measures, as compared to phase-insensitive approaches, it is not clear whether phase information is equally important across the frequency spectrum. In this paper, we investigate the importance of estimating phase across spectral regions, by conducting a pairwise listening study to determine if phase enhancement can be limited to certain frequency bands. Our experimental results suggest that estimating phase at lower-frequency bands is mostly important for speech quality in normal-hearing (NH) listeners. We further propose a hybrid deep-learning framework that adopts two sub-networks for handling phase differently across the spectrum. The proposed hybrid-net significantly improves the model compatibility with low-resource platforms while achieving superior performance to the original phase-aware speech enhancement approaches.
许多现有的相位感知语音增强算法认为所有频谱频率上的相位对感知质量和可理解性同等重要。尽管根据客观和主观测量都观察到改进,但与相位不敏感方法相比,相位信息在整个频谱中是否同样重要尚不清楚。在本文中,我们研究了跨频谱区域估计相位的重要性,通过进行配对聆听研究来确定相位增强是否可以限制在某些频段。我们的实验结果表明,在较低频段估计相位对正常听力(NH)听众的语音质量最为重要。我们进一步提出了一种混合深度学习框架,该框架采用两个子网络来跨频谱处理不同的相位。所提出的混合网络显著提高了模型对低资源平台的兼容性,同时取得了优于原有相位感知语音增强方法的性能。
{"title":"Investigation on the Band Importance of Phase-aware Speech Enhancement","authors":"Z. Zhang, D. Williamson, Yi Shen","doi":"10.21437/interspeech.2022-284","DOIUrl":"https://doi.org/10.21437/interspeech.2022-284","url":null,"abstract":"Many existing phase-aware speech enhancement algorithms consider the phase at all spectral frequencies to be equally important to perceptual quality and intelligibility. Although im-provements are observed according to both objective and subjective measures, as compared to phase-insensitive approaches, it is not clear whether phase information is equally important across the frequency spectrum. In this paper, we investigate the importance of estimating phase across spectral regions, by conducting a pairwise listening study to determine if phase enhancement can be limited to certain frequency bands. Our experimental results suggest that estimating phase at lower-frequency bands is mostly important for speech quality in normal-hearing (NH) listeners. We further propose a hybrid deep-learning framework that adopts two sub-networks for handling phase differently across the spectrum. The proposed hybrid-net significantly improves the model compatibility with low-resource platforms while achieving superior performance to the original phase-aware speech enhancement approaches.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4651-4655"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47581177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction 第一届清晰度预测挑战赛:助听器清晰度预测的机器学习挑战
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10821
J. Barker, M. Akeroyd, T. Cox, J. Culling, J. Firth, S. Graetzer, Holly Griffiths, Lara Harris, G. Naylor, Zuzanna Podwinska, Eszter Porter, R. V. Muñoz
{"title":"The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction","authors":"J. Barker, M. Akeroyd, T. Cox, J. Culling, J. Firth, S. Graetzer, Holly Griffiths, Lara Harris, G. Naylor, Zuzanna Podwinska, Eszter Porter, R. V. Muñoz","doi":"10.21437/interspeech.2022-10821","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10821","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3508-3512"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46317919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Audio-Visual Scene Classification Based on Multi-modal Graph Fusion 基于多模态图融合的视听场景分类
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-741
Hancheng Lei, Ning-qiang Chen
Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.
视听场景分类(AVSC)任务试图通过对音频和视频模式的联合分析来实现场景分类。现有的AVSC模型大多基于特征级或决策级融合。可能存在的问题有:i)由于不同模态下对应的特征分布差异较大,在特征级融合中直接拼接可能得不到很好的效果。ii)决策级融合不能充分利用不同模态特征之间的共同性和互补性以及相应的相似性。为了解决这些问题,提出了基于图卷积网络(GCN)的AVSC多模态融合算法。首先,训练深度神经网络(DNN)从每个模态中提取基本特征。然后,基于每个模态特征构建样本间交叉相似图(SSCSG)。最后,分别引入动态GCN (DM-GCN)和关注GCN (AT-GCN),实现特征级和相似级融合,保证分类精度。在TAU视听城市场景2021开发数据集上的实验结果表明,与现有方案相比,AVSC-MGCN方案具有更高的分类精度和更低的计算复杂度。
{"title":"Audio-Visual Scene Classification Based on Multi-modal Graph Fusion","authors":"Hancheng Lei, Ning-qiang Chen","doi":"10.21437/interspeech.2022-741","DOIUrl":"https://doi.org/10.21437/interspeech.2022-741","url":null,"abstract":"Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4157-4161"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46404103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Oktoechos Classification in Liturgical Music Using SBU-LSTM/GRU 运用SBU-LSTM/GRU对外科音乐中的Oktoechos分类
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-136
R. Rajan, Ananya Ayasi
A distinguishing feature of the music repertoire of the Syrian tradition is the system of classifying melodies into eight tunes, called ’oktoe¯chos’. It inspired many traditions, such as Greek and Indian liturgical music. In oktoe¯chos tradition, liturgical hymns are sung in eight modes or eight colours (known as eight ’niram’, regionally). In this paper, the automatic oktoe¯chos genre classification is addressed using musical texture features (MTF), i-vectors and Mel-spectrograms through stacked bidirectional and unidirectional long-short term memory (SBU-LSTM) and GRU (SB-GRU) architectures. The performance of the proposed approaches is evaluated using a newly created corpus of liturgical music in Malayalam. SBU-LSTM and SB-GRU frameworks report average classification accuracy of 88.19% and 87.50%, with a significant margin over other frameworks. The experiments demonstrate the potential of stacked architectures in learning temporal information from MTF for the proposed task.
叙利亚传统音乐曲目的一个显著特点是将旋律分为八个曲调,称为“oktoe”chos。它激发了许多传统,如希腊和印度的礼拜音乐。在oktoe’chos的传统中,礼拜赞美诗以八种模式或八种颜色演唱(在地区上被称为八种“niram”)。在本文中,通过堆叠的双向和单向长短期记忆(SBU-LSTM)和GRU(SB-GRU)架构,使用音乐纹理特征(MTF)、i向量和梅尔谱图来解决oktoe’chos流派的自动分类问题。使用马拉雅拉姆语中新创建的礼拜音乐语料库来评估所提出的方法的性能。SBU-LSTM和SB-GRU框架报告的平均分类准确率分别为88.19%和87.50%,与其他框架相比有显著差异。实验证明了堆叠结构在从MTF学习所提出任务的时间信息方面的潜力。
{"title":"Oktoechos Classification in Liturgical Music Using SBU-LSTM/GRU","authors":"R. Rajan, Ananya Ayasi","doi":"10.21437/interspeech.2022-136","DOIUrl":"https://doi.org/10.21437/interspeech.2022-136","url":null,"abstract":"A distinguishing feature of the music repertoire of the Syrian tradition is the system of classifying melodies into eight tunes, called ’oktoe¯chos’. It inspired many traditions, such as Greek and Indian liturgical music. In oktoe¯chos tradition, liturgical hymns are sung in eight modes or eight colours (known as eight ’niram’, regionally). In this paper, the automatic oktoe¯chos genre classification is addressed using musical texture features (MTF), i-vectors and Mel-spectrograms through stacked bidirectional and unidirectional long-short term memory (SBU-LSTM) and GRU (SB-GRU) architectures. The performance of the proposed approaches is evaluated using a newly created corpus of liturgical music in Malayalam. SBU-LSTM and SB-GRU frameworks report average classification accuracy of 88.19% and 87.50%, with a significant margin over other frameworks. The experiments demonstrate the potential of stacked architectures in learning temporal information from MTF for the proposed task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2403-2407"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46409817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WideResNet with Joint Representation Learning and Data Augmentation for Cover Song Identification 基于联合表示学习和数据增强的WideResNet翻唱歌曲识别
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10600
Shichao Hu, Bin Zhang, Jinhong Lu, Yiliang Jiang, Wucheng Wang, Lingchen Kong, Weifeng Zhao, Tao Jiang
Cover song identification (CSI) has been a challenging task and an import topic in music information retrieval (MIR) commu-nity. In recent years, CSI problems have been extensively stud-ied based on deep learning methods. In this paper, we propose a novel framework for CSI based on a joint representation learning method inspired by multi-task learning. In specific, we propose a joint learning strategy which combines classification and metric learning for optimizing the cover song model based on WideResNet, called LyraC-Net. Classification objective learns separable embeddings from different classes, while metric learning optimizes embedding similarity by decreasing the inter-class distance and increasing the intra-classs separabil-ity. This joint optimization strategy is expected to learn a more robust cover song representation than methods with single training objectives. For the metric learning, prototypical network is introduced to stabilize and accelerate the training process, to-gether with triplet loss. Furthermore, we introduce SpecAugment, a popular augmentation method in speech recognition, to further improve the performance. Experiment results show that our proposed method achieves promising results and outperforms other recent CSI methods in the evaluations.
翻唱歌曲识别(CSI)一直是音乐信息检索(MIR)领域的一项具有挑战性的任务和重要课题。近年来,基于深度学习方法的CSI问题得到了广泛的研究。在本文中,我们提出了一种新的CSI框架,该框架基于受多任务学习启发的联合表示学习方法。具体而言,我们提出了一种结合分类和度量学习的联合学习策略,用于优化基于WideResNet的翻唱歌曲模型,称为LyraC-Net。分类目标学习来自不同类的可分离嵌入,而度量学习通过减少类间距离和增加类内分离性来优化嵌入相似性。与具有单一训练目标的方法相比,这种联合优化策略有望学习到更稳健的翻唱歌曲表示。对于度量学习,引入原型网络来稳定和加速训练过程,同时避免三元组损失。此外,我们引入了SpecAugment,一种在语音识别中流行的增强方法,以进一步提高性能。实验结果表明,我们提出的方法取得了很好的结果,并且在评估中优于其他最近的CSI方法。
{"title":"WideResNet with Joint Representation Learning and Data Augmentation for Cover Song Identification","authors":"Shichao Hu, Bin Zhang, Jinhong Lu, Yiliang Jiang, Wucheng Wang, Lingchen Kong, Weifeng Zhao, Tao Jiang","doi":"10.21437/interspeech.2022-10600","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10600","url":null,"abstract":"Cover song identification (CSI) has been a challenging task and an import topic in music information retrieval (MIR) commu-nity. In recent years, CSI problems have been extensively stud-ied based on deep learning methods. In this paper, we propose a novel framework for CSI based on a joint representation learning method inspired by multi-task learning. In specific, we propose a joint learning strategy which combines classification and metric learning for optimizing the cover song model based on WideResNet, called LyraC-Net. Classification objective learns separable embeddings from different classes, while metric learning optimizes embedding similarity by decreasing the inter-class distance and increasing the intra-classs separabil-ity. This joint optimization strategy is expected to learn a more robust cover song representation than methods with single training objectives. For the metric learning, prototypical network is introduced to stabilize and accelerate the training process, to-gether with triplet loss. Furthermore, we introduce SpecAugment, a popular augmentation method in speech recognition, to further improve the performance. Experiment results show that our proposed method achieves promising results and outperforms other recent CSI methods in the evaluations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4187-4191"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46421152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation 用于实时单声道语音去噪和去混响的频谱-时间子网
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-468
Feifei Xiong, Weiguang Chen, P. Wang, Xiaofei Li, Jinwei Feng
This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet) that processes each frequency band in-dependently and requires small amount of resources for good generalization, the proposed framework named STSubNet ex-ploits sufficient spectro-temporal receptive fields (STRFs) from speech spectrum via a two-dimensional convolution network cooperating with a bi-directional long short-term memory network across frequency bands, to further improve the neural network discrimination between desired speech component and undesired interference including noise and reverberation. The importance of this STRF extractor is analyzed by evaluating the contribution of individual module to the STSubNet performance for simultaneously denoising and dereverberation. Experimental results show that STSubNet outperforms other subband variants and achieves competitive performance compared to state-of-the-art models on two publicly benchmark test sets.
本文提出了一种改进的子带神经网络,用于在线单通道场景的联合语音去噪和去混响。保留了子带模型(SubNet)的优点,该子带模型独立地处理每个频带并且需要少量的资源来实现良好的泛化,所提出的名为STSubNet的框架通过与跨频带的双向长短期记忆网络协作的二维卷积网络从语音频谱中释放出足够的频谱-时间感受野(STRFs),以进一步改进神经网络在期望的语音分量和包括噪声和混响的不期望干扰之间的区分。通过评估单个模块对STSubNet同时去噪和去混响性能的贡献,分析了该STRF提取器的重要性。实验结果表明,在两个公开的基准测试集上,与最先进的模型相比,STSubNet优于其他子带变体,并实现了有竞争力的性能。
{"title":"Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation","authors":"Feifei Xiong, Weiguang Chen, P. Wang, Xiaofei Li, Jinwei Feng","doi":"10.21437/interspeech.2022-468","DOIUrl":"https://doi.org/10.21437/interspeech.2022-468","url":null,"abstract":"This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet) that processes each frequency band in-dependently and requires small amount of resources for good generalization, the proposed framework named STSubNet ex-ploits sufficient spectro-temporal receptive fields (STRFs) from speech spectrum via a two-dimensional convolution network cooperating with a bi-directional long short-term memory network across frequency bands, to further improve the neural network discrimination between desired speech component and undesired interference including noise and reverberation. The importance of this STRF extractor is analyzed by evaluating the contribution of individual module to the STSubNet performance for simultaneously denoising and dereverberation. Experimental results show that STSubNet outperforms other subband variants and achieves competitive performance compared to state-of-the-art models on two publicly benchmark test sets.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"931-935"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46609959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Linguistically Informed Post-processing for ASR Error correction in Sanskrit 梵文ASR纠错的语言信息后处理
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11189
Rishabh Kumar, D. Adiga, R. Ranjan, A. Krishna, Ganesh Ramakrishnan, Pawan Goyal, P. Jyothi
We propose an ASR system for Sanskrit, a low-resource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.
我们提出了一种针对低资源语言梵语的ASR系统,该系统有效地将子词标记化策略和搜索空间丰富与语言信息相结合。更具体地说,为了解决语言中存在大量词汇外条目所带来的挑战,我们首先使用基于子词的语言模型和声学模型来生成搜索空间。这样获得的搜索空间被转换为基于单词的搜索空间,并使用基于浅解析器的形态和词汇信息进一步丰富搜索空间。最后,使用为梵语提出的有监督的形态学解析器重新定位搜索空间中的转换。我们提出的方法目前在梵语ASR中报告了最先进的结果,比以前的最先进的WER降低了7.18个绝对点。
{"title":"Linguistically Informed Post-processing for ASR Error correction in Sanskrit","authors":"Rishabh Kumar, D. Adiga, R. Ranjan, A. Krishna, Ganesh Ramakrishnan, Pawan Goyal, P. Jyothi","doi":"10.21437/interspeech.2022-11189","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11189","url":null,"abstract":"We propose an ASR system for Sanskrit, a low-resource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2293-2297"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46613564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Bring dialogue-context into RNN-T for streaming ASR 将对话上下文带入RNN-T以进行流式ASR
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-697
Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma
Recently the conversational end-to-end (E2E) automatic speech recognition (ASR) models, which directly integrate dialogue-context such as historical utterances into E2E models, have shown superior performance than single-utterance E2E models. However, few works investigate how to inject the dialogue-context into the recurrent neural network transducer (RNN-T) model. In this work, we bring dialogue-context into a streaming RNN-T model and explore various structures of contextual RNN-T model as well as training strategies to better utilize the dialogue-context. Firstly, we propose a deep fusion architecture which efficiently integrates the dialogue-context within the encoder and predictor of RNN-T. Secondly, we propose contextual & non-contextual model joint training as regularization, and propose context perturbation to relieve the context mismatch between training and inference. Moreover, we adopt a context-aware language model (CLM) for contextual RNN-T decoding to take full advantage of the dialogue-context for conversational ASR. We conduct experiments on the Switchboard-2000h task and observe performance gains from the proposed techniques. Compared with non-contextual RNN-T, our contextual RNN-T model yields 4.8% / 6.0% relative improvement on Switchboard and Callhome Hub5’00 testsets. By additionally integrating a CLM, the gain is further increased to 10.6% / 7.8%.
最近,直接将诸如历史话语的对话上下文集成到E2E模型中的端到端会话(E2E)自动语音识别(ASR)模型显示出比单话语E2E模型优越的性能。然而,很少有研究如何将对话上下文注入递归神经网络转换器(RNN-T)模型。在这项工作中,我们将对话上下文引入流式RNN-T模型,并探索上下文RNN-T模式的各种结构以及更好地利用对话上下文的训练策略。首先,我们提出了一种深度融合架构,该架构有效地将对话上下文集成在RNN-T的编码器和预测器中。其次,我们提出了上下文和非上下文模型联合训练作为正则化,并提出了上下文扰动来缓解训练和推理之间的上下文不匹配。此外,我们采用上下文感知语言模型(CLM)进行上下文RNN-T解码,以充分利用对话上下文进行会话ASR。我们在Switchboard-2000h任务上进行了实验,并观察了所提出的技术带来的性能增益。与非上下文的RNN-T相比,我们的上下文RNN-T模型在Switchboard和Callhome Hub5'00测试集上产生了4.8%/6.0%的相对改进。通过对CLM进行额外积分,增益进一步增加到10.6%/7.8%。
{"title":"Bring dialogue-context into RNN-T for streaming ASR","authors":"Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma","doi":"10.21437/interspeech.2022-697","DOIUrl":"https://doi.org/10.21437/interspeech.2022-697","url":null,"abstract":"Recently the conversational end-to-end (E2E) automatic speech recognition (ASR) models, which directly integrate dialogue-context such as historical utterances into E2E models, have shown superior performance than single-utterance E2E models. However, few works investigate how to inject the dialogue-context into the recurrent neural network transducer (RNN-T) model. In this work, we bring dialogue-context into a streaming RNN-T model and explore various structures of contextual RNN-T model as well as training strategies to better utilize the dialogue-context. Firstly, we propose a deep fusion architecture which efficiently integrates the dialogue-context within the encoder and predictor of RNN-T. Secondly, we propose contextual & non-contextual model joint training as regularization, and propose context perturbation to relieve the context mismatch between training and inference. Moreover, we adopt a context-aware language model (CLM) for contextual RNN-T decoding to take full advantage of the dialogue-context for conversational ASR. We conduct experiments on the Switchboard-2000h task and observe performance gains from the proposed techniques. Compared with non-contextual RNN-T, our contextual RNN-T model yields 4.8% / 6.0% relative improvement on Switchboard and Callhome Hub5’00 testsets. By additionally integrating a CLM, the gain is further increased to 10.6% / 7.8%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2048-2052"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46695012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks MISRNet:使用多输入单共享残差块的轻量级神经声码器
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11152
Takuhiro Kaneko, H. Kameoka, Kou Tanaka, Shogo Seki
Neural vocoders have recently become popular in text-to-speech synthesis and voice conversion, increasing the demand for efficient neural vocoders. One successful approach is HiFi-GAN, which archives high-fidelity audio synthesis using a relatively small model. This characteristic is obtained using a generator incorporating multi-receptive field fusion (MRF) with multiple branches of residual blocks, allowing the expansion of the description capacity with few-channel convolutions. How-ever, MRF requires the model size to increase with the number of branches. Alternatively, we propose a network called MISRNet , which incorporates a novel module called multi-input single shared residual block (MISR) . MISR enlarges the description capacity by enriching the input variation using lightweight convolutions with a kernel size of 1 and, alternatively, reduces the variation of residual blocks from multiple to single. Because the model size of the input convolutions is significantly smaller than that of the residual blocks, MISR reduces the model size compared with that of MRF. Furthermore, we introduce an implementation technique for MISR, where we accelerate the processing speed by adopting tensor reshaping. We experimentally applied our ideas to lightweight variants of HiFi-GAN and iSTFTNet, making the models more lightweight with comparable speech quality and without compromising speed. 1
近年来,神经声码器在文本到语音的合成和语音转换中越来越受欢迎,这增加了对高效神经声码器的需求。一种成功的方法是HiFi-GAN,它使用一个相对较小的模型来存档高保真音频合成。这一特性是通过将多接收场融合(MRF)与残差块的多个分支相结合的生成器获得的,允许用较少的通道卷积扩展描述容量。然而,MRF要求模型大小随着分支数量的增加而增加。另外,我们提出了一种称为MISRNet的网络,它包含一个称为多输入单共享剩余块(MISR)的新模块。MISR通过使用核大小为1的轻量级卷积丰富输入变化来扩大描述能力,或者减少残差块从多个到单个的变化。由于输入卷积的模型大小明显小于残差块的模型大小,因此MISR与MRF相比减小了模型大小。此外,我们还介绍了一种MISR的实现技术,通过采用张量重塑来加快处理速度。我们通过实验将我们的想法应用于HiFi-GAN和iSTFTNet的轻量化变体,使模型更轻量化,具有相当的语音质量,且不影响速度。1
{"title":"MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks","authors":"Takuhiro Kaneko, H. Kameoka, Kou Tanaka, Shogo Seki","doi":"10.21437/interspeech.2022-11152","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11152","url":null,"abstract":"Neural vocoders have recently become popular in text-to-speech synthesis and voice conversion, increasing the demand for efficient neural vocoders. One successful approach is HiFi-GAN, which archives high-fidelity audio synthesis using a relatively small model. This characteristic is obtained using a generator incorporating multi-receptive field fusion (MRF) with multiple branches of residual blocks, allowing the expansion of the description capacity with few-channel convolutions. How-ever, MRF requires the model size to increase with the number of branches. Alternatively, we propose a network called MISRNet , which incorporates a novel module called multi-input single shared residual block (MISR) . MISR enlarges the description capacity by enriching the input variation using lightweight convolutions with a kernel size of 1 and, alternatively, reduces the variation of residual blocks from multiple to single. Because the model size of the input convolutions is significantly smaller than that of the residual blocks, MISR reduces the model size compared with that of MRF. Furthermore, we introduce an implementation technique for MISR, where we accelerate the processing speed by adopting tensor reshaping. We experimentally applied our ideas to lightweight variants of HiFi-GAN and iSTFTNet, making the models more lightweight with comparable speech quality and without compromising speed. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1631-1635"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41526941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1