Few-shot learning for E2E speech recognition: architectural variants for support set generation

2022 30th European Signal Processing Conference (EUSIPCO) Pub Date : 2022-08-29 DOI:10.23919/eusipco55093.2022.9909613

Dhanya Eledath, Narasimha Rao Thurlapati, V. Pavithra, Tirthankar Banerjee, V. Ramasubramanian

{"title":"Few-shot learning for E2E speech recognition: architectural variants for support set generation","authors":"Dhanya Eledath, Narasimha Rao Thurlapati, V. Pavithra, Tirthankar Banerjee, V. Ramasubramanian","doi":"10.23919/eusipco55093.2022.9909613","DOIUrl":null,"url":null,"abstract":"In this paper, we propose two architectural variants of our recent adaptation of a ‘few shot-learning’ (FSL) framework ‘Matching Networks’ (MN) to end-to-end (E2E) continuous speech recognition (CSR) in a formulation termed ‘MN-CTC’ which involves a CTC-loss based end-to-end episodic training of MN and an associated CTC-based decoding of continuous speech. An important component of the MN theory is the labelled support-set during training and inference. The architectural variants proposed and studied here for E2E CSR, namely, the ‘Uncoupled MN-CTC’ and the ‘Coupled MN-CTC’, address this problem of generating supervised support sets from continuous speech. While the ‘Uncoupled MN-CTC’ generates the support-sets ‘outside’ the MN-architecture, the ‘Coupled MN-CTC’ variant is a derivative framework which generates the support set ‘within’ the MN-architecture through a multi-task formulation coupling the support-set generation loss and the main MN-CTC loss for jointly optimizing the support-sets and the embedding functions of MN. On TIMIT and Librispeech datasets, we establish the ‘few-shot’ effectiveness of the proposed variants with PER and LER performances and also demonstrate the cross-domain applicability of the MN-CTC formulation with a Librispeech trained ‘Coupled MN-CTC’ variant inferencing on TIMIT low resource target-corpus with a 8% (absolute) LER advantage over a single-domain (TIMIT only) scenario.","PeriodicalId":231263,"journal":{"name":"2022 30th European Signal Processing Conference (EUSIPCO)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th European Signal Processing Conference (EUSIPCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/eusipco55093.2022.9909613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper, we propose two architectural variants of our recent adaptation of a ‘few shot-learning’ (FSL) framework ‘Matching Networks’ (MN) to end-to-end (E2E) continuous speech recognition (CSR) in a formulation termed ‘MN-CTC’ which involves a CTC-loss based end-to-end episodic training of MN and an associated CTC-based decoding of continuous speech. An important component of the MN theory is the labelled support-set during training and inference. The architectural variants proposed and studied here for E2E CSR, namely, the ‘Uncoupled MN-CTC’ and the ‘Coupled MN-CTC’, address this problem of generating supervised support sets from continuous speech. While the ‘Uncoupled MN-CTC’ generates the support-sets ‘outside’ the MN-architecture, the ‘Coupled MN-CTC’ variant is a derivative framework which generates the support set ‘within’ the MN-architecture through a multi-task formulation coupling the support-set generation loss and the main MN-CTC loss for jointly optimizing the support-sets and the embedding functions of MN. On TIMIT and Librispeech datasets, we establish the ‘few-shot’ effectiveness of the proposed variants with PER and LER performances and also demonstrate the cross-domain applicability of the MN-CTC formulation with a Librispeech trained ‘Coupled MN-CTC’ variant inferencing on TIMIT low resource target-corpus with a 8% (absolute) LER advantage over a single-domain (TIMIT only) scenario.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

端到端语音识别的少量学习:支持集生成的架构变体

在本文中，我们提出了我们最近将“少量射击学习”(FSL)框架“匹配网络”(MN)改编为端到端(E2E)连续语音识别(CSR)的两个架构变体，其公式称为“MN- ctc”，其中包括基于ctc损失的端到端MN情景训练和相关的基于ctc的连续语音解码。MN理论的一个重要组成部分是训练和推理过程中的标记支持集。本文提出并研究了E2E CSR的架构变体，即“不耦合的MN-CTC”和“耦合的MN-CTC”，解决了从连续语音中生成监督支持集的问题。“不耦合的MN- ctc”生成MN架构“外部”的支持集，而“耦合的MN- ctc”变体是一个衍生框架，它通过多任务公式耦合支持集生成损失和主要MN- ctc损失来生成MN架构“内部”的支持集，以共同优化MN的支持集和嵌入函数。在TIMIT和librisspeech数据集上，我们建立了具有PER和LER性能的拟议变体的“少射”有效性，并通过librisspeech训练的“耦合MN-CTC”变体推理在TIMIT低资源目标语料库上证明了MN-CTC公式的跨域适用性，比单域(仅TIMIT)场景具有8%(绝对)的LER优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 30th European Signal Processing Conference (EUSIPCO)

自引率

0.00%

发文量