Comparison and analysis of new curriculum criteria for end-to-end ASR

IF 2.4 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-07-31 DOI:10.1016/j.specom.2024.103113

Georgios Karakasidis , Mikko Kurimo , Peter Bell , Tamás Grósz

{"title":"Comparison and analysis of new curriculum criteria for end-to-end ASR","authors":"Georgios Karakasidis , Mikko Kurimo , Peter Bell , Tamás Grósz","doi":"10.1016/j.specom.2024.103113","DOIUrl":null,"url":null,"abstract":"<div><p>Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103113"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000840/pdfft?md5=60eaa8c29b9e0afde3f299e6bfeb1d10&pid=1-s2.0-S0167639324000840-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000840","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

端到端 ASR 新课程标准的比较与分析

传统上，教授人类和教授机器学习（ML）模型是完全不同的，但有组织、有条理的学习能够让人更快、更好地理解基本概念。例如，当人类学习说话时，他们首先学习如何说出基本的电话，然后慢慢转向更复杂的结构，如单词和句子。受此启发，研究人员开始采用这种方法来训练 ML 模型。由于这种方法的主要概念--难度逐渐增加--与教育中的课程概念相似，因此被称为课程学习（CL）。在这项工作中，我们设计并测试了用于训练自动语音识别系统的新的 CL 方法，尤其侧重于所谓的端到端模型。这些模型由执行识别任务的单个大型神经网络组成，而传统的方法是由几个专门的组件负责不同的子任务（如声学和语言建模）。我们证明，如果为端到端模型提供由难度不断增加的示例组成的有组织训练集，它们就能获得更好的性能。为了对训练集进行结构化处理并定义简单示例的概念，我们探索了多种解决方案，既可以使用外部静态评分方法，也可以结合模型本身的反馈。此外，我们还研究了步调函数的效果，该函数可控制在每个训练周期内向网络提供多少数据。我们提出的课程学习策略在两个数据集的语音识别任务中进行了测试，一个数据集包含自发的芬兰语语音，要求志愿者就给定的主题发言；另一个数据集包含计划好的英语语音。实证结果表明，好的课程学习策略可以提高性能，加快收敛速度。经过一定数量的历时后，我们的最佳策略在芬兰语和英语数据集的测试集单词错误率方面分别降低了 5.6% 和 3.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.