Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli
{"title":"CL-MASR: A Continual Learning Benchmark for Multilingual ASR","authors":"Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli","doi":"10.1109/TASLP.2024.3487410","DOIUrl":null,"url":null,"abstract":"Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4931-4944"},"PeriodicalIF":4.1000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10737390/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.
现代多语言自动语音识别(ASR)系统(如 Whisper)已经实现了用单一模型转录多语言音频。然而,目前最先进的 ASR 模型通常是在单个语言或多任务环境中进行评估,忽略了不断学习新语言的挑战。关于如何在不丢失以前数据中宝贵信息的情况下添加新语言的研究尚不充分。此外,现有的持续学习基准主要集中在视觉和语言任务上,多语言自动识别的持续学习在很大程度上尚未被探索。为了弥补这一差距,我们提出了 CL-MASR,这是一个专为在持续学习环境中研究多语种 ASR 而设计的基准。CL-MASR在大规模预训练ASR模型的基础上提供了多种持续学习方法,并提供了通用指标来评估学习新语言的效果,同时解决了灾难性遗忘的问题。据我们所知,CL-MASR 是首个针对多语言 ASR 任务的持续学习基准。
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.