Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya
{"title":"Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition","authors":"Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya","doi":"arxiv-2409.07165","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition (ASR) with an encoder equipped with\nself-attention, whether streaming or non-streaming, takes quadratic time in the\nlength of the speech utterance. This slows down training and decoding, increase\ntheir cost, and limit the deployment of the ASR in constrained devices.\nSummaryMixing is a promising linear-time complexity alternative to\nself-attention for non-streaming speech recognition that, for the first time,\npreserves or outperforms the accuracy of self-attention models. Unfortunately,\nthe original definition of SummaryMixing is not suited to streaming speech\nrecognition. Hence, this work extends SummaryMixing to a Conformer Transducer\nthat works in both a streaming and an offline mode. It shows that this new\nlinear-time complexity speech encoder outperforms self-attention in both\nscenarios while requiring less compute and memory during training and decoding.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"114 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic speech recognition (ASR) with an encoder equipped with
self-attention, whether streaming or non-streaming, takes quadratic time in the
length of the speech utterance. This slows down training and decoding, increase
their cost, and limit the deployment of the ASR in constrained devices.
SummaryMixing is a promising linear-time complexity alternative to
self-attention for non-streaming speech recognition that, for the first time,
preserves or outperforms the accuracy of self-attention models. Unfortunately,
the original definition of SummaryMixing is not suited to streaming speech
recognition. Hence, this work extends SummaryMixing to a Conformer Transducer
that works in both a streaming and an offline mode. It shows that this new
linear-time complexity speech encoder outperforms self-attention in both
scenarios while requiring less compute and memory during training and decoding.