Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya
{"title":"用于流式语音识别的线性时间复杂性拟合与摘要混音技术","authors":"Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya","doi":"arxiv-2409.07165","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition (ASR) with an encoder equipped with\nself-attention, whether streaming or non-streaming, takes quadratic time in the\nlength of the speech utterance. This slows down training and decoding, increase\ntheir cost, and limit the deployment of the ASR in constrained devices.\nSummaryMixing is a promising linear-time complexity alternative to\nself-attention for non-streaming speech recognition that, for the first time,\npreserves or outperforms the accuracy of self-attention models. Unfortunately,\nthe original definition of SummaryMixing is not suited to streaming speech\nrecognition. Hence, this work extends SummaryMixing to a Conformer Transducer\nthat works in both a streaming and an offline mode. It shows that this new\nlinear-time complexity speech encoder outperforms self-attention in both\nscenarios while requiring less compute and memory during training and decoding.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"114 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition\",\"authors\":\"Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya\",\"doi\":\"arxiv-2409.07165\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic speech recognition (ASR) with an encoder equipped with\\nself-attention, whether streaming or non-streaming, takes quadratic time in the\\nlength of the speech utterance. This slows down training and decoding, increase\\ntheir cost, and limit the deployment of the ASR in constrained devices.\\nSummaryMixing is a promising linear-time complexity alternative to\\nself-attention for non-streaming speech recognition that, for the first time,\\npreserves or outperforms the accuracy of self-attention models. Unfortunately,\\nthe original definition of SummaryMixing is not suited to streaming speech\\nrecognition. Hence, this work extends SummaryMixing to a Conformer Transducer\\nthat works in both a streaming and an offline mode. It shows that this new\\nlinear-time complexity speech encoder outperforms self-attention in both\\nscenarios while requiring less compute and memory during training and decoding.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"114 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07165\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition
Automatic speech recognition (ASR) with an encoder equipped with
self-attention, whether streaming or non-streaming, takes quadratic time in the
length of the speech utterance. This slows down training and decoding, increase
their cost, and limit the deployment of the ASR in constrained devices.
SummaryMixing is a promising linear-time complexity alternative to
self-attention for non-streaming speech recognition that, for the first time,
preserves or outperforms the accuracy of self-attention models. Unfortunately,
the original definition of SummaryMixing is not suited to streaming speech
recognition. Hence, this work extends SummaryMixing to a Conformer Transducer
that works in both a streaming and an offline mode. It shows that this new
linear-time complexity speech encoder outperforms self-attention in both
scenarios while requiring less compute and memory during training and decoding.