{"title":"Hypformer:一个快速假设驱动的语音识别框架","authors":"Xuyi Zhuang;Yukun Qian;Mingjiang Wang","doi":"10.1109/LSP.2024.3516700","DOIUrl":null,"url":null,"abstract":"Recently, the performance of non-autoregressive ASR models has made significant progress but still lags behind hybrid CTC/attention systems. This paper introduces Hypformer, a fast hypothesis-driven rescoring speech recognition framework. Multiple hypothetical prefixes are realized by fast prefix generation algorithm. With two different rescoring methods, nar-ar rescoring and nar\n<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\n rescoring, Hypformer can flexibly switch between autoregressive and non-autoregressive decoding modes to perform rescoring of hypothesis prefixes. Experiments on the standard Mandarin datasets AISHELL-1 and AISHELL-2 demonstrate that Hypformer outperforms the state-of-the-art Hybrid CTC/Attention systems in ASR performance while achieving a speedup of over six times. Experiments on the Mandarin sub-dialect dataset KeSpeech indicate that Hypformer achieves more accurate recognition by leveraging richer contextual information.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"471-475"},"PeriodicalIF":3.2000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hypformer: A Fast Hypothesis-Driven Rescoring Speech Recognition Framework\",\"authors\":\"Xuyi Zhuang;Yukun Qian;Mingjiang Wang\",\"doi\":\"10.1109/LSP.2024.3516700\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, the performance of non-autoregressive ASR models has made significant progress but still lags behind hybrid CTC/attention systems. This paper introduces Hypformer, a fast hypothesis-driven rescoring speech recognition framework. Multiple hypothetical prefixes are realized by fast prefix generation algorithm. With two different rescoring methods, nar-ar rescoring and nar\\n<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\\n rescoring, Hypformer can flexibly switch between autoregressive and non-autoregressive decoding modes to perform rescoring of hypothesis prefixes. Experiments on the standard Mandarin datasets AISHELL-1 and AISHELL-2 demonstrate that Hypformer outperforms the state-of-the-art Hybrid CTC/Attention systems in ASR performance while achieving a speedup of over six times. Experiments on the Mandarin sub-dialect dataset KeSpeech indicate that Hypformer achieves more accurate recognition by leveraging richer contextual information.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"471-475\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10795661/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10795661/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Hypformer: A Fast Hypothesis-Driven Rescoring Speech Recognition Framework
Recently, the performance of non-autoregressive ASR models has made significant progress but still lags behind hybrid CTC/attention systems. This paper introduces Hypformer, a fast hypothesis-driven rescoring speech recognition framework. Multiple hypothetical prefixes are realized by fast prefix generation algorithm. With two different rescoring methods, nar-ar rescoring and nar
$^{2}$
rescoring, Hypformer can flexibly switch between autoregressive and non-autoregressive decoding modes to perform rescoring of hypothesis prefixes. Experiments on the standard Mandarin datasets AISHELL-1 and AISHELL-2 demonstrate that Hypformer outperforms the state-of-the-art Hybrid CTC/Attention systems in ASR performance while achieving a speedup of over six times. Experiments on the Mandarin sub-dialect dataset KeSpeech indicate that Hypformer achieves more accurate recognition by leveraging richer contextual information.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.