What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2025-03-01 Epub Date: 2024-10-22 DOI:10.1016/j.csl.2024.101738

Julian Linke , Bernhard C. Geiger , Gernot Kubin , Barbara Schuppler

{"title":"What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures","authors":"Julian Linke , Bernhard C. Geiger , Gernot Kubin , Barbara Schuppler","doi":"10.1016/j.csl.2024.101738","DOIUrl":null,"url":null,"abstract":"<div><div>Highly performing speech recognition is important for more fluent human–machine interaction (e.g., dialogue systems). Modern ASR architectures achieve human-level recognition performance on read speech but still perform sub-par on conversational speech, which arguably is or, at least, will be instrumental for human–machine interaction. Understanding the factors behind this shortcoming of modern ASR systems may suggest directions for improving them. In this work, we compare the performances of HMM- vs. transformer-based ASR architectures on a corpus of Austrian German conversational speech. Specifically, we investigate how strongly utterance length, prosody, pronunciation, and utterance complexity as measured by perplexity affect different ASR architectures. Among other findings, we observe that single-word utterances – which are characteristic of conversational speech and constitute roughly 30% of the corpus – are recognized more accurately if their F0 contour is flat; for longer utterances, the effects of the F0 contour tend to be weaker. We further find that zero-shot systems require longer utterance lengths and are less robust to pronunciation variation, which indicates that pronunciation lexicons and fine-tuning on the respective corpus are essential ingredients for the successful recognition of conversational speech.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101738"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001219","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Highly performing speech recognition is important for more fluent human–machine interaction (e.g., dialogue systems). Modern ASR architectures achieve human-level recognition performance on read speech but still perform sub-par on conversational speech, which arguably is or, at least, will be instrumental for human–machine interaction. Understanding the factors behind this shortcoming of modern ASR systems may suggest directions for improving them. In this work, we compare the performances of HMM- vs. transformer-based ASR architectures on a corpus of Austrian German conversational speech. Specifically, we investigate how strongly utterance length, prosody, pronunciation, and utterance complexity as measured by perplexity affect different ASR architectures. Among other findings, we observe that single-word utterances – which are characteristic of conversational speech and constitute roughly 30% of the corpus – are recognized more accurately if their F0 contour is flat; for longer utterances, the effects of the F0 contour tend to be weaker. We further find that zero-shot systems require longer utterance lengths and are less robust to pronunciation variation, which indicates that pronunciation lexicons and fine-tuning on the respective corpus are essential ingredients for the successful recognition of conversational speech.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对话语音有什么复杂的？基于 HMM 和基于变换器的 ASR 架构比较

高性能的语音识别对于更流畅的人机交互（如对话系统）非常重要。现代 ASR 架构在阅读语音方面达到了人类水平的识别性能，但在会话语音方面的表现仍然不尽如人意，而会话语音可以说是或至少将是人机交互的关键。了解现代自动语音识别系统这一缺陷背后的因素，或许能为改进这些系统指明方向。在这项研究中，我们比较了基于 HMM 和转换器的 ASR 架构在奥地利德语对话语音语料库中的表现。具体来说，我们研究了语篇长度、拟声词、发音和语篇复杂度对不同 ASR 架构的影响。除其他发现外，我们还观察到，如果单字语篇的 F0 等高线是平坦的，则其识别率更高；对于较长的语篇，F0 等高线的影响往往较弱，而单字语篇是会话语音的特征，约占语料库的 30%。我们还发现，"0-shot "系统需要更长的语篇长度，而且对发音变化的稳健性较差，这表明发音词典和对相应语料的微调是成功识别会话语音的基本要素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.