节奏与元音空间在语音识别中的作用

Speech Prosody 2022 Pub Date : 2022-05-23 DOI:10.21437/speechprosody.2022-87

Li-Fang Lai, J. G. Hell, John M. Lipski

{"title":"节奏与元音空间在语音识别中的作用","authors":"Li-Fang Lai, J. G. Hell, John M. Lipski","doi":"10.21437/speechprosody.2022-87","DOIUrl":null,"url":null,"abstract":"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"The Role of Rhythm and Vowel Space in Speech Recognition\",\"authors\":\"Li-Fang Lai, J. G. Hell, John M. Lipski\",\"doi\":\"10.21437/speechprosody.2022-87\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.\",\"PeriodicalId\":442842,\"journal\":{\"name\":\"Speech Prosody 2022\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Prosody 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/speechprosody.2022-87\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-87","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文探讨了节奏和元音空间在自动语音识别(ASR)中的作用，特别关注阿巴拉契亚地区的米德兰和南美英语。进行了三组分析。首先，我们计算了真实情况与DARLA生成的文本之间的单词错误率。与之前的研究一致，结果显示南方英语的错误率(59.5%)高于米德兰英语(47.2%)，这表明语音识别方面存在方言差距。接下来，我们检查了错误率是否受到节奏的影响。结果表明，%V和ΔV都不能可靠地预测ASR性能。我们还试图找出元音空间、语音可理解性和ASR表现之间的联系。三个元音空间度量被考虑:凸包，形成体分散，和多边形面积。我们注意到，随着凸包体和波峰色散的增加，错误率下降，特别是对米德兰人来说。这与我们的假设一致，即更大的元音空间提高了语音的可理解性，从而降低了米德兰队列的错误率。多边形面积、语音可理解性和错误率之间没有明显的联系。这些结果虽然具有启发性，但为语音识别中声学建模的改进指出了一些有希望的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The Role of Rhythm and Vowel Space in Speech Recognition

This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Prosody 2022

自引率

0.00%

发文量