The Role of Rhythm and Vowel Space in Speech Recognition

Li-Fang Lai, J. G. Hell, John M. Lipski
{"title":"The Role of Rhythm and Vowel Space in Speech Recognition","authors":"Li-Fang Lai, J. G. Hell, John M. Lipski","doi":"10.21437/speechprosody.2022-87","DOIUrl":null,"url":null,"abstract":"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-87","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
节奏与元音空间在语音识别中的作用
本文探讨了节奏和元音空间在自动语音识别(ASR)中的作用,特别关注阿巴拉契亚地区的米德兰和南美英语。进行了三组分析。首先,我们计算了真实情况与DARLA生成的文本之间的单词错误率。与之前的研究一致,结果显示南方英语的错误率(59.5%)高于米德兰英语(47.2%),这表明语音识别方面存在方言差距。接下来,我们检查了错误率是否受到节奏的影响。结果表明,%V和ΔV都不能可靠地预测ASR性能。我们还试图找出元音空间、语音可理解性和ASR表现之间的联系。三个元音空间度量被考虑:凸包,形成体分散,和多边形面积。我们注意到,随着凸包体和波峰色散的增加,错误率下降,特别是对米德兰人来说。这与我们的假设一致,即更大的元音空间提高了语音的可理解性,从而降低了米德兰队列的错误率。多边形面积、语音可理解性和错误率之间没有明显的联系。这些结果虽然具有启发性,但为语音识别中声学建模的改进指出了一些有希望的方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Conversational Correlates of Prosodic Entrainment in Youth with and without Autism Spectrum Disorder Individual variation in F0 marking of turn-taking in natural conversation in German and Swedish Contribution of voice quality to prediction of turn-taking events Production of Lexical Stress Matures Late in Typically Developing Children Can Prosody Transfer Embeddings be Used for Prosody Assessment?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1