节奏与元音空间在语音识别中的作用

Li-Fang Lai, J. G. Hell, John M. Lipski
{"title":"节奏与元音空间在语音识别中的作用","authors":"Li-Fang Lai, J. G. Hell, John M. Lipski","doi":"10.21437/speechprosody.2022-87","DOIUrl":null,"url":null,"abstract":"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"The Role of Rhythm and Vowel Space in Speech Recognition\",\"authors\":\"Li-Fang Lai, J. G. Hell, John M. Lipski\",\"doi\":\"10.21437/speechprosody.2022-87\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.\",\"PeriodicalId\":442842,\"journal\":{\"name\":\"Speech Prosody 2022\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Prosody 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/speechprosody.2022-87\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-87","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

本文探讨了节奏和元音空间在自动语音识别(ASR)中的作用,特别关注阿巴拉契亚地区的米德兰和南美英语。进行了三组分析。首先,我们计算了真实情况与DARLA生成的文本之间的单词错误率。与之前的研究一致,结果显示南方英语的错误率(59.5%)高于米德兰英语(47.2%),这表明语音识别方面存在方言差距。接下来,我们检查了错误率是否受到节奏的影响。结果表明,%V和ΔV都不能可靠地预测ASR性能。我们还试图找出元音空间、语音可理解性和ASR表现之间的联系。三个元音空间度量被考虑:凸包,形成体分散,和多边形面积。我们注意到,随着凸包体和波峰色散的增加,错误率下降,特别是对米德兰人来说。这与我们的假设一致,即更大的元音空间提高了语音的可理解性,从而降低了米德兰队列的错误率。多边形面积、语音可理解性和错误率之间没有明显的联系。这些结果虽然具有启发性,但为语音识别中声学建模的改进指出了一些有希望的方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The Role of Rhythm and Vowel Space in Speech Recognition
This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
相关文献
二甲双胍通过HDAC6和FoxO3a转录调控肌肉生长抑制素诱导肌肉萎缩
IF 8.9 1区 医学Journal of Cachexia, Sarcopenia and MusclePub Date : 2021-11-02 DOI: 10.1002/jcsm.12833
Min Ju Kang, Ji Wook Moon, Jung Ok Lee, Ji Hae Kim, Eun Jeong Jung, Su Jin Kim, Joo Yeon Oh, Sang Woo Wu, Pu Reum Lee, Sun Hwa Park, Hyeon Soo Kim
具有疾病敏感单倍型的非亲属供体脐带血移植后的1型糖尿病
IF 3.2 3区 医学Journal of Diabetes InvestigationPub Date : 2022-11-02 DOI: 10.1111/jdi.13939
Kensuke Matsumoto, Taisuke Matsuyama, Ritsu Sumiyoshi, Matsuo Takuji, Tadashi Yamamoto, Ryosuke Shirasaki, Haruko Tashiro
封面:蛋白质组学分析确定IRSp53和fastin是PRV输出和直接细胞-细胞传播的关键
IF 3.4 4区 生物学ProteomicsPub Date : 2019-12-02 DOI: 10.1002/pmic.201970201
Fei-Long Yu, Huan Miao, Jinjin Xia, Fan Jia, Huadong Wang, Fuqiang Xu, Lin Guo
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Conversational Correlates of Prosodic Entrainment in Youth with and without Autism Spectrum Disorder Individual variation in F0 marking of turn-taking in natural conversation in German and Swedish Contribution of voice quality to prediction of turn-taking events Production of Lexical Stress Matures Late in Typically Developing Children Can Prosody Transfer Embeddings be Used for Prosody Assessment?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1