基于资源效率的半开放集方言识别问题研究

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-07-01 DOI:10.1016/j.specom.2023.102957
Spandan Dey, Goutam Saha
{"title":"基于资源效率的半开放集方言识别问题研究","authors":"Spandan Dey,&nbsp;Goutam Saha","doi":"10.1016/j.specom.2023.102957","DOIUrl":null,"url":null,"abstract":"<div><p>This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves <span><math><mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>avg</mi></mrow></msub><mrow><mo>(</mo><mo>×</mo><mn>100</mn><mo>)</mo></mrow></mrow></math></span> score of 8.50 and EER <span><math><mrow><mo>(</mo><mtext>%</mtext><mo>)</mo></mrow></math></span> of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Addressing the semi-open set dialect recognition problem under resource-efficient considerations\",\"authors\":\"Spandan Dey,&nbsp;Goutam Saha\",\"doi\":\"10.1016/j.specom.2023.102957\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves <span><math><mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>avg</mi></mrow></msub><mrow><mo>(</mo><mo>×</mo><mn>100</mn><mo>)</mo></mrow></mrow></math></span> score of 8.50 and EER <span><math><mrow><mo>(</mo><mtext>%</mtext><mo>)</mo></mrow></math></span> of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323000912\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323000912","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 1

摘要

本文为半开放集评估场景下的口语方言识别任务提供了一种资源高效的解决方案,其中封闭集模型暴露于未知的类输入。我们主要为我们的实验探索了OLR 2020挑战的任务2。在这个任务中,要识别三种中国方言福建话、四川话和上海话。为了评估,除了三种目标方言外,还包括其他未知类别的话语。我们发现,表现最好的提交和基线系统没有提出明确解决半开放集场景的解决方案。这项工作特别关注问题的半开放集性质,并分析了如果不单独处理未知话语如何潜在地降低整体性能。我们使用ECAPA-TDNN架构和40维MFCC从三种方言的训练数据中训练主方言分类器。我们提出了一种置信度评估算法,并结合了端到端和嵌入提取方法的TDNN性能。然后,我们将半开集场景构建为约束优化问题。通过求解该问题,我们证明了如果相应的softmax预测在目标输出中同样混淆,则未知语音对性能的影响最小。基于这一准则,我们在系统中开发了不同的反馈模块。这些模块基于新颖性检测原理,将未知类话语标记为异常。然后,相应话语的预测分数被压平。该系统的Cavg(×100)得分为8.50,EER(%)为9.77。平均这两个指标,我们的系统得分优于获胜的提交。由于所提出的半开放集自适应,我们的系统比表现最好的提交使用更少的训练数据和计算资源实现了这种性能。此外,为了验证所提出的半开放集解决方案的更广泛适用性,我们对另外两个方言识别任务进行了实验,这些任务涵盖英语和阿拉伯语以及更大的数据库规模。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Addressing the semi-open set dialect recognition problem under resource-efficient considerations

This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves Cavg(×100) score of 8.50 and EER (%) of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1