基于资源效率的半开放集方言识别问题研究

IF 3 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-07-01 DOI:10.1016/j.specom.2023.102957

Spandan Dey, Goutam Saha

{"title":"基于资源效率的半开放集方言识别问题研究","authors":"Spandan Dey, Goutam Saha","doi":"10.1016/j.specom.2023.102957","DOIUrl":null,"url":null,"abstract":"<div><p>This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves <span><math><mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>avg</mi></mrow></msub><mrow><mo>(</mo><mo>×</mo><mn>100</mn><mo>)</mo></mrow></mrow></math></span> score of 8.50 and EER <span><math><mrow><mo>(</mo><mtext>%</mtext><mo>)</mo></mrow></math></span> of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"152 ","pages":"Article 102957"},"PeriodicalIF":3.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Addressing the semi-open set dialect recognition problem under resource-efficient considerations\",\"authors\":\"Spandan Dey, Goutam Saha\",\"doi\":\"10.1016/j.specom.2023.102957\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves <span><math><mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>avg</mi></mrow></msub><mrow><mo>(</mo><mo>×</mo><mn>100</mn><mo>)</mo></mrow></mrow></math></span> score of 8.50 and EER <span><math><mrow><mo>(</mo><mtext>%</mtext><mo>)</mo></mrow></math></span> of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"152 \",\"pages\":\"Article 102957\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323000912\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323000912","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 1

摘要

本文为半开放集评估场景下的口语方言识别任务提供了一种资源高效的解决方案，其中封闭集模型暴露于未知的类输入。我们主要为我们的实验探索了OLR 2020挑战的任务2。在这个任务中，要识别三种中国方言福建话、四川话和上海话。为了评估，除了三种目标方言外，还包括其他未知类别的话语。我们发现，表现最好的提交和基线系统没有提出明确解决半开放集场景的解决方案。这项工作特别关注问题的半开放集性质，并分析了如果不单独处理未知话语如何潜在地降低整体性能。我们使用ECAPA-TDNN架构和40维MFCC从三种方言的训练数据中训练主方言分类器。我们提出了一种置信度评估算法，并结合了端到端和嵌入提取方法的TDNN性能。然后，我们将半开集场景构建为约束优化问题。通过求解该问题，我们证明了如果相应的softmax预测在目标输出中同样混淆，则未知语音对性能的影响最小。基于这一准则，我们在系统中开发了不同的反馈模块。这些模块基于新颖性检测原理，将未知类话语标记为异常。然后，相应话语的预测分数被压平。该系统的Cavg(×100)得分为8.50,EER(%)为9.77。平均这两个指标，我们的系统得分优于获胜的提交。由于所提出的半开放集自适应，我们的系统比表现最好的提交使用更少的训练数据和计算资源实现了这种性能。此外，为了验证所提出的半开放集解决方案的更广泛适用性，我们对另外两个方言识别任务进行了实验，这些任务涵盖英语和阿拉伯语以及更大的数据库规模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Addressing the semi-open set dialect recognition problem under resource-efficient considerations

This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves $C_{avg} (\times 100)$ score of 8.50 and EER $(%)$ of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.