Sufficiency Quantification for Seamless Text-Independent Speaker Enrollment

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2018-07-13 DOI:10.1109/ICASSP.2018.8461954

Gokcen Cilingir, Jonathan Huang, Mandar Joshi, Narayan Biswal

引用次数: 0

Abstract

Text-independent speaker recognition (TI-SR) requires a lengthy enrollment process that involves asking dedicated time from the user to create a reliable model of their voice. Seamless enrollment is a highly attractive feature which refers to the enrollment process that happens in the background and asks for no dedicated time from the user. One of the key problems in a fully automated seamless enrollment process is to determine the sufficiency of a given utterance collection for the purpose of TI-SR. No known metric exists in the literature to quantify sufficiency. This paper introduces a novel metric called phoneme-richness score. Quality of a sufficiency metric can be assessed via its correlation with the TI-SR performance. Our assessment shows that phoneme-richness score achieves −0.96 correlation with TI-SR performance (measured in equal error rate), which is highly significant, whereas a naive sufficiency metric like speech duration achieves only −0.68 correlation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

无缝文本独立演讲者注册的充分性量化

文本无关说话人识别(TI-SR)需要一个漫长的注册过程，包括要求用户花专门的时间来创建他们声音的可靠模型。无缝注册是一个非常吸引人的特性，它指的是注册过程在后台进行，不需要用户花费专门的时间。在一个完全自动化的无缝注册过程中，关键问题之一是确定一个给定的话语收集是否足够用于TI-SR。文献中没有已知的度量来量化充分性。本文介绍了一种新的音素丰富度评分方法。充分性度量的质量可以通过其与TI-SR性能的相关性来评估。我们的评估表明，音素丰富度得分与TI-SR表现(以等错误率衡量)的相关性为- 0.96，这是非常显著的，而语音持续时间等幼稚充分性指标的相关性仅为- 0.68。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量

期刊最新文献

Reduced Dimension Minimum BER PSK Precoding for Constrained Transmit Signals in Massive MIMO Low Complexity Joint RDO of Prediction Units Couples for HEVC Intra Coding Non-Native Children Speech Recognition Through Transfer Learning Synthesis of Images by Two-Stage Generative Adversarial Networks Statistical T+2d Subband Modelling for Crowd Counting