Towards Unsupervised Learning of Speech Features in the Wild

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI:10.1109/SLT48900.2021.9383461

M. Rivière, Emmanuel Dupoux

引用次数: 20

Abstract

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向野外语音特征的无监督学习

最近关于语音表示的无监督对比学习的研究已经显示出有希望的结果，但到目前为止，主要是应用于干净的、精心策划的语音数据集。它是否也可以用于“野外”未准备好的音频数据?在这里，我们探讨了在这种情况下的三个潜在问题:(i)非语音数据的存在，(ii)嘈杂或低质量的语音数据，以及(iii)说话人分布的不平衡。我们表明，在lib -light train集(它本身是一个相对干净的纯语音数据集)上，这些问题结合起来，相对于ABX分数，性能成本已经高达30%。我们表明，前两个问题可以通过数据过滤来缓解，语音活动检测选择语音片段，而使用干净数据训练的模型的困惑有助于丢弃整个文件。我们证明了第三个问题可以通过在模型的预测分支中学习一个说话人来缓解。我们表明，这些技术建立了更强大的语音特征，可以在低资源环境下转移到ASR任务中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量