Speaker independent diarization for child language environment analysis using deep neural networks

2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI:10.1109/SLT.2016.7846253

M. Najafian, J. Hansen

{"title":"Speaker independent diarization for child language environment analysis using deep neural networks","authors":"M. Najafian, J. Hansen","doi":"10.1109/SLT.2016.7846253","DOIUrl":null,"url":null,"abstract":"Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于深度神经网络的儿童语言环境分析

通过测量在语音交流中其他儿童和成人对儿童的言语量来大规模监测儿童语言环境是一项重要的任务。利用从儿童在托儿中心佩戴的录音装置中提取的音频，我们提出的录音系统可以在每个时间点确定儿童语言环境的内容，通过将音频内容分为四大类之一，即(1)佩戴录音装置的儿童发起的语音，(2)其他儿童或(3)成人发起的针对主要儿童的语音，以及(4)非语音内容。在这项研究中，我们利用具有多个状态的复杂隐马尔可夫模型(HMM)来建模不同声学变异性源之间的时间依赖性，并使用深度神经网络作为判别建模方法来估计HMM状态输出概率。该系统对由快速转弯、类相似性和背景噪声引起的常见化误差具有鲁棒性，而不需要预先聚类技术。实验结果表明，该方法在不需要自下而上聚类的情况下优于最先进的基于高斯混合模型的diarization，相对误差降低了22.24%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量