{"title":"Speaker independent diarization for child language environment analysis using deep neural networks","authors":"M. Najafian, J. Hansen","doi":"10.1109/SLT.2016.7846253","DOIUrl":null,"url":null,"abstract":"Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.