Jisu Park, Shin Cha, Seongbae Eun, J. Park, Young-Sun Yun
{"title":"Data Augmentation and D-vector Representation Methods for Speaker Change Detection","authors":"Jisu Park, Shin Cha, Seongbae Eun, J. Park, Young-Sun Yun","doi":"10.1145/3400286.3418270","DOIUrl":null,"url":null,"abstract":"Speaker Change Detection (SCD) is the process that detects speaker changes during a conversation. The conversation can be divided into homogeneous segments using a typical SCD system or speaker diarization system in which the segments are partitioned according to a speaker identity. When the d-vectors are used to identify or verify the speakers with deep neural network model, they are often considered insufficient to train model for detecting the speaker changes by using only acoustic information. There are few dedicated datasets for system training, so the progress of the SCD study is slow and the performance is poor. Therefore, we presented data augmentation method based on TIMIT dataset to suit for the system, and we also proposed several methods to represent d-vectors for SCD systems and their preliminary results. In the proposed data augmentation method, the boundary information of speakers is transformed into probability according to the offset in a given frame and collected in the segment. To model the boundaries of the speakers, we concatenate two random speech sentences dedicated to speech recognition system. The preliminary experimental results, specifically recall percentage, shows the possibility of the proposed approaches. In the future, we will add linguistic information to the proposed classification system, or improve the system to use hybrid system of d-vector and frame vectors, or convolutional networks.","PeriodicalId":326100,"journal":{"name":"Proceedings of the International Conference on Research in Adaptive and Convergent Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Research in Adaptive and Convergent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3400286.3418270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speaker Change Detection (SCD) is the process that detects speaker changes during a conversation. The conversation can be divided into homogeneous segments using a typical SCD system or speaker diarization system in which the segments are partitioned according to a speaker identity. When the d-vectors are used to identify or verify the speakers with deep neural network model, they are often considered insufficient to train model for detecting the speaker changes by using only acoustic information. There are few dedicated datasets for system training, so the progress of the SCD study is slow and the performance is poor. Therefore, we presented data augmentation method based on TIMIT dataset to suit for the system, and we also proposed several methods to represent d-vectors for SCD systems and their preliminary results. In the proposed data augmentation method, the boundary information of speakers is transformed into probability according to the offset in a given frame and collected in the segment. To model the boundaries of the speakers, we concatenate two random speech sentences dedicated to speech recognition system. The preliminary experimental results, specifically recall percentage, shows the possibility of the proposed approaches. In the future, we will add linguistic information to the proposed classification system, or improve the system to use hybrid system of d-vector and frame vectors, or convolutional networks.