{"title":"通过自调整语音分割和嵌入式提取实现对话短语说话者日记化","authors":"Haitian Lu;Gaofeng Cheng;Yonghong Yan","doi":"10.1109/LSP.2024.3453772","DOIUrl":null,"url":null,"abstract":"Conversational short-phrase speaker diarization focuses on diarizing the phrases that are short in duration. Nonetheless, conventional speaker diarization systems fail to give enough importance to conversational short phrases. This letter proposed a novel speaker diarization system to address this issue. Firstly, we employ an RNN-T model for joint speech recognition and speaker change detection. The speech recognition results can be utilized directly in downstream tasks while the speaker change points serve as guidance for the following steps. Secondly, we introduce self-adjusting speech segmentation, which dynamically adjusts segment lengths based on the temporal distribution of speaker change points. Thirdly, we introduce self-adjusting embedding extraction, which employs speaker encoders trained under different speech duration conditions by projecting them to the same embedding space. Our method achieves a major reduction of Diarization Error Rate (DER) and Conversational Diarization Error Rate (CDER) on the MagicData-RAMC and Mixer 6 datasets.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Conversational Short-Phrase Speaker Diarization via Self-Adjusting Speech Segmentation and Embedding Extraction\",\"authors\":\"Haitian Lu;Gaofeng Cheng;Yonghong Yan\",\"doi\":\"10.1109/LSP.2024.3453772\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conversational short-phrase speaker diarization focuses on diarizing the phrases that are short in duration. Nonetheless, conventional speaker diarization systems fail to give enough importance to conversational short phrases. This letter proposed a novel speaker diarization system to address this issue. Firstly, we employ an RNN-T model for joint speech recognition and speaker change detection. The speech recognition results can be utilized directly in downstream tasks while the speaker change points serve as guidance for the following steps. Secondly, we introduce self-adjusting speech segmentation, which dynamically adjusts segment lengths based on the temporal distribution of speaker change points. Thirdly, we introduce self-adjusting embedding extraction, which employs speaker encoders trained under different speech duration conditions by projecting them to the same embedding space. Our method achieves a major reduction of Diarization Error Rate (DER) and Conversational Diarization Error Rate (CDER) on the MagicData-RAMC and Mixer 6 datasets.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10663942/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10663942/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Conversational Short-Phrase Speaker Diarization via Self-Adjusting Speech Segmentation and Embedding Extraction
Conversational short-phrase speaker diarization focuses on diarizing the phrases that are short in duration. Nonetheless, conventional speaker diarization systems fail to give enough importance to conversational short phrases. This letter proposed a novel speaker diarization system to address this issue. Firstly, we employ an RNN-T model for joint speech recognition and speaker change detection. The speech recognition results can be utilized directly in downstream tasks while the speaker change points serve as guidance for the following steps. Secondly, we introduce self-adjusting speech segmentation, which dynamically adjusts segment lengths based on the temporal distribution of speaker change points. Thirdly, we introduce self-adjusting embedding extraction, which employs speaker encoders trained under different speech duration conditions by projecting them to the same embedding space. Our method achieves a major reduction of Diarization Error Rate (DER) and Conversational Diarization Error Rate (CDER) on the MagicData-RAMC and Mixer 6 datasets.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.