Graph attention-based deep embedded clustering for speaker diarization

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-10-05 DOI:10.1016/j.specom.2023.102991
Yi Wei, Haiyan Guo, Zirui Ge, Zhen Yang
{"title":"Graph attention-based deep embedded clustering for speaker diarization","authors":"Yi Wei,&nbsp;Haiyan Guo,&nbsp;Zirui Ge,&nbsp;Zhen Yang","doi":"10.1016/j.specom.2023.102991","DOIUrl":null,"url":null,"abstract":"<div><p>Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 102991"},"PeriodicalIF":2.4000,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001255","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于图关注的说话人深度嵌入聚类
深度说话人嵌入提取模型最近成为模块化说话人二元化系统的基石。然而,在当前的模块化系统中,提取的说话人嵌入(即说话人特征)不能有效地利用它们的内在关系,而且,也不是专门为聚类任务定制的。在本文中,受深度嵌入聚类(DEC)的启发,我们提出了一种基于图注意力的深度嵌入聚类的说话人二元化方法来解决上述问题。首先,考虑到语音信号的时间性质,当将语音信号分割成小片段时,当前片段及其相邻片段中的语音可能属于同一说话者。这表明,从相邻片段中提取的嵌入可以帮助为当前片段生成信息量更大的说话者表示。为了更好地描述片段之间的复杂关系,并利用其嵌入之间的局部结构信息,我们为连续音频信号中预先提取的扬声器嵌入构建了一个图。在此基础上,我们引入了一个图注意力编码器(GAE)模块来整合图中相邻节点(即相邻片段)的信息,并学习潜在的说话人嵌入。此外,我们在一个统一的框架内进一步联合优化潜在说话人嵌入和聚类结果,从而为聚类任务提供更具鉴别性的说话人嵌入。实验结果表明,在NIST SRE 2000 CALLHOME、AMI和VoxConverse数据集上,我们提出的基于GADEC的说话人二元化系统在二元化错误率(DER)方面显著优于基线系统和其他几个最近的说话人二次化系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A new universal camouflage attack algorithm for intelligent speech system Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection A robust temporal map of speech monitoring from planning to articulation The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1