端到端视听神经扬声器日记

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-10106

Maokui He, Jun Du, Chin-Hui Lee

{"title":"端到端视听神经扬声器日记","authors":"Maokui He, Jun Du, Chin-Hui Lee","doi":"10.21437/interspeech.2022-10106","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classiﬁcation output layers produces activities of each speaker. With the ﬁnely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades signiﬁcantly using the visual-only model. Evaluated on the datasets of the ﬁrst multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"End-to-End Audio-Visual Neural Speaker Diarization\",\"authors\":\"Maokui He, Jun Du, Chin-Hui Lee\",\"doi\":\"10.21437/interspeech.2022-10106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classiﬁcation output layers produces activities of each speaker. With the ﬁnely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades signiﬁcantly using the visual-only model. Evaluated on the datasets of the ﬁrst multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-10106\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

在本文中，我们提出了一种新的基于端到端神经网络的视听说话者日记化方法。与大多数现有的视听方法不同，我们的视听模型将音频特征（例如，FBANK）、多扬声器唇区（ROI）和多扬声器i矢量嵌入作为多模态输入。一组二进制分类输出层产生每个说话者的活动。通过精心设计的端到端结构，该方法可以明确处理重叠语音，并利用多模态信息准确区分语音和非语音。I矢量是解决视觉模态误差（如遮挡、屏幕外扬声器或不可靠检测）引起的对准问题的关键。此外，我们的视听模型在没有视觉模态的情况下是稳健的，使用纯视觉模型，日记化性能显著下降。在第一次基于多模型信息的语音处理（MISP）挑战的数据集上进行评估，所提出的方法在具有参考语音活动检测（VAD）信息的开发/评估集上实现了10.1%/9.5%的二值化错误率（DERs），而纯音频和纯视频系统的DERs分别为27.9%/29.0%和14.6%/13.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

End-to-End Audio-Visual Neural Speaker Diarization

In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classiﬁcation output layers produces activities of each speaker. With the ﬁnely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades signiﬁcantly using the visual-only model. Evaluated on the datasets of the ﬁrst multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量