推进演讲者嵌入式学习:用于研究和制作的 Wespeaker 工具包

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-07-01 DOI:10.1016/j.specom.2024.103104
Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li
{"title":"推进演讲者嵌入式学习:用于研究和制作的 Wespeaker 工具包","authors":"Shuai Wang ,&nbsp;Zhengyang Chen ,&nbsp;Bing Han ,&nbsp;Hongji Wang ,&nbsp;Chengdong Liang ,&nbsp;Binbin Zhang ,&nbsp;Xu Xiang ,&nbsp;Wen Ding ,&nbsp;Johan Rohdin ,&nbsp;Anna Silnova ,&nbsp;Yanmin Qian ,&nbsp;Haizhou Li","doi":"10.1016/j.specom.2024.103104","DOIUrl":null,"url":null,"abstract":"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103104"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Advancing speaker embedding learning: Wespeaker toolkit for research and production\",\"authors\":\"Shuai Wang ,&nbsp;Zhengyang Chen ,&nbsp;Bing Han ,&nbsp;Hongji Wang ,&nbsp;Chengdong Liang ,&nbsp;Binbin Zhang ,&nbsp;Xu Xiang ,&nbsp;Wen Ding ,&nbsp;Johan Rohdin ,&nbsp;Anna Silnova ,&nbsp;Yanmin Qian ,&nbsp;Haizhou Li\",\"doi\":\"10.1016/j.specom.2024.103104\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"162 \",\"pages\":\"Article 103104\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000761\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000761","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

扬声器建模在各种任务中起着至关重要的作用,而固定维度的向量表示(即扬声器嵌入)是最主要的建模方法。这些嵌入通常是在扬声器验证的框架内进行评估的,但它们的用途也扩展到了扬声器日记化、语音合成、语音转换和目标扬声器提取等广泛的相关任务中。本文介绍的 Wespeaker 是一款用户友好型工具包,既可用于研究,也可用于生产,专门用于学习扬声器嵌入。Wespeaker 提供可扩展的数据管理、最先进的扬声器嵌入模型和自监督学习训练方案,具有利用大规模无标注真实世界数据的潜力。该工具包采用了结构化配方,这些配方已成功应用于各种扬声器验证挑战的成功系统中,确保了极具竞争力的结果。对于面向生产的开发,Wespeaker集成了CPU和GPU兼容的部署和运行代码,支持Windows、Linux、Mac等主流平台和horizon X3'PI等设备上芯片。Wespeaker 还通过提供各种预训练模型提供现成的高质量扬声器嵌入,这些模型可以毫不费力地应用于需要扬声器建模的不同任务。该工具包可通过 https://github.com/wenet-e2e/wespeaker 公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Advancing speaker embedding learning: Wespeaker toolkit for research and production

Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection A robust temporal map of speech monitoring from planning to articulation The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1