推进演讲者嵌入式学习：用于研究和制作的 Wespeaker 工具包

IF 3 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-07-01 Epub Date: 2024-07-14 DOI:10.1016/j.specom.2024.103104

Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li

{"title":"推进演讲者嵌入式学习：用于研究和制作的 Wespeaker 工具包","authors":"Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li","doi":"10.1016/j.specom.2024.103104","DOIUrl":null,"url":null,"abstract":"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103104"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Advancing speaker embedding learning: Wespeaker toolkit for research and production\",\"authors\":\"Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li\",\"doi\":\"10.1016/j.specom.2024.103104\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"162 \",\"pages\":\"Article 103104\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000761\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/14 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000761","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/14 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

扬声器建模在各种任务中起着至关重要的作用，而固定维度的向量表示（即扬声器嵌入）是最主要的建模方法。这些嵌入通常是在扬声器验证的框架内进行评估的，但它们的用途也扩展到了扬声器日记化、语音合成、语音转换和目标扬声器提取等广泛的相关任务中。本文介绍的 Wespeaker 是一款用户友好型工具包，既可用于研究，也可用于生产，专门用于学习扬声器嵌入。Wespeaker 提供可扩展的数据管理、最先进的扬声器嵌入模型和自监督学习训练方案，具有利用大规模无标注真实世界数据的潜力。该工具包采用了结构化配方，这些配方已成功应用于各种扬声器验证挑战的成功系统中，确保了极具竞争力的结果。对于面向生产的开发，Wespeaker集成了CPU和GPU兼容的部署和运行代码，支持Windows、Linux、Mac等主流平台和horizon X3'PI等设备上芯片。Wespeaker 还通过提供各种预训练模型提供现成的高质量扬声器嵌入，这些模型可以毫不费力地应用于需要扬声器建模的不同任务。该工具包可通过 https://github.com/wenet-e2e/wespeaker 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Advancing speaker embedding learning: Wespeaker toolkit for research and production

Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.