Lightweight target speaker separation network based on joint training

IF 1.9 3区计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2023-12-06 DOI:10.1186/s13636-023-00317-3

Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu

{"title":"Lightweight target speaker separation network based on joint training","authors":"Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu","doi":"10.1186/s13636-023-00317-3","DOIUrl":null,"url":null,"abstract":"Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-023-00317-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于联合训练的轻量级目标扬声器分离网络

目标说话人分离旨在从混合语音中分离出目标说话人的语音成分，并去除噪声等无关成分。近年来，基于深度学习的语音分离方法取得了重大突破，并逐渐成为主流。然而，由于模型规模较大，这些现有方法普遍面临系统延迟和性能上限等问题。为解决这些问题，本文提出了网络结构和训练方法的改进措施，以提高模型的性能。本文提出了一种基于长短期记忆（LSTM）的轻量级目标扬声器分离网络，它能在保持分离性能的同时减小模型体积和计算延迟。在此基础上，提出了一种基于联合训练的目标扬声器分离方法，以实现目标扬声器分离系统的整体训练和优化。为进一步提高系统性能，还提出了基于扬声器注册和扬声器分离的联合损失函数，用于网络的联合训练。实验结果表明，本文提出的轻量级目标扬声器分离网络在轻量级的同时具有更好的性能，利用我们提出的损失函数对目标扬声器分离网络进行联合训练可以进一步提高原始模型的分离性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.

期刊最新文献

Diffraction perception in L-shaped rooms using virtual reality. Singing to speech conversion with generative flow. Robust and early howling detection based on a sparsity measure. Compression of room impulse responses for compact storage and fast low-latency convolution Guest editorial: AI for computational audition—sound and music processing