IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第9页

On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement 论用于单声道语音增强的复值变分 U 网络的泛化能力

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444492

Eike J. Nustede;Jörn Anemüller

The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.

对于真实世界中的音频去噪系统来说，能够很好地适应不同环境是非常重要的。尤其是单通道信号，需要在不对语音清晰度产生负面影响的情况下进行高效的噪声过滤。我们之前的工作表明，概率潜空间模型与 U-Network 架构相结合，在一定程度上提高了性能和泛化能力。在此，我们将在两个不同的数据集上，并在训练-测试不匹配的情况下，进一步评估纯幅度模型和复值 U-Network 模型。通过引入与 "曲线下面积 "指标类似的基于曲线的评分，对模型的适应性进行了评估。所提出的概率潜空间模型在大多数情况下都优于其消融变体，也优于著名的比较方法，而网络规模的增加可以忽略不计。在匹配条件下，SI-SDR 可提高 0.97 dB，在不匹配条件下，SI-SDR 可提高 2.72 dB，SI-SDR 总分最高分别为 20.21 dB 和 18.71 dB。建议的稳定性分数与观察到的性能表现非常吻合，进一步验证了概率潜空间模型。

{"title":"On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement","authors":"Eike J. Nustede;Jörn Anemüller","doi":"10.1109/TASLP.2024.3444492","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444492","url":null,"abstract":"The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3838-3849"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142084492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings 利用频率动态卷积和 BEATs 音频嵌入增强基于共形器的声音事件检测

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444490

Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos

Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.

在过去几年中，大多数采用深度学习技术进行音频处理的任务都利用基于 Conformer 的系统取得了最先进的成果。然而，当涉及声音事件检测（SED）时，它在赢得 DCASE 挑战赛 2020 任务 4 之后就很少被使用了。在之前的研究中，我们发现基于 Conformer 的系统在声音事件分类方面比其他经常使用的架构（如卷积递归神经网络（CRNN））性能更高。鉴于为复调声音检测评分（PSDS2）提出的第二种方案侧重于避免类别之间的混淆，我们在本文中建议优化基于 Conformer 的系统，以最大限度地提高该方案的性能。为此，我们对超参数进行了调整，并采用了最近提出的频率动态卷积（FDY）来增强其分类性能。此外，我们还采用了之前提出的多分辨率方法，不仅提高了性能，还加深了对 SED 的 Conformer 架构的理解，分析了其优缺点，并找到了可能的解决方案。此外，我们还探索了如何整合来自预训练模型 BEATs 的嵌入，这是一个从音频变换器学习双向编码器表示的迭代框架。通过将这些嵌入信息串联到 Conformer 模块的输入中，结果得到了进一步改善，PSDS2 值达到了 0.813，大大优于基于 CRNN 的 SED 系统。

{"title":"Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings","authors":"Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos","doi":"10.1109/TASLP.2024.3444490","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444490","url":null,"abstract":"Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3896-3907"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637738","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BirdVoxDetect: Large-Scale Detection and Classification of Flight Calls for Bird Migration Monitoring BirdVoxDetect：用于鸟类迁徙监测的大规模飞行鸣叫检测和分类

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444486

Vincent Lostanlen;Aurora Cramer;Justin Salamon;Andrew Farnsworth;Benjamin M. Van Doren;Steve Kelling;Juan Pablo Bello

Sound event classification has the potential to advance our understanding of bird migration. Although it is long known that migratory species have a vocal signature of their own, previous work on automatic flight call classification has been limited in robustness and scope: e.g., covering few recording sites, short acquisition segments, and simplified biological taxonomies. In this paper, we present BirdVoxDetect (BVD), the first full-fledged solution to bird migration monitoring from acoustic sensor network data. As an open-source software, BVD integrates an original pipeline of three machine learning modules. The first module is a random forest classifier of sensor faults, trained with human-in-the-loop active learning. The second module is a deep convolutional neural network for sound event detection with per-channel energy normalization (PCEN). The third module is a multitask convolutional neural network which predicts the family, genus, and species of flight calls from passerines (Passeriformes) of North America. We evaluate BVD on a new dataset (296 hours from nine locations, the largest to date for this task) and discuss the main sources of estimation error in a real-world deployment: mechanical sensor failures, sensitivity to background noise, misdetection, and taxonomic confusion. Then, we deploy BVD to an unprecedented scale: 6672 hours of audio (approximately one terabyte), corresponding to a full season of bird migration. Running BVD in parallel over the full-season dataset yields 1.6 billion FFT's, 480 million neural network predictions, and over six petabytes of throughput. With this method, our main finding is that deep learning and bioacoustic sensor networks are ready to complement radar observations and crowdsourced surveys for bird migration monitoring, thus benefiting conservation ecology and land-use planning at large.

声音事件分类有可能促进我们对鸟类迁徙的了解。虽然人们早就知道迁徙物种有自己的声音特征，但以前的自动飞行鸣叫分类工作在稳健性和范围上都很有限：例如，覆盖的记录点少、采集片段短、生物分类法简化。在本文中，我们介绍了 BirdVoxDetect（BVD），这是首个利用声学传感器网络数据监测鸟类迁徙的成熟解决方案。作为一款开源软件，BVD 集成了由三个机器学习模块组成的原创管道。第一个模块是传感器故障的随机森林分类器，由人工在环主动学习训练而成。第二个模块是用于声音事件检测的深度卷积神经网络，采用每通道能量归一化（PCEN）技术。第三个模块是一个多任务卷积神经网络，用于预测北美雀形目（Passeriformes）飞行鸣叫的科、属和种。我们在一个新的数据集上对 BVD 进行了评估（来自 9 个地点的 296 个小时，是迄今为止该任务中最大的数据集），并讨论了实际部署中估计误差的主要来源：机械传感器故障、对背景噪声的敏感性、错误检测和分类混淆。然后，我们以前所未有的规模部署了 BVD：6672 小时的音频（约 1 TB），相当于一整个鸟类迁徙季节。在整个季节的数据集上并行运行 BVD 可产生 16 亿次 FFT、4.8 亿次神经网络预测和超过 6 PB 的吞吐量。通过这种方法，我们的主要发现是，深度学习和生物声学传感器网络可以补充雷达观测和众包调查对鸟类迁徙的监测，从而有利于保护生态学和土地利用规划。

{"title":"BirdVoxDetect: Large-Scale Detection and Classification of Flight Calls for Bird Migration Monitoring","authors":"Vincent Lostanlen;Aurora Cramer;Justin Salamon;Andrew Farnsworth;Benjamin M. Van Doren;Steve Kelling;Juan Pablo Bello","doi":"10.1109/TASLP.2024.3444486","DOIUrl":"10.1109/TASLP.2024.3444486","url":null,"abstract":"Sound event classification has the potential to advance our understanding of bird migration. Although it is long known that migratory species have a vocal signature of their own, previous work on automatic flight call classification has been limited in robustness and scope: e.g., covering few recording sites, short acquisition segments, and simplified biological taxonomies. In this paper, we present BirdVoxDetect (BVD), the first full-fledged solution to bird migration monitoring from acoustic sensor network data. As an open-source software, BVD integrates an original pipeline of three machine learning modules. The first module is a random forest classifier of sensor faults, trained with human-in-the-loop active learning. The second module is a deep convolutional neural network for sound event detection with per-channel energy normalization (PCEN). The third module is a multitask convolutional neural network which predicts the family, genus, and species of flight calls from passerines \u0000<italic>(Passeriformes)</i>\u0000 of North America. We evaluate BVD on a new dataset (296 hours from nine locations, the largest to date for this task) and discuss the main sources of estimation error in a real-world deployment: mechanical sensor failures, sensitivity to background noise, misdetection, and taxonomic confusion. Then, we deploy BVD to an unprecedented scale: 6672 hours of audio (approximately one terabyte), corresponding to a full season of bird migration. Running BVD in parallel over the full-season dataset yields 1.6 billion FFT's, 480 million neural network predictions, and over six petabytes of throughput. With this method, our main finding is that deep learning and bioacoustic sensor networks are ready to complement radar observations and crowdsourced surveys for bird migration monitoring, thus benefiting conservation ecology and land-use planning at large.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4134-4145"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation 用于多对多多语言语音到语音翻译的无文本单元到单元训练

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444470

Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro

This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.

本文提出了一种用于多对多多语种语音到语音翻译的无文本训练方法，这种方法也有利于将预先训练好的知识转移到基于文本的系统、文本到语音合成和文本到语音翻译中。为此，我们用语音单元来表示多语言语音，这些语音单元是由自监督语音模型得出的语音特征的离散表示。通过将语音单元视为伪文本，我们可以将注意力集中在语音的语言内容上，这样就可以很容易地将语音和文本模式的语音信息联系起来。通过将学习问题的输入和输出都设置为语音单元，我们提出了在多对多口语翻译环境中训练编码器-解码器模型的方法，即单元对单元翻译（UTUT）。具体来说，编码器以源语言标记为条件，正确理解输入的口语，而解码器则以目标语言标记为条件，生成目标语言的翻译语音。因此，在训练过程中，模型可以建立有关语言理解方式以及如何将它们与不同语言联系起来的知识。由于语音单元可以很容易地通过量化和音素化分别从音频和文本中关联起来，因此即使是以无文本方式训练的模型，也可以很容易地转移到与文本相关的任务中。我们证明，所提出的 UTUT 模型不仅可以有效地用于语音到语音翻译（S2ST），还可以用于多语言文本到语音合成（T2S）和文本到语音翻译（T2ST），只需对文本输入进行最小限度的微调。通过开展涵盖各种语言的综合实验，我们验证了所提方法在各种多语言任务中的有效性。此外，得益于多对多的语言训练，我们证明了UTUT 还能对训练过程中不存在的新语言对进行语言翻译，而这在之前的文献中还没有得到很好的探讨。

{"title":"Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation","authors":"Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro","doi":"10.1109/TASLP.2024.3444470","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444470","url":null,"abstract":"This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3934-3946"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation 基于上下文信息开发的粗到细目标扬声器提取

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-08 DOI: 10.1109/TASLP.2024.3440638

Xue Yang;Changchun Bao;Xianhong Chen

To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.

为解决鸡尾酒会问题，目标发言人提取（TSE）近来受到越来越多的关注。通常，TSE 在两种情况下进行探索。第一种情况是特定情况，即目标说话者在场，麦克风接收到的信号至少包含两个说话者。第二种情况是普遍情况，目标扬声器可能存在，也可能不存在，接收到的信号可能包含一个或多个扬声器。许多 TSE 研究利用目标扬声器的嵌入来指导提取。然而，仅仅利用这种嵌入可能无法充分利用注册中的上下文信息。为了解决这一局限性，有人提出了一种直接利用时频（T-F）域上下文信息的新方法。本文通过整合我们之前提出的 "从粗到细 "框架，对这一方法进行了改进。针对特定场景，采用了一个交互块，以促进报名和接收信号的 T-F 表示之间的直接交互。这种直接互动会产生一致的报名表示，为粗提取提供指导。之后，粗提取信号的 T-F 表示将用于指导精提取。精提取过程中获得的残差表示提高了提取精度。此外，本文还探讨了不考虑噪声和混响的无干扰通用场景。本文设计了一种两级决策方案，将我们提出的方法推广到这种无干扰通用场景中。所提出的方法实现了高性能，并被证明对这两种场景都有效。

{"title":"Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation","authors":"Xue Yang;Changchun Bao;Xianhong Chen","doi":"10.1109/TASLP.2024.3440638","DOIUrl":"10.1109/TASLP.2024.3440638","url":null,"abstract":"To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3795-3810"},"PeriodicalIF":4.1,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Theoretical Analysis of Maclaurin Expansion Based Linear Differential Microphone Arrays and Improved Solutions 基于麦克劳林扩展的线性差分麦克风阵列的理论分析和改进方案

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-07 DOI: 10.1109/TASLP.2024.3439994

Jinfu Wang;Feiran Yang;Xiaoqing Hu;Jun Yang

Linear differential microphone arrays (LDMAs) are becoming popular due to their potentially high directional gain and frequency-invariant beampattern. By increasing the number of microphones, the Maclaurin expansion-based LDMAs address the inherently poor robustness problem of the conventional LDMA at low frequencies. However, this method encounters severe beampattern distortion and the deep nulls problem in the white noise gain (WNG) and the directivity factor (DF) at high frequencies as the number of microphones increases. In this paper, we reveal that the severe beampattern distortion is attributed to the deviation term of the synthesized beampattern while the deep nulls problem in the WNG and the DF is attributed to the violation of the distortionless constraint in the desired direction. We then propose two new design methods to avoid the degraded performance of LDMAs. Compared to the Maclaurin series expansion-based method, the first method additionally imposes the distortionless constraint in the desired direction, and the deep nulls problem in the WNG and the DF can be avoided. The second method explicitly requires the response of the higher order spatial directivity pattern in the deviation term to be zero, and thus the beampattern distortion can be avoided. By choosing the frequency-wise parameter that determines the number of the considered higher order spatial directivity patterns, the second method enables a good trade-off between the WNG and the beampattern distortion. Simulations exemplify the superiority of the proposed method against existing methods in terms of the robustness and the beampattern distortion.

线性差分传声器阵列（LDMA）因其潜在的高指向性增益和频率不变的振型而越来越受欢迎。通过增加传声器数量，基于麦克劳林扩展的 LDMA 解决了传统 LDMA 在低频下固有的鲁棒性差的问题。然而，随着麦克风数量的增加，这种方法会遇到严重的贝型失真以及高频白噪声增益（WNG）和指向性因子（DF）的深空问题。本文揭示了严重的贝型失真归因于合成贝型的偏差项，而白噪增益（WNG）和指向性因子（DF）中的深空问题则归因于在所需方向上违反了无失真约束。我们随后提出了两种新的设计方法，以避免 LDMA 性能下降。与基于 Maclaurin 数列展开的方法相比，第一种方法在所需方向上额外施加了无失真约束，从而避免了 WNG 和 DF 中的深空问题。第二种方法明确要求偏差项中高阶空间指向性模式的响应为零，因此可以避免振型失真。通过选择决定所考虑的高阶空间指向性模式数量的频率参数，第二种方法可以在 WNG 和贝叶斯失真之间实现良好的权衡。仿真结果表明，与现有方法相比，所提出的方法在鲁棒性和振铃失真方面更具优势。

{"title":"Theoretical Analysis of Maclaurin Expansion Based Linear Differential Microphone Arrays and Improved Solutions","authors":"Jinfu Wang;Feiran Yang;Xiaoqing Hu;Jun Yang","doi":"10.1109/TASLP.2024.3439994","DOIUrl":"10.1109/TASLP.2024.3439994","url":null,"abstract":"Linear differential microphone arrays (LDMAs) are becoming popular due to their potentially high directional gain and frequency-invariant beampattern. By increasing the number of microphones, the Maclaurin expansion-based LDMAs address the inherently poor robustness problem of the conventional LDMA at low frequencies. However, this method encounters severe beampattern distortion and the deep nulls problem in the white noise gain (WNG) and the directivity factor (DF) at high frequencies as the number of microphones increases. In this paper, we reveal that the severe beampattern distortion is attributed to the deviation term of the synthesized beampattern while the deep nulls problem in the WNG and the DF is attributed to the violation of the distortionless constraint in the desired direction. We then propose two new design methods to avoid the degraded performance of LDMAs. Compared to the Maclaurin series expansion-based method, the first method additionally imposes the distortionless constraint in the desired direction, and the deep nulls problem in the WNG and the DF can be avoided. The second method explicitly requires the response of the higher order spatial directivity pattern in the deviation term to be zero, and thus the beampattern distortion can be avoided. By choosing the frequency-wise parameter that determines the number of the considered higher order spatial directivity patterns, the second method enables a good trade-off between the WNG and the beampattern distortion. Simulations exemplify the superiority of the proposed method against existing methods in terms of the robustness and the beampattern distortion.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3811-3825"},"PeriodicalIF":4.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors 使用非自回归吸引子的端到端神经扬声器标示法

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-07 DOI: 10.1109/TASLP.2024.3439993

Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk

Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.

尽管最近在说话人日记化方面取得了许多进展，但如何使日记化在现实生活中既稳健又有效，仍然是一个挑战，也是一个活跃的研究领域。基于聚类的成熟方法显示出良好的性能和质量。然而，这类系统由多个独立的、单独优化的模块组成，可能会导致性能不理想。端到端神经扬声器日记化（EEND）系统被认为是追求高性能日记化的下一块基石。然而，这种方法也有其局限性，比如在处理长录音和录音中扬声器数量较多（超过四个）或未知扬声器数量的情况时。基于编码器-解码器吸引子的 EEND（EEND-EDA）的出现，使我们能够利用基于 LSTM 的 EDA 模块，灵活处理包含大量发言人的录音。与 EEND-EDA 基线相比，本文作者最近提出的具有非自回归吸引子的 EEND（EEND-NAA）估计是一种有竞争力的替代方案。NAA 后端将 k-means 聚类作为吸引子估计的一部分，并采用基于变换解码器的吸引子细化模块。不过，在我们之前的 EEND-NAA 工作中，我们假设了已知的扬声器数量，而且实验评估仅限于 2 个扬声器的录音。在本文中，我们将详细介绍我们最近的 EEND-NAA 方法，并提出对 EEND-NAA 架构的进一步改进，引入 NAA 后端的三种新变体，它们可以处理包含不同和未知发言人数的语音录音。实验包括使用 Switchboard 和 NIST SRE 数据集生成的模拟混合物，以及来自 CALLHOME 和 DIHARD II 数据集的真实录音。在实验评估中，与基线 EEND-EDA 相比，建议的系统在模拟场景中实现了高达 51% 的相对改进，在真实录音中实现了高达 15% 的相对改进。

{"title":"End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors","authors":"Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk","doi":"10.1109/TASLP.2024.3439993","DOIUrl":"10.1109/TASLP.2024.3439993","url":null,"abstract":"Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3960-3973"},"PeriodicalIF":4.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization 通过自适应神经网络量化实现轻量级扬声器验证

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-05 DOI: 10.1109/TASLP.2024.3437237

Bei Liu;Haoyu Wang;Yanmin Qian

Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. This approach assigns varying bit widths to different network layers. When bit combinations are determined, MSFT progressively quantizes and fine-tunes the network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of

$sim$

8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.

现代说话人验证（SV）系统通常需要昂贵的存储和计算资源，因此阻碍了它们在移动设备上的部署。在本文中，我们探讨了用于轻量级说话人验证的自适应神经网络量化方法。首先，我们提出了一种新颖的自适应统一精度量化方法，该方法能够在 K 均值聚类的基础上动态生成为每个网络层定制的量化中心点。通过将其应用于预训练 SV 系统，我们获得了一系列不同位宽的量化变体。为了增强低位量化模型，我们进一步引入了混合精度量化算法和多级微调（MSFT）策略。这种方法为不同的网络层分配不同的位宽。当位组合确定后，MSFT 按特定顺序逐步量化和微调网络。最后，我们设计了两种不同的二进制量化方案，以减轻 1 位量化模型的性能下降：静态量化器和自适应量化器。在 VoxCeleb 上进行的实验表明，在 ResNets 和 DF-ResNets 上实现了无损的 4 位统一精度量化，压缩率达到了 8 美元。此外，与统一精度方法相比，混合精度量化不仅能在模型大小相似的情况下获得额外的性能改进，还能灵活地生成任何理想模型大小的位组合。此外，我们建议的 1 位量化方案显著提高了二值化模型的性能。最后，与现有的轻量级 SV 系统进行全面比较后发现，在各种模型大小范围内，我们提出的模型都远远优于之前的所有方法。

{"title":"Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization","authors":"Bei Liu;Haoyu Wang;Yanmin Qian","doi":"10.1109/TASLP.2024.3437237","DOIUrl":"10.1109/TASLP.2024.3437237","url":null,"abstract":"Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. This approach assigns varying bit widths to different network layers. When bit combinations are determined, MSFT progressively quantizes and fine-tunes the network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u00008. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3771-3784"},"PeriodicalIF":4.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks SpeechPrompt：提示语音语言模型完成语音处理任务

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3436618

Kai-Wei Chang;Haibin Wu;Yu-Kai Wang;Yuan-Kuei Wu;Hua Shen;Wei-Cheng Tseng;Iu-Thing Kang;Shang-Wen Li;Hung-Yi Lee

Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.

提示已成为利用预训练语言模型（LM）的一种实用方法。这种方法有几个优点。它能让 LM 以最少的训练和参数更新适应新任务，从而实现存储和计算的高效率。此外，提示只修改 LM 的输入，并利用语言模型的生成能力，以统一的方式处理各种下游任务。这大大减少了设计特定任务模型的人力需求。随着语言模型所服务任务数量的增加，这些优势会变得更加明显。在提示功能优势的推动下，我们首次探索了提示语音 LM 在语音处理领域的潜力。最近，人们对将语音转换为离散单元进行语言建模的兴趣日益浓厚。我们的开创性研究表明，在我们的统一提示框架内，这些量化语音单元具有很强的通用性。它们不仅可以作为类标签，还包含丰富的语音信息，可以重新合成为语音信号，用于语音生成任务。具体来说，我们将语音处理任务重新表述为语音单元生成任务。因此，我们可以在一个统一的提示框架内无缝整合语音分类、序列生成和语音生成等任务。实验结果表明，在可训练参数数量相近的情况下，与基于自监督学习模型的强微调方法相比，提示方法可以获得具有竞争力的性能。提示法还在少镜头设置中显示出良好的效果。此外，随着先进的语音 LM 进入舞台，所提出的提示框架将大有可为。

{"title":"SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks","authors":"Kai-Wei Chang;Haibin Wu;Yu-Kai Wang;Yuan-Kuei Wu;Hua Shen;Wei-Cheng Tseng;Iu-Thing Kang;Shang-Wen Li;Hung-Yi Lee","doi":"10.1109/TASLP.2024.3436618","DOIUrl":"10.1109/TASLP.2024.3436618","url":null,"abstract":"Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3730-3744"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Artist Similarity Based on Heterogeneous Graph Neural Networks 基于异构图神经网络的艺术家相似性

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3437170

Angelo Cesar Mendes da Silva;Diego Furtado Silva;Ricardo Marcondes Marcacini

Music streaming platforms rely on recommending similar artists to maintain user engagement, with artists benefiting from these suggestions to boost their popularity. Another important feature is music information retrieval, allowing users to explore new content. In both scenarios, performance depends on how to compute the similarity between musical content. This is a challenging process since musical data is inherently multimodal, containing textual and audio data. We propose a novel graph-based artist representation that integrates audio, lyrics features, and artist relations. Thus, a multimodal representation on a heterogeneous graph is proposed, along with a network regularization process followed by a GNN model to aggregate multimodal information into a more robust unified representation. The proposed method explores this final multimodal representation for the task of artist similarity as a link prediction problem. Our method introduces a new importance matrix to emphasize related artists in this multimodal space. We compare our approach with other strong baselines based on combining input features, importance matrix construction, and GNN models. Experimental results highlight the superiority of multimodal representation through the transfer learning process and the value of the importance matrix in enhancing GNN models for artist similarity.

音乐流媒体平台依靠推荐相似的艺人来维持用户参与度，艺人则从这些推荐中获益，从而提高自己的人气。另一个重要功能是音乐信息检索，允许用户探索新内容。在这两种情况下，性能都取决于如何计算音乐内容之间的相似性。这是一个具有挑战性的过程，因为音乐数据本身就是多模态的，包含文本和音频数据。我们提出了一种新颖的基于图的艺术家表示法，它整合了音频、歌词特征和艺术家关系。因此，我们提出了一种异构图上的多模态表示法，以及一种网络正则化过程，然后使用 GNN 模型将多模态信息聚合到一个更强大的统一表示法中。所提出的方法将这种最终的多模态表示法用于艺术家相似性任务的链接预测问题。我们的方法引入了新的重要性矩阵，以强调多模态空间中的相关艺术家。我们将我们的方法与其他基于输入特征组合、重要性矩阵构建和 GNN 模型的强大基线进行了比较。实验结果凸显了通过迁移学习过程进行多模态表示的优越性，以及重要性矩阵在增强艺术家相似性 GNN 模型方面的价值。

{"title":"Artist Similarity Based on Heterogeneous Graph Neural Networks","authors":"Angelo Cesar Mendes da Silva;Diego Furtado Silva;Ricardo Marcondes Marcacini","doi":"10.1109/TASLP.2024.3437170","DOIUrl":"10.1109/TASLP.2024.3437170","url":null,"abstract":"Music streaming platforms rely on recommending similar artists to maintain user engagement, with artists benefiting from these suggestions to boost their popularity. Another important feature is music information retrieval, allowing users to explore new content. In both scenarios, performance depends on how to compute the similarity between musical content. This is a challenging process since musical data is inherently multimodal, containing textual and audio data. We propose a novel graph-based artist representation that integrates audio, lyrics features, and artist relations. Thus, a multimodal representation on a heterogeneous graph is proposed, along with a network regularization process followed by a GNN model to aggregate multimodal information into a more robust unified representation. The proposed method explores this final multimodal representation for the task of artist similarity as a link prediction problem. Our method introduces a new importance matrix to emphasize related artists in this multimodal space. We compare our approach with other strong baselines based on combining input features, importance matrix construction, and GNN models. Experimental results highlight the superiority of multimodal representation through the transfer learning process and the value of the importance matrix in enhancing GNN models for artist similarity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3717-3729"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141880423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0