首页 > 最新文献

IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

英文 中文
Graph-Based Cross-Granularity Message Passing on Knowledge-Intensive Text 基于图的知识密集型文本跨粒度信息传递
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473308
Chenwei Yan;Xiangling Fu;Xinxin You;Ji Wu;Xien Liu
In knowledge-intensive fields such as medicine, the text often contains numerous professional terms, specific text fragments, and multidimensional information. However, most existing text representation methods ignore this specialized knowledge and instead adopt methods similar to those used in the general domain. In this paper, we focus on developing a learning module to enhance the representation ability of knowledge-intensive text by leveraging a graph-based cross-granularity message passing mechanism. To this end, we propose a novel learning framework, the Multi-Granularity Graph Neural Network (MG-GNN), to integrate fine-grained and coarse-grained knowledge at the character, word, and phase levels. The MG-GNN performs learning in two stages: 1) inter-granularity learning and 2) intra-granularity learning. During inter-granularity learning, semantic knowledge is extracted from character, word, and phrase granularity graphs, whereas intra-granularity learning focuses on fusing knowledge across different granularity graphs to achieve comprehensive message integration. To enhance the fusion performance, we propose a context-based gating mechanism to guide cross-graph propagation learning. Furthermore, we apply MG-GNN to address two important medical applications. Experimental results demonstrate that our proposed MG-GNN model significantly enhances the performance in both diagnosis prediction and medical named entity recognition tasks.
在医学等知识密集型领域,文本往往包含大量专业术语、特定文本片段和多维信息。然而,现有的文本表示方法大多忽略了这些专业知识,而是采用与一般领域类似的方法。在本文中,我们将重点开发一种学习模块,利用基于图的跨粒度信息传递机制来增强知识密集型文本的表示能力。为此,我们提出了一个新颖的学习框架--多粒度图神经网络(MG-GNN),以整合字符、单词和相位层面的细粒度和粗粒度知识。MG-GNN 分两个阶段进行学习:1) 粒度间学习和 2) 粒度内学习。在粒度间学习过程中,语义知识是从字符、单词和短语粒度图中提取的,而粒度内学习则侧重于融合不同粒度图中的知识,以实现全面的信息整合。为了提高融合性能,我们提出了一种基于上下文的门控机制来指导跨图传播学习。此外,我们还将 MG-GNN 应用于两个重要的医疗应用。实验结果表明,我们提出的 MG-GNN 模型显著提高了诊断预测和医疗命名实体识别任务的性能。
{"title":"Graph-Based Cross-Granularity Message Passing on Knowledge-Intensive Text","authors":"Chenwei Yan;Xiangling Fu;Xinxin You;Ji Wu;Xien Liu","doi":"10.1109/TASLP.2024.3473308","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473308","url":null,"abstract":"In knowledge-intensive fields such as medicine, the text often contains numerous professional terms, specific text fragments, and multidimensional information. However, most existing text representation methods ignore this specialized knowledge and instead adopt methods similar to those used in the general domain. In this paper, we focus on developing a learning module to enhance the representation ability of knowledge-intensive text by leveraging a graph-based cross-granularity message passing mechanism. To this end, we propose a novel learning framework, the \u0000<bold>M</b>\u0000ulti-\u0000<bold>G</b>\u0000ranularity \u0000<bold>G</b>\u0000raph \u0000<bold>N</b>\u0000eural \u0000<bold>N</b>\u0000etwork (MG-GNN), to integrate fine-grained and coarse-grained knowledge at the character, word, and phase levels. The MG-GNN performs learning in two stages: 1) inter-granularity learning and 2) intra-granularity learning. During inter-granularity learning, semantic knowledge is extracted from character, word, and phrase granularity graphs, whereas intra-granularity learning focuses on fusing knowledge across different granularity graphs to achieve comprehensive message integration. To enhance the fusion performance, we propose a context-based gating mechanism to guide cross-graph propagation learning. Furthermore, we apply MG-GNN to address two important medical applications. Experimental results demonstrate that our proposed MG-GNN model significantly enhances the performance in both diagnosis prediction and medical named entity recognition tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4409-4419"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142430820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Utterance Conditioned VAE for Speech Generation 用于语音生成的交叉共振条件 VAE
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-30 DOI: 10.1109/TASLP.2024.3453598
Yang Li;Cheng Yu;Guangzhi Sun;Weiqin Zu;Zheng Tian;Ying Wen;Wei Pan;Chao Zhang;Jun Wang;Yang Yang;Fanglei Sun
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.
由神经网络驱动的语音合成系统为多媒体制作带来了希望,但在生成有表现力的语音和无缝编辑方面经常面临问题。为此,我们提出了交叉均衡条件变异自动编码器语音合成(CUC-VAE S2)框架,以增强前音并确保自然语音的生成。该框架利用了预训练语言模型的强大表示能力和变异自动编码器(VAE)的重表达能力。CUC-VAE S2 框架的核心部分是跨口音 CVAE,它从周围的句子中提取声学、说话人和文本特征,生成上下文敏感的前音特征,从而更准确地模拟人类前音生成。我们还针对不同的语音合成应用提出了两种实用算法:用于文本到语音的 CUC-VAE TTS 和用于语音编辑的 CUC-VAE SE。CUC-VAE TTS 是该框架的直接应用,旨在生成带有从周围文本中提取的上下文前音的音频。另一方面,CUC-VAE SE 算法利用以上下文信息为条件的真实熔谱采样,生成与真实声音非常接近的音频,从而方便了基于文本的灵活语音编辑,如删除、插入和替换。在 LibriTTS 数据集上的实验结果表明,我们提出的模型显著增强了语音合成和编辑功能,生成的语音更自然、更具表现力。
{"title":"Cross-Utterance Conditioned VAE for Speech Generation","authors":"Yang Li;Cheng Yu;Guangzhi Sun;Weiqin Zu;Zheng Tian;Ying Wen;Wei Pan;Chao Zhang;Jun Wang;Yang Yang;Fanglei Sun","doi":"10.1109/TASLP.2024.3453598","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3453598","url":null,"abstract":"Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4263-4276"},"PeriodicalIF":4.1,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142359710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross Domain Optimization for Speech Enhancement: Parallel or Cascade? 语音增强的跨域优化:并行还是级联?
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-26 DOI: 10.1109/TASLP.2024.3468026
Liang Wan;Hongqing Liu;Liming Shi;Yi Zhou;Lu Gan
This paper introduces five novel deep-learning architectures for speech enhancement. Existing methods typically use time-domain, time-frequency representations, or a hybrid approach. Recognizing the unique contributions of each domain to feature extraction and model design, this study investigates the integration of waveform and complex spectrogram models through cross-domain fusion to enhance speech feature learning and noise reduction, thereby improving speech quality. We examine both cascading and parallel configurations of waveform and complex spectrogram models to assess their effectiveness in speech enhancement. Additionally, we employ an orthogonal projection-based error decomposition technique and manage the inputs of individual sub-models to analyze factors affecting speech quality. The network is trained by optimizing three specific loss functions applied across all sub-models. Our experiments, using the DNS Challenge (ICASSP 2021) dataset, reveal that the proposed models surpass existing benchmarks in speech enhancement, offering superior speech quality and intelligibility. These results highlight the efficacy of our cross-domain fusion strategy.
本文介绍了用于语音增强的五种新型深度学习架构。现有方法通常使用时域、时频表示或混合方法。认识到每个域对特征提取和模型设计的独特贡献,本研究探讨了通过跨域融合来整合波形和复杂频谱模型,以增强语音特征学习和降噪,从而提高语音质量。我们研究了波形和复杂频谱图模型的级联和并行配置,以评估它们在语音增强中的有效性。此外,我们还采用了基于正交投影的误差分解技术,并对各个子模型的输入进行管理,以分析影响语音质量的因素。我们通过优化应用于所有子模型的三个特定损失函数来训练网络。我们使用 DNS Challenge(ICASSP 2021)数据集进行的实验表明,所提出的模型超越了语音增强方面的现有基准,提供了卓越的语音质量和可懂度。这些结果凸显了我们的跨域融合策略的功效。
{"title":"Cross Domain Optimization for Speech Enhancement: Parallel or Cascade?","authors":"Liang Wan;Hongqing Liu;Liming Shi;Yi Zhou;Lu Gan","doi":"10.1109/TASLP.2024.3468026","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3468026","url":null,"abstract":"This paper introduces five novel deep-learning architectures for speech enhancement. Existing methods typically use time-domain, time-frequency representations, or a hybrid approach. Recognizing the unique contributions of each domain to feature extraction and model design, this study investigates the integration of waveform and complex spectrogram models through cross-domain fusion to enhance speech feature learning and noise reduction, thereby improving speech quality. We examine both cascading and parallel configurations of waveform and complex spectrogram models to assess their effectiveness in speech enhancement. Additionally, we employ an orthogonal projection-based error decomposition technique and manage the inputs of individual sub-models to analyze factors affecting speech quality. The network is trained by optimizing three specific loss functions applied across all sub-models. Our experiments, using the DNS Challenge (ICASSP 2021) dataset, reveal that the proposed models surpass existing benchmarks in speech enhancement, offering superior speech quality and intelligibility. These results highlight the efficacy of our cross-domain fusion strategy.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4328-4341"},"PeriodicalIF":4.1,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142376622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment 基于适应环境的物理约束核插值的声场估计
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-25 DOI: 10.1109/TASLP.2024.3467951
Juliano G. C. Ribeiro;Shoichi Koyama;Ryosuke Horiuchi;Hiroshi Saruwatari
A sound field estimation method based on kernel interpolation with an adaptive kernel function is proposed. The kernel-interpolation-based sound field estimation methods enable physics-constrained interpolation from pressure measurements of distributed microphones with a linear estimator, which constrains interpolation functions to satisfy the Helmholtz equation. However, a fixed kernel function would not be capable of adapting to the acoustic environment in which the measurement is performed, limiting their applicability. To make the kernel function adaptive, we represent it with a sum of directed and residual trainable kernel functions. The directed kernel is defined by a weight function composed of a superposition of exponential functions to capture highly directional components. The weight function for the residual kernel is represented by neural networks to capture unpredictable spatial patterns of the residual components. Experimental results using simulated and real data indicate that the proposed method outperforms the current kernel-interpolation-based methods and a method based on physics-informed neural networks.
本文提出了一种基于具有自适应核函数的核插值的声场估计方法。基于核内插法的声场估算方法能够利用线性估算器对分布式传声器的压力测量结果进行物理约束内插法,该估算器约束内插法函数满足亥姆霍兹方程。然而,固定的核函数无法适应进行测量的声学环境,从而限制了其适用性。为了使核函数具有自适应能力,我们用定向核函数和残差可训练核函数的总和来表示核函数。定向内核由一个权重函数定义,该权重函数由指数函数叠加而成,用于捕捉高方向性成分。残差核的权重函数由神经网络表示,以捕捉残差成分的不可预测空间模式。使用模拟和真实数据的实验结果表明,所提出的方法优于目前基于内核插值的方法和基于物理信息神经网络的方法。
{"title":"Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment","authors":"Juliano G. C. Ribeiro;Shoichi Koyama;Ryosuke Horiuchi;Hiroshi Saruwatari","doi":"10.1109/TASLP.2024.3467951","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3467951","url":null,"abstract":"A sound field estimation method based on kernel interpolation with an adaptive kernel function is proposed. The kernel-interpolation-based sound field estimation methods enable physics-constrained interpolation from pressure measurements of distributed microphones with a linear estimator, which constrains interpolation functions to satisfy the Helmholtz equation. However, a fixed kernel function would not be capable of adapting to the acoustic environment in which the measurement is performed, limiting their applicability. To make the kernel function adaptive, we represent it with a sum of directed and residual trainable kernel functions. The directed kernel is defined by a weight function composed of a superposition of exponential functions to capture highly directional components. The weight function for the residual kernel is represented by neural networks to capture unpredictable spatial patterns of the residual components. Experimental results using simulated and real data indicate that the proposed method outperforms the current kernel-interpolation-based methods and a method based on physics-informed neural networks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4369-4383"},"PeriodicalIF":4.1,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10693558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142430884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoders 高保真声码器的时频表示判别器研究
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-25 DOI: 10.1109/TASLP.2024.3468005
Yicheng Gu;Xueyao Zhang;Liumeng Xue;Haizhou Li;Zhizheng Wu
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.
基于生成对抗网络(GAN)的声码器从声学表征重建可听波形时,在推理速度和合成质量方面都更胜一筹。本研究的重点是改进基于 GAN 的声码器的判别器。现有的基于时频表示法(TFR)的判别器大多植根于短时傅里叶变换(STFT),它具有恒定的时频(TF)分辨率、线性缩放的中心频率和固定的分解基础,因此不适合像歌声这样需要动态关注不同频段和不同时间间隔的信号。有鉴于此,我们提出了多尺度子带常数 Q 变换 CQT(MS-SB-CQT)判别器和多尺度时域压缩连续小波变换 CWT(MS-TC-CWT)判别器。CQT 和 CWT 对不同频段都具有动态 TF 分辨率。相比之下,CQT 对音高信息的建模能力更强,而 CWT 对短时瞬态的建模能力更强。在语音和歌声中进行的实验证实了我们提出的判别器的有效性。此外,基于 STFT、CQT 和 CWT 的判别器可以联合使用,以获得更好的性能。所提出的判别器可以提高各种基于 GAN 的最先进声码器的合成质量,包括 HiFi-GAN、BigVGAN 和 APNet。
{"title":"An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoders","authors":"Yicheng Gu;Xueyao Zhang;Liumeng Xue;Haizhou Li;Zhizheng Wu","doi":"10.1109/TASLP.2024.3468005","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3468005","url":null,"abstract":"Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4569-4579"},"PeriodicalIF":4.1,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Three-Dimensional Room Transfer Function Parameterization Based on Multiple Concentric Planar Circular Arrays 基于多同心平面圆阵列的三维室内传递函数参数化
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-25 DOI: 10.1109/TASLP.2024.3468025
Lu Li;Maoshen Jia;Changchun Bao
This study proposes a three-dimensional room transfer function (RTF) parameterization method based on multiple concentric planar circular arrays, which exhibits robustness to variations in the positions of both the receiver and source. According to the harmonic solution to the wave equation, the RTFs between two spherical regions (sound source and receiver) in a room can be expressed as a weighted sum of spherical harmonics, whose weight coefficients serve as the RTF parameters, which can be estimated by placing multiple concentric planar circular arrays composed of monopole-source pairs (MSPs) and multiple concentric planar circular arrays composed of omnidirectional-microphone pairs (OMPs) in respective source and receiver regions. We use MSP arrays to generate required outgoing soundfields originating from a source region. We derive a method to use OMP arrays to estimate RTF parameters that are concealed within the captured soundfield, which can be employed to reconstruct the RTF from any point in the source region to any point in the receiver region. The accuracy of the RTF parameterization method is validated through simulation testing.
本研究提出了一种基于多个同心平面圆阵列的三维房间传递函数(RTF)参数化方法,该方法对接收器和声源位置的变化具有鲁棒性。根据波方程的谐波解,房间内两个球形区域(声源和接收器)之间的 RTF 可表示为球形谐波的加权和,其权重系数可作为 RTF 参数,通过在声源和接收器区域分别放置由单极声源对(MSP)和全向麦克风对(OMP)组成的多个同心平面圆阵列,可估算出 RTF 参数。我们使用 MSP 阵列来生成源自声源区域的所需外向声场。我们推导出一种使用 OMP 阵列估算隐藏在捕获声场中的 RTF 参数的方法,该方法可用于重建从声源区域任意点到接收区域任意点的 RTF。通过模拟测试验证了 RTF 参数化方法的准确性。
{"title":"Three-Dimensional Room Transfer Function Parameterization Based on Multiple Concentric Planar Circular Arrays","authors":"Lu Li;Maoshen Jia;Changchun Bao","doi":"10.1109/TASLP.2024.3468025","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3468025","url":null,"abstract":"This study proposes a three-dimensional room transfer function (RTF) parameterization method based on multiple concentric planar circular arrays, which exhibits robustness to variations in the positions of both the receiver and source. According to the harmonic solution to the wave equation, the RTFs between two spherical regions (sound source and receiver) in a room can be expressed as a weighted sum of spherical harmonics, whose weight coefficients serve as the RTF parameters, which can be estimated by placing multiple concentric planar circular arrays composed of monopole-source pairs (MSPs) and multiple concentric planar circular arrays composed of omnidirectional-microphone pairs (OMPs) in respective source and receiver regions. We use MSP arrays to generate required outgoing soundfields originating from a source region. We derive a method to use OMP arrays to estimate RTF parameters that are concealed within the captured soundfield, which can be employed to reconstruct the RTF from any point in the source region to any point in the receiver region. The accuracy of the RTF parameterization method is validated through simulation testing.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4384-4398"},"PeriodicalIF":4.1,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142430805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the Quantization of Neural Models for Speaker Verification 论用于验证说话人的神经模型量化
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-20 DOI: 10.1109/TASLP.2024.3463430
Vishal Kumar;Vinayak Abrol;Mathew Magamai Doss
This paper addresses the sub-optimality of current post-training quantization (PTQ) and quantization-aware training (QAT) methods for state-of-the-art speaker verification (SV) models featuring intricate architectural elements such as channel aggregation and squeeze excitation modules. To address these limitations, we propose 1) a data-independent PTQ technique employing iterative low-precision calibration on pre-trained models; and 2) a data-dependent QAT method designed to reduce the performance gap between full-precision and integer models. Our QAT involves two progressive stages where FP-32 weights are initially transformed into FP-8, adapting precision based on the gradient norm, followed by the learning of quantizer parameters (scale and zero-point) for INT8 conversion. Experimental validation underscores the ingenuity of our method in model quantization, demonstrating reduced floating-point operations and INT8 inference time, all while maintaining performance on par with full-precision models.
本文探讨了当前训练后量化(PTQ)和量化感知训练(QAT)方法对于具有复杂架构元素(如信道聚合和挤压激励模块)的最先进扬声器验证(SV)模型的次优化问题。为了解决这些局限性,我们提出了:1)一种与数据无关的 PTQ 技术,在预训练模型上采用迭代低精度校准;2)一种与数据无关的 QAT 方法,旨在缩小全精度模型和整数模型之间的性能差距。我们的 QAT 包括两个渐进阶段,首先将 FP-32 权重转换为 FP-8,根据梯度规范调整精度,然后学习量化器参数(标度和零点)以进行 INT8 转换。实验验证凸显了我们在模型量化方面的独创性,证明我们减少了浮点运算和 INT8 推理时间,同时保持了与全精度模型相同的性能。
{"title":"On the Quantization of Neural Models for Speaker Verification","authors":"Vishal Kumar;Vinayak Abrol;Mathew Magamai Doss","doi":"10.1109/TASLP.2024.3463430","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3463430","url":null,"abstract":"This paper addresses the sub-optimality of current post-training quantization (PTQ) and quantization-aware training (QAT) methods for state-of-the-art speaker verification (SV) models featuring intricate architectural elements such as channel aggregation and squeeze excitation modules. To address these limitations, we propose 1) a data-independent PTQ technique employing iterative low-precision calibration on pre-trained models; and 2) a data-dependent QAT method designed to reduce the performance gap between full-precision and integer models. Our QAT involves two progressive stages where FP-32 weights are initially transformed into FP-8, adapting precision based on the gradient norm, followed by the learning of quantizer parameters (scale and zero-point) for INT8 conversion. Experimental validation underscores the ingenuity of our method in model quantization, demonstrating reduced floating-point operations and INT8 inference time, all while maintaining performance on par with full-precision models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4226-4236"},"PeriodicalIF":4.1,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142328387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting 克服灾难性遗忘的贝叶斯参数高效微调技术
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463395
Haolin Chen;Philip N. Garner
We are motivated primarily by the adaptation of text-to-speech synthesis models; however we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. Nevertheless, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker-factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.
我们的主要动机是对文本到语音合成模型进行调整;但我们认为,更通用的参数高效微调(PEFT)是进行这种调整的合适框架。然而,灾难性遗忘仍然是 PEFT 的一个问题,它损害了预训练模型的固有能力。我们证明,只要微调层的参数偏移可以微分计算,现有的贝叶斯学习技术就可以应用于 PEFT,以防止灾难性遗忘。在语言建模和语音合成任务的一系列原则性实验中,我们利用已有的拉普拉斯近似方法(包括对角线方法和克朗克因子方法)对 PEFT 与低秩适应(LoRA)进行了正则化,并比较了它们在预训练知识保存方面的性能。结果表明,我们的方法可以在不降低微调性能的情况下克服灾难性遗忘,而使用 Kronecker-factored近似方法比对角线近似方法能更好地保存训练前知识。
{"title":"Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting","authors":"Haolin Chen;Philip N. Garner","doi":"10.1109/TASLP.2024.3463395","DOIUrl":"10.1109/TASLP.2024.3463395","url":null,"abstract":"We are motivated primarily by the adaptation of text-to-speech synthesis models; however we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. Nevertheless, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker-factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4253-4262"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10683983","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NeuroHeed: Neuro-Steered Speaker Extraction Using EEG Signals NeuroHeed:使用脑电信号的神经分层扬声器提取技术
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463498
Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li
Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.
人类拥有一种非凡的能力,即在相互竞争的声音和背景噪声中选择性地注意单个说话者,这种能力被称为选择性听觉注意。听觉神经科学的最新研究表明,被注意的语音信号与相应的大脑神经元活动之间存在很强的相关性。在这项研究中,我们使用经济实惠的非侵入式脑电图(EEG)设备对这种大脑活动进行了研究。我们提出的 NeuroHeed 是一种说话者提取模型,它利用听者的同步脑电信号提取鸡尾酒会场景中的语音信号,提取过程以脑电信号编码的神经元吸引子为条件。我们提出了离线和在线 NeuroHeed,后者专为实时推理而设计。在在线 NeuroHeed 中,我们还提出了一个自回归扬声器编码器,该编码器会将过去提取的语音信号累积起来,以便将所关注的扬声器信息自加入听觉吸引子,从而保持注意力的长期动力。在线 NeuroHeed 在这两个吸引子的引导下提取语音信号的当前窗口。KUL 数据集双扬声器场景的实验结果表明,NeuroHeed 能有效提取大脑关注的语音信号,离线设置下的平均标度不变信噪比改进(SI-SDRi)为 14.3 dB,提取准确率为 90.8%;在线设置下的平均标度不变信噪比改进(SI-SDRi)为 11.2 dB,提取准确率为 85.1%。
{"title":"NeuroHeed: Neuro-Steered Speaker Extraction Using EEG Signals","authors":"Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li","doi":"10.1109/TASLP.2024.3463498","DOIUrl":"10.1109/TASLP.2024.3463498","url":null,"abstract":"Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as \u0000<italic>selective auditory attention</i>\u0000. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4456-4470"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10683957","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children 自动检测粤语学龄前儿童的语音障碍
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463503
Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee
Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.
言语发声障碍(SSD)是一种发育障碍,儿童在正确发出某些言语声音时会遇到持续性困难。传统上,对 SSD 的评估主要依赖于具有相应语言背景的言语和语言病理学家(SLPs)。由于对合格语言病理学家的需求得不到满足,自动检测 SSD 对辅助临床工作、提高服务效率和质量非常有必要。本文研究了全自动检测幼儿 SSD 的方法和系统。本文开发了一种微观方法和一种宏观方法。微观系统基于检测受损儿童语音中的语音错误。通过训练深度神经网络(DNN)模型来学习辅音片段之间的相似度和对比度。通过将测试语音片段与参考片段进行对比来识别语音错误。电话级的相似性分数被汇总到扬声器级的 SSD 检测中。宏观方法利用与障碍有关的语音特征的整体变化。研究并比较了各种类型的扬声器级嵌入。实验结果表明,所提出的微观系统在电话级错误检测方面实现了 84.0% 到 91.9% 的非加权平均召回率(UAR)。所提出的宏观方法在扬声器级 SSD 检测方面的 UAR 可达到 89.0%。宏观固态硬盘检测所采用的说话人嵌入可以有效地剔除与说话人个人身份相关的信息。
{"title":"Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children","authors":"Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee","doi":"10.1109/TASLP.2024.3463503","DOIUrl":"10.1109/TASLP.2024.3463503","url":null,"abstract":"Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4355-4368"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1