Computer Speech and Language最新文献_第9页

Complementary regional energy features for spoofed speech detection 用于欺骗性语音检测的互补区域能量特征

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-12-16 DOI: 10.1016/j.csl.2023.101602

Gökay Dişken

Automatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of biometric systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on speech synthesis, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features’ contribution to the performance. First, light convolutional neural network – long short-term memory (LCNN – LSTM) model with linear frequency cepstral coefficients is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SE-Res2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.

人们发现，自动语音验证系统很容易受到语音转换、文本到语音和重放语音等欺骗性攻击。由于生物识别系统的安全性至关重要，因此人们开发了许多针对欺骗语音检测的对策。为了满足语音合成的最新发展，公开可用的数据集变得越来越具有挑战性（如 ASVspoof 2019 和 2021 数据集）。在这些数据集中还考虑了各种重放攻击配置，因为它们不需要专业知识，因此很容易执行。这项工作利用了区域能量特征，实验证明它比传统的基于帧的能量特征更有效。所提出的能量特征与语句长度无关，是在幅度频谱的非重叠时频区域提取的。实验中考虑了不同的配置，以验证区域能量特征对性能的贡献。首先，使用带有线性频率倒频谱系数的轻卷积神经网络-长短期记忆（LCNN - LSTM）模型来确定区域能量特征的最佳数量。然后，使用具有对数功率谱图特征的 SE-Res2Net 模型，在 ASVspoof 2019 逻辑访问条件下取得了与最先进技术相当的结果。实验中还使用了 ASVspoof 2019 数据集的物理访问条件、ASVspoof 2021 数据集的逻辑访问和深度伪造条件。区域能量特征在几乎不增加计算或内存负荷的情况下（SE-Res2Net 的模型大小增加不到 1%）改善了所有条件。区域能量特征的主要优势可概括为 i) 捕捉非语音片段，ii) 提取带限信息。这两方面对欺骗性语音检测都有鉴别作用。

{"title":"Complementary regional energy features for spoofed speech detection","authors":"Gökay Dişken","doi":"10.1016/j.csl.2023.101602","DOIUrl":"10.1016/j.csl.2023.101602","url":null,"abstract":"<div><p><span><span>Automatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of </span>biometric<span> systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on </span></span>speech synthesis<span>, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features’ contribution to the performance. First, light convolutional neural network<span> – long short-term memory (LCNN – LSTM) model with linear frequency cepstral coefficients<span> is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SE-Res2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138745937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rep-MCA-former: An efficient multi-scale convolution attention encoder for text-independent speaker verification Rep-MCA-former：用于独立于文本的说话人验证的高效多尺度卷积注意力编码器

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-12-10 DOI: 10.1016/j.csl.2023.101600

Xiaohu Liu, Defu Chen, Xianbao Wang, Sheng Xiang, Xuwen Zhou

In many speaker verification tasks, the quality of speaker embedding is an important factor in affecting speaker verification systems. Advanced speaker embedding extraction networks aim to capture richer speaker features through the multi-branch network architecture. Recently, speaker verification systems based on transformer encoders have received much attention, and many satisfactory results have been achieved because transformer encoders can efficiently extract the global features of the speaker (e.g., MFA-Conformer). However, the large number of model parameters and computational latency are common problems faced by the above approaches, which make them difficult to apply to resource-constrained edge terminals. To address this issue, this paper proposes an effective, lightweight transformer model (MCA-former) with multi-scale convolutional self-attention (MCA), which can perform multi-scale modeling and channel modeling in the temporal direction of the input with low computational cost. In addition, in the inference phase of the model, we further develop a systematic re-parameterization method to convert the multi-branch network structure into the single-path topology, effectively improving the inference speed. We investigate the performance of the MCA-former for speaker verification under the VoxCeleb1 test set. The results show that the MCA-based transformer model is more advantageous in terms of the number of parameters and inference efficiency. By applying the re-parameterization, the inference speed of the model is increased by about 30%, and the memory consumption is significantly improved.

在许多扬声器验证任务中，扬声器嵌入的质量是影响扬声器验证系统的一个重要因素。先进的扬声器嵌入提取网络旨在通过多分支网络架构捕捉更丰富的扬声器特征。最近，基于变压器编码器的说话人验证系统受到了广泛关注，由于变压器编码器能有效提取说话人的全局特征（如 MFA-Conformer），因此取得了许多令人满意的结果。然而，大量的模型参数和计算延迟是上述方法面临的共同问题，这使得它们难以应用于资源受限的边缘终端。针对这一问题，本文提出了一种有效、轻量级的变换器模型（MCA-former），它具有多尺度卷积自注意（MCA）功能，能以较低的计算成本在输入的时间方向上进行多尺度建模和信道建模。此外，在模型推理阶段，我们进一步开发了一种系统化的重参数化方法，将多分支网络结构转换为单路径拓扑结构，有效提高了推理速度。我们研究了 MCA 生成器在 VoxCeleb1 测试集下验证说话人的性能。结果表明，基于 MCA 的变换器模型在参数数量和推理效率方面更具优势。通过重新参数化，模型的推理速度提高了约 30%，内存消耗也得到了显著改善。

{"title":"Rep-MCA-former: An efficient multi-scale convolution attention encoder for text-independent speaker verification","authors":"Xiaohu Liu, Defu Chen, Xianbao Wang, Sheng Xiang, Xuwen Zhou","doi":"10.1016/j.csl.2023.101600","DOIUrl":"10.1016/j.csl.2023.101600","url":null,"abstract":"<div><p><span>In many speaker verification tasks, the quality of speaker embedding is an important factor in affecting speaker verification systems. Advanced speaker embedding extraction networks aim to capture richer speaker features through the multi-branch </span>network architecture. Recently, speaker verification systems based on transformer encoders have received much attention, and many satisfactory results have been achieved because transformer encoders can efficiently extract the global features of the speaker (e.g., MFA-Conformer). However, the large number of model parameters and computational latency are common problems faced by the above approaches, which make them difficult to apply to resource-constrained edge terminals. To address this issue, this paper proposes an effective, lightweight transformer model (MCA-former) with multi-scale convolutional self-attention (MCA), which can perform multi-scale modeling and channel modeling in the temporal direction of the input with low computational cost. In addition, in the inference phase of the model, we further develop a systematic re-parameterization method to convert the multi-branch network structure into the single-path topology, effectively improving the inference speed. We investigate the performance of the MCA-former for speaker verification under the VoxCeleb1 test set. The results show that the MCA-based transformer model is more advantageous in terms of the number of parameters and inference efficiency. By applying the re-parameterization, the inference speed of the model is increased by about 30%, and the memory consumption is significantly improved.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138569851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New research on monaural speech segregation based on quality assessment 基于质量评价的单音语音分离新研究

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-12-05 DOI: 10.1016/j.csl.2023.101601

Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding

Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation Spectrogram (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.

语音增强技术是提高语音信号质量和可理解性的关键技术。然而，当处理高信噪比(SNR)条件下的语音信号时，传统的SE技术可能会无意中导致语音质量(PESQ)和短时客观可理解性(STOI)的感知评价的降低。本文介绍了将非侵入式语音质量评估(NISQA)算法创新性地整合到语音识别系统中。通过比较增强前和增强后的语音质量分数，可以判断所考虑的语音信号是否需要进行增强处理，从而减轻PESQ和STOI的潜在恶化。此外，本研究还探讨了五种常见的语音特征，即Mel频率倒谱系数(MFCC)、gamma酮频率倒谱系数(GFCC)、相对频谱变换感知线性预测系数(RASTA-PLP)、调幅谱图(AMS)和多分辨率耳蜗图(MRCG)在不同噪声条件下对PESQ和STOI的影响。实验结果强调，MRCG始终是STOI的最佳和最稳定的特征，而产生最高PESQ分数的特征与背景噪声类型、信噪比水平以及与语音信号的噪声兼容性表现出复杂的相关性。因此，我们提出了一种基于质量评估和特征选择的SE方法，促进了针对不同背景噪声场景的最佳特征的自适应选择，从而始终保持最高水平的PESQ指标增强效果。

{"title":"New research on monaural speech segregation based on quality assessment","authors":"Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding","doi":"10.1016/j.csl.2023.101601","DOIUrl":"10.1016/j.csl.2023.101601","url":null,"abstract":"<div><p>Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients<span> (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation<span> Spectrogram<span> (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138528319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating frame-level boundary detection and deepfake detection for locating manipulated regions in partially spoofed audio forgery attacks 结合帧级边界检测和深度伪造检测定位部分欺骗音频伪造攻击中的被操纵区域

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-12-05 DOI: 10.1016/j.csl.2023.101597

Zexin Cai , Ming Li

Partially fake audio, a variant of deep fake that involves manipulating audio utterances through the incorporation of fake or externally-sourced bona fide audio clips, constitutes a growing threat as an audio forgery attack impacting both human and artificial intelligence applications. Researchers have recently developed valuable databases to aid in the development of effective countermeasures against such attacks. While existing countermeasures mainly focus on identifying partially fake audio at the level of entire utterances or segments, this paper introduces a paradigm shift by proposing frame-level systems. These systems are designed to detect manipulated utterances and pinpoint the specific regions within partially fake audio where the manipulation occurs. Our approach leverages acoustic features extracted from large-scale self-supervised pre-training models, delivering promising results evaluated on diverse, publicly accessible databases. Additionally, we study the integration of boundary and deepfake detection systems, exploring their potential synergies and shortcomings. Importantly, our techniques have yielded impressive results. We have achieved state-of-the-art performance on the test dataset of the Track 2 of ADD 2022 challenge with an equal error rate of 4.4%. Furthermore, our methods exhibit remarkable performance in locating manipulated regions in Track 2 of the ADD 2023 challenge, resulting in a final ADD score of 0.6713 and securing the top position.

部分伪造音频是深度伪造的一种变体，涉及通过结合伪造或外部来源的真实音频剪辑来操纵音频话语，作为音频伪造攻击影响人类和人工智能应用的日益增长的威胁。研究人员最近开发了有价值的数据库，以帮助开发针对此类攻击的有效对策。虽然现有的对策主要集中在整个话语或片段水平上识别部分虚假音频，但本文通过提出帧级系统介绍了一种范式转换。这些系统旨在检测被操纵的话语，并在部分假音频中精确定位操纵发生的特定区域。我们的方法利用了从大规模自监督预训练模型中提取的声学特征，在各种公开访问的数据库上评估了有希望的结果。此外，我们还研究了边界和深度检测系统的集成，探索了它们潜在的协同作用和缺点。重要的是，我们的技术已经产生了令人印象深刻的结果。我们在ADD 2022挑战的Track 2测试数据集上取得了最先进的性能，错误率为4.4%。此外，我们的方法在ADD 2023挑战的Track 2中定位被操纵区域方面表现出色，最终的ADD得分为0.6713，确保了排名第一的位置。

{"title":"Integrating frame-level boundary detection and deepfake detection for locating manipulated regions in partially spoofed audio forgery attacks","authors":"Zexin Cai , Ming Li","doi":"10.1016/j.csl.2023.101597","DOIUrl":"10.1016/j.csl.2023.101597","url":null,"abstract":"<div><p><span><span><span>Partially fake audio, a variant of deep fake that involves manipulating audio utterances through the incorporation of fake or externally-sourced bona fide audio clips, constitutes a growing threat as an audio forgery attack impacting both human and </span>artificial intelligence applications. Researchers have recently developed valuable databases to aid in the development of effective </span>countermeasures against such attacks. While existing countermeasures mainly focus on identifying partially fake audio at the level of entire utterances or segments, this paper introduces a paradigm shift by proposing frame-level systems. These systems are designed to detect manipulated utterances and pinpoint the specific regions within partially fake audio where the manipulation occurs. Our approach leverages acoustic features extracted from large-scale self-supervised pre-training models, delivering promising results evaluated on diverse, publicly accessible databases. Additionally, we study the integration of boundary and </span>deepfake<span> detection systems, exploring their potential synergies and shortcomings. Importantly, our techniques have yielded impressive results. We have achieved state-of-the-art performance on the test dataset<span> of the Track 2 of ADD 2022 challenge with an equal error rate of 4.4%. Furthermore, our methods exhibit remarkable performance in locating manipulated regions in Track 2 of the ADD 2023 challenge, resulting in a final ADD score of 0.6713 and securing the top position.</span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138528327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A knowledge-augmented heterogeneous graph convolutional network for aspect-level multimodal sentiment analysis 面向方面级多模态情感分析的知识增强异构图卷积网络

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-11-23 DOI: 10.1016/j.csl.2023.101587

Yujie Wan, Yuzhong Chen, Jiali Lin, Jiayuan Zhong, Chen Dong

Aspect-level multimodal sentiment analysis has also become a new challenge in the field of sentiment analysis. Although there has been significant progress in the task based on image–text data, existing works do not fully deal with the implicit sentiment expression in data. In addition, they do not fully exploit the important information from external knowledge and image tags. To address these problems, we propose a knowledge-augmented heterogeneous graph convolutional network (KAHGCN). First, we propose a dynamic knowledge selection algorithm to select the most relevant external knowledge, thereby enhancing KAHGCN’s ability of understanding the implicit sentiment expression in review texts. Second, we propose a graph construction strategy to construct a heterogeneous graph that aggregates review text, image tags and external knowledge. Third, we propose a multilayer heterogeneous graph convolutional network to strengthen the interaction between information from external knowledge, review texts and image tags. Experimental results on two datasets demonstrate the effectiveness of the KAHGCN.

面向层面的多模态情感分析也成为情感分析领域的一个新挑战。尽管基于图像-文本数据的任务已经取得了重大进展，但现有的工作并没有完全处理数据中的隐式情感表达。此外，它们没有充分利用外部知识和图像标签中的重要信息。为了解决这些问题，我们提出了一种知识增强异构图卷积网络(KAHGCN)。首先，我们提出了一种动态知识选择算法来选择最相关的外部知识，从而增强KAHGCN对评论文本中隐含情感表达的理解能力。其次，我们提出了一种图构建策略，构建了一个聚合评论文本、图像标签和外部知识的异构图。第三，我们提出了一种多层异构图卷积网络，以加强外部知识信息、评论文本和图像标签之间的交互。在两个数据集上的实验结果验证了KAHGCN算法的有效性。

{"title":"A knowledge-augmented heterogeneous graph convolutional network for aspect-level multimodal sentiment analysis","authors":"Yujie Wan, Yuzhong Chen, Jiali Lin, Jiayuan Zhong, Chen Dong","doi":"10.1016/j.csl.2023.101587","DOIUrl":"https://doi.org/10.1016/j.csl.2023.101587","url":null,"abstract":"<div><p>Aspect-level multimodal sentiment analysis<span><span><span> has also become a new challenge in the field of sentiment analysis. Although there has been significant progress in the task based on image–text data, existing works do not fully deal with the implicit sentiment expression in data. In addition, they do not fully exploit the important information from external knowledge and image tags. To address these problems, we propose a knowledge-augmented heterogeneous graph convolutional network (KAHGCN). First, we propose a dynamic knowledge </span>selection algorithm to select the most relevant external knowledge, thereby enhancing KAHGCN’s ability of understanding the implicit sentiment expression in review texts. Second, we propose a </span>graph construction strategy to construct a heterogeneous graph that aggregates review text, image tags and external knowledge. Third, we propose a multilayer heterogeneous graph convolutional network to strengthen the interaction between information from external knowledge, review texts and image tags. Experimental results on two datasets demonstrate the effectiveness of the KAHGCN.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138465995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A semi-supervised high-quality pseudo labels algorithm based on multi-constraint optimization for speech deception detection 基于多约束优化的半监督高质量伪标签语音欺骗检测算法

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-11-22 DOI: 10.1016/j.csl.2023.101586

Huawei Tao , Hang Yu , Man Liu , Hongliang Fu , Chunhua Zhu , Yue Xie

Deep learning-based speech deception detection research relies on a large amount of labeled data. However, in the process of collecting speech deception detection data, the identification of truth and lies requires researchers to have a professional knowledge reserve, which greatly limits the number of annotated samples. Improving the accuracy of lie detection with insufficient annotation data is the focus of this study at this stage. In this paper, we propose a semi-supervised high-quality pseudo-label algorithm based on multi-constraint optimization (HQPL-MC) for speech deception detection. Firstly, the algorithm exploits the potential feature information of unlabeled data by using deep auto-encoder networks; secondly, it achieves entropy minimization with the help of the pseudo labeling technique to reduce the class overlap distribution of truth and deception data; finally, it improves the quality of pseudo labels by optimizing the unlabeled loss and reconstruction loss to further enhance the classification performance of the model when the labeled data is insufficient. We recorded an interview-style corpus by ourselves and used it in this paper for the experimental demonstration of the algorithm together with the Columbia/SRI/Colorado(CSC) corpus. The detection performance of the proposed algorithm is better than most state-of-the-art algorithms.

基于深度学习的语音欺骗检测研究依赖于大量的标记数据。然而，在收集语音欺骗检测数据的过程中，真实和谎言的识别需要研究人员有专业的知识储备，这极大地限制了标注样本的数量。提高标注数据不足的测谎准确率是本阶段研究的重点。本文提出了一种基于多约束优化的半监督高质量伪标签算法(HQPL-MC)用于语音欺骗检测。该算法首先利用深度自编码器网络挖掘未标记数据的潜在特征信息;其次，利用伪标记技术实现熵最小化，减少真实和欺骗数据的类重叠分布;最后，通过优化未标记损失和重建损失来提高伪标签的质量，进一步提高模型在标记数据不足时的分类性能。我们自己录制了一个访谈式语料库，并将其与Columbia/SRI/Colorado(CSC)语料库一起用于本文算法的实验演示。该算法的检测性能优于大多数最先进的算法。

{"title":"A semi-supervised high-quality pseudo labels algorithm based on multi-constraint optimization for speech deception detection","authors":"Huawei Tao , Hang Yu , Man Liu , Hongliang Fu , Chunhua Zhu , Yue Xie","doi":"10.1016/j.csl.2023.101586","DOIUrl":"https://doi.org/10.1016/j.csl.2023.101586","url":null,"abstract":"<div><p>Deep learning-based speech deception detection research relies on a large amount of labeled data. However, in the process of collecting speech deception detection data, the identification of truth and lies requires researchers to have a professional knowledge reserve, which greatly limits the number of annotated samples. Improving the accuracy of lie detection with insufficient annotation data is the focus of this study at this stage. In this paper, we propose a semi-supervised high-quality pseudo-label algorithm based on multi-constraint optimization (HQPL-MC) for speech deception detection. Firstly, the algorithm exploits the potential feature information of unlabeled data by using deep auto-encoder networks; secondly, it achieves entropy minimization with the help of the pseudo labeling technique to reduce the class overlap distribution of truth and deception data; finally, it improves the quality of pseudo labels by optimizing the unlabeled loss and reconstruction loss to further enhance the classification performance of the model when the labeled data is insufficient. We recorded an interview-style corpus by ourselves and used it in this paper for the experimental demonstration of the algorithm together with the Columbia/SRI/Colorado(CSC) corpus. The detection performance of the proposed algorithm is better than most state-of-the-art algorithms.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138467643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Representation learning strategies to model pathological speech: Effect of multiple spectral resolutions 病态言语模型的表征学习策略:多光谱分辨率的影响

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-11-15 DOI: 10.1016/j.csl.2023.101584

Gabriel Figueiredo Miller , Juan Camilo Vásquez-Correa , Juan Rafael Orozco-Arroyave , Elmar Nöth

This paper considers a representation learning strategy to model speech signals from patients with Parkinson’s disease, with the goal of predicting the presence of the disease, and evaluating the level of degradation of a patient’s speech. In particular, we propose a novel fusion strategy that combines wideband and narrowband spectral resolutions using a representation learning strategy based on autoencoders, called the multi-spectral autoencoder. The proposed model is able to classify the speech from Parkinson’s disease patients with accuracy up to 97%. The proposed model is also able to assess the dysarthria severity of Parkinson’s disease patients with a Spearman correlation up to 0.79. These results outperform those observed in literature where the same problem was addressed with the same corpus.

本文考虑了一种表征学习策略来对帕金森病患者的语音信号进行建模，目的是预测该疾病的存在，并评估患者的语音退化程度。特别是，我们提出了一种新的融合策略，该策略使用基于自编码器的表示学习策略将宽带和窄带频谱分辨率结合起来，称为多光谱自编码器。该模型能够对帕金森病患者的语音进行分类，准确率高达97%。该模型还能够评估帕金森病患者构音障碍的严重程度，Spearman相关系数高达0.79。这些结果优于用相同语料库解决相同问题的文献中观察到的结果。

引用次数: 0

Though this be hesitant, yet there is method in ’t: Effects of disfluency patterns in neural speech synthesis for cultural heritage presentations 虽然这一点尚不明确，但已有方法研究非流利模式对文化遗产展示神经语音合成的影响

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-11-11 DOI: 10.1016/j.csl.2023.101585

Loredana Schettino , Antonio Origlia , Francesco Cutugno

This study presents the results of two perception experiments aimed at evaluating the effect that specific patterns of disfluencies have on people listening to synthetic speech. We consider the particular case of Cultural Heritage presentations and propose a linguistic model to support the positioning of disfluencies throughout the utterances in the Italian language. A state-of-the-art speech synthesizer, based on Deep Neural Networks, is used to prepare a set of experimental stimuli and two different experiments are presented to provide both subjective evaluations and behavioural assessments from human subjects. Results show that synthetic utterances including disfluencies, predicted by a linguistic model, are identified as more natural and that the presence of disfluencies benefits the listeners’ recall of the provided information.

本研究提出了两个感知实验的结果，旨在评估特定的不流畅模式对听合成语音的人的影响。我们考虑了文化遗产展示的特殊情况，并提出了一个语言模型来支持意大利语中整个话语中的不流利定位。基于深度神经网络的最先进的语音合成器用于准备一组实验刺激，并提出了两个不同的实验，以提供人类受试者的主观评估和行为评估。结果表明，由语言模型预测的包含不流利的合成话语被认为更自然，而且不流利的存在有利于听者对所提供信息的回忆。

引用次数: 0

Dual Knowledge Distillation for neural machine translation 神经机器翻译的双知识蒸馏

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-11-09 DOI: 10.1016/j.csl.2023.101583

Yuxian Wan , Wenlin Zhang , Zhen Li , Hao Zhang , Yanxia Li

Existing knowledge distillation methods use large amount of bilingual data and focus on mining the corresponding knowledge distribution between the source language and the target language. However, for some languages, bilingual data is not abundant. In this paper, to make better use of both monolingual and limited bilingual data, we propose a new knowledge distillation method called Dual Knowledge Distillation (DKD). For monolingual data, we use a self-distillation strategy which combines self-training and knowledge distillation for the encoder to extract more consistent monolingual representation. For bilingual data, on top of the k Nearest Neighbor Knowledge Distillation (kNN-KD) method, a similar self-distillation strategy is adopted as a consistency regularization method to force the decoder to produce consistent output. Experiments on standard datasets, multi-domain translation datasets, and low-resource datasets show that DKD achieves consistent improvements over state-of-the-art baselines including kNN-KD.

现有的知识蒸馏方法使用大量的双语数据，着重挖掘源语言和目标语言之间相应的知识分布。然而，对于某些语言，双语数据并不丰富。为了更好地利用单语和有限的双语数据，我们提出了一种新的知识蒸馏方法——双知识蒸馏(Dual knowledge distillation, DKD)。对于单语数据，我们使用自蒸馏策略，将自训练和知识蒸馏相结合，对编码器提取更一致的单语表示。对于双语数据，在k近邻知识蒸馏(kNN-KD)方法的基础上，采用类似的自蒸馏策略作为一致性正则化方法，迫使解码器产生一致的输出。在标准数据集、多域翻译数据集和低资源数据集上的实验表明，DKD比最先进的基线(包括kNN-KD)实现了一致的改进。

引用次数: 0

Speaking to remember: Model-based adaptive vocabulary learning using automatic speech recognition 口语记忆:使用自动语音识别的基于模型的自适应词汇学习

IF 4.3 3区计算机科学 Q1 Mathematics

Computer Speech and Language

Pub Date : 2023-10-31 DOI: 10.1016/j.csl.2023.101578

Thomas Wilschut , Florian Sense , Hedderik van Rijn

Memorizing vocabulary is a crucial aspect of learning a new language. While personalized learning- or intelligent tutoring systems can assist learners in memorizing vocabulary, the majority of such systems are limited to typing-based learning and do not allow for speech practice. Here, we aim to compare the efficiency of typing- and speech based vocabulary learning. Furthermore, we explore the possibilities of improving such speech-based learning using an adaptive algorithm based on a cognitive model of memory retrieval. We combined a response time-based algorithm for adaptive item scheduling that was originally developed for typing-based learning with automatic speech recognition technology and tested the system with 50 participants. We show that typing- and speech-based learning result in similar learning outcomes and that using a model-based, adaptive scheduling algorithm improves recall performance relative to traditional learning in both modalities, both immediately after learning and on follow-up tests. These results can inform the development of vocabulary learning applications that–unlike traditional systems–allow for speech-based input.

记忆词汇是学习一门新语言的一个重要方面。虽然个性化学习或智能辅导系统可以帮助学习者记忆词汇，但大多数此类系统仅限于基于打字的学习，不允许语音练习。在这里，我们的目的是比较打字和语音为基础的词汇学习的效率。此外，我们探索了使用基于记忆检索认知模型的自适应算法来改进这种基于语音的学习的可能性。我们将一种基于响应时间的自适应项目调度算法与自动语音识别技术相结合，该算法最初是为基于打字的学习而开发的，并对该系统进行了50名参与者的测试。我们表明，基于打字和基于语音的学习产生了相似的学习结果，并且使用基于模型的自适应调度算法，在学习后和后续测试中，与传统学习相比，在这两种模式下都能提高记忆性能。这些结果可以为词汇学习应用程序的开发提供信息，这些应用程序与传统系统不同，允许基于语音的输入。

{"title":"Speaking to remember: Model-based adaptive vocabulary learning using automatic speech recognition","authors":"Thomas Wilschut , Florian Sense , Hedderik van Rijn","doi":"10.1016/j.csl.2023.101578","DOIUrl":"https://doi.org/10.1016/j.csl.2023.101578","url":null,"abstract":"<div><p>Memorizing vocabulary is a crucial aspect of learning a new language. While personalized learning- or intelligent tutoring systems can assist learners in memorizing vocabulary, the majority of such systems are limited to typing-based learning and do not allow for speech practice. Here, we aim to compare the efficiency of typing- and speech based vocabulary learning. Furthermore, we explore the possibilities of improving such speech-based learning using an adaptive algorithm based on a cognitive model of memory retrieval. We combined a response time-based algorithm for adaptive item scheduling that was originally developed for typing-based learning with automatic speech recognition technology and tested the system with 50 participants. We show that typing- and speech-based learning result in similar learning outcomes and that using a model-based, adaptive scheduling algorithm improves recall performance relative to traditional learning in both modalities, both immediately after learning and on follow-up tests. These results can inform the development of vocabulary learning applications that–unlike traditional systems–allow for speech-based input.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230823000979/pdfft?md5=193f674d81842a617a595d4386cfe454&pid=1-s2.0-S0885230823000979-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138087255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0