Pub Date : 2024-01-18DOI: 10.1016/j.csl.2024.101619
Chang Zeng , Xiaoxiao Miao , Xin Wang , Erica Cooper , Junichi Yamagishi
Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.
{"title":"Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances","authors":"Chang Zeng , Xiaoxiao Miao , Xin Wang , Erica Cooper , Junichi Yamagishi","doi":"10.1016/j.csl.2024.101619","DOIUrl":"10.1016/j.csl.2024.101619","url":null,"abstract":"<div><p>Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000020/pdfft?md5=ef4d8f62c6e421e3a3accd1ee4ea9a64&pid=1-s2.0-S0885230824000020-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139507969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-13DOI: 10.1016/j.csl.2024.101618
Yihao Li , Meng Sun , Xiongwei Zhang , Hugo Van hamme
A key step to single channel speech enhancement is the orthogonal separation of speech and noise. In this paper, a dual branch complex convolutional recurrent network (DBCCRN) is proposed to separate the complex spectrograms of speech and noises simultaneously. To model both local and global information, we incorporate conformer modules into our network. The orthogonality of the outputs of the two branches can be improved by optimizing the Signal-to-Noise Ratio (SNR) related losses. However, we found the models trained by two existing versions of SI-SNRs will yield enhanced speech at a very different scale from that of its clean counterpart. SNR loss will lead to a shrink amplitude of enhanced speech as well. A solution to this problem is to simply normalize the output, but it only works for off-line processing, not for the streaming one. When streaming speech enhancement is required, the error scale will lead to the degradation of speech quality. From an analytical inspection of the weakness of the models trained by SNR and SI-SNR losses, a new loss function called scale-aware SNR (SA-SNR) is proposed to cope with the scale variations of the enhanced speech. SA-SNR improves over SI-SNR by introducing an extra regularization term that encourages the model to produce signals of similar scale as the input, which has little influence on the perceptual quality of the enhanced speech. In addition, the commonly used evaluation recipe for speech enhancement may not be sufficient to comprehensively reflect the performance of the speech enhancement methods using SI-SNR losses, where amplitude variations of input speech should be carefully considered. A new evaluation recipe called ScaleError is introduced. Experiments show that our proposed method improves over the existing baselines on the evaluation sets of the voice bank corpus, DEMAND and the Interspeech 2020 Deep Noise Suppression Challenge, by obtaining higher scores for PESQ, STOI, SSNR, CSIG, CBAK and COVL.
{"title":"Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement","authors":"Yihao Li , Meng Sun , Xiongwei Zhang , Hugo Van hamme","doi":"10.1016/j.csl.2024.101618","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101618","url":null,"abstract":"<div><p><span><span><span>A key step to single channel speech enhancement is the orthogonal separation of speech and noise. In this paper, a dual branch complex convolutional recurrent network<span> (DBCCRN) is proposed to separate the complex spectrograms of speech and noises simultaneously. To model both local and global information, we incorporate </span></span>conformer<span><span> modules into our network. The orthogonality of the outputs of the two branches can be improved by optimizing the Signal-to-Noise Ratio (SNR) related losses. However, we found the models trained by two existing versions of SI-SNRs will yield enhanced speech at a very different scale from that of its clean counterpart. SNR loss will lead to a shrink amplitude of enhanced speech as well. A solution to this problem is to simply normalize the output, but it only works for off-line processing, not for the streaming one. When streaming speech enhancement is required, the error scale will lead to the degradation of speech quality. From an analytical inspection of the weakness of the models trained by SNR and SI-SNR losses, a new loss function called scale-aware SNR (SA-SNR) is proposed to cope with the scale variations of the enhanced speech. SA-SNR improves over SI-SNR by introducing an extra </span>regularization term that encourages the model to produce signals of similar scale as the input, which has little influence on the </span></span>perceptual quality of the enhanced speech. In addition, the commonly used evaluation recipe for speech enhancement may not be sufficient to comprehensively reflect the performance of the speech enhancement methods using SI-SNR losses, where amplitude variations of input speech should be carefully considered. A new evaluation recipe called </span><em>ScaleError</em> is introduced. Experiments show that our proposed method improves over the existing baselines on the evaluation sets of the <em>voice bank corpus, DEMAND</em> and <span><em>the Interspeech 2020 Deep </em><em>Noise Suppression</em><em> Challenge</em></span>, by obtaining higher scores for PESQ, STOI, SSNR, CSIG, CBAK and COVL.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139487878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-08DOI: 10.1016/j.csl.2023.101603
Francesca Alloatti , Francesca Grasso , Roger Ferrod , Giovanni Siragusa , Luigi Di Caro , Federica Cena
Mutual comprehension is a crucial component that makes a conversation succeed. While it can be easily reached through the cooperation of the parties in human–human dialogues, such cooperation is often lacking in human–computer interaction due to technical problems, leading to broken conversations. Our goal is to work towards an effective detection of breakdowns in a conversation between humans and Conversational Agents (CA), as well as the different repair strategies users adopt when such communication problems occur. In this work, we propose a novel tag system designed to map and classify users’ repair attempts while interacting with a CA. We subsequently present a set of Machine Learning models1 trained to automatize the detection of such repair strategies. The tags are employed in a manual annotation exercise, performed on a publicly available dataset 2 of text-based task-oriented conversations. The batch of annotated data was then used to train the neural network-based classifiers. The analysis of the annotations provides interesting insights about users’ behaviour when dealing with breakdowns in a task-oriented dialogue system. The encouraging results obtained from neural models confirm the possibility of automatically recognizing occurrences of misunderstanding between users and CAs on the fly.
相互理解是对话成功的关键因素。在人与人的对话中,通过双方的合作可以很容易地实现相互理解,但在人机交互中,由于技术问题,这种合作往往会缺失,从而导致对话中断。我们的目标是致力于有效检测人类与对话代理(CA)之间对话的中断情况,以及用户在出现此类交流问题时所采取的不同修复策略。在这项工作中,我们提出了一个新颖的标签系统,旨在对用户与 CA 交互时的修复尝试进行映射和分类。随后,我们提出了一套经过训练的机器学习模型1 ,用于自动检测此类修复策略。这些标签被用于人工标注工作,该工作是在一个公开可用的基于任务导向的文本会话数据集 2 上进行的。批量注释数据随后被用于训练基于神经网络的分类器。通过对注释的分析,我们可以深入了解用户在任务导向对话系统中处理故障时的行为。神经网络模型获得的令人鼓舞的结果证实了自动识别用户与 CA 之间误解的可能性。
{"title":"A tag-based methodology for the detection of user repair strategies in task-oriented conversational agents","authors":"Francesca Alloatti , Francesca Grasso , Roger Ferrod , Giovanni Siragusa , Luigi Di Caro , Federica Cena","doi":"10.1016/j.csl.2023.101603","DOIUrl":"https://doi.org/10.1016/j.csl.2023.101603","url":null,"abstract":"<div><p><span><span>Mutual comprehension is a crucial component that makes a conversation succeed. While it can be easily reached through the cooperation of the parties in human–human dialogues, such cooperation is often lacking in human–computer interaction due to technical problems, leading to broken conversations. Our goal is to work towards an effective detection of breakdowns in a conversation between humans and Conversational Agents (CA), as well as the different repair strategies users adopt when such communication problems occur. In this work, we propose a novel tag system designed to map and classify users’ repair attempts while interacting with a CA. We subsequently present a set of </span>Machine Learning models</span><span><sup>1</sup></span> trained to automatize the detection of such repair strategies. The tags are employed in a manual annotation exercise, performed on a publicly available dataset <span><sup>2</sup></span><span> of text-based task-oriented conversations. The batch of annotated data was then used to train the neural network-based classifiers. The analysis of the annotations provides interesting insights about users’ behaviour when dealing with breakdowns in a task-oriented dialogue system<span>. The encouraging results obtained from neural models confirm the possibility of automatically recognizing occurrences of misunderstanding between users and CAs on the fly.</span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139406502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the last two decades, many efforts have been made to provide resources to support the Arabic Natural Language Processing (NLP). Some of these resources target specific NLP tasks such as word tokenization, parsing, or sentiment analysis, while others attempt to tackle numerous tasks at once. In this paper, we present ¡¡TTK¿¿, a toolkit for Tunisian linguistic analysis. It consists of a collection of linguistic analysis tools for orthographic normalization, sentence boundaries detection, word tokenization, morphological analysis, parsing and named entity recognition. This paper focuses on the design and implementation of TTK tools.
{"title":"TTK: A toolkit for Tunisian linguistic analysis","authors":"Asma Mekki, Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith","doi":"10.1016/j.csl.2023.101617","DOIUrl":"10.1016/j.csl.2023.101617","url":null,"abstract":"<div><p><span><span>Over the last two decades, many efforts have been made to provide resources to support the Arabic Natural Language Processing (NLP). Some of these resources target specific NLP tasks such as word tokenization, </span>parsing, or </span>sentiment analysis<span><span>, while others attempt to tackle numerous tasks at once. In this paper, we present ¡¡TTK¿¿, a toolkit for Tunisian linguistic analysis. It consists of a collection of linguistic analysis tools for orthographic normalization, sentence boundaries detection, word tokenization, morphological analysis, parsing and </span>named entity recognition. This paper focuses on the design and implementation of TTK tools.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-28DOI: 10.1016/j.csl.2023.101616
Pengfei Chen , Biqing Zeng , Yuwu Lu , Yun Xue , Fei Fan , Mayi Xu , Lingcong Feng
Aspect-level sentiment analysis (ALSA) aims to extract the polarity of different aspect terms in a sentence. Previous works leveraging traditional dependency syntax parsing trees (DSPT) to encode contextual syntactic information had obtained state-of-the-art results. However, these works may not be able to learn fine-grained syntactic knowledge efficiently, which makes them difficult to take advantage of local context. Furthermore, these works failed to exploit the dependency relation from DSPT sufficiently. To solve these problems, we propose a novel method to enhance local knowledge by using extensions of Local Context Network based on Proximity Values (LCPV) and Syntax-clusters Attention (SCA), named LCSA. LCPV first gets the induced trees from pre-trained models and generates the syntactic proximity values between context word and aspect to adaptively determine the extent of local context. Our improved SCA further extracts fine-grained knowledge, which not only focuses on the essential clusters for the target aspect term but also guides the model to learn essential words inside each cluster in DSPT. Extensive experimental results on multiple benchmark datasets demonstrate that LCSA is highly robust and achieves state-of-the-art performance for ALSA.
{"title":"Enhanced local knowledge with proximity values and syntax-clusters for aspect-level sentiment analysis","authors":"Pengfei Chen , Biqing Zeng , Yuwu Lu , Yun Xue , Fei Fan , Mayi Xu , Lingcong Feng","doi":"10.1016/j.csl.2023.101616","DOIUrl":"10.1016/j.csl.2023.101616","url":null,"abstract":"<div><p>Aspect-level sentiment analysis (ALSA) aims to extract the polarity of different aspect terms in a sentence. Previous works leveraging traditional dependency syntax parsing<span> trees (DSPT) to encode contextual syntactic<span> information had obtained state-of-the-art results. However, these works may not be able to learn fine-grained syntactic knowledge efficiently, which makes them difficult to take advantage of local context. Furthermore, these works failed to exploit the dependency relation from DSPT sufficiently. To solve these problems, we propose a novel method to enhance local knowledge by using extensions of Local Context Network based on Proximity Values (LCPV) and Syntax-clusters Attention (SCA), named LCSA. LCPV first gets the induced trees from pre-trained models and generates the syntactic proximity values between context word and aspect to adaptively determine the extent of local context. Our improved SCA further extracts fine-grained knowledge, which not only focuses on the essential clusters for the target aspect term but also guides the model to learn essential words inside each cluster in DSPT. Extensive experimental results on multiple benchmark datasets demonstrate that LCSA is highly robust and achieves state-of-the-art performance for ALSA.</span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139071559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-26DOI: 10.1016/j.csl.2023.101605
Vijay Ravi , Jinhan Wang , Jonathan Flint , Abeer Alwan
Speech signals are valuable biomarkers for assessing an individual’s mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this — adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness () and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a of −1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a of −0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.
语音信号是评估个人心理健康的重要生物标记,包括自动识别重度抑郁症(MDD)。这方面常用的一种方法是采用与说话者身份相关的特征,如说话者嵌入。然而,在心理健康筛查系统中过度依赖说话者身份特征可能会损害病人的隐私。此外,说话者身份的某些方面可能与抑郁检测无关,可能成为影响系统性能的偏差因素。为了克服这些局限性,我们建议将说话者身份信息与抑郁相关信息分离开来。具体来说,我们提出了四种不同的解缠方法来实现这一目标--对抗性说话人识别(SID)-损失最大化(ADV)、带方差的 SID 损失均衡(LEV)、使用交叉熵的 SID 损失均衡(LECE)和使用 KL 分歧的 SID 损失均衡(LEKLD)。我们的实验采用了不同的输入特征和模型架构,提高了 MDD 检测和语音隐私属性的 F1 分数,并通过语音独特性增益(GVD)和去识别分数(DeID)进行量化。在 DAIC-WOZ 数据集(英语)上,使用 ComparE16 特征的 LECE 得到了 80% 的最佳 F1 分数,代表了纯音频 SOTA 抑郁症检测 F1 分数,同时 GVD 为 -1.1 dB,DeID 为 85%。在 EATD 数据集(普通话)上,使用原始音频信号的 ADV 的 F1 分数为 72.38%,超过了多模式 SOTA,GVD 为 -0.89 dB dB,DeID 为 51.21%。通过减少对说话者身份相关特征的依赖,我们的方法为基于语音的抑郁检测提供了一个保护患者隐私的前景广阔的方向。
{"title":"Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement","authors":"Vijay Ravi , Jinhan Wang , Jonathan Flint , Abeer Alwan","doi":"10.1016/j.csl.2023.101605","DOIUrl":"10.1016/j.csl.2023.101605","url":null,"abstract":"<div><p>Speech signals are valuable biomarkers for assessing an individual’s mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this — adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness (<span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span>) and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span> of −1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span> of −0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230823001249/pdfft?md5=7acff7dbe3c70a9a6ae6cde978bd02e2&pid=1-s2.0-S0885230823001249-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139052205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-23DOI: 10.1016/j.csl.2023.101606
Long Dai, Jiarong Mao, Liaoran Xu, Xuefeng Fan, Xiaoyi Zhou
The popularity of ChatGPT demonstrates the immense commercial value of natural language processing (NLP) technology. However, NLP models like ChatGPT are vulnerable to piracy and redistribution, which can harm the economic interests of model owners. Existing NLP model watermarking schemes struggle to balance robustness and covertness. Typically, robust watermarks require embedding more information, which compromises their covertness; conversely, covert watermarks are challenging to embed more information, which affects their robustness. This paper is proposed to use multi-task learning (MTL) to address the conflict between robustness and covertness. Specifically, a covert trigger set is established to implement remote verification of the watermark model, and a covert auxiliary network is designed to enhance the watermark model’s robustness. The proposed watermarking framework is evaluated on two benchmark datasets and three mainstream NLP models. Compared with existing schemes, the framework not only has excellent covertness and robustness but also has a lower false positive rate and can effectively resist fraudulent ownership claims by adversaries.
{"title":"SecNLP: An NLP classification model watermarking framework based on multi-task learning","authors":"Long Dai, Jiarong Mao, Liaoran Xu, Xuefeng Fan, Xiaoyi Zhou","doi":"10.1016/j.csl.2023.101606","DOIUrl":"10.1016/j.csl.2023.101606","url":null,"abstract":"<div><p><span>The popularity of ChatGPT demonstrates the immense commercial value of natural language processing (NLP) technology. However, NLP models like ChatGPT are vulnerable to piracy and redistribution, which can harm the economic interests of model owners. Existing NLP model </span>watermarking schemes<span> struggle to balance robustness and covertness. Typically, robust watermarks require embedding more information, which compromises their covertness; conversely, covert watermarks are challenging to embed more information, which affects their robustness. This paper is proposed to use multi-task learning (MTL) to address the conflict between robustness and covertness. Specifically, a covert trigger set is established to implement remote verification of the watermark model, and a covert auxiliary network is designed to enhance the watermark model’s robustness. The proposed watermarking framework is evaluated on two benchmark datasets and three mainstream NLP models. Compared with existing schemes, the framework not only has excellent covertness and robustness but also has a lower false positive rate and can effectively resist fraudulent ownership claims by adversaries.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139030523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-20DOI: 10.1016/j.csl.2023.101604
Asalah Thiab , Luay Alawneh , Mohammad AL-Smadi
Emotion detection from online textual information is gaining more attention due to its usefulness in understanding users’ behaviors and their desires. This is driven by the large amounts of texts from different sources such as social media and shopping websites. Recent studies investigated the benefits of deep learning in the detection of emotions from textual conversations. In this paper, we study the performance of several deep learning and transformer-based models in the classification of emotions in English conversations. Further, we apply ensemble learning using a majority voting technique to improve the overall classification performance. We evaluated our proposed models on the SemEval 2019 Task 3 public dataset that categorizes emotions as Happy, Angry, Sad, and Others. The results show that our models can successfully distinguish the three main classes of emotions and separate them from Others in a highly imbalanced dataset. The transformer-based models achieved a micro-averaged F1-score of up to 75.55%, whereas the RNN-based models only reached 67.03%. Further, we show that the ensemble model significantly improves the overall performance and achieves a micro-averaged F1-score of 77.07%.
从在线文本信息中进行情感检测有助于了解用户的行为和愿望,因此越来越受到人们的关注。这主要得益于来自社交媒体和购物网站等不同来源的大量文本。最近的研究调查了深度学习在从文本对话中检测情感方面的优势。在本文中,我们研究了几种基于深度学习和转换器的模型在英语会话情感分类中的表现。此外,我们还利用多数投票技术进行了集合学习,以提高整体分类性能。我们在 SemEval 2019 Task 3 公共数据集上评估了我们提出的模型,该数据集将情绪分类为快乐、愤怒、悲伤和其他。结果表明,我们的模型可以成功区分三大类情绪,并在高度不平衡的数据集中将它们与 "其他 "区分开来。基于变换器的模型的微平均 F1 分数高达 75.55%,而基于 RNN 的模型仅为 67.03%。此外,我们还发现,集合模型显著提高了整体性能,微平均 F1 分数达到了 77.07%。
{"title":"Contextual emotion detection using ensemble deep learning","authors":"Asalah Thiab , Luay Alawneh , Mohammad AL-Smadi","doi":"10.1016/j.csl.2023.101604","DOIUrl":"10.1016/j.csl.2023.101604","url":null,"abstract":"<div><p><span><span>Emotion detection from online textual information is gaining more attention due to its usefulness in understanding users’ behaviors and their desires. This is driven by the large amounts of texts from different sources such as social media and shopping websites. Recent studies investigated the benefits of deep learning in the detection of emotions from textual conversations. In this paper, we study the performance of several deep learning and transformer-based models in the classification of emotions in English conversations. Further, we apply </span>ensemble learning using a majority voting technique to improve the overall classification performance. We evaluated our proposed models on the SemEval 2019 Task 3 public dataset that categorizes emotions as </span><em>Happy</em>, <em>Angry</em>, <em>Sad</em>, and <em>Others</em>. The results show that our models can successfully distinguish the three main classes of emotions and separate them from <em>Others</em> in a highly imbalanced dataset. The transformer-based models achieved a micro-averaged F1-score of up to 75.55%, whereas the RNN-based models only reached 67.03%. Further, we show that the ensemble model significantly improves the overall performance and achieves a micro-averaged F1-score of 77.07%.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139030524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-18DOI: 10.1016/j.csl.2023.101599
Souvik Sinha, Spandan Dey, Goutam Saha
The application of voice recognition systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.
{"title":"Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion","authors":"Souvik Sinha, Spandan Dey, Goutam Saha","doi":"10.1016/j.csl.2023.101599","DOIUrl":"10.1016/j.csl.2023.101599","url":null,"abstract":"<div><p>The application of voice recognition<span><span> systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures<span> are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the </span></span>EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138746029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-16DOI: 10.1016/j.csl.2023.101598
Geoffroy Vanderreydt, Kris Demuynck
We propose a novel technique to estimate the channel characteristics for robust speech recognition. The method focuses on reliable time–frequency speech patches which are highly independent of the noise condition. Combined with a root-based approximation of the logarithm in the MFCC computation, this reduces the variance caused by the noise on the spectral features, and therefore also the constrain on the acoustic model in a multi-style training setup. We show that compared to the standard mean normalization, the proposed method estimates the channel equally well under clean conditions and better under noisy conditions. When integrated in the feature extraction pipeline, we show improvements in speech recognition accuracy on noisy speech and a status quo on clean speech. Our experiments reveal that this method helps the most for generative models that need to model the complex noise variability, and less so for discriminative models, which can learn to ignore noise instead of accurately modeling it. Our approach outperforms the state of the art on the noisy Aurora4 task.
{"title":"A novel channel estimate for noise robust speech recognition","authors":"Geoffroy Vanderreydt, Kris Demuynck","doi":"10.1016/j.csl.2023.101598","DOIUrl":"10.1016/j.csl.2023.101598","url":null,"abstract":"<div><p>We propose a novel technique to estimate the channel characteristics for robust speech recognition<span>. The method focuses on reliable time–frequency speech patches which are highly independent of the noise condition. Combined with a root-based approximation<span> of the logarithm in the MFCC computation, this reduces the variance caused by the noise on the spectral features<span>, and therefore also the constrain on the acoustic model in a multi-style training setup. We show that compared to the standard mean normalization, the proposed method estimates the channel equally well under clean conditions and better under noisy conditions. When integrated in the feature extraction pipeline, we show improvements in speech recognition accuracy on noisy speech and a status quo on clean speech. Our experiments reveal that this method helps the most for generative models that need to model the complex noise variability, and less so for discriminative models, which can learn to ignore noise instead of accurately modeling it. Our approach outperforms the state of the art on the noisy Aurora4 task.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138745942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}