Pub Date : 2024-06-06DOI: 10.1016/j.csl.2024.101677
Heming Wang , Ashutosh Pandey , DeLiang Wang
Deep learning has led to dramatic performance improvements for the task of speech enhancement, where deep neural networks (DNNs) are trained to recover clean speech from noisy and reverberant mixtures. Most of the existing DNN-based algorithms operate in the frequency domain, as time-domain approaches are believed to be less effective for speech dereverberation. In this study, we employ two DNNs: ARN (attentive recurrent network) and DC-CRN (densely-connected convolutional recurrent network), and systematically investigate the effects of different components on enhancement performance, such as window sizes, loss functions, and feature representations. We conduct evaluation experiments in two main conditions: reverberant-only and reverberant-noisy. Our findings suggest that incorporating larger window sizes is helpful for dereverberation, and adding transform operations (either convolutional or linear) to encode and decode waveform features improves the sparsity of the learned representations, and boosts the performance of time-domain models. Experimental results demonstrate that ARN and DC-CRN with proposed techniques achieve superior performance compared with other strong enhancement baselines.
{"title":"A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments","authors":"Heming Wang , Ashutosh Pandey , DeLiang Wang","doi":"10.1016/j.csl.2024.101677","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101677","url":null,"abstract":"<div><p>Deep learning has led to dramatic performance improvements for the task of speech enhancement, where deep neural networks (DNNs) are trained to recover clean speech from noisy and reverberant mixtures. Most of the existing DNN-based algorithms operate in the frequency domain, as time-domain approaches are believed to be less effective for speech dereverberation. In this study, we employ two DNNs: ARN (attentive recurrent network) and DC-CRN (densely-connected convolutional recurrent network), and systematically investigate the effects of different components on enhancement performance, such as window sizes, loss functions, and feature representations. We conduct evaluation experiments in two main conditions: reverberant-only and reverberant-noisy. Our findings suggest that incorporating larger window sizes is helpful for dereverberation, and adding transform operations (either convolutional or linear) to encode and decode waveform features improves the sparsity of the learned representations, and boosts the performance of time-domain models. Experimental results demonstrate that ARN and DC-CRN with proposed techniques achieve superior performance compared with other strong enhancement baselines.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101677"},"PeriodicalIF":4.3,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000603/pdfft?md5=6f57ae0077f304562bdf74000559d71d&pid=1-s2.0-S0885230824000603-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141325435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1016/j.csl.2024.101676
Tianyu Song , Linh Thi Hoai Nguyen , Ton Viet Ta
This paper presents three innovative deep learning models for English accent classification: Multi-task Pyramid Split Attention- Densely Convolutional Networks (MPSA-DenseNet), Pyramid Split Attention- Densely Convolutional Networks (PSA-DenseNet), and Multi-task- Densely Convolutional Networks (Multi-DenseNet), that combine multi-task learning and/or the PSA module attention mechanism with DenseNet. We applied these models to data collected from five dialects of English across native English-speaking regions (England, the United States) and nonnative English-speaking regions (Hong Kong, Germany, India). Our experimental results show a significant improvement in classification accuracy, particularly with MPSA-DenseNet, which outperforms all other models, including Densely Convolutional Networks (DenseNet) and Efficient Pyramid Squeeze Attention (EPSA) models previously used for accent identification. Our findings indicate that MPSA-DenseNet is a highly promising model for accurately identifying English accents.
{"title":"MPSA-DenseNet: A novel deep learning model for English accent classification","authors":"Tianyu Song , Linh Thi Hoai Nguyen , Ton Viet Ta","doi":"10.1016/j.csl.2024.101676","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101676","url":null,"abstract":"<div><p>This paper presents three innovative deep learning models for English accent classification: Multi-task Pyramid Split Attention- Densely Convolutional Networks (MPSA-DenseNet), Pyramid Split Attention- Densely Convolutional Networks (PSA-DenseNet), and Multi-task- Densely Convolutional Networks (Multi-DenseNet), that combine multi-task learning and/or the PSA module attention mechanism with DenseNet. We applied these models to data collected from five dialects of English across native English-speaking regions (England, the United States) and nonnative English-speaking regions (Hong Kong, Germany, India). Our experimental results show a significant improvement in classification accuracy, particularly with MPSA-DenseNet, which outperforms all other models, including Densely Convolutional Networks (DenseNet) and Efficient Pyramid Squeeze Attention (EPSA) models previously used for accent identification. Our findings indicate that MPSA-DenseNet is a highly promising model for accurately identifying English accents.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101676"},"PeriodicalIF":4.3,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000597/pdfft?md5=45eac4ef8fe33cc3af54ca5ce1756899&pid=1-s2.0-S0885230824000597-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141264076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1016/j.csl.2024.101667
A. Poobalan , K. Ganapriya , K. Kalaivani , K. Parthiban
Email data has some characteristics that are different from other social media data, such as a large range of answers, formal language, notable length variations, high degrees of anomalies, and indirect relationships. The main goal in this research is to develop a robust and computationally efficient classifier that can distinguish between spam and regular email content. The benchmark Enron dataset, which is accessible to the public, was used for the tests. The six distinct Enron data sets we acquired were combined to generate the final seven Enron data sets. The dataset undergoes early preprocessing to remove superfluous sentences. The proposed model Bidirectional Long Short-Term Memory (BiLSTM) apply spam labels and to examine email documents for spam. On seven Enron datasets, DNN-BiLSTM performs better than other classifiers in the performance comparison in terms of accuracy. DNN-BiLSTM and convolutional neural networks demonstrated that they can classify spam with 96.39 % and 98.69 % accuracy, respectively, in comparison to other machine learning classifiers. The risks associated with cloud data management and potential security flaws are also covered in the paper. This research presents hybrid encryption as a means of protecting cloud data while preserving privacy by using the hybrid AES-Rabit encryption algorithm which is based on symmetric session key exchange.
{"title":"A novel and secured email classification using deep neural network with bidirectional long short-term memory","authors":"A. Poobalan , K. Ganapriya , K. Kalaivani , K. Parthiban","doi":"10.1016/j.csl.2024.101667","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101667","url":null,"abstract":"<div><p>Email data has some characteristics that are different from other social media data, such as a large range of answers, formal language, notable length variations, high degrees of anomalies, and indirect relationships. The main goal in this research is to develop a robust and computationally efficient classifier that can distinguish between spam and regular email content. The benchmark Enron dataset, which is accessible to the public, was used for the tests. The six distinct Enron data sets we acquired were combined to generate the final seven Enron data sets. The dataset undergoes early preprocessing to remove superfluous sentences. The proposed model Bidirectional Long Short-Term Memory (BiLSTM) apply spam labels and to examine email documents for spam. On seven Enron datasets, DNN-BiLSTM performs better than other classifiers in the performance comparison in terms of accuracy. DNN-BiLSTM and convolutional neural networks demonstrated that they can classify spam with 96.39 % and 98.69 % accuracy, respectively, in comparison to other machine learning classifiers. The risks associated with cloud data management and potential security flaws are also covered in the paper. This research presents hybrid encryption as a means of protecting cloud data while preserving privacy by using the hybrid AES-Rabit encryption algorithm which is based on symmetric session key exchange.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101667"},"PeriodicalIF":4.3,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000500/pdfft?md5=93a3ab04f63a63c4343031dc3b1f9eca&pid=1-s2.0-S0885230824000500-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141250220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22DOI: 10.1016/j.csl.2024.101666
Nicolás Grágeda , Carlos Busso , Eduardo Alvarado , Ricardo García , Rodrigo Mahu , Fernando Huenupan , Néstor Becerra Yoma
The use of speech-based solutions is an appealing alternative to communicate in human-robot interaction (HRI). An important challenge in this area is processing distant speech which is often noisy, and affected by reverberation and time-varying acoustic channels. It is important to investigate effective speech solutions, especially in dynamic environments where the robots and the users move, changing the distance and orientation between a speaker and the microphone. This paper addresses this problem in the context of speech emotion recognition (SER), which is an important task to understand the intention of the message and the underlying mental state of the user. We propose a novel setup with a PR2 robot that moves as target speech and ambient noise are simultaneously recorded. Our study not only analyzes the detrimental effect of distance speech in this dynamic robot-user setting for speech emotion recognition but also provides solutions to attenuate its effect. We evaluate the use of two beamforming schemes to spatially filter the speech signal using either delay-and-sum (D&S) or minimum variance distortionless response (MVDR). We consider the original training speech recorded in controlled situations, and simulated conditions where the training utterances are processed to simulate the target acoustic environment. We consider the case where the robot is moving (dynamic case) and not moving (static case). For speech emotion recognition, we explore two state-of-the-art classifiers using hand-crafted features implemented with the ladder network strategy and learned features implemented with the wav2vec 2.0 feature representation. MVDR led to a signal-to-noise ratio higher than the basic D&S method. However, both approaches provided very similar average concordance correlation coefficient (CCC) improvements equal to 116 % with the HRI subsets using the ladder network trained with the original MSP-Podcast training utterances. For the wav2vec 2.0-based model, only D&S led to improvements. Surprisingly, the static and dynamic HRI testing subsets resulted in a similar average concordance correlation coefficient. Finally, simulating the acoustic environment in the training dataset provided the highest average concordance correlation coefficient scores with the HRI subsets that are just 29 % and 22 % lower than those obtained with the original training/testing utterances, with ladder network and wav2vec 2.0, respectively.
{"title":"Speech emotion recognition in real static and dynamic human-robot interaction scenarios","authors":"Nicolás Grágeda , Carlos Busso , Eduardo Alvarado , Ricardo García , Rodrigo Mahu , Fernando Huenupan , Néstor Becerra Yoma","doi":"10.1016/j.csl.2024.101666","DOIUrl":"10.1016/j.csl.2024.101666","url":null,"abstract":"<div><p>The use of speech-based solutions is an appealing alternative to communicate in human-robot interaction (HRI). An important challenge in this area is processing distant speech which is often noisy, and affected by reverberation and time-varying acoustic channels. It is important to investigate effective speech solutions, especially in dynamic environments where the robots and the users move, changing the distance and orientation between a speaker and the microphone. This paper addresses this problem in the context of speech emotion recognition (SER), which is an important task to understand the intention of the message and the underlying mental state of the user. We propose a novel setup with a PR2 robot that moves as target speech and ambient noise are simultaneously recorded. Our study not only analyzes the detrimental effect of distance speech in this dynamic robot-user setting for speech emotion recognition but also provides solutions to attenuate its effect. We evaluate the use of two beamforming schemes to spatially filter the speech signal using either delay-and-sum (D&S) or minimum variance distortionless response (MVDR). We consider the original training speech recorded in controlled situations, and simulated conditions where the training utterances are processed to simulate the target acoustic environment. We consider the case where the robot is moving (dynamic case) and not moving (static case). For speech emotion recognition, we explore two state-of-the-art classifiers using hand-crafted features implemented with the ladder network strategy and learned features implemented with the wav2vec 2.0 feature representation. MVDR led to a signal-to-noise ratio higher than the basic D&S method. However, both approaches provided very similar average concordance correlation coefficient (CCC) improvements equal to 116 % with the HRI subsets using the ladder network trained with the original MSP-Podcast training utterances. For the wav2vec 2.0-based model, only D&S led to improvements. Surprisingly, the static and dynamic HRI testing subsets resulted in a similar average concordance correlation coefficient. Finally, simulating the acoustic environment in the training dataset provided the highest average concordance correlation coefficient scores with the HRI subsets that are just 29 % and 22 % lower than those obtained with the original training/testing utterances, with ladder network and wav2vec 2.0, respectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101666"},"PeriodicalIF":4.3,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000494/pdfft?md5=10d8a0faec641adaf8be74271eaf5174&pid=1-s2.0-S0885230824000494-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141134350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22DOI: 10.1016/j.csl.2024.101665
Tanya Talkar , James R. Williamson , Sophia Yuditskaya , Daniel J. Hannon , Hrishikesh M. Rao , Lisa Nowinski , Hannah Saro , Maria Mody , Christopher J. McDougle , Thomas F. Quatieri
Autism spectrum disorder (ASD) is a neurodevelopmental disorder often associated with difficulties in speech production and fine-motor tasks. Thus, there is a need to develop objective measures to assess and understand speech production and other fine-motor challenges in individuals with ASD. In addition, recent research suggests that difficulties with speech production and fine-motor tasks may contribute to language difficulties in ASD. In this paper, we explore the utility of an off-body recording platform, from which we administer a speech- and fine-motor protocol to verbal children with ASD and neurotypical controls. We utilize a correlation-based analysis technique to develop proxy measures of motor coordination from signals derived from recordings of speech- and fine-motor behaviors. Eigenvalues of the resulting correlation matrix are inputs to Gaussian Mixture Models to discriminate between highly-verbal children with ASD and neurotypical controls. These eigenvalues also characterize the complexity (underlying dimensionality) of representative signals of speech- and fine-motor movement dynamics, and form the feature basis to estimate scores on an expressive vocabulary measure. Based on a pilot dataset (15 ASD and 15 controls), features derived from an oral story reading task are used in discriminating between the two groups with AUCs > 0.80, and highlight lower complexity of coordination in children with ASD. Features derived from handwriting and maze tracing tasks led to AUCs of 0.86 and 0.91, however features derived from ocular tasks did not aid in discrimination between the ASD and neurotypical groups. In addition, features derived from free speech and sustained vowel tasks are strongly correlated with expressive vocabulary scores. These results indicate the promise of a correlation-based analysis in elucidating motor differences between individuals with ASD and neurotypical controls.
{"title":"An exploratory characterization of speech- and fine-motor coordination in verbal children with Autism spectrum disorder","authors":"Tanya Talkar , James R. Williamson , Sophia Yuditskaya , Daniel J. Hannon , Hrishikesh M. Rao , Lisa Nowinski , Hannah Saro , Maria Mody , Christopher J. McDougle , Thomas F. Quatieri","doi":"10.1016/j.csl.2024.101665","DOIUrl":"10.1016/j.csl.2024.101665","url":null,"abstract":"<div><p>Autism spectrum disorder (ASD) is a neurodevelopmental disorder often associated with difficulties in speech production and fine-motor tasks. Thus, there is a need to develop objective measures to assess and understand speech production and other fine-motor challenges in individuals with ASD. In addition, recent research suggests that difficulties with speech production and fine-motor tasks may contribute to language difficulties in ASD. In this paper, we explore the utility of an off-body recording platform, from which we administer a speech- and fine-motor protocol to verbal children with ASD and neurotypical controls. We utilize a correlation-based analysis technique to develop proxy measures of motor coordination from signals derived from recordings of speech- and fine-motor behaviors. Eigenvalues of the resulting correlation matrix are inputs to Gaussian Mixture Models to discriminate between highly-verbal children with ASD and neurotypical controls. These eigenvalues also characterize the complexity (underlying dimensionality) of representative signals of speech- and fine-motor movement dynamics, and form the feature basis to estimate scores on an expressive vocabulary measure. Based on a pilot dataset (15 ASD and 15 controls), features derived from an oral story reading task are used in discriminating between the two groups with AUCs > 0.80, and highlight lower complexity of coordination in children with ASD. Features derived from handwriting and maze tracing tasks led to AUCs of 0.86 and 0.91, however features derived from ocular tasks did not aid in discrimination between the ASD and neurotypical groups. In addition, features derived from free speech and sustained vowel tasks are strongly correlated with expressive vocabulary scores. These results indicate the promise of a correlation-based analysis in elucidating motor differences between individuals with ASD and neurotypical controls.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101665"},"PeriodicalIF":4.3,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000482/pdfft?md5=6554015220341426a1f33615cd53fd75&pid=1-s2.0-S0885230824000482-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141139333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-15DOI: 10.1016/j.csl.2024.101650
Xiaojun Xia , Yujiang Liu , Lijun Fu
In the task of joint entity and relation extraction, the relationship between two entities is determined by some specific words in their source text. These words are viewed as potential triggers which are the evidence to explain the relationship but not marked clearly. However, the current models cannot make good use of the potential words to optimize components of entities and relations, but can only give separate results. These models aim to identify the type of relation between two entities mentioned in the source text by encoding the text and entities. Although some models can generate the weights for every single word by improving the attention mechanism, the weights will be influenced by the irrelevant words essentially, which is not needed in enhancing the influence of the triggers. We propose a joint entity-relation quintuple extraction framework based on the Potential Relation Trigger (PRT) method to get the highest probability of a word as the prompt in every time step and join the words together as relation hints. In specific, we leverage polarization mechanism in possibility calculation to avoid nondifferentiable points of the functions in our method when choosing. We find that their representation will improve the performance of the relation part with the exact range of the entities. Extensive experiments results demonstrate that the effectiveness of our proposed model achieves state-of-the-art performance on four RE benchmark datasets.
在联合实体和关系提取任务中,两个实体之间的关系是由它们源文本中的一些特定词语决定的。这些词被视为潜在的触发因素,是解释关系的证据,但没有明确标出。然而,目前的模型不能很好地利用潜在词语来优化实体和关系的组成部分,而只能给出单独的结果。这些模型旨在通过对文本和实体进行编码来识别源文本中提到的两个实体之间的关系类型。虽然有些模型可以通过改进关注机制为每个单词生成权重,但权重基本上会受到无关词的影响,而这在增强触发器的影响力方面是不需要的。我们提出了一种基于潜在关系触发(PRT)方法的实体-关系五元联合提取框架,以获取每个时间步中作为提示词的最高概率,并将这些词连接起来作为关系提示。具体来说,我们利用可能性计算中的极化机制,在选择时避免我们方法中函数的无差别点。我们发现,它们的表示方法将提高关系部分与实体精确范围的性能。广泛的实验结果表明,我们提出的模型在四个 RE 基准数据集上达到了最先进的性能。
{"title":"A potential relation trigger method for entity-relation quintuple extraction in text with excessive entities","authors":"Xiaojun Xia , Yujiang Liu , Lijun Fu","doi":"10.1016/j.csl.2024.101650","DOIUrl":"10.1016/j.csl.2024.101650","url":null,"abstract":"<div><p>In the task of joint entity and relation extraction, the relationship between two entities is determined by some specific words in their source text. These words are viewed as potential triggers which are the evidence to explain the relationship but not marked clearly. However, the current models cannot make good use of the potential words to optimize components of entities and relations, but can only give separate results. These models aim to identify the type of relation between two entities mentioned in the source text by encoding the text and entities. Although some models can generate the weights for every single word by improving the attention mechanism, the weights will be influenced by the irrelevant words essentially, which is not needed in enhancing the influence of the triggers. We propose a joint entity-relation quintuple extraction framework based on the Potential Relation Trigger (PRT) method to get the highest probability of a word as the prompt in every time step and join the words together as relation hints. In specific, we leverage polarization mechanism in possibility calculation to avoid nondifferentiable points of the functions in our method when choosing. We find that their representation will improve the performance of the relation part with the exact range of the entities. Extensive experiments results demonstrate that the effectiveness of our proposed model achieves state-of-the-art performance on four RE benchmark datasets.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101650"},"PeriodicalIF":4.3,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000330/pdfft?md5=394e6ed4d34c985c0397218c2f0043ed&pid=1-s2.0-S0885230824000330-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141047279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-14DOI: 10.1016/j.csl.2024.101664
Yuan Xie , Tao Zou , Junjie Yang , Weijun Sun , Shengli Xie
Source separation in an underdetermined reverberation environment is a very challenging issue. The classical method is based on the expectation–maximization algorithm. However, it is limited to high reverberation environments, resulting in bad or even invalid separation performance. To eliminate this restriction, a room impulse response reshaping-based expectation–maximization method is designed to solve the problem of source separation in an underdetermined reverberant environment. Firstly, a room impulse response reshaping technology is designed to eliminate the influence of audible echo on the reverberant environment, improving the quality of the received signals. Then, a new mathematical model of time-frequency mixing signals is established to reduce the approximation error of model transformation caused by high reverberation. Furthermore, an improved expectation–maximization method is proposed for real-time update learning rules of model parameters, and then the sources are separated using the estimators provided by the improved expectation–maximization method. Experimental results based on source separation of speech and music mixtures demonstrate that the proposed algorithm achieves better separation performance while maintaining much better robustness than popular expectation–maximization methods.
{"title":"Room impulse response reshaping-based expectation–maximization in an underdetermined reverberant environment","authors":"Yuan Xie , Tao Zou , Junjie Yang , Weijun Sun , Shengli Xie","doi":"10.1016/j.csl.2024.101664","DOIUrl":"10.1016/j.csl.2024.101664","url":null,"abstract":"<div><p>Source separation in an underdetermined reverberation environment is a very challenging issue. The classical method is based on the expectation–maximization algorithm. However, it is limited to high reverberation environments, resulting in bad or even invalid separation performance. To eliminate this restriction, a room impulse response reshaping-based expectation–maximization method is designed to solve the problem of source separation in an underdetermined reverberant environment. Firstly, a room impulse response reshaping technology is designed to eliminate the influence of audible echo on the reverberant environment, improving the quality of the received signals. Then, a new mathematical model of time-frequency mixing signals is established to reduce the approximation error of model transformation caused by high reverberation. Furthermore, an improved expectation–maximization method is proposed for real-time update learning rules of model parameters, and then the sources are separated using the estimators provided by the improved expectation–maximization method. Experimental results based on source separation of speech and music mixtures demonstrate that the proposed algorithm achieves better separation performance while maintaining much better robustness than popular expectation–maximization methods.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"88 ","pages":"Article 101664"},"PeriodicalIF":4.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141047534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-11DOI: 10.1016/j.csl.2024.101663
Julia Ohse , Bakir Hadžić , Parvez Mohammed , Nicolina Peperkorn , Michael Danner , Akihiro Yorita , Naoyuki Kubota , Matthias Rätsch , Youssef Shiban
Depression is a significant global health challenge. Still, many people suffering from depression remain undiagnosed. Furthermore, the assessment of depression can be subject to human bias. Natural Language Processing (NLP) models offer a promising solution. We investigated the potential of four NLP models (BERT, Llama2-13B, GPT-3.5, and GPT-4) for depression detection in clinical interviews. Participants (N = 82) underwent clinical interviews and completed a self-report depression questionnaire. NLP models inferred depression scores from interview transcripts. Questionnaire cut-off values for depression were used as a classifier for depression. GPT-4 showed the highest accuracy for depression classification (F1 score 0.73), while zero-shot GPT-3.5 initially performed with low accuracy (0.34), improved to 0.82 after fine-tuning, and achieved 0.68 with clustered data. GPT-4 estimates of symptom severity PHQ-8 score correlated strongly (r = 0.71) with true symptom severity. These findings demonstrate the potential of AI models for depression detection. However, further research is necessary before widespread deployment can be considered.
{"title":"Zero-Shot Strike: Testing the generalisation capabilities of out-of-the-box LLM models for depression detection","authors":"Julia Ohse , Bakir Hadžić , Parvez Mohammed , Nicolina Peperkorn , Michael Danner , Akihiro Yorita , Naoyuki Kubota , Matthias Rätsch , Youssef Shiban","doi":"10.1016/j.csl.2024.101663","DOIUrl":"10.1016/j.csl.2024.101663","url":null,"abstract":"<div><p>Depression is a significant global health challenge. Still, many people suffering from depression remain undiagnosed. Furthermore, the assessment of depression can be subject to human bias. Natural Language Processing (NLP) models offer a promising solution. We investigated the potential of four NLP models (BERT, Llama2-13B, GPT-3.5, and GPT-4) for depression detection in clinical interviews. Participants (N = 82) underwent clinical interviews and completed a self-report depression questionnaire. NLP models inferred depression scores from interview transcripts. Questionnaire cut-off values for depression were used as a classifier for depression. GPT-4 showed the highest accuracy for depression classification (F1 score 0.73), while zero-shot GPT-3.5 initially performed with low accuracy (0.34), improved to 0.82 after fine-tuning, and achieved 0.68 with clustered data. GPT-4 estimates of symptom severity PHQ-8 score correlated strongly (r = 0.71) with true symptom severity. These findings demonstrate the potential of AI models for depression detection. However, further research is necessary before widespread deployment can be considered.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"88 ","pages":"Article 101663"},"PeriodicalIF":4.3,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141043762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1016/j.csl.2024.101661
Priyanshu Priya, Mauajama Firdaus, Asif Ekbal
Goal-oriented dialogue systems are becoming pervasive in human lives. To facilitate task completion and human participation in a practical setting, such systems must have extensive technical knowledge and social understanding. Politeness is a socially desirable trait that plays a crucial role in task-oriented conversations for ensuring better user engagement and satisfaction. To this end, we propose a novel task of politeness analysis in goal-oriented dialogues. Politeness analysis consists of two sub-tasks: politeness turn identification and phrase extraction. Politeness turn identification is dependent on textual triggers denoting politeness or impoliteness. In this regard, we propose a Bidirectional Encoder Representations from Transformers-Directional Graph Convolutional Network (BERT-DGCN) based multi-task learning approach that performs turn identification and phrase extraction tasks in a unified framework. Our proposed approach employs BERT for encoding input turns and DGCN for encoding syntactic information, in which dependency among words is incorporated into DGCN to improve its capability to represent input utterances and benefit politeness analysis task accordingly. Our proposed model classifies each turn of a conversation into one of the three pre-defined classes, viz. polite, impolite and neutral, and extracts phrases denoting politeness or impoliteness in that turn simultaneously. As there is no such readily available data, we prepare a conversational dataset, PoDial for mental health counseling and legal aid for crime victims in English for our experiment. Experimental results demonstrate that our proposed approach is effective and achieves 2.04 points improvement on turn identification accuracy and 2.40 points on phrase extraction F1- score on our dataset over baselines.
{"title":"Two in One: A multi-task framework for politeness turn identification and phrase extraction in goal-oriented conversations","authors":"Priyanshu Priya, Mauajama Firdaus, Asif Ekbal","doi":"10.1016/j.csl.2024.101661","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101661","url":null,"abstract":"<div><p>Goal-oriented dialogue systems are becoming pervasive in human lives. To facilitate task completion and human participation in a practical setting, such systems must have extensive technical knowledge and social understanding. Politeness is a socially desirable trait that plays a crucial role in task-oriented conversations for ensuring better user engagement and satisfaction. To this end, we propose a novel task of politeness analysis in goal-oriented dialogues. Politeness analysis consists of two sub-tasks: politeness turn identification and phrase extraction. Politeness turn identification is dependent on textual triggers denoting politeness or impoliteness. In this regard, we propose a Bidirectional Encoder Representations from Transformers-Directional Graph Convolutional Network (BERT-DGCN) based multi-task learning approach that performs turn identification and phrase extraction tasks in a unified framework. Our proposed approach employs BERT for encoding input turns and DGCN for encoding syntactic information, in which dependency among words is incorporated into DGCN to improve its capability to represent input utterances and benefit politeness analysis task accordingly. Our proposed model classifies each turn of a conversation into one of the three pre-defined classes, <em>viz.</em> polite, impolite and neutral, and extracts phrases denoting politeness or impoliteness in that turn simultaneously. As there is no such readily available data, we prepare a conversational dataset, <strong><em>PoDial</em></strong> for mental health counseling and legal aid for crime victims in English for our experiment. Experimental results demonstrate that our proposed approach is effective and achieves 2.04 points improvement on turn identification accuracy and 2.40 points on phrase extraction F1- score on our dataset over baselines.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"88 ","pages":"Article 101661"},"PeriodicalIF":4.3,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140947812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1016/j.csl.2024.101662
Chen Tang , Tyler Loakman , Chenghua Lin
Despite recent advancements, existing story generation systems continue to encounter difficulties in effectively incorporating contextual and event features, which greatly influence the quality of generated narratives. To tackle these challenges, we introduce a novel neural generation model, EtriCA, that enhances the relevance and coherence of generated stories by employing a cross-attention mechanism to map context features onto event sequences through residual mapping. This feature capturing mechanism enables our model to exploit logical relationships between events more effectively during the story generation process. To further enhance our proposed model, we employ a post-training framework for knowledge enhancement (KeEtriCA) on a large-scale book corpus. This allows EtriCA to adapt to a wider range of data samples. This results in approximately 5% improvement in automatic metrics and over 10% improvement in human evaluation. We conduct extensive experiments, including comparisons with state-of-the-art (SOTA) baseline models, to evaluate the performance of our framework on story generation. The experimental results, encompassing both automated metrics and human assessments, demonstrate the superiority of our model over existing state-of-the-art baselines. These results underscore the effectiveness of our model in leveraging context and event features to improve the quality of generated narratives.
{"title":"A cross-attention augmented model for event-triggered context-aware story generation","authors":"Chen Tang , Tyler Loakman , Chenghua Lin","doi":"10.1016/j.csl.2024.101662","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101662","url":null,"abstract":"<div><p>Despite recent advancements, existing story generation systems continue to encounter difficulties in effectively incorporating contextual and event features, which greatly influence the quality of generated narratives. To tackle these challenges, we introduce a novel neural generation model, EtriCA, that enhances the relevance and coherence of generated stories by employing a cross-attention mechanism to map context features onto event sequences through residual mapping. This feature capturing mechanism enables our model to exploit logical relationships between events more effectively during the story generation process. To further enhance our proposed model, we employ a post-training framework for knowledge enhancement (KeEtriCA) on a large-scale book corpus. This allows EtriCA to adapt to a wider range of data samples. This results in approximately 5% improvement in automatic metrics and over 10% improvement in human evaluation. We conduct extensive experiments, including comparisons with state-of-the-art (SOTA) baseline models, to evaluate the performance of our framework on story generation. The experimental results, encompassing both automated metrics and human assessments, demonstrate the superiority of our model over existing state-of-the-art baselines. These results underscore the effectiveness of our model in leveraging context and event features to improve the quality of generated narratives.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"88 ","pages":"Article 101662"},"PeriodicalIF":4.3,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000457/pdfft?md5=6b2981aa01c6fa0779df7e400f7d036a&pid=1-s2.0-S0885230824000457-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141077706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}