Computer Speech and Language最新文献

英文中文

A CBR-based conversational architecture for situational data management

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-02-19 DOI: 10.1016/j.csl.2025.101779

Maria Helena Franciscatto , Luis Carlos Erpen de Bona , Celio Trois , Marcos Didonet Del Fabro

This paper introduces a conversational Case-Based Reasoning (CBR) architecture, aimed at improving situational data management by incorporating user feedback into the process. The core of the architecture is a “human-in-the-loop” approach implemented through a conversational agent, which facilitates interaction between the user and the system. The CBR-based approach leverages a historical knowledge base that is dynamically updated based on user feedback, allowing for a more responsive and adaptive system. This feedback plays a crucial role in the processes of case retrieval, review, and retention within the CBR cycle, enabling the system to evolve based on user interactions. An empirical study involving 22 participants was conducted to assess the impact of user feedback on system recommendations. This study included both static and dynamic test scenarios, focusing on aspects such as visibility, support, usefulness, and data integration. The results highlighted a general preference for recommendations that were influenced by user input, indicating the effectiveness of incorporating human feedback in the decision-making process. The research contributes to situational data management by illustrating how a conversational CBR framework, integrated with user feedback, can improve processes such as data integration and data discovery. In addition, it highlights the importance of user involvement in enhancing the functionality of conversational systems for complex data management, pointing to the potential for further development in this area.

{"title":"A CBR-based conversational architecture for situational data management","authors":"Maria Helena Franciscatto , Luis Carlos Erpen de Bona , Celio Trois , Marcos Didonet Del Fabro","doi":"10.1016/j.csl.2025.101779","DOIUrl":"10.1016/j.csl.2025.101779","url":null,"abstract":"<div><div>This paper introduces a conversational Case-Based Reasoning (CBR) architecture, aimed at improving situational data management by incorporating user feedback into the process. The core of the architecture is a “human-in-the-loop” approach implemented through a conversational agent, which facilitates interaction between the user and the system. The CBR-based approach leverages a historical knowledge base that is dynamically updated based on user feedback, allowing for a more responsive and adaptive system. This feedback plays a crucial role in the processes of case retrieval, review, and retention within the CBR cycle, enabling the system to evolve based on user interactions. An empirical study involving 22 participants was conducted to assess the impact of user feedback on system recommendations. This study included both static and dynamic test scenarios, focusing on aspects such as visibility, support, usefulness, and data integration. The results highlighted a general preference for recommendations that were influenced by user input, indicating the effectiveness of incorporating human feedback in the decision-making process. The research contributes to situational data management by illustrating how a conversational CBR framework, integrated with user feedback, can improve processes such as data integration and data discovery. In addition, it highlights the importance of user involvement in enhancing the functionality of conversational systems for complex data management, pointing to the potential for further development in this area.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101779"},"PeriodicalIF":3.1,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate speaker counting, diarization and separation for advanced recognition of multichannel multispeaker conversations

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-02-16 DOI: 10.1016/j.csl.2025.101780

Anton Mitrofanov , Tatiana Prisyach , Tatiana Timofeeva , Sergei Novoselov , Maxim Korenevsky , Yuri Khokhlov , Artem Akulov , Alexander Anikin , Roman Khalili , Iurii Lezhenin , Aleksandr Melnikov , Dmitriy Miroshnichenko , Nikita Mamaev , Ilya Odegov , Olga Rudnitskaya , Aleksei Romanenko

This paper aims at highlighting the recent trends and effective solutions in the field of automatic distant speech recognition in a multispeaker and multichannel recording scenario. The paper is mainly based on the results and experience obtained during our extensive research for CHiME-8 Challenge Task 1 (DASR) aimed at automatic distant speech transcription and diarization with multiple recording devices. Our main attention was paid to the carefully trained and tuned diarization pipeline and the speaker counting. This allowed us to significantly reduce the diarization error rate (DER) and obtain more reliable segments for speech separation and recognition. To improve source separation, we designed a Guided Target Speaker Extraction (G-TSE) model and used it in conjunction with the traditional Guided Source Separation (GSS) method. To train various parts of our pipeline, we investigated several data augmentation and generation techniques, which helped us improve the overall system quality.

{"title":"Accurate speaker counting, diarization and separation for advanced recognition of multichannel multispeaker conversations","authors":"Anton Mitrofanov , Tatiana Prisyach , Tatiana Timofeeva , Sergei Novoselov , Maxim Korenevsky , Yuri Khokhlov , Artem Akulov , Alexander Anikin , Roman Khalili , Iurii Lezhenin , Aleksandr Melnikov , Dmitriy Miroshnichenko , Nikita Mamaev , Ilya Odegov , Olga Rudnitskaya , Aleksei Romanenko","doi":"10.1016/j.csl.2025.101780","DOIUrl":"10.1016/j.csl.2025.101780","url":null,"abstract":"<div><div>This paper aims at highlighting the recent trends and effective solutions in the field of automatic distant speech recognition in a multispeaker and multichannel recording scenario. The paper is mainly based on the results and experience obtained during our extensive research for CHiME-8 Challenge Task 1 (DASR) aimed at automatic distant speech transcription and diarization with multiple recording devices. Our main attention was paid to the carefully trained and tuned diarization pipeline and the speaker counting. This allowed us to significantly reduce the diarization error rate (DER) and obtain more reliable segments for speech separation and recognition. To improve source separation, we designed a Guided Target Speaker Extraction (G-TSE) model and used it in conjunction with the traditional Guided Source Separation (GSS) method. To train various parts of our pipeline, we investigated several data augmentation and generation techniques, which helped us improve the overall system quality.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101780"},"PeriodicalIF":3.1,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143444262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-modal evidential fusion network for social media classification

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-02-14 DOI: 10.1016/j.csl.2025.101784

Chen Yu, Zhiguo Wang

Human activities on social networks can reflect the attitudes of the masses towards various events and are important for economic development and social progress. Many studies have focused on various multimodal tasks in social media, driven by the development of deep multimodal techniques. However, existing multimodal methods treat both reliable and unreliable modalities equally, which affects the efficiency of multimodal classification underlying social media. Therefore, a reliable method for multimodal fusion is required. This study presents a novel cross-modal evidential fusion network (CEFN) based on the subjective logic theory to incorporate uncertainty estimates into the multimodal fusion process. CEFN models uncertainty directly and learns more reliable representations by treating the outputs of encoders as subjective opinions. To reduce semantic uncertainty caused by random noise, momentum models are employed for each unimodal encoder. These unimodal encoders align the pseudo-views generated by the momentum models to mitigate the effects of noise. In addition, CEFN introduces a conflict loss function to facilitate representation learning from image-text pairs containing opposing views. This loss captures uncertainty from cross-modal conflicts to improve the feature extraction capability of each encoder. Experimental results on three real-world social media datasets show that CEFN outperforms related multimodal networks.

{"title":"Cross-modal evidential fusion network for social media classification","authors":"Chen Yu, Zhiguo Wang","doi":"10.1016/j.csl.2025.101784","DOIUrl":"10.1016/j.csl.2025.101784","url":null,"abstract":"<div><div>Human activities on social networks can reflect the attitudes of the masses towards various events and are important for economic development and social progress. Many studies have focused on various multimodal tasks in social media, driven by the development of deep multimodal techniques. However, existing multimodal methods treat both reliable and unreliable modalities equally, which affects the efficiency of multimodal classification underlying social media. Therefore, a reliable method for multimodal fusion is required. This study presents a novel cross-modal evidential fusion network (CEFN) based on the subjective logic theory to incorporate uncertainty estimates into the multimodal fusion process. CEFN models uncertainty directly and learns more reliable representations by treating the outputs of encoders as subjective opinions. To reduce semantic uncertainty caused by random noise, momentum models are employed for each unimodal encoder. These unimodal encoders align the pseudo-views generated by the momentum models to mitigate the effects of noise. In addition, CEFN introduces a conflict loss function to facilitate representation learning from image-text pairs containing opposing views. This loss captures uncertainty from cross-modal conflicts to improve the feature extraction capability of each encoder. Experimental results on three real-world social media datasets show that CEFN outperforms related multimodal networks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101784"},"PeriodicalIF":3.1,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The management of mental health in a smart medical dialogue system based on a two-stage attention speech enhancement module

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-02-10 DOI: 10.1016/j.csl.2025.101778

Yongtai Quan

In response to the current problems in artificial intelligence medical conversations, this project intends to study a new speech reinforcement algorithm based on two-level attention mechanism to improve its correct understanding and execution ability in verbal communication. Firstly, the paper conducted theoretical and research on commonly used speech enhancement algorithms, and studied them; On this basis, a speech enhancement algorithm based on a two-level attention reinforcement network is studied, and the algorithm is applied to intelligent medical conversations. The experiment shows that the signal-to-noise ratio obtained by using a two-level attention enhancement algorithm is higher than that of the attention enhancement algorithm for a 3 dB channel, with a 3 % improvement in signal-to-noise ratio, which is the smallest among the four algorithms. In addition, 98.95 % of students reported participating in 13 or more tutoring classes, accounting for 54 %. Based on the above research, the proposed speech enhancement method in this article is feasible and effective.

引用次数: 0

Widening the bottleneck of lexical choice for non-autoregressive translation

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-01-31 DOI: 10.1016/j.csl.2024.101765

Liang Ding , Longyue Wang , Siyou Liu , Weihua Luo , Kaifu Zhang

Recently, non-autoregressive models have enjoyed great popularity in natural language processing (NLP) communities, and slowly crept into the main body of research such as speech recognition and computer vision. Non-autoregressive translation (NAT) has been proposed to improve the decoding efficiency of translation models by predicting all tokens independently and simultaneously. To reduce the complexity of the raw data, knowledge distillation (KD) is the preliminary step for training NAT models by leveraging autoregressive translation (AT). In this study, we first reveal that the discrepancy between the raw and the KD data leads to lexical choice errors on predicting low-frequency words. Then we bridge the gap by exploiting three architecture-free approaches without introducing any computational cost: (1) Model Level, where we introduce an extra Kullback–Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data; (2) Parallel Data Level, where we reactivate low-frequency information by proposing raw pre-training and reverse KD training; (3) Monolingual Data Level, where we transfer both the knowledge of the bilingual raw data and that of the new monolingual data to the NAT model. We conduct experiments on widely-used NAT benchmarks (i.e. WMT14 English–German and WMT16 Romanian–English) over two advanced NAT architectures. Results demonstrate that the proposed approaches can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Extensive analyses demonstrate that (1) these approach generates translations that contain more low-frequency words; (2) these techniques can be used together profitably to further recall the useful information lost in the standard KD; (3) enlarging the monolingual data consistently improves the BLEU scores, while this trend does not hold when further scaling the monolingual data. To this end, we establish a new NAT benchmarks by validating our approaches on three additional datasets varying from languages and scales (i.e. WMT17 Chinese–English, WMT19 English–German and WAT17 Japanese–English). We will release data, code and models, which we hope can significantly promote research in this field.²

{"title":"Widening the bottleneck of lexical choice for non-autoregressive translation","authors":"Liang Ding , Longyue Wang , Siyou Liu , Weihua Luo , Kaifu Zhang","doi":"10.1016/j.csl.2024.101765","DOIUrl":"10.1016/j.csl.2024.101765","url":null,"abstract":"<div><div>Recently, non-autoregressive models have enjoyed great popularity in natural language processing (NLP) communities, and slowly crept into the main body of research such as speech recognition and computer vision. Non-autoregressive translation (NAT) has been proposed to improve the decoding efficiency of translation models by predicting all tokens independently and simultaneously. To reduce the complexity of the raw data, knowledge distillation (KD) is the preliminary step for training NAT models by leveraging autoregressive translation (AT). In this study, we first reveal that the discrepancy between the raw and the KD data leads to lexical choice errors on predicting low-frequency words. Then we bridge the gap by exploiting three architecture-free approaches without introducing any computational cost: (1) <em>Model Level</em>, where we introduce an extra Kullback–Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data; (2) <em>Parallel Data Level</em>, where we reactivate low-frequency information by proposing raw pre-training and reverse KD training; (3) <em>Monolingual Data Level</em>, where we transfer both the knowledge of the bilingual raw data and that of the new monolingual data to the NAT model. We conduct experiments on widely-used NAT benchmarks (i.e. WMT14 English–German and WMT16 Romanian–English) over two advanced NAT architectures. Results demonstrate that the proposed approaches can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Extensive analyses demonstrate that (1) these approach generates translations that contain more low-frequency words; (2) these techniques can be used together profitably to further recall the useful information lost in the standard KD; (3) enlarging the monolingual data consistently improves the BLEU scores, while this trend does not hold when further scaling the monolingual data. To this end, we establish a new NAT benchmarks by validating our approaches on three additional datasets varying from languages and scales (i.e. WMT17 Chinese–English, WMT19 English–German and WAT17 Japanese–English). We will release data, code and models, which we hope can significantly promote research in this field.<span><span><sup>2</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101765"},"PeriodicalIF":3.1,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143158427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Augmentative and alternative speech communication (AASC) aid for people with dysarthria

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-01-22 DOI: 10.1016/j.csl.2025.101777

Mariya Celin T.A. , Vijayalakshmi P. , Nagarajan T. , Mrinalini K.

Speech assistive aids are designed to enhance the intelligibility of speech, particularly for individuals with speech impairments such as dysarthria, by utilizing speech recognition and speech synthesis systems. The development of these devices promote independence and employability for dysarthric individuals facilitating their natural communication. However, the availability of speech assistive aids is limited due to various challenges, including the necessity to train a dysarthric speech recognition system tailored to the errors of dysarthric speakers, the portability required for use by any dysarthric individual with motor disorders, the need to sustain an adequate speech communication rate, and the financial implications associated with the development of such aids. To address this, in the current work a portable, affordable, and a personalized augmentative and alternative speech communication aid tailored to each dysarthric speaker’s need is developed. The dysarthric speech recognition system used in this aid is trained using a transfer learning approach, with normal speaker’s speech data as the source model and the target model includes the augmented dysarthric speech data. The data augmentation for dysarthric speech data is performed utilizing a virtual microphone and a multi-resolution-based feature extraction approach (VM-MRFE), previously proposed by the authors, to enhance the quantity of the target speech data and improve recognition accuracy. The recognized text is synthesized into intelligible speech using a hidden Markov model (HMM)-based text-to-speech synthesis system. To enhance accessibility, the recognizer and synthesizer systems are ported on to the raspberry pi platform, along with a collar microphone and loudspeaker. The real-time performance of the aid by the dysarthric user is examined, also, the aid provides speech communication, with recognition achieved in under 3 s and synthesis in 1.4 s, resulting in a speech delivery rate of roughly 4.4 s.

{"title":"Augmentative and alternative speech communication (AASC) aid for people with dysarthria","authors":"Mariya Celin T.A. , Vijayalakshmi P. , Nagarajan T. , Mrinalini K.","doi":"10.1016/j.csl.2025.101777","DOIUrl":"10.1016/j.csl.2025.101777","url":null,"abstract":"<div><div>Speech assistive aids are designed to enhance the intelligibility of speech, particularly for individuals with speech impairments such as dysarthria, by utilizing speech recognition and speech synthesis systems. The development of these devices promote independence and employability for dysarthric individuals facilitating their natural communication. However, the availability of speech assistive aids is limited due to various challenges, including the necessity to train a dysarthric speech recognition system tailored to the errors of dysarthric speakers, the portability required for use by any dysarthric individual with motor disorders, the need to sustain an adequate speech communication rate, and the financial implications associated with the development of such aids. To address this, in the current work a portable, affordable, and a personalized augmentative and alternative speech communication aid tailored to each dysarthric speaker’s need is developed. The dysarthric speech recognition system used in this aid is trained using a transfer learning approach, with normal speaker’s speech data as the source model and the target model includes the augmented dysarthric speech data. The data augmentation for dysarthric speech data is performed utilizing a virtual microphone and a multi-resolution-based feature extraction approach (VM-MRFE), previously proposed by the authors, to enhance the quantity of the target speech data and improve recognition accuracy. The recognized text is synthesized into intelligible speech using a hidden Markov model (HMM)-based text-to-speech synthesis system. To enhance accessibility, the recognizer and synthesizer systems are ported on to the raspberry pi platform, along with a collar microphone and loudspeaker. The real-time performance of the aid by the dysarthric user is examined, also, the aid provides speech communication, with recognition achieved in under 3 s and synthesis in 1.4 s, resulting in a speech delivery rate of roughly 4.4 s.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101777"},"PeriodicalIF":3.1,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143158006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative aspect-based sentiment analysis with a grid tagging matching auxiliary task

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2025-01-09 DOI: 10.1016/j.csl.2025.101776

Linan Zhu, Xiaolei Guo, Zhechao Zhu, Yifei Xu, Zehai Zhou, Xiangfan Chen, Xiangjie Kong

Aspect-based sentiment analysis has gained significant attention in recent years. Particularly, the employment of generative models to address the Aspect-Category-Opinion-Sentiment (ACOS) quadruple extraction task has emerged as a prominent research focus. However, existing studies have not thoroughly explored the inherent connections among sentiment elements, which could potentially enhance the extraction capabilities of the model. To this end, we propose a novel Generative Model with a Grid Tagging Matching auxiliary task, dubbed as GM-GTM. First, to fully harness the logical interaction flourishing within sentiment elements, a newly output template is designed for generative extraction task, which conforms to causality and human intuition. Besides, we technically introduce a grid tagging matching auxiliary task. Specifically, a grid tagging matrix is designed, in which various tags are defined to represent different relationships among sentiment elements. In this way, a comprehensive understanding of the relationships among sentiment elements is obtained. Consequently, the model’s reasoning ability is enhanced, enabling it to make more informed inferences regarding new sentiment elements based on existing ones. Extensive experimental results on ACOS datasets demonstrated the superior performance of our model compared with existing state-of-the-art methods.

{"title":"Generative aspect-based sentiment analysis with a grid tagging matching auxiliary task","authors":"Linan Zhu, Xiaolei Guo, Zhechao Zhu, Yifei Xu, Zehai Zhou, Xiangfan Chen, Xiangjie Kong","doi":"10.1016/j.csl.2025.101776","DOIUrl":"10.1016/j.csl.2025.101776","url":null,"abstract":"<div><div>Aspect-based sentiment analysis has gained significant attention in recent years. Particularly, the employment of generative models to address the Aspect-Category-Opinion-Sentiment (ACOS) quadruple extraction task has emerged as a prominent research focus. However, existing studies have not thoroughly explored the inherent connections among sentiment elements, which could potentially enhance the extraction capabilities of the model. To this end, we propose a novel Generative Model with a Grid Tagging Matching auxiliary task, dubbed as GM-GTM. First, to fully harness the logical interaction flourishing within sentiment elements, a newly output template is designed for generative extraction task, which conforms to causality and human intuition. Besides, we technically introduce a grid tagging matching auxiliary task. Specifically, a grid tagging matrix is designed, in which various tags are defined to represent different relationships among sentiment elements. In this way, a comprehensive understanding of the relationships among sentiment elements is obtained. Consequently, the model’s reasoning ability is enhanced, enabling it to make more informed inferences regarding new sentiment elements based on existing ones. Extensive experimental results on ACOS datasets demonstrated the superior performance of our model compared with existing state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101776"},"PeriodicalIF":3.1,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143158429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EMGVox-GAN: A transformative approach to EMG-based speech synthesis, enhancing clarity, and efficiency via extensive dataset utilization

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-12-24 DOI: 10.1016/j.csl.2024.101754

Sara Sualiheen, Deok-Hwan Kim

This study introduces EMGVox-GAN, a groundbreaking synthesis approach that combines electromyography (EMG) signals with advanced deep learning techniques to produce speech, departing from conventional vocoder technology. The EMGVox-GAN was crafted with a Scale-Adaptive-Frequency-Enhanced Discriminator (SAFE-Disc) composed of three individual sub-discriminators specializing in processing signals of varying frequency scales. Each subdiscriminator includes two downblocks, strengthening the discriminators in discriminating between real and fake audio (generated audio). The proposed EMGVox-GAN was validated on a speech dataset (LJSpeech) and three EMG datasets (Silent Speech, CSL-EMG-Array, and XVoice_Speech_EMG). We have significantly enhanced speech quality, achieving a Mean Opinion Score (MOS) of 4.14 on our largest dataset. Additionally, the Word Error Rate (WER) was notably reduced from 47 % to 36 %, as defined in the state-of-the-art work, underscoring the improved clarity in the synthesized speech. This breakthrough offers a transformative shift in speech synthesis by utilizing silent EMG signals to generate intelligible, high-quality speech. Beyond the advancement in speech quality, the EMGVox-GAN's successful integration of EMG signals opens new possibilities for applications in assistive technology, human-computer interaction, and other domains where clear and efficient speech synthesis is crucial.

{"title":"EMGVox-GAN: A transformative approach to EMG-based speech synthesis, enhancing clarity, and efficiency via extensive dataset utilization","authors":"Sara Sualiheen, Deok-Hwan Kim","doi":"10.1016/j.csl.2024.101754","DOIUrl":"10.1016/j.csl.2024.101754","url":null,"abstract":"<div><div>This study introduces EMGVox-GAN, a groundbreaking synthesis approach that combines electromyography (EMG) signals with advanced deep learning techniques to produce speech, departing from conventional vocoder technology. The EMGVox-GAN was crafted with a Scale-Adaptive-Frequency-Enhanced Discriminator (SAFE-Disc) composed of three individual sub-discriminators specializing in processing signals of varying frequency scales. Each subdiscriminator includes two downblocks, strengthening the discriminators in discriminating between real and fake audio (generated audio). The proposed EMGVox-GAN was validated on a speech dataset (LJSpeech) and three EMG datasets (Silent Speech, CSL-EMG-Array, and XVoice_Speech_EMG). We have significantly enhanced speech quality, achieving a Mean Opinion Score (MOS) of 4.14 on our largest dataset. Additionally, the Word Error Rate (WER) was notably reduced from 47 % to 36 %, as defined in the state-of-the-art work, underscoring the improved clarity in the synthesized speech. This breakthrough offers a transformative shift in speech synthesis by utilizing silent EMG signals to generate intelligible, high-quality speech. Beyond the advancement in speech quality, the EMGVox-GAN's successful integration of EMG signals opens new possibilities for applications in assistive technology, human-computer interaction, and other domains where clear and efficient speech synthesis is crucial.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101754"},"PeriodicalIF":3.1,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143158428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction: Explainability, AI literacy, and language development

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-12-13 DOI: 10.1016/j.csl.2024.101766

Gyu-Ho Shin , Natalie Parde

引用次数: 0

Knowledge-enhanced meta-prompt for few-shot relation extraction

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-12-13 DOI: 10.1016/j.csl.2024.101762

Jinman Cui , Fu Xu , Xinyang Wang , Yakun Li , Xiaolong Qu , Lei Yao , Dongmei Li

Few-shot relation extraction (RE) aims to identity and extract the relation between head and tail entities in a given context by utilizing a few annotated instances. Recent studies have shown that prompt-tuning models can improve the performance of few-shot learning by bridging the gap between pre-training and downstream tasks. The core idea of prompt-tuning is to leverage prompt templates to wrap the original input text into a cloze question and map the output words to corresponding labels via a language verbalizer for predictions. However, designing an appropriate prompt template and language verbalizer for RE task is cumbersome and time-consuming. Furthermore, the rich prior knowledge and semantic information contained in the relations are easily ignored, which can be used to construct prompts. To address these issues, we propose a novel Knowledge-enhanced Meta-Prompt (Know-MP) framework, which can improve meta-learning capabilities by introducing external knowledge to construct prompts. Specifically, we first inject the entity types of head and tail entities to construct prompt templates, thereby encoding the prior knowledge contained in the relations into prompt-tuning. Then, we expand rich label words for each relation type from their relation name to construct a knowledge-enhanced soft verbalizer. Finally, we adopt the meta-learning algorithm based on the attention mechanisms to mitigate the impact of noisy data on few-shot RE to accurately predict the relation of query instances and optimize the parameters of meta-learner. Experiments on FewRel 1.0 and FewRel 2.0, two benchmark datasets of few-shot RE, demonstrate the effectiveness of Know-MP.

{"title":"Knowledge-enhanced meta-prompt for few-shot relation extraction","authors":"Jinman Cui , Fu Xu , Xinyang Wang , Yakun Li , Xiaolong Qu , Lei Yao , Dongmei Li","doi":"10.1016/j.csl.2024.101762","DOIUrl":"10.1016/j.csl.2024.101762","url":null,"abstract":"<div><div>Few-shot relation extraction (RE) aims to identity and extract the relation between head and tail entities in a given context by utilizing a few annotated instances. Recent studies have shown that prompt-tuning models can improve the performance of few-shot learning by bridging the gap between pre-training and downstream tasks. The core idea of prompt-tuning is to leverage prompt templates to wrap the original input text into a cloze question and map the output words to corresponding labels via a language verbalizer for predictions. However, designing an appropriate prompt template and language verbalizer for RE task is cumbersome and time-consuming. Furthermore, the rich prior knowledge and semantic information contained in the relations are easily ignored, which can be used to construct prompts. To address these issues, we propose a novel Knowledge-enhanced Meta-Prompt (Know-MP) framework, which can improve meta-learning capabilities by introducing external knowledge to construct prompts. Specifically, we first inject the entity types of head and tail entities to construct prompt templates, thereby encoding the prior knowledge contained in the relations into prompt-tuning. Then, we expand rich label words for each relation type from their relation name to construct a knowledge-enhanced soft verbalizer. Finally, we adopt the meta-learning algorithm based on the attention mechanisms to mitigate the impact of noisy data on few-shot RE to accurately predict the relation of query instances and optimize the parameters of meta-learner. Experiments on FewRel 1.0 and FewRel 2.0, two benchmark datasets of few-shot RE, demonstrate the effectiveness of Know-MP.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101762"},"PeriodicalIF":3.1,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143145831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computer Speech and Language

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀