Pub Date : 2024-03-06DOI: 10.1016/j.csl.2024.101636
Zhenyou Zhou, Zhibin Liu, Zhaoan Dong, Yuhan Liu
Task-oriented dialogue systems use deep reinforcement learning (DRL) to learn policies, and agent interaction with user models can help the agent enhance its generalization capacity. But user models frequently lack the language complexity of human interlocutors and contain generative errors, and their design biases can impair the agent’s ability to function well in certain situations. In this paper, we incorporate an evaluator based on inverse reinforcement learning into the model to determine the quality of the dialogue of user models in order to recruit high-quality user models for training. We can successfully regulate the quality of training trajectories while maintaining their diversity by constructing a sampling environment distribution to pick high-quality user models to participate in policy learning. The evaluation on the Multiwoz dataset demonstrates that it is capable of successfully improving the dialogue agents’ performance.
{"title":"Model discrepancy policy optimization for task-oriented dialogue","authors":"Zhenyou Zhou, Zhibin Liu, Zhaoan Dong, Yuhan Liu","doi":"10.1016/j.csl.2024.101636","DOIUrl":"10.1016/j.csl.2024.101636","url":null,"abstract":"<div><p>Task-oriented dialogue systems use deep reinforcement learning (DRL) to learn policies, and agent interaction with user models can help the agent enhance its generalization capacity. But user models frequently lack the language complexity of human interlocutors and contain generative errors, and their design biases can impair the agent’s ability to function well in certain situations. In this paper, we incorporate an evaluator based on inverse reinforcement learning into the model to determine the quality of the dialogue of user models in order to recruit high-quality user models for training. We can successfully regulate the quality of training trajectories while maintaining their diversity by constructing a sampling environment distribution to pick high-quality user models to participate in policy learning. The evaluation on the Multiwoz dataset demonstrates that it is capable of successfully improving the dialogue agents’ performance.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1016/j.csl.2024.101635
Ramish Shahid, Aamir Wali, Maryam Bashir
Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies.
{"title":"Next word prediction for Urdu language using deep learning models","authors":"Ramish Shahid, Aamir Wali, Maryam Bashir","doi":"10.1016/j.csl.2024.101635","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101635","url":null,"abstract":"<div><p>Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1016/j.csl.2024.101627
Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam
In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.
在本研究中,我们引入了一个新的代码切换语音数据库,其中包含 57h 曼尼普尔语-英语注释自发语音。曼尼普尔语是印度的官方语言,主要在印度东北部的曼尼普尔邦使用。如今,大多数曼尼普尔本地人都会说两种语言,并且在日常讨论中经常使用语码转换。通过仔细评估每段视频中的代码转换语音量,我们收集了 YouTube 上的录音。数据库中共有 21,339 个语句和 291,731 个语码转换实例。考虑到数据的代码切换性质,我们使用了适当的注释程序,并分别使用曼尼普尔语和英语的迈特马耶克统一编码字体和罗马字母对数据进行了人工注释。转录包括说话人信息、非语音信息和相应的注释。本研究的目的是构建一个自动语音识别(ASR)系统,并对语音语料库进行全面分析和详细说明。我们相信,我们的研究是首个将 ASR 系统用于曼尼普尔语-英语代码转换语音的研究。为了评估性能,我们为曼尼普尔语-英语开发了基于混合深度神经网络和隐马尔可夫模型(DNN-HMM)、时延神经网络(TDNN)、混合时延神经网络和长短期记忆(TDNN-LSTM)以及三种端到端(E2E)模型(即混合连接主义时间分类和注意力模型(CTC-Attention)、Conformer、wav2vec XLSR)的 ASR 系统。与其他模型相比,纯 TDNN 产生的结果明显更优。
{"title":"MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition","authors":"Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam","doi":"10.1016/j.csl.2024.101627","DOIUrl":"10.1016/j.csl.2024.101627","url":null,"abstract":"<div><p>In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139925331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-09DOI: 10.1016/j.csl.2024.101623
Sadaf Abdul Rauf , François Yvon
Machine Translation (MT) technologies have improved in many ways and generate usable outputs for a growing number of domains and language pairs. Yet, most sentence based MT systems struggle with contextual dependencies, processing small chunks of texts, typically sentences, in isolation from their textual context. This is likely to cause systematic errors or inconsistencies when processing long documents. While various attempts are made to handle extended contexts in translation, the relevance of these contextual cues, especially those related to the structural organization, and the extent to which they affect translation quality remains an under explored area. In this work, we explore ways to take these structural aspects into account, by integrating document structure as an extra conditioning context. Our experiments on biomedical abstracts, which are usually structured in a rigid way, suggest that this type of structural information can be useful for MT and document structure prediction. We also present in detail the impact of structural information on MT output and assess the degree to which structural information can be learned from the data.
{"title":"Translating scientific abstracts in the bio-medical domain with structure-aware models","authors":"Sadaf Abdul Rauf , François Yvon","doi":"10.1016/j.csl.2024.101623","DOIUrl":"10.1016/j.csl.2024.101623","url":null,"abstract":"<div><p>Machine Translation (MT) technologies have improved in many ways and generate usable outputs for a growing number of domains and language pairs. Yet, most sentence based MT systems struggle with contextual dependencies, processing small chunks of texts, typically sentences, in isolation from their textual context. This is likely to cause systematic errors or inconsistencies when processing long documents. While various attempts are made to handle extended contexts in translation, the relevance of these contextual cues, especially those related to the structural organization, and the extent to which they affect translation quality remains an under explored area. In this work, we explore ways to take these structural aspects into account, by integrating document structure as an extra conditioning context. Our experiments on biomedical abstracts, which are usually structured in a rigid way, suggest that this type of structural information can be useful for MT and document structure prediction. We also present in detail the impact of structural information on MT output and assess the degree to which structural information can be learned from the data.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139884004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1016/j.csl.2024.101625
Lisu Yu , Fei Li , Lixin Yu , Wei Li , Zhicheng Dong , Donghong Cai , Zhen Wang
Rumors can easily propagate through social media, posing potential threats to both individual and public health. Most existing approaches focus on single-language rumor detection, which leads to unsatisfying performance when these are applied to mixed-language rumor detection. Meanwhile, the type of mixed-language (mixture of word-level or sentence-level) is a great challenge for mixed-language rumor detection. In this paper, focusing on a mixed scene of Chinese and Tibetan, the research first provides a Chinese–Tibetan mixed-language rumor detection dataset (Weibo_Ch_Ti) that comprises 1,617 non-rumor tweets and 1,456 rumor tweets in two mixed-language types. Then, the research proposes an effective model with multi-extractors, named “MER-CTRD” for short. This model mainly consists of three extractors. The Multi-task Extractor helps the model to extract feature representations of different mixed-language types adaptively. The Rich-semantic Extractor enriches the semantic features representations of Tibetan in the Chinese–Tibetan-mixed language. The Fusion-feature Extractor fuses the mean and disparity semantic features of Chinese and Tibetan to complement feature representations of the mixed language. Finally, the research conducts experiments on Weibo_Ch_Ti. The results show that the proposed model improves accuracy by about 3%–12% over the baseline models, indicating its effectiveness in the Chinese–Tibetan mixed-language rumor detection scenario.
{"title":"A novel Chinese–Tibetan mixed-language rumor detector with multi-extractor representations","authors":"Lisu Yu , Fei Li , Lixin Yu , Wei Li , Zhicheng Dong , Donghong Cai , Zhen Wang","doi":"10.1016/j.csl.2024.101625","DOIUrl":"10.1016/j.csl.2024.101625","url":null,"abstract":"<div><p>Rumors can easily propagate through social media, posing potential threats to both individual and public health. Most existing approaches focus on single-language rumor detection, which leads to unsatisfying performance when these are applied to mixed-language rumor detection. Meanwhile, the type of mixed-language (mixture of word-level or sentence-level) is a great challenge for mixed-language rumor detection. In this paper, focusing on a mixed scene of Chinese and Tibetan, the research first provides a Chinese–Tibetan mixed-language rumor detection dataset (Weibo_Ch_Ti) that comprises 1,617 non-rumor tweets and 1,456 rumor tweets in two mixed-language types. Then, the research proposes an effective model with multi-extractors, named “MER-CTRD” for short. This model mainly consists of three extractors. The Multi-task Extractor helps the model to extract feature representations of different mixed-language types adaptively. The Rich-semantic Extractor enriches the semantic features representations of Tibetan in the Chinese–Tibetan-mixed language. The Fusion-feature Extractor fuses the mean and disparity semantic features of Chinese and Tibetan to complement feature representations of the mixed language. Finally, the research conducts experiments on Weibo_Ch_Ti. The results show that the proposed model improves accuracy by about 3%–12% over the baseline models, indicating its effectiveness in the Chinese–Tibetan mixed-language rumor detection scenario.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139829040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1016/j.csl.2024.101626
Sania Gul , Muhammad Salman Khan , Muhammad Fazeel
Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1 % in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1 % deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.
{"title":"Single-channel speech enhancement using colored spectrograms","authors":"Sania Gul , Muhammad Salman Khan , Muhammad Fazeel","doi":"10.1016/j.csl.2024.101626","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101626","url":null,"abstract":"<div><p>Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1 % in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1 % deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139749127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-05DOI: 10.1016/j.csl.2024.101624
Bowen Jiang , Qianhui Dong , Guojin Liu
Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.
{"title":"A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure","authors":"Bowen Jiang , Qianhui Dong , Guojin Liu","doi":"10.1016/j.csl.2024.101624","DOIUrl":"10.1016/j.csl.2024.101624","url":null,"abstract":"<div><p>Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139689402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-03DOI: 10.1016/j.csl.2024.101622
Titouan Parcollet , Ha Nguyen , Solène Evain , Marcely Zanon Boito , Adrien Pupier , Salima Mdhaffar , Hang Le , Sina Alisamir , Natalia Tomashenko , Marco Dinarelli , Shucong Zhang , Alexandre Allauzen , Maximin Coavoux , Yannick Estève , Mickael Rouvier , Jerôme Goulian , Benjamin Lecouteux , François Portet , Solange Rossato , Fabien Ringeval , Laurent Besacier
Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 h of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 h of French speech outperform multilingual and previous LeBenchmark SSL models across the benchmark but also required up to four times more energy for pre-training.
{"title":"LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech","authors":"Titouan Parcollet , Ha Nguyen , Solène Evain , Marcely Zanon Boito , Adrien Pupier , Salima Mdhaffar , Hang Le , Sina Alisamir , Natalia Tomashenko , Marco Dinarelli , Shucong Zhang , Alexandre Allauzen , Maximin Coavoux , Yannick Estève , Mickael Rouvier , Jerôme Goulian , Benjamin Lecouteux , François Portet , Solange Rossato , Fabien Ringeval , Laurent Besacier","doi":"10.1016/j.csl.2024.101622","DOIUrl":"10.1016/j.csl.2024.101622","url":null,"abstract":"<div><p>Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces <em>LeBenchmark 2.0</em> an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 h of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. <em>LeBenchmark 2.0</em> also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 h of French speech outperform multilingual and previous <em>LeBenchmark</em> SSL models across the benchmark but also required up to four times more energy for pre-training.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139679924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.1016/j.csl.2024.101620
Yi Zhu, Tiago H. Falk
Speech COVID-19 detection systems have gained popularity as they represent an easy-to-use and low-cost solution that is well suited for at-home long-term monitoring of patients with persistent symptoms. Recently, however, the limited generalization capability of existing deep neural network based systems to unseen datasets has been raised as a serious concern, as has their limited interpretability. In this study, we aim to develop an interpretable and generalizable speech-based COVID-19 detection system. First, we propose the use of a 3-dimensional modulation frequency tensor (called modulation tensorgram representation, MTR) as input to a convolutional recurrent neural network for COVID-19 detection. The MTR representation is known to capture long-term dynamics of speech correlated with articulation and respiration, hence being a potential candidate for characterizing COVID-19 speech. The customized network explores both the spectral and temporal pattern from MTR to learn the underlying COVID-19 speech pattern. Next, we design a spectro-temporal saliency masking to aggregate regions of the MTR related to COVID-19, thus helping further improve the generalizability and interpretability of the model. Experiments are conducted on three public datasets and results show the proposed solution consistently outperforming two benchmark systems in within-, across-, and unseen-dataset tests. The learned salient regions have been shown correlated with whispered speech and vocal hoarseness, which explains the increased generalizability. Furthermore, our model relies on a small amount of parameters, thus offering a promising solution for on-device remote monitoring of COVID-19 infection.
{"title":"Spectral–temporal saliency masks and modulation tensorgrams for generalizable COVID-19 detection","authors":"Yi Zhu, Tiago H. Falk","doi":"10.1016/j.csl.2024.101620","DOIUrl":"10.1016/j.csl.2024.101620","url":null,"abstract":"<div><p>Speech COVID-19 detection systems have gained popularity as they represent an easy-to-use and low-cost solution that is well suited for at-home long-term monitoring of patients with persistent symptoms. Recently, however, the limited generalization capability of existing deep neural network based systems to unseen datasets has been raised as a serious concern, as has their limited interpretability. In this study, we aim to develop an interpretable and generalizable speech-based COVID-19 detection system. First, we propose the use of a 3-dimensional modulation frequency tensor (called modulation tensorgram representation, MTR) as input to a convolutional recurrent neural network for COVID-19 detection. The MTR representation is known to capture long-term dynamics of speech correlated with articulation and respiration, hence being a potential candidate for characterizing COVID-19 speech. The customized network explores both the spectral and temporal pattern from MTR to learn the underlying COVID-19 speech pattern. Next, we design a spectro-temporal saliency masking to aggregate regions of the MTR related to COVID-19, thus helping further improve the generalizability and interpretability of the model. Experiments are conducted on three public datasets and results show the proposed solution consistently outperforming two benchmark systems in within-, across-, and unseen-dataset tests. The learned salient regions have been shown correlated with whispered speech and vocal hoarseness, which explains the increased generalizability. Furthermore, our model relies on a small amount of parameters, thus offering a promising solution for on-device remote monitoring of COVID-19 infection.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000032/pdfft?md5=e39e0b3ee7ea45c5b9c50622ff48dbd4&pid=1-s2.0-S0885230824000032-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139664299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-24DOI: 10.1016/j.csl.2024.101621
B.M. Mala, Smita Sandeep Darandale
In the present scenario, the recognition of particular emotions or needs from an infant's cry is a difficult process in the field of pattern recognition as it does not have any verbal information. In this article, an automated model is introduced for an effective recognition of infant cries. At first, the infant cry signals are collected from the Baby Chillanto (BC) dataset and the Donate a Cry Corpus (DCC) dataset. These acquired signals are converted into feature vectors by employing nine techniques namely, Zero Crossing Rate (ZCR), acoustic features, audio features, amplitude, energy, Root Mean Square (RMS), statistical moments, autocorrelation, and Mel-Frequency Cepstral Coefficients (MFCCs). These obtained feature vectors are multi-dimensional; therefore, a Simulated Annealing Algorithm (SAA) is employed to select informative feature vectors. The selected informative feature vectors are passed to the leaky Bi-directional Long Short Term Memory (Bi-LSTM) model for classifying the types of infant cries. Specifically, in the leaky Bi-LSTM model, the conventional activation functions (Tangent (Tanh) and sigmoid) are replaced with the leaky Rectified Linear Unit (leaky ReLU) activation function. This process significantly mitigates the vanishing gradient problem and improves convergence during data training, which is vital for signal classification tasks. Furthermore, an Improved Artificial Rabbit's Optimization (IARO) algorithm is proposed to choose optimal hyper-parameters in the leaky Bi-LSTM model, where this mechanism reduces the complexity and training time of the classification model. In the IARO algorithm, selective opposition and Lévy flight strategies are integrated with the conventional ARO algorithm to enhance the dynamics and diversity of the population, along with the model's tracking efficiency. The empirical investigation denotes that the proposed IARO based leaky Bi-LSTM model achieves 99.66 % and 95.92 % of classification accuracy on the BC and DCC datasets, respectively. The proposed IARO based leaky Bi-LSTM model achieves maximum classification results when related to the conventional recognition models.
目前,从婴儿的哭声中识别特定情绪或需求是模式识别领域的一个难题,因为它没有任何语言信息。本文介绍了一种有效识别婴儿哭声的自动化模型。首先,从婴儿 Chillanto(BC)数据集和 Donate a Cry Corpus(DCC)数据集中收集婴儿哭声信号。这些采集到的信号通过九种技术转换成特征向量,即零交叉率(ZCR)、声学特征、音频特征、振幅、能量、均方根(RMS)、统计矩、自相关性和梅尔-频率倒频谱系数(MFCC)。这些获得的特征向量是多维的,因此采用了模拟退火算法(SAA)来选择信息量大的特征向量。筛选出的信息特征向量将传递给泄漏双向长短期记忆(Bi-LSTM)模型,用于对婴儿哭声类型进行分类。具体来说,在泄漏双向长时短记忆模型中,传统的激活函数(正切(Tanh)和sigmoid)被泄漏整流线性单元(leaky ReLU)激活函数所取代。这一过程大大缓解了梯度消失问题,提高了数据训练过程中的收敛性,这对信号分类任务至关重要。此外,还提出了一种改进的人工兔优化(IARO)算法,用于在泄漏 Bi-LSTM 模型中选择最佳超参数,这种机制降低了分类模型的复杂性并缩短了训练时间。在 IARO 算法中,选择性对抗和 Lévy 飞行策略与传统的 ARO 算法相结合,增强了种群的动态性和多样性,同时也提高了模型的跟踪效率。实证研究表明,所提出的基于 IARO 的泄漏 Bi-LSTM 模型在 BC 和 DCC 数据集上的分类准确率分别达到了 99.66% 和 95.92%。与传统识别模型相比,所提出的基于 IARO 的泄漏 Bi-LSTM 模型取得了最大的分类结果。
{"title":"Effective infant cry signal analysis and reasoning using IARO based leaky Bi-LSTM model","authors":"B.M. Mala, Smita Sandeep Darandale","doi":"10.1016/j.csl.2024.101621","DOIUrl":"10.1016/j.csl.2024.101621","url":null,"abstract":"<div><p>In the present scenario, the recognition of particular emotions or needs from an infant's cry is a difficult process in the field of pattern recognition as it does not have any verbal information. In this article, an automated model is introduced for an effective recognition of infant cries. At first, the infant cry signals are collected from the Baby Chillanto (BC) dataset and the Donate a Cry Corpus (DCC) dataset. These acquired signals are converted into feature vectors by employing nine techniques namely, Zero Crossing Rate (ZCR), acoustic features, audio features, amplitude, energy, Root Mean Square (RMS), statistical moments, autocorrelation, and Mel-Frequency Cepstral Coefficients (MFCCs). These obtained feature vectors are multi-dimensional; therefore, a Simulated Annealing Algorithm (SAA) is employed to select informative feature vectors. The selected informative feature vectors are passed to the leaky Bi-directional Long Short Term Memory (Bi-LSTM) model for classifying the types of infant cries. Specifically, in the leaky Bi-LSTM model, the conventional activation functions (Tangent (Tanh) and sigmoid) are replaced with the leaky Rectified Linear Unit (leaky ReLU) activation function. This process significantly mitigates the vanishing gradient problem and improves convergence during data training, which is vital for signal classification tasks. Furthermore, an Improved Artificial Rabbit's Optimization (IARO) algorithm is proposed to choose optimal hyper-parameters in the leaky Bi-LSTM model, where this mechanism reduces the complexity and training time of the classification model. In the IARO algorithm, selective opposition and Lévy flight strategies are integrated with the conventional ARO algorithm to enhance the dynamics and diversity of the population, along with the model's tracking efficiency. The empirical investigation denotes that the proposed IARO based leaky Bi-LSTM model achieves 99.66 % and 95.92 % of classification accuracy on the BC and DCC datasets, respectively. The proposed IARO based leaky Bi-LSTM model achieves maximum classification results when related to the conventional recognition models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}