Language Resources and Evaluation最新文献_第4页

Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain 生物医学领域跨语言命名实体识别的数据扩增和迁移学习

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-05-10 DOI: 10.1007/s10579-024-09738-8

Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov

Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.

随着生物医学领域数据产量的增加和互联网势不可挡的发展，对信息提取（IE）技术的需求急剧上升。命名实体识别（NER）是此类 IE 任务之一，对不同领域的专业人员都很有用。生物医学 NER 有多种应用场合，例如生物医学文献的提取和分析、关系提取、生物医学文档的组织以及知识库的完善。然而，对生物医学领域的实体进行计算处理面临着许多挑战，包括注释成本高、模棱两可以及缺乏英语以外语言的生物医学 NER 数据集。这些困难阻碍了数据的开发，影响了该领域本身及其多语言覆盖范围。本研究的目的是通过开发一种稳健的双语 NER 模型，克服西班牙语 NER 生物医学数据稀缺的问题（目前仅有两个数据集）。受到反向翻译的启发，本文利用神经机器翻译（NMT）领域的进展，创建了科罗拉多富注释全文（CRAFT）数据集的西班牙语合成版本。此外，我们还通过替换原始数据集中 20% 的实体构建了一个新的 CRAFT 数据集，并生成了一个新的增强数据集。我们评估了两种训练方法：数据集连接和连续训练，以评估转换器使用新获得的数据集进行迁移学习的能力。开发集中表现最好的 NER 系统的 F-1 得分为 86.39%。本文提出的新方法是首个双语 NER 系统，它有望改善资源不足语言的应用。

{"title":"Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain","authors":"Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.1007/s10579-024-09738-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09738-8","url":null,"abstract":"Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"47 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Features in extractive supervised single-document summarization: case of Persian news 提取式有监督单篇文档摘要中的特征：波斯新闻案例

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-05-08 DOI: 10.1007/s10579-024-09739-7

Hosein Rezaei, Seyed Amid Moeinzadeh Mirhosseini, Azar Shahgholian, Mohamad Saraee

Text summarization has been one of the most challenging areas of research in NLP. Much effort has been made to overcome this challenge by using either abstractive or extractive methods. Extractive methods are preferable due to their simplicity compared with the more elaborate abstractive methods. In extractive supervised single-document approaches, the system will not generate sentences. Instead, via supervised learning, it learns how to score sentences within the document based on some textual features and subsequently selects those with the highest rank. Therefore, the core objective is ranking, which enormously depends on the document structure and context. These dependencies have been unnoticed by many state-of-the-art solutions. In this work, document-related features such as topic and relative length are integrated into the vectors of every sentence to enhance the quality of summaries. Our experiment results show that the system takes contextual and structural patterns into account, which will increase the precision of the learned model. Consequently, our method will produce more comprehensive and concise summaries.

文本摘要一直是 NLP 中最具挑战性的研究领域之一。为了克服这一挑战，人们使用抽象或提取方法做出了很多努力。与更复杂的抽象方法相比，提取方法因其简单性而更受欢迎。在抽取式单文档监督方法中，系统不会生成句子。相反，通过监督学习，系统会学习如何根据某些文本特征对文档中的句子进行评分，然后选出排名最高的句子。因此，核心目标是排名，而排名在很大程度上取决于文档结构和上下文。许多最先进的解决方案都没有注意到这些依赖性。在这项工作中，与文档相关的特征（如主题和相对长度）被整合到每个句子的向量中，以提高摘要的质量。我们的实验结果表明，该系统考虑了上下文和结构模式，这将提高所学模型的精确度。因此，我们的方法将产生更全面、更简洁的摘要。

{"title":"Features in extractive supervised single-document summarization: case of Persian news","authors":"Hosein Rezaei, Seyed Amid Moeinzadeh Mirhosseini, Azar Shahgholian, Mohamad Saraee","doi":"10.1007/s10579-024-09739-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09739-7","url":null,"abstract":"Text summarization has been one of the most challenging areas of research in NLP. Much effort has been made to overcome this challenge by using either abstractive or extractive methods. Extractive methods are preferable due to their simplicity compared with the more elaborate abstractive methods. In extractive supervised single-document approaches, the system will not generate sentences. Instead, via supervised learning, it learns how to score sentences within the document based on some textual features and subsequently selects those with the highest rank. Therefore, the core objective is ranking, which enormously depends on the document structure and context. These dependencies have been unnoticed by many state-of-the-art solutions. In this work, document-related features such as topic and relative length are integrated into the vectors of every sentence to enhance the quality of summaries. Our experiment results show that the system takes contextual and structural patterns into account, which will increase the precision of the learned model. Consequently, our method will produce more comprehensive and concise summaries.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"130 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mismatching-aware unsupervised translation quality estimation for low-resource languages 低资源语言的错配感知无监督翻译质量评估

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-05-05 DOI: 10.1007/s10579-024-09727-x

Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti

Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new English(rightarrow)Persian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.

翻译质量评估 (QE) 是在没有任何参考的情况下预测机器翻译 (MT) 输出质量的任务。作为 MT 实际应用中的重要组成部分，这项任务越来越受到关注。在本文中，我们首先提出了 XLMRScore，它是通过 XLM-RoBERTa (XLMR) 模型计算的 BERTScore 的跨语言对应指标。该指标可用作一种简单的无监督 QE 方法，但面临两个问题：第一，未翻译的标记会导致意想不到的高翻译分数；第二，在 XLMRScore 中应用贪婪匹配时，源标记和假设标记之间会出现不匹配错误。为了缓解这些问题，我们建议分别用未知标记和预训练模型的跨语言对齐来替换未翻译的单词，以表示更接近彼此的对齐单词。我们在 WMT21 QE 共享任务的四个低资源语言对以及本文引入的一个新的 English(rightarrow)Persian (En-Fa) 测试数据集上评估了所提出的方法。实验表明，我们的方法可以在两个零点场景下获得与有监督基线相当的结果，即皮尔逊相关性的差异小于 0.01，同时在所有低资源语言对中平均超过 8%，优于无监督对手。

{"title":"Mismatching-aware unsupervised translation quality estimation for low-resource languages","authors":"Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti","doi":"10.1007/s10579-024-09727-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09727-x","url":null,"abstract":"Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new English(rightarrow)Persian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"128 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing 通过基于自然语言处理的情境感知注意力深度模型改进阿拉伯语情感分析

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-04-27 DOI: 10.1007/s10579-024-09741-z

Abubakr H. Ombabi, Wael Ouarda, Adel M. Alimi

With the enormous growth of social data in recent years, sentiment analysis has gained increasing research attention and has been widely explored in various languages. Arabic language nature imposes several challenges, such as the complicated morphological structure and the limited resources, Thereby, the current state-of-the-art methods for sentiment analysis remain to be enhanced. This inspired us to explore the application of the emerging deep-learning architecture to Arabic text classification. In this paper, we present an ensemble model which integrates a convolutional neural network, bidirectional long short-term memory (Bi-LSTM), and attention mechanism, to predict the sentiment orientation of Arabic sentences. The convolutional layer is used for feature extraction from the higher-level sentence representations layer, the BiLSTM is integrated to further capture the contextual information from the produced set of features. Two attention mechanism units are incorporated to highlight the critical information from the contextual feature vectors produced by the Bi-LSTM hidden layers. The context-related vectors generated by the attention mechanism layers are then concatenated and passed into a classifier to predict the final label. To disentangle the influence of these components, the proposed model is validated as three variant architectures on a multi-domains corpus, as well as four benchmarks. Experimental results show that incorporating Bi-LSTM and attention mechanism improves the model’s performance while yielding 96.08% in accuracy. Consequently, this architecture consistently outperforms the other State-of-The-Art approaches with up to + 14.47%, + 20.38%, and + 18.45% improvements in accuracy, precision, and recall respectively. These results demonstrated the strengths of this model in addressing the challenges of text classification tasks.

近年来，随着社会数据的巨大增长，情感分析越来越受到研究人员的关注，并在各种语言中得到了广泛的探索。阿拉伯语的性质带来了一些挑战，如复杂的形态结构和有限的资源，因此，目前最先进的情感分析方法仍有待改进。这启发我们探索将新兴的深度学习架构应用于阿拉伯语文本分类。在本文中，我们提出了一种整合了卷积神经网络、双向长短期记忆（Bi-LSTM）和注意力机制的集合模型，用于预测阿拉伯语句子的情感取向。卷积层用于从高层句子表征层提取特征，双向长短期记忆（Bi-LSTM）用于从生成的特征集中进一步捕捉上下文信息。Bi-LSTM 隐藏层生成的上下文特征向量中包含两个注意力机制单元，用于突出关键信息。然后，由注意力机制层生成的上下文相关向量会被串联起来并传入分类器，以预测最终标签。为了区分这些成分的影响，我们在多领域语料库和四个基准上验证了所提出模型的三种不同架构。实验结果表明，结合 Bi-LSTM 和注意力机制提高了模型的性能，准确率达到 96.08%。因此，该架构的准确率、精确度和召回率分别提高了 + 14.47%、+ 20.38% 和 + 18.45%，始终优于其他最新方法。这些结果证明了该模型在应对文本分类任务挑战方面的优势。

{"title":"Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing","authors":"Abubakr H. Ombabi, Wael Ouarda, Adel M. Alimi","doi":"10.1007/s10579-024-09741-z","DOIUrl":"https://doi.org/10.1007/s10579-024-09741-z","url":null,"abstract":"With the enormous growth of social data in recent years, sentiment analysis has gained increasing research attention and has been widely explored in various languages. Arabic language nature imposes several challenges, such as the complicated morphological structure and the limited resources, Thereby, the current state-of-the-art methods for sentiment analysis remain to be enhanced. This inspired us to explore the application of the emerging deep-learning architecture to Arabic text classification. In this paper, we present an ensemble model which integrates a convolutional neural network, bidirectional long short-term memory (Bi-LSTM), and attention mechanism, to predict the sentiment orientation of Arabic sentences. The convolutional layer is used for feature extraction from the higher-level sentence representations layer, the BiLSTM is integrated to further capture the contextual information from the produced set of features. Two attention mechanism units are incorporated to highlight the critical information from the contextual feature vectors produced by the Bi-LSTM hidden layers. The context-related vectors generated by the attention mechanism layers are then concatenated and passed into a classifier to predict the final label. To disentangle the influence of these components, the proposed model is validated as three variant architectures on a multi-domains corpus, as well as four benchmarks. Experimental results show that incorporating Bi-LSTM and attention mechanism improves the model’s performance while yielding 96.08% in accuracy. Consequently, this architecture consistently outperforms the other State-of-The-Art approaches with up to + 14.47%, + 20.38%, and + 18.45% improvements in accuracy, precision, and recall respectively. These results demonstrated the strengths of this model in addressing the challenges of text classification tasks.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"8 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140809633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text 虚假仇恨：揭开传播仇恨故事的虚假叙事之网：跨语言印地语-英语代码混合文本中的多标签和多类别数据集

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-04-16 DOI: 10.1007/s10579-024-09732-0

Shankar Biradar, Sunil Saumya, Arun Chauhan

Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated (chi ^2) value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.

不可否认，社交媒体改变了人们的交流方式，但同时也带来了毋庸置疑的弊端，尤其是虚假和仇恨言论的泛滥。最近的观察表明，这两个问题经常并存，关于仇恨话题的讨论经常被虚假言论所主导。因此，探讨虚假叙事在当代仇恨传播中的作用已成为当务之急。在这一方向上，本文提出了一个名为 "虚假仇恨多标签数据集"（FHMLD）的新数据集，该数据集由 8014 条印地语-英语混合编码文本中的虚假仇恨评论组成。据我们所知，这是首次将虚假内容和仇恨内容整合到一个统一的框架中。此外，所提议的数据集收集自 YouTube 和 Twitter 等不同平台，以减少用户相关偏见。为了研究虚假叙事的存在及其对仇恨强度的影响之间的关系，本研究使用卡方检验法（Chi-square test）进行了统计分析。统计结果表明，计算出的（chi ^2）值大于标准表中的值，从而拒绝了零假设。此外，本研究还提出了利用单词和句子层面的句法和语义特征对多类别和多标签数据集进行分类的基线方法。实验结果表明，基于 fastText 和 SVM 的方法在二元假憎预测和严重性预测方面的准确率分别为 71% 和 58%，优于其他模型。

{"title":"Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text","authors":"Shankar Biradar, Sunil Saumya, Arun Chauhan","doi":"10.1007/s10579-024-09732-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09732-0","url":null,"abstract":"Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated (chi ^2) value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Depression symptoms modelling from social media text: an LLM driven semi-supervised learning approach 从社交媒体文本中建立抑郁症状模型：一种 LLM 驱动的半监督学习方法

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-04-04 DOI: 10.1007/s10579-024-09720-4

Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, Osmar R. Zaïane

A fundamental component of user-level social media language based clinical depression modelling is depression symptoms detection (DSD). Unfortunately, there does not exist any DSD dataset that reflects both the clinical insights and the distribution of depression symptoms from the samples of self-disclosed depressed population. In our work, we describe a semi-supervised learning (SSL) framework which uses an initial supervised learning model that leverages (1) a state-of-the-art large mental health forum text pre-trained language model further fine-tuned on a clinician annotated DSD dataset, (2) a Zero-Shot learning model for DSD, and couples them together to harvest depression symptoms related samples from our large self-curated depressive tweets repository (DTR). Our clinician annotated dataset is the largest of its kind. Furthermore, DTR is created from the samples of tweets in self-disclosed depressed users Twitter timeline from two datasets, including one of the largest benchmark datasets for user-level depression detection from Twitter. This further helps preserve the depression symptoms distribution of self-disclosed tweets. Subsequently, we iteratively retrain our initial DSD model with the harvested data. We discuss the stopping criteria and limitations of this SSL process, and elaborate the underlying constructs which play a vital role in the overall SSL process. We show that we can produce a final dataset which is the largest of its kind. Furthermore, a DSD and a Depression Post Detection model trained on it achieves significantly better accuracy than their initial version.

基于用户级社交媒体语言的临床抑郁症建模的一个基本组成部分是抑郁症状检测（DSD）。遗憾的是，目前还没有任何 DSD 数据集既能反映临床见解，又能从自我披露的抑郁人群样本中反映抑郁症状的分布情况。在我们的工作中，我们介绍了一种半监督学习（SSL）框架，它使用了一个初始监督学习模型，该模型利用了（1）在临床医生注释的 DSD 数据集上进一步微调的最先进的大型心理健康论坛文本预训练语言模型，（2）DSD 的零点学习模型，并将它们结合在一起，从我们的大型自编辑抑郁推文库（DTR）中获取抑郁症状相关样本。我们的临床医生注释数据集是同类数据集中最大的。此外，DTR 是根据两个数据集中自我披露的抑郁用户 Twitter 时间轴中的推文样本创建的，其中包括 Twitter 用户级抑郁检测的最大基准数据集之一。这进一步有助于保留自我披露推文中抑郁症状的分布。随后，我们利用收集到的数据迭代地重新训练初始 DSD 模型。我们讨论了这一 SSL 过程的停止标准和局限性，并阐述了在整个 SSL 过程中发挥重要作用的底层构造。我们证明，我们可以生成同类中最大的最终数据集。此外，在此基础上训练的 DSD 和抑郁后检测模型的准确性也大大高于其初始版本。

{"title":"Depression symptoms modelling from social media text: an LLM driven semi-supervised learning approach","authors":"Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, Osmar R. Zaïane","doi":"10.1007/s10579-024-09720-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09720-4","url":null,"abstract":"A fundamental component of user-level social media language based clinical depression modelling is depression symptoms detection (DSD). Unfortunately, there does not exist any DSD dataset that reflects both the clinical insights and the distribution of depression symptoms from the samples of self-disclosed depressed population. In our work, we describe a semi-supervised learning (SSL) framework which uses an initial supervised learning model that leverages (1) a state-of-the-art large mental health forum text pre-trained language model further fine-tuned on a clinician annotated DSD dataset, (2) a Zero-Shot learning model for DSD, and couples them together to harvest depression symptoms related samples from our large self-curated depressive tweets repository (DTR). Our clinician annotated dataset is the largest of its kind. Furthermore, DTR is created from the samples of tweets in self-disclosed depressed users Twitter timeline from two datasets, including one of the largest benchmark datasets for user-level depression detection from Twitter. This further helps preserve the depression symptoms distribution of self-disclosed tweets. Subsequently, we iteratively retrain our initial DSD model with the harvested data. We discuss the stopping criteria and limitations of this SSL process, and elaborate the underlying constructs which play a vital role in the overall SSL process. We show that we can produce a final dataset which is the largest of its kind. Furthermore, a DSD and a Depression Post Detection model trained on it achieves significantly better accuracy than their initial version.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"86 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140577863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A longitudinal multi-modal dataset for dementia monitoring and diagnosis 用于痴呆症监测和诊断的多模式纵向数据集

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-03-30 DOI: 10.1007/s10579-023-09718-4

Abstract

Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.

摘要痴呆症影响成年人的认知功能，包括记忆、语言和行为。核磁共振成像等标准诊断生物标志物成本高昂，而神经心理测试在检测痴呆症发病方面存在灵敏度问题。语音和语言分析已成为诊断和监测痴呆症的一种前景广阔的非侵入性技术。目前，这方面的大多数工作都忽略了人类交流的多模式性质和日常对话互动的交互方面。此外，由于缺乏一致的纵向数据，大多数研究忽略了认知状态随时间的变化。在这里，我们介绍一种新颖的细粒度纵向多模态语料库，该语料库是在自然环境中从健康对照组和痴呆症患者那里收集的，分为两个阶段，每个阶段跨越 28 个会话。该语料库由口语对话（其中一部分已转录）、打字和书面思想以及相关的语言外信息（如笔画和击键）组成。我们介绍了数据收集过程，并详细描述了语料库。此外，我们还为健康对照组和痴呆症患者两个组群建立了基线，以捕捉不同模式下语言的纵向变化，并概述了该语料库的未来研究方向。

{"title":"A longitudinal multi-modal dataset for dementia monitoring and diagnosis","authors":"","doi":"10.1007/s10579-023-09718-4","DOIUrl":"https://doi.org/10.1007/s10579-023-09718-4","url":null,"abstract":"<h3>Abstract</h3> Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"42 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140577783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DILLo: an Italian lexical database for speech-language pathologists DILLo：供语言病理学家使用的意大利语词汇数据库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-03-23 DOI: 10.1007/s10579-024-09722-2

Federica Beccaria, Angela Cristiano, Flavio Pisciotta, Noemi Usardi, Elisa Borgogni, Filippo Prayer Galletti, Giulia Corsi, Lorenzo Gregori, Gloria Gagliardi

A novel lexical resource for treating speech impairments from childhood to senility: DILLo—Database Italiano del Lessico per Logopedisti (i.e., Italian Database for Speech-Language Pathologists) is presented. DILLo is a free online web application that allows extraction of filtered wordlists for flexible rehabilitative purposes. Its major aim is to provide Italian speech-language pathologists (SLPs) with a resource that takes advantage of Information and Communication Technologies for language in a healthcare setting. DILLo’s design adopts an integrated approach that envisages fruitful cooperation between clinical and linguistic professionals. The 7690 Italian words in the database have been selected based on phonological, phonotactic, and morphological properties, and their frequency of use. These linguistic features are encoded in the tool, which includes the orthographic and phonological transcriptions, and the phonotactic structure of each word. Moreover, most of the entries are associated with their respective ARASAAC pictogram, providing an additional and inclusive tool for treating speech impairments. The user-friendly interface is structured to allow for different and adaptable search options. DILLo allows Speech-Language Pathologists (SLPs) to obtain a rich, tailored, and varied selection of suitable linguistic stimuli. It can be used to customize the treatment of many impairments, e.g., Speech Sound Disorders, Childhood Apraxia of Speech, Specific Learning Disabilities, aphasia, dysarthria, dysphonia, and the auditory training that follows cochlear implantations.

用于治疗从儿童到老年的语言障碍的新型词汇资源：介绍了 DILLo-Database Italiano del Lessico per Logopedisti（即意大利语言病理学家数据库）。DILLo 是一个免费的在线网络应用程序，可提取过滤词表用于灵活的康复目的。其主要目的是为意大利语言病理学家（SLPs）提供一种资源，利用信息和通信技术在医疗保健环境中进行语言治疗。DILLo 的设计采用了一种综合方法，设想在临床和语言专业人员之间开展富有成效的合作。数据库中的 7690 个意大利语单词是根据语音、音素学和形态学特性及其使用频率筛选出来的。这些语言特点都已编码在工具中，其中包括每个单词的正字法和语音转写以及音素结构。此外，大多数词条都与各自的 ARASAAC 象形图相关联，为治疗语言障碍提供了一个额外的包容性工具。用户友好的界面结构允许不同和可调整的搜索选项。DILLo 允许语言病理学家 (SLP) 获得丰富的、量身定制的、多样的合适语言刺激选择。它可用于定制多种障碍的治疗，如言语发音障碍、儿童语言障碍、特殊学习障碍、失语症、构音障碍、发音障碍以及人工耳蜗植入术后的听觉训练。

{"title":"DILLo: an Italian lexical database for speech-language pathologists","authors":"Federica Beccaria, Angela Cristiano, Flavio Pisciotta, Noemi Usardi, Elisa Borgogni, Filippo Prayer Galletti, Giulia Corsi, Lorenzo Gregori, Gloria Gagliardi","doi":"10.1007/s10579-024-09722-2","DOIUrl":"https://doi.org/10.1007/s10579-024-09722-2","url":null,"abstract":"A novel lexical resource for treating speech impairments from childhood to senility: DILLo—Database Italiano del Lessico per Logopedisti (i.e., Italian Database for Speech-Language Pathologists) is presented. DILLo is a free online web application that allows extraction of filtered wordlists for flexible rehabilitative purposes. Its major aim is to provide Italian speech-language pathologists (SLPs) with a resource that takes advantage of Information and Communication Technologies for language in a healthcare setting. DILLo’s design adopts an integrated approach that envisages fruitful cooperation between clinical and linguistic professionals. The 7690 Italian words in the database have been selected based on phonological, phonotactic, and morphological properties, and their frequency of use. These linguistic features are encoded in the tool, which includes the orthographic and phonological transcriptions, and the phonotactic structure of each word. Moreover, most of the entries are associated with their respective ARASAAC pictogram, providing an additional and inclusive tool for treating speech impairments. The user-friendly interface is structured to allow for different and adaptable search options. DILLo allows Speech-Language Pathologists (SLPs) to obtain a rich, tailored, and varied selection of suitable linguistic stimuli. It can be used to customize the treatment of many impairments, e.g., Speech Sound Disorders, Childhood Apraxia of Speech, Specific Learning Disabilities, aphasia, dysarthria, dysphonia, and the auditory training that follows cochlear implantations.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"80 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140200878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VeLeRo: an inflected verbal lexicon of standard Romanian and a quantitative analysis of morphological predictability VeLeRo：标准罗马尼亚语变音动词词典和形态可预测性定量分析

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-03-23 DOI: 10.1007/s10579-024-09721-3

Borja Herce, Bogdan Pricop

This paper presents VeLeRo, an inflected lexicon of Standard Romanian which contains the full paradigm of 7297 verbs in phonological form. We explain the process by which the resource was compiled, and how stress, diphthongs and hiatus, consonant palatalization, and other relevant issues were handled in phonemization. On the basis of the most token-frequent verbs in VeLeRo, we also perform a quantitative analysis of morphological predictability in Romanian verbs, whose complexity patterns are presented within the broader Romance context.

本文介绍的 VeLeRo 是标准罗马尼亚语的转折词库，其中包含 7297 个动词的完整语音范式。我们解释了该资源的编纂过程，以及在音素化过程中如何处理重音、双元音和间歇、辅音腭化及其他相关问题。根据 VeLeRo 中标记频率最高的动词，我们还对罗马尼亚语动词的形态可预测性进行了定量分析，并在更广泛的罗曼语语境中介绍了其复杂性模式。

引用次数: 0

Introducing the 3MT_French dataset to investigate the timing of public speaking judgements 引入 3MT_French 数据集，调查公开演讲判断的时间安排

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-03-23 DOI: 10.1007/s10579-023-09709-5

Beatrice Biancardi, Mathieu Chollet, Chloé Clavel

In most public speaking datasets, judgements are given after watching the entire performance, or on thin slices randomly selected from the presentations, without focusing on the temporal location of these slices. This does not allow to investigate how people’s judgements develop over time during presentations. This contrasts with primacy and recency theories, which suggest that some moments of the speech could be more salient than others and contribute disproportionately to the perception of the speaker’s performance. To provide novel insights on this phenomenon, we present the 3MT_French dataset. It contains a set of public speaking annotations collected on a crowd-sourcing platform through a novel annotation scheme and protocol. Global evaluation, persuasiveness, perceived self-confidence of the speaker and audience engagement were annotated on different time windows (i.e., the beginning, middle or end of the presentation, or the full video). This new resource will be useful to researchers working on public speaking assessment and training. It will allow to fine-tune the analysis of presentations under a novel perspective relying on socio-cognitive theories rarely studied before in this context, such as first impressions and primacy and recency theories. An exploratory correlation analysis on the annotations provided in the dataset suggests that the early moments of a presentation have a stronger impact on the judgements.

在大多数公共演讲数据集中，人们都是在观看了整场演讲或从演讲中随机选取的薄片后做出判断，并不关注这些薄片的时间位置。这就无法研究人们的判断是如何随着演讲时间的推移而发展的。这与 "首要性 "和 "重现性 "理论形成了鲜明对比，后者认为演讲中的某些时刻可能比其他时刻更突出，对演讲者表现的感知也会产生不成比例的影响。为了对这一现象提供新的见解，我们提出了 3MT_French 数据集。该数据集包含一组通过新颖的注释方案和协议在众包平台上收集的公共演讲注释。我们在不同的时间窗口（即演讲的开头、中间或结尾，或完整视频）上对演讲者的总体评价、说服力、自信感知和听众参与度进行了注释。这一新资源将对从事公众演讲评估和培训的研究人员有所帮助。它将以一种新颖的视角对演讲进行微调分析，这种视角依赖于以前很少在这方面进行研究的社会认知理论，如第一印象、优先性和重复性理论。对数据集中提供的注释进行的探索性相关分析表明，演讲的早期时刻对判断的影响更大。

{"title":"Introducing the 3MT_French dataset to investigate the timing of public speaking judgements","authors":"Beatrice Biancardi, Mathieu Chollet, Chloé Clavel","doi":"10.1007/s10579-023-09709-5","DOIUrl":"https://doi.org/10.1007/s10579-023-09709-5","url":null,"abstract":"In most public speaking datasets, judgements are given after watching the entire performance, or on thin slices randomly selected from the presentations, without focusing on the temporal location of these slices. This does not allow to investigate how people’s judgements develop over time during presentations. This contrasts with primacy and recency theories, which suggest that some moments of the speech could be more salient than others and contribute disproportionately to the perception of the speaker’s performance. To provide novel insights on this phenomenon, we present the 3MT_French dataset. It contains a set of public speaking annotations collected on a crowd-sourcing platform through a novel annotation scheme and protocol. Global evaluation, persuasiveness, perceived self-confidence of the speaker and audience engagement were annotated on different time windows (i.e., the beginning, middle or end of the presentation, or the full video). This new resource will be useful to researchers working on public speaking assessment and training. It will allow to fine-tune the analysis of presentations under a novel perspective relying on socio-cognitive theories rarely studied before in this context, such as first impressions and primacy and recency theories. An exploratory correlation analysis on the annotations provided in the dataset suggests that the early moments of a presentation have a stronger impact on the judgements.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"142 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140200795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0