首页 > 最新文献

Language Resources and Evaluation最新文献

英文 中文
Perspectivist approaches to natural language processing: a survey 自然语言处理的透视法:调查
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09766-4
Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, Davide Bernardi

In Artificial Intelligence research, perspectivism is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.

在人工智能研究中,"视角主义 "是一种机器学习方法,旨在利用由不同个体标注的数据,对影响其观点和世界观的不同视角进行建模。我们首次对自然语言处理(NLP)中与视角主义相关的数据集和方法进行了调查。我们回顾了保留注释者个人标签的数据集,以及专注于为 NLP 任务分析和建模人类视角的研究论文。我们的分析基于有针对性的问题,这些问题旨在揭示如何考虑不同的视角、视角主义方法的新颖性和优势以及这些研究的局限性。所收录的大部分作品都以视角主义为目标,即使其中一些作品并未明确讨论视角主义。这些作品中有相当一部分关注自然语言中的高度主观现象,在这些现象中,人类表现出不同的理解和解释,例如对有毒语言和其他不良语言的注释。然而,在看似客观的任务中,人类评判者也经常表现出系统性的分歧。通过视角主义框架,我们总结了为提取和模拟不同观点而提出的解决方案,以及如何评估和解释视角主义模型。最后,我们列出了从资料分析中得出的关键概念,以及关于视角主义方法对未来 NLP 研究影响的一些重要观点。
{"title":"Perspectivist approaches to natural language processing: a survey","authors":"Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, Davide Bernardi","doi":"10.1007/s10579-024-09766-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09766-4","url":null,"abstract":"<p>In Artificial Intelligence research, <i>perspectivism</i> is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"76 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chinese-DiMLex: a lexicon of Chinese discourse connectives Chinese-DiMLex:汉语话语连接词词典
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09761-9
Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho

Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in connective-lex.info, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).

机器可读的连接词目录提供了多层次的信息,是自动话语分析、机器翻译、文本摘要和论证挖掘等方面的有用资源。尽管中文是世界上使用最广泛的语言之一,并且拥有丰富的注释语料库,但这样的中文词典仍然缺失。相比之下,许多其他语言的词典早已建立。在本文中,我们介绍了 226 个汉语话语连接词,并增加了形态变化、句法(语篇)和语义(PDBT3.0 义项库)信息、用法示例和英文翻译。由此产生的词库 Chinese-DiMLex 以 XML 格式公开发布,并收录在 connective-lex.info 中,这是一个专为跨语言连接词词库的人性化浏览而设计的平台。我们描述了词库的创建过程,并讨论了在此过程中出现和讨论的一些中国特有的考虑因素和问题。通过展示这一过程,我们希望不仅能为研究和教育目的做出贡献,而且还能激励研究人员将我们的方法作为建立自己(母语)词典的参考。
{"title":"Chinese-DiMLex: a lexicon of Chinese discourse connectives","authors":"Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho","doi":"10.1007/s10579-024-09761-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09761-9","url":null,"abstract":"<p>Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in <i>connective-lex.info</i>, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"7 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies 在巴西众议院的真实案例中为法律信息检索建立相关性反馈语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09767-3
Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade

The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.

司法和立法机构的正常运作需要从大量数据集中高效检索法律文件。法律信息检索侧重于研究如何有效地处理这些数据集,以便从中检索相关信息。相关性反馈是信息检索系统的一个重要方面,它利用用户提供的相关性信息来加强对特定请求的文档检索。然而,目前缺乏包含此类信息的可用语料库,尤其是在立法领域。因此,本文介绍了用于立法信息检索的相关性反馈语料库 Ulysses-RFCorpus,该语料库是根据巴西众议院的实际情况建立的。据我们所知,该语料库是首个公开的巴西葡萄牙语语料库。它也是唯一一个包含立法文件反馈信息的语料库,因为文献中发现的其他语料库主要侧重于司法文本。我们还利用该语料库评估了巴西众议院信息检索系统的性能。因此,我们强调了该模型的强大性能,并强调了该数据集在法律信息检索领域的重要意义。
{"title":"Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies","authors":"Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade","doi":"10.1007/s10579-024-09767-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09767-3","url":null,"abstract":"<p>The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PESTS: Persian_English cross lingual corpus for semantic textual similarity PESTS:波斯语_英语跨语言语料库的语义文本相似性
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-03 DOI: 10.1007/s10579-024-09759-3
Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli

In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.

近年来,人们对自然语言处理的子任务--语义文本相似性--产生了浓厚的研究兴趣。测量单词或术语、句子、段落和文档之间的语义相似性在自然语言处理和计算语言学中发挥着重要作用。它在问题解答系统、语义搜索、欺诈检测、机器翻译、信息检索等方面都有应用。语义相似性需要评估两个文本文档、段落或句子之间的意义相似程度,既包括同一语言中的相似程度,也包括不同语言之间的相似程度。要实现跨语言语义相似性,必须拥有由源语言和目标语言的句子对组成的语料库。这些句对之间应具有一定程度的语义相似性。由于缺乏可用的跨语言语义相似性数据集,该领域的许多现有模型都依赖于机器翻译。然而,对机器翻译的依赖会导致翻译错误的潜在传播,从而降低模型的准确性。对于被归类为低资源语言的波斯语来说,在开发能够理解两种语言上下文的模型方面一直缺乏努力。现在比以往任何时候都更需要这样一种能弥合语言间理解差距的模型。在本文中,通过语言学专家的合作,我们首次建立了波斯语和英语句子语义文本相似性语料库。我们将该数据集命名为 PESTS(波斯语英语语义文本相似性)。该语料库包含 5375 个句子对。此外,我们还使用该数据集对各种基于转换器的模型进行了微调。根据从 PESTS 数据集获得的结果,我们发现使用 XLM_ROBERTa 模型可将皮尔逊相关性从 85.87% 提高到 95.62%。
{"title":"PESTS: Persian_English cross lingual corpus for semantic textual similarity","authors":"Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli","doi":"10.1007/s10579-024-09759-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09759-3","url":null,"abstract":"<p>In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (<b>P</b>ersian <b>E</b>nglish <b>S</b>emantic <b>T</b>extual <b>S</b>imilarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DoSLex: automatic generation of all domain semantically rich sentiment lexicon DoSLex:自动生成所有领域语义丰富的情感词典
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09753-9
Minni Jain, Rajni Jindal, Amita Jain

For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “DoSLex” has been proposed. In DoSLex, all the words are represented in a circle where the centre stands for the domain, and the x and y axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of DoSLex has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.

对于情感分析而言,词典是重要的资源之一。现有的情感词典对每个词都有一个通用的极性。事实上,许多词在不同领域使用时具有不同的极性。在这项工作中,首次提出了自动化的特定领域情感词库 "DoSLex"。在 DoSLex 中,所有词语都用一个圆来表示,圆心代表领域,x 轴和 y 轴分别代表情感的强度和方向。在圆圈中,半径是使用 MuRIL 嵌入计算出的词域与词之间的上下文相似度,角度则是从各种知识库中提取的先验情感得分。所提出的方法与语言无关,可应用于任何领域。我们在三种低资源语言上进行了广泛的实验:印地语、泰米尔语和孟加拉语。实验研究讨论了不同单词嵌入(FastText、M-Bert 和 MuRIL)与不同领域中若干先验情感知识库来源的组合性能。还将 DoSLex 的性能与三种情感词典进行了比较,结果表明情感分析能力有了显著提高。
{"title":"DoSLex: automatic generation of all domain semantically rich sentiment lexicon","authors":"Minni Jain, Rajni Jindal, Amita Jain","doi":"10.1007/s10579-024-09753-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09753-9","url":null,"abstract":"<p>For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “<i>DoSLex</i>” has been proposed. In <i>DoSLex</i>, all the words are represented in a circle where the centre stands for the domain, and the <i>x</i> and <i>y</i> axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of <i>DoSLex</i> has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How different is different? Systematically identifying distribution shifts and their impacts in NER datasets 差异有多大?在 NER 数据集中系统识别分布变化及其影响
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09754-8
Xue Li, Paul Groth

When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.

在处理自然语言时,我们经常会遇到分布转移的问题。例如,使用在新闻语料库中训练好的模型来处理法律文本,其性能就会下降。虽然这个问题众所周知,但到目前为止,还没有系统性的研究来检测偏移并调查偏移对 NLP 任务中模型性能的影响。因此,在本文中,我们针对 12 个基准名称实体识别数据集,通过三种不同的表示方法检测并测量了两种类型的分布偏移。我们发现,输入偏移和标签偏移都会导致性能急剧下降。例如,在广谱数据集(OntoNotes)上进行微调,并在共享标签的电子邮件数据集(CEREC)上进行测试,会导致 F1 性能下降 63 分。总之,我们的结果表明,分布偏移的测量可以为微调所需的数据量以及模型是否可以 "现成 "使用而无需后续微调提供指导。最后,我们的结果表明,偏移测量可以在 NLP 模型管道定义中发挥重要作用。
{"title":"How different is different? Systematically identifying distribution shifts and their impacts in NER datasets","authors":"Xue Li, Paul Groth","doi":"10.1007/s10579-024-09754-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09754-8","url":null,"abstract":"<p>When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"39 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain Ulysses Tesemõ:巴西法律和政府领域的新大型语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09762-8
Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho

The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.

人工智能方法在法律领域的应用日益广泛,这引发了人们对应用自然语言处理技术来处理法律任务并减轻这些专业人员工作量的兴趣。然而,葡萄牙语法律语料库的可用性非常有限,尤其是在巴西法律领域。现有资源提供了一些法律数据,但覆盖面不够全面。为了填补这一空白,我们推出了 Ulysses Tesemõ,这是一个专门为巴西法律领域建立的大型语料库。该语料库包含 350 多万个文件,总计 30.7 GB 的原始文本,收集自 159 个来源,包括司法、立法、学术、新闻和其他相关数据。这些数据是通过从政府网站上抓取公共信息收集的,重点是过去二十年中产生的内容。我们将获得的文件分为 30 个不同的类别,涵盖巴西政府的各个部门和不同类型的文本。该语料库保留了原始内容,只进行了极少的数据转换,解决了葡萄牙语法律语料库稀缺的问题,为研究人员在该研究领域取得进展提供了宝贵的资源。
{"title":"Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain","authors":"Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho","doi":"10.1007/s10579-024-09762-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09762-8","url":null,"abstract":"<p>The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Historical Portuguese corpora: a survey 葡萄牙语历史语料库:调查
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09757-5
Tomás Freitas Osório, Henrique Lopes Cardoso

This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.

本调查旨在全面研究和评估葡萄牙语历史电子语料库的现状。这是通过对现有资源的全面分析来实现的。文章有两大贡献。首先是对现有的葡萄牙语历史语料进行了详尽的编目,对每个语料的语言时期、地理来源和主题内容都进行了细致的说明。第二项贡献是为研究人员提供这些语料库的数字化访问途径。这些贡献对于加强和推进葡萄牙语历史语料库研究至关重要,为该领域未来的语言学研究奠定了重要基础。我们的调查确定了 20 个可免费访问的语料库,包括约 6390 万个词组,以及两个私人语料库,共计 5990 万个词组。
{"title":"Historical Portuguese corpora: a survey","authors":"Tomás Freitas Osório, Henrique Lopes Cardoso","doi":"10.1007/s10579-024-09757-5","DOIUrl":"https://doi.org/10.1007/s10579-024-09757-5","url":null,"abstract":"<p>This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"28 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Šolar, the developmental corpus of Slovene 斯洛文尼亚语发展语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09758-4
Špela Arhar Holdt, Iztok Kosem

The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.

本文介绍了斯洛文尼亚语的 Šolar 发展语料库,该语料库由斯洛文尼亚语中小学学生的书面语言生产和教师反馈组成。该语料库由 5485 篇文本(1,635,407 个单词)组成,包括按语言分类的教师批改,从而使该语料库在反映真实课堂批改实践方面独树一帜。本文论述了语料库的编制、内容和格式、注释、可用性及其应用价值。学习者语料库非常丰富,但发展性语料库却不常见。本文介绍了 Šolar 1.0 到 3.0 的演变过程,强调了在文本收集、错误和更正注释以及分类方法方面的改进,从而弥补了这一差距。论文还强调了编纂开发性语料库所面临的挑战和尚未解决的问题,其中最突出的是编纂过程的不同步骤缺乏公开可用的工具和标准。总之,Šolar 语料库为语言学习和教学提供了宝贵的见解,有助于教师培训、应用语言学的实证研究和自然语言处理任务。
{"title":"Šolar, the developmental corpus of Slovene","authors":"Špela Arhar Holdt, Iztok Kosem","doi":"10.1007/s10579-024-09758-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09758-4","url":null,"abstract":"<p>The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches Parlamint-it:意大利议会发言的 18 克拉 UD 树状库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09748-6
Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi

The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.

本文介绍了 ParlaMint-It,这是一个新的意大利议会辩论树库,基于通用依存关系(UD)框架进行语言注释。该资源包含 20,460 个标记,代表了 UD 计划中代表性不足的混合语言种类。ParlaMint-It 是人工修订过程的结果,该过程依赖于一种半自动方法,该方法能够识别最有可能包含自动注释产生的不一致和重复错误模式的句子。这种方法使修订过程比修订整个树库更快、更有效。此外,这种方法还能识别和纠正因语言结构在 UD 树状库中表现不一致以及因议会发言的特殊性而产生的注释错误。因此,该树状库被视为 "18 克拉 "资源,因为尽管没有完全人工修订,但对于从事意大利语语言处理任务的研究人员来说,它是一个宝贵的资源。
{"title":"Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches","authors":"Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi","doi":"10.1007/s10579-024-09748-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09748-6","url":null,"abstract":"<p>The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"373 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Language Resources and Evaluation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1