Language Resources and Evaluation最新文献

英文中文

How different is different? Systematically identifying distribution shifts and their impacts in NER datasets 差异有多大？在 NER 数据集中系统识别分布变化及其影响

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09754-8

Xue Li, Paul Groth

When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.

在处理自然语言时，我们经常会遇到分布转移的问题。例如，使用在新闻语料库中训练好的模型来处理法律文本，其性能就会下降。虽然这个问题众所周知，但到目前为止，还没有系统性的研究来检测偏移并调查偏移对 NLP 任务中模型性能的影响。因此，在本文中，我们针对 12 个基准名称实体识别数据集，通过三种不同的表示方法检测并测量了两种类型的分布偏移。我们发现，输入偏移和标签偏移都会导致性能急剧下降。例如，在广谱数据集（OntoNotes）上进行微调，并在共享标签的电子邮件数据集（CEREC）上进行测试，会导致 F1 性能下降 63 分。总之，我们的结果表明，分布偏移的测量可以为微调所需的数据量以及模型是否可以 "现成 "使用而无需后续微调提供指导。最后，我们的结果表明，偏移测量可以在 NLP 模型管道定义中发挥重要作用。

{"title":"How different is different? Systematically identifying distribution shifts and their impacts in NER datasets","authors":"Xue Li, Paul Groth","doi":"10.1007/s10579-024-09754-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09754-8","url":null,"abstract":"When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"39 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DoSLex: automatic generation of all domain semantically rich sentiment lexicon DoSLex：自动生成所有领域语义丰富的情感词典

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09753-9

Minni Jain, Rajni Jindal, Amita Jain

For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “DoSLex” has been proposed. In DoSLex, all the words are represented in a circle where the centre stands for the domain, and the x and y axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of DoSLex has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.

对于情感分析而言，词典是重要的资源之一。现有的情感词典对每个词都有一个通用的极性。事实上，许多词在不同领域使用时具有不同的极性。在这项工作中，首次提出了自动化的特定领域情感词库 "DoSLex"。在 DoSLex 中，所有词语都用一个圆来表示，圆心代表领域，x 轴和 y 轴分别代表情感的强度和方向。在圆圈中，半径是使用 MuRIL 嵌入计算出的词域与词之间的上下文相似度，角度则是从各种知识库中提取的先验情感得分。所提出的方法与语言无关，可应用于任何领域。我们在三种低资源语言上进行了广泛的实验：印地语、泰米尔语和孟加拉语。实验研究讨论了不同单词嵌入（FastText、M-Bert 和 MuRIL）与不同领域中若干先验情感知识库来源的组合性能。还将 DoSLex 的性能与三种情感词典进行了比较，结果表明情感分析能力有了显著提高。

{"title":"DoSLex: automatic generation of all domain semantically rich sentiment lexicon","authors":"Minni Jain, Rajni Jindal, Amita Jain","doi":"10.1007/s10579-024-09753-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09753-9","url":null,"abstract":"For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “DoSLex” has been proposed. In DoSLex, all the words are represented in a circle where the centre stands for the domain, and the x and y axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of DoSLex has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain Ulysses Tesemõ：巴西法律和政府领域的新大型语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09762-8

Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho

The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.

人工智能方法在法律领域的应用日益广泛，这引发了人们对应用自然语言处理技术来处理法律任务并减轻这些专业人员工作量的兴趣。然而，葡萄牙语法律语料库的可用性非常有限，尤其是在巴西法律领域。现有资源提供了一些法律数据，但覆盖面不够全面。为了填补这一空白，我们推出了 Ulysses Tesemõ，这是一个专门为巴西法律领域建立的大型语料库。该语料库包含 350 多万个文件，总计 30.7 GB 的原始文本，收集自 159 个来源，包括司法、立法、学术、新闻和其他相关数据。这些数据是通过从政府网站上抓取公共信息收集的，重点是过去二十年中产生的内容。我们将获得的文件分为 30 个不同的类别，涵盖巴西政府的各个部门和不同类型的文本。该语料库保留了原始内容，只进行了极少的数据转换，解决了葡萄牙语法律语料库稀缺的问题，为研究人员在该研究领域取得进展提供了宝贵的资源。

{"title":"Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain","authors":"Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho","doi":"10.1007/s10579-024-09762-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09762-8","url":null,"abstract":"The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Historical Portuguese corpora: a survey 葡萄牙语历史语料库：调查

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09757-5

Tomás Freitas Osório, Henrique Lopes Cardoso

This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.

本调查旨在全面研究和评估葡萄牙语历史电子语料库的现状。这是通过对现有资源的全面分析来实现的。文章有两大贡献。首先是对现有的葡萄牙语历史语料进行了详尽的编目，对每个语料的语言时期、地理来源和主题内容都进行了细致的说明。第二项贡献是为研究人员提供这些语料库的数字化访问途径。这些贡献对于加强和推进葡萄牙语历史语料库研究至关重要，为该领域未来的语言学研究奠定了重要基础。我们的调查确定了 20 个可免费访问的语料库，包括约 6390 万个词组，以及两个私人语料库，共计 5990 万个词组。

引用次数: 0

Šolar, the developmental corpus of Slovene 斯洛文尼亚语发展语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09758-4

Špela Arhar Holdt, Iztok Kosem

The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.

本文介绍了斯洛文尼亚语的 Šolar 发展语料库，该语料库由斯洛文尼亚语中小学学生的书面语言生产和教师反馈组成。该语料库由 5485 篇文本（1,635,407 个单词）组成，包括按语言分类的教师批改，从而使该语料库在反映真实课堂批改实践方面独树一帜。本文论述了语料库的编制、内容和格式、注释、可用性及其应用价值。学习者语料库非常丰富，但发展性语料库却不常见。本文介绍了 Šolar 1.0 到 3.0 的演变过程，强调了在文本收集、错误和更正注释以及分类方法方面的改进，从而弥补了这一差距。论文还强调了编纂开发性语料库所面临的挑战和尚未解决的问题，其中最突出的是编纂过程的不同步骤缺乏公开可用的工具和标准。总之，Šolar 语料库为语言学习和教学提供了宝贵的见解，有助于教师培训、应用语言学的实证研究和自然语言处理任务。

引用次数: 0

Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches Parlamint-it：意大利议会发言的 18 克拉 UD 树状库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09748-6

Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi

The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.

本文介绍了 ParlaMint-It，这是一个新的意大利议会辩论树库，基于通用依存关系（UD）框架进行语言注释。该资源包含 20,460 个标记，代表了 UD 计划中代表性不足的混合语言种类。ParlaMint-It 是人工修订过程的结果，该过程依赖于一种半自动方法，该方法能够识别最有可能包含自动注释产生的不一致和重复错误模式的句子。这种方法使修订过程比修订整个树库更快、更有效。此外，这种方法还能识别和纠正因语言结构在 UD 树状库中表现不一致以及因议会发言的特殊性而产生的注释错误。因此，该树状库被视为 "18 克拉 "资源，因为尽管没有完全人工修订，但对于从事意大利语语言处理任务的研究人员来说，它是一个宝贵的资源。

引用次数: 0

BRISE-plandok: a German legal corpus of building regulations BRISE-plandok：德国建筑法规法律语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09747-7

Gábor Recski, Eszter Iklódi, Björn Lellmann, Ádám Kovács, Allan Hanbury

We present the BRISE-Plandok corpus, a collection of 250 text documents with a total of over 7000 sentences from the Zoning Map of the City of Vienna, annotated manually with formal representations of the rules they convey. The generic rule format used by the corpus enables automated compliance checking of building plans, a process developed as part of the BRISE (https://smartcity.wien.gv.at/en/brise/) project. The format also allows for conversion to multiple logic formalisms, including dyadic deontic logic, enabling automated reasoning. Annotation guidelines were developed in collaboration with experts of the city’s building inspection office, describing nearly 100 domain-specific attributes with examples. Each document was annotated independently by two trained annotators and subsequently reviewed by the authors. A rule-based system for the automatic extraction of rules from text was developed and used in the annotation process to provide suggestions. The reviewed dataset was also used to train a set of baseline machine learning models for the task of attribute extraction, the main step in the rule extraction process. Both the rule-based system and the ML baselines are evaluated on the annotated dataset and released as open-source software. We also describe and release the framework used for generating and parsing the interactive xlsx spreadsheets used by annotators.

我们介绍了 BRISE-Plandok 语料库，这是一个由 250 个文本文档组成的语料库，其中包含来自维也纳市分区地图的总计 7000 多个句子，这些句子由人工标注，并附有规则的正式表述。该语料库使用的通用规则格式可对建筑规划进行自动合规性检查，该过程是 BRISE (https://smartcity.wien.gv.at/en/brise/) 项目开发的一部分。该格式还可转换为多种逻辑形式，包括二元契约逻辑，从而实现自动推理。注释指南是与城市建筑检查办公室的专家合作开发的，描述了近 100 个特定领域的属性，并附有示例。每份文档都由两名经过培训的注释员独立注释，随后由作者进行审核。我们开发了一个基于规则的系统，用于从文本中自动提取规则，并在注释过程中提供建议。经审查的数据集还用于训练一组基准机器学习模型，以完成属性提取任务，这是规则提取过程中的主要步骤。基于规则的系统和机器学习基线都在注释数据集上进行了评估，并作为开源软件发布。我们还描述并发布了用于生成和解析注释者使用的交互式 xlsx 电子表格的框架。

{"title":"BRISE-plandok: a German legal corpus of building regulations","authors":"Gábor Recski, Eszter Iklódi, Björn Lellmann, Ádám Kovács, Allan Hanbury","doi":"10.1007/s10579-024-09747-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09747-7","url":null,"abstract":"We present the BRISE-Plandok corpus, a collection of 250 text documents with a total of over 7000 sentences from the Zoning Map of the City of Vienna, annotated manually with formal representations of the rules they convey. The generic rule format used by the corpus enables automated compliance checking of building plans, a process developed as part of the BRISE (https://smartcity.wien.gv.at/en/brise/) project. The format also allows for conversion to multiple logic formalisms, including dyadic deontic logic, enabling automated reasoning. Annotation guidelines were developed in collaboration with experts of the city’s building inspection office, describing nearly 100 domain-specific attributes with examples. Each document was annotated independently by two trained annotators and subsequently reviewed by the authors. A rule-based system for the automatic extraction of rules from text was developed and used in the annotation process to provide suggestions. The reviewed dataset was also used to train a set of baseline machine learning models for the task of attribute extraction, the main step in the rule extraction process. Both the rule-based system and the ML baselines are evaluated on the annotated dataset and released as open-source software. We also describe and release the framework used for generating and parsing the interactive xlsx spreadsheets used by annotators.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on translation quality self-evaluation by expert translators: an empirical study 专家译者的翻译质量自我评价研究：一项实证研究

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09760-w

Yaya Zheng

The study of translation quality self-evaluation represents a shift of focus from product to process in the field of translation quality research. As this is a new and rarely explored field of investigation, a complete description of the research object will serve as a strong foundation for its future development. This research uses screen-recording software as an instrument to observe the phenomena of translation quality self-evaluation as manifested by expert translators. The results of the observation prove that translation quality self-evaluation is an essential part of any translation process, and that it takes place not only in the revision stage but also in the very first stage of meaning generation. The research also sheds some light on the content, criteria, and external and internal resources used in the process of translation quality self-evaluation.

翻译质量自我评价研究是翻译质量研究领域从产品到过程的重点转移。由于这是一个鲜有探索的新研究领域，对研究对象的完整描述将为其未来发展奠定坚实的基础。本研究以屏幕录制软件为工具，观察专家译者表现出的翻译质量自我评价现象。观察结果证明，翻译质量自我评价是任何翻译过程的重要组成部分，它不仅发生在审校阶段，也发生在意义生成的最初阶段。这项研究还揭示了翻译质量自我评价过程中使用的内容、标准以及外部和内部资源。

引用次数: 0

Construction of Amharic information retrieval resources and corpora 构建阿姆哈拉语信息检索资源和语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-02 DOI: 10.1007/s10579-024-09719-x

Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie

The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.

由于有了自然语言资源和语料库，许多自然语言的信息检索系统和自然语言处理工具的开发成为可能。虽然阿姆哈拉语是埃塞俄比亚的工作语言，但它仍然是一种资源不足的语言。迄今为止，还没有足够的资源和语料库可用于阿姆哈拉语的临时检索评估。现有的资源和语料库并不公开，不适合对信息检索系统进行科学评估。为了促进阿姆哈拉语临时检索的发展，我们建立了一个临时检索测试集，其中包括原始文本、词形注释的基于词干和词根的语料库、停止词表、基于词干和词根的词典以及类似 WordNet 的资源。我们还利用原始文本和语料库的形态分割形式创建了词嵌入。在构建这些资源和语料库时，我们在很大程度上考虑了语言的形态特征。本文旨在介绍我们向研究界提供的这些阿姆哈拉语资源和语料库，用于信息检索任务。语言学家也对这些资源和语料库进行了实验和评估。

{"title":"Construction of Amharic information retrieval resources and corpora","authors":"Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie","doi":"10.1007/s10579-024-09719-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09719-x","url":null,"abstract":"The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations 使用基于 UMLS 的词典在西班牙文医学语料库中进行实体规范化：研究结果与局限性

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-02 DOI: 10.1007/s10579-024-09755-7

Pablo Báez, Leonardo Campillos-Llanos, Fredy Núñez, Jocelyn Dunstan

Entity normalization is a common strategy to resolve ambiguities by mapping all the synonym mentions to a single concept identifier in standard terminology. Normalizing medical entities is challenging, especially for languages other than English, where lexical variation is considerably under-represented. Here, we report a new linguistic resource for medical entity normalization in Spanish. We applied a UMLS-based medical lexicon (MedLexSp) to automatically normalize mentions from 2000 medical referrals of the Chilean Waiting List Corpus. Three medical students manually revised the automatic normalization. The inter-coder agreement was computed, and the distribution of concepts, errors, and linguistic sources of variation was analyzed. The automatic method normalized 52% of the mentions, compared to 91% after manual revision. The lowest agreement between automatic and automatic-manual normalization was observed for Finding, Disease, and Procedure entities. Errors in normalization were associated with ortho-typographic, semantic, and grammatical linguistic inadequacies, mainly of the hyponymy/hyperonymy, polysemy/metonymy, and acronym-abbreviation types. This new resource can enrich dictionaries and lexicons with new mentions to improve the functioning of modern entity normalization methods. The linguistic analysis offers insight into the sources of lexical variety in the Spanish clinical environment related to error generation using lexicon-based normalization methods. This article also introduces a workflow that can serve as a benchmark for comparison in studies replicating our analysis in Romance languages.

实体规范化是一种常见策略，通过将所有同义词映射到标准术语中的单一概念标识符来解决歧义问题。医学实体规范化具有挑战性，尤其是对于英语以外的语言，因为在英语中，词汇变化的代表性严重不足。在此，我们报告了一种用于西班牙语医学实体规范化的新语言资源。我们应用了基于 UMLS 的医疗词典（MedLexSp），对智利候诊名单语料库中 2000 份医疗转诊中的提及进行了自动规范化。三名医科学生对自动规范化进行了人工修订。计算了编码员之间的一致性，分析了概念、错误和语言变异源的分布。自动归一化方法归一化了 52% 的提及，而人工修订后归一化了 91%。在 "发现"、"疾病 "和 "程序 "实体中，自动归一化和自动-人工归一化的一致性最低。归一化错误与正词表、语义和语法方面的语言缺陷有关，主要是同义词/异义词、多义词/近义词和首字母缩略词类型。这一新资源可以丰富字典和词典，提供新的词条，从而改善现代实体规范化方法的功能。语言学分析有助于深入了解西班牙临床环境中词汇多样性的来源，这与使用基于词典的规范化方法产生错误有关。本文还介绍了一种工作流程，可作为在罗曼语中复制我们的分析方法的研究比较基准。

{"title":"Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations","authors":"Pablo Báez, Leonardo Campillos-Llanos, Fredy Núñez, Jocelyn Dunstan","doi":"10.1007/s10579-024-09755-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09755-7","url":null,"abstract":"Entity normalization is a common strategy to resolve ambiguities by mapping all the synonym mentions to a single concept identifier in standard terminology. Normalizing medical entities is challenging, especially for languages other than English, where lexical variation is considerably under-represented. Here, we report a new linguistic resource for medical entity normalization in Spanish. We applied a UMLS-based medical lexicon (MedLexSp) to automatically normalize mentions from 2000 medical referrals of the Chilean Waiting List Corpus. Three medical students manually revised the automatic normalization. The inter-coder agreement was computed, and the distribution of concepts, errors, and linguistic sources of variation was analyzed. The automatic method normalized 52% of the mentions, compared to 91% after manual revision. The lowest agreement between automatic and automatic-manual normalization was observed for Finding, Disease, and Procedure entities. Errors in normalization were associated with ortho-typographic, semantic, and grammatical linguistic inadequacies, mainly of the hyponymy/hyperonymy, polysemy/metonymy, and acronym-abbreviation types. This new resource can enrich dictionaries and lexicons with new mentions to improve the functioning of modern entity normalization methods. The linguistic analysis offers insight into the sources of lexical variety in the Spanish clinical environment related to error generation using lexicon-based normalization methods. This article also introduces a workflow that can serve as a benchmark for comparison in studies replicating our analysis in Romance languages.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"23 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Language Resources and Evaluation

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀