Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09754-8
Xue Li, Paul Groth
When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.
{"title":"How different is different? Systematically identifying distribution shifts and their impacts in NER datasets","authors":"Xue Li, Paul Groth","doi":"10.1007/s10579-024-09754-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09754-8","url":null,"abstract":"<p>When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"39 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09753-9
Minni Jain, Rajni Jindal, Amita Jain
For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “DoSLex” has been proposed. In DoSLex, all the words are represented in a circle where the centre stands for the domain, and the x and y axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of DoSLex has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.
{"title":"DoSLex: automatic generation of all domain semantically rich sentiment lexicon","authors":"Minni Jain, Rajni Jindal, Amita Jain","doi":"10.1007/s10579-024-09753-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09753-9","url":null,"abstract":"<p>For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “<i>DoSLex</i>” has been proposed. In <i>DoSLex</i>, all the words are represented in a circle where the centre stands for the domain, and the <i>x</i> and <i>y</i> axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of <i>DoSLex</i> has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09762-8
Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho
The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.
{"title":"Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain","authors":"Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho","doi":"10.1007/s10579-024-09762-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09762-8","url":null,"abstract":"<p>The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09757-5
Tomás Freitas Osório, Henrique Lopes Cardoso
This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.
{"title":"Historical Portuguese corpora: a survey","authors":"Tomás Freitas Osório, Henrique Lopes Cardoso","doi":"10.1007/s10579-024-09757-5","DOIUrl":"https://doi.org/10.1007/s10579-024-09757-5","url":null,"abstract":"<p>This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"28 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09758-4
Špela Arhar Holdt, Iztok Kosem
The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.
{"title":"Šolar, the developmental corpus of Slovene","authors":"Špela Arhar Holdt, Iztok Kosem","doi":"10.1007/s10579-024-09758-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09758-4","url":null,"abstract":"<p>The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s10579-024-09748-6
Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi
The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.
本文介绍了 ParlaMint-It,这是一个新的意大利议会辩论树库,基于通用依存关系(UD)框架进行语言注释。该资源包含 20,460 个标记,代表了 UD 计划中代表性不足的混合语言种类。ParlaMint-It 是人工修订过程的结果,该过程依赖于一种半自动方法,该方法能够识别最有可能包含自动注释产生的不一致和重复错误模式的句子。这种方法使修订过程比修订整个树库更快、更有效。此外,这种方法还能识别和纠正因语言结构在 UD 树状库中表现不一致以及因议会发言的特殊性而产生的注释错误。因此,该树状库被视为 "18 克拉 "资源,因为尽管没有完全人工修订,但对于从事意大利语语言处理任务的研究人员来说,它是一个宝贵的资源。
{"title":"Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches","authors":"Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi","doi":"10.1007/s10579-024-09748-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09748-6","url":null,"abstract":"<p>The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"373 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s10579-024-09747-7
Gábor Recski, Eszter Iklódi, Björn Lellmann, Ádám Kovács, Allan Hanbury
We present the BRISE-Plandok corpus, a collection of 250 text documents with a total of over 7000 sentences from the Zoning Map of the City of Vienna, annotated manually with formal representations of the rules they convey. The generic rule format used by the corpus enables automated compliance checking of building plans, a process developed as part of the BRISE (https://smartcity.wien.gv.at/en/brise/) project. The format also allows for conversion to multiple logic formalisms, including dyadic deontic logic, enabling automated reasoning. Annotation guidelines were developed in collaboration with experts of the city’s building inspection office, describing nearly 100 domain-specific attributes with examples. Each document was annotated independently by two trained annotators and subsequently reviewed by the authors. A rule-based system for the automatic extraction of rules from text was developed and used in the annotation process to provide suggestions. The reviewed dataset was also used to train a set of baseline machine learning models for the task of attribute extraction, the main step in the rule extraction process. Both the rule-based system and the ML baselines are evaluated on the annotated dataset and released as open-source software. We also describe and release the framework used for generating and parsing the interactive xlsx spreadsheets used by annotators.
{"title":"BRISE-plandok: a German legal corpus of building regulations","authors":"Gábor Recski, Eszter Iklódi, Björn Lellmann, Ádám Kovács, Allan Hanbury","doi":"10.1007/s10579-024-09747-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09747-7","url":null,"abstract":"<p>We present the BRISE-Plandok corpus, a collection of 250 text documents with a total of over 7000 sentences from the Zoning Map of the City of Vienna, annotated manually with formal representations of the rules they convey. The generic rule format used by the corpus enables automated compliance checking of building plans, a process developed as part of the BRISE (https://smartcity.wien.gv.at/en/brise/) project. The format also allows for conversion to multiple logic formalisms, including dyadic deontic logic, enabling automated reasoning. Annotation guidelines were developed in collaboration with experts of the city’s building inspection office, describing nearly 100 domain-specific attributes with examples. Each document was annotated independently by two trained annotators and subsequently reviewed by the authors. A rule-based system for the automatic extraction of rules from text was developed and used in the annotation process to provide suggestions. The reviewed dataset was also used to train a set of baseline machine learning models for the task of attribute extraction, the main step in the rule extraction process. Both the rule-based system and the ML baselines are evaluated on the annotated dataset and released as open-source software. We also describe and release the framework used for generating and parsing the interactive xlsx spreadsheets used by annotators.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s10579-024-09760-w
Yaya Zheng
The study of translation quality self-evaluation represents a shift of focus from product to process in the field of translation quality research. As this is a new and rarely explored field of investigation, a complete description of the research object will serve as a strong foundation for its future development. This research uses screen-recording software as an instrument to observe the phenomena of translation quality self-evaluation as manifested by expert translators. The results of the observation prove that translation quality self-evaluation is an essential part of any translation process, and that it takes place not only in the revision stage but also in the very first stage of meaning generation. The research also sheds some light on the content, criteria, and external and internal resources used in the process of translation quality self-evaluation.
{"title":"Research on translation quality self-evaluation by expert translators: an empirical study","authors":"Yaya Zheng","doi":"10.1007/s10579-024-09760-w","DOIUrl":"https://doi.org/10.1007/s10579-024-09760-w","url":null,"abstract":"<p>The study of translation quality self-evaluation represents a shift of focus from product to process in the field of translation quality research. As this is a new and rarely explored field of investigation, a complete description of the research object will serve as a strong foundation for its future development. This research uses screen-recording software as an instrument to observe the phenomena of translation quality self-evaluation as manifested by expert translators. The results of the observation prove that translation quality self-evaluation is an essential part of any translation process, and that it takes place not only in the revision stage but also in the very first stage of meaning generation. The research also sheds some light on the content, criteria, and external and internal resources used in the process of translation quality self-evaluation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"88 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s10579-024-09719-x
Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.
{"title":"Construction of Amharic information retrieval resources and corpora","authors":"Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie","doi":"10.1007/s10579-024-09719-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09719-x","url":null,"abstract":"<p>The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s10579-024-09755-7
Pablo Báez, Leonardo Campillos-Llanos, Fredy Núñez, Jocelyn Dunstan
Entity normalization is a common strategy to resolve ambiguities by mapping all the synonym mentions to a single concept identifier in standard terminology. Normalizing medical entities is challenging, especially for languages other than English, where lexical variation is considerably under-represented. Here, we report a new linguistic resource for medical entity normalization in Spanish. We applied a UMLS-based medical lexicon (MedLexSp) to automatically normalize mentions from 2000 medical referrals of the Chilean Waiting List Corpus. Three medical students manually revised the automatic normalization. The inter-coder agreement was computed, and the distribution of concepts, errors, and linguistic sources of variation was analyzed. The automatic method normalized 52% of the mentions, compared to 91% after manual revision. The lowest agreement between automatic and automatic-manual normalization was observed for Finding, Disease, and Procedure entities. Errors in normalization were associated with ortho-typographic, semantic, and grammatical linguistic inadequacies, mainly of the hyponymy/hyperonymy, polysemy/metonymy, and acronym-abbreviation types. This new resource can enrich dictionaries and lexicons with new mentions to improve the functioning of modern entity normalization methods. The linguistic analysis offers insight into the sources of lexical variety in the Spanish clinical environment related to error generation using lexicon-based normalization methods. This article also introduces a workflow that can serve as a benchmark for comparison in studies replicating our analysis in Romance languages.
{"title":"Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations","authors":"Pablo Báez, Leonardo Campillos-Llanos, Fredy Núñez, Jocelyn Dunstan","doi":"10.1007/s10579-024-09755-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09755-7","url":null,"abstract":"<p>Entity normalization is a common strategy to resolve ambiguities by mapping all the synonym mentions to a single concept identifier in standard terminology. Normalizing medical entities is challenging, especially for languages other than English, where lexical variation is considerably under-represented. Here, we report a new linguistic resource for medical entity normalization in Spanish. We applied a UMLS-based medical lexicon (MedLexSp) to automatically normalize mentions from 2000 medical referrals of the Chilean Waiting List Corpus. Three medical students manually revised the automatic normalization. The inter-coder agreement was computed, and the distribution of concepts, errors, and linguistic sources of variation was analyzed. The automatic method normalized 52% of the mentions, compared to 91% after manual revision. The lowest agreement between automatic and automatic-manual normalization was observed for Finding, Disease, and Procedure entities. Errors in normalization were associated with ortho-typographic, semantic, and grammatical linguistic inadequacies, mainly of the hyponymy/hyperonymy, polysemy/metonymy, and acronym-abbreviation types. This new resource can enrich dictionaries and lexicons with new mentions to improve the functioning of modern entity normalization methods. The linguistic analysis offers insight into the sources of lexical variety in the Spanish clinical environment related to error generation using lexicon-based normalization methods. This article also introduces a workflow that can serve as a benchmark for comparison in studies replicating our analysis in Romance languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"23 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}