{"title":"Construction of Amharic information retrieval resources and corpora","authors":"Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie","doi":"10.1007/s10579-024-09719-x","DOIUrl":null,"url":null,"abstract":"<p>The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09719-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.
期刊介绍:
Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications.
Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use.
Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.