In Artificial Intelligence research, perspectivism is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.
{"title":"Perspectivist approaches to natural language processing: a survey","authors":"Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, Davide Bernardi","doi":"10.1007/s10579-024-09766-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09766-4","url":null,"abstract":"<p>In Artificial Intelligence research, <i>perspectivism</i> is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"76 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-18DOI: 10.1007/s10579-024-09761-9
Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho
Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in connective-lex.info, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).
机器可读的连接词目录提供了多层次的信息,是自动话语分析、机器翻译、文本摘要和论证挖掘等方面的有用资源。尽管中文是世界上使用最广泛的语言之一,并且拥有丰富的注释语料库,但这样的中文词典仍然缺失。相比之下,许多其他语言的词典早已建立。在本文中,我们介绍了 226 个汉语话语连接词,并增加了形态变化、句法(语篇)和语义(PDBT3.0 义项库)信息、用法示例和英文翻译。由此产生的词库 Chinese-DiMLex 以 XML 格式公开发布,并收录在 connective-lex.info 中,这是一个专为跨语言连接词词库的人性化浏览而设计的平台。我们描述了词库的创建过程,并讨论了在此过程中出现和讨论的一些中国特有的考虑因素和问题。通过展示这一过程,我们希望不仅能为研究和教育目的做出贡献,而且还能激励研究人员将我们的方法作为建立自己(母语)词典的参考。
{"title":"Chinese-DiMLex: a lexicon of Chinese discourse connectives","authors":"Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho","doi":"10.1007/s10579-024-09761-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09761-9","url":null,"abstract":"<p>Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in <i>connective-lex.info</i>, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"7 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-18DOI: 10.1007/s10579-024-09767-3
Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade
The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.
{"title":"Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies","authors":"Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade","doi":"10.1007/s10579-024-09767-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09767-3","url":null,"abstract":"<p>The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-03DOI: 10.1007/s10579-024-09759-3
Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli
In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.
{"title":"PESTS: Persian_English cross lingual corpus for semantic textual similarity","authors":"Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli","doi":"10.1007/s10579-024-09759-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09759-3","url":null,"abstract":"<p>In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (<b>P</b>ersian <b>E</b>nglish <b>S</b>emantic <b>T</b>extual <b>S</b>imilarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09753-9
Minni Jain, Rajni Jindal, Amita Jain
For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “DoSLex” has been proposed. In DoSLex, all the words are represented in a circle where the centre stands for the domain, and the x and y axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of DoSLex has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.
{"title":"DoSLex: automatic generation of all domain semantically rich sentiment lexicon","authors":"Minni Jain, Rajni Jindal, Amita Jain","doi":"10.1007/s10579-024-09753-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09753-9","url":null,"abstract":"<p>For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “<i>DoSLex</i>” has been proposed. In <i>DoSLex</i>, all the words are represented in a circle where the centre stands for the domain, and the <i>x</i> and <i>y</i> axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of <i>DoSLex</i> has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09754-8
Xue Li, Paul Groth
When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.
{"title":"How different is different? Systematically identifying distribution shifts and their impacts in NER datasets","authors":"Xue Li, Paul Groth","doi":"10.1007/s10579-024-09754-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09754-8","url":null,"abstract":"<p>When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"39 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09762-8
Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho
The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.
{"title":"Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain","authors":"Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho","doi":"10.1007/s10579-024-09762-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09762-8","url":null,"abstract":"<p>The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09757-5
Tomás Freitas Osório, Henrique Lopes Cardoso
This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.
{"title":"Historical Portuguese corpora: a survey","authors":"Tomás Freitas Osório, Henrique Lopes Cardoso","doi":"10.1007/s10579-024-09757-5","DOIUrl":"https://doi.org/10.1007/s10579-024-09757-5","url":null,"abstract":"<p>This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"28 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s10579-024-09758-4
Špela Arhar Holdt, Iztok Kosem
The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.
{"title":"Šolar, the developmental corpus of Slovene","authors":"Špela Arhar Holdt, Iztok Kosem","doi":"10.1007/s10579-024-09758-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09758-4","url":null,"abstract":"<p>The paper presents the Šolar developmental corpus of Slovene, comprising the written language production of students in Slovene elementary and secondary schools, along with teacher feedback. The corpus consists of 5485 texts (1,635,407 words) and includes linguistically categorized teacher corrections, making the corpus unique in reflecting authentic classroom correction practices. The paper addresses the corpus compilation, content and format, annotation, availability, and its applicative value. While learner corpora are abundant, developmental corpora are less common. The paper bridges the gap by introducing the evolution from Šolar 1.0 to 3.0, emphasizing improvements in text collection, error and correction annotation, and categorization methodology. It also underlines the challenges and unresolved issues of compiling developmental corpora, most notably the lack of openly available tools and standards for different steps of the compilation process. Overall, the Šolar corpus offers valuable insights into language learning and teaching, contributing to teacher training, empirical studies in applied linguistics, and natural language processing tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s10579-024-09748-6
Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi
The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.
本文介绍了 ParlaMint-It,这是一个新的意大利议会辩论树库,基于通用依存关系(UD)框架进行语言注释。该资源包含 20,460 个标记,代表了 UD 计划中代表性不足的混合语言种类。ParlaMint-It 是人工修订过程的结果,该过程依赖于一种半自动方法,该方法能够识别最有可能包含自动注释产生的不一致和重复错误模式的句子。这种方法使修订过程比修订整个树库更快、更有效。此外,这种方法还能识别和纠正因语言结构在 UD 树状库中表现不一致以及因议会发言的特殊性而产生的注释错误。因此,该树状库被视为 "18 克拉 "资源,因为尽管没有完全人工修订,但对于从事意大利语语言处理任务的研究人员来说,它是一个宝贵的资源。
{"title":"Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches","authors":"Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi","doi":"10.1007/s10579-024-09748-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09748-6","url":null,"abstract":"<p>The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates, linguistically annotated based on the Universal Dependencies (UD) framework. The resource comprises 20,460 tokens and represents a hybrid language variety that is underrepresented in the UD initiative. ParlaMint-It results from a manual revision process that relies on a semi-automatic methodology able to identify sentences that are most likely to contain inconsistencies and recurrent error patterns generated by the automatic annotation. Such a method made the revision process faster and more efficient than revising the entire treebank. In addition, it allowed the identification and correction of annotation errors resulting from linguistic constructions inconsistently represented in UD treebanks and from characteristics specific to parliamentary speeches. Hence, the treebank is deemed as an 18-karat resource, since, although not fully manually revised, it is a valuable resource for researchers working on Italian language processing tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"373 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}