Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.
{"title":"Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect","authors":"Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane","doi":"10.1007/s10579-024-09764-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09764-6","url":null,"abstract":"<p>Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"6 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1007/s10579-024-09769-1
Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, Nina Tahmasebi
The study of semantic shift, that is, of how words change meaning as a consequence of social practices, events and political circumstances, is relevant in Natural Language Processing, Linguistics, and Social Sciences. The increasing availability of large diachronic corpora and advance in computational semantics have accelerated the development of computational approaches to detecting such shift. In this paper, we introduce a novel approach to tracing the evolution of word meaning over time. Our analysis focuses on gradual changes in word semantics and relies on an incremental approach to semantic shift detection (SSD) called What is Done is Done (WiDiD). WiDiD leverages scalable and evolutionary clustering of contextualised word embeddings to detect semantic shift and capture temporal transactions in word meanings. Existing approaches to SSD: (a) significantly simplify the semantic shift problem to cover change between two (or a few) time points, and (b) consider the existing corpora as static. We instead treat SSD as an organic process in which word meanings evolve across tens or even hundreds of time periods as the corpus is progressively made available. This results in an extremely demanding task that entails a multitude of intricate decisions. We demonstrate the applicability of this incremental approach on a diachronic corpus of Italian parliamentary speeches spanning eighteen distinct time periods. We also evaluate its performance on seven popular labelled benchmarks for SSD across multiple languages. Empirical results show that our results are comparable to state-of-the-art approaches, while outperforming the state-of-the-art for certain languages.
{"title":"Studying word meaning evolution through incremental semantic shift detection","authors":"Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, Nina Tahmasebi","doi":"10.1007/s10579-024-09769-1","DOIUrl":"https://doi.org/10.1007/s10579-024-09769-1","url":null,"abstract":"<p>The study of <i>semantic shift</i>, that is, of how words change meaning as a consequence of social practices, events and political circumstances, is relevant in Natural Language Processing, Linguistics, and Social Sciences. The increasing availability of large diachronic corpora and advance in computational semantics have accelerated the development of computational approaches to detecting such shift. In this paper, we introduce a novel approach to tracing the evolution of word meaning over time. Our analysis focuses on gradual changes in word semantics and relies on an incremental approach to semantic shift detection (SSD) called <i>What is Done is Done</i> (WiDiD). WiDiD leverages scalable and evolutionary clustering of contextualised word embeddings to detect semantic shift and capture temporal <i>transactions</i> in word meanings. Existing approaches to SSD: (a) significantly simplify the semantic shift problem to cover change between two (or a few) time points, and (b) consider the existing corpora as static. We instead treat SSD as an organic process in which word meanings evolve across tens or even hundreds of time periods as the corpus is progressively made available. This results in an extremely demanding task that entails a multitude of intricate decisions. We demonstrate the applicability of this incremental approach on a diachronic corpus of Italian parliamentary speeches spanning eighteen distinct time periods. We also evaluate its performance on seven popular labelled benchmarks for SSD across multiple languages. Empirical results show that our results are comparable to state-of-the-art approaches, while outperforming the state-of-the-art for certain languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"26 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1007/s10579-024-09763-7
Najet Hadj Mohamed, Cherifa Ben Khelil, Agata Savary, Iskander Keskes, Jean Yves Antoine, Lamia Belguith Hadrich
In this paper we present PARSEME-AR, the first openly available Arabic corpus manually annotated for Verbal Multiword Expressions (VMWEs). The annotation process is carried out based on guidelines put forward by PARSEME, a multilingual project for more than 26 languages. The corpus contains 4749 VMWEs in about 7500 sentences taken from the Prague Arabic Dependency Treebank. The results notably show a high degree of discontinuity in Arabic VMWEs in comparison to other languages in the PARSEME suite. We also propose analyses of interesting and challenging phenomena encountered during the annotation process. Moreover, we offer the first benchmark for the VMWE identification task in Arabic, by training two state-of-the-art systems, on our Arabic data.
{"title":"PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines","authors":"Najet Hadj Mohamed, Cherifa Ben Khelil, Agata Savary, Iskander Keskes, Jean Yves Antoine, Lamia Belguith Hadrich","doi":"10.1007/s10579-024-09763-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09763-7","url":null,"abstract":"<p>In this paper we present PARSEME-AR, the first openly available Arabic corpus manually annotated for Verbal Multiword Expressions (VMWEs). The annotation process is carried out based on guidelines put forward by PARSEME, a multilingual project for more than 26 languages. The corpus contains 4749 VMWEs in about 7500 sentences taken from the Prague Arabic Dependency Treebank. The results notably show a high degree of discontinuity in Arabic VMWEs in comparison to other languages in the PARSEME suite. We also propose analyses of interesting and challenging phenomena encountered during the annotation process. Moreover, we offer the first benchmark for the VMWE identification task in Arabic, by training two state-of-the-art systems, on our Arabic data.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"146 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1007/s10579-024-09724-0
Sriram Krishnan, Amba Kulkarni, Gérard Huet
Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and Saṃsādhanī tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.
{"title":"Normalized dataset for Sanskrit word segmentation and morphological parsing","authors":"Sriram Krishnan, Amba Kulkarni, Gérard Huet","doi":"10.1007/s10579-024-09724-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09724-0","url":null,"abstract":"<p>Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and <i>Saṃsādhanī</i> tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1007/s10579-024-09752-w
Pascual Julián-Iranzo, Germán Rigau, Fernando Sáenz-Pérez, Pablo Velasco-Crespo
WordNet is a lexical database for English that is supplied in a variety of formats, including one compatible with the Prolog programming language. Given the success and usefulness of WordNet, wordnets of other languages have been developed, including Spanish. The Spanish WordNet, like others, does not provide a version compatible with Prolog. This work aims to fill this gap by translating the Multilingual Central Repository (MCR) version of the Spanish WordNet into a Prolog-compatible format. Thanks to this translation, a set of Spanish lexical databases are obtained, which allows access to WordNet information using declarative techniques and the deductive capabilities of the Prolog language. Also, this work facilitates the development of other programs to analyze the obtained information. Remarkably, we have adapted the technique of differential testing, used in software testing, to verify the correctness of this conversion. In addition, to ensure the consistency of the generated Prolog databases, as well as the databases from which we started, a complete series of integrity constraint tests have been carried out. In this way we have discovered some inconsistency problems in the MCR databases that have a reflection in the generated Prolog databases and have been reported to the owners of those databases.
{"title":"Conversion of the Spanish WordNet databases into a Prolog-readable format","authors":"Pascual Julián-Iranzo, Germán Rigau, Fernando Sáenz-Pérez, Pablo Velasco-Crespo","doi":"10.1007/s10579-024-09752-w","DOIUrl":"https://doi.org/10.1007/s10579-024-09752-w","url":null,"abstract":"<p>WordNet is a lexical database for English that is supplied in a variety of formats, including one compatible with the <span>Prolog</span> programming language. Given the success and usefulness of WordNet, wordnets of other languages have been developed, including Spanish. The Spanish WordNet, like others, does not provide a version compatible with <span>Prolog</span>. This work aims to fill this gap by translating the Multilingual Central Repository (MCR) version of the Spanish WordNet into a <span>Prolog</span>-compatible format. Thanks to this translation, a set of Spanish lexical databases are obtained, which allows access to WordNet information using declarative techniques and the deductive capabilities of the <span>Prolog</span> language. Also, this work facilitates the development of other programs to analyze the obtained information. Remarkably, we have adapted the technique of differential testing, used in software testing, to verify the correctness of this conversion. In addition, to ensure the consistency of the generated <span>Prolog</span> databases, as well as the databases from which we started, a complete series of integrity constraint tests have been carried out. In this way we have discovered some inconsistency problems in the MCR databases that have a reflection in the generated <span>Prolog</span> databases and have been reported to the owners of those databases.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-18DOI: 10.1007/s10579-024-09750-y
Ibtissam Touahri, Azzeddine Mazroui
Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in FNN has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.
{"title":"Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers","authors":"Ibtissam Touahri, Azzeddine Mazroui","doi":"10.1007/s10579-024-09750-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09750-y","url":null,"abstract":"<p>Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in F<sup>NN</sup> has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-18DOI: 10.1007/s10579-024-09761-9
Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho
Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in connective-lex.info, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).
机器可读的连接词目录提供了多层次的信息,是自动话语分析、机器翻译、文本摘要和论证挖掘等方面的有用资源。尽管中文是世界上使用最广泛的语言之一,并且拥有丰富的注释语料库,但这样的中文词典仍然缺失。相比之下,许多其他语言的词典早已建立。在本文中,我们介绍了 226 个汉语话语连接词,并增加了形态变化、句法(语篇)和语义(PDBT3.0 义项库)信息、用法示例和英文翻译。由此产生的词库 Chinese-DiMLex 以 XML 格式公开发布,并收录在 connective-lex.info 中,这是一个专为跨语言连接词词库的人性化浏览而设计的平台。我们描述了词库的创建过程,并讨论了在此过程中出现和讨论的一些中国特有的考虑因素和问题。通过展示这一过程,我们希望不仅能为研究和教育目的做出贡献,而且还能激励研究人员将我们的方法作为建立自己(母语)词典的参考。
{"title":"Chinese-DiMLex: a lexicon of Chinese discourse connectives","authors":"Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho","doi":"10.1007/s10579-024-09761-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09761-9","url":null,"abstract":"<p>Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in <i>connective-lex.info</i>, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"7 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In Artificial Intelligence research, perspectivism is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.
{"title":"Perspectivist approaches to natural language processing: a survey","authors":"Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, Davide Bernardi","doi":"10.1007/s10579-024-09766-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09766-4","url":null,"abstract":"<p>In Artificial Intelligence research, <i>perspectivism</i> is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"76 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-18DOI: 10.1007/s10579-024-09767-3
Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade
The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.
{"title":"Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies","authors":"Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade","doi":"10.1007/s10579-024-09767-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09767-3","url":null,"abstract":"<p>The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-03DOI: 10.1007/s10579-024-09759-3
Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli
In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.
{"title":"PESTS: Persian_English cross lingual corpus for semantic textual similarity","authors":"Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli","doi":"10.1007/s10579-024-09759-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09759-3","url":null,"abstract":"<p>In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (<b>P</b>ersian <b>E</b>nglish <b>S</b>emantic <b>T</b>extual <b>S</b>imilarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}