Language Resources and Evaluation最新文献

Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect 摩洛哥方言情感分析数据集：弥合阿拉伯语和拉丁字母方言之间的差距

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-09-11 DOI: 10.1007/s10579-024-09764-6

Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane

Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.

情感分析是确定文本中表达的情感或观点的自动化过程，在自然语言处理领域有着广泛的探索。然而，摩洛哥方言的情感分析却一直没有得到充分的体现，因为摩洛哥方言具有独特的语言景观，多种文字并存。以前的情感分析工作主要针对使用阿拉伯文字的方言。虽然这些工作提供了有价值的见解，但它们可能无法完全捕捉到摩洛哥网络内容的复杂性，因为摩洛哥网络内容融合了阿拉伯语和拉丁语文字。因此，我们的研究强调了将情感分析扩展到摩洛哥语言多样性整个范围的重要性。我们研究的核心是创建最大的摩洛哥方言情感分析公共数据集，该数据集不仅包含以阿拉伯文字书写的摩洛哥方言，还包含以拉丁字母书写的摩洛哥方言。通过收集各种文本数据，我们构建了一个包含 19991 个人工标注的摩洛哥方言文本的数据集，并公开了摩洛哥方言中的停顿词列表，为摩洛哥阿拉伯语资源做出了新的贡献。在对情感分析的探索中，我们进行了一项包含各种机器学习模型的综合研究，以评估它们与我们的数据集的兼容性。调查显示，通过使用 DarijaBert-mix 转移学习模型，我们获得了 98.42% 的最高准确率，同时我们还深入研究了深度学习模型。值得注意的是，在采用 CNN 模型时，我们的实验取得了令人称道的 92% 的准确率。此外，为了证实我们的数据集的可靠性，我们使用较小的摩洛哥方言公开数据集对 CNN 模型进行了测试，结果证明是有希望的，支持了我们的研究结果。

{"title":"Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect","authors":"Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane","doi":"10.1007/s10579-024-09764-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09764-6","url":null,"abstract":"Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"6 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Studying word meaning evolution through incremental semantic shift detection 通过增量语义转换检测研究词义演变

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-09-09 DOI: 10.1007/s10579-024-09769-1

Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, Nina Tahmasebi

The study of semantic shift, that is, of how words change meaning as a consequence of social practices, events and political circumstances, is relevant in Natural Language Processing, Linguistics, and Social Sciences. The increasing availability of large diachronic corpora and advance in computational semantics have accelerated the development of computational approaches to detecting such shift. In this paper, we introduce a novel approach to tracing the evolution of word meaning over time. Our analysis focuses on gradual changes in word semantics and relies on an incremental approach to semantic shift detection (SSD) called What is Done is Done (WiDiD). WiDiD leverages scalable and evolutionary clustering of contextualised word embeddings to detect semantic shift and capture temporal transactions in word meanings. Existing approaches to SSD: (a) significantly simplify the semantic shift problem to cover change between two (or a few) time points, and (b) consider the existing corpora as static. We instead treat SSD as an organic process in which word meanings evolve across tens or even hundreds of time periods as the corpus is progressively made available. This results in an extremely demanding task that entails a multitude of intricate decisions. We demonstrate the applicability of this incremental approach on a diachronic corpus of Italian parliamentary speeches spanning eighteen distinct time periods. We also evaluate its performance on seven popular labelled benchmarks for SSD across multiple languages. Empirical results show that our results are comparable to state-of-the-art approaches, while outperforming the state-of-the-art for certain languages.

语义转换研究，即词语如何因社会实践、事件和政治环境而改变意义的研究，与自然语言处理、语言学和社会科学息息相关。随着大型非同步语料库的不断增加以及计算语义学的进步，检测语义转换的计算方法也在加速发展。在本文中，我们介绍了一种追踪词义随时间演变的新方法。我们的分析侧重于词义的渐进变化，并依赖于一种名为 "所做即所为"（WiDiD）的增量式语义转变检测（SSD）方法。WiDiD 利用上下文词嵌入的可扩展和进化聚类来检测语义转换并捕捉词义中的时间交易。现有的 SSD 方法：(a) 将语义转换问题大幅简化为涵盖两个（或几个）时间点之间的变化；(b) 将现有语料库视为静态。而我们将 SSD 视为一个有机的过程，在这一过程中，随着语料库的逐步完善，词义会在数十个甚至数百个时间段内发生演变。这导致了一项要求极高的任务，需要做出大量复杂的决定。我们在横跨 18 个不同时期的意大利议会演讲的非同步语料库中演示了这种增量方法的适用性。我们还评估了它在七种流行的多语言 SSD 标签基准上的性能。实证结果表明，我们的结果与最先进的方法不相上下，而在某些语言上则优于最先进的方法。

{"title":"Studying word meaning evolution through incremental semantic shift detection","authors":"Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, Nina Tahmasebi","doi":"10.1007/s10579-024-09769-1","DOIUrl":"https://doi.org/10.1007/s10579-024-09769-1","url":null,"abstract":"The study of semantic shift, that is, of how words change meaning as a consequence of social practices, events and political circumstances, is relevant in Natural Language Processing, Linguistics, and Social Sciences. The increasing availability of large diachronic corpora and advance in computational semantics have accelerated the development of computational approaches to detecting such shift. In this paper, we introduce a novel approach to tracing the evolution of word meaning over time. Our analysis focuses on gradual changes in word semantics and relies on an incremental approach to semantic shift detection (SSD) called What is Done is Done (WiDiD). WiDiD leverages scalable and evolutionary clustering of contextualised word embeddings to detect semantic shift and capture temporal transactions in word meanings. Existing approaches to SSD: (a) significantly simplify the semantic shift problem to cover change between two (or a few) time points, and (b) consider the existing corpora as static. We instead treat SSD as an organic process in which word meanings evolve across tens or even hundreds of time periods as the corpus is progressively made available. This results in an extremely demanding task that entails a multitude of intricate decisions. We demonstrate the applicability of this incremental approach on a diachronic corpus of Italian parliamentary speeches spanning eighteen distinct time periods. We also evaluate its performance on seven popular labelled benchmarks for SSD across multiple languages. Empirical results show that our results are comparable to state-of-the-art approaches, while outperforming the state-of-the-art for certain languages.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"26 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines PARSEME-AR：使用 PARSEME 注释指南的阿拉伯语多词表达参考语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-28 DOI: 10.1007/s10579-024-09763-7

Najet Hadj Mohamed, Cherifa Ben Khelil, Agata Savary, Iskander Keskes, Jean Yves Antoine, Lamia Belguith Hadrich

In this paper we present PARSEME-AR, the first openly available Arabic corpus manually annotated for Verbal Multiword Expressions (VMWEs). The annotation process is carried out based on guidelines put forward by PARSEME, a multilingual project for more than 26 languages. The corpus contains 4749 VMWEs in about 7500 sentences taken from the Prague Arabic Dependency Treebank. The results notably show a high degree of discontinuity in Arabic VMWEs in comparison to other languages in the PARSEME suite. We also propose analyses of interesting and challenging phenomena encountered during the annotation process. Moreover, we offer the first benchmark for the VMWE identification task in Arabic, by training two state-of-the-art systems, on our Arabic data.

在本文中，我们介绍了 PARSEME-AR，这是第一个公开可用的阿拉伯语语料库，人工注释了口头多词表达 (VMWE)。注释过程是根据 PARSEME 提出的指导原则进行的，PARSEME 是一个多语言项目，涉及超过 26 种语言。该语料库包含来自布拉格阿拉伯语依存关系树库的约 7500 个句子中的 4749 个 VMWE。研究结果表明，与 PARSEME 套件中的其他语言相比，阿拉伯语 VMWE 的不连续性程度很高。我们还对注释过程中遇到的有趣和具有挑战性的现象进行了分析。此外，通过在我们的阿拉伯语数据上训练两个最先进的系统，我们为阿拉伯语 VMWE 识别任务提供了第一个基准。

引用次数: 0

Normalized dataset for Sanskrit word segmentation and morphological parsing 用于梵语单词分割和形态解析的规范化数据集

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-28 DOI: 10.1007/s10579-024-09724-0

Sriram Krishnan, Amba Kulkarni, Gérard Huet

Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and Saṃsādhanī tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.

在过去十年中，梵语处理中数据驱动方法的使用激增。尽管与其他语言相比，使用的数据集相对有限，但通过开发最先进的模型，已经解决了分段、形态解析和依赖性分析等各种任务。然而，一个重大的挑战在于如何获得带有词法、词形、句法和语义标记的注释数据集。句法和语义标记更适合后期处理阶段，如句法分析和消歧，而词法和形态标记则对单词分割和形态分析等低级任务至关重要。梵文数字语料库（DCS）是一项值得注意的工作，该语料库收录了来自约 250 个文本的超过 650,000 个带有词法和词形标签的句子，但在句子的不同层次（如词块、词段、词干和词形分析）上也有其局限性。为了克服这些局限性，我们研究了梵文遗产分段器（SH）和 Saṃsādhanī 工具等替代工具，它们能提供补充 DCS 数据的信息。这项工作的重点是通过纳入 SH 的分析来丰富 DCS 数据集，从而创建一个词法和形态信息丰富的数据集。此外，这项工作还讨论了此类数据集对现有分词器（特别是梵文遗产分词器）性能的影响。

{"title":"Normalized dataset for Sanskrit word segmentation and morphological parsing","authors":"Sriram Krishnan, Amba Kulkarni, Gérard Huet","doi":"10.1007/s10579-024-09724-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09724-0","url":null,"abstract":"Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and Saṃsādhanī tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conversion of the Spanish WordNet databases into a Prolog-readable format 将西班牙文 WordNet 数据库转换为 Prolog 可读格式

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-21 DOI: 10.1007/s10579-024-09752-w

Pascual Julián-Iranzo, Germán Rigau, Fernando Sáenz-Pérez, Pablo Velasco-Crespo

WordNet is a lexical database for English that is supplied in a variety of formats, including one compatible with the Prolog programming language. Given the success and usefulness of WordNet, wordnets of other languages have been developed, including Spanish. The Spanish WordNet, like others, does not provide a version compatible with Prolog. This work aims to fill this gap by translating the Multilingual Central Repository (MCR) version of the Spanish WordNet into a Prolog-compatible format. Thanks to this translation, a set of Spanish lexical databases are obtained, which allows access to WordNet information using declarative techniques and the deductive capabilities of the Prolog language. Also, this work facilitates the development of other programs to analyze the obtained information. Remarkably, we have adapted the technique of differential testing, used in software testing, to verify the correctness of this conversion. In addition, to ensure the consistency of the generated Prolog databases, as well as the databases from which we started, a complete series of integrity constraint tests have been carried out. In this way we have discovered some inconsistency problems in the MCR databases that have a reflection in the generated Prolog databases and have been reported to the owners of those databases.

WordNet 是一个英语词库，以多种格式提供，包括与 Prolog 编程语言兼容的格式。鉴于 WordNet 的成功和实用性，其他语言的词库也被开发出来，包括西班牙语。西班牙语 WordNet 和其他语言一样，没有提供与 Prolog 兼容的版本。这项工作旨在通过将西班牙语 WordNet 的多语种中央资源库 (MCR) 版本翻译成与 Prolog 兼容的格式来填补这一空白。通过翻译，我们获得了一套西班牙语词库，可以使用声明式技术和 Prolog 语言的演绎能力访问 WordNet 信息。此外，这项工作还有助于开发其他程序来分析所获得的信息。值得注意的是，我们采用了软件测试中使用的差分测试技术来验证这种转换的正确性。此外，为了确保生成的 Prolog 数据库和我们最初使用的数据库的一致性，我们还进行了一系列完整的完整性约束测试。通过这种方式，我们在 MCR 数据库中发现了一些不一致的问题，这些问题在生成的 Prolog 数据库中也有反映，并已报告给这些数据库的所有者。

{"title":"Conversion of the Spanish WordNet databases into a Prolog-readable format","authors":"Pascual Julián-Iranzo, Germán Rigau, Fernando Sáenz-Pérez, Pablo Velasco-Crespo","doi":"10.1007/s10579-024-09752-w","DOIUrl":"https://doi.org/10.1007/s10579-024-09752-w","url":null,"abstract":"WordNet is a lexical database for English that is supplied in a variety of formats, including one compatible with the Prolog programming language. Given the success and usefulness of WordNet, wordnets of other languages have been developed, including Spanish. The Spanish WordNet, like others, does not provide a version compatible with Prolog. This work aims to fill this gap by translating the Multilingual Central Repository (MCR) version of the Spanish WordNet into a Prolog-compatible format. Thanks to this translation, a set of Spanish lexical databases are obtained, which allows access to WordNet information using declarative techniques and the deductive capabilities of the Prolog language. Also, this work facilitates the development of other programs to analyze the obtained information. Remarkably, we have adapted the technique of differential testing, used in software testing, to verify the correctness of this conversion. In addition, to ensure the consistency of the generated Prolog databases, as well as the databases from which we started, a complete series of integrity constraint tests have been carried out. In this way we have discovered some inconsistency problems in the MCR databases that have a reflection in the generated Prolog databases and have been reported to the owners of those databases.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers 使用转换器对照基准数据集对阿拉伯语方言情感语料库进行注释和评估

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09750-y

Ibtissam Touahri, Azzeddine Mazroui

Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in F^NN has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.

情感分析是自然语言处理中的一项任务，旨在识别评论的整体极性，以便进行后续分析。本研究使用了阿拉伯语语音行为和情感分析、阿拉伯语情感推文数据集和 SemEval 基准数据集，以及侧重于摩洛哥方言的摩洛哥情感分析语料库。此外，还创建了现代标准和方言阿拉伯语语料库数据集，并根据现代标准阿拉伯语、摩洛哥阿拉伯方言和混合语言这三种语言类型进行了注释。此外，还在情感层面进行了注释，将情感分为正面、负面和混合情感。数据集的规模从 2000 到 21000 条评论不等。概述了增强情感分类系统的基本方言特征。所提出的方法涉及部署多个采用监督方法的模型，包括发生向量、循环神经网络-长短期记忆和预先训练的变压器模型阿拉伯变压器双向编码器表示法（AraBERT），并辅以生成对抗网络（GAN）的集成。所提议方法的独特之处在于通过手动方式构建和注释方言情感语料库，并仔细研究其主要特征，然后将其用于为经典监督模型提供信息。此外，还使用了 GANs 来拉大所研究类别之间的差距，以增强 AraBERT 所获得的结果。分类测试结果很不错，可以与其他系统进行比较。我们使用上述数据集，与 Mazajak 和 CAMelTools 这两个针对大多数阿拉伯语方言设计的最先进系统进行了评估。结果表明，FNN 明显提高了 30 个百分点。这些结果证实了所提系统的多功能性，证明了它在多方言、多领域数据集以及平衡和不平衡数据集上的有效性。

{"title":"Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers","authors":"Ibtissam Touahri, Azzeddine Mazroui","doi":"10.1007/s10579-024-09750-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09750-y","url":null,"abstract":"Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in FNN has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Chinese-DiMLex: a lexicon of Chinese discourse connectives Chinese-DiMLex：汉语话语连接词词典

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09761-9

Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho

Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in connective-lex.info, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).

机器可读的连接词目录提供了多层次的信息，是自动话语分析、机器翻译、文本摘要和论证挖掘等方面的有用资源。尽管中文是世界上使用最广泛的语言之一，并且拥有丰富的注释语料库，但这样的中文词典仍然缺失。相比之下，许多其他语言的词典早已建立。在本文中，我们介绍了 226 个汉语话语连接词，并增加了形态变化、句法（语篇）和语义（PDBT3.0 义项库）信息、用法示例和英文翻译。由此产生的词库 Chinese-DiMLex 以 XML 格式公开发布，并收录在 connective-lex.info 中，这是一个专为跨语言连接词词库的人性化浏览而设计的平台。我们描述了词库的创建过程，并讨论了在此过程中出现和讨论的一些中国特有的考虑因素和问题。通过展示这一过程，我们希望不仅能为研究和教育目的做出贡献，而且还能激励研究人员将我们的方法作为建立自己（母语）词典的参考。

{"title":"Chinese-DiMLex: a lexicon of Chinese discourse connectives","authors":"Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho","doi":"10.1007/s10579-024-09761-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09761-9","url":null,"abstract":"Machine-readable inventories of connectives that provide information on multiple levels are a useful resource for automated discourse parsing, machine translation, text summarization and argumentation mining, etc. Despite Chinese being one of the world’s most widely spoken languages and having a wealth of annotated corpora, such a lexicon for Chinese still remains absent. In contrast, lexicons for many other languages have long been established. In this paper, we present 226 Chinese discourse connectives, augmented with morphological variations, syntactic (part-of-speech) and semantic (PDBT3.0 sense inventory) information, usage examples and English translations. The resulting lexicon, Chinese-DiMLex, is made publicly available in XML format, and is included in connective-lex.info, a platform specifically designed for human-friendly browsing of connective lexicons across languages. We describe the creation process of the lexicon, and discuss several Chinese-specific considerations and issues arising and discussed in the process. By demonstrating the process, we hope not only to contribute to research and educational purposes, but also to inspire researchers to use our method as a reference for building lexicons for their (native) language(s).","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"7 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Perspectivist approaches to natural language processing: a survey 自然语言处理的透视法：调查

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09766-4

Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, Davide Bernardi

In Artificial Intelligence research, perspectivism is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.

在人工智能研究中，"视角主义 "是一种机器学习方法，旨在利用由不同个体标注的数据，对影响其观点和世界观的不同视角进行建模。我们首次对自然语言处理（NLP）中与视角主义相关的数据集和方法进行了调查。我们回顾了保留注释者个人标签的数据集，以及专注于为 NLP 任务分析和建模人类视角的研究论文。我们的分析基于有针对性的问题，这些问题旨在揭示如何考虑不同的视角、视角主义方法的新颖性和优势以及这些研究的局限性。所收录的大部分作品都以视角主义为目标，即使其中一些作品并未明确讨论视角主义。这些作品中有相当一部分关注自然语言中的高度主观现象，在这些现象中，人类表现出不同的理解和解释，例如对有毒语言和其他不良语言的注释。然而，在看似客观的任务中，人类评判者也经常表现出系统性的分歧。通过视角主义框架，我们总结了为提取和模拟不同观点而提出的解决方案，以及如何评估和解释视角主义模型。最后，我们列出了从资料分析中得出的关键概念，以及关于视角主义方法对未来 NLP 研究影响的一些重要观点。

{"title":"Perspectivist approaches to natural language processing: a survey","authors":"Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, Davide Bernardi","doi":"10.1007/s10579-024-09766-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09766-4","url":null,"abstract":"In Artificial Intelligence research, perspectivism is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"76 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies 在巴西众议院的真实案例中为法律信息检索建立相关性反馈语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09767-3

Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade

The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.

司法和立法机构的正常运作需要从大量数据集中高效检索法律文件。法律信息检索侧重于研究如何有效地处理这些数据集，以便从中检索相关信息。相关性反馈是信息检索系统的一个重要方面，它利用用户提供的相关性信息来加强对特定请求的文档检索。然而，目前缺乏包含此类信息的可用语料库，尤其是在立法领域。因此，本文介绍了用于立法信息检索的相关性反馈语料库 Ulysses-RFCorpus，该语料库是根据巴西众议院的实际情况建立的。据我们所知，该语料库是首个公开的巴西葡萄牙语语料库。它也是唯一一个包含立法文件反馈信息的语料库，因为文献中发现的其他语料库主要侧重于司法文本。我们还利用该语料库评估了巴西众议院信息检索系统的性能。因此，我们强调了该模型的强大性能，并强调了该数据集在法律信息检索领域的重要意义。

{"title":"Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies","authors":"Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade","doi":"10.1007/s10579-024-09767-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09767-3","url":null,"abstract":"The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PESTS: Persian_English cross lingual corpus for semantic textual similarity PESTS：波斯语_英语跨语言语料库的语义文本相似性

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-08-03 DOI: 10.1007/s10579-024-09759-3

Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli

In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.

近年来，人们对自然语言处理的子任务--语义文本相似性--产生了浓厚的研究兴趣。测量单词或术语、句子、段落和文档之间的语义相似性在自然语言处理和计算语言学中发挥着重要作用。它在问题解答系统、语义搜索、欺诈检测、机器翻译、信息检索等方面都有应用。语义相似性需要评估两个文本文档、段落或句子之间的意义相似程度，既包括同一语言中的相似程度，也包括不同语言之间的相似程度。要实现跨语言语义相似性，必须拥有由源语言和目标语言的句子对组成的语料库。这些句对之间应具有一定程度的语义相似性。由于缺乏可用的跨语言语义相似性数据集，该领域的许多现有模型都依赖于机器翻译。然而，对机器翻译的依赖会导致翻译错误的潜在传播，从而降低模型的准确性。对于被归类为低资源语言的波斯语来说，在开发能够理解两种语言上下文的模型方面一直缺乏努力。现在比以往任何时候都更需要这样一种能弥合语言间理解差距的模型。在本文中，通过语言学专家的合作，我们首次建立了波斯语和英语句子语义文本相似性语料库。我们将该数据集命名为 PESTS（波斯语英语语义文本相似性）。该语料库包含 5375 个句子对。此外，我们还使用该数据集对各种基于转换器的模型进行了微调。根据从 PESTS 数据集获得的结果，我们发现使用 XLM_ROBERTa 模型可将皮尔逊相关性从 85.87% 提高到 95.62%。

{"title":"PESTS: Persian_English cross lingual corpus for semantic textual similarity","authors":"Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli","doi":"10.1007/s10579-024-09759-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09759-3","url":null,"abstract":"In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0