首页 > 最新文献

Language Resources and Evaluation最新文献

英文 中文
An aligned corpus of Spanish bibles 西班牙文圣经对齐语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-15 DOI: 10.1007/s10579-024-09726-y
Gerardo Sierra, Gemma Bel-Enguix, Ameyali Díaz-Velasco, Natalia Guerrero-Cerón, Núria Bel

We present a comprehensive and valuable resource in the form of an aligned parallel corpus comprising translations of the Bible in Spanish. Our collection encompasses a total of eleven Bibles, originating from diverse centuries (XVI, XIX, XX), various religious denominations (Protestant, Catholic), and geographical regions (Spain, Latin America). The process of aligning the verses across these translations has been meticulously carried out, ensuring that the content is organized in a coherent manner. As a result, this corpus serves as a useful convenient resource for various linguistic analyses, including paraphrase detection, semantic clustering, and the exploration of biases present within the texts. To illustrate the utility of this resource, we provide several examples that demonstrate how it can be effectively employed in these applications.

我们以对齐平行语料库的形式提供了一个全面而宝贵的资源,其中包括《圣经》的西班牙文译本。我们收集的《圣经》共有 11 种,分别来自不同的世纪(十六世纪、十九世纪、二十世纪)、不同的宗教派别(新教、天主教)和地理区域(西班牙、拉丁美洲)。我们对这些译本的经文进行了细致的调整,以确保内容组织的连贯性。因此,该语料库为各种语言分析提供了有用的便利资源,包括意译检测、语义聚类和对文本中存在的偏差的探索。为了说明该资源的实用性,我们提供了几个示例,展示如何在这些应用中有效地使用该资源。
{"title":"An aligned corpus of Spanish bibles","authors":"Gerardo Sierra, Gemma Bel-Enguix, Ameyali Díaz-Velasco, Natalia Guerrero-Cerón, Núria Bel","doi":"10.1007/s10579-024-09726-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09726-y","url":null,"abstract":"<p>We present a comprehensive and valuable resource in the form of an aligned parallel corpus comprising translations of the Bible in Spanish. Our collection encompasses a total of eleven Bibles, originating from diverse centuries (XVI, XIX, XX), various religious denominations (Protestant, Catholic), and geographical regions (Spain, Latin America). The process of aligning the verses across these translations has been meticulously carried out, ensuring that the content is organized in a coherent manner. As a result, this corpus serves as a useful convenient resource for various linguistic analyses, including paraphrase detection, semantic clustering, and the exploration of biases present within the texts. To illustrate the utility of this resource, we provide several examples that demonstrate how it can be effectively employed in these applications.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"26 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SOLD: Sinhala offensive language dataset 出售:僧伽罗语冒犯性语言数据集
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-06 DOI: 10.1007/s10579-024-09723-1
Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri

The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.

仇恨言论和网络欺凌等攻击性内容在网上泛滥是一个全球现象。这引发了人工智能(AI)和自然语言处理(NLP)界的兴趣,促使人们开发各种经过训练的系统,以自动检测潜在的有害内容。这些系统需要标注数据集来训练机器学习(ML)模型。然而,除了少数明显的例外情况,有关这一主题的大多数数据集都是针对英语和其他少数高资源语言的。因此,攻击性语言识别方面的研究仅限于这些语言。僧伽罗语是一种低资源的印度-雅利安语,斯里兰卡有 1700 多万人使用这种语言。我们介绍了僧伽罗语冒犯性语言数据集(SOLD),并在此数据集上进行了多项实验。SOLD 是一个人工标注的数据集,包含来自 Twitter 的 10,000 篇文章,在句子和标记层面上标注了冒犯性和非冒犯性,从而提高了 ML 模型的可解释性。SOLD 是首个公开可用的大型僧伽罗语攻击性语言数据集。我们还引入了 SemiSOLD,这是一个更大的数据集,包含 145,000 多条僧伽罗语推文,采用半监督方法进行注释。
{"title":"SOLD: Sinhala offensive language dataset","authors":"Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri","doi":"10.1007/s10579-024-09723-1","DOIUrl":"https://doi.org/10.1007/s10579-024-09723-1","url":null,"abstract":"<p>The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (<i>SOLD</i>) and present multiple experiments on this dataset. <i>SOLD</i> is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. <i>SOLD</i> is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce <i>SemiSOLD</i>, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus 基于事件的监测中的传染性风险事件及其新颖性:新定义和注释语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-05 DOI: 10.1007/s10579-024-09728-w
François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti

Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.

基于事件的监控(EBS)需要分析越来越多的文件,这就需要自动处理来支持人工分析人员。用于评估 EBS 领域信息提取工具的注释语料很少。这项工作的主要目标是建立一个语料库,其中包含当前 EBS 信息系统中收集的具有代表性的文档,并为这些文档标注事件及其新颖性。我们对传染病事件及其新颖性提出了新的定义,以适应在 EBS 领域工作的分析人员的背景工作,我们编制了一个包含 305 篇文档的语料库,描述了 283 个传染病事件。其中包括 36 篇法文文档,共代表 11 个事件,其余为英文文档。我们对语料库中最新的 110 篇文档进行了新颖性注释,共产生 101 个事件。在事件识别方面,标注者之间的一致性为 0.74(F1-Score),在新颖性标注方面,标注者之间的一致性为 0.69 [95% CI: 0.51; 0.88](Kappa)。实体标注的总体一致性较低,根据考虑的实体类型不同而存在显著差异(范围为 0.30-0.68)。该语料库是一个有用的工具,可用于创建和评估 EBS 研究团队提交的事件检测和新颖性标注算法和方法,从而改进文档流处理的运行。该语料库的规模较小,因此不太适合用于训练自然语言处理模型,不过在使用少量学习方法时,这一限制会逐渐消失。
{"title":"Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus","authors":"François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti","doi":"10.1007/s10579-024-09728-w","DOIUrl":"https://doi.org/10.1007/s10579-024-09728-w","url":null,"abstract":"<p> Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.\u0000</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"116 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic search as extractive paraphrase span detection 作为提取式转述跨度检测的语义搜索
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 DOI: 10.1007/s10579-023-09715-7

Abstract

In this paper, we approach the problem of semantic search by introducing a task of paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. While current work in paraphrasing has almost uniquely focused on sentence-level approaches, the novel span detection approach gives a possibility to retrieve a segment of arbitrary length. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that by achieving an exact match of 88.73 our paraphrase span detection approach outperforms widely adopted sentence-level retrieval baselines (lexical similarity as well as BERT and SBERT sentence embeddings) by more than 20pp in terms of exact match, and 11pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the paraphrase retrieval in terms of span extraction rather than commonly used sentence similarity, the sentence-level approaches being clearly suboptimal for applications where the retrieval targets are not guaranteed to be full sentences. Even when limiting the evaluation to sentence-level retrieval targets only, the span detection model still outperforms the sentence-level baselines by more than 4 pp in terms of exact match, and almost 6pp F-score. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.

摘要 在本文中,我们通过引入转述跨度检测任务来解决语义搜索问题,即给定一段文本作为查询短语,任务是在给定文档中识别其转述,这与抽取式问题解答中通常使用的建模设置相同。目前的转述工作几乎都集中在句子级方法上,而新颖的跨度检测方法提供了检索任意长度文本段的可能性。图尔库转述语料库(Turku Paraphrase Corpus)包含 10 万个人工提取的芬兰语转述对(包括其原始文档上下文),我们发现,通过实现 88.73 的精确匹配,我们的转述跨度检测方法在精确匹配方面比广泛采用的句子级检索基线(词汇相似性以及 BERT 和 SBERT 句子嵌入)高出 20pp 以上,在标记级 F score 方面高出 11pp。这表明,以跨度提取而不是常用的句子相似性来建立仿句检索模型具有强大的优势,对于检索目标不能保证是完整句子的应用来说,句子级方法显然不是最佳选择。即使只对句子级检索目标进行评估,跨度检测模型的精确匹配度仍比句子级基线高出 4 个百分点,F-score 也高出近 6 个百分点。此外,我们还介绍了一种通过反向翻译创建人工意译数据的方法,这种方法适用于没有用于训练跨度检测模型的人工注释意译资源的语言。
{"title":"Semantic search as extractive paraphrase span detection","authors":"","doi":"10.1007/s10579-023-09715-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09715-7","url":null,"abstract":"<h3>Abstract</h3> <p>In this paper, we approach the problem of semantic search by introducing a task of paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. While current work in paraphrasing has almost uniquely focused on sentence-level approaches, the novel span detection approach gives a possibility to retrieve a segment of arbitrary length. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that by achieving an exact match of 88.73 our paraphrase span detection approach outperforms widely adopted sentence-level retrieval baselines (lexical similarity as well as BERT and SBERT sentence embeddings) by more than 20pp in terms of exact match, and 11pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the paraphrase retrieval in terms of span extraction rather than commonly used sentence similarity, the sentence-level approaches being clearly suboptimal for applications where the retrieval targets are not guaranteed to be full sentences. Even when limiting the evaluation to sentence-level retrieval targets only, the span detection model still outperforms the sentence-level baselines by more than 4 pp in terms of exact match, and almost 6pp F-score. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"13 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139661804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new methodology for automatic creation of concept maps of Turkish texts 自动绘制土耳其语文本概念图的新方法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-28 DOI: 10.1007/s10579-023-09713-9
Merve Bayrak, Deniz Dal

Concept maps are two-dimensional visual tools that describe the relationships between concepts belonging to a particular subject. The manual creation of these maps entails problems such as requiring expertise in the relevant field, minimizing visual complexity, and integrating maps, especially in terms of text-intensive documents. In order to overcome these problems, automatic creation of concept maps is required. On the other hand, the production of a fully automated and human-hand quality concept map from a document has not yet been achieved satisfactorily. Motivated by this observation, this study aims to develop a new methodology for automatic creation of the concept maps from Turkish text documents for the first time in the literature. In this respect, within the scope of this study, a new heuristic algorithm has been developed using the Turkish Natural Language Processing software chain and the Graphviz tool to automatically extract concept maps from Turkish texts. The proposed algorithm works with the principle of obtaining concepts based on the dependencies of Turkish words in sentences. The algorithm also determines the sentences to be added to the concept map with a new sentence scoring mechanism. The developed algorithm has been applied on a total of 20 data sets in the fields of Turkish Literature, Geography, Science, and Computer Sciences. The effectiveness of the algorithm has been analyzed with three different performance evaluation criteria, namely precision, recall and F-score. The findings have revealed that the proposed algorithm is quite effective in Turkish texts containing concepts. It has also been observed that the sentence selection algorithm produces results close to the average value in terms of the performance criteria being evaluated. According to the findings, the concept maps automatically obtained by the proposed algorithm are quite similar to the concept maps extracted manually. On the other hand, there is a limitation of the developed algorithm since it is dependent on a natural language processing tool and therefore requires manual intervention in some cases.

概念图是一种二维可视化工具,用于描述属于某一特定主题的概念之间的关系。手工绘制这些地图会遇到一些问题,如需要相关领域的专业知识、尽量减少视觉复杂性以及整合地图,特别是在文本密集型文档方面。为了克服这些问题,需要自动绘制概念图。另一方面,从文件中生成全自动、高质量的人工概念图的工作尚未取得令人满意的成果。受此启发,本研究旨在开发一种从土耳其文本文档自动创建概念图的新方法,这在文献中尚属首次。在这方面,本研究利用土耳其自然语言处理软件链和 Graphviz 工具开发了一种新的启发式算法,用于从土耳其文本中自动提取概念图。拟议算法的工作原理是根据句子中土耳其语单词的依赖关系获取概念。该算法还通过一种新的句子评分机制来确定要添加到概念图中的句子。所开发的算法已应用于土耳其文学、地理、科学和计算机科学领域的共 20 个数据集。通过精确度、召回率和 F 分数这三种不同的性能评估标准,对算法的有效性进行了分析。研究结果表明,所提出的算法在包含概念的土耳其文本中相当有效。此外,还发现句子选择算法在性能评估标准方面产生的结果接近平均值。根据研究结果,拟议算法自动获得的概念图与人工提取的概念图非常相似。另一方面,所开发的算法也存在局限性,因为它依赖于自然语言处理工具,因此在某些情况下需要人工干预。
{"title":"A new methodology for automatic creation of concept maps of Turkish texts","authors":"Merve Bayrak, Deniz Dal","doi":"10.1007/s10579-023-09713-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09713-9","url":null,"abstract":"<p>Concept maps are two-dimensional visual tools that describe the relationships between concepts belonging to a particular subject. The manual creation of these maps entails problems such as requiring expertise in the relevant field, minimizing visual complexity, and integrating maps, especially in terms of text-intensive documents. In order to overcome these problems, automatic creation of concept maps is required. On the other hand, the production of a fully automated and human-hand quality concept map from a document has not yet been achieved satisfactorily. Motivated by this observation, this study aims to develop a new methodology for automatic creation of the concept maps from Turkish text documents for the first time in the literature. In this respect, within the scope of this study, a new heuristic algorithm has been developed using the Turkish Natural Language Processing software chain and the Graphviz tool to automatically extract concept maps from Turkish texts. The proposed algorithm works with the principle of obtaining concepts based on the dependencies of Turkish words in sentences. The algorithm also determines the sentences to be added to the concept map with a new sentence scoring mechanism. The developed algorithm has been applied on a total of 20 data sets in the fields of Turkish Literature, Geography, Science, and Computer Sciences. The effectiveness of the algorithm has been analyzed with three different performance evaluation criteria, namely precision, recall and F-score. The findings have revealed that the proposed algorithm is quite effective in Turkish texts containing concepts. It has also been observed that the sentence selection algorithm produces results close to the average value in terms of the performance criteria being evaluated. According to the findings, the concept maps automatically obtained by the proposed algorithm are quite similar to the concept maps extracted manually. On the other hand, there is a limitation of the developed algorithm since it is dependent on a natural language processing tool and therefore requires manual intervention in some cases.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139587444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large scale annotated dataset for code-mix abusive short noisy text 大规模注释数据集,用于编码混合滥用短篇高噪声文本
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-25 DOI: 10.1007/s10579-023-09707-7

Abstract

With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.

摘要 随着全球化和全球文化交流的发展,大多数人至少掌握了两种语言。社交媒体平台(SMP)上的双语用户群极大地促进了代码混合的普及。然而,除了多种重要用途外,SMP 还存在滥用文本内容的问题。识别单一语言的辱骂实例是一项极具挑战性的任务,而对于代码混合来说则更具挑战性。辱骂性帖子的检测问题比想象的要复杂得多,因为其内容不雅、数据嘈杂且上下文不确定。要分析这些内容,研究界需要一个合适的数据集。小规模的数据集并不适合作为研究工作的样本。在本文中,我们分析了短篇嘈杂文本中 Devanagari-Roman 混合代码的维度。我们还讨论了滥用实例所带来的挑战。我们提出了一种具有 20.38% 相关性得分的经济有效的方法,用于收集和注释代码混杂的滥用文本实例。我们的数据集是相关最先进数据集的八倍。我们的数据集确保了平衡,其中滥用类实例占 55.81%,非滥用类实例占 44.19%。我们还进行了实验来验证数据集的实用性。我们使用传统机器学习技术、传统神经网络架构、循环神经网络架构和预训练的大型语言模型(LLM)进行了实验。通过实验,我们发现该数据集适用于进一步的科研工作。
{"title":"Large scale annotated dataset for code-mix abusive short noisy text","authors":"","doi":"10.1007/s10579-023-09707-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09707-7","url":null,"abstract":"<h3>Abstract</h3> <p>With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"164 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139560843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A flexible tool for a qualia-enriched FrameNet: the FrameNet Brasil WebTool 一个用于丰富质点的框架网的灵活工具:巴西框架网网络工具
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-22 DOI: 10.1007/s10579-023-09714-8
Tiago Timponi Torrent, Ely Edison da Silva Matos, Alexandre Diniz da Costa, Maucha Andrade Gamonal, Simone Peron-Corrêa, Vanessa Maria Ramos Lopes Paiva

In this paper we present a database management and annotation tool for running an enriched FrameNet database, the FrameNet Brasil WebTool. We demonstrate how the entity-based model of such a tool allows for the addition of two types of data-structure to FrameNet Brasil, both of which aimed at refining the granularity of the semantic representations: the frame element-to-frame and the ternary qualia relations. We report on three proof-of-concept applications of such an enriched database: a domain-specific structured lexicon, a recommendation system for tourists and a post-editing system for domain adaptation in machine translation.

在本文中,我们介绍了一种用于运行丰富的 FrameNet 数据库的数据库管理和注释工具,即 FrameNet Brasil WebTool。我们展示了这种工具的实体模型如何允许在 FrameNet Brasil 中添加两类数据结构,这两类数据结构都旨在细化语义表征的粒度:框架元素到框架和三元定性关系。我们报告了此类丰富数据库的三个概念验证应用:特定领域结构化词典、游客推荐系统和机器翻译领域适应性后期编辑系统。
{"title":"A flexible tool for a qualia-enriched FrameNet: the FrameNet Brasil WebTool","authors":"Tiago Timponi Torrent, Ely Edison da Silva Matos, Alexandre Diniz da Costa, Maucha Andrade Gamonal, Simone Peron-Corrêa, Vanessa Maria Ramos Lopes Paiva","doi":"10.1007/s10579-023-09714-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09714-8","url":null,"abstract":"<p>In this paper we present a database management and annotation tool for running an enriched FrameNet database, the FrameNet Brasil WebTool. We demonstrate how the entity-based model of such a tool allows for the addition of two types of data-structure to FrameNet Brasil, both of which aimed at refining the granularity of the semantic representations: the frame element-to-frame and the ternary qualia relations. We report on three proof-of-concept applications of such an enriched database: a domain-specific structured lexicon, a recommendation system for tourists and a post-editing system for domain adaptation in machine translation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"139 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139516379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NewsCom-TOX: a corpus of comments on news articles annotated for toxicity in Spanish NewsCom-TOX:西班牙语新闻文章评论注释语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-17 DOI: 10.1007/s10579-023-09711-x
Mariona Taulé, Montserrat Nofre, Víctor Bargiela, Xavier Bonet

In this article, we present the NewsCom-TOX corpus, a new corpus manually annotated for toxicity in Spanish. NewsCom-TOX consists of 4359 comments in Spanish posted in response to 21 news articles on social media related to immigration, in order to analyse and identify messages with racial and xenophobic content. This corpus is multi-level annotated with different binary linguistic categories -stance, target, stereotype, sarcasm, mockery, insult, improper language, aggressiveness and intolerance- taking into account not only the information conveyed in each comment, but also the whole discourse thread in which the comment occurs, as well as the information conveyed in the news article, including their images. These categories allow us to identify the presence of toxicity and its intensity, that is, the level of toxicity of each comment. All this information is available for research purposes upon request. Here we describe the NewsCom-TOX corpus, the annotation tagset used, the criteria applied and the annotation process carried out, including the inter-annotator agreement tests conducted. A quantitative analysis of the results obtained is also provided. NewsCom-TOX is a linguistic resource that will be valuable for both linguistic and computational research in Spanish in NLP tasks for the detection of toxic information.

在本文中,我们介绍了 NewsCom-TOX 语料库,这是一个人工标注西班牙语毒性的新语料库。NewsCom-TOX 包含针对社交媒体上 21 篇有关移民的新闻文章发表的 4359 条西班牙语评论,目的是分析和识别带有种族和仇外内容的信息。该语料库使用不同的二元语言类别(立场、目标、刻板印象、讽刺、嘲弄、侮辱、不当语言、攻击性和不容忍)进行多层次注释,不仅考虑到每条评论中传达的信息,还考虑到评论发生时的整个话语线程,以及新闻文章中传达的信息,包括其图片。通过这些分类,我们可以确定是否存在毒性及其强度,即每条评论的毒性程度。所有这些信息都可应要求提供用于研究目的。在此,我们将介绍 NewsCom-TOX 语料库、使用的注释标签集、应用的标准和进行的注释过程,包括进行的注释者间一致性测试。我们还提供了对所获结果的定量分析。NewsCom-TOX 是一种语言资源,对西班牙语在有毒信息检测 NLP 任务中的语言学和计算研究都很有价值。
{"title":"NewsCom-TOX: a corpus of comments on news articles annotated for toxicity in Spanish","authors":"Mariona Taulé, Montserrat Nofre, Víctor Bargiela, Xavier Bonet","doi":"10.1007/s10579-023-09711-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09711-x","url":null,"abstract":"<p>In this article, we present the NewsCom-TOX corpus, a new corpus manually annotated for toxicity in Spanish. NewsCom-TOX consists of 4359 comments in Spanish posted in response to 21 news articles on social media related to immigration, in order to analyse and identify messages with racial and xenophobic content. This corpus is multi-level annotated with different binary linguistic categories -stance, target, stereotype, sarcasm, mockery, insult, improper language, aggressiveness and intolerance- taking into account not only the information conveyed in each comment, but also the whole discourse thread in which the comment occurs, as well as the information conveyed in the news article, including their images. These categories allow us to identify the presence of toxicity and its intensity, that is, the level of toxicity of each comment. All this information is available for research purposes upon request. Here we describe the NewsCom-TOX corpus, the annotation tagset used, the criteria applied and the annotation process carried out, including the inter-annotator agreement tests conducted. A quantitative analysis of the results obtained is also provided. NewsCom-TOX is a linguistic resource that will be valuable for both linguistic and computational research in Spanish in NLP tasks for the detection of toxic information.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139497797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning 利用共同关注多任务学习在代码混合文本中进行有毒评论分类和理由提取
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-13 DOI: 10.1007/s10579-023-09708-6
Kiran Babu Nelatoori, Hima Bindu Kommanti

Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.

检测有毒评论和社交媒体帖子的攻击性理由可以促进对社交媒体内容的管理。为此,我们通过迁移学习为低资源印地语-英语(俗称兴英语)有毒文本提出了一种协同多任务学习(CA-MTL)模型。合理性/泛读检测和有毒评论分类这两项合作任务共同创造了一个强大的多任务学习目标。我们设计了一个任务协作模块,以利用分类任务和跨度预测任务之间的双向注意力。模型的综合损失函数是利用这两个任务的单独损失函数构建的。虽然存在英语毒性跨度检测数据集,但到目前为止还没有针对兴英语码混合文本的数据集。因此,我们为混合英语代码文本开发了一个带有有毒跨度注释的数据集。我们使用多语种和兴英语 BERT 变体,将所提出的 CA-MTL 模型与缺乏共同关注机制的单任务和多任务学习模型进行了比较。使用 HingRoBERTa 编码器的 CA-MTL 模型在这两项任务中的 F1 分数都明显高于基线模型。注意事项本文可能包含对某些读者造成困扰的词语。
{"title":"Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning","authors":"Kiran Babu Nelatoori, Hima Bindu Kommanti","doi":"10.1007/s10579-023-09708-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09708-6","url":null,"abstract":"<p>Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. <i>Caution:</i> This paper may contain words disturbing to some readers.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"27 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus 为研究拉丁语语料库中的模态而进行多层语义标注和标注模式正规化
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-06 DOI: 10.1007/s10579-023-09706-8

Abstract

This paper stems from the project A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.

摘要 本文源自 "一个充满可能性的世界 "项目。该项目采用基于语料库的方法研究拉丁语历史中的模态。语言注释,特别是模态的语义注释是该项目的关键。除了任何处理语义的注释任务都会遇到的固有困难之外,我们的注释方案还涉及相互关联的多层注释,从而增加了任务的复杂性。考虑到细粒度语义标注的复杂性,我们需要开发记录完备的模式,以便控制标注的一致性,同时还能有效地重复使用我们标注的语料库。本文介绍了注释任务中涉及的不同要素,以及如何将模式语言与 XML 文档相结合,对不同语言成分之间的描述和关系进行形式化和文档化。
{"title":"Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus","authors":"","doi":"10.1007/s10579-023-09706-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09706-8","url":null,"abstract":"<h3>Abstract</h3> <p>This paper stems from the project <em>A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language</em> (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139375818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Language Resources and Evaluation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1