We present a comprehensive and valuable resource in the form of an aligned parallel corpus comprising translations of the Bible in Spanish. Our collection encompasses a total of eleven Bibles, originating from diverse centuries (XVI, XIX, XX), various religious denominations (Protestant, Catholic), and geographical regions (Spain, Latin America). The process of aligning the verses across these translations has been meticulously carried out, ensuring that the content is organized in a coherent manner. As a result, this corpus serves as a useful convenient resource for various linguistic analyses, including paraphrase detection, semantic clustering, and the exploration of biases present within the texts. To illustrate the utility of this resource, we provide several examples that demonstrate how it can be effectively employed in these applications.
{"title":"An aligned corpus of Spanish bibles","authors":"Gerardo Sierra, Gemma Bel-Enguix, Ameyali Díaz-Velasco, Natalia Guerrero-Cerón, Núria Bel","doi":"10.1007/s10579-024-09726-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09726-y","url":null,"abstract":"<p>We present a comprehensive and valuable resource in the form of an aligned parallel corpus comprising translations of the Bible in Spanish. Our collection encompasses a total of eleven Bibles, originating from diverse centuries (XVI, XIX, XX), various religious denominations (Protestant, Catholic), and geographical regions (Spain, Latin America). The process of aligning the verses across these translations has been meticulously carried out, ensuring that the content is organized in a coherent manner. As a result, this corpus serves as a useful convenient resource for various linguistic analyses, including paraphrase detection, semantic clustering, and the exploration of biases present within the texts. To illustrate the utility of this resource, we provide several examples that demonstrate how it can be effectively employed in these applications.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"26 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
{"title":"SOLD: Sinhala offensive language dataset","authors":"Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri","doi":"10.1007/s10579-024-09723-1","DOIUrl":"https://doi.org/10.1007/s10579-024-09723-1","url":null,"abstract":"<p>The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (<i>SOLD</i>) and present multiple experiments on this dataset. <i>SOLD</i> is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. <i>SOLD</i> is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce <i>SemiSOLD</i>, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-05DOI: 10.1007/s10579-024-09728-w
François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti
Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.
{"title":"Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus","authors":"François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti","doi":"10.1007/s10579-024-09728-w","DOIUrl":"https://doi.org/10.1007/s10579-024-09728-w","url":null,"abstract":"<p> Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.\u0000</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"116 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.1007/s10579-023-09715-7
Abstract
In this paper, we approach the problem of semantic search by introducing a task of paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. While current work in paraphrasing has almost uniquely focused on sentence-level approaches, the novel span detection approach gives a possibility to retrieve a segment of arbitrary length. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that by achieving an exact match of 88.73 our paraphrase span detection approach outperforms widely adopted sentence-level retrieval baselines (lexical similarity as well as BERT and SBERT sentence embeddings) by more than 20pp in terms of exact match, and 11pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the paraphrase retrieval in terms of span extraction rather than commonly used sentence similarity, the sentence-level approaches being clearly suboptimal for applications where the retrieval targets are not guaranteed to be full sentences. Even when limiting the evaluation to sentence-level retrieval targets only, the span detection model still outperforms the sentence-level baselines by more than 4 pp in terms of exact match, and almost 6pp F-score. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.
{"title":"Semantic search as extractive paraphrase span detection","authors":"","doi":"10.1007/s10579-023-09715-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09715-7","url":null,"abstract":"<h3>Abstract</h3> <p>In this paper, we approach the problem of semantic search by introducing a task of paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. While current work in paraphrasing has almost uniquely focused on sentence-level approaches, the novel span detection approach gives a possibility to retrieve a segment of arbitrary length. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that by achieving an exact match of 88.73 our paraphrase span detection approach outperforms widely adopted sentence-level retrieval baselines (lexical similarity as well as BERT and SBERT sentence embeddings) by more than 20pp in terms of exact match, and 11pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the paraphrase retrieval in terms of span extraction rather than commonly used sentence similarity, the sentence-level approaches being clearly suboptimal for applications where the retrieval targets are not guaranteed to be full sentences. Even when limiting the evaluation to sentence-level retrieval targets only, the span detection model still outperforms the sentence-level baselines by more than 4 pp in terms of exact match, and almost 6pp F-score. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"13 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139661804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-28DOI: 10.1007/s10579-023-09713-9
Merve Bayrak, Deniz Dal
Concept maps are two-dimensional visual tools that describe the relationships between concepts belonging to a particular subject. The manual creation of these maps entails problems such as requiring expertise in the relevant field, minimizing visual complexity, and integrating maps, especially in terms of text-intensive documents. In order to overcome these problems, automatic creation of concept maps is required. On the other hand, the production of a fully automated and human-hand quality concept map from a document has not yet been achieved satisfactorily. Motivated by this observation, this study aims to develop a new methodology for automatic creation of the concept maps from Turkish text documents for the first time in the literature. In this respect, within the scope of this study, a new heuristic algorithm has been developed using the Turkish Natural Language Processing software chain and the Graphviz tool to automatically extract concept maps from Turkish texts. The proposed algorithm works with the principle of obtaining concepts based on the dependencies of Turkish words in sentences. The algorithm also determines the sentences to be added to the concept map with a new sentence scoring mechanism. The developed algorithm has been applied on a total of 20 data sets in the fields of Turkish Literature, Geography, Science, and Computer Sciences. The effectiveness of the algorithm has been analyzed with three different performance evaluation criteria, namely precision, recall and F-score. The findings have revealed that the proposed algorithm is quite effective in Turkish texts containing concepts. It has also been observed that the sentence selection algorithm produces results close to the average value in terms of the performance criteria being evaluated. According to the findings, the concept maps automatically obtained by the proposed algorithm are quite similar to the concept maps extracted manually. On the other hand, there is a limitation of the developed algorithm since it is dependent on a natural language processing tool and therefore requires manual intervention in some cases.
概念图是一种二维可视化工具,用于描述属于某一特定主题的概念之间的关系。手工绘制这些地图会遇到一些问题,如需要相关领域的专业知识、尽量减少视觉复杂性以及整合地图,特别是在文本密集型文档方面。为了克服这些问题,需要自动绘制概念图。另一方面,从文件中生成全自动、高质量的人工概念图的工作尚未取得令人满意的成果。受此启发,本研究旨在开发一种从土耳其文本文档自动创建概念图的新方法,这在文献中尚属首次。在这方面,本研究利用土耳其自然语言处理软件链和 Graphviz 工具开发了一种新的启发式算法,用于从土耳其文本中自动提取概念图。拟议算法的工作原理是根据句子中土耳其语单词的依赖关系获取概念。该算法还通过一种新的句子评分机制来确定要添加到概念图中的句子。所开发的算法已应用于土耳其文学、地理、科学和计算机科学领域的共 20 个数据集。通过精确度、召回率和 F 分数这三种不同的性能评估标准,对算法的有效性进行了分析。研究结果表明,所提出的算法在包含概念的土耳其文本中相当有效。此外,还发现句子选择算法在性能评估标准方面产生的结果接近平均值。根据研究结果,拟议算法自动获得的概念图与人工提取的概念图非常相似。另一方面,所开发的算法也存在局限性,因为它依赖于自然语言处理工具,因此在某些情况下需要人工干预。
{"title":"A new methodology for automatic creation of concept maps of Turkish texts","authors":"Merve Bayrak, Deniz Dal","doi":"10.1007/s10579-023-09713-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09713-9","url":null,"abstract":"<p>Concept maps are two-dimensional visual tools that describe the relationships between concepts belonging to a particular subject. The manual creation of these maps entails problems such as requiring expertise in the relevant field, minimizing visual complexity, and integrating maps, especially in terms of text-intensive documents. In order to overcome these problems, automatic creation of concept maps is required. On the other hand, the production of a fully automated and human-hand quality concept map from a document has not yet been achieved satisfactorily. Motivated by this observation, this study aims to develop a new methodology for automatic creation of the concept maps from Turkish text documents for the first time in the literature. In this respect, within the scope of this study, a new heuristic algorithm has been developed using the Turkish Natural Language Processing software chain and the Graphviz tool to automatically extract concept maps from Turkish texts. The proposed algorithm works with the principle of obtaining concepts based on the dependencies of Turkish words in sentences. The algorithm also determines the sentences to be added to the concept map with a new sentence scoring mechanism. The developed algorithm has been applied on a total of 20 data sets in the fields of Turkish Literature, Geography, Science, and Computer Sciences. The effectiveness of the algorithm has been analyzed with three different performance evaluation criteria, namely precision, recall and F-score. The findings have revealed that the proposed algorithm is quite effective in Turkish texts containing concepts. It has also been observed that the sentence selection algorithm produces results close to the average value in terms of the performance criteria being evaluated. According to the findings, the concept maps automatically obtained by the proposed algorithm are quite similar to the concept maps extracted manually. On the other hand, there is a limitation of the developed algorithm since it is dependent on a natural language processing tool and therefore requires manual intervention in some cases.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139587444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-25DOI: 10.1007/s10579-023-09707-7
Abstract
With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.
{"title":"Large scale annotated dataset for code-mix abusive short noisy text","authors":"","doi":"10.1007/s10579-023-09707-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09707-7","url":null,"abstract":"<h3>Abstract</h3> <p>With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"164 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139560843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-22DOI: 10.1007/s10579-023-09714-8
Tiago Timponi Torrent, Ely Edison da Silva Matos, Alexandre Diniz da Costa, Maucha Andrade Gamonal, Simone Peron-Corrêa, Vanessa Maria Ramos Lopes Paiva
In this paper we present a database management and annotation tool for running an enriched FrameNet database, the FrameNet Brasil WebTool. We demonstrate how the entity-based model of such a tool allows for the addition of two types of data-structure to FrameNet Brasil, both of which aimed at refining the granularity of the semantic representations: the frame element-to-frame and the ternary qualia relations. We report on three proof-of-concept applications of such an enriched database: a domain-specific structured lexicon, a recommendation system for tourists and a post-editing system for domain adaptation in machine translation.
在本文中,我们介绍了一种用于运行丰富的 FrameNet 数据库的数据库管理和注释工具,即 FrameNet Brasil WebTool。我们展示了这种工具的实体模型如何允许在 FrameNet Brasil 中添加两类数据结构,这两类数据结构都旨在细化语义表征的粒度:框架元素到框架和三元定性关系。我们报告了此类丰富数据库的三个概念验证应用:特定领域结构化词典、游客推荐系统和机器翻译领域适应性后期编辑系统。
{"title":"A flexible tool for a qualia-enriched FrameNet: the FrameNet Brasil WebTool","authors":"Tiago Timponi Torrent, Ely Edison da Silva Matos, Alexandre Diniz da Costa, Maucha Andrade Gamonal, Simone Peron-Corrêa, Vanessa Maria Ramos Lopes Paiva","doi":"10.1007/s10579-023-09714-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09714-8","url":null,"abstract":"<p>In this paper we present a database management and annotation tool for running an enriched FrameNet database, the FrameNet Brasil WebTool. We demonstrate how the entity-based model of such a tool allows for the addition of two types of data-structure to FrameNet Brasil, both of which aimed at refining the granularity of the semantic representations: the frame element-to-frame and the ternary qualia relations. We report on three proof-of-concept applications of such an enriched database: a domain-specific structured lexicon, a recommendation system for tourists and a post-editing system for domain adaptation in machine translation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"139 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139516379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we present the NewsCom-TOX corpus, a new corpus manually annotated for toxicity in Spanish. NewsCom-TOX consists of 4359 comments in Spanish posted in response to 21 news articles on social media related to immigration, in order to analyse and identify messages with racial and xenophobic content. This corpus is multi-level annotated with different binary linguistic categories -stance, target, stereotype, sarcasm, mockery, insult, improper language, aggressiveness and intolerance- taking into account not only the information conveyed in each comment, but also the whole discourse thread in which the comment occurs, as well as the information conveyed in the news article, including their images. These categories allow us to identify the presence of toxicity and its intensity, that is, the level of toxicity of each comment. All this information is available for research purposes upon request. Here we describe the NewsCom-TOX corpus, the annotation tagset used, the criteria applied and the annotation process carried out, including the inter-annotator agreement tests conducted. A quantitative analysis of the results obtained is also provided. NewsCom-TOX is a linguistic resource that will be valuable for both linguistic and computational research in Spanish in NLP tasks for the detection of toxic information.
{"title":"NewsCom-TOX: a corpus of comments on news articles annotated for toxicity in Spanish","authors":"Mariona Taulé, Montserrat Nofre, Víctor Bargiela, Xavier Bonet","doi":"10.1007/s10579-023-09711-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09711-x","url":null,"abstract":"<p>In this article, we present the NewsCom-TOX corpus, a new corpus manually annotated for toxicity in Spanish. NewsCom-TOX consists of 4359 comments in Spanish posted in response to 21 news articles on social media related to immigration, in order to analyse and identify messages with racial and xenophobic content. This corpus is multi-level annotated with different binary linguistic categories -stance, target, stereotype, sarcasm, mockery, insult, improper language, aggressiveness and intolerance- taking into account not only the information conveyed in each comment, but also the whole discourse thread in which the comment occurs, as well as the information conveyed in the news article, including their images. These categories allow us to identify the presence of toxicity and its intensity, that is, the level of toxicity of each comment. All this information is available for research purposes upon request. Here we describe the NewsCom-TOX corpus, the annotation tagset used, the criteria applied and the annotation process carried out, including the inter-annotator agreement tests conducted. A quantitative analysis of the results obtained is also provided. NewsCom-TOX is a linguistic resource that will be valuable for both linguistic and computational research in Spanish in NLP tasks for the detection of toxic information.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139497797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-13DOI: 10.1007/s10579-023-09708-6
Kiran Babu Nelatoori, Hima Bindu Kommanti
Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.
{"title":"Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning","authors":"Kiran Babu Nelatoori, Hima Bindu Kommanti","doi":"10.1007/s10579-023-09708-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09708-6","url":null,"abstract":"<p>Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. <i>Caution:</i> This paper may contain words disturbing to some readers.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"27 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-06DOI: 10.1007/s10579-023-09706-8
Abstract
This paper stems from the project A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.
摘要 本文源自 "一个充满可能性的世界 "项目。该项目采用基于语料库的方法研究拉丁语历史中的模态。语言注释,特别是模态的语义注释是该项目的关键。除了任何处理语义的注释任务都会遇到的固有困难之外,我们的注释方案还涉及相互关联的多层注释,从而增加了任务的复杂性。考虑到细粒度语义标注的复杂性,我们需要开发记录完备的模式,以便控制标注的一致性,同时还能有效地重复使用我们标注的语料库。本文介绍了注释任务中涉及的不同要素,以及如何将模式语言与 XML 文档相结合,对不同语言成分之间的描述和关系进行形式化和文档化。
{"title":"Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus","authors":"","doi":"10.1007/s10579-023-09706-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09706-8","url":null,"abstract":"<h3>Abstract</h3> <p>This paper stems from the project <em>A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language</em> (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139375818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}