利用共同关注多任务学习在代码混合文本中进行有毒评论分类和理由提取

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-01-13 DOI:10.1007/s10579-023-09708-6

Kiran Babu Nelatoori, Hima Bindu Kommanti

{"title":"利用共同关注多任务学习在代码混合文本中进行有毒评论分类和理由提取","authors":"Kiran Babu Nelatoori, Hima Bindu Kommanti","doi":"10.1007/s10579-023-09708-6","DOIUrl":null,"url":null,"abstract":"Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"27 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning\",\"authors\":\"Kiran Babu Nelatoori, Hima Bindu Kommanti\",\"doi\":\"10.1007/s10579-023-09708-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.\",\"PeriodicalId\":49927,\"journal\":{\"name\":\"Language Resources and Evaluation\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Language Resources and Evaluation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10579-023-09708-6\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-023-09708-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

检测有毒评论和社交媒体帖子的攻击性理由可以促进对社交媒体内容的管理。为此，我们通过迁移学习为低资源印地语-英语（俗称兴英语）有毒文本提出了一种协同多任务学习（CA-MTL）模型。合理性/泛读检测和有毒评论分类这两项合作任务共同创造了一个强大的多任务学习目标。我们设计了一个任务协作模块，以利用分类任务和跨度预测任务之间的双向注意力。模型的综合损失函数是利用这两个任务的单独损失函数构建的。虽然存在英语毒性跨度检测数据集，但到目前为止还没有针对兴英语码混合文本的数据集。因此，我们为混合英语代码文本开发了一个带有有毒跨度注释的数据集。我们使用多语种和兴英语 BERT 变体，将所提出的 CA-MTL 模型与缺乏共同关注机制的单任务和多任务学习模型进行了比较。使用 HingRoBERTa 编码器的 CA-MTL 模型在这两项任务中的 F1 分数都明显高于基线模型。注意事项本文可能包含对某些读者造成困扰的词语。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning

Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.