基于新闻标题数据集的讽刺检测

IF 14.8 AI Open Pub Date : 2023-01-01 DOI:10.1016/j.aiopen.2023.01.001

Rishabh Misra , Prahal Arora

{"title":"基于新闻标题数据集的讽刺检测","authors":"Rishabh Misra , Prahal Arora","doi":"10.1016/j.aiopen.2023.01.001","DOIUrl":null,"url":null,"abstract":"<div><p>Sarcasm has been an elusive concept for humans. Due to interesting linguistic properties, sarcasm detection has gained traction of the Natural Language Processing (NLP) research community in the past few years. However, the task of predicting sarcasm in a text remains a difficult one for machines as well, and there are limited insights into what makes a sentence sarcastic. Past studies in sarcasm detection either use large scale datasets collected using tag-based supervision or small scale manually annotated datasets. The former category of datasets are noisy in terms of labels and language, whereas the latter category of datasets do not have enough instances to train deep learning models reliably despite having high-quality labels. To overcome these shortcomings, we introduce a high-quality and relatively larger-scale dataset which is a collection of news headlines from a sarcastic news website and a real news website. We describe the unique aspects of our dataset and compare its various characteristics with other benchmark datasets in sarcasm detection domain. Furthermore, we produce insights into what constitute as sarcasm in a text using a Hybrid Neural Network architecture. First released in 2019, we dedicate a section on how the NLP research community has extensively relied upon our contributions to push the state of the art further in the sarcasm detection domain. Lastly, we make the dataset as well as framework implementation publicly available to facilitate continued research in this domain.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"4 ","pages":"Pages 13-18"},"PeriodicalIF":14.8000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Sarcasm detection using news headlines dataset\",\"authors\":\"Rishabh Misra , Prahal Arora\",\"doi\":\"10.1016/j.aiopen.2023.01.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Sarcasm has been an elusive concept for humans. Due to interesting linguistic properties, sarcasm detection has gained traction of the Natural Language Processing (NLP) research community in the past few years. However, the task of predicting sarcasm in a text remains a difficult one for machines as well, and there are limited insights into what makes a sentence sarcastic. Past studies in sarcasm detection either use large scale datasets collected using tag-based supervision or small scale manually annotated datasets. The former category of datasets are noisy in terms of labels and language, whereas the latter category of datasets do not have enough instances to train deep learning models reliably despite having high-quality labels. To overcome these shortcomings, we introduce a high-quality and relatively larger-scale dataset which is a collection of news headlines from a sarcastic news website and a real news website. We describe the unique aspects of our dataset and compare its various characteristics with other benchmark datasets in sarcasm detection domain. Furthermore, we produce insights into what constitute as sarcasm in a text using a Hybrid Neural Network architecture. First released in 2019, we dedicate a section on how the NLP research community has extensively relied upon our contributions to push the state of the art further in the sarcasm detection domain. Lastly, we make the dataset as well as framework implementation publicly available to facilitate continued research in this domain.</p></div>\",\"PeriodicalId\":100068,\"journal\":{\"name\":\"AI Open\",\"volume\":\"4 \",\"pages\":\"Pages 13-18\"},\"PeriodicalIF\":14.8000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AI Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666651023000013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651023000013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

讽刺对人类来说一直是一个难以捉摸的概念。由于有趣的语言特性，讽刺检测在过去几年中受到了自然语言处理（NLP）研究界的关注。然而，对于机器来说，预测文本中的讽刺仍然是一项困难的任务，而且对一个句子的讽刺原因的见解有限。过去的讽刺检测研究要么使用使用基于标签的监督收集的大规模数据集，要么使用小规模手动注释的数据集。前一类数据集在标签和语言方面是有噪声的，而后一类数据集中尽管有高质量的标签，但没有足够的实例来可靠地训练深度学习模型。为了克服这些缺点，我们引入了一个高质量且规模相对较大的数据集，该数据集是来自讽刺新闻网站和真实新闻网站的新闻标题的集合。我们描述了我们数据集的独特之处，并将其各种特征与讽刺检测领域的其他基准数据集进行了比较。此外，我们使用混合神经网络架构来深入了解文本中的讽刺构成。我们于2019年首次发布，专门介绍了NLP研究界如何广泛依赖我们的贡献，进一步推动讽刺检测领域的最新技术。最后，我们公开了数据集和框架实现，以促进该领域的持续研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Sarcasm detection using news headlines dataset

Sarcasm has been an elusive concept for humans. Due to interesting linguistic properties, sarcasm detection has gained traction of the Natural Language Processing (NLP) research community in the past few years. However, the task of predicting sarcasm in a text remains a difficult one for machines as well, and there are limited insights into what makes a sentence sarcastic. Past studies in sarcasm detection either use large scale datasets collected using tag-based supervision or small scale manually annotated datasets. The former category of datasets are noisy in terms of labels and language, whereas the latter category of datasets do not have enough instances to train deep learning models reliably despite having high-quality labels. To overcome these shortcomings, we introduce a high-quality and relatively larger-scale dataset which is a collection of news headlines from a sarcastic news website and a real news website. We describe the unique aspects of our dataset and compare its various characteristics with other benchmark datasets in sarcasm detection domain. Furthermore, we produce insights into what constitute as sarcasm in a text using a Hybrid Neural Network architecture. First released in 2019, we dedicate a section on how the NLP research community has extensively relied upon our contributions to push the state of the art further in the sarcasm detection domain. Lastly, we make the dataset as well as framework implementation publicly available to facilitate continued research in this domain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AI Open

CiteScore

45.00

自引率

0.00%

发文量