HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural network

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-04-27 DOI:10.1145/3661306

Akshi Kumar, Abhishek Mallik, Sanjay Kumar

{"title":"HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural network","authors":"Akshi Kumar, Abhishek Mallik, Sanjay Kumar","doi":"10.1145/3661306","DOIUrl":null,"url":null,"abstract":"<p>Humour is a crucial aspect of human speech, and it is, therefore, imperative to create a system that can offer such detection. While data regarding humour in English speech is plentiful, the same cannot be said for a low-resource language like Hindi. Through this paper, we introduce two multimodal datasets for humour detection in the Hindi web series. The dataset was collected from over 500 minutes of conversations amongst the characters of the Hindi web series \\(Kota-Factory\\) and \\(Panchayat\\). Each dialogue is manually annotated as Humour or Non-Humour. Along with presenting a new Hindi language-based Humour detection dataset, we propose an improved framework for detecting humour in Hindi conversations. We start by preprocessing both datasets to obtain uniformity across the dialogues and datasets. The processed dialogues are then passed through the Skip-gram model for generating Hindi word embedding. The generated Hindi word embedding is then passed onto three convolutional neural network (CNN) architectures simultaneously, each having a different filter size for feature extraction. The extracted features are then passed through stacked Long Short-Term Memory (LSTM) layers for further processing and finally classifying the dialogues as Humour or Non-Humour. We conduct intensive experiments on both proposed Hindi datasets and evaluate several standard performance metrics. The performance of our proposed framework was also compared with several baselines and contemporary algorithms for Humour detection. The results demonstrate the effectiveness of our dataset to be used as a standard dataset for Humour detection in the Hindi web series. The proposed model yields an accuracy of 91.79 and 87.32 while an F1 score of 91.64 and 87.04 in percentage for the \\(Kota-Factory\\) and \\(Panchayat\\) datasets, respectively.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3661306","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Humour is a crucial aspect of human speech, and it is, therefore, imperative to create a system that can offer such detection. While data regarding humour in English speech is plentiful, the same cannot be said for a low-resource language like Hindi. Through this paper, we introduce two multimodal datasets for humour detection in the Hindi web series. The dataset was collected from over 500 minutes of conversations amongst the characters of the Hindi web series \(Kota-Factory\) and \(Panchayat\). Each dialogue is manually annotated as Humour or Non-Humour. Along with presenting a new Hindi language-based Humour detection dataset, we propose an improved framework for detecting humour in Hindi conversations. We start by preprocessing both datasets to obtain uniformity across the dialogues and datasets. The processed dialogues are then passed through the Skip-gram model for generating Hindi word embedding. The generated Hindi word embedding is then passed onto three convolutional neural network (CNN) architectures simultaneously, each having a different filter size for feature extraction. The extracted features are then passed through stacked Long Short-Term Memory (LSTM) layers for further processing and finally classifying the dialogues as Humour or Non-Humour. We conduct intensive experiments on both proposed Hindi datasets and evaluate several standard performance metrics. The performance of our proposed framework was also compared with several baselines and contemporary algorithms for Humour detection. The results demonstrate the effectiveness of our dataset to be used as a standard dataset for Humour detection in the Hindi web series. The proposed model yields an accuracy of 91.79 and 87.32 while an F1 score of 91.64 and 87.04 in percentage for the \(Kota-Factory\) and \(Panchayat\) datasets, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HumourHindiNet：利用单词嵌入和卷积神经网络检测印地语网络剧中的幽默内容

幽默是人类语言的一个重要方面，因此，创建一个能够提供幽默检测的系统势在必行。虽然有关英语语音中幽默的数据非常丰富，但像印地语这样的低资源语言却不尽相同。通过本文，我们介绍了两个用于检测印地语网络系列中幽默的多模态数据集。该数据集是从印地语网络系列剧（Kota-Factory）和（Panchayat）角色之间超过 500 分钟的对话中收集的。每段对话都被人工标注为 "幽默 "或 "非幽默"。在提出基于印地语的新幽默检测数据集的同时，我们还提出了一个改进的框架，用于检测印地语对话中的幽默。我们首先对两个数据集进行预处理，以获得对话和数据集的一致性。然后，将处理过的对话通过 Skip-gram 模型生成印地语单词嵌入。生成的印地语单词嵌入会同时传递给三个卷积神经网络（CNN）架构，每个架构都有不同的滤波器大小，用于提取特征。提取的特征随后通过堆叠的长短期记忆（LSTM）层进行进一步处理，并最终将对话分类为 "幽默"（Humour）或 "非幽默"（Non-Humour）。我们在两个拟议的印地语数据集上进行了深入实验，并评估了多个标准性能指标。我们提出的框架的性能还与若干基准和当代幽默检测算法进行了比较。结果表明，我们的数据集可以有效地用作印地语网络系列中幽默检测的标准数据集。所提出的模型在（Kota-Factory）和（Panchayat）数据集上的准确率分别为91.79和87.32，F1得分分别为91.64和87.04。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.