Transformer-based Pouranic topic classification in Indian mythology

Sādhanā Pub Date : 2024-09-19 DOI:10.1007/s12046-024-02598-6

Apurba Paul, Srijan Seal, Dipankar Das

引用次数: 0

Abstract

Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce PouranicTopic, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets Similarity-based and Log-likelihood-based are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变压器的印度神话 Pouranic 主题分类

要理解印度神话的主题或题材，主题分类是一项具有挑战性的任务。在处理包含神话的文本时，它将提高基于 NLP 的系统（如推荐和语义搜索引擎）的性能。本研究的重点是为印度神话文档的自动主题分类开发基于转换器的模型，以解决组织和分析这一丰富多样的语料库所面临的挑战。我们介绍了一个新的注释数据集 PouranicTopic，该数据集包含来自 7 个主要印度教文本的 20 多万节诗文，并带有章节、主题和句子标签。我们还利用句子聚类技术创建了基于相似度和基于对数概率的其他数据集。在这些数据集上对 BERT、RoBERTa 和 DistilBERT 模型进行了音调和主题分类评估。聚类技术大大提高了基于相似度的数据集的结果，但基于对数似然的数据集仍然具有挑战性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Sādhanā

自引率

0.00%

发文量