基于变压器的印度神话 Pouranic 主题分类

Apurba Paul, Srijan Seal, Dipankar Das
{"title":"基于变压器的印度神话 Pouranic 主题分类","authors":"Apurba Paul, Srijan Seal, Dipankar Das","doi":"10.1007/s12046-024-02598-6","DOIUrl":null,"url":null,"abstract":"<p>Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce <b><span>PouranicTopic</span></b>, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets <b><span>Similarity-based</span></b> and <b><span>Log-likelihood-based</span></b> are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.</p>","PeriodicalId":21498,"journal":{"name":"Sādhanā","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-based Pouranic topic classification in Indian mythology\",\"authors\":\"Apurba Paul, Srijan Seal, Dipankar Das\",\"doi\":\"10.1007/s12046-024-02598-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce <b><span>PouranicTopic</span></b>, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets <b><span>Similarity-based</span></b> and <b><span>Log-likelihood-based</span></b> are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.</p>\",\"PeriodicalId\":21498,\"journal\":{\"name\":\"Sādhanā\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sādhanā\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s12046-024-02598-6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sādhanā","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12046-024-02598-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

要理解印度神话的主题或题材,主题分类是一项具有挑战性的任务。在处理包含神话的文本时,它将提高基于 NLP 的系统(如推荐和语义搜索引擎)的性能。本研究的重点是为印度神话文档的自动主题分类开发基于转换器的模型,以解决组织和分析这一丰富多样的语料库所面临的挑战。我们介绍了一个新的注释数据集 PouranicTopic,该数据集包含来自 7 个主要印度教文本的 20 多万节诗文,并带有章节、主题和句子标签。我们还利用句子聚类技术创建了基于相似度和基于对数概率的其他数据集。在这些数据集上对 BERT、RoBERTa 和 DistilBERT 模型进行了音调和主题分类评估。聚类技术大大提高了基于相似度的数据集的结果,但基于对数似然的数据集仍然具有挑战性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Transformer-based Pouranic topic classification in Indian mythology

Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce PouranicTopic, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets Similarity-based and Log-likelihood-based are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Buckling performance optimization of sub-stiffened composite panels with straight and curvilinear sub-stiffeners Transformer-based Pouranic topic classification in Indian mythology Influence of non-stoichiometric solutions on the THF hydrate growth: chemical affinity modelling and visualization Development and analysis of Hastelloy-X alloy butt joint made by laser beam welding Comparative analysis of a remotely-controlled wetland paddy seeder and conventional drum seeder
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1