An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language

Salim Sazzed
{"title":"An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language","authors":"Salim Sazzed","doi":"10.18653/v1/2022.mia-1.2","DOIUrl":null,"url":null,"abstract":"The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Multilingual Information Access (MIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.mia-1.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
低资源孟加拉语语篇模式识别的标注数据集与自动方法
话语模式有助于理解交际中使用的各种语言形式的惯例和目的。在这项研究中,我们为低资源的孟加拉语(也称为孟加拉语)引入了一个话语模式注释的语料库。该语料库由三种不同话语模式的句子级注释组成,即叙事性、描述性和信息性的文本节选自一些孟加拉小说。我们分析了标注的语料库,揭示了话语模式的各个语言方面,如类分布和平均句子长度。为了自动确定话语模式,我们应用了基于n-gram统计特征的CML(经典机器学习)分类器和基于微调的BERT(双向编码器表示)的语言模型。我们观察到基于bert的微调方法比基于n-gram的CML分类器产生更有希望的结果。我们创建的语篇模式标注数据集(首个在孟加拉语中创建的语篇模式标注数据集)和评估为孟加拉语的自动语篇模式识别提供了基线,并可以辅助各种下游自然语言处理任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Zero-shot cross-lingual open domain question answering Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1