Which words are important?: an empirical study of Assamese sentiment analysis

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-06-19 DOI:10.1007/s10579-024-09756-6
Ringki Das, Thoudam Doren Singh
{"title":"Which words are important?: an empirical study of Assamese sentiment analysis","authors":"Ringki Das, Thoudam Doren Singh","doi":"10.1007/s10579-024-09756-6","DOIUrl":null,"url":null,"abstract":"<p>Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09756-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
哪些词重要?:阿萨姆语情感分析实证研究
情感分析是文本分析和自然语言处理的一个重要研究领域。过去几十年来,它已成为研究人员了解人类情感的一个引人入胜的突出领域。根据 2011 年的人口普查,使用阿萨姆语的人口达 1500 万。尽管阿萨姆语是印度宪法规定的语言,但它仍然是一种资源有限的语言。尽管阿萨姆语是一种官方语言,并有自己的文字,但用阿萨姆语进行情感分析的工作报道较少。在印度这样一个语言多样化的国家,有必要提供一个系统来帮助人们理解其母语中的情感。因此,如果没有最先进的地区语言 NLP 系统,印度的多语言社会将无法充分利用人工智能的优势。阿萨姆语因其广泛的应用而变得流行。社交媒体和其他平台上的阿萨姆用户也与日俱增。自动情感分析系统对个人、政府、政党和其他组织都很有效,还能阻止负面情绪的传播,而不会造成语言鸿沟。本文利用机器学习和深度学习技术,对阿萨姆语新闻领域的不同词性特征进行了文本情感分析研究。在实验中,开发了基线模型,并与带有词法特征的模型进行了比较。与 TF-IDF 方法相比,基于 XGBoost 分类器的带有 AAV 词性特征的拟议模型预测准确率最高,达到 86.76%。据观察,在小数据集情况下,词性特征与机器学习分类器的结合比单个词性特征更有助于情感预测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Language Resources and Evaluation
Language Resources and Evaluation 工程技术-计算机:跨学科应用
CiteScore
6.50
自引率
3.70%
发文量
55
审稿时长
>12 weeks
期刊介绍: Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.
期刊最新文献
Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect Studying word meaning evolution through incremental semantic shift detection PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines Normalized dataset for Sanskrit word segmentation and morphological parsing Conversion of the Spanish WordNet databases into a Prolog-readable format
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1