Which words are important?: an empirical study of Assamese sentiment analysis

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-06-19 DOI:10.1007/s10579-024-09756-6

Ringki Das, Thoudam Doren Singh

{"title":"Which words are important?: an empirical study of Assamese sentiment analysis","authors":"Ringki Das, Thoudam Doren Singh","doi":"10.1007/s10579-024-09756-6","DOIUrl":null,"url":null,"abstract":"<p>Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09756-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

哪些词重要？：阿萨姆语情感分析实证研究

情感分析是文本分析和自然语言处理的一个重要研究领域。过去几十年来，它已成为研究人员了解人类情感的一个引人入胜的突出领域。根据 2011 年的人口普查，使用阿萨姆语的人口达 1500 万。尽管阿萨姆语是印度宪法规定的语言，但它仍然是一种资源有限的语言。尽管阿萨姆语是一种官方语言，并有自己的文字，但用阿萨姆语进行情感分析的工作报道较少。在印度这样一个语言多样化的国家，有必要提供一个系统来帮助人们理解其母语中的情感。因此，如果没有最先进的地区语言 NLP 系统，印度的多语言社会将无法充分利用人工智能的优势。阿萨姆语因其广泛的应用而变得流行。社交媒体和其他平台上的阿萨姆用户也与日俱增。自动情感分析系统对个人、政府、政党和其他组织都很有效，还能阻止负面情绪的传播，而不会造成语言鸿沟。本文利用机器学习和深度学习技术，对阿萨姆语新闻领域的不同词性特征进行了文本情感分析研究。在实验中，开发了基线模型，并与带有词法特征的模型进行了比较。与 TF-IDF 方法相比，基于 XGBoost 分类器的带有 AAV 词性特征的拟议模型预测准确率最高，达到 86.76%。据观察，在小数据集情况下，词性特征与机器学习分类器的结合比单个词性特征更有助于情感预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.