基于NLP的情感分析应用的代码多样性图鲁-英语数据集

2021 Advanced Communication Technologies and Signal Processing (ACTS) Pub Date : 2021-12-15 DOI:10.1109/ACTS53447.2021.9708241

Prashanth Kannadaguli

{"title":"基于NLP的情感分析应用的代码多样性图鲁-英语数据集","authors":"Prashanth Kannadaguli","doi":"10.1109/ACTS53447.2021.9708241","DOIUrl":null,"url":null,"abstract":"Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.","PeriodicalId":201741,"journal":{"name":"2021 Advanced Communication Technologies and Signal Processing (ACTS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Code-Diverse Tulu-English Dataset For NLP Based Sentiment Analysis Applications\",\"authors\":\"Prashanth Kannadaguli\",\"doi\":\"10.1109/ACTS53447.2021.9708241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.\",\"PeriodicalId\":201741,\"journal\":{\"name\":\"2021 Advanced Communication Technologies and Signal Processing (ACTS)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Advanced Communication Technologies and Signal Processing (ACTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACTS53447.2021.9708241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Advanced Communication Technologies and Signal Processing (ACTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACTS53447.2021.9708241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

随着社交媒体应用的不断扩大，人们对文本内容的自然语言处理(NLP)越来越感兴趣。代码交换是多语言国家普遍存在的矛盾现象，社会交际表现为低资源语言与高资源语言在同一文本中以非母语文字书写的混合。为了支持机器翻译、自动对话系统和情感分析(SA)等独特的NLP任务，必须对交换文本的代码进行优化。SA的主要目标是识别和分析数据集中的态度、意见、情感或情绪。虽然有多个系统能够处理单方言数据集，但当涉及到代码多样化的数据时，它们都崩溃了，因为混合不同标准的文本会变得更加复杂。尽管如此，对于这种明确的代码混合数据进行建模的资产数量较少，与无监督学习相比，机器学习或深度学习算法执行监督学习方法产生更好的结果。这些数据集可用于印度语英语，泰米尔语英语，马拉雅拉姆语英语，孟加拉语英语，德语英语，西班牙语英语，日语英语，阿拉伯语英语等。虽然我们的研究主要集中在对图鲁语(一种充满活力的南印度语言)进行情感和情感检测的NLP，但我们首先建立了第一个用于图鲁英语代码多样化文本的NLP应用的白金标准语料库，因为在我们的母语中没有这样的资源。通过Krippendorff的Alpha值为0.9对我们的数据集进行性能分析，表明它是图鲁自动情感分析系统开发的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Code-Diverse Tulu-English Dataset For NLP Based Sentiment Analysis Applications

Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 Advanced Communication Technologies and Signal Processing (ACTS)

自引率

0.00%

发文量