{"title":"基于NLP的情感分析应用的代码多样性图鲁-英语数据集","authors":"Prashanth Kannadaguli","doi":"10.1109/ACTS53447.2021.9708241","DOIUrl":null,"url":null,"abstract":"Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.","PeriodicalId":201741,"journal":{"name":"2021 Advanced Communication Technologies and Signal Processing (ACTS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Code-Diverse Tulu-English Dataset For NLP Based Sentiment Analysis Applications\",\"authors\":\"Prashanth Kannadaguli\",\"doi\":\"10.1109/ACTS53447.2021.9708241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.\",\"PeriodicalId\":201741,\"journal\":{\"name\":\"2021 Advanced Communication Technologies and Signal Processing (ACTS)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Advanced Communication Technologies and Signal Processing (ACTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACTS53447.2021.9708241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Advanced Communication Technologies and Signal Processing (ACTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACTS53447.2021.9708241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Code-Diverse Tulu-English Dataset For NLP Based Sentiment Analysis Applications
Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.