{"title":"利用词嵌入研究乌尔都语文本的跨语言迁移学习技术","authors":"Shujah Ur Rehman, Bilal Tahir, M. Mehmood","doi":"10.1109/ICOSST53930.2021.9683873","DOIUrl":null,"url":null,"abstract":"The plethora of online content has paved the way for the development of sophisticated and advanced Natural Language Processing (NLP) and Information Retrieval (IR) tools. However, such tools are only available for English and other high-resource languages while being unavailable for low-resource languages such as Urdu. In this regard, generally, cross-lingual transfer learning techniques are adopted to utilize tools developed for the English language for low resource languages. In this paper, we evaluate the performance of three word-level transfer learning methods: OrthoMap, VecMap-supervised, and VecMap unsupervised for Urdu text. We further test these transfer learning methods for three tasks: propaganda identification, topic classification, and sentiment analysis. For this purpose, we augment an English-Urdu word dictionary and three datasets of Ur-En Propaganda, Ur-En News Dataset, and Ur-En Sentiment Corpus. Our analysis shows that the transfer learning methods optimize better for the short-text of Ur-En Sentiment Corpus with a precision of 40.1%. While for propaganda detection, the classifier attained an accuracy of 83% after transfer learning which is competitive with the 87% accuracy achieved after training the model on Urdu text data. We believe that this work will be beneficial for NLP, IR, and computational linguistic researchers working on Urdu language content.","PeriodicalId":325357,"journal":{"name":"2021 15th International Conference on Open Source Systems and Technologies (ICOSST)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Investigating Cross-Lingual Transfer Learning Techniques for Urdu Text Using Word Embeddings\",\"authors\":\"Shujah Ur Rehman, Bilal Tahir, M. Mehmood\",\"doi\":\"10.1109/ICOSST53930.2021.9683873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The plethora of online content has paved the way for the development of sophisticated and advanced Natural Language Processing (NLP) and Information Retrieval (IR) tools. However, such tools are only available for English and other high-resource languages while being unavailable for low-resource languages such as Urdu. In this regard, generally, cross-lingual transfer learning techniques are adopted to utilize tools developed for the English language for low resource languages. In this paper, we evaluate the performance of three word-level transfer learning methods: OrthoMap, VecMap-supervised, and VecMap unsupervised for Urdu text. We further test these transfer learning methods for three tasks: propaganda identification, topic classification, and sentiment analysis. For this purpose, we augment an English-Urdu word dictionary and three datasets of Ur-En Propaganda, Ur-En News Dataset, and Ur-En Sentiment Corpus. Our analysis shows that the transfer learning methods optimize better for the short-text of Ur-En Sentiment Corpus with a precision of 40.1%. While for propaganda detection, the classifier attained an accuracy of 83% after transfer learning which is competitive with the 87% accuracy achieved after training the model on Urdu text data. We believe that this work will be beneficial for NLP, IR, and computational linguistic researchers working on Urdu language content.\",\"PeriodicalId\":325357,\"journal\":{\"name\":\"2021 15th International Conference on Open Source Systems and Technologies (ICOSST)\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 15th International Conference on Open Source Systems and Technologies (ICOSST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICOSST53930.2021.9683873\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 15th International Conference on Open Source Systems and Technologies (ICOSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOSST53930.2021.9683873","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Investigating Cross-Lingual Transfer Learning Techniques for Urdu Text Using Word Embeddings
The plethora of online content has paved the way for the development of sophisticated and advanced Natural Language Processing (NLP) and Information Retrieval (IR) tools. However, such tools are only available for English and other high-resource languages while being unavailable for low-resource languages such as Urdu. In this regard, generally, cross-lingual transfer learning techniques are adopted to utilize tools developed for the English language for low resource languages. In this paper, we evaluate the performance of three word-level transfer learning methods: OrthoMap, VecMap-supervised, and VecMap unsupervised for Urdu text. We further test these transfer learning methods for three tasks: propaganda identification, topic classification, and sentiment analysis. For this purpose, we augment an English-Urdu word dictionary and three datasets of Ur-En Propaganda, Ur-En News Dataset, and Ur-En Sentiment Corpus. Our analysis shows that the transfer learning methods optimize better for the short-text of Ur-En Sentiment Corpus with a precision of 40.1%. While for propaganda detection, the classifier attained an accuracy of 83% after transfer learning which is competitive with the 87% accuracy achieved after training the model on Urdu text data. We believe that this work will be beneficial for NLP, IR, and computational linguistic researchers working on Urdu language content.