Jafar Albadarneh, Bashar Talafha, M. Al-Ayyoub, B. Zaqaibeh, Mohammad Al-Smadi, Y. Jararweh, E. Benkhelifa
{"title":"Using Big Data Analytics for Authorship Authentication of Arabic Tweets","authors":"Jafar Albadarneh, Bashar Talafha, M. Al-Ayyoub, B. Zaqaibeh, Mohammad Al-Smadi, Y. Jararweh, E. Benkhelifa","doi":"10.1109/UCC.2015.80","DOIUrl":null,"url":null,"abstract":"Authorship authentication of a certain text is concerned with correctly attributing it to its author based on its contents. It is a very important problem with deep root in history as many classical texts have doubtful attributions. The information age and ubiquitous use of the Internet is further complicating this problem and adding more dimensions to it. We are interested in the modern version of this problem where the text whose authorship needs authentication is an online text found in online social networks. Specifically, we are interested in the authorship authentication of tweets. This is not the only challenging aspect we consider here. Another challenging aspect is the language of the tweets. Most current works and existing tools support English. We chose to focus on the very important, yet largely understudied, Arabic language. Finally, we add another challenging aspect to the problem at hand by addressing it at a very large scale. We present our effort to employ big data analytics to address the authorship authentication problem of Arabic tweets. We start by crawling a dataset of more than 53K tweets distributed across 20 authors. We then use preprocessing steps to clean the data and prepare it for analysis. The next step is to compute the feature vectors of each tweet. We use the Bag-Of-Words (BOW) approach and compute the weights using the Term Frequency-Inverse Document Frequency (TF-IDF). Then, we feed the dataset to a Naive Bayes classifier implemented on a parallel and distributed computing framework known as Hadoop. To the best of our knowledge, none of the previous works on authorship authentication of Arabic text addressed the unique challenges associated with (1) tweets and (2) large-scale datasets. This makes our work unique on many levels. The results show that the testing accuracy is not very high (61.6%), which is expected in the very challenging setting that we consider.","PeriodicalId":381279,"journal":{"name":"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UCC.2015.80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29
Abstract
Authorship authentication of a certain text is concerned with correctly attributing it to its author based on its contents. It is a very important problem with deep root in history as many classical texts have doubtful attributions. The information age and ubiquitous use of the Internet is further complicating this problem and adding more dimensions to it. We are interested in the modern version of this problem where the text whose authorship needs authentication is an online text found in online social networks. Specifically, we are interested in the authorship authentication of tweets. This is not the only challenging aspect we consider here. Another challenging aspect is the language of the tweets. Most current works and existing tools support English. We chose to focus on the very important, yet largely understudied, Arabic language. Finally, we add another challenging aspect to the problem at hand by addressing it at a very large scale. We present our effort to employ big data analytics to address the authorship authentication problem of Arabic tweets. We start by crawling a dataset of more than 53K tweets distributed across 20 authors. We then use preprocessing steps to clean the data and prepare it for analysis. The next step is to compute the feature vectors of each tweet. We use the Bag-Of-Words (BOW) approach and compute the weights using the Term Frequency-Inverse Document Frequency (TF-IDF). Then, we feed the dataset to a Naive Bayes classifier implemented on a parallel and distributed computing framework known as Hadoop. To the best of our knowledge, none of the previous works on authorship authentication of Arabic text addressed the unique challenges associated with (1) tweets and (2) large-scale datasets. This makes our work unique on many levels. The results show that the testing accuracy is not very high (61.6%), which is expected in the very challenging setting that we consider.