使用大数据分析进行阿拉伯语推文的作者身份验证

Jafar Albadarneh, Bashar Talafha, M. Al-Ayyoub, B. Zaqaibeh, Mohammad Al-Smadi, Y. Jararweh, E. Benkhelifa
{"title":"使用大数据分析进行阿拉伯语推文的作者身份验证","authors":"Jafar Albadarneh, Bashar Talafha, M. Al-Ayyoub, B. Zaqaibeh, Mohammad Al-Smadi, Y. Jararweh, E. Benkhelifa","doi":"10.1109/UCC.2015.80","DOIUrl":null,"url":null,"abstract":"Authorship authentication of a certain text is concerned with correctly attributing it to its author based on its contents. It is a very important problem with deep root in history as many classical texts have doubtful attributions. The information age and ubiquitous use of the Internet is further complicating this problem and adding more dimensions to it. We are interested in the modern version of this problem where the text whose authorship needs authentication is an online text found in online social networks. Specifically, we are interested in the authorship authentication of tweets. This is not the only challenging aspect we consider here. Another challenging aspect is the language of the tweets. Most current works and existing tools support English. We chose to focus on the very important, yet largely understudied, Arabic language. Finally, we add another challenging aspect to the problem at hand by addressing it at a very large scale. We present our effort to employ big data analytics to address the authorship authentication problem of Arabic tweets. We start by crawling a dataset of more than 53K tweets distributed across 20 authors. We then use preprocessing steps to clean the data and prepare it for analysis. The next step is to compute the feature vectors of each tweet. We use the Bag-Of-Words (BOW) approach and compute the weights using the Term Frequency-Inverse Document Frequency (TF-IDF). Then, we feed the dataset to a Naive Bayes classifier implemented on a parallel and distributed computing framework known as Hadoop. To the best of our knowledge, none of the previous works on authorship authentication of Arabic text addressed the unique challenges associated with (1) tweets and (2) large-scale datasets. This makes our work unique on many levels. The results show that the testing accuracy is not very high (61.6%), which is expected in the very challenging setting that we consider.","PeriodicalId":381279,"journal":{"name":"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Using Big Data Analytics for Authorship Authentication of Arabic Tweets\",\"authors\":\"Jafar Albadarneh, Bashar Talafha, M. Al-Ayyoub, B. Zaqaibeh, Mohammad Al-Smadi, Y. Jararweh, E. Benkhelifa\",\"doi\":\"10.1109/UCC.2015.80\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Authorship authentication of a certain text is concerned with correctly attributing it to its author based on its contents. It is a very important problem with deep root in history as many classical texts have doubtful attributions. The information age and ubiquitous use of the Internet is further complicating this problem and adding more dimensions to it. We are interested in the modern version of this problem where the text whose authorship needs authentication is an online text found in online social networks. Specifically, we are interested in the authorship authentication of tweets. This is not the only challenging aspect we consider here. Another challenging aspect is the language of the tweets. Most current works and existing tools support English. We chose to focus on the very important, yet largely understudied, Arabic language. Finally, we add another challenging aspect to the problem at hand by addressing it at a very large scale. We present our effort to employ big data analytics to address the authorship authentication problem of Arabic tweets. We start by crawling a dataset of more than 53K tweets distributed across 20 authors. We then use preprocessing steps to clean the data and prepare it for analysis. The next step is to compute the feature vectors of each tweet. We use the Bag-Of-Words (BOW) approach and compute the weights using the Term Frequency-Inverse Document Frequency (TF-IDF). Then, we feed the dataset to a Naive Bayes classifier implemented on a parallel and distributed computing framework known as Hadoop. To the best of our knowledge, none of the previous works on authorship authentication of Arabic text addressed the unique challenges associated with (1) tweets and (2) large-scale datasets. This makes our work unique on many levels. The results show that the testing accuracy is not very high (61.6%), which is expected in the very challenging setting that we consider.\",\"PeriodicalId\":381279,\"journal\":{\"name\":\"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/UCC.2015.80\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UCC.2015.80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

摘要

文本的作者身份认证是指根据文本的内容,正确地将文本归属于作者。这是一个有着深厚历史渊源的重要问题,因为许多经典文本都有可疑的出处。信息时代和互联网的普遍使用使这个问题进一步复杂化,并增加了更多的维度。我们感兴趣的是这个问题的现代版本,其作者身份需要认证的文本是在线社交网络中发现的在线文本。具体来说,我们对tweet的作者身份验证感兴趣。这并不是我们在这里考虑的唯一具有挑战性的方面。另一个具有挑战性的方面是推文的语言。大多数当前的作品和现有的工具支持英语。我们选择把重点放在非常重要、但在很大程度上未被充分研究的阿拉伯语上。最后,我们通过在非常大的范围内解决问题,为手头的问题增加了另一个具有挑战性的方面。我们展示了我们的努力,采用大数据分析来解决阿拉伯语推文的作者身份认证问题。我们首先爬取分布在20位作者的超过53K条推文的数据集。然后,我们使用预处理步骤来清理数据并为分析做准备。下一步是计算每条推文的特征向量。我们使用词袋(BOW)方法,并使用术语频率-逆文档频率(TF-IDF)计算权重。然后,我们将数据集提供给在并行和分布式计算框架Hadoop上实现的朴素贝叶斯分类器。据我们所知,以前关于阿拉伯文本作者身份认证的工作都没有解决与(1)推文和(2)大规模数据集相关的独特挑战。这使得我们的工作在很多层面上都是独一无二的。结果表明,测试精度不是很高(61.6%),这在我们考虑的非常具有挑战性的设置中是预期的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Using Big Data Analytics for Authorship Authentication of Arabic Tweets
Authorship authentication of a certain text is concerned with correctly attributing it to its author based on its contents. It is a very important problem with deep root in history as many classical texts have doubtful attributions. The information age and ubiquitous use of the Internet is further complicating this problem and adding more dimensions to it. We are interested in the modern version of this problem where the text whose authorship needs authentication is an online text found in online social networks. Specifically, we are interested in the authorship authentication of tweets. This is not the only challenging aspect we consider here. Another challenging aspect is the language of the tweets. Most current works and existing tools support English. We chose to focus on the very important, yet largely understudied, Arabic language. Finally, we add another challenging aspect to the problem at hand by addressing it at a very large scale. We present our effort to employ big data analytics to address the authorship authentication problem of Arabic tweets. We start by crawling a dataset of more than 53K tweets distributed across 20 authors. We then use preprocessing steps to clean the data and prepare it for analysis. The next step is to compute the feature vectors of each tweet. We use the Bag-Of-Words (BOW) approach and compute the weights using the Term Frequency-Inverse Document Frequency (TF-IDF). Then, we feed the dataset to a Naive Bayes classifier implemented on a parallel and distributed computing framework known as Hadoop. To the best of our knowledge, none of the previous works on authorship authentication of Arabic text addressed the unique challenges associated with (1) tweets and (2) large-scale datasets. This makes our work unique on many levels. The results show that the testing accuracy is not very high (61.6%), which is expected in the very challenging setting that we consider.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CYCLONE Unified Deployment and Management of Federated, Multi-cloud Applications Cloud Orchestration Features: Are Tools Fit for Purpose? Efficient Update of Encrypted Files for Cloud Storage Adaptive Performance Isolation Middleware for Multi-tenant SaaS Agent-Based Modelling as a Service on Amazon EC2: Opportunities and Challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1