Social Media Data Analysis Using MapReduce Programming Model and Training a Tweet Classifier Using Apache Mahout

2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2) Pub Date : 2018-11-01 DOI:10.1109/SC2.2018.00024

Umit Demirbaga, D. Jha

{"title":"Social Media Data Analysis Using MapReduce Programming Model and Training a Tweet Classifier Using Apache Mahout","authors":"Umit Demirbaga, D. Jha","doi":"10.1109/SC2.2018.00024","DOIUrl":null,"url":null,"abstract":"Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.","PeriodicalId":340244,"journal":{"name":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC2.2018.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用MapReduce编程模型进行社交媒体数据分析，并使用Apache Mahout训练Tweet分类器

微博服务推特每分钟都会产生大量数据，因为它让人们有机会快速清晰地表达自己对任何话题的想法和感受。为了从这些可用的大数据中获取所需的信息，需要高性能的并行计算工具以及机器学习算法的支持。新兴的大数据处理框架(如Hadoop)可以有效地处理此类大数据。本文首先介绍了一种基于推文发布日期和推文发布国家(使用MapReduce算法)自动分类英国地质调查局(BGS) Twitter数据的新方法，这些数据是使用滑坡、滑坡、泥石流、陆地降落、滑坡、土壤滑动等特定关键词收集的。然后，我们在Mahout上使用Naïve-Bayes机器学习算法和n-Grams语言模型提出了一个模型来区分推文是否与山体滑坡相关。本文还描述了一种算法，用于预处理步骤，使半结构化的Twitter文本数据为分类做好准备。所提出的方法有助于BGS和其他感兴趣的人能够看到发送推文的国家名称和数量，每个国家发送的推文数量，推文的日期和时间间隔，以及对推文是否与滑坡有关进行分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)

自引率

0.00%

发文量

期刊最新文献

Get Your Head Out of the Clouds: The Illusion of Confidentiality & Privacy Improving the Performance of Stock Trend Prediction by Applying GA to Feature Selection Publisher's Information SC2 2018 Program Committee Hera Object Storage: A Seamless, Automated Multi-Tiering Solution on Top of OpenStack Swift