{"title":"Social Media Data Analysis Using MapReduce Programming Model and Training a Tweet Classifier Using Apache Mahout","authors":"Umit Demirbaga, D. Jha","doi":"10.1109/SC2.2018.00024","DOIUrl":null,"url":null,"abstract":"Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.","PeriodicalId":340244,"journal":{"name":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC2.2018.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.