Ekapop Verasakulvong, P. Vateekul, Apivadee Piyatumrong, Chatchawal Sangkeettrakarn
{"title":"Online Emerging Topic Detection on Twitter Using Random Forest with Stock Indicator Features","authors":"Ekapop Verasakulvong, P. Vateekul, Apivadee Piyatumrong, Chatchawal Sangkeettrakarn","doi":"10.1109/JCSSE.2018.8457349","DOIUrl":null,"url":null,"abstract":"Social media is one of the most impactful and fastest communication methods. By monitoring Twitter streams, we are able to detect emerging topics and understand events around the world. There are some prior attempts that aim to online detect topics on Twitter. However, they can only detect bursty topics by using user-defined keywords a long with simple rules. In this paper, we propose an algorithm to detect emerging topics on Twitter streams. To detect emerging topics, a clustering technique has been applied to aggregate a set of keywords. Since an emerging topic occurs continuously, the emerging topics are merged with stateful technique to accumulate topics from different time intervals. To detect both high signal topics and small-medium signal topics, we use statistical features based on average, acceleration, and z-score. Moreover, we propose to include the stock indicator features: Relative Strength Index (RSI) and Stochastic Oscillator (STOCH). They are common features in trend (oversold and overbought) detection in stock analysis which is similar to our topic detection in twitter. To capture any event patterns, Random Forest (RF) has been proposed as a classifier to detect emerging keywords by utilizing the stated above five features. To evaluate the performance, we created and published a corpus by collecting Twitter data for 10 days with over 80 million tweets and then labeling possible topics in tota1161 events along with related keywords. The experiment was conducted on our collected data. The Fl-results show that our model outperforms all baselines: TwitterMonitor, SigniTrend, and TopicSketch, in terms of detected keywords and topics.","PeriodicalId":338973,"journal":{"name":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2018.8457349","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Social media is one of the most impactful and fastest communication methods. By monitoring Twitter streams, we are able to detect emerging topics and understand events around the world. There are some prior attempts that aim to online detect topics on Twitter. However, they can only detect bursty topics by using user-defined keywords a long with simple rules. In this paper, we propose an algorithm to detect emerging topics on Twitter streams. To detect emerging topics, a clustering technique has been applied to aggregate a set of keywords. Since an emerging topic occurs continuously, the emerging topics are merged with stateful technique to accumulate topics from different time intervals. To detect both high signal topics and small-medium signal topics, we use statistical features based on average, acceleration, and z-score. Moreover, we propose to include the stock indicator features: Relative Strength Index (RSI) and Stochastic Oscillator (STOCH). They are common features in trend (oversold and overbought) detection in stock analysis which is similar to our topic detection in twitter. To capture any event patterns, Random Forest (RF) has been proposed as a classifier to detect emerging keywords by utilizing the stated above five features. To evaluate the performance, we created and published a corpus by collecting Twitter data for 10 days with over 80 million tweets and then labeling possible topics in tota1161 events along with related keywords. The experiment was conducted on our collected data. The Fl-results show that our model outperforms all baselines: TwitterMonitor, SigniTrend, and TopicSketch, in terms of detected keywords and topics.