{"title":"Pythia: A System for Online Topic Discovery of Social Media Posts","authors":"Juliana Litou, V. Kalogeraki","doi":"10.1109/ICDCS.2017.289","DOIUrl":null,"url":null,"abstract":"Social media constitute nowadays one of the most common communication mediums. Millions of users exploit them daily to share information with their community in the network via messages, referred as posts. The massive volume of information shared is extremely diverse and covers a vast spectrum of topics and interests. Automatically identifying the topics of the posts is of particular interest as this can assist in a variety of applications, such as event detection, trends discovery, expert finding etc. However, designing an automated system that requires no human agent participation to identify the topics covered in posts published in Online Social Networks (OSNs) presents manifold challenges. First, posts are unstructured and commonly short, limited to just a few characters. This prevents existing classification schemes to be directly applied in such cases, due to sparseness of the text. Second, new information emerges constantly, hence building a learning corpus from past posts may fail to capture the ever evolving information emerging in OSNs. To overcome the aforementioned limitations we have designed Pythia, an automated system for short text classification that exploits the Wikipedia structure and articles to identify the topics of the posts. The topic discovery is performed in two phases. In the first step, the system exploits Wikipedia categories and articles of the corresponding categories to build the training corpus for the suppervised learning. In the second step, the text of a given post is augmented using a text enrichment mechanism that extends the post with relevant Wikipedia articles. After the initial steps are performed, we deploy k-NN classifier to determine the topic(s) covered in the original post.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"75 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Social media constitute nowadays one of the most common communication mediums. Millions of users exploit them daily to share information with their community in the network via messages, referred as posts. The massive volume of information shared is extremely diverse and covers a vast spectrum of topics and interests. Automatically identifying the topics of the posts is of particular interest as this can assist in a variety of applications, such as event detection, trends discovery, expert finding etc. However, designing an automated system that requires no human agent participation to identify the topics covered in posts published in Online Social Networks (OSNs) presents manifold challenges. First, posts are unstructured and commonly short, limited to just a few characters. This prevents existing classification schemes to be directly applied in such cases, due to sparseness of the text. Second, new information emerges constantly, hence building a learning corpus from past posts may fail to capture the ever evolving information emerging in OSNs. To overcome the aforementioned limitations we have designed Pythia, an automated system for short text classification that exploits the Wikipedia structure and articles to identify the topics of the posts. The topic discovery is performed in two phases. In the first step, the system exploits Wikipedia categories and articles of the corresponding categories to build the training corpus for the suppervised learning. In the second step, the text of a given post is augmented using a text enrichment mechanism that extends the post with relevant Wikipedia articles. After the initial steps are performed, we deploy k-NN classifier to determine the topic(s) covered in the original post.
社交媒体是当今最常见的交流媒介之一。数以百万计的用户每天利用微博通过消息(即帖子)与他们在网络上的社区分享信息。共享的大量信息极其多样化,涵盖了广泛的主题和兴趣。自动识别帖子的主题是特别有趣的,因为这可以帮助各种应用程序,如事件检测,趋势发现,专家寻找等。然而,设计一个不需要人工参与的自动化系统来识别在线社交网络(OSNs)上发布的帖子所涵盖的主题,面临着多方面的挑战。首先,帖子没有结构,通常很短,只有几个字。由于文本的稀疏性,这阻止了现有的分类方案直接应用于这种情况。其次,新信息不断出现,因此从过去的帖子中构建学习语料库可能无法捕获osn中不断发展的信息。To overcome the aforementioned limitations we have designed Pythia, an automated system for short text classification that exploits the Wikipedia structure and articles to identify the topics of the posts. 主题发现分两个阶段进行。In the first step, the system exploits Wikipedia categories and articles of the corresponding categories to build the training corpus for the suppervised learning. 在第二步中,使用文本充实机制对给定文章的文本进行扩充,该机制用相关的Wikipedia文章扩展文章。在执行初始步骤之后,我们部署k-NN分类器来确定原始帖子中涵盖的主题。