{"title":"新闻事件的分布式和动态聚类","authors":"Vinay Setty","doi":"10.1145/3210284.3219774","DOIUrl":null,"url":null,"abstract":"The primary consumption of news is now increasingly online and has resulted in a large volume of online news from varied news outlets. Consequently news aggregators have become popular for clustering, ranking and personalization of news which process millions of news articles each day. In addition, since news articles stream constantly, there is a need for a scalable event-based system which can facilitate news mining in an online fashion. To address these challenges, we propose a distributed framework to process news articles and cluster them to facilitate many news mining tasks. The core of our system is a novel and scalable distributed clustering algorithm using Locality Sensitive Hashing which is robust to outliers and noise. In addition, we also propose an online version of the clustering algorithm to dynamically maintain the news event clusters. We implement the proposed solution on Apache Spark. Using a large news collection with over 8 million news articles, we show that our approach outperforms widely-used clustering techniques such as K-Means both in run time and clustering quality.","PeriodicalId":412438,"journal":{"name":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed and Dynamic Clustering For News Events\",\"authors\":\"Vinay Setty\",\"doi\":\"10.1145/3210284.3219774\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The primary consumption of news is now increasingly online and has resulted in a large volume of online news from varied news outlets. Consequently news aggregators have become popular for clustering, ranking and personalization of news which process millions of news articles each day. In addition, since news articles stream constantly, there is a need for a scalable event-based system which can facilitate news mining in an online fashion. To address these challenges, we propose a distributed framework to process news articles and cluster them to facilitate many news mining tasks. The core of our system is a novel and scalable distributed clustering algorithm using Locality Sensitive Hashing which is robust to outliers and noise. In addition, we also propose an online version of the clustering algorithm to dynamically maintain the news event clusters. We implement the proposed solution on Apache Spark. Using a large news collection with over 8 million news articles, we show that our approach outperforms widely-used clustering techniques such as K-Means both in run time and clustering quality.\",\"PeriodicalId\":412438,\"journal\":{\"name\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3210284.3219774\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3210284.3219774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Distributed and Dynamic Clustering For News Events
The primary consumption of news is now increasingly online and has resulted in a large volume of online news from varied news outlets. Consequently news aggregators have become popular for clustering, ranking and personalization of news which process millions of news articles each day. In addition, since news articles stream constantly, there is a need for a scalable event-based system which can facilitate news mining in an online fashion. To address these challenges, we propose a distributed framework to process news articles and cluster them to facilitate many news mining tasks. The core of our system is a novel and scalable distributed clustering algorithm using Locality Sensitive Hashing which is robust to outliers and noise. In addition, we also propose an online version of the clustering algorithm to dynamically maintain the news event clusters. We implement the proposed solution on Apache Spark. Using a large news collection with over 8 million news articles, we show that our approach outperforms widely-used clustering techniques such as K-Means both in run time and clustering quality.