{"title":"基于gossip的分布式数据流光谱聚类","authors":"Matt Talistu, Teng-Sheng Moh, M. Moh","doi":"10.1109/HPCSim.2015.7237058","DOIUrl":null,"url":null,"abstract":"With the growth of the Internet, social networks, and other distributed systems, there is an abundance of data about user transactions, network traffic, social interactions, and other areas that is available for analysis. Extracting knowledge from this data has become a growing field of research recently, especially as the size of the data makes traditional data mining methods ineffective. Some approaches assume the data is at a central location or a complete set of data is available for analysis. However, many modern-day applications consume distributed data streams. The dataset is spread across multiple locations and each location only has access to a portion of the data stream. We propose a distributed data stream analysis method, which uses hierarchical clustering for local online summary, a gossip protocol for distributing these summaries, and spectral clustering for offline analysis. The resulting solution successfully avoids the heavy computation and communication capability requirements of a centralized approach. Through experiments, we have demonstrated that the proposed solution is able to accurately cluster the data streams and is highly scalable. Its quality significantly increases as the number of microcluster increases, yet it is fault-tolerant when this number is small. Finally, it has achieved a similar level of accuracy when compared with a centralized approach.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Gossip-based spectral clustering of distributed data streams\",\"authors\":\"Matt Talistu, Teng-Sheng Moh, M. Moh\",\"doi\":\"10.1109/HPCSim.2015.7237058\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growth of the Internet, social networks, and other distributed systems, there is an abundance of data about user transactions, network traffic, social interactions, and other areas that is available for analysis. Extracting knowledge from this data has become a growing field of research recently, especially as the size of the data makes traditional data mining methods ineffective. Some approaches assume the data is at a central location or a complete set of data is available for analysis. However, many modern-day applications consume distributed data streams. The dataset is spread across multiple locations and each location only has access to a portion of the data stream. We propose a distributed data stream analysis method, which uses hierarchical clustering for local online summary, a gossip protocol for distributing these summaries, and spectral clustering for offline analysis. The resulting solution successfully avoids the heavy computation and communication capability requirements of a centralized approach. Through experiments, we have demonstrated that the proposed solution is able to accurately cluster the data streams and is highly scalable. Its quality significantly increases as the number of microcluster increases, yet it is fault-tolerant when this number is small. Finally, it has achieved a similar level of accuracy when compared with a centralized approach.\",\"PeriodicalId\":134009,\"journal\":{\"name\":\"2015 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSim.2015.7237058\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2015.7237058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Gossip-based spectral clustering of distributed data streams
With the growth of the Internet, social networks, and other distributed systems, there is an abundance of data about user transactions, network traffic, social interactions, and other areas that is available for analysis. Extracting knowledge from this data has become a growing field of research recently, especially as the size of the data makes traditional data mining methods ineffective. Some approaches assume the data is at a central location or a complete set of data is available for analysis. However, many modern-day applications consume distributed data streams. The dataset is spread across multiple locations and each location only has access to a portion of the data stream. We propose a distributed data stream analysis method, which uses hierarchical clustering for local online summary, a gossip protocol for distributing these summaries, and spectral clustering for offline analysis. The resulting solution successfully avoids the heavy computation and communication capability requirements of a centralized approach. Through experiments, we have demonstrated that the proposed solution is able to accurately cluster the data streams and is highly scalable. Its quality significantly increases as the number of microcluster increases, yet it is fault-tolerant when this number is small. Finally, it has achieved a similar level of accuracy when compared with a centralized approach.