{"title":"基于spark的宏基因组序列聚类标记扩散和标记选择社区检测算法","authors":"Zhengjiang Wu, Xuyang Wu, Junwei Luo","doi":"10.1007/s44196-023-00348-w","DOIUrl":null,"url":null,"abstract":"Abstract It is a challenge to assemble an enormous amount of metagenome data in metagenomics. Usually, metagenome cluster sequence before assembly accelerates the whole process. In SpaRC, sequences are defined as nodes and clustered by a parallel label propagation algorithm (LPA). To address the randomness of label selection from the parallel LPA during clustering and improve the completeness of metagenome sequence clustering, Spark-based parallel label diffusion and label selection community detection algorithm is proposed in the paper to obtain more accurate clustering results. In this paper, the importance of sequence is defined based on the Jaccard similarity coefficient and its degree. The core sequence is defined as the one with the largest importance in its located community. Three strategies are formulated to reduce the randomness of label selection. Firstly, the core sequence label diffuses over its located cluster and becomes the initial label of other sequences. Those sequences that do not receive an initial label will select the sequence label with the highest importance in the neighbor sequences. Secondly, we perform improved label propagation in order of label frequency and sequence importance to reduce the randomness of label selection. Finally, a merge small communities step is added to increase the completeness of clustered clusters. The experimental results show that our proposed algorithm can effectively reduce the randomness of label selection, improve the purity, completeness, and F-Measure and reduce the runtime of metagenome sequence clustering.","PeriodicalId":54967,"journal":{"name":"International Journal of Computational Intelligence Systems","volume":"233 1","pages":"0"},"PeriodicalIF":2.9000,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spark-Based Label Diffusion and Label Selection Community Detection Algorithm for Metagenome Sequence Clustering\",\"authors\":\"Zhengjiang Wu, Xuyang Wu, Junwei Luo\",\"doi\":\"10.1007/s44196-023-00348-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract It is a challenge to assemble an enormous amount of metagenome data in metagenomics. Usually, metagenome cluster sequence before assembly accelerates the whole process. In SpaRC, sequences are defined as nodes and clustered by a parallel label propagation algorithm (LPA). To address the randomness of label selection from the parallel LPA during clustering and improve the completeness of metagenome sequence clustering, Spark-based parallel label diffusion and label selection community detection algorithm is proposed in the paper to obtain more accurate clustering results. In this paper, the importance of sequence is defined based on the Jaccard similarity coefficient and its degree. The core sequence is defined as the one with the largest importance in its located community. Three strategies are formulated to reduce the randomness of label selection. Firstly, the core sequence label diffuses over its located cluster and becomes the initial label of other sequences. Those sequences that do not receive an initial label will select the sequence label with the highest importance in the neighbor sequences. Secondly, we perform improved label propagation in order of label frequency and sequence importance to reduce the randomness of label selection. Finally, a merge small communities step is added to increase the completeness of clustered clusters. The experimental results show that our proposed algorithm can effectively reduce the randomness of label selection, improve the purity, completeness, and F-Measure and reduce the runtime of metagenome sequence clustering.\",\"PeriodicalId\":54967,\"journal\":{\"name\":\"International Journal of Computational Intelligence Systems\",\"volume\":\"233 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2023-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computational Intelligence Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s44196-023-00348-w\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computational Intelligence Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44196-023-00348-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Spark-Based Label Diffusion and Label Selection Community Detection Algorithm for Metagenome Sequence Clustering
Abstract It is a challenge to assemble an enormous amount of metagenome data in metagenomics. Usually, metagenome cluster sequence before assembly accelerates the whole process. In SpaRC, sequences are defined as nodes and clustered by a parallel label propagation algorithm (LPA). To address the randomness of label selection from the parallel LPA during clustering and improve the completeness of metagenome sequence clustering, Spark-based parallel label diffusion and label selection community detection algorithm is proposed in the paper to obtain more accurate clustering results. In this paper, the importance of sequence is defined based on the Jaccard similarity coefficient and its degree. The core sequence is defined as the one with the largest importance in its located community. Three strategies are formulated to reduce the randomness of label selection. Firstly, the core sequence label diffuses over its located cluster and becomes the initial label of other sequences. Those sequences that do not receive an initial label will select the sequence label with the highest importance in the neighbor sequences. Secondly, we perform improved label propagation in order of label frequency and sequence importance to reduce the randomness of label selection. Finally, a merge small communities step is added to increase the completeness of clustered clusters. The experimental results show that our proposed algorithm can effectively reduce the randomness of label selection, improve the purity, completeness, and F-Measure and reduce the runtime of metagenome sequence clustering.
期刊介绍:
The International Journal of Computational Intelligence Systems publishes original research on all aspects of applied computational intelligence, especially targeting papers demonstrating the use of techniques and methods originating from computational intelligence theory. The core theories of computational intelligence are fuzzy logic, neural networks, evolutionary computation and probabilistic reasoning. The journal publishes only articles related to the use of computational intelligence and broadly covers the following topics:
-Autonomous reasoning-
Bio-informatics-
Cloud computing-
Condition monitoring-
Data science-
Data mining-
Data visualization-
Decision support systems-
Fault diagnosis-
Intelligent information retrieval-
Human-machine interaction and interfaces-
Image processing-
Internet and networks-
Noise analysis-
Pattern recognition-
Prediction systems-
Power (nuclear) safety systems-
Process and system control-
Real-time systems-
Risk analysis and safety-related issues-
Robotics-
Signal and image processing-
IoT and smart environments-
Systems integration-
System control-
System modelling and optimization-
Telecommunications-
Time series prediction-
Warning systems-
Virtual reality-
Web intelligence-
Deep learning