基于spark的宏基因组序列聚类标记扩散和标记选择社区检测算法

Zhengjiang Wu, Xuyang Wu, Junwei Luo
{"title":"基于spark的宏基因组序列聚类标记扩散和标记选择社区检测算法","authors":"Zhengjiang Wu, Xuyang Wu, Junwei Luo","doi":"10.1007/s44196-023-00348-w","DOIUrl":null,"url":null,"abstract":"Abstract It is a challenge to assemble an enormous amount of metagenome data in metagenomics. Usually, metagenome cluster sequence before assembly accelerates the whole process. In SpaRC, sequences are defined as nodes and clustered by a parallel label propagation algorithm (LPA). To address the randomness of label selection from the parallel LPA during clustering and improve the completeness of metagenome sequence clustering, Spark-based parallel label diffusion and label selection community detection algorithm is proposed in the paper to obtain more accurate clustering results. In this paper, the importance of sequence is defined based on the Jaccard similarity coefficient and its degree. The core sequence is defined as the one with the largest importance in its located community. Three strategies are formulated to reduce the randomness of label selection. Firstly, the core sequence label diffuses over its located cluster and becomes the initial label of other sequences. Those sequences that do not receive an initial label will select the sequence label with the highest importance in the neighbor sequences. Secondly, we perform improved label propagation in order of label frequency and sequence importance to reduce the randomness of label selection. Finally, a merge small communities step is added to increase the completeness of clustered clusters. The experimental results show that our proposed algorithm can effectively reduce the randomness of label selection, improve the purity, completeness, and F-Measure and reduce the runtime of metagenome sequence clustering.","PeriodicalId":54967,"journal":{"name":"International Journal of Computational Intelligence Systems","volume":"233 1","pages":"0"},"PeriodicalIF":2.9000,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spark-Based Label Diffusion and Label Selection Community Detection Algorithm for Metagenome Sequence Clustering\",\"authors\":\"Zhengjiang Wu, Xuyang Wu, Junwei Luo\",\"doi\":\"10.1007/s44196-023-00348-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract It is a challenge to assemble an enormous amount of metagenome data in metagenomics. Usually, metagenome cluster sequence before assembly accelerates the whole process. In SpaRC, sequences are defined as nodes and clustered by a parallel label propagation algorithm (LPA). To address the randomness of label selection from the parallel LPA during clustering and improve the completeness of metagenome sequence clustering, Spark-based parallel label diffusion and label selection community detection algorithm is proposed in the paper to obtain more accurate clustering results. In this paper, the importance of sequence is defined based on the Jaccard similarity coefficient and its degree. The core sequence is defined as the one with the largest importance in its located community. Three strategies are formulated to reduce the randomness of label selection. Firstly, the core sequence label diffuses over its located cluster and becomes the initial label of other sequences. Those sequences that do not receive an initial label will select the sequence label with the highest importance in the neighbor sequences. Secondly, we perform improved label propagation in order of label frequency and sequence importance to reduce the randomness of label selection. Finally, a merge small communities step is added to increase the completeness of clustered clusters. The experimental results show that our proposed algorithm can effectively reduce the randomness of label selection, improve the purity, completeness, and F-Measure and reduce the runtime of metagenome sequence clustering.\",\"PeriodicalId\":54967,\"journal\":{\"name\":\"International Journal of Computational Intelligence Systems\",\"volume\":\"233 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2023-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computational Intelligence Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s44196-023-00348-w\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computational Intelligence Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44196-023-00348-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在宏基因组学中,如何收集海量的宏基因组数据是一个挑战。通常,宏基因组在组装前的聚类序列加快了整个过程。在SpaRC中,序列被定义为节点,并通过并行标签传播算法(LPA)聚类。为了解决并行LPA在聚类过程中标签选择的随机性,提高宏基因组序列聚类的完整性,本文提出了基于spark的并行标签扩散和标签选择社区检测算法,以获得更准确的聚类结果。本文根据Jaccard相似系数及其程度来定义序列的重要度。核心序列被定义为在其所在社区中最重要的序列。制定了三种策略来降低标签选择的随机性。首先,核心序列标签在其定位的聚类上扩散,并成为其他序列的初始标签。那些没有收到初始标签的序列将在邻居序列中选择最重要的序列标签。其次,我们按照标签频率和序列重要性的顺序进行改进的标签传播,以减少标签选择的随机性。最后,增加了合并小社区的步骤,以提高群集的完整性。实验结果表明,本文提出的算法能够有效降低标签选择的随机性,提高标记的纯度、完整性和F-Measure,缩短宏基因组序列聚类的运行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Spark-Based Label Diffusion and Label Selection Community Detection Algorithm for Metagenome Sequence Clustering
Abstract It is a challenge to assemble an enormous amount of metagenome data in metagenomics. Usually, metagenome cluster sequence before assembly accelerates the whole process. In SpaRC, sequences are defined as nodes and clustered by a parallel label propagation algorithm (LPA). To address the randomness of label selection from the parallel LPA during clustering and improve the completeness of metagenome sequence clustering, Spark-based parallel label diffusion and label selection community detection algorithm is proposed in the paper to obtain more accurate clustering results. In this paper, the importance of sequence is defined based on the Jaccard similarity coefficient and its degree. The core sequence is defined as the one with the largest importance in its located community. Three strategies are formulated to reduce the randomness of label selection. Firstly, the core sequence label diffuses over its located cluster and becomes the initial label of other sequences. Those sequences that do not receive an initial label will select the sequence label with the highest importance in the neighbor sequences. Secondly, we perform improved label propagation in order of label frequency and sequence importance to reduce the randomness of label selection. Finally, a merge small communities step is added to increase the completeness of clustered clusters. The experimental results show that our proposed algorithm can effectively reduce the randomness of label selection, improve the purity, completeness, and F-Measure and reduce the runtime of metagenome sequence clustering.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Computational Intelligence Systems
International Journal of Computational Intelligence Systems 工程技术-计算机:跨学科应用
自引率
3.40%
发文量
94
期刊介绍: The International Journal of Computational Intelligence Systems publishes original research on all aspects of applied computational intelligence, especially targeting papers demonstrating the use of techniques and methods originating from computational intelligence theory. The core theories of computational intelligence are fuzzy logic, neural networks, evolutionary computation and probabilistic reasoning. The journal publishes only articles related to the use of computational intelligence and broadly covers the following topics: -Autonomous reasoning- Bio-informatics- Cloud computing- Condition monitoring- Data science- Data mining- Data visualization- Decision support systems- Fault diagnosis- Intelligent information retrieval- Human-machine interaction and interfaces- Image processing- Internet and networks- Noise analysis- Pattern recognition- Prediction systems- Power (nuclear) safety systems- Process and system control- Real-time systems- Risk analysis and safety-related issues- Robotics- Signal and image processing- IoT and smart environments- Systems integration- System control- System modelling and optimization- Telecommunications- Time series prediction- Warning systems- Virtual reality- Web intelligence- Deep learning
期刊最新文献
A LDA-Based Social Media Data Mining Framework for Plastic Circular Economy Identifying People’s Faces in Smart Banking Systems Using Artificial Neural Networks A GMEE-WFED System: Optimizing Wind Turbine Distribution for Enhanced Renewable Energy Generation in the Future Active Exploration Deep Reinforcement Learning for Continuous Action Space with Forward Prediction Optimized Convolutional Forest by Particle Swarm Optimizer for Pothole Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1