基于数据科学的基因组学方法的SARS-CoV-2 rna聚类的高性能计算

Q2 Agricultural and Biological Sciences Genomics and Informatics Pub Date : 2021-12-01 Epub Date: 2021-12-31 DOI:10.5808/gi.21056

Anas Oujja, Mohamed Riduan Abid, Jaouad Boumhidi, Safae Bourhnane, Asmaa Mourhir, Fatima Merchant, Driss Benhaddou

{"title":"基于数据科学的基因组学方法的SARS-CoV-2 rna聚类的高性能计算","authors":"Anas Oujja, Mohamed Riduan Abid, Jaouad Boumhidi, Safae Bourhnane, Asmaa Mourhir, Fatima Merchant, Driss Benhaddou","doi":"10.5808/gi.21056","DOIUrl":null,"url":null,"abstract":"Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.","PeriodicalId":36591,"journal":{"name":"Genomics and Informatics","volume":"19 4","pages":"e49"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8752974/pdf/","citationCount":"0","resultStr":"{\"title\":\"High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach.\",\"authors\":\"Anas Oujja, Mohamed Riduan Abid, Jaouad Boumhidi, Safae Bourhnane, Asmaa Mourhir, Fatima Merchant, Driss Benhaddou\",\"doi\":\"10.5808/gi.21056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.\",\"PeriodicalId\":36591,\"journal\":{\"name\":\"Genomics and Informatics\",\"volume\":\"19 4\",\"pages\":\"e49\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8752974/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genomics and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5808/gi.21056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/12/31 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5808/gi.21056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/12/31 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}

引用次数: 0

摘要

如今，基因组数据构成了世界上增长最快的数据集之一。预计到2025年，中国将成为第四大大数据来源，因此需要足够的高性能计算(HPC)平台进行处理。随着严重急性呼吸综合征冠状病毒2 (SARS-CoV-2)最新的前所未有和不可预测的突变，研究界迫切需要ICT工具来处理SARS-CoV-2 RNA数据，例如通过对其进行分类(即聚类)，从而协助跟踪病毒突变并预测未来的突变。在本文中，我们提出了一个基于hpc的SARS-CoV-2 rna聚类工具。我们正在采用数据科学的方法，从数据收集，到分析，再到可视化。在分析步骤中，我们介绍了我们的聚类方法如何利用HPC和LCS算法。该方法采用Hadoop MapReduce编程范式，并采用LCS算法，高效地计算出每对SARS-CoV-2 RNA序列的LCS长度。后者是从美国国家生物技术信息中心(NCBI)病毒库中提取的。计算的LCS长度用于测量RNA序列之间的差异，以便计算出现有的簇。除此之外，我们还对基于可变工作负载和不同Hadoop工作节点数量的LCS算法性能进行了比较研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach.

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊