基于数据科学的基因组学方法的SARS-CoV-2 rna聚类的高性能计算

Q2 Agricultural and Biological Sciences Genomics and Informatics Pub Date : 2021-12-01 Epub Date: 2021-12-31 DOI:10.5808/gi.21056
Anas Oujja, Mohamed Riduan Abid, Jaouad Boumhidi, Safae Bourhnane, Asmaa Mourhir, Fatima Merchant, Driss Benhaddou
{"title":"基于数据科学的基因组学方法的SARS-CoV-2 rna聚类的高性能计算","authors":"Anas Oujja,&nbsp;Mohamed Riduan Abid,&nbsp;Jaouad Boumhidi,&nbsp;Safae Bourhnane,&nbsp;Asmaa Mourhir,&nbsp;Fatima Merchant,&nbsp;Driss Benhaddou","doi":"10.5808/gi.21056","DOIUrl":null,"url":null,"abstract":"<p><p>Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.</p>","PeriodicalId":36591,"journal":{"name":"Genomics and Informatics","volume":"19 4","pages":"e49"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8752974/pdf/","citationCount":"0","resultStr":"{\"title\":\"High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach.\",\"authors\":\"Anas Oujja,&nbsp;Mohamed Riduan Abid,&nbsp;Jaouad Boumhidi,&nbsp;Safae Bourhnane,&nbsp;Asmaa Mourhir,&nbsp;Fatima Merchant,&nbsp;Driss Benhaddou\",\"doi\":\"10.5808/gi.21056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.</p>\",\"PeriodicalId\":36591,\"journal\":{\"name\":\"Genomics and Informatics\",\"volume\":\"19 4\",\"pages\":\"e49\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8752974/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genomics and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5808/gi.21056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/12/31 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5808/gi.21056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/12/31 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0

摘要

如今,基因组数据构成了世界上增长最快的数据集之一。预计到2025年,中国将成为第四大大数据来源,因此需要足够的高性能计算(HPC)平台进行处理。随着严重急性呼吸综合征冠状病毒2 (SARS-CoV-2)最新的前所未有和不可预测的突变,研究界迫切需要ICT工具来处理SARS-CoV-2 RNA数据,例如通过对其进行分类(即聚类),从而协助跟踪病毒突变并预测未来的突变。在本文中,我们提出了一个基于hpc的SARS-CoV-2 rna聚类工具。我们正在采用数据科学的方法,从数据收集,到分析,再到可视化。在分析步骤中,我们介绍了我们的聚类方法如何利用HPC和LCS算法。该方法采用Hadoop MapReduce编程范式,并采用LCS算法,高效地计算出每对SARS-CoV-2 RNA序列的LCS长度。后者是从美国国家生物技术信息中心(NCBI)病毒库中提取的。计算的LCS长度用于测量RNA序列之间的差异,以便计算出现有的簇。除此之外,我们还对基于可变工作负载和不同Hadoop工作节点数量的LCS算法性能进行了比较研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach.

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Genomics and Informatics
Genomics and Informatics Agricultural and Biological Sciences-Ecology, Evolution, Behavior and Systematics
CiteScore
1.90
自引率
0.00%
发文量
0
审稿时长
12 weeks
期刊最新文献
Gut metagenomic analysis of gastric cancer patients reveals Akkermansia, Gammaproteobacteria, and Veillonella microbiota as potential non-invasive biomarkers COVID-19 progression towards ARDS: a genome wide study reveals host factors underlying critical COVID-19. Bioinformatic analyses reveal the prognostic significance and potential role of ankyrin 3 (ANK3) in kidney renal clear cell carcinoma. Comparison of digital PCR platforms using the molecular marker. Single-cell RNA sequencing identifies distinct transcriptomic signatures between PMA/ionomycin- and αCD3/αCD28-activated primary human T cells.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1