AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.

IF 1.2 4区生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Genes & genetic systems Pub Date : 2021-12-16 Epub Date: 2021-09-27 DOI:10.1266/ggs.21-00025

Toshimichi Ikemura, Yuki Iwasaki, Kennosuke Wada, Yoshiko Wada, Takashi Abe

{"title":"AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.","authors":"Toshimichi Ikemura, Yuki Iwasaki, Kennosuke Wada, Yoshiko Wada, Takashi Abe","doi":"10.1266/ggs.21-00025","DOIUrl":null,"url":null,"abstract":"<p><p>In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.</p>","PeriodicalId":12690,"journal":{"name":"Genes & genetic systems","volume":"96 4","pages":"165-176"},"PeriodicalIF":1.2000,"publicationDate":"2021-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genetic systems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1266/ggs.21-00025","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/9/27 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 1

Abstract

In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人工智能用于对大量基因组序列进行集体分析:从大流行SARS-CoV-2的小基因组到人类基因组的各种例子。

在遗传学及相关领域，基因组序列等海量数据正在积累，利用适合大数据分析的人工智能(AI)变得越来越重要。无监督人工智能可以在没有先验知识或特定模型的情况下从大数据中揭示新知识，这对于基因组序列分析来说是非常可取的，特别是在获得意想不到的见解时。我们开发了一种批量学习自组织图谱(BLSOM)，用于寡核苷酸组成，可以揭示各种新的基因组特征。在这里，我们解释了BLSOM的数据挖掘:一种无监督的人工智能。作为一个特定的靶点，我们首先选择了SARS-CoV-2(严重急性呼吸综合征冠状病毒2)，因为通过全球范围的努力已经积累了大量的病毒基因组序列。我们分析了主要在大流行第一年收集的60多万个序列。短寡核苷酸(例如，4-6-mers)的BLSOMs允许分离到已知的分支中，但较长的寡核苷酸进一步提高了分离能力，并揭示了已知分支中的亚群。以15-mers为例，基因组中基本上只有一个拷贝;因此，流行病开始后出现的15-mers可能与突变有关，15-mers的BLSOM揭示了导致分离为已知分支及其亚群的突变。在介绍了详细的方法策略后，我们解释了各种主题的BLSOM，例如来自几乎所有现有微生物的超过500万个5-kb片段序列的四核苷酸BLSOM及其在宏基因组研究中的应用。我们还解释了多种真核生物的BLSOMs，包括鱼类、青蛙和果蝇物种，并发现在密切相关的物种中有很高的分离能力。在分析人类基因组时，我们发现转录因子结合序列在着丝粒和周围着丝粒异染色质区域富集。tdna (tRNA基因)可以根据其对应的氨基酸进行分离。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊