AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.

IF 1 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Genes & genetic systems Pub Date : 2021-12-16 Epub Date: 2021-09-27 DOI:10.1266/ggs.21-00025
Toshimichi Ikemura, Yuki Iwasaki, Kennosuke Wada, Yoshiko Wada, Takashi Abe
{"title":"AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.","authors":"Toshimichi Ikemura,&nbsp;Yuki Iwasaki,&nbsp;Kennosuke Wada,&nbsp;Yoshiko Wada,&nbsp;Takashi Abe","doi":"10.1266/ggs.21-00025","DOIUrl":null,"url":null,"abstract":"<p><p>In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.</p>","PeriodicalId":12690,"journal":{"name":"Genes & genetic systems","volume":null,"pages":null},"PeriodicalIF":1.0000,"publicationDate":"2021-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genetic systems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1266/ggs.21-00025","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/9/27 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 1

Abstract

In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工智能用于对大量基因组序列进行集体分析:从大流行SARS-CoV-2的小基因组到人类基因组的各种例子。
在遗传学及相关领域,基因组序列等海量数据正在积累,利用适合大数据分析的人工智能(AI)变得越来越重要。无监督人工智能可以在没有先验知识或特定模型的情况下从大数据中揭示新知识,这对于基因组序列分析来说是非常可取的,特别是在获得意想不到的见解时。我们开发了一种批量学习自组织图谱(BLSOM),用于寡核苷酸组成,可以揭示各种新的基因组特征。在这里,我们解释了BLSOM的数据挖掘:一种无监督的人工智能。作为一个特定的靶点,我们首先选择了SARS-CoV-2(严重急性呼吸综合征冠状病毒2),因为通过全球范围的努力已经积累了大量的病毒基因组序列。我们分析了主要在大流行第一年收集的60多万个序列。短寡核苷酸(例如,4-6-mers)的BLSOMs允许分离到已知的分支中,但较长的寡核苷酸进一步提高了分离能力,并揭示了已知分支中的亚群。以15-mers为例,基因组中基本上只有一个拷贝;因此,流行病开始后出现的15-mers可能与突变有关,15-mers的BLSOM揭示了导致分离为已知分支及其亚群的突变。在介绍了详细的方法策略后,我们解释了各种主题的BLSOM,例如来自几乎所有现有微生物的超过500万个5-kb片段序列的四核苷酸BLSOM及其在宏基因组研究中的应用。我们还解释了多种真核生物的BLSOMs,包括鱼类、青蛙和果蝇物种,并发现在密切相关的物种中有很高的分离能力。在分析人类基因组时,我们发现转录因子结合序列在着丝粒和周围着丝粒异染色质区域富集。tdna (tRNA基因)可以根据其对应的氨基酸进行分离。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Genes & genetic systems
Genes & genetic systems 生物-生化与分子生物学
CiteScore
1.50
自引率
0.00%
发文量
22
审稿时长
>12 weeks
期刊介绍: Genes & Genetic Systems , formerly the Japanese Journal of Genetics , is published bimonthly by the Genetics Society of Japan.
期刊最新文献
Development and characterization of expressed sequence tag-simple sequence repeat markers for the near-threatened halophyte, Limonium tetragonum (Thunb.) A. A. Bullock (Plumbaginaceae). Next-generation sequencing analysis with a population-specific human reference genome. Mutagenic effects of ultraviolet radiation and trimethyl psoralen in mycoplasma toward a minimal genome. FOXM1 derived from Triple negative breast cancer exosomes promotes cancer progression by activating IDO1 transcription in macrophages to suppress ferroptosis and induce M2 polarization of Tumor-associated macrophages. Identification of abiotic stress-responsive genes: a genome-wide analysis of the cytokinin response regulator gene family in rice.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1