DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.

IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-10-14 DOI:10.1186/s12859-024-05955-8
Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin
{"title":"DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.","authors":"Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin","doi":"10.1186/s12859-024-05955-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.</p><p><strong>Results: </strong>DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.</p><p><strong>Conclusions: </strong>DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476100/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05955-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.

Results: DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.

Conclusions: DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DNASimCLR:基于对比学习的基因序列数据分类深度学习方法。
背景:深度神经网络模型的快速发展大大提高了从微生物序列数据中提取特征的能力,这对于解决生物学难题至关重要。然而,标注微生物数据的稀缺性和复杂性给监督学习方法带来了巨大困难。为了解决这些问题,我们提出了 DNASimCLR,这是一种无监督框架,旨在高效提取基因序列数据特征:DNASimCLR 利用卷积神经网络和基于对比学习的 SimCLR 框架,从不同的微生物基因序列中提取复杂的特征。预训练在两个经典的大规模无标签数据集上进行,包括元基因组和病毒基因序列。随后的分类任务是利用之前获得的模型对预训练模型进行微调。我们的实验证明,DNASimCLR 在基因序列分类方面至少可以与最先进的技术相媲美。对于基于卷积神经网络的方法,DNASimCLR 超越了现有的最新方法,明确确立了其优于最先进的基于 CNN 的特征提取技术的地位。此外,该模型在分析生物序列数据的各种任务中表现出卓越的性能,展示了其强大的适应性:DNASimCLR 是一种用于基因序列分类的稳健且与数据库无关的解决方案。它的多功能性使其在涉及新基因序列或以前未见过的基因序列的情况下表现出色,成为基因组学中各种应用的重要工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Bioinformatics
BMC Bioinformatics 生物-生化研究方法
CiteScore
5.70
自引率
3.30%
发文量
506
审稿时长
4.3 months
期刊介绍: BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
期刊最新文献
Rare copy number variant analysis in case-control studies using snp array data: a scalable and automated data analysis pipeline. Mining contextually meaningful subgraphs from a vertex-attributed graph. Robust double machine learning model with application to omics data. A mapping-free natural language processing-based technique for sequence search in nanopore long-reads. Closha 2.0: a bio-workflow design system for massive genome data analysis on high performance cluster infrastructure.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1