用于分类分类的16S-rRNA基因的增量和半监督学习

Emrecan Ozdogan, Norman C. Sabin, Thomas Gracie, Steven Portley, Mali Halac, Thomas Coard, William Trimble, B. Sokhansanj, G. Rosen, R. Polikar
{"title":"用于分类分类的16S-rRNA基因的增量和半监督学习","authors":"Emrecan Ozdogan, Norman C. Sabin, Thomas Gracie, Steven Portley, Mali Halac, Thomas Coard, William Trimble, B. Sokhansanj, G. Rosen, R. Polikar","doi":"10.1109/SSCI50451.2021.9660093","DOIUrl":null,"url":null,"abstract":"Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.","PeriodicalId":255763,"journal":{"name":"2021 IEEE Symposium Series on Computational Intelligence (SSCI)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Incremental and Semi-Supervised Learning of 16S-rRNA Genes For Taxonomic Classification\",\"authors\":\"Emrecan Ozdogan, Norman C. Sabin, Thomas Gracie, Steven Portley, Mali Halac, Thomas Coard, William Trimble, B. Sokhansanj, G. Rosen, R. Polikar\",\"doi\":\"10.1109/SSCI50451.2021.9660093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.\",\"PeriodicalId\":255763,\"journal\":{\"name\":\"2021 IEEE Symposium Series on Computational Intelligence (SSCI)\",\"volume\":\"169 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Symposium Series on Computational Intelligence (SSCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSCI50451.2021.9660093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Symposium Series on Computational Intelligence (SSCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSCI50451.2021.9660093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

基因组测序产生大量数据,因此需要越来越高的计算资源。在宏基因组学应用中,日益增长的数据问题甚至更为严重,因为来自环境样本的数据包括许多生物体,而不仅仅是常见的单一生物体测序。传统的分类和聚类方法和平台——虽然被设计为计算效率高——不能在新数据到达时增量地更新先前训练过的系统,然后需要用增强的(旧加新)数据完全重新训练。这种完全的再训练是低效的,并且会导致计算资源的利用率低下。仅使用新数据更新分类系统的能力在呈现新数据时提供了更低的运行时间,并且不需要在整个以前的数据集上重新训练该方法。在本文中,我们提出了增量VSEARCH (I-VSEARCH)及其半监督版本的分类分类,以及阈值独立的VSEARCH (TI-VSEARCH)作为包装,VSEARCH是一种成熟的(无监督的)宏基因组聚类算法。我们在一个16S rRNA基因数据集上显示,I-VSEARCH只在随着时间的推移而可用的新批次数据上增量运行,与在完整数据上运行的VSEARCH相比,它不会失去任何准确性,同时提供了有吸引力的计算优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Incremental and Semi-Supervised Learning of 16S-rRNA Genes For Taxonomic Classification
Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Voice Dialog System for Simulated Patient Robot and Detection of Interviewer Nodding Deep Learning Approaches to Remaining Useful Life Prediction: A Survey Evaluation of Graph Convolutions for Spatio-Temporal Predictions of EV-Charge Availability Balanced K-means using Quantum annealing A Study of Transfer Learning in a Generation Constructive Hyper-Heuristic for One Dimensional Bin Packing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1