用于分类分类的16S-rRNA基因的增量和半监督学习

2021 IEEE Symposium Series on Computational Intelligence (SSCI) Pub Date : 2021-12-05 DOI:10.1109/SSCI50451.2021.9660093

Emrecan Ozdogan, Norman C. Sabin, Thomas Gracie, Steven Portley, Mali Halac, Thomas Coard, William Trimble, B. Sokhansanj, G. Rosen, R. Polikar

{"title":"用于分类分类的16S-rRNA基因的增量和半监督学习","authors":"Emrecan Ozdogan, Norman C. Sabin, Thomas Gracie, Steven Portley, Mali Halac, Thomas Coard, William Trimble, B. Sokhansanj, G. Rosen, R. Polikar","doi":"10.1109/SSCI50451.2021.9660093","DOIUrl":null,"url":null,"abstract":"Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.","PeriodicalId":255763,"journal":{"name":"2021 IEEE Symposium Series on Computational Intelligence (SSCI)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Incremental and Semi-Supervised Learning of 16S-rRNA Genes For Taxonomic Classification\",\"authors\":\"Emrecan Ozdogan, Norman C. Sabin, Thomas Gracie, Steven Portley, Mali Halac, Thomas Coard, William Trimble, B. Sokhansanj, G. Rosen, R. Polikar\",\"doi\":\"10.1109/SSCI50451.2021.9660093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.\",\"PeriodicalId\":255763,\"journal\":{\"name\":\"2021 IEEE Symposium Series on Computational Intelligence (SSCI)\",\"volume\":\"169 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Symposium Series on Computational Intelligence (SSCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSCI50451.2021.9660093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Symposium Series on Computational Intelligence (SSCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSCI50451.2021.9660093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

基因组测序产生大量数据，因此需要越来越高的计算资源。在宏基因组学应用中，日益增长的数据问题甚至更为严重，因为来自环境样本的数据包括许多生物体，而不仅仅是常见的单一生物体测序。传统的分类和聚类方法和平台——虽然被设计为计算效率高——不能在新数据到达时增量地更新先前训练过的系统，然后需要用增强的(旧加新)数据完全重新训练。这种完全的再训练是低效的，并且会导致计算资源的利用率低下。仅使用新数据更新分类系统的能力在呈现新数据时提供了更低的运行时间，并且不需要在整个以前的数据集上重新训练该方法。在本文中，我们提出了增量VSEARCH (I-VSEARCH)及其半监督版本的分类分类，以及阈值独立的VSEARCH (TI-VSEARCH)作为包装，VSEARCH是一种成熟的(无监督的)宏基因组聚类算法。我们在一个16S rRNA基因数据集上显示，I-VSEARCH只在随着时间的推移而可用的新批次数据上增量运行，与在完整数据上运行的VSEARCH相比，它不会失去任何准确性，同时提供了有吸引力的计算优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Incremental and Semi-Supervised Learning of 16S-rRNA Genes For Taxonomic Classification

Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE Symposium Series on Computational Intelligence (SSCI)

自引率

0.00%

发文量