基于随机森林的高效比较机器学习元基因组分拆技术

2013 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA) Pub Date : 2013-07-15 DOI:10.1109/CIVEMSA.2013.6617419

Helal Saghir, D. Megherbi

{"title":"基于随机森林的高效比较机器学习元基因组分拆技术","authors":"Helal Saghir, D. Megherbi","doi":"10.1109/CIVEMSA.2013.6617419","DOIUrl":null,"url":null,"abstract":"Metagenomics is the study of microorganisms collected directly from natural environments. Metagenomics studies use DNA fragments obtained directly from a natural environment using whole genome shotgun (WGS) sequencing. Sequencing random fragments obtained from whole genome shotgun into taxa-based groups is known as binning. Currently, there are two different methods of binning: sequence similarity methods and sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known genome like BLAST, or MEGAN. As only a very small fraction of species is available in the current databases, similarity methods do not yield good results. As a given database of organisms grows, the complexity of the search will also grow. Sequence composition methods are based on compositional features of a given DNA sequence like K-mers, or other genomic signature(s). Most of these current methods for binning have two major issues: they do not work well with short sequences and closely related genomes. In this paper we propose new machine learning related predictive DNA sequence feature selection algorithms to solve binning problems in more accurate and efficient ways. In this work we use Oligonucleotide frequencies from 2-mers to 4-mers as features to differentiate between sequences. 2-mers produces 16 features, 3-mers produces 64 features and 4-mers produces 256 features. We did not use feature higher than 4-mers as the number of feature increases exponentially and for 5-mers the number of feature would be 1024 features. We found out that the 4-mers produces better results than 2-mers and 3-mers. The data used in this work has an average length of 250, 500, 1000, and 2000 base pairs. Experimental results of the proposed algorithms are presented to show the potential value of the proposed methods. The proposed algorithm accuracy is tested on a variety of data sets and the classification/prediction accuracy achieved is between 78% - 99% for various simulated data sets using Random forest classifier and 37% - 95% using Naïve Bayes classifier. Random forest Classifier did better in classification in all the dataset compared to Naïve Bayes.","PeriodicalId":159100,"journal":{"name":"2013 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"An efficient comparative machine learning-based metagenomics binning technique via using Random forest\",\"authors\":\"Helal Saghir, D. Megherbi\",\"doi\":\"10.1109/CIVEMSA.2013.6617419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Metagenomics is the study of microorganisms collected directly from natural environments. Metagenomics studies use DNA fragments obtained directly from a natural environment using whole genome shotgun (WGS) sequencing. Sequencing random fragments obtained from whole genome shotgun into taxa-based groups is known as binning. Currently, there are two different methods of binning: sequence similarity methods and sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known genome like BLAST, or MEGAN. As only a very small fraction of species is available in the current databases, similarity methods do not yield good results. As a given database of organisms grows, the complexity of the search will also grow. Sequence composition methods are based on compositional features of a given DNA sequence like K-mers, or other genomic signature(s). Most of these current methods for binning have two major issues: they do not work well with short sequences and closely related genomes. In this paper we propose new machine learning related predictive DNA sequence feature selection algorithms to solve binning problems in more accurate and efficient ways. In this work we use Oligonucleotide frequencies from 2-mers to 4-mers as features to differentiate between sequences. 2-mers produces 16 features, 3-mers produces 64 features and 4-mers produces 256 features. We did not use feature higher than 4-mers as the number of feature increases exponentially and for 5-mers the number of feature would be 1024 features. We found out that the 4-mers produces better results than 2-mers and 3-mers. The data used in this work has an average length of 250, 500, 1000, and 2000 base pairs. Experimental results of the proposed algorithms are presented to show the potential value of the proposed methods. The proposed algorithm accuracy is tested on a variety of data sets and the classification/prediction accuracy achieved is between 78% - 99% for various simulated data sets using Random forest classifier and 37% - 95% using Naïve Bayes classifier. Random forest Classifier did better in classification in all the dataset compared to Naïve Bayes.\",\"PeriodicalId\":159100,\"journal\":{\"name\":\"2013 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIVEMSA.2013.6617419\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIVEMSA.2013.6617419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

宏基因组学是对直接从自然环境中收集的微生物的研究。宏基因组学研究使用全基因组霰弹枪(WGS)测序直接从自然环境中获得的DNA片段。将从全基因组中获得的随机片段排序为基于分类的组，称为分箱。目前，有两种不同的分类方法:序列相似法和序列组合法。序列相似性方法通常基于与已知基因组(如BLAST或MEGAN)的序列比对。由于目前数据库中只有很小一部分物种可用，相似性方法不能产生很好的结果。随着给定生物数据库的增长，搜索的复杂性也会增加。序列组成方法基于给定DNA序列的组成特征，如K-mers或其他基因组特征。目前大多数的分类方法都有两个主要问题:它们不能很好地处理短序列和密切相关的基因组。在本文中，我们提出了新的机器学习相关的预测DNA序列特征选择算法，以更准确和有效的方式解决分箱问题。在这项工作中，我们使用从2-mers到4-mers的寡核苷酸频率作为区分序列的特征。2-mers产生16个特征，3-mers产生64个特征，4-mers产生256个特征。我们没有使用高于4-mers的特征，因为特征数量呈指数增长，而对于5-mers，特征数量将达到1024个。我们发现4-mers比2-mers和3-mers产生更好的效果。本工作中使用的数据平均长度为250、500、1000和2000个碱基对。实验结果表明了所提算法的潜在价值。本文算法在多种数据集上进行了精度测试，使用随机森林分类器对各种模拟数据集的分类/预测精度在78% ~ 99%之间，使用Naïve贝叶斯分类器的分类/预测精度在37% ~ 95%之间。与Naïve贝叶斯相比，随机森林分类器在所有数据集上的分类效果都更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An efficient comparative machine learning-based metagenomics binning technique via using Random forest

Metagenomics is the study of microorganisms collected directly from natural environments. Metagenomics studies use DNA fragments obtained directly from a natural environment using whole genome shotgun (WGS) sequencing. Sequencing random fragments obtained from whole genome shotgun into taxa-based groups is known as binning. Currently, there are two different methods of binning: sequence similarity methods and sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known genome like BLAST, or MEGAN. As only a very small fraction of species is available in the current databases, similarity methods do not yield good results. As a given database of organisms grows, the complexity of the search will also grow. Sequence composition methods are based on compositional features of a given DNA sequence like K-mers, or other genomic signature(s). Most of these current methods for binning have two major issues: they do not work well with short sequences and closely related genomes. In this paper we propose new machine learning related predictive DNA sequence feature selection algorithms to solve binning problems in more accurate and efficient ways. In this work we use Oligonucleotide frequencies from 2-mers to 4-mers as features to differentiate between sequences. 2-mers produces 16 features, 3-mers produces 64 features and 4-mers produces 256 features. We did not use feature higher than 4-mers as the number of feature increases exponentially and for 5-mers the number of feature would be 1024 features. We found out that the 4-mers produces better results than 2-mers and 3-mers. The data used in this work has an average length of 250, 500, 1000, and 2000 base pairs. Experimental results of the proposed algorithms are presented to show the potential value of the proposed methods. The proposed algorithm accuracy is tested on a variety of data sets and the classification/prediction accuracy achieved is between 78% - 99% for various simulated data sets using Random forest classifier and 37% - 95% using Naïve Bayes classifier. Random forest Classifier did better in classification in all the dataset compared to Naïve Bayes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)

自引率

0.00%

发文量

期刊最新文献

Intelligent SVM based food intake measurement system An ANN based system for forecasting ship roll motion Computational Intelligence based construction of a Body Condition Assessment system for cattle Facial expression cloning with fuzzy set clustering The impact of motion in virtual environments on memorization performance