Aditi R Durge, Deepti D Shrimankar, Ankush D Sawarkar
{"title":"高效预测基因组序列处理模型的启发式分析:统计学视角。","authors":"Aditi R Durge, Deepti D Shrimankar, Ankush D Sawarkar","doi":"10.2174/1389202923666220927105311","DOIUrl":null,"url":null,"abstract":"<p><p>Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, <i>etc</i>. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), <i>etc</i>., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.</p>","PeriodicalId":10803,"journal":{"name":"Current Genomics","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2022-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/f3/6a/CG-23-299.PMC9878859.pdf","citationCount":"1","resultStr":"{\"title\":\"Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective.\",\"authors\":\"Aditi R Durge, Deepti D Shrimankar, Ankush D Sawarkar\",\"doi\":\"10.2174/1389202923666220927105311\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, <i>etc</i>. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), <i>etc</i>., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.</p>\",\"PeriodicalId\":10803,\"journal\":{\"name\":\"Current Genomics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2022-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/f3/6a/CG-23-299.PMC9878859.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Genomics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.2174/1389202923666220927105311\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/1389202923666220927105311","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 1
摘要
基因组序列显示了各种各样的特征,包括种和亚种类型、基因型、疾病、生长指标、产量质量等。为了分析和研究不同物种基因组序列的特征,研究者们提出了各种深度学习模型,如卷积神经网络(cnn)、深度信念网络(DBNs)、多层感知器(MLPs)等,这些模型在评估性能、应用领域和处理物种方面都有所不同。由于算法实现之间的广泛差异,研究编程人员很难为其应用选择最佳的基因组处理模型。为了便于选择,本文回顾了各种各样的模型,并从准确性、应用领域、计算复杂性、处理延迟、精度和召回率等方面对它们的性能进行了比较。因此,在目前的回顾中,已经提出了各种深度学习和机器学习模型,这些模型对于不同的应用具有不同的精度。对于多基因组数据,使用支持向量机重复增量修剪产生误差减少(Ripper SVM)的准确率为99.7%,对于癌症基因组数据,使用CNN贝叶斯方法的准确率为99.27%。而对于新冠病毒基因组分析,Bidirectional Long - short Memory with CNN (BiLSTM CNN)的准确率最高,为99.95%。对不同型号的精度和召回率进行了类似的分析。最后,本文总结了与基因组加工模型相关的一些有趣的观察结果,并对其有效利用的应用提出了建议。
Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective.
Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, etc. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), etc., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.
期刊介绍:
Current Genomics is a peer-reviewed journal that provides essential reading about the latest and most important developments in genome science and related fields of research. Systems biology, systems modeling, machine learning, network inference, bioinformatics, computational biology, epigenetics, single cell genomics, extracellular vesicles, quantitative biology, and synthetic biology for the study of evolution, development, maintenance, aging and that of human health, human diseases, clinical genomics and precision medicine are topics of particular interest. The journal covers plant genomics. The journal will not consider articles dealing with breeding and livestock.
Current Genomics publishes three types of articles including:
i) Research papers from internationally-recognized experts reporting on new and original data generated at the genome scale level. Position papers dealing with new or challenging methodological approaches, whether experimental or mathematical, are greatly welcome in this section.
ii) Authoritative and comprehensive full-length or mini reviews from widely recognized experts, covering the latest developments in genome science and related fields of research such as systems biology, statistics and machine learning, quantitative biology, and precision medicine. Proposals for mini-hot topics (2-3 review papers) and full hot topics (6-8 review papers) guest edited by internationally-recognized experts are welcome in this section. Hot topic proposals should not contain original data and they should contain articles originating from at least 2 different countries.
iii) Opinion papers from internationally recognized experts addressing contemporary questions and issues in the field of genome science and systems biology and basic and clinical research practices.