{"title":"基于期望最大化的DNA序列基序查找算法","authors":"J. C. Garbelini, D. Sanches, A. Pozo","doi":"10.1109/CEC55065.2022.9870303","DOIUrl":null,"url":null,"abstract":"Finding transcription factor binding sites plays an important role inside bioinformatics. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists in seeking conserved patterns in biological datasets of sequences, through using unsupervised learning algorithms. This problem is considered one of the open problems of computational biology, which in its simplest formulation has been proven to be np-hard. Moreover, heuristics and meta-heuristics algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this paper we propose a new algorithm called Biomapp (Biological Motif Application) based on canonical Expectation Maximization that uses the Kullback-Leibler divergence to re-estimate the parameters of statistical model. Furthermore, the algorithm is embedded in an Iterated Local Search, as the local search step and then, we use a hierarchical perturbation operator in order to escape from local optima. The results obtained by this new approach were compared with the state-of-the-art algorithm MEME (Multiple EM Motif Elicitation) showing that Biomapp outperformed this classical technique in several datasets.","PeriodicalId":153241,"journal":{"name":"2022 IEEE Congress on Evolutionary Computation (CEC)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Expectation Maximization based algorithm applied to DNA sequence motif finder\",\"authors\":\"J. C. Garbelini, D. Sanches, A. Pozo\",\"doi\":\"10.1109/CEC55065.2022.9870303\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finding transcription factor binding sites plays an important role inside bioinformatics. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists in seeking conserved patterns in biological datasets of sequences, through using unsupervised learning algorithms. This problem is considered one of the open problems of computational biology, which in its simplest formulation has been proven to be np-hard. Moreover, heuristics and meta-heuristics algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this paper we propose a new algorithm called Biomapp (Biological Motif Application) based on canonical Expectation Maximization that uses the Kullback-Leibler divergence to re-estimate the parameters of statistical model. Furthermore, the algorithm is embedded in an Iterated Local Search, as the local search step and then, we use a hierarchical perturbation operator in order to escape from local optima. The results obtained by this new approach were compared with the state-of-the-art algorithm MEME (Multiple EM Motif Elicitation) showing that Biomapp outperformed this classical technique in several datasets.\",\"PeriodicalId\":153241,\"journal\":{\"name\":\"2022 IEEE Congress on Evolutionary Computation (CEC)\",\"volume\":\"138 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE Congress on Evolutionary Computation (CEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CEC55065.2022.9870303\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Congress on Evolutionary Computation (CEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC55065.2022.9870303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
寻找转录因子结合位点在生物信息学中起着重要的作用。在共表达基因的启动子区域正确识别它是理解基因表达机制和创造新药和疫苗的关键一步。寻找基序的问题在于通过使用无监督学习算法在序列的生物数据集中寻找保守模式。这个问题被认为是计算生物学的开放问题之一,其最简单的表述已被证明是np困难的。此外,启发式和元启发式算法已被证明在解决具有非常大搜索空间的组合问题方面非常有前途。本文提出了一种基于典型期望最大化的新算法Biomapp (Biological Motif Application),该算法利用Kullback-Leibler散度对统计模型的参数进行重新估计。此外,将算法嵌入到迭代局部搜索中,作为局部搜索步骤,然后使用层次摄动算子来避免局部最优。通过这种新方法获得的结果与最先进的算法MEME (Multiple EM Motif Elicitation)进行了比较,表明Biomapp在几个数据集中优于这种经典技术。
Expectation Maximization based algorithm applied to DNA sequence motif finder
Finding transcription factor binding sites plays an important role inside bioinformatics. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists in seeking conserved patterns in biological datasets of sequences, through using unsupervised learning algorithms. This problem is considered one of the open problems of computational biology, which in its simplest formulation has been proven to be np-hard. Moreover, heuristics and meta-heuristics algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this paper we propose a new algorithm called Biomapp (Biological Motif Application) based on canonical Expectation Maximization that uses the Kullback-Leibler divergence to re-estimate the parameters of statistical model. Furthermore, the algorithm is embedded in an Iterated Local Search, as the local search step and then, we use a hierarchical perturbation operator in order to escape from local optima. The results obtained by this new approach were compared with the state-of-the-art algorithm MEME (Multiple EM Motif Elicitation) showing that Biomapp outperformed this classical technique in several datasets.