{"title":"Expectation Maximization based algorithm applied to DNA sequence motif finder","authors":"J. C. Garbelini, D. Sanches, A. Pozo","doi":"10.1109/CEC55065.2022.9870303","DOIUrl":null,"url":null,"abstract":"Finding transcription factor binding sites plays an important role inside bioinformatics. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists in seeking conserved patterns in biological datasets of sequences, through using unsupervised learning algorithms. This problem is considered one of the open problems of computational biology, which in its simplest formulation has been proven to be np-hard. Moreover, heuristics and meta-heuristics algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this paper we propose a new algorithm called Biomapp (Biological Motif Application) based on canonical Expectation Maximization that uses the Kullback-Leibler divergence to re-estimate the parameters of statistical model. Furthermore, the algorithm is embedded in an Iterated Local Search, as the local search step and then, we use a hierarchical perturbation operator in order to escape from local optima. The results obtained by this new approach were compared with the state-of-the-art algorithm MEME (Multiple EM Motif Elicitation) showing that Biomapp outperformed this classical technique in several datasets.","PeriodicalId":153241,"journal":{"name":"2022 IEEE Congress on Evolutionary Computation (CEC)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Congress on Evolutionary Computation (CEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC55065.2022.9870303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Finding transcription factor binding sites plays an important role inside bioinformatics. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists in seeking conserved patterns in biological datasets of sequences, through using unsupervised learning algorithms. This problem is considered one of the open problems of computational biology, which in its simplest formulation has been proven to be np-hard. Moreover, heuristics and meta-heuristics algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this paper we propose a new algorithm called Biomapp (Biological Motif Application) based on canonical Expectation Maximization that uses the Kullback-Leibler divergence to re-estimate the parameters of statistical model. Furthermore, the algorithm is embedded in an Iterated Local Search, as the local search step and then, we use a hierarchical perturbation operator in order to escape from local optima. The results obtained by this new approach were compared with the state-of-the-art algorithm MEME (Multiple EM Motif Elicitation) showing that Biomapp outperformed this classical technique in several datasets.