Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar
{"title":"MyESL:分子进化和系统发育分析中的稀疏学习。","authors":"Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequences alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci or genes) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale sequence data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL takes binary response or phylogenetic trees as the regression response, processing them into class-balanced hypotheses as required. It also processes continuous and binary features or sequence alignments that are transformed into a binary one-hot encoded feature matrix for analysis. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for both Windows and macOS, which can be integrated into third-party software and pipelines.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760232/pdf/","citationCount":"0","resultStr":"{\"title\":\"MyESL: Sparse learning in molecular evolution and phylogenetic analysis.\",\"authors\":\"Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequences alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci or genes) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale sequence data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL takes binary response or phylogenetic trees as the regression response, processing them into class-balanced hypotheses as required. It also processes continuous and binary features or sequence alignments that are transformed into a binary one-hot encoded feature matrix for analysis. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for both Windows and macOS, which can be integrated into third-party software and pipelines.</p>\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760232/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MyESL: Sparse learning in molecular evolution and phylogenetic analysis.
Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequences alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci or genes) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale sequence data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL takes binary response or phylogenetic trees as the regression response, processing them into class-balanced hypotheses as required. It also processes continuous and binary features or sequence alignments that are transformed into a binary one-hot encoded feature matrix for analysis. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for both Windows and macOS, which can be integrated into third-party software and pipelines.