MyESL: Sparse learning in molecular evolution and phylogenetic analysis.

ArXiv Pub Date : 2025-01-09
Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar
{"title":"MyESL: Sparse learning in molecular evolution and phylogenetic analysis.","authors":"Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequences alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci or genes) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale sequence data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL takes binary response or phylogenetic trees as the regression response, processing them into class-balanced hypotheses as required. It also processes continuous and binary features or sequence alignments that are transformed into a binary one-hot encoded feature matrix for analysis. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for both Windows and macOS, which can be integrated into third-party software and pipelines.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760232/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequences alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci or genes) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale sequence data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL takes binary response or phylogenetic trees as the regression response, processing them into class-balanced hypotheses as required. It also processes continuous and binary features or sequence alignments that are transformed into a binary one-hot encoded feature matrix for analysis. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for both Windows and macOS, which can be integrated into third-party software and pipelines.

Abstract Image

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MyESL:分子进化和系统发育分析中的稀疏学习。
进化稀疏学习(ESL)使用一种有监督的机器学习方法,最小绝对收缩和选择算子(LASSO),来建立模型,解释假设与序列比对中基因组特征(例如,位点)差异之间的关系。ESL通过使用稀疏组LASSO来利用基因组特征组之间和组内的稀疏性(例如,基因组位点)。虽然有一些软件包可用于执行稀疏组LASSO,但我们发现它们不太适合处理和分析包含数百万个特征(如碱基)的基因组尺度数据。MyESL软件填补了对开源软件的需求,用于进行ESL分析,其功能是预处理输入假设和大型比对,使LASSO灵活且计算效率高,并对输出模型进行后处理,以产生在功能或进化基因组学中有用的不同度量。MyESL可以将系统发育树和序列比对作为输入,并将它们分别转化为数值响应和特征。模型输出被处理成用户友好的文本和图形文件。MyESL的计算核心是用c++编写的,它提供了具有或不具有组稀疏性的模型构建,而输入和模型输出的预处理和后处理是使用Python编写的自定义函数执行的。它在系统基因组学中的一个应用展示了MyESL的效用。我们对经验基因组规模数据集的分析表明,MyESL可以在个人桌面上快速有效地构建进化模型,而其他计算软件包由于对计算资源和时间的要求太高而无法实现。MyESL可用于Linux上的Python环境,并作为Windows和macOS的独立应用程序分发。可从https://github.com/kumarlabgit/MyESL获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Proceedings for the Inaugural Meeting of the International Society for Tractography -- IST 2025 Bordeaux. Fluctuation-Response Design Rules for Nonequilibrium Flows. Learning Mixtures of Linear Dynamical Systems via Hybrid Tensor-EM Method. Animal behavioral analysis and neural encoding with transformer-based self-supervised pretraining. CAMEL: An ECG Language Model for Forecasting Cardiac Events.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1