首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
Unsupervised synchronization of molecular dynamics trajectories via graph embedding and time warping. 基于图嵌入和时间翘曲的分子动力学轨迹无监督同步。
IF 5.4 Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag017
Manuel Mangoni, Salvatore Daniele Bianco, Francesco Petrizzelli, Michele Pieroni, Pietro Hiram Guzzi, Viviana Caputo, Tommaso Biagini, Tommaso Mazza

Motivation: Molecular dynamics (MD) simulations provide detailed atomistic insights into biomolecular processes, but comparing independent trajectories remains challenging due to stochastic divergence. Misaligned simulations can obscure shared mechanisms or exaggerate differences, limiting reproducibility and mechanistic interpretation. A generalizable, unsupervised method for synchronizing and comparing MD trajectories across systems and conditions is, therefore, needed.

Results: We introduce NetMD, a computational framework that synchronizes and analyzes MD trajectories by integrating graph-based representations with dynamic time warping. Trajectory frames are converted into residue-contact graphs, entropy-filtered to retain variable interactions, and embedded as low-dimensional vectors. NetMD aligns these vectorized trajectories through time-warping barycenter averaging, generating a consensus trajectory while pruning outlier simulations. Applied to transporters, demethylases, and large protein complexes relevant to neurological disease pathways and cancer, NetMD revealed shared multiphase dynamics and identified mutation- or ligand-specific deviations. This unsupervised, time-resolved approach enables direct comparison of MD ensembles across heterogeneous conditions. NetMD is robust and broadly applicable, providing a tool for uncovering conserved patterns and critical divergences in biomolecular dynamics.

Availability and implementation: NetMD is freely available at https://github.com/mazzalab/NetMD.

动机:分子动力学(MD)模拟为生物分子过程提供了详细的原子性见解,但由于随机发散,比较独立的轨迹仍然具有挑战性。不一致的模拟会模糊共享机制或夸大差异,限制再现性和机制解释。因此,需要一种通用的、无监督的方法来同步和比较跨系统和条件的MD轨迹。结果:我们引入了NetMD,这是一个计算框架,通过集成基于图的表示和动态时间翘曲来同步和分析MD轨迹。将轨迹帧转换为残余接触图,进行熵滤波以保留变量相互作用,并作为低维向量嵌入。NetMD通过时间扭曲质心平均来对齐这些矢量化轨迹,在修剪异常模拟的同时生成一致的轨迹。NetMD应用于转运蛋白、去甲基化酶和与神经疾病通路和癌症相关的大蛋白复合物,揭示了共同的多相动力学,并确定了突变或配体特异性偏差。这种无监督的、时间分辨的方法可以在不同的条件下直接比较MD集合。NetMD功能强大,应用广泛,为揭示生物分子动力学中的保守模式和关键差异提供了工具。可获得性:NetMD可在https://github.com/mazzalab/NetMD.Supplementary上免费获得;补充数据可在Bioinformatics在线上获得。
{"title":"Unsupervised synchronization of molecular dynamics trajectories via graph embedding and time warping.","authors":"Manuel Mangoni, Salvatore Daniele Bianco, Francesco Petrizzelli, Michele Pieroni, Pietro Hiram Guzzi, Viviana Caputo, Tommaso Biagini, Tommaso Mazza","doi":"10.1093/bioinformatics/btag017","DOIUrl":"10.1093/bioinformatics/btag017","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular dynamics (MD) simulations provide detailed atomistic insights into biomolecular processes, but comparing independent trajectories remains challenging due to stochastic divergence. Misaligned simulations can obscure shared mechanisms or exaggerate differences, limiting reproducibility and mechanistic interpretation. A generalizable, unsupervised method for synchronizing and comparing MD trajectories across systems and conditions is, therefore, needed.</p><p><strong>Results: </strong>We introduce NetMD, a computational framework that synchronizes and analyzes MD trajectories by integrating graph-based representations with dynamic time warping. Trajectory frames are converted into residue-contact graphs, entropy-filtered to retain variable interactions, and embedded as low-dimensional vectors. NetMD aligns these vectorized trajectories through time-warping barycenter averaging, generating a consensus trajectory while pruning outlier simulations. Applied to transporters, demethylases, and large protein complexes relevant to neurological disease pathways and cancer, NetMD revealed shared multiphase dynamics and identified mutation- or ligand-specific deviations. This unsupervised, time-resolved approach enables direct comparison of MD ensembles across heterogeneous conditions. NetMD is robust and broadly applicable, providing a tool for uncovering conserved patterns and critical divergences in biomolecular dynamics.</p><p><strong>Availability and implementation: </strong>NetMD is freely available at https://github.com/mazzalab/NetMD.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChemGenXplore: an interactive tool for exploring and analysing chemical genomic data. ChemGenXplore:一个用于探索和分析化学基因组数据的交互式工具。
IF 5.4 Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag021
Huda Ahmad, Hannah M Doherty, Sam T Benedict, James R J Haycocks, Ge Zhou, Patrick J Moynihan, Danesh Moradigaravand, Manuel Banzhaf

Motivation: Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualization and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.

Results: We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualization and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualize phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customizable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.

Availability and implementation: ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.

动机:化学基因组学是一种强大的高通量方法,可以系统地将表型与基因型联系起来。然而,由于缺乏集成的、交互式的可视化和分析工具,产生的大量数据集仍然具有挑战性。现有的工作流通常需要多个独立的软件工具,限制了数据的可访问性和协作。因此,我们创建了一个用户友好的平台,可以有效地探索和共享化学基因组学数据。结果:我们开发了ChemGenXplore,这是一个基于网络的Shiny应用程序,旨在简化化学基因组筛选的可视化和分析。它提供了两个主要功能:一个用于探索预实现的数据集,另一个用于分析用户上传的数据集。ChemGenXplore使用户能够可视化表型谱,评估基因-基因和条件-条件相关性,执行GO和KEGG富集分析,并生成可定制的交互式热图。为了进一步支持合作研究,ChemGenXplore还促进了化学基因组学和其他组学数据集的比较分析。ChemGenXplore将这些功能整合到一个单一的交互式和可访问的工具中,促进了数据共享,提高了可重复性,并促进了研究界的合作。可用性:ChemGenXplore作为web应用程序可免费访问https://chemgenxplore.kaust.edu.sa/。源代码和文档,包括本地安装的说明,在GitHub (https://github.com/Hudaahmadd/ChemGenXplore)上提供。DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore)上也提供Docker镜像,以确保可再现性并简化安装。联系方式:example@example.org.Supplementary信息:补充数据可在Bioinformatics在线获取。
{"title":"ChemGenXplore: an interactive tool for exploring and analysing chemical genomic data.","authors":"Huda Ahmad, Hannah M Doherty, Sam T Benedict, James R J Haycocks, Ge Zhou, Patrick J Moynihan, Danesh Moradigaravand, Manuel Banzhaf","doi":"10.1093/bioinformatics/btag021","DOIUrl":"10.1093/bioinformatics/btag021","url":null,"abstract":"<p><strong>Motivation: </strong>Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualization and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.</p><p><strong>Results: </strong>We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualization and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualize phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customizable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.</p><p><strong>Availability and implementation: </strong>ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12872398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving biomedical entity linking with generative relevance feedback. 利用生成相关性反馈改进生物医学实体链接。
IF 5.4 Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag011
Darya Shlyk, Lawrence Hunter

Motivation: Biomedical Entity Linking (BEL) maps mentions in biomedical text to standardized identifiers, enabling structured data integration and downstream knowledge discovery. However, current BEL systems remain fundamentally constrained by the recall of the initial candidate pool, where suboptimal retrieval limits the overall effectiveness of the normalization pipeline.

Results: We present the first systematic evaluation of Generative Relevance Feedback (GRF) for enhancing candidate retrieval in state-of-the-art BEL systems. GRF leverages large language models (LLMs) to enrich the expressiveness of the mention in a zero-shot fashion. We assess GRF's impact under two scenarios-direct linking prediction and candidate generation in cascading normalization pipelines-and analyze its sensitivity to different LLMs, feedback types, and integration strategies. Experiments across eight corpora and four biomedical knowledge bases demonstrate that integrating GRF significantly improves both accuracy and recall, thereby increasing the upper bound on normalization performance. Our findings highlight GRF as an efficient, model-agnostic solution and underscore its potential as a key component for advancing BEL.

Availability and implementation: The code to reproduce our experiments can be found at: https://doi.org/10.5281/zenodo.17853541.

动机:生物医学实体链接(BEL)将生物医学文本中的提及映射到标准化标识符,从而实现结构化数据集成和下游知识发现。然而,当前的BEL系统仍然从根本上受到初始候选池召回的限制,其中次优检索限制了规范化管道的整体有效性。结果:我们首次对生成关联反馈(GRF)进行了系统评估,以增强最先进的BEL系统中的候选检索。GRF利用大型语言模型(llm)以零射击的方式丰富提及的表达性。我们评估了GRF在级联归一化管道中直接链接预测和候选生成两种情况下的影响,并分析了其对不同llm、反馈类型和集成策略的敏感性。基于8个语料库和4个生物医学知识库的实验表明,整合GRF显著提高了正确率和召回率,从而提高了归一化性能的上限。我们的发现强调了GRF是一种高效的、与模型无关的解决方案,并强调了它作为推进bel的关键组件的潜力。可用性:可以在https://doi.org/10.5281/zenodo.17853541上找到重现我们实验的代码。
{"title":"Improving biomedical entity linking with generative relevance feedback.","authors":"Darya Shlyk, Lawrence Hunter","doi":"10.1093/bioinformatics/btag011","DOIUrl":"10.1093/bioinformatics/btag011","url":null,"abstract":"<p><strong>Motivation: </strong>Biomedical Entity Linking (BEL) maps mentions in biomedical text to standardized identifiers, enabling structured data integration and downstream knowledge discovery. However, current BEL systems remain fundamentally constrained by the recall of the initial candidate pool, where suboptimal retrieval limits the overall effectiveness of the normalization pipeline.</p><p><strong>Results: </strong>We present the first systematic evaluation of Generative Relevance Feedback (GRF) for enhancing candidate retrieval in state-of-the-art BEL systems. GRF leverages large language models (LLMs) to enrich the expressiveness of the mention in a zero-shot fashion. We assess GRF's impact under two scenarios-direct linking prediction and candidate generation in cascading normalization pipelines-and analyze its sensitivity to different LLMs, feedback types, and integration strategies. Experiments across eight corpora and four biomedical knowledge bases demonstrate that integrating GRF significantly improves both accuracy and recall, thereby increasing the upper bound on normalization performance. Our findings highlight GRF as an efficient, model-agnostic solution and underscore its potential as a key component for advancing BEL.</p><p><strong>Availability and implementation: </strong>The code to reproduce our experiments can be found at: https://doi.org/10.5281/zenodo.17853541.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
mimicDetector: a pipeline for protein motif mimicry detection in host-pathogen interactions. mimicDetector:宿主-病原体相互作用中蛋白质基序模仿检测的管道。
IF 5.4 Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag012
Kaylee D Rich, James D Wasmuth

Motivation: Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.

Results: We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.

Availability and implementation: mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.

动机:病原体利用分子模仿来逃避宿主免疫系统和操纵宿主其他细胞过程。它通常由非同源蛋白中的短基序介导,其检测挑战了现有生物信息学工具的敏感性和特异性。结果:我们提出了mimicDetector,这是一个基于k-mer的管道,用于鉴定病原体和宿主之间的蛋白质水平分子模仿。mimicDetector应用于17种全球重要的病原体,确定了一组广泛的、生物学上合理的模拟候选物,包括模仿人类补体系统成分的蠕虫蛋白和免疫细胞募集调节剂Reticulon-4的婴儿利什曼原虫模拟物。可用性:mimicDetector可以在https://github.com/kayleerich/mimicDetector/上免费获得,用Python和Snakemake实现,并与基于unix的系统兼容。补充信息:与结果相关的数据被纳入文章和在线补充材料,可在Bioinformatics在线上获得。
{"title":"mimicDetector: a pipeline for protein motif mimicry detection in host-pathogen interactions.","authors":"Kaylee D Rich, James D Wasmuth","doi":"10.1093/bioinformatics/btag012","DOIUrl":"10.1093/bioinformatics/btag012","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.</p><p><strong>Results: </strong>We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.</p><p><strong>Availability and implementation: </strong>mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881831/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
scSNViz: visualization and analysis of cell-specific expressed SNVs. scSNViz:细胞特异性表达snv的可视化和分析。
IF 5.4 Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag023
Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath

Motivation: Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.

Results: scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.

Availability and implementation: scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.

动机:准确地描述单细胞水平上表达的遗传变异对于理解复杂组织中的转录异质性、等位基因调控和突变动力学至关重要。然而,很少有工具能够对单个细胞的表达变异进行全面的可视化和定量分析。结果:scSNViz是一个R软件包,用于从细胞条形码单细胞RNA测序(scRNA-seq)数据中探索、量化和可视化表达的单核苷酸变异(snv)。该软件支持变异等位基因分数的估计,SNV表达谱的聚类,以及单个SNV或用户定义的SNV组的2D和3D可视化。除了可视化之外,scSNViz还有助于研究细胞、集群或谱系特异性变异表达模式,以及等位基因动力学,包括印迹、随机等位基因失活和转录破裂。它与已建立的单细胞框架(seurat用于聚类,Slingshot用于轨迹推断,scType用于细胞类型注释,CopyKat用于拷贝数分析)无缝互操作,从而实现表达变异的综合多组学分析。可用性:scSNViz是用R实现的,可以在https://github.com/HorvathLab/scSNViz免费获得(DOI: 10.5281/zenodo.17307516)。该软件包包括全面的文档和示例工作流程,为有限的生物信息学经验的用户设计。补充信息:补充数据可在生物信息学在线获取。
{"title":"scSNViz: visualization and analysis of cell-specific expressed SNVs.","authors":"Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath","doi":"10.1093/bioinformatics/btag023","DOIUrl":"10.1093/bioinformatics/btag023","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.</p><p><strong>Results: </strong>scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.</p><p><strong>Availability and implementation: </strong>scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866635/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EXPLANA: a user-friendly workflow for EXPLoratory ANAlysis and feature selection in cross-sectional and longitudinal microbiome studies. 一个用户友好的工作流程探索性分析和特征选择在横断面和纵向微生物组研究。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf658
Jennifer Fouquier, Maggie Stanislawski, John O'Connor, Ashley Scadden, Catherine Lozupone

Motivation: Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including nonindependent data requiring mixed-effects models. Furthermore, large amounts of data motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating feature changes between timepoints) can be powerful, how to best conduct these analyses is often unclear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest, whereas for interventional LMS, baseline is typically a key reference point, often indicating the start of treatment.

Results: To address these challenges, a feature selection workflow, called EXPLANA (EXPLoratory ANAlysis), was developed for LMS that supports numerical and categorical data, and also accommodates cross-sectional studies. Machine learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated longitudinal data, with a balanced accuracy score of 0.91 (range: 0.79-1.00, SD = 0.05), outperformed an existing tool, QIIME 2 feature-volatility (balanced accuracy: 0.95 versus 0.56) and identified novel order-dependent categorical feature changes (e.g. different effect for A_B versus B_A). EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.

Availability and implementation: Software is available at https://github.com/JTFouquier/explana and https://zenodo.org/records/17478745 (10.5281/zenodo.17478744). Documentation and demos are available at www.explana.io.

动机:纵向微生物组研究(LMS)越来越普遍,但存在分析挑战,包括需要混合效应模型的非独立数据。此外,大量的数据激发了探索性分析,以确定与结果变量相关的因素。尽管变更分析(例如,计算时间点之间的特性变化)可能很强大,但如何最好地进行这些分析通常是不清楚的。例如,观察性LMS测量显示自然波动,因此基线可能不是主要关注的参考,而对于介入性LMS,基线通常是关键参考点,通常表明治疗的开始。结果:为了应对这些挑战,LMS开发了一种称为expla (EXPLoratory ANAlysis,探索性分析)的特征选择工作流,该工作流支持数值和分类数据,并可进行横断面研究。机器学习方法与不同类型的变化计算和下游解释方法相结合,以识别统计上有意义的变量,并解释它们与结果的关系。expla生成一个交互式报告,以文本和图形形式总结方法和结果。expla在模拟纵向数据上表现良好,平衡精度得分为0.91(范围:0.79-1.00,SD = 0.05),优于现有工具QIIME 2特征波动率(平衡精度:0.95 vs. 0.56),并识别出新的顺序相关分类特征变化(例如,A_B与B_A的不同影响)。expla广泛适用,简化了识别与感兴趣的结果相关的特征的分析。可用性:软件可在https://github.com/JTFouquier/explana和https://zenodo.org/records/17478745 (10.5281/zenodo.17478744)获得。文档和演示可在www.explana.io.Supplementary信息上获得;补充数据可在Bioinformatics在线上获得。
{"title":"EXPLANA: a user-friendly workflow for EXPLoratory ANAlysis and feature selection in cross-sectional and longitudinal microbiome studies.","authors":"Jennifer Fouquier, Maggie Stanislawski, John O'Connor, Ashley Scadden, Catherine Lozupone","doi":"10.1093/bioinformatics/btaf658","DOIUrl":"10.1093/bioinformatics/btaf658","url":null,"abstract":"<p><strong>Motivation: </strong>Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including nonindependent data requiring mixed-effects models. Furthermore, large amounts of data motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating feature changes between timepoints) can be powerful, how to best conduct these analyses is often unclear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest, whereas for interventional LMS, baseline is typically a key reference point, often indicating the start of treatment.</p><p><strong>Results: </strong>To address these challenges, a feature selection workflow, called EXPLANA (EXPLoratory ANAlysis), was developed for LMS that supports numerical and categorical data, and also accommodates cross-sectional studies. Machine learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated longitudinal data, with a balanced accuracy score of 0.91 (range: 0.79-1.00, SD = 0.05), outperformed an existing tool, QIIME 2 feature-volatility (balanced accuracy: 0.95 versus 0.56) and identified novel order-dependent categorical feature changes (e.g. different effect for A_B versus B_A). EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.</p><p><strong>Availability and implementation: </strong>Software is available at https://github.com/JTFouquier/explana and https://zenodo.org/records/17478745 (10.5281/zenodo.17478744). Documentation and demos are available at www.explana.io.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12766912/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores. 基于注意机制的纳米孔蛋白质识别自监督学习方法。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf657
Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong

Motivation: Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.

Results: We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.

Availability and implementation: The main program is available at https://doi.org/10.5281/zenodo.17172822.

动机:纳米孔是一种前沿的跨学科工具,可以在单分子水平上分析生物分子,用于许多应用,例如DNA测序。目前正在努力将纳米孔扩展到蛋白质组学,包括开发用于蛋白质测序和鉴定的机器学习算法。然而,单分子数据本质上是有噪声的,难以处理。此外,纳米孔机器学习的发展和性能受到数据稀缺的影响。自监督学习是一种新兴的方法,可能在纳米孔场景中产生优势。结果:我们提出并实验验证了使用自监督学习(NanoSSL)进行纳米孔分析,这是一种基于注意力机制的生成式自监督学习框架,用于识别纳米孔中的蛋白质信号。利用由自我监督预训练和监督微调组成的两步方法,NanoSSL从经验数据中学习有用的特征表示,以促进下游分类任务。受传统蛋白质测序技术中片段化概念的启发,在预训练过程中,每个易位事件被分割成多个大小相等的非重叠片段,其中一些片段被随机屏蔽,并使用屏蔽自编码器进行重构。学习重构的纳米孔事件的特征表示有助于分子识别的微调。在这项研究中,我们重新测试了一个公开可用的纳米孔多重蛋白质传感数据集,用于模型迭代,随后使用自制的固态纳米孔测量了阿尔茨海默病生物标志物a β1-42。实验结果表明,在对两个突变的a - β1-42、E22G和G37R进行分类时,NanoSSL在正确率、精密度、召回率和F1分数四个指标上取得了前所未有的成绩。验证了自监督学习和注意机制是成绩提高的来源。可用性和实现:主程序可在https://doi.org/10.5281/zenodo.17172822上获得。
{"title":"NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores.","authors":"Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong","doi":"10.1093/bioinformatics/btaf657","DOIUrl":"10.1093/bioinformatics/btaf657","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.</p><p><strong>Results: </strong>We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.</p><p><strong>Availability and implementation: </strong>The main program is available at https://doi.org/10.5281/zenodo.17172822.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 1","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors. MegaPlantTF:一个用于植物转录因子综合鉴定和分类的机器学习框架。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf678
Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali

Motivation: Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.

Results: We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.

Availability and implementation: MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.

研究动机:了解转录因子在植物中的作用对研究基因调控和各种生物过程至关重要。然而,由于这些蛋白质的多样性和复杂性,TF的检测和分类仍然具有挑战性。传统的方法,如BLAST,通常在不太常见的转录因子家族上存在较高的计算复杂度和有限的性能。结果:我们引入了MegaPlantTF,这是第一个全面的机器学习和深度学习框架,用于预测(TF与非TF)和分类(家族水平)植物转录因子。我们的方法采用基于k-mer的蛋白质表示和结合深度前馈神经网络和堆叠集成分类器的两阶段架构。为了确保可靠的性能评估,我们报告了微观、宏观和加权平均性能指标,提供了频繁和代表性不足的TF家族的整体评估。此外,我们采用基于阈值的评估来校准TF检测的置信度。结果表明,MegaPlantTF在k-mer大小为3、分类阈值为0.5的情况下具有较强的准确性和精密度,即使在严格的阈值下也能保持稳定的性能。除了标准的交叉验证测试外,对高粱双色的用例研究证实,我们的方法在全基因组分析中表现出色,使其非常适合大规模的TF鉴定和分类任务。MegaPlantTF通过将k-mer编码、二元家族特异性分类器和两阶段堆叠集成到一个统一的、可重复的框架中,为大规模植物TF识别和分类做出了新的贡献。可用性和实现:MegaPlantTF可通过公共web服务器(https://bioinformatics.um6p.ma/MegaPlantTF)免费访问。完整的源代码,包括预训练模型和示例数据集,可在https://github.com/Bioinformatics-UM6P/MegaPlantTF.Contacts和补充信息:补充数据可在线获得。任何通信应通过电子邮件或在MegaPlantTF Github页面上打开问题发给作者。
{"title":"MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors.","authors":"Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali","doi":"10.1093/bioinformatics/btaf678","DOIUrl":"10.1093/bioinformatics/btaf678","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.</p><p><strong>Results: </strong>We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.</p><p><strong>Availability and implementation: </strong>MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12803907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BPSS: a Nextflow pipeline for Bacterial Peptide Sequence Selection to detect protein diversity. BPSS:用于检测蛋白质多样性的细菌肽序列选择的Nextflow管道。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf677
Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch

Motivation: Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.

Results: We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.

Availability and implementation: All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.

动机:序列可变性可能非常高,特别是在细菌中,由于与它们的高复制率和环境选择压力相关的突变的快速积累,这通常有利于多样化选择。对于大多数物种来说,没有自动化的、计算效率高的工具可用于构建覆盖目标蛋白等位基因变异的非冗余数据库。结果:我们因此开发了细菌肽序列选择(BPSS),这是Nextflow的一个管道,用于定义用于检测感兴趣蛋白质的所有变体的肽序列的最小列表。可用性:所有使用的代码和容器都可以在Gitlab上从https://gitbio.ens-lyon.fr/ciri/stapath/bpss免费获得,或者在GPLv3开源许可证和DockerHub平台下从https://hub.docker.com/u/stapath.Supplementary免费获得Zenodo (10.5281/ Zenodo .16894981)。
{"title":"BPSS: a Nextflow pipeline for Bacterial Peptide Sequence Selection to detect protein diversity.","authors":"Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch","doi":"10.1093/bioinformatics/btaf677","DOIUrl":"10.1093/bioinformatics/btaf677","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.</p><p><strong>Results: </strong>We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.</p><p><strong>Availability and implementation: </strong>All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models. Hi-Enhancer:基于blend - kan和Stacking-Auto模型的两阶段增强子预测和定位框架。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf441
Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang

Motivation: Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.

Results: We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.

Availability and implementation: The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.

动机:基因表达在细胞功能中起着至关重要的作用,增强子可以精确调控基因表达。因此,对增强子的准确预测尤为重要。然而,现有的预测方法精度较低,或者依赖于固定的多个表观遗传信号,这些信号可能并不总是可用的。结果:我们提出了一个两阶段框架,通过灵活组合多个表观遗传信号来准确预测增强子。在第一阶段,我们设计了一个blend -KAN模型,该模型集成了各种基分类器的结果,并采用Kolmogorov-Arnold Networks (KAN)作为元分类器,基于多个表观遗传信号的灵活组合来预测增强子。在第二阶段,我们建立了一个stack - auto模型,该模型使用DNABERT-2提取序列特征,并基于Stacking策略和AutoGluon框架定位增强子。当使用5个表观遗传信号时,blendin - kan模型的准确率达到99.69±0.11%。在跨细胞系预测中,准确率大于等于93.72%。在高斯噪声条件下,仍能保持98.74±0.03%的精度。在第二阶段,stack - auto模型的准确率达到80.50%,优于现有的17种方法。结果表明,我们的模型可以灵活地利用多种表观遗传信号的组合来预测和定位增强子。可用性和实施:源代码可在https://github.com/emanlee/Hi-Enhancer和https://doi.org/10.6084/m9.figshare.29262158.v1.Supplementary上获得信息:补充数据可在Bioinformatics在线上获得。
{"title":"Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models.","authors":"Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang","doi":"10.1093/bioinformatics/btaf441","DOIUrl":"10.1093/bioinformatics/btaf441","url":null,"abstract":"<p><strong>Motivation: </strong>Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.</p><p><strong>Results: </strong>We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.</p><p><strong>Availability and implementation: </strong>The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758598/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144839356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1