首页 > 最新文献

Bioinformatics advances最新文献

英文 中文
TidyGWAS: a scalable approach for standardized cleaning of genome-wide association study summary statistics. TidyGWAS:一种可扩展的全基因组关联研究汇总统计标准化清洗方法。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-27 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf262
Arvid Harder, Jerry Guintivano, Joëlle A Pasman, Patrick F Sullivan, Yi Lu

Motivation: Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific "munging" before analysis can proceed.

Results: Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses.

Availability and implementation: The package, reference data, and Docker containers are freely available for broad adoption.

动机:全基因组关联研究(GWAS)通过识别数以万计的性状相关变异,改变了人类遗传学,使从药物发现到多基因风险预测的应用成为可能。这些进步主要依赖于GWAS汇总统计数据的公开共享。然而,缺乏标准化格式使下游分析变得复杂,需要在分析进行之前进行大量特定于数据集的“修改”。结果:在这里,我们提出了tidyGWAS,这是一个R包,通过清晰地将数据验证和协调从质量控制中分离出来,简化了这一过程。tidyGWAS使用整理的数据来修复和协调基因组构建中的变体标识符,在可能的情况下输入缺失的列,并使用最小的过滤器验证汇总统计信息。输出保存为分区的拼花文件,通过箭头包优化高吞吐量分析。通过对现有工具进行基准测试,tidyGWAS的速度提高了6.5倍,并且内存效率大大提高。此外,我们直接在tidyGWAS输出上实现了固定效应元分析,比现有软件实现了高达10倍的加速。tidyGWAS简化并加速了统计遗传工作流程,提高了大规模遗传分析的可重复性和可扩展性。可用性和实现:包、参考数据和Docker容器都是免费的,可以广泛采用。
{"title":"TidyGWAS: a scalable approach for standardized cleaning of genome-wide association study summary statistics.","authors":"Arvid Harder, Jerry Guintivano, Joëlle A Pasman, Patrick F Sullivan, Yi Lu","doi":"10.1093/bioadv/vbaf262","DOIUrl":"10.1093/bioadv/vbaf262","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific \"munging\" before analysis can proceed.</p><p><strong>Results: </strong>Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses.</p><p><strong>Availability and implementation: </strong>The package, reference data, and Docker containers are freely available for broad adoption.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf262"},"PeriodicalIF":2.8,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12597892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145497642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gaining insights into Alzheimer's disease by predicting chromatin spatial organization. 通过预测染色质空间组织来深入了解阿尔茨海默病。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-25 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf268
Camilo Villaman, Irene Cartas-Espinel, Mauricio Saez, Alberto J M Martin

Motivation: CTCF is a conserved protein involved in the establishment and maintenance of topologically associating domains (TADs) and loops. Alzheimer's disease (AD) represents the most common form of dementia, affecting over 50 million elderly individuals. Epigenetic alterations are a hallmark of AD, and epigenetic disruptions are able to affect CTCF binding and looping. Understanding the dynamics of CTCF loops behind AD may lead to new, undiscovered contributions of CTCF to the etiology of AD. To understand the dynamics behind CTCF loops, we developed a CTCF loop predictor using different genomic and epigenomic features, such as CTCF motif information, CTCF protein binding information, and different histone marks.

Results: We obtained F-scores of over 0.9 in GM12878 and K562 cell lines. We reported the importance of each feature in classification, and compared the results with other loop predictors. After testing the predictor, we predicted loops in control and AD data, reported a score of loop disruption and selected the top disrupted loops on AD which were all previously linked with AD in bibliography. Our study contributes to a better understanding of the role of CTCF binding and CTCF loops in gene regulation, and highlights new clues about CTCF in the etiology and development of AD.

Availability and implementation: The method can be found in https://github.com/networkbiolab/jalpy.

动机:CTCF是一种保守蛋白,参与了拓扑相关结构域(TADs)和环的建立和维持。阿尔茨海默病(AD)是最常见的痴呆症,影响着5000多万老年人。表观遗传改变是AD的标志,而表观遗传破坏能够影响CTCF的结合和环。了解AD背后CTCF循环的动力学可能会导致CTCF对AD病因学的新的、未被发现的贡献。为了了解CTCF环背后的动力学,我们开发了一个CTCF环预测器,使用不同的基因组和表观基因组特征,如CTCF基序信息、CTCF蛋白结合信息和不同的组蛋白标记。结果:我们在GM12878和K562细胞系中获得了大于0.9的f分数。我们报告了分类中每个特征的重要性,并将结果与其他循环预测因子进行了比较。在对预测器进行测试后,我们预测了对照和AD数据中的循环,报告了循环中断的分数,并选择了AD中先前与AD相关的最严重的中断循环。我们的研究有助于更好地理解CTCF结合和CTCF环在基因调控中的作用,并突出了CTCF在AD病因和发展中的新线索。可用性和实现:该方法可在https://github.com/networkbiolab/jalpy中找到。
{"title":"Gaining insights into Alzheimer's disease by predicting chromatin spatial organization.","authors":"Camilo Villaman, Irene Cartas-Espinel, Mauricio Saez, Alberto J M Martin","doi":"10.1093/bioadv/vbaf268","DOIUrl":"10.1093/bioadv/vbaf268","url":null,"abstract":"<p><strong>Motivation: </strong>CTCF is a conserved protein involved in the establishment and maintenance of topologically associating domains (TADs) and loops. Alzheimer's disease (AD) represents the most common form of dementia, affecting over 50 million elderly individuals. Epigenetic alterations are a hallmark of AD, and epigenetic disruptions are able to affect CTCF binding and looping. Understanding the dynamics of CTCF loops behind AD may lead to new, undiscovered contributions of CTCF to the etiology of AD. To understand the dynamics behind CTCF loops, we developed a CTCF loop predictor using different genomic and epigenomic features, such as CTCF motif information, CTCF protein binding information, and different histone marks.</p><p><strong>Results: </strong>We obtained F-scores of over 0.9 in GM12878 and K562 cell lines. We reported the importance of each feature in classification, and compared the results with other loop predictors. After testing the predictor, we predicted loops in control and AD data, reported a score of loop disruption and selected the top disrupted loops on AD which were all previously linked with AD in bibliography. Our study contributes to a better understanding of the role of CTCF binding and CTCF loops in gene regulation, and highlights new clues about CTCF in the etiology and development of AD.</p><p><strong>Availability and implementation: </strong>The method can be found in https://github.com/networkbiolab/jalpy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf268"},"PeriodicalIF":2.8,"publicationDate":"2025-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145565500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MutSeqR: an open source R package for standardized analysis of error-corrected next-generation sequencing data in genetic toxicology. MutSeqR:一个开源的R包,用于基因毒理学中校正错误的下一代测序数据的标准化分析。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-23 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf265
Annette E Dodge, Andrew Williams, Danielle P M LeBlanc, David M Schuster, Elena Esina, Charles C Valentine, Jesse J Salk, Alex Y Maslov, Chris Bradley, Carole L Yauk, Francesco Marchetti, Matthew J Meier

Motivation: Error-corrected next-generation sequencing (ECS) methods are increasingly used to assess mutagenicity and other genetic toxicology endpoints. The lack of open and standardized bioinformatic workflows and tools poses challenges to data reproducibility, comparability, and consistency in interpretation for its application in genetic toxicity assessment.

Results: We present MutSeqR, an open source R package to analyse ECS mutation data for genetic toxicology studies. MutSeqR offers practical variant filtering, comparative analysis of mutation frequency between experimental conditions, dose-response assessment via benchmark dose calculations, mutation spectrum analysis, and clonality analyses. We demonstrate MutSeqR's application using published datasets on mice treated with benzo[a]pyrene or benzo[b]fluoranthene, analysed using Duplex Sequencing and SMM-seq, respectively. MutSeqR's flexible functions enable reproducible analyses across ECS platforms, facilitating research and regulatory applications in mutagenicity testing.

Availability and implementation: MutSeqR is freely available under an open source license at https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR. Implemented in R (version 3.4.0 or greater), it supports all major operating systems. Sequencing data for Project 1 has been deposited in the Sequence Read Archive under accession number PRJNA803048. Variant call files for Project 2 are available on Mendeley Data (doi: 10.17632/65dnysxym8.1).

动机:纠正错误的下一代测序(ECS)方法越来越多地用于评估突变性和其他遗传毒理学终点。缺乏开放和标准化的生物信息学工作流程和工具,对其在遗传毒性评估中的应用的数据再现性、可比性和解释的一致性提出了挑战。结果:我们提出了MutSeqR,这是一个开源的R包,用于分析ECS突变数据,用于遗传毒理学研究。MutSeqR提供实用的变异过滤,实验条件下突变频率的比较分析,通过基准剂量计算进行剂量-反应评估,突变谱分析和克隆分析。我们使用已发表的数据集来展示MutSeqR在用苯并[a]芘或苯并[b]荧光蒽处理的小鼠上的应用,分别使用双工测序和SMM-seq进行分析。MutSeqR的灵活功能可以跨ECS平台进行可重复分析,促进致突变性测试的研究和监管应用。可用性和实现:MutSeqR在开源许可下可在https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR免费获得。它在R(3.4.0或更高版本)中实现,支持所有主要的操作系统。项目1的测序数据已存入Sequence Read Archive,登录号为PRJNA803048。项目2的变体调用文件可在Mendeley Data上获得(doi: 10.17632/65dnysxym8.1)。
{"title":"MutSeqR: an open source R package for standardized analysis of error-corrected next-generation sequencing data in genetic toxicology.","authors":"Annette E Dodge, Andrew Williams, Danielle P M LeBlanc, David M Schuster, Elena Esina, Charles C Valentine, Jesse J Salk, Alex Y Maslov, Chris Bradley, Carole L Yauk, Francesco Marchetti, Matthew J Meier","doi":"10.1093/bioadv/vbaf265","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf265","url":null,"abstract":"<p><strong>Motivation: </strong>Error-corrected next-generation sequencing (ECS) methods are increasingly used to assess mutagenicity and other genetic toxicology endpoints. The lack of open and standardized bioinformatic workflows and tools poses challenges to data reproducibility, comparability, and consistency in interpretation for its application in genetic toxicity assessment.</p><p><strong>Results: </strong>We present MutSeqR, an open source R package to analyse ECS mutation data for genetic toxicology studies. MutSeqR offers practical variant filtering, comparative analysis of mutation frequency between experimental conditions, dose-response assessment via benchmark dose calculations, mutation spectrum analysis, and clonality analyses. We demonstrate MutSeqR's application using published datasets on mice treated with benzo[a]pyrene or benzo[b]fluoranthene, analysed using Duplex Sequencing and SMM-seq, respectively. MutSeqR's flexible functions enable reproducible analyses across ECS platforms, facilitating research and regulatory applications in mutagenicity testing.</p><p><strong>Availability and implementation: </strong>MutSeqR is freely available under an open source license at https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR. Implemented in R (version 3.4.0 or greater), it supports all major operating systems. Sequencing data for Project 1 has been deposited in the Sequence Read Archive under accession number PRJNA803048. Variant call files for Project 2 are available on Mendeley Data (doi: 10.17632/65dnysxym8.1).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf265"},"PeriodicalIF":2.8,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12645840/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RNA-EFM: energy-based flow matching for protein-conditioned RNA sequence-structure co-design. RNA- efm:蛋白质条件RNA序列-结构协同设计的能量流匹配。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf258
Abrar Rahman Abir, Liqing Zhang

Motivation: Designing RNA molecules that can specifically bind to target proteins is fundamental to numerous biological and therapeutic applications. However, existing approaches to protein-conditioned RNA design primarily focus on structural alignment or sequence recovery, often ignoring essential biophysical factors such as molecular stability and thermodynamic feasibility.

Results: To address this gap, we propose RNA-EFM, a novel deep learning framework that integrates energy-based refinement with flow matching for protein-conditioned RNA sequence and structure co-design. RNA-EFM consists of two complementary components: a flow matching objective that supervises geometric alignment between predicted and native RNA backbone structures, and an energy-based idempotent refinement that iteratively improves RNA structure predictions by minimizing both structural error and physical energy. The energy refinement is guided by biophysical priors including the Lennard-Jones potential and sequence-derived free energy, ensuring that the generated RNAs are not only geometrically plausible but also thermodynamically stable. We demonstrate the effectiveness of RNA-EFM through extensive experiments. RNA-EFM significantly outperforms state-of-the-art baselines in terms of RMSD, lDDT, sequence recovery, and binding energy improvement. These results highlight the importance of incorporating biophysical constraints into RNA design and establish RNA-EFM as a promising framework.

Availability and implementation: The source code for RNA-EFM is available at: https://github.com/abrarrahmanabir/RNA-EFM.

动机:设计能够特异性结合靶蛋白的RNA分子是许多生物学和治疗应用的基础。然而,现有的蛋白质条件RNA设计方法主要集中在结构比对或序列恢复上,往往忽略了分子稳定性和热力学可行性等重要的生物物理因素。为了解决这一差距,我们提出了RNA- efm,这是一种新的深度学习框架,将基于能量的优化与蛋白质条件RNA序列和结构协同设计的流匹配相结合。RNA- efm由两个互补的部分组成:监督预测和天然RNA主链结构之间几何对齐的流匹配目标,以及基于能量的幂等改进,通过最小化结构误差和物理能量来迭代改进RNA结构预测。能量精化以生物物理先验为指导,包括Lennard-Jones势和序列衍生自由能,确保生成的rna不仅在几何上合理,而且在热力学上稳定。我们通过大量的实验证明了RNA-EFM的有效性。RNA-EFM在RMSD、lDDT、序列恢复和结合能改善方面明显优于最先进的基线。这些结果强调了将生物物理约束纳入RNA设计的重要性,并将RNA- efm建立为一个有前途的框架。可用性和实现:RNA-EFM的源代码可从:https://github.com/abrarrahmanabir/RNA-EFM获得。
{"title":"RNA-EFM: energy-based flow matching for protein-conditioned RNA sequence-structure co-design.","authors":"Abrar Rahman Abir, Liqing Zhang","doi":"10.1093/bioadv/vbaf258","DOIUrl":"10.1093/bioadv/vbaf258","url":null,"abstract":"<p><strong>Motivation: </strong>Designing RNA molecules that can specifically bind to target proteins is fundamental to numerous biological and therapeutic applications. However, existing approaches to protein-conditioned RNA design primarily focus on structural alignment or sequence recovery, often ignoring essential biophysical factors such as molecular stability and thermodynamic feasibility.</p><p><strong>Results: </strong>To address this gap, we propose RNA-EFM, a novel deep learning framework that integrates energy-based refinement with flow matching for protein-conditioned RNA sequence and structure co-design. RNA-EFM consists of two complementary components: a flow matching objective that supervises geometric alignment between predicted and native RNA backbone structures, and an energy-based idempotent refinement that iteratively improves RNA structure predictions by minimizing both structural error and physical energy. The energy refinement is guided by biophysical priors including the Lennard-Jones potential and sequence-derived free energy, ensuring that the generated RNAs are not only geometrically plausible but also thermodynamically stable. We demonstrate the effectiveness of RNA-EFM through extensive experiments. RNA-EFM significantly outperforms state-of-the-art baselines in terms of RMSD, lDDT, sequence recovery, and binding energy improvement. These results highlight the importance of incorporating biophysical constraints into RNA design and establish RNA-EFM as a promising framework.</p><p><strong>Availability and implementation: </strong>The source code for RNA-EFM is available at: https://github.com/abrarrahmanabir/RNA-EFM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf258"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701795/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unifying proteomic technologies with ProteinProjector. 统一蛋白质组学技术与ProteinProjector。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf266
Leah V Schaffer, Mayank Jain, Rami Nasser, Roded Sharan, Trey Ideker

Summary: Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.

Availability and implementation: ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.

摘要:蛋白质组学已经发展了许多方法来了解蛋白质的亚细胞组织,每种方法都具有不同的覆盖范围和对不同尺度的敏感性。在这里,我们开发了一个自我监督的深度学习框架,ProteinProjector,它可以灵活地集成来自任何数量模式的蛋白质的所有可用数据,从而生成蛋白质位置的统一地图。作为最初的概念验证,我们整合了HEK293人胚胎肾细胞的四种蛋白质组范围的表征,包括蛋白质亲和纯化,接近连接,大小排除色谱质谱(AP-MS, PL-MS, SEC-MS)以及蛋白质荧光成像。随着新的数据模式的加入,地图的覆盖范围和准确性大大增加,当使用所有四种蛋白质组学数据集时,观察到已知复合物的最大恢复。我们发现ProteinProjector在恢复训练期间未使用的正交功能和物理关联方面优于个体模式和其他集成方法。ProteinProjector为整合表征亚细胞结构的多种模式提供了基础。可用性和实现:ProteinProjector可作为细胞映射工具包的一部分,网址为https://github.com/idekerlab/cellmaps_coembedding。
{"title":"Unifying proteomic technologies with ProteinProjector.","authors":"Leah V Schaffer, Mayank Jain, Rami Nasser, Roded Sharan, Trey Ideker","doi":"10.1093/bioadv/vbaf266","DOIUrl":"10.1093/bioadv/vbaf266","url":null,"abstract":"<p><strong>Summary: </strong>Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.</p><p><strong>Availability and implementation: </strong>ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf266"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12680973/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genomic optimum contribution selection and mate allocation using JuMP. 基于JuMP的基因组最优贡献选择与配偶分配。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf259
Patrik Waldmann

Motivation: Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.

Results: We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( < 0.01 seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.

Availability and implementation: Source code and documentation are available at https://github.com/patwa67/GOCSMA.

动机:人工选择改善了理想的性状,但减少了种群内的遗传多样性。现代育种计划旨在平衡遗传增益与维持遗传变异,以确保长期可持续性。最优贡献选择(OCS)是一种广泛采用的策略,它在限制近交率的同时最大化遗传增益,传统上依赖于系谱数据。然而,基因组关系矩阵提供了一个更准确的测量遗传相关性。OCS的后续步骤包括配偶分配(MA)以优化繁殖计划,这通常对大型数据集提出了重大的计算挑战。结果:我们开发了一种在JuMP/Julia中实现的两阶段基因组OCS和配偶分配(GOCSMA)方法。将OCS问题表述为具有二次约束的线性规划,并采用二次算子分裂法(COSMO)进行求解。用SCIP框架的分支切割定价算法求解混合整数规划的MA问题。将GOCSMA应用于模拟的QTLMAS2010数据集,我们观察到OCS的有效收敛,与传统的顶端选择相比,它更好地平衡了遗传增益和共祖约束。MA阶段始终实现非常低的运行时间(0.01秒),与二进制约束相比,整数交配约束提供更低的共祖先和更高的遗传增益,表明更优化的交配方案。因此,GOCSMA为整合基因组OCS和MA提供了一个高效的确定性数学优化框架。在灵活的JuMP环境中使用先进的求解器,我们的方法提供了一个强大的解决方案来平衡大规模育种计划中的遗传增益和多样性。可用性和实现:源代码和文档可在https://github.com/patwa67/GOCSMA上获得。
{"title":"Genomic optimum contribution selection and mate allocation using JuMP.","authors":"Patrik Waldmann","doi":"10.1093/bioadv/vbaf259","DOIUrl":"10.1093/bioadv/vbaf259","url":null,"abstract":"<p><strong>Motivation: </strong>Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.</p><p><strong>Results: </strong>We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( <math><mrow><mo><</mo> <mn>0.01</mn></mrow> </math> seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.</p><p><strong>Availability and implementation: </strong>Source code and documentation are available at https://github.com/patwa67/GOCSMA.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf259"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SUMO: an R package for simulating multi-omics data for methods development and testing. SUMO:一个R包,用于模拟用于方法开发和测试的多组学数据。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf264
Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen

Motivation: Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.

Availability and implementation: The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.

动机:综合多组学分析的见解推动了对多组学研究中创新计算方法和工具的需求。然而,具有用户定义信号结构的多组学数据集的稀缺性阻碍了对这些新开发工具的评估。SUMO (simulation Multi-Omics)是一个开源的R包,通过生成高质量的基于因子分析的数据集,完全控制数据集的结构,如潜在结构、噪声和复杂性,从而解决了这一差距。用户可以配置具有不同和/或共享的非重叠潜在因素的数据集,从而实现对信号结构的灵活和精确控制。因此,SUMO允许方法的可重复测试和验证,促进方法创新。可用性和实现:SUMO R包在CC-BY 4.0许可下,可以在综合R存档网络https://doi.org/10.32614/CRAN.package.SUMO和GitHub https://github.com/lucp12891/SUMO.git上免费获得和访问。
{"title":"SUMO: an R package for simulating multi-omics data for methods development and testing.","authors":"Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen","doi":"10.1093/bioadv/vbaf264","DOIUrl":"10.1093/bioadv/vbaf264","url":null,"abstract":"<p><strong>Motivation: </strong>Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.</p><p><strong>Availability and implementation: </strong>The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf264"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12630132/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NifFinder: improved Nif protein prediction using SWeeP vectors and neural networks. NifFinder:使用扫描载体和神经网络改进的Nif蛋白预测。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-16 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf260
Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz

Motivation: Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of nif genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of nif genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.

Results: We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of nif genes, NifFinder enables more reliable genome-wide identification of Nif proteins.

Availability and implementation: The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.

研究动机:生物固氮是全球生态系统和农业的重要过程;然而,nif基因的多样性和复杂性为准确鉴定nif蛋白提出了重大挑战。现有的计算工具通常仅限于nif基因的一个狭窄子集,使许多重要的蛋白质类别未被探索。NifFinder的开发就是为了解决这一问题,它将SWeeP向量表示与神经网络模型相结合,可以预测多达24种不同的Nif蛋白。通过扩大预测范围和提高准确性,NifFinder提供了一个更全面、更可靠的框架来研究固氮,支持进化见解和农业可持续性的应用。结果:我们提出了NifFinder,这是一个将SWeeP矢量编码与神经网络分类器集成在一起的计算框架,可以预测古生菌和细菌中多达24种不同的Nif蛋白类别。NifFinder平均准确率为84.31%,灵敏度为86.49%,精密度为81.97%,f1评分为82.33%,类相关系数为0.94。对Nif管理的资源进行基准测试显示,即使在类别不平衡的情况下,也有很强的一致性和健壮的分类。通过扩展超越传统的nif基因亚群,NifFinder能够更可靠地对nif蛋白进行全基因组鉴定。可用性和实现:可以在https://sourceforge.net/projects/NifFinder上访问NifFinder安装说明和源代码。
{"title":"NifFinder: improved Nif protein prediction using SWeeP vectors and neural networks.","authors":"Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz","doi":"10.1093/bioadv/vbaf260","DOIUrl":"10.1093/bioadv/vbaf260","url":null,"abstract":"<p><strong>Motivation: </strong>Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of <i>nif</i> genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of <i>nif</i> genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.</p><p><strong>Results: </strong>We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of <i>nif</i> genes, NifFinder enables more reliable genome-wide identification of Nif proteins.</p><p><strong>Availability and implementation: </strong>The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf260"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StarPepWeb: an integrative, graph-based resource for bioactive peptides. StarPepWeb:一个综合性的、基于图形的生物活性肽资源。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-16 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf261
Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce

Motivation: The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.

Results: We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.

Availability and implementation: StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.

动机:生物活性肽序列的快速增长对组织和分析提出了挑战。现有的存储库通常专注于功能、分类起源或结构类,但大多数存储库仍然是孤立的,使用异构元数据,并且缺乏统一的描述符或结构模型。很少有集成的web服务存在,仅提供部分覆盖或深度。因此,对生物活性肽景观的可重复和全面的探索仍然有限,强调需要一个统一的,来源跟踪的,可扩展的平台。结果:我们提出了StarPepWeb,一个免费访问的web应用程序,使访问StarPepDB民主化,StarPepDB是最大的生物活性肽库之一。该平台将来自40个公共数据库的45 120个非冗余序列集成到一个源跟踪图中,该图富含元数据、物理化学特征和ESMFold预测的3D结构。每个肽都用ESM-2嵌入和iFeature描述符表示,而界面支持元数据感知过滤,基于对齐的单一和多个查询相似度搜索,以及交互式可视化。微面向服务的体系结构确保了可伸缩性、可维护性和可复制的版本下载,包括Neo4j导出。因此,StarPepWeb克服了独立数据库的部署和专业知识障碍,为综合生物活性肽分析提供了一个可扩展的云托管框架。可用性和实现:StarPepWeb可以在https://starpepweb.org上免费获得。源代码和文档托管于https://github.com/starpep-web。
{"title":"StarPepWeb: an integrative, graph-based resource for bioactive peptides.","authors":"Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce","doi":"10.1093/bioadv/vbaf261","DOIUrl":"10.1093/bioadv/vbaf261","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.</p><p><strong>Results: </strong>We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.</p><p><strong>Availability and implementation: </strong>StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf261"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Hydractinia Genome Project Portal: multi-omic annotation and visualization of Hydractinia genomic datasets. 水螅基因组计划门户:水螅基因组数据集的多组注释和可视化。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-15 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf215
R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis

Motivation: The colonial hydroid Hydractinia exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.

Results: We present the Hydractinia Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied Hydractinia species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of Hydractinia-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.

Availability and implementation: The Hydractinia Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.

动机:水螅虫群体表现出一些独特的生物学特性,包括其显著的再生能力和区分自我与非自我的能力,这些特性使它们成为研究人类疾病和衰老的有价值的模型。多基因组数据的可用性,以及可视化这些数据的工具,对于推进这些模式生物的使用,增强我们对基因组和形态复杂性、多细胞进化和新细胞类型出现之间关系的理解至关重要。结果:我们提出了水葫芦基因组计划门户网站,这是一个全面的资源,提供了两个广泛研究的水葫芦物种的基因组,转录组学和蛋白质组学数据集。该门户提供了其他地方没有的大量序列、结构和功能注释资源,包括基因组浏览器、单细胞基因表达图谱、蛋白质结构查看器和自定义BLAST实现。我们展示了门户网站在生物学发现方面的实用性,并使用了hydractinia特异性干细胞基因标记的子集来探索注释转移方法中的已知空白,说明了基于结构的深度学习方法(如DeepFRI)如何显着改善迄今未注释的i细胞标记的功能注释。可用性和实施:Hydractinia基因组计划门户网站免费提供https://research.nhgri.nih.gov/hydractinia。
{"title":"The <i>Hydractinia</i> Genome Project Portal: multi-omic annotation and visualization of <i>Hydractinia</i> genomic datasets.","authors":"R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis","doi":"10.1093/bioadv/vbaf215","DOIUrl":"10.1093/bioadv/vbaf215","url":null,"abstract":"<p><strong>Motivation: </strong>The colonial hydroid <i>Hydractinia</i> exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.</p><p><strong>Results: </strong>We present the <i>Hydractinia</i> Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied <i>Hydractinia</i> species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of <i>Hydractinia</i>-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.</p><p><strong>Availability and implementation: </strong>The <i>Hydractinia</i> Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf215"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12624445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics advances
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1