首页 > 最新文献

Bioinformatics advances最新文献

英文 中文
Unifying proteomic technologies with ProteinProjector. 统一蛋白质组学技术与ProteinProjector。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf266
Leah V Schaffer, Mayank Jain, Rami Nasser, Roded Sharan, Trey Ideker

Summary: Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.

Availability and implementation: ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.

摘要:蛋白质组学已经发展了许多方法来了解蛋白质的亚细胞组织,每种方法都具有不同的覆盖范围和对不同尺度的敏感性。在这里,我们开发了一个自我监督的深度学习框架,ProteinProjector,它可以灵活地集成来自任何数量模式的蛋白质的所有可用数据,从而生成蛋白质位置的统一地图。作为最初的概念验证,我们整合了HEK293人胚胎肾细胞的四种蛋白质组范围的表征,包括蛋白质亲和纯化,接近连接,大小排除色谱质谱(AP-MS, PL-MS, SEC-MS)以及蛋白质荧光成像。随着新的数据模式的加入,地图的覆盖范围和准确性大大增加,当使用所有四种蛋白质组学数据集时,观察到已知复合物的最大恢复。我们发现ProteinProjector在恢复训练期间未使用的正交功能和物理关联方面优于个体模式和其他集成方法。ProteinProjector为整合表征亚细胞结构的多种模式提供了基础。可用性和实现:ProteinProjector可作为细胞映射工具包的一部分,网址为https://github.com/idekerlab/cellmaps_coembedding。
{"title":"Unifying proteomic technologies with ProteinProjector.","authors":"Leah V Schaffer, Mayank Jain, Rami Nasser, Roded Sharan, Trey Ideker","doi":"10.1093/bioadv/vbaf266","DOIUrl":"10.1093/bioadv/vbaf266","url":null,"abstract":"<p><strong>Summary: </strong>Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.</p><p><strong>Availability and implementation: </strong>ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf266"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12680973/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genomic optimum contribution selection and mate allocation using JuMP. 基于JuMP的基因组最优贡献选择与配偶分配。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf259
Patrik Waldmann

Motivation: Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.

Results: We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( < 0.01 seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.

Availability and implementation: Source code and documentation are available at https://github.com/patwa67/GOCSMA.

动机:人工选择改善了理想的性状,但减少了种群内的遗传多样性。现代育种计划旨在平衡遗传增益与维持遗传变异,以确保长期可持续性。最优贡献选择(OCS)是一种广泛采用的策略,它在限制近交率的同时最大化遗传增益,传统上依赖于系谱数据。然而,基因组关系矩阵提供了一个更准确的测量遗传相关性。OCS的后续步骤包括配偶分配(MA)以优化繁殖计划,这通常对大型数据集提出了重大的计算挑战。结果:我们开发了一种在JuMP/Julia中实现的两阶段基因组OCS和配偶分配(GOCSMA)方法。将OCS问题表述为具有二次约束的线性规划,并采用二次算子分裂法(COSMO)进行求解。用SCIP框架的分支切割定价算法求解混合整数规划的MA问题。将GOCSMA应用于模拟的QTLMAS2010数据集,我们观察到OCS的有效收敛,与传统的顶端选择相比,它更好地平衡了遗传增益和共祖约束。MA阶段始终实现非常低的运行时间(0.01秒),与二进制约束相比,整数交配约束提供更低的共祖先和更高的遗传增益,表明更优化的交配方案。因此,GOCSMA为整合基因组OCS和MA提供了一个高效的确定性数学优化框架。在灵活的JuMP环境中使用先进的求解器,我们的方法提供了一个强大的解决方案来平衡大规模育种计划中的遗传增益和多样性。可用性和实现:源代码和文档可在https://github.com/patwa67/GOCSMA上获得。
{"title":"Genomic optimum contribution selection and mate allocation using JuMP.","authors":"Patrik Waldmann","doi":"10.1093/bioadv/vbaf259","DOIUrl":"10.1093/bioadv/vbaf259","url":null,"abstract":"<p><strong>Motivation: </strong>Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.</p><p><strong>Results: </strong>We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( <math><mrow><mo><</mo> <mn>0.01</mn></mrow> </math> seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.</p><p><strong>Availability and implementation: </strong>Source code and documentation are available at https://github.com/patwa67/GOCSMA.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf259"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SUMO: an R package for simulating multi-omics data for methods development and testing. SUMO:一个R包,用于模拟用于方法开发和测试的多组学数据。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf264
Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen

Motivation: Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.

Availability and implementation: The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.

动机:综合多组学分析的见解推动了对多组学研究中创新计算方法和工具的需求。然而,具有用户定义信号结构的多组学数据集的稀缺性阻碍了对这些新开发工具的评估。SUMO (simulation Multi-Omics)是一个开源的R包,通过生成高质量的基于因子分析的数据集,完全控制数据集的结构,如潜在结构、噪声和复杂性,从而解决了这一差距。用户可以配置具有不同和/或共享的非重叠潜在因素的数据集,从而实现对信号结构的灵活和精确控制。因此,SUMO允许方法的可重复测试和验证,促进方法创新。可用性和实现:SUMO R包在CC-BY 4.0许可下,可以在综合R存档网络https://doi.org/10.32614/CRAN.package.SUMO和GitHub https://github.com/lucp12891/SUMO.git上免费获得和访问。
{"title":"SUMO: an R package for simulating multi-omics data for methods development and testing.","authors":"Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen","doi":"10.1093/bioadv/vbaf264","DOIUrl":"10.1093/bioadv/vbaf264","url":null,"abstract":"<p><strong>Motivation: </strong>Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.</p><p><strong>Availability and implementation: </strong>The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf264"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12630132/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NifFinder: improved Nif protein prediction using SWeeP vectors and neural networks. NifFinder:使用扫描载体和神经网络改进的Nif蛋白预测。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-16 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf260
Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz

Motivation: Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of nif genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of nif genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.

Results: We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of nif genes, NifFinder enables more reliable genome-wide identification of Nif proteins.

Availability and implementation: The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.

研究动机:生物固氮是全球生态系统和农业的重要过程;然而,nif基因的多样性和复杂性为准确鉴定nif蛋白提出了重大挑战。现有的计算工具通常仅限于nif基因的一个狭窄子集,使许多重要的蛋白质类别未被探索。NifFinder的开发就是为了解决这一问题,它将SWeeP向量表示与神经网络模型相结合,可以预测多达24种不同的Nif蛋白。通过扩大预测范围和提高准确性,NifFinder提供了一个更全面、更可靠的框架来研究固氮,支持进化见解和农业可持续性的应用。结果:我们提出了NifFinder,这是一个将SWeeP矢量编码与神经网络分类器集成在一起的计算框架,可以预测古生菌和细菌中多达24种不同的Nif蛋白类别。NifFinder平均准确率为84.31%,灵敏度为86.49%,精密度为81.97%,f1评分为82.33%,类相关系数为0.94。对Nif管理的资源进行基准测试显示,即使在类别不平衡的情况下,也有很强的一致性和健壮的分类。通过扩展超越传统的nif基因亚群,NifFinder能够更可靠地对nif蛋白进行全基因组鉴定。可用性和实现:可以在https://sourceforge.net/projects/NifFinder上访问NifFinder安装说明和源代码。
{"title":"NifFinder: improved Nif protein prediction using SWeeP vectors and neural networks.","authors":"Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz","doi":"10.1093/bioadv/vbaf260","DOIUrl":"10.1093/bioadv/vbaf260","url":null,"abstract":"<p><strong>Motivation: </strong>Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of <i>nif</i> genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of <i>nif</i> genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.</p><p><strong>Results: </strong>We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of <i>nif</i> genes, NifFinder enables more reliable genome-wide identification of Nif proteins.</p><p><strong>Availability and implementation: </strong>The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf260"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StarPepWeb: an integrative, graph-based resource for bioactive peptides. StarPepWeb:一个综合性的、基于图形的生物活性肽资源。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-16 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf261
Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce

Motivation: The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.

Results: We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.

Availability and implementation: StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.

动机:生物活性肽序列的快速增长对组织和分析提出了挑战。现有的存储库通常专注于功能、分类起源或结构类,但大多数存储库仍然是孤立的,使用异构元数据,并且缺乏统一的描述符或结构模型。很少有集成的web服务存在,仅提供部分覆盖或深度。因此,对生物活性肽景观的可重复和全面的探索仍然有限,强调需要一个统一的,来源跟踪的,可扩展的平台。结果:我们提出了StarPepWeb,一个免费访问的web应用程序,使访问StarPepDB民主化,StarPepDB是最大的生物活性肽库之一。该平台将来自40个公共数据库的45 120个非冗余序列集成到一个源跟踪图中,该图富含元数据、物理化学特征和ESMFold预测的3D结构。每个肽都用ESM-2嵌入和iFeature描述符表示,而界面支持元数据感知过滤,基于对齐的单一和多个查询相似度搜索,以及交互式可视化。微面向服务的体系结构确保了可伸缩性、可维护性和可复制的版本下载,包括Neo4j导出。因此,StarPepWeb克服了独立数据库的部署和专业知识障碍,为综合生物活性肽分析提供了一个可扩展的云托管框架。可用性和实现:StarPepWeb可以在https://starpepweb.org上免费获得。源代码和文档托管于https://github.com/starpep-web。
{"title":"StarPepWeb: an integrative, graph-based resource for bioactive peptides.","authors":"Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce","doi":"10.1093/bioadv/vbaf261","DOIUrl":"10.1093/bioadv/vbaf261","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.</p><p><strong>Results: </strong>We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.</p><p><strong>Availability and implementation: </strong>StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf261"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Hydractinia Genome Project Portal: multi-omic annotation and visualization of Hydractinia genomic datasets. 水螅基因组计划门户:水螅基因组数据集的多组注释和可视化。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-15 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf215
R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis

Motivation: The colonial hydroid Hydractinia exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.

Results: We present the Hydractinia Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied Hydractinia species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of Hydractinia-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.

Availability and implementation: The Hydractinia Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.

动机:水螅虫群体表现出一些独特的生物学特性,包括其显著的再生能力和区分自我与非自我的能力,这些特性使它们成为研究人类疾病和衰老的有价值的模型。多基因组数据的可用性,以及可视化这些数据的工具,对于推进这些模式生物的使用,增强我们对基因组和形态复杂性、多细胞进化和新细胞类型出现之间关系的理解至关重要。结果:我们提出了水葫芦基因组计划门户网站,这是一个全面的资源,提供了两个广泛研究的水葫芦物种的基因组,转录组学和蛋白质组学数据集。该门户提供了其他地方没有的大量序列、结构和功能注释资源,包括基因组浏览器、单细胞基因表达图谱、蛋白质结构查看器和自定义BLAST实现。我们展示了门户网站在生物学发现方面的实用性,并使用了hydractinia特异性干细胞基因标记的子集来探索注释转移方法中的已知空白,说明了基于结构的深度学习方法(如DeepFRI)如何显着改善迄今未注释的i细胞标记的功能注释。可用性和实施:Hydractinia基因组计划门户网站免费提供https://research.nhgri.nih.gov/hydractinia。
{"title":"The <i>Hydractinia</i> Genome Project Portal: multi-omic annotation and visualization of <i>Hydractinia</i> genomic datasets.","authors":"R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis","doi":"10.1093/bioadv/vbaf215","DOIUrl":"10.1093/bioadv/vbaf215","url":null,"abstract":"<p><strong>Motivation: </strong>The colonial hydroid <i>Hydractinia</i> exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.</p><p><strong>Results: </strong>We present the <i>Hydractinia</i> Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied <i>Hydractinia</i> species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of <i>Hydractinia</i>-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.</p><p><strong>Availability and implementation: </strong>The <i>Hydractinia</i> Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf215"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12624445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ICCTax: a hierarchical taxonomic classifier for metagenomic sequences on a large language model. ICCTax:一个基于大语言模型的元基因组序列分级分类器。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-15 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf257
Yichun Gao, Jiaxing Bai, Feng Zhou, Yushuang He, Ying Wang, Xiaobing Huang

Motivation: Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.

Results: We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.

Availability and implementation: Code is available at https://github.com/Ying-Lab/ICCTax.

动机:宏基因组数据越来越多地反映了古生菌、细菌、真核生物和病毒等物种在复杂环境中的共存。这四个超级王国之间的分类学分类对于理解微生物群落、探索基因组进化关系和识别新物种至关重要。这项任务本质上是不平衡的、不平衡的和分层的。基因组序列为分类分类提供了重要的信息,但现有的许多方法依赖于序列与参考基因组的相似性,往往由于参考数据库不完整或缺失而导致序列分类错误。大型语言模型为从序列中提取内在特征提供了一种新的方法。结果:我们提出了ICCTax分类器,该分类器将大型语言模型HyenaDNA与基于互补视图的分层度量学习和分层级紧凑性损失相结合,用于识别分类基因组序列。ICCTax准确地将序列划分为四个超级王国的155个属和43个门,包括未见过的分类群。在使用不同策略构建的三个数据集中,ICCTax优于基线方法,特别是在非分布数据上。在三个海洋站点的模拟海洋宏基因组群落数据集、DairyDB-16S rRNA、Tara Oceans和废水宏基因组数据集上,它展示了强大的性能,展示了现实世界的适用性。ICCTax可以进一步支持在不同环境中鉴定新物种和功能基因,增强对微生物生态学的理解。可用性和实现:代码可从https://github.com/Ying-Lab/ICCTax获得。
{"title":"ICCTax: a hierarchical taxonomic classifier for metagenomic sequences on a large language model.","authors":"Yichun Gao, Jiaxing Bai, Feng Zhou, Yushuang He, Ying Wang, Xiaobing Huang","doi":"10.1093/bioadv/vbaf257","DOIUrl":"10.1093/bioadv/vbaf257","url":null,"abstract":"<p><strong>Motivation: </strong>Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.</p><p><strong>Results: </strong>We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.</p><p><strong>Availability and implementation: </strong>Code is available at https://github.com/Ying-Lab/ICCTax.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf257"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619997/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In silico analysis of insect-associated bacterial phytases reveals optimal biochemical properties and function in poultry gut condition. 昆虫相关细菌植酸酶的计算机分析揭示了家禽肠道条件下最佳的生化特性和功能。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-15 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf256
Olyad Erba Urgessa, Ketema Tafess Tulu, Mesfin Tafesse Gemeda, Hunduma Dinka

Motivation: Insect guts may harbor phytase-producing bacteria applicable in poultry nutrition, but only Serratia sp. TN49 and its histidine acid phytase (AEQ29498.1) have been studied for this purpose. Therefore, AEQ29498.1 was used as a query to conduct a homology search for insect-associated bacterial phytases, followed by prediction of their structure and function. This in silico analysis of phytase may lead to the isolation of native phytase-producing bacteria from insect guts, potentially facilitating the production of desirable phytases for use in feed additives.

Results: Twenty-six phytases from bacteria associated with the guts of black soldier fly larvae, fruit flies, and honey bees were identified. The mature chains of these phytases, except for the 4-phytase of Bartocella apis PEB0150, were predicted to carry a positive charge under the acidic conditions of the poultry upper gastrointestinal tract. They are stable (instability indices <40) and belong to histidine acid phosphatase family, which has been proven to be an effective poultry feed additive. The three-dimensional structure of the mature histidine-type phosphatase of Tatumella sp. JGM130 demonstrated the best quality and was found to be a homo-tetrameric protein. Molecular docking confirmed phytate binding at the catalytic motif of the histidine acid phosphatase family, RHGVRPP/AP/Q and HD.

动机:昆虫肠道可能含有可用于家禽营养的产植酸菌,但目前仅研究了Serratia sp. TN49及其组氨酸酸植酸酶(AEQ29498.1)。因此,我们以AEQ29498.1作为查询,对昆虫相关的细菌植酸酶进行同源性搜索,并对其结构和功能进行预测。这种对植酸酶的硅分析可能导致从昆虫肠道中分离出天然产植酸酶的细菌,从而有可能促进生产用于饲料添加剂的所需植酸酶。结果:从黑虻幼虫、果蝇和蜜蜂肠道相关细菌中鉴定出26种植酸酶。在家禽上消化道酸性条件下,除4-植酸酶PEB0150外,其余成熟的植酸酶链均带正电荷。其中以JGM130为最佳,是一种同源四聚体蛋白。分子对接证实了组氨酸酸性磷酸酶家族、RHGVRPP/AP/Q和HD催化基序上的植酸结合。
{"title":"<i>In silico</i> analysis of insect-associated bacterial phytases reveals optimal biochemical properties and function in poultry gut condition.","authors":"Olyad Erba Urgessa, Ketema Tafess Tulu, Mesfin Tafesse Gemeda, Hunduma Dinka","doi":"10.1093/bioadv/vbaf256","DOIUrl":"10.1093/bioadv/vbaf256","url":null,"abstract":"<p><strong>Motivation: </strong>Insect guts may harbor phytase-producing bacteria applicable in poultry nutrition, but only <i>Serratia</i> sp. TN49 and its histidine acid phytase (AEQ29498.1) have been studied for this purpose. Therefore, AEQ29498.1 was used as a query to conduct a homology search for insect-associated bacterial phytases, followed by prediction of their structure and function. This <i>in silico</i> analysis of phytase may lead to the isolation of native phytase-producing bacteria from insect guts, potentially facilitating the production of desirable phytases for use in feed additives.</p><p><strong>Results: </strong>Twenty-six phytases from bacteria associated with the guts of black soldier fly larvae, fruit flies, and honey bees were identified. The mature chains of these phytases, except for the 4-phytase of <i>Bartocella apis</i> PEB0150, were predicted to carry a positive charge under the acidic conditions of the poultry upper gastrointestinal tract. They are stable (instability indices <40) and belong to histidine acid phosphatase family, which has been proven to be an effective poultry feed additive. The three-dimensional structure of the mature histidine-type phosphatase of <i>Tatumella</i> sp. JGM130 demonstrated the best quality and was found to be a homo-tetrameric protein. Molecular docking confirmed phytate binding at the catalytic motif of the histidine acid phosphatase family, RHGVRPP/AP/Q and HD.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf256"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596144/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pathogenicity patterns in cytochrome P450 family. 细胞色素P450家族的致病性模式。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-14 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf231
Anna Špačková, Nina Kadášová, Ivana Hutařová Vařeková, Karel Berka

Motivation: Cytochrome P450 proteins play a crucial role in human metabolism, ranging from hormone production to drug metabolism. While multiple commonly known variants have known effects on the individual cytochrome P450 protein performance, the pathogenicity information is usually experimentally limited to only a few mutations. Current pathogenicity prediction software enables the extension of the scope to virtually mutate all amino acids with all possible substitutional mutations. In this work, we do a comprehensive exploration that unveils pathogenicity patterns in the human cytochrome P450 family. Pathogenicity analysis was conducted across proteins using SIFT, AlphaMissense, and PrimateAI-3D algorithms.

Results: Our findings indicate a progressive increase in pathogenicity along protein tunnels-identified via MOLE-toward the cofactor binding site, underscoring the essential role of cofactor interactions in enzymatic function. Notably, the integrity of tunnels and cofactor environment emerges as a critical factor, with even single amino acid alterations potentially disrupting molecular guidance to active sites. These insights highlight the fundamental role of structural pathways in preserving cytochrome P450 functionality, with implications for understanding disease-associated variants and drug metabolism.

Availability and implementation: Data and source code can be found at https://github.com/annaspac/P450_pathogenicity_codes.

动机:细胞色素P450蛋白在人体代谢中起着至关重要的作用,从激素产生到药物代谢。虽然多种已知的变异对单个细胞色素P450蛋白的性能有已知的影响,但其致病性信息通常在实验上仅限于少数突变。目前的致病性预测软件使范围的扩展,几乎突变所有的氨基酸与所有可能的替代突变。在这项工作中,我们做了一个全面的探索,揭示了人类细胞色素P450家族的致病性模式。使用SIFT、AlphaMissense和PrimateAI-3D算法对蛋白质进行致病性分析。结果:我们的研究结果表明,沿蛋白质通道(通过mole鉴定)向辅因子结合位点的致病性逐渐增加,强调了辅因子相互作用在酶功能中的重要作用。值得注意的是,通道和辅因子环境的完整性是一个关键因素,即使是单个氨基酸的改变也可能破坏分子对活性位点的引导。这些见解强调了结构通路在保持细胞色素P450功能中的基本作用,对理解疾病相关变异和药物代谢具有重要意义。可用性和实现:可以在https://github.com/annaspac/P450_pathogenicity_codes上找到数据和源代码。
{"title":"Pathogenicity patterns in cytochrome P450 family.","authors":"Anna Špačková, Nina Kadášová, Ivana Hutařová Vařeková, Karel Berka","doi":"10.1093/bioadv/vbaf231","DOIUrl":"10.1093/bioadv/vbaf231","url":null,"abstract":"<p><strong>Motivation: </strong>Cytochrome P450 proteins play a crucial role in human metabolism, ranging from hormone production to drug metabolism. While multiple commonly known variants have known effects on the individual cytochrome P450 protein performance, the pathogenicity information is usually experimentally limited to only a few mutations. Current pathogenicity prediction software enables the extension of the scope to virtually mutate all amino acids with all possible substitutional mutations. In this work, we do a comprehensive exploration that unveils pathogenicity patterns in the human cytochrome P450 family. Pathogenicity analysis was conducted across proteins using SIFT, AlphaMissense, and PrimateAI-3D algorithms.</p><p><strong>Results: </strong>Our findings indicate a progressive increase in pathogenicity along protein tunnels-identified via MOLE-toward the cofactor binding site, underscoring the essential role of cofactor interactions in enzymatic function. Notably, the integrity of tunnels and cofactor environment emerges as a critical factor, with even single amino acid alterations potentially disrupting molecular guidance to active sites. These insights highlight the fundamental role of structural pathways in preserving cytochrome P450 functionality, with implications for understanding disease-associated variants and drug metabolism.</p><p><strong>Availability and implementation: </strong>Data and source code can be found at https://github.com/annaspac/P450_pathogenicity_codes.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf231"},"PeriodicalIF":2.8,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12534787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Asymmetric integration of various cancer datasets for identifying risk-associated variants and genes. 非对称整合各种癌症数据集,以识别风险相关的变异和基因。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-14 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf253
Ruixuan Wang, Lam Tran, Benjamin Brennan, Lars G Fritsche, Kevin He, J Chad Brenner, Hui Jiang

Motivation: Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging.

Results: Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative, using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDRs). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.

Availability and implementation: Our method is available as source code at https://github.com/rxxwang/integrate_cancer.

动机:癌症基因组研究提供了识别癌症风险相关基因的机会,但由于样本量有限,通常存在不理想的低统计能力。不同癌症的综合分析有可能提高识别泛癌症风险基因的统计能力。然而,各种癌症之间的巨大异质性使得这一研究具有挑战性。结果:近年来提出了一种新的非对称集成方法,该方法可以处理数据异质性,并从分析中排除无用的数据集。我们调整并应用该方法整合来自密歇根基因组计划的匹配病例和对照个体的基因型数据集,分别使用每种癌症作为感兴趣的主要数据集,其他癌症作为辅助数据集。条件逻辑回归模型与非对称集成框架相结合,以处理匹配的病例对照研究设计,并进行置换检验以控制错误发现率(FDRs)。在相同的FDR水平上,综合分析发现了更多与各种癌症风险相关的潜在遗传变异和基因,显示了拟议的癌症数据集综合分析方法的前景。可用性和实现:我们的方法的源代码可在https://github.com/rxxwang/integrate_cancer上获得。
{"title":"Asymmetric integration of various cancer datasets for identifying risk-associated variants and genes.","authors":"Ruixuan Wang, Lam Tran, Benjamin Brennan, Lars G Fritsche, Kevin He, J Chad Brenner, Hui Jiang","doi":"10.1093/bioadv/vbaf253","DOIUrl":"10.1093/bioadv/vbaf253","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging.</p><p><strong>Results: </strong>Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative, using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDRs). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.</p><p><strong>Availability and implementation: </strong>Our method is available as source code at https://github.com/rxxwang/integrate_cancer.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf253"},"PeriodicalIF":2.8,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576323/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics advances
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1