PangeBlocks：通过最大块定制构建泛基因组图。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-11-04 DOI:10.1186/s12859-024-05958-5

Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti

{"title":"PangeBlocks：通过最大块定制构建泛基因组图。","authors":"Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti","doi":"10.1186/s12859-024-05958-5","DOIUrl":null,"url":null,"abstract":"Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"344"},"PeriodicalIF":2.9000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533710/pdf/","citationCount":"0","resultStr":"{\"title\":\"PangeBlocks: customized construction of pangenome graphs via maximal blocks.\",\"authors\":\"Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti\",\"doi\":\"10.1186/s12859-024-05958-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":\"25 1\",\"pages\":\"344\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533710/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-024-05958-5\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05958-5","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

背景：构建庞基因组图是庞基因组学的一项基本任务。一个自然的理论问题是如何将构建最优庞基因组图的计算问题形式化，明确基本优化标准和可行解决方案集。目前的方法是利用一些启发式方法构建庞基因组图，而不假定一些明确的优化标准。因此，具体的优化标准如何影响图拓扑和下游分析（如读取映射和变异调用）尚不清楚：本文利用多重序列比对（MSA）中最大区块的概念，将泛基因组图构建问题重构为区块上的精确覆盖问题，称为最小加权区块覆盖（MWBC）。然后，我们为 MWBC 问题提出了一种整数线性规划（ILP）公式，使我们能够研究构建图的最自然目标函数。我们提供了求解 MWBC 的 ILP 方法的实现，并在 SARS-CoV-2 完整基因组上对其进行了评估，显示了不同的目标函数如何导致具有不同属性的 pangenome 图，暗示了特定的下游任务可以驱动图构建阶段：我们的研究表明，基于目标函数的庞基因组图的定制化构建会对生成的图产生直接影响。特别是，我们基于寻找覆盖 MSA 的最优块子集对 MWBC 问题进行了形式化，为用户可以指导构建 MSA 图表示的新型实用方法铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PangeBlocks: customized construction of pangenome graphs via maximal blocks.

Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.

Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.

Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.