Bioinformatics最新文献_第8页

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. HAPNEST:高效、大规模地生成和评估基因型和表型的合成数据集。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad535

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna

Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

动机:现有的模拟合成基因型和表型数据集的方法具有有限的可扩展性，限制了它们用于大规模分析的可用性。此外，还缺乏评估合成数据质量的系统方法和用于开发和评估多基因风险评分方法的基准合成数据集。结果:我们提出了happnest，一种有效生成不同个体水平基因型和表型数据的新方法。与其他方法相比，HAPNEST的计算速度更快，与参考面板的相关度更低，同时生成的数据集保留了真实数据的关键统计属性。这些理想的合成数据特性使我们能够在100万个个体中产生680万个常见变异和9种具有不同程度遗传性和多基因性的表型。我们展示了HAPNEST如何通过比较七种方法来促进生物库规模的分析，从而在多个祖先群体和不同的遗传结构中生成多基因风险评分。可用性和实现:在https://www.ebi.ac.uk/biostudies/studies/S-BSST936上可以获得一个包含1008,000个个体和9个特征的680万个常见变异的合成数据集。用于生成合成数据集的happnest软件可以在https://github.com/intervene-EU-H2020/synthetic_data上以Docker/Singularity容器和开源Julia和C代码的形式获得。

{"title":"HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.","authors":"Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna","doi":"10.1093/bioinformatics/btad535","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad535","url":null,"abstract":"Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MSDRP: a deep learning model based on multisource data for predicting drug response. MSDRP:基于多源数据的深度学习模型，用于预测药物反应。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad514

Haochen Zhao, Xiaoyu Zhang, Qichang Zhao, Yaohang Li, Jianxin Wang

Motivation: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.

Results: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model.

Availability and implementation: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.

动机:癌症异质性极大地影响癌症治疗结果。体外预测药物反应有望帮助制定个性化的治疗方案。近年来，人们提出了几种基于机器学习和深度学习的计算模型来预测体外药物反应。然而，这些方法中的大多数基于单一药物描述(例如药物结构)捕获药物特征，而没有考虑药物与生物实体之间的关系(例如靶点、疾病和副作用)。此外，这些方法大多分别收集药物和细胞系的特征，而没有考虑药物和细胞系之间的成对相互作用。结果:在本文中，我们提出了一个深度学习框架MSDRP用于药物反应预测。MSDRP使用交互模块捕获药物与细胞系之间的相互作用，并通过相似网络融合算法整合药物与生物实体之间的多种关联/相互作用，在所有实验的所有性能指标中都优于一些最先进的模型。从头测试和独立测试的实验结果证明了该模型对新药的优良性能。此外，几个案例研究说明了使用来自多源数据的药物相似矩阵的特征向量来表示药物的合理性和我们的模型的可解释性。可用性和实施:MSDRP的代码可在https://github.com/xyzhang-10/MSDRP上获得。

{"title":"MSDRP: a deep learning model based on multisource data for predicting drug response.","authors":"Haochen Zhao, Xiaoyu Zhang, Qichang Zhao, Yaohang Li, Jianxin Wang","doi":"10.1093/bioinformatics/btad514","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad514","url":null,"abstract":"Motivation: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.Results: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model.Availability and implementation: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

metGWAS 1.0: an R workflow for network-driven over-representation analysis between independent metabolomic and meta-genome-wide association studies. metGWAS 1.0：用于独立代谢组学和元全基因组关联研究之间网络驱动的过度代表性分析的 R 工作流。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad523

Saifur R Khan, Andreea Obersterescu, Erica P Gunderson, Babak Razani, Michael B Wheeler, Brian J Cox

Motivation: The method of genome-wide association studies (GWAS) and metabolomics combined provide an quantitative approach to pinpoint metabolic pathways and genes linked to specific diseases; however, such analyses require both genomics and metabolomics datasets from the same individuals/samples. In most cases, this approach is not feasible due to high costs, lack of technical infrastructure, unavailability of samples, and other factors. Therefore, an unmet need exists for a bioinformatics tool that can identify gene loci-associated polymorphic variants for metabolite alterations seen in disease states using standalone metabolomics.

Results: Here, we developed a bioinformatics tool, metGWAS 1.0, that integrates independent GWAS data from the GWAS database and standalone metabolomics data using a network-based systems biology approach to identify novel disease/trait-specific metabolite-gene associations. The tool was evaluated using standalone metabolomics datasets extracted from two metabolomics-GWAS case studies. It discovered both the observed and novel gene loci with known single nucleotide polymorphisms when compared to the original studies.

Availability and implementation: The developed metGWAS 1.0 framework is implemented in an R pipeline and available at: https://github.com/saifurbd28/metGWAS-1.0.

动机全基因组关联研究（GWAS）和代谢组学相结合的方法提供了一种定量方法，可精确定位与特定疾病相关的代谢途径和基因；然而，此类分析需要来自相同个体/样本的基因组学和代谢组学数据集。在大多数情况下，由于成本高昂、缺乏技术基础设施、无法获得样本等因素，这种方法并不可行。因此，对生物信息学工具的需求尚未得到满足，这种工具可以利用独立的代谢组学鉴定疾病状态下代谢物改变的基因位点相关多态变异：在此，我们开发了一种生物信息学工具--metGWAS 1.0，该工具采用基于网络的系统生物学方法，整合了来自 GWAS 数据库的独立 GWAS 数据和独立代谢组学数据，以确定新的疾病/特异性代谢物-基因关联。该工具使用从两个代谢组学-GWAS 案例研究中提取的独立代谢组学数据集进行了评估。与原始研究相比，该工具发现了具有已知单核苷酸多态性的已观察基因位点和新基因位点：已开发的 metGWAS 1.0 框架在 R 管道中实现，可在以下网址获取：https://github.com/saifurbd28/metGWAS-1.0。

{"title":"metGWAS 1.0: an R workflow for network-driven over-representation analysis between independent metabolomic and meta-genome-wide association studies.","authors":"Saifur R Khan, Andreea Obersterescu, Erica P Gunderson, Babak Razani, Michael B Wheeler, Brian J Cox","doi":"10.1093/bioinformatics/btad523","DOIUrl":"10.1093/bioinformatics/btad523","url":null,"abstract":"Motivation: The method of genome-wide association studies (GWAS) and metabolomics combined provide an quantitative approach to pinpoint metabolic pathways and genes linked to specific diseases; however, such analyses require both genomics and metabolomics datasets from the same individuals/samples. In most cases, this approach is not feasible due to high costs, lack of technical infrastructure, unavailability of samples, and other factors. Therefore, an unmet need exists for a bioinformatics tool that can identify gene loci-associated polymorphic variants for metabolite alterations seen in disease states using standalone metabolomics.Results: Here, we developed a bioinformatics tool, metGWAS 1.0, that integrates independent GWAS data from the GWAS database and standalone metabolomics data using a network-based systems biology approach to identify novel disease/trait-specific metabolite-gene associations. The tool was evaluated using standalone metabolomics datasets extracted from two metabolomics-GWAS case studies. It discovered both the observed and novel gene loci with known single nucleotide polymorphisms when compared to the original studies.Availability and implementation: The developed metGWAS 1.0 framework is implemented in an R pipeline and available at: https://github.com/saifurbd28/metGWAS-1.0.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491949/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CellAnn: a comprehensive, super-fast, and user-friendly single-cell annotation web server. CellAnn:一个全面的、超快的、用户友好的单细胞注释web服务器。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad521

Pin Lyu, Yijie Zhai, Taibo Li, Jiang Qian

Motivation: Single-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use.

Results: Here, we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis.

Availability and implementation: The web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.

动机:单细胞测序技术已经成为研究许多生物学问题的常规方法。分析单细胞数据的一个核心步骤是将细胞簇分配到特定的细胞类型。提出了基于参考的方法来预测单细胞簇的细胞类型。然而，可扩展性和缺乏预处理的参考数据集阻碍了它们的实用性和易用性。结果:本文介绍了一种基于参考的细胞注释web服务器CellAnn，该服务器速度快，使用方便。CellAnn包含一个全面的参考数据库，包含204个人类和191个小鼠单细胞数据集。这些参考数据集涵盖32个器官。此外，我们开发了一种簇对簇对齐方法，将cell标签从参考数据集转移到查询数据集，该方法优于现有方法，具有更高的准确性和更高的可扩展性。最后，CellAnn是一个在线工具，它集成了细胞注释的所有过程，包括参考搜索、转移细胞标签、可视化结果和协调细胞注释标签。通过用户友好的界面，用户可以通过与多个参考数据集的交叉验证来识别最佳注释。我们相信CellAnn可以极大地促进单细胞测序数据分析。可用性和实现:web服务器可在www.cellann.io上获得，源代码可在https://github.com/Pinlyu3/CellAnn_shinyapp上获得。

{"title":"CellAnn: a comprehensive, super-fast, and user-friendly single-cell annotation web server.","authors":"Pin Lyu, Yijie Zhai, Taibo Li, Jiang Qian","doi":"10.1093/bioinformatics/btad521","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad521","url":null,"abstract":"Motivation: Single-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use.Results: Here, we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis.Availability and implementation: The web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ionmob: a Python package for prediction of peptide collisional cross-section values. Ionmob：用于预测肽碰撞横截面值的Python包。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad486

David Teschner, David Gomez-Zepeda, Arthur Declercq, Mateusz K Łącki, Seymen Avci, Konstantin Bob, Ute Distler, Thomas Michna, Lennart Martens, Stefan Tenzer, Andreas Hildebrandt

Motivation: Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion's mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing.

Results: We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task.

Availability and implementation: The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.

动机：将离子迁移率分离（IMS）纳入质谱蛋白质组学实验有助于提高覆盖率和产量。许多IMS设备能够将实验得出的离子迁移率与其碰撞截面（CCS）联系起来，这是一种高度可重复的物理化学性质，取决于离子在气相中的质量、电荷和构象。因此，已知的肽离子迁移率可用于定制获取方法或细化数据库搜索结果。潜在肽序列的大空间，也由氨基酸的翻译后修饰驱动，激发了肽CCS的计算机预测。最近的研究探索了各种机器学习技术的一般性能，然而，工作流工程部分是次要的。为了适用性，这种工具应该是通用的、数据驱动的，并提供易于适应实验设计和数据处理的单个工作流程的可能性。结果：我们创建了ionmob，这是一个基于Python的框架，用于肽碰撞截面值的数据准备、训练和预测。它易于定制，包括一组经过预训练的现成模型和用于训练和推理的预处理例程。使用一组≈21 000个独特的磷酸化肽和≈17 000MHC配体序列和电荷态对，我们扩展了可以整合到CCS预测中的肽的空间。最后，我们研究了计算机预测CCS的适用性，通过应用重新评分的方法来增加对已鉴定肽的信心，并证明预测的CCS值补充了该任务的现有预测因子。可用性和实现：Python包可在github上获得：https://github.com/theGreatHerrLebert/ionmob.

{"title":"Ionmob: a Python package for prediction of peptide collisional cross-section values.","authors":"David Teschner, David Gomez-Zepeda, Arthur Declercq, Mateusz K Łącki, Seymen Avci, Konstantin Bob, Ute Distler, Thomas Michna, Lennart Martens, Stefan Tenzer, Andreas Hildebrandt","doi":"10.1093/bioinformatics/btad486","DOIUrl":"10.1093/bioinformatics/btad486","url":null,"abstract":"Motivation: Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion's mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing.Results: We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task.Availability and implementation: The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10521631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9989115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory. cloneRate：使用联合理论快速估计单细胞克隆动力学。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad561

Brian Johnson, Yubo Shuai, Jason Schweinsberg, Kit Curtius

Motivation: While evolutionary approaches to medicine show promise, measuring evolution itself is difficult due to experimental constraints and the dynamic nature of body systems. In cancer evolution, continuous observation of clonal architecture is impossible, and longitudinal samples from multiple timepoints are rare. Increasingly available DNA sequencing datasets at single-cell resolution enable the reconstruction of past evolution using mutational history, allowing for a better understanding of dynamics prior to detectable disease. There is an unmet need for an accurate, fast, and easy-to-use method to quantify clone growth dynamics from these datasets.

Results: We derived methods based on coalescent theory for estimating the net growth rate of clones using either reconstructed phylogenies or the number of shared mutations. We applied and validated our analytical methods for estimating the net growth rate of clones, eliminating the need for complex simulations used in previous methods. When applied to hematopoietic data, we show that our estimates may have broad applications to improve mechanistic understanding and prognostic ability. Compared to clones with a single or unknown driver mutation, clones with multiple drivers have significantly increased growth rates (median 0.94 versus 0.25 per year; P = 1.6×10-6). Further, stratifying patients with a myeloproliferative neoplasm (MPN) by the growth rate of their fittest clone shows that higher growth rates are associated with shorter time to MPN diagnosis (median 13.9 versus 26.4 months; P = 0.0026).

Availability and implementation: We developed a publicly available R package, cloneRate, to implement our methods (Package website: https://bdj34.github.io/cloneRate/). Source code: https://github.com/bdj34/cloneRate/.

动机：虽然进化医学方法显示出前景，但由于实验限制和身体系统的动态性质，测量进化本身很困难。在癌症进化中，克隆结构的连续观察是不可能的，并且来自多个时间点的纵向样本是罕见的。越来越多的单细胞分辨率的DNA测序数据集能够利用突变历史重建过去的进化，从而更好地了解可检测疾病之前的动力学。对一种准确、快速、易于使用的方法来量化这些数据集的克隆生长动态的需求尚未得到满足。结果：我们推导了基于联合理论的方法，使用重建的系统发育或共享突变的数量来估计克隆的净生长率。我们应用并验证了我们的分析方法来估计克隆的净增长率，消除了以前方法中使用的复杂模拟的需要。当应用于造血数据时，我们表明我们的估计可能具有广泛的应用，以提高对机制的理解和预后能力。与具有单一或未知驱动因素突变的克隆相比，具有多个驱动因素的克隆的生长率显著提高（中位数为0.94，而每年为0.25；P = 1.6×10-6）。此外，根据最适克隆的生长率对骨髓增生性肿瘤（MPN）患者进行分层显示，较高的生长率与较短的诊断时间有关（中位数13.9对26.4 月；P = 0.0026）。可用性和实现：我们开发了一个公开可用的R包cloneRate来实现我们的方法（包网站：https://bdj34.github.io/cloneRate/)。源代码：https://github.com/bdj34/cloneRate/.

{"title":"cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory.","authors":"Brian Johnson, Yubo Shuai, Jason Schweinsberg, Kit Curtius","doi":"10.1093/bioinformatics/btad561","DOIUrl":"10.1093/bioinformatics/btad561","url":null,"abstract":"Motivation: While evolutionary approaches to medicine show promise, measuring evolution itself is difficult due to experimental constraints and the dynamic nature of body systems. In cancer evolution, continuous observation of clonal architecture is impossible, and longitudinal samples from multiple timepoints are rare. Increasingly available DNA sequencing datasets at single-cell resolution enable the reconstruction of past evolution using mutational history, allowing for a better understanding of dynamics prior to detectable disease. There is an unmet need for an accurate, fast, and easy-to-use method to quantify clone growth dynamics from these datasets.Results: We derived methods based on coalescent theory for estimating the net growth rate of clones using either reconstructed phylogenies or the number of shared mutations. We applied and validated our analytical methods for estimating the net growth rate of clones, eliminating the need for complex simulations used in previous methods. When applied to hematopoietic data, we show that our estimates may have broad applications to improve mechanistic understanding and prognostic ability. Compared to clones with a single or unknown driver mutation, clones with multiple drivers have significantly increased growth rates (median 0.94 versus 0.25 per year; P = 1.6×10-6). Further, stratifying patients with a myeloproliferative neoplasm (MPN) by the growth rate of their fittest clone shows that higher growth rates are associated with shorter time to MPN diagnosis (median 13.9 versus 26.4 months; P = 0.0026).Availability and implementation: We developed a publicly available R package, cloneRate, to implement our methods (Package website: https://bdj34.github.io/cloneRate/). Source code: https://github.com/bdj34/cloneRate/.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10534056/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10226226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

P-DOR, an easy-to-use pipeline to reconstruct bacterial outbreaks using genomics. P-DOR，一个使用基因组学重建细菌爆发的易于使用的管道。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad571

Gherard Batisti Biffignandi, Greta Bellinzona, Greta Petazzoni, Davide Sassera, Gian Vincenzo Zuccotti, Claudio Bandi, Fausto Baldanti, Francesco Comandatore, Stefano Gaiarsa

Summary: Bacterial Healthcare-Associated Infections (HAIs) are a major threat worldwide, which can be counteracted by establishing effective infection control measures, guided by constant surveillance and timely epidemiological investigations. Genomics is crucial in modern epidemiology but lacks standard methods and user-friendly software, accessible to users without a strong bioinformatics proficiency. To overcome these issues we developed P-DOR, a novel tool for rapid bacterial outbreak characterization. P-DOR accepts genome assemblies as input, it automatically selects a background of publicly available genomes using k-mer distances and adds it to the analysis dataset before inferring a Single-Nucleotide Polymorphism (SNP)-based phylogeny. Epidemiological clusters are identified considering the phylogenetic tree topology and SNP distances. By analyzing the SNP-distance distribution, the user can gauge the correct threshold. Patient metadata can be inputted as well, to provide a spatio-temporal representation of the outbreak. The entire pipeline is fast and scalable and can be also run on low-end computers.

Availability and implementation: P-DOR is implemented in Python3 and R and can be installed using conda environments. It is available from GitHub https://github.com/SteMIDIfactory/P-DOR under the GPL-3.0 license.

摘要：细菌性医疗保健相关感染（HAI）是世界范围内的一个主要威胁，可以通过建立有效的感染控制措施，在持续监测和及时流行病学调查的指导下加以应对。基因组学在现代流行病学中至关重要，但缺乏标准的方法和用户友好的软件，没有很强的生物信息学能力的用户可以访问。为了克服这些问题，我们开发了P-DOR，这是一种用于快速细菌爆发表征的新工具。P-DOR接受基因组组装作为输入，它使用k-mer距离自动选择公开可用基因组的背景，并在推断基于单核苷酸多态性（SNP）的系统发育之前将其添加到分析数据集。根据系统发育树拓扑结构和SNP距离确定流行病学集群。通过分析SNP距离分布，用户可以测量正确的阈值。还可以输入患者元数据，以提供疫情的时空表示。整个管道快速且可扩展，也可以在低端计算机上运行。可用性和实现：P-DOR在Python3和R中实现，可以使用conda环境进行安装。它可从GitHub获得https://github.com/SteMIDIfactory/P-DOR根据GPL-3.0许可证。

{"title":"P-DOR, an easy-to-use pipeline to reconstruct bacterial outbreaks using genomics.","authors":"Gherard Batisti Biffignandi, Greta Bellinzona, Greta Petazzoni, Davide Sassera, Gian Vincenzo Zuccotti, Claudio Bandi, Fausto Baldanti, Francesco Comandatore, Stefano Gaiarsa","doi":"10.1093/bioinformatics/btad571","DOIUrl":"10.1093/bioinformatics/btad571","url":null,"abstract":"Summary: Bacterial Healthcare-Associated Infections (HAIs) are a major threat worldwide, which can be counteracted by establishing effective infection control measures, guided by constant surveillance and timely epidemiological investigations. Genomics is crucial in modern epidemiology but lacks standard methods and user-friendly software, accessible to users without a strong bioinformatics proficiency. To overcome these issues we developed P-DOR, a novel tool for rapid bacterial outbreak characterization. P-DOR accepts genome assemblies as input, it automatically selects a background of publicly available genomes using k-mer distances and adds it to the analysis dataset before inferring a Single-Nucleotide Polymorphism (SNP)-based phylogeny. Epidemiological clusters are identified considering the phylogenetic tree topology and SNP distances. By analyzing the SNP-distance distribution, the user can gauge the correct threshold. Patient metadata can be inputted as well, to provide a spatio-temporal representation of the outbreak. The entire pipeline is fast and scalable and can be also run on low-end computers.Availability and implementation: P-DOR is implemented in Python3 and R and can be installed using conda environments. It is available from GitHub https://github.com/SteMIDIfactory/P-DOR under the GPL-3.0 license.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10533420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10227374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Cardinality optimization in constraint-based modelling: application to human metabolism. 基于约束的建模中的基数优化：在人类新陈代谢中的应用。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad450

Ronan M T Fleming, Hulda S Haraldsdottir, Le Hoai Minh, Phan Tu Vuong, Thomas Hankemeier, Ines Thiele

Motivation: Several applications in constraint-based modelling can be mathematically formulated as cardinality optimization problems involving the minimization or maximization of the number of nonzeros in a vector. These problems include testing for stoichiometric consistency, testing for flux consistency, testing for thermodynamic flux consistency, computing sparse solutions to flux balance analysis problems and computing the minimum number of constraints to relax to render an infeasible flux balance analysis problem feasible. Such cardinality optimization problems are computationally complex, with no known polynomial time algorithms capable of returning an exact and globally optimal solution.

Results: By approximating the zero-norm with nonconvex continuous functions, we reformulate a set of cardinality optimization problems in constraint-based modelling into a difference of convex functions. We implemented and numerically tested novel algorithms that approximately solve the reformulated problems using a sequence of convex programs. We applied these algorithms to various biochemical networks and demonstrate that our algorithms match or outperform existing related approaches. In particular, we illustrate the efficiency and practical utility of our algorithms for cardinality optimization problems that arise when extracting a model ready for thermodynamic flux balance analysis given a human metabolic reconstruction.

Availability and implementation: Open source scripts to reproduce the results are here https://github.com/opencobra/COBRA.papers/2023_cardOpt with general purpose functions integrated within the COnstraint-Based Reconstruction and Analysis toolbox: https://github.com/opencobra/cobratoolbox.

动机：基于约束的建模中的几个应用可以在数学上公式化为基数优化问题，涉及向量中非零数量的最小化或最大化。这些问题包括化学计量一致性测试、通量一致性测试，热力学通量一致性的测试，计算通量平衡分析问题的稀疏解，以及计算松弛的最小约束数量，以使不可行的通量平衡分析变得可行。这种基数优化问题在计算上很复杂，没有已知的多项式时间算法能够返回精确的全局最优解。结果：通过用非凸连续函数逼近零范数，我们将基于约束建模中的一组基数优化问题重新表述为凸函数的差分。我们实现并数值测试了使用一系列凸程序近似解决重新表述的问题的新算法。我们将这些算法应用于各种生物化学网络，并证明我们的算法匹配或优于现有的相关方法。特别是，我们说明了我们的算法对基数优化问题的效率和实用性，这些问题是在提取一个模型时出现的，该模型准备在给定人类代谢重建的情况下进行热力学通量平衡分析。可用性和实现：这里有用于重现结果的开源脚本https://github.com/opencobra/COBRA.papers/2023_cardOpt通用功能集成在基于COnstraint的重建和分析工具箱中：https://github.com/opencobra/cobratoolbox.

{"title":"Cardinality optimization in constraint-based modelling: application to human metabolism.","authors":"Ronan M T Fleming, Hulda S Haraldsdottir, Le Hoai Minh, Phan Tu Vuong, Thomas Hankemeier, Ines Thiele","doi":"10.1093/bioinformatics/btad450","DOIUrl":"10.1093/bioinformatics/btad450","url":null,"abstract":"Motivation: Several applications in constraint-based modelling can be mathematically formulated as cardinality optimization problems involving the minimization or maximization of the number of nonzeros in a vector. These problems include testing for stoichiometric consistency, testing for flux consistency, testing for thermodynamic flux consistency, computing sparse solutions to flux balance analysis problems and computing the minimum number of constraints to relax to render an infeasible flux balance analysis problem feasible. Such cardinality optimization problems are computationally complex, with no known polynomial time algorithms capable of returning an exact and globally optimal solution.Results: By approximating the zero-norm with nonconvex continuous functions, we reformulate a set of cardinality optimization problems in constraint-based modelling into a difference of convex functions. We implemented and numerically tested novel algorithms that approximately solve the reformulated problems using a sequence of convex programs. We applied these algorithms to various biochemical networks and demonstrate that our algorithms match or outperform existing related approaches. In particular, we illustrate the efficiency and practical utility of our algorithms for cardinality optimization problems that arise when extracting a model ready for thermodynamic flux balance analysis given a human metabolic reconstruction.Availability and implementation: Open source scripts to reproduce the results are here https://github.com/opencobra/COBRA.papers/2023_cardOpt with general purpose functions integrated within the COnstraint-Based Reconstruction and Analysis toolbox: https://github.com/opencobra/cobratoolbox.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10495685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10649419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DCAlign v1.0: aligning biological sequences using co-evolution models and informed priors. DCAlign v1.0:使用协同进化模型和知情先验对生物序列进行对齐。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad537

Anna Paola Muntoni, Andrea Pagnani

Summary: DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions.

Availability and implementation: DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.

摘要:DCAlign是一种新的比对方法，能够处理同源序列多序列比对列的保守性和协同进化信号。然而，对齐候选序列所需的预处理步骤在计算上要求很高。我们在v1.0中展示了如何通过在反映插入和删除存在的一组信息变量上包含经验先验来显著减少总体计算时间。可用性和实现:DCAlign v1.0是在Julia中实现的，可以在https://github.com/infernet-h2020/DCAlign上完全获得。

引用次数: 0

diseaseGPS: auxiliary diagnostic system for genetic disorders based on genotype and phenotype. 疾病egps:基于基因型和表型的遗传病辅助诊断系统。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad517

Daoyi Huang, Jianping Jiang, Tingting Zhao, Shengnan Wu, Pin Li, Yongfen Lyu, Jincai Feng, Mingyue Wei, Zhixing Zhu, Jianlei Gu, Yongyong Ren, Guangjun Yu, Hui Lu

Summary: The next-generation sequencing brought opportunities for the diagnosis of genetic disorders due to its high-throughput capabilities. However, the majority of existing methods were limited to only sequencing candidate variants, and the process of linking these variants to a diagnosis of genetic disorders still required medical professionals to consult databases. Therefore, we introduce diseaseGPS, an integrated platform for the diagnosis of genetic disorders that combines both phenotype and genotype data for analysis. It offers not only a user-friendly GUI web application for those without a programming background but also scripts that can be executed in batch mode for bioinformatics professionals. The genetic and phenotypic data are integrated using the ACMG-Bayes method and a novel phenotypic similarity method, to prioritize the results of genetic disorders. diseaseGPS was evaluated on 6085 cases from Deciphering Developmental Disorders project and 187 cases from Shanghai Children's hospital. The results demonstrated that diseaseGPS performed better than other commonly used methods.

Availability and implementation: diseaseGPS is available to freely accessed at https://diseasegps.sjtu.edu.cn with source code at https://github.com/BioHuangDY/diseaseGPS.

摘要:新一代测序由于其高通量能力，为遗传疾病的诊断带来了机会。然而，大多数现有的方法仅限于对候选变异进行测序，并且将这些变异与遗传疾病的诊断联系起来的过程仍然需要医学专业人员查阅数据库。因此，我们引入了disease egps，这是一个综合的遗传疾病诊断平台，结合了表型和基因型数据进行分析。它不仅为那些没有编程背景的人提供了一个用户友好的GUI web应用程序，而且还为生物信息学专业人员提供了可以批处理模式执行的脚本。利用ACMG-Bayes方法和一种新的表型相似性方法整合遗传和表型数据，对遗传疾病的结果进行优先排序。对来自破译发育障碍项目的6085例患儿和来自上海儿童医院的187例患儿进行egps评价。结果表明，疾病egps比其他常用的方法效果更好。可用性和实现:可以在https://diseasegps.sjtu.edu.cn免费访问疾病管理系统，源代码在https://github.com/BioHuangDY/diseaseGPS。

{"title":"diseaseGPS: auxiliary diagnostic system for genetic disorders based on genotype and phenotype.","authors":"Daoyi Huang, Jianping Jiang, Tingting Zhao, Shengnan Wu, Pin Li, Yongfen Lyu, Jincai Feng, Mingyue Wei, Zhixing Zhu, Jianlei Gu, Yongyong Ren, Guangjun Yu, Hui Lu","doi":"10.1093/bioinformatics/btad517","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad517","url":null,"abstract":"Summary: The next-generation sequencing brought opportunities for the diagnosis of genetic disorders due to its high-throughput capabilities. However, the majority of existing methods were limited to only sequencing candidate variants, and the process of linking these variants to a diagnosis of genetic disorders still required medical professionals to consult databases. Therefore, we introduce diseaseGPS, an integrated platform for the diagnosis of genetic disorders that combines both phenotype and genotype data for analysis. It offers not only a user-friendly GUI web application for those without a programming background but also scripts that can be executed in batch mode for bioinformatics professionals. The genetic and phenotypic data are integrated using the ACMG-Bayes method and a novel phenotypic similarity method, to prioritize the results of genetic disorders. diseaseGPS was evaluated on 6085 cases from Deciphering Developmental Disorders project and 187 cases from Shanghai Children's hospital. The results demonstrated that diseaseGPS performed better than other commonly used methods.Availability and implementation: diseaseGPS is available to freely accessed at https://diseasegps.sjtu.edu.cn with source code at https://github.com/BioHuangDY/diseaseGPS.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0