Bioinformatics最新文献_第3页

Correction to: GIL: a python package for designing custom indexing primers 更正：GIL：用于设计定制索引引物的 python 软件包

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-12-01 DOI: 10.1093/bioinformatics/btad735

引用次数: 0

SCORPIO: a utility for defining and classifying mutation constellations of virus genomes. SCORPIO：用于定义和分类病毒基因组突变星座的实用程序。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-10-03 DOI: 10.1093/bioinformatics/btad575

Rachel Colquhoun, Ben Jackson, Áine O'Toole, Andrew Rambaut

Summary: Scorpio provides a set of command line utilities for classifying, haplotyping, and defining constellations of mutations for an aligned set of genome sequences. It was developed to enable exploration and classification of variants of concern within the SARS-CoV-2 pandemic, but can be applied more generally to other species.

Availability and implementation: Scorpio is an open-source project distributed under the GNU GPL version 3 license. Source code and binaries are available at https://github.com/cov-lineages/scorpio, and binaries are also available from Bioconda. SARS-CoV-2 specific definitions can be installed as a separate dependency from https://github.com/cov-lineages/constellations.

摘要：Scorpio提供了一套命令行实用程序，用于对一组对齐的基因组序列进行分类、单倍型和定义突变星座。它的开发是为了探索和分类SARS-CoV-2大流行中的变异毒株，但可以更广泛地应用于其他物种。可用性和实现：Scorpio是一个以GNU GPL第3版许可证分发的开源项目。源代码和二进制文件可在https://github.com/cov-lineages/scorpio和二进制文件也可从Bioconda获得。严重急性呼吸系统综合征冠状病毒2型的特定定义可以作为单独的依赖项安装https://github.com/cov-lineages/constellations.

引用次数: 0

A machine learning-based quantitative model (LogBB_Pred) to predict the blood-brain barrier permeability (logBB value) of drug compounds. 一种基于机器学习的定量模型（LogBB_Pred），用于预测药物化合物的血脑屏障通透性（LogBB值）。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-10-03 DOI: 10.1093/bioinformatics/btad577

Bilal Shaker, Jingyu Lee, Yunhyeok Lee, Myeong-Sang Yu, Hyang-Mi Lee, Eunee Lee, Hoon-Chul Kang, Kwang-Seok Oh, Hyung Wook Kim, Dokyun Na

Motivation: Efficient assessment of the blood-brain barrier (BBB) penetration ability of a drug compound is one of the major hurdles in central nervous system drug discovery since experimental methods are costly and time-consuming. To advance and elevate the success rate of neurotherapeutic drug discovery, it is essential to develop an accurate computational quantitative model to determine the absolute logBB value (a logarithmic ratio of the concentration of a drug in the brain to its concentration in the blood) of a drug candidate.

Results: Here, we developed a quantitative model (LogBB_Pred) capable of predicting a logBB value of a query compound. The model achieved an R2 of 0.61 on an independent test dataset and outperformed other publicly available quantitative models. When compared with the available qualitative (classification) models that only classified whether a compound is BBB-permeable or not, our model achieved the same accuracy (0.85) with the best qualitative model and far-outperformed other qualitative models (accuracies between 0.64 and 0.70). For further evaluation, our model, quantitative models, and the qualitative models were evaluated on a real-world central nervous system drug screening library. Our model showed an accuracy of 0.97 while the other models showed an accuracy in the range of 0.29-0.83. Consequently, our model can accurately classify BBB-permeable compounds as well as predict the absolute logBB values of drug candidates.

Availability and implementation: Web server is freely available on the web at http://ssbio.cau.ac.kr/software/logbb_pred/. The data used in this study are available to download at http://ssbio.cau.ac.kr/software/logbb_pred/dataset.zip.

动机：有效评估药物化合物的血脑屏障（BBB）穿透能力是中枢神经系统药物发现的主要障碍之一，因为实验方法成本高昂且耗时。为了推进和提高神经治疗药物发现的成功率，必须开发一个准确的计算定量模型来确定候选药物的绝对logBB值（大脑中药物浓度与血液中药物浓度的对数比）。结果：在这里，我们开发了一个能够预测查询化合物的LogBB值的定量模型（LogBB_Pred）。该模型在独立测试数据集上获得了0.61的R2，并优于其他公开可用的定量模型。与只分类化合物是否具有血脑屏障渗透性的现有定性（分类）模型相比，我们的模型获得了与最佳定性模型相同的准确度（0.85），并且远远优于其他定性模型（准确度在0.64和0.70之间）。为了进一步评估，并在真实世界的中枢神经系统药物筛选库中对定性模型进行评估。我们的模型显示出0.97的准确度，而其他模型显示出0.29-0.83的准确度。因此，我们的模型可以准确地对血脑屏障可渗透的化合物进行分类，并预测候选药物的绝对logBB值。可用性和实施：Web服务器可在http://ssbio.cau.ac.kr/software/logbb_pred/.本研究中使用的数据可在http://ssbio.cau.ac.kr/software/logbb_pred/dataset.zip.

{"title":"A machine learning-based quantitative model (LogBB_Pred) to predict the blood-brain barrier permeability (logBB value) of drug compounds.","authors":"Bilal Shaker, Jingyu Lee, Yunhyeok Lee, Myeong-Sang Yu, Hyang-Mi Lee, Eunee Lee, Hoon-Chul Kang, Kwang-Seok Oh, Hyung Wook Kim, Dokyun Na","doi":"10.1093/bioinformatics/btad577","DOIUrl":"10.1093/bioinformatics/btad577","url":null,"abstract":"Motivation: Efficient assessment of the blood-brain barrier (BBB) penetration ability of a drug compound is one of the major hurdles in central nervous system drug discovery since experimental methods are costly and time-consuming. To advance and elevate the success rate of neurotherapeutic drug discovery, it is essential to develop an accurate computational quantitative model to determine the absolute logBB value (a logarithmic ratio of the concentration of a drug in the brain to its concentration in the blood) of a drug candidate.Results: Here, we developed a quantitative model (LogBB_Pred) capable of predicting a logBB value of a query compound. The model achieved an R2 of 0.61 on an independent test dataset and outperformed other publicly available quantitative models. When compared with the available qualitative (classification) models that only classified whether a compound is BBB-permeable or not, our model achieved the same accuracy (0.85) with the best qualitative model and far-outperformed other qualitative models (accuracies between 0.64 and 0.70). For further evaluation, our model, quantitative models, and the qualitative models were evaluated on a real-world central nervous system drug screening library. Our model showed an accuracy of 0.97 while the other models showed an accuracy in the range of 0.29-0.83. Consequently, our model can accurately classify BBB-permeable compounds as well as predict the absolute logBB values of drug candidates.Availability and implementation: Web server is freely available on the web at http://ssbio.cau.ac.kr/software/logbb_pred/. The data used in this study are available to download at http://ssbio.cau.ac.kr/software/logbb_pred/dataset.zip.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10560102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10260174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FrameD: framework for DNA-based data storage design, verification, and validation. FrameD：基于DNA的数据存储设计、验证和验证框架。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-10-03 DOI: 10.1093/bioinformatics/btad572

Kevin D Volkel, Kevin N Lin, Paul W Hook, Winston Timp, Albert J Keung, James M Tuck

Motivation: DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge, we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components.

Results: We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using multiple sequence alignment algorithms and others that do not. We found that the choice to include multiple sequence alignment in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems.

Availability and implementation: The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at https://github.com/dna-storage/framed, (https://dx.doi.org/10.5281/zenodo.7757762).

动机：基于DNA的数据存储是一个快速发展的领域，希望利用DNA分子的巨大理论信息密度，生产出一种具有竞争力的适用于档案数据的下一代存储介质。近年来，已经提出了许多基于DNA的存储系统设计。由于不存在用于模拟这些存储系统的通用基础架构，因此比较许多不同的设计以及许多不同的错误模型变得越来越困难。为了应对这一挑战，我们引入了FrameD，这是一种用于DNA存储系统的模拟基础设施，它利用DNA存储系统设计的底层模块性，提供了一个框架来表达不同的设计，同时能够重用通用组件。结果：我们通过案例研究证明了FrameD的实用性和对通用仿真平台的需求。我们的案例研究比较了以不同方式使用链拷贝的设计，有些使用多个序列比对算法比对链拷贝，有些则不使用。我们发现，在管道中包括多序列比对的选择取决于错误率和注入的错误类型，并不总是有益的。除了支持广泛的设计外，FrameD还为用户提供了透明的并行性，以处理来自测序的大量读取以及许多故障注入迭代的需要。我们相信，FrameD通过提供一个模块化和可扩展的框架，支持大规模并行性，填补了DNA存储社区公开可用工具的空白。因此，它将有助于加快未来基于DNA的存储系统的设计过程。可用性和实现：FrameD的源代码以及在FrameD演示过程中生成的数据可在公共Github存储库中获得，网址为https://github.com/dna-storage/framed(https://dx.doi.org/10.5281/zenodo.7757762)。

{"title":"FrameD: framework for DNA-based data storage design, verification, and validation.","authors":"Kevin D Volkel, Kevin N Lin, Paul W Hook, Winston Timp, Albert J Keung, James M Tuck","doi":"10.1093/bioinformatics/btad572","DOIUrl":"10.1093/bioinformatics/btad572","url":null,"abstract":"Motivation: DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge, we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components.Results: We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using multiple sequence alignment algorithms and others that do not. We found that the choice to include multiple sequence alignment in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems.Availability and implementation: The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at https://github.com/dna-storage/framed, (https://dx.doi.org/10.5281/zenodo.7757762).","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10563143/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10261101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach. 使用特定章节学习法从随机对照试验摘要中精确提取 PICO。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-05 DOI: 10.1093/bioinformatics/btad542

Yan Hu, Vipina K Keloth, Kalpana Raja, Yong Chen, Hua Xu

Motivation: Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation.

Results: We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level.

Availability: Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机从随机对照试验（RCT）摘要中自动提取参与者、干预措施、对比/对照和结果（PICO）对于证据综合非常重要。之前的研究已经证明了应用自然语言处理（NLP）提取 PICO 的可行性。然而，由于 RCT 摘要中 PICO 信息的复杂性及其注释所涉及的挑战，其性能并不理想：我们提出了从 RCT 摘要中提取 PICO 要素的两步 NLP 流程：(i) 使用基于提示的学习模型进行句子分类；(ii) 使用命名实体识别（NER）模型提取 PICO。首先，将摘要中的句子分为四个部分，即背景、方法、结果和结论。然后，应用 NER 模型从标题和方法部分的句子中提取 PICO 要素，这两个部分包含的 PICO 信息量大于 96%。我们在三个数据集上评估了我们提出的 NLP 管道：EBM-NLPmoddataset（从 EBM-NLP 语料库中随机挑选并重新标注的 500 篇 RCT 摘要数据集）、150 篇 COVID-19 RCT 摘要数据集和 150 篇阿尔茨海默病（AD）RCT 摘要数据集。端到端评估结果表明，我们提出的方法在EBM-NLPmod数据集上取得了0.833的微观F1得分，在COVID-19数据集上取得了0.928的微观F1得分，在AD数据集上取得了0.899的微观F1得分；在EBM-NLPmod数据集上取得了0.712的微观F1得分，在COVID-19数据集上取得了0.850的微观F1得分，在AD数据集上取得了0.805的微观F1得分；在实体层面上取得了0.712的微观F1得分，在COVID-19数据集上取得了0.850的微观F1得分，在AD数据集上取得了0.805的微观F1得分：我们的代码和数据集可在 https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.Supplementary 信息网站上公开获取：补充数据可在 Bioinformatics online 上获取。

{"title":"Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach.","authors":"Yan Hu, Vipina K Keloth, Kalpana Raja, Yong Chen, Hua Xu","doi":"10.1093/bioinformatics/btad542","DOIUrl":"10.1093/bioinformatics/btad542","url":null,"abstract":"Motivation: Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation.Results: We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level.Availability: Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500081/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10261389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ePlatypus: an ecosystem for computational analysis of immunogenomics data. ePlatypus：用于免疫基因组学数据计算分析的生态系统。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad553

Tudor-Stefan Cotet, Andreas Agrafiotis, Victor Kreiner, Raphael Kuhn, Danielle Shlesinger, Marcos Manero-Carranza, Keywan Khodaverdi, Evgenios Kladis, Aurora Desideri Perea, Dylan Maassen-Veeters, Wiona Glänzer, Solène Massery, Lorenzo Guerci, Kai-Lin Hong, Jiami Han, Kostas Stiklioraitis, Vittoria Martinolli D'Arcy, Raphael Dizerens, Samuel Kilchenmann, Lucas Stalder, Leon Nissen, Basil Vogelsanger, Stine Anzböck, Daria Laslo, Sophie Bakker, Melinda Kondorosy, Marco Venerito, Alejandro Sanz García, Isabelle Feller, Annette Oxenius, Sai T Reddy, Alexander Yermanos

Motivation: The maturation of systems immunology methodologies requires novel and transparent computational frameworks capable of integrating diverse data modalities in a reproducible manner.

Results: Here, we present the ePlatypus computational immunology ecosystem for immunogenomics data analysis, with a focus on adaptive immune repertoires and single-cell sequencing. ePlatypus is an open-source web-based platform and provides programming tutorials and an integrative database that helps elucidate signatures of B and T cell clonal selection. Furthermore, the ecosystem links novel and established bioinformatics pipelines relevant for single-cell immune repertoires and other aspects of computational immunology such as predicting ligand-receptor interactions, structural modeling, simulations, machine learning, graph theory, pseudotime, spatial transcriptomics, and phylogenetics. The ePlatypus ecosystem helps extract deeper insight in computational immunology and immunogenomics and promote open science.

Availability and implementation: Platypus code used in this manuscript can be found at github.com/alexyermanos/Platypus.

动机：系统免疫学方法的成熟需要新颖透明的计算框架，能够以可复制的方式集成各种数据模式。结果：在这里，我们介绍了用于免疫基因组学数据分析的ePlatypus计算免疫学生态系统，重点是适应性免疫库和单细胞测序。ePlatypus是一个开源的基于网络的平台，提供编程教程和综合数据库，帮助阐明B细胞和T细胞克隆选择的特征。此外，该生态系统连接了与单细胞免疫库和计算免疫学的其他方面相关的新的和已建立的生物信息学管道，如预测配体-受体相互作用、结构建模、模拟、机器学习、图论、假时间、空间转录组学和系统发育学。ePlatypus生态系统有助于深入了解计算免疫学和免疫基因组学，并促进开放科学。可用性和实现：本文中使用的Platypus代码可以在github.com/alexyermanos/Platypus上找到。

{"title":"ePlatypus: an ecosystem for computational analysis of immunogenomics data.","authors":"Tudor-Stefan Cotet, Andreas Agrafiotis, Victor Kreiner, Raphael Kuhn, Danielle Shlesinger, Marcos Manero-Carranza, Keywan Khodaverdi, Evgenios Kladis, Aurora Desideri Perea, Dylan Maassen-Veeters, Wiona Glänzer, Solène Massery, Lorenzo Guerci, Kai-Lin Hong, Jiami Han, Kostas Stiklioraitis, Vittoria Martinolli D'Arcy, Raphael Dizerens, Samuel Kilchenmann, Lucas Stalder, Leon Nissen, Basil Vogelsanger, Stine Anzböck, Daria Laslo, Sophie Bakker, Melinda Kondorosy, Marco Venerito, Alejandro Sanz García, Isabelle Feller, Annette Oxenius, Sai T Reddy, Alexander Yermanos","doi":"10.1093/bioinformatics/btad553","DOIUrl":"10.1093/bioinformatics/btad553","url":null,"abstract":"Motivation: The maturation of systems immunology methodologies requires novel and transparent computational frameworks capable of integrating diverse data modalities in a reproducible manner.Results: Here, we present the ePlatypus computational immunology ecosystem for immunogenomics data analysis, with a focus on adaptive immune repertoires and single-cell sequencing. ePlatypus is an open-source web-based platform and provides programming tutorials and an integrative database that helps elucidate signatures of B and T cell clonal selection. Furthermore, the ecosystem links novel and established bioinformatics pipelines relevant for single-cell immune repertoires and other aspects of computational immunology such as predicting ligand-receptor interactions, structural modeling, simulations, machine learning, graph theory, pseudotime, spatial transcriptomics, and phylogenetics. The ePlatypus ecosystem helps extract deeper insight in computational immunology and immunogenomics and promote open science.Availability and implementation: Platypus code used in this manuscript can be found at github.com/alexyermanos/Platypus.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10518073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A general framework for powerful confounder adjustment in omics association studies. 组学关联研究中强大的混杂因素调整的通用框架。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad563

Asmita Roy, Jun Chen, Xianyang Zhang

Motivation: Genomic data are subject to various sources of confounding, such as demographic variables, biological heterogeneity, and batch effects. To identify genomic features associated with a variable of interest in the presence of confounders, the traditional approach involves fitting a confounder-adjusted regression model to each genomic feature, followed by multiplicity correction.

Results: This study shows that the traditional approach is suboptimal and proposes a new two-dimensional false discovery rate control framework (2DFDR+) that provides significant power improvement over the conventional method and applies to a wide range of settings. 2DFDR+ uses marginal independence test statistics as auxiliary information to filter out less promising features, and FDR control is performed based on conditional independence test statistics in the remaining features. 2DFDR+ provides (asymptotically) valid inference from samples in settings where the conditional distribution of the genomic variables given the covariate of interest and the confounders is arbitrary and completely unknown. Promising finite sample performance is demonstrated via extensive simulations and real data applications.

Availability and implementation: R codes and vignettes are available at https://github.com/asmita112358/tdfdr.np.

动机：基因组数据受到各种混杂来源的影响，如人口统计学变量、生物学异质性和批量效应。为了在存在混杂因素的情况下识别与感兴趣变量相关的基因组特征，传统方法包括将混杂因素调整的回归模型拟合到每个基因组特征，然后进行多重性校正。结果：本研究表明，传统方法是次优的，并提出了一种新的二维错误发现率控制框架（2DFDR+），该框架比传统方法提供了显著的功率改进，适用于广泛的设置。2DFDR+使用边际独立性测试统计数据作为辅助信息来过滤出不太有希望的特征，并基于剩余特征中的条件独立性测试统计学来执行FDR控制。2DFDR+在给定感兴趣的协变和混杂因素的基因组变量的条件分布是任意和完全未知的情况下，从样本中提供（渐进）有效的推断。通过广泛的模拟和实际数据应用，展示了有希望的有限样本性能。可用性和实施：R代码和小插曲可在https://github.com/asmita112358/tdfdr.np.

{"title":"A general framework for powerful confounder adjustment in omics association studies.","authors":"Asmita Roy, Jun Chen, Xianyang Zhang","doi":"10.1093/bioinformatics/btad563","DOIUrl":"10.1093/bioinformatics/btad563","url":null,"abstract":"Motivation: Genomic data are subject to various sources of confounding, such as demographic variables, biological heterogeneity, and batch effects. To identify genomic features associated with a variable of interest in the presence of confounders, the traditional approach involves fitting a confounder-adjusted regression model to each genomic feature, followed by multiplicity correction.Results: This study shows that the traditional approach is suboptimal and proposes a new two-dimensional false discovery rate control framework (2DFDR+) that provides significant power improvement over the conventional method and applies to a wide range of settings. 2DFDR+ uses marginal independence test statistics as auxiliary information to filter out less promising features, and FDR control is performed based on conditional independence test statistics in the remaining features. 2DFDR+ provides (asymptotically) valid inference from samples in settings where the conditional distribution of the genomic variables given the covariate of interest and the confounders is arbitrary and completely unknown. Promising finite sample performance is demonstrated via extensive simulations and real data applications.Availability and implementation: R codes and vignettes are available at https://github.com/asmita112358/tdfdr.np.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10539716/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10188188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ActivePPI: quantifying protein-protein interaction network activity with Markov random fields. ActivePPI：用马尔可夫随机场量化蛋白质-蛋白质相互作用网络活性。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad567

Chuanyuan Wang, Shiyu Xu, Duanchen Sun, Zhi-Ping Liu

Motivation: Protein-protein interactions (PPI) are crucial components of the biomolecular networks that enable cells to function. Biological experiments have identified a large number of PPI, and these interactions are stored in knowledge bases. However, these interactions are often restricted to specific cellular environments and conditions. Network activity can be characterized as the extent of agreement between a PPI network (PPIN) and a distinct cellular environment measured by protein mass spectrometry, and it can also be quantified as a statistical significance score. Without knowing the activity of these PPI in the cellular environments or specific phenotypes, it is impossible to reveal how these PPI perform and affect cellular functioning.

Results: To calculate the activity of PPIN in different cellular conditions, we proposed a PPIN activity evaluation framework named ActivePPI to measure the consistency between network architecture and protein measurement data. ActivePPI estimates the probability density of protein mass spectrometry abundance and models PPIN using a Markov-random-field-based method. Furthermore, empirical P-value is derived based on a nonparametric permutation test to quantify the likelihood significance of the match between PPIN structure and protein abundance data. Extensive numerical experiments demonstrate the superior performance of ActivePPI and result in network activity evaluation, pathway activity assessment, and optimal network architecture tuning tasks. To summarize it succinctly, ActivePPI is a versatile tool for evaluating PPI network that can uncover the functional significance of protein interactions in crucial cellular biological processes and offer further insights into physiological phenomena.

Availability and implementation: All source code and data are freely available at https://github.com/zpliulab/ActivePPI.

动机：蛋白质-蛋白质相互作用（PPI）是使细胞发挥功能的生物分子网络的关键组成部分。生物实验已经确定了大量的PPI，并且这些相互作用存储在知识库中。然而，这些相互作用通常局限于特定的细胞环境和条件。网络活性可以表征为PPI网络（PPIN）与蛋白质质谱法测量的不同细胞环境之间的一致程度，也可以量化为统计显著性得分。如果不知道这些PPI在细胞环境或特定表型中的活性，就不可能揭示这些PPI是如何表现和影响细胞功能的。结果：为了计算PPIN在不同细胞条件下的活性，我们提出了一个名为ActivePPI的PPIN活性评估框架，以测量网络结构和蛋白质测量数据之间的一致性。ActivePPI估计蛋白质质谱丰度的概率密度，并使用基于马尔可夫随机场的方法对PPIN进行建模。此外，基于非参数排列检验推导了经验P值，以量化PPIN结构和蛋白质丰度数据之间匹配的似然显著性。大量的数值实验证明了ActivePPI的优越性能，并导致了网络活动评估、路径活动评估和最佳网络架构调整任务。简而言之，ActivePPI是一种评估PPI网络的通用工具，可以揭示蛋白质相互作用在关键细胞生物学过程中的功能意义，并对生理现象提供进一步的见解。可用性和实现：所有源代码和数据均可在https://github.com/zpliulab/ActivePPI.

{"title":"ActivePPI: quantifying protein-protein interaction network activity with Markov random fields.","authors":"Chuanyuan Wang, Shiyu Xu, Duanchen Sun, Zhi-Ping Liu","doi":"10.1093/bioinformatics/btad567","DOIUrl":"10.1093/bioinformatics/btad567","url":null,"abstract":"Motivation: Protein-protein interactions (PPI) are crucial components of the biomolecular networks that enable cells to function. Biological experiments have identified a large number of PPI, and these interactions are stored in knowledge bases. However, these interactions are often restricted to specific cellular environments and conditions. Network activity can be characterized as the extent of agreement between a PPI network (PPIN) and a distinct cellular environment measured by protein mass spectrometry, and it can also be quantified as a statistical significance score. Without knowing the activity of these PPI in the cellular environments or specific phenotypes, it is impossible to reveal how these PPI perform and affect cellular functioning.Results: To calculate the activity of PPIN in different cellular conditions, we proposed a PPIN activity evaluation framework named ActivePPI to measure the consistency between network architecture and protein measurement data. ActivePPI estimates the probability density of protein mass spectrometry abundance and models PPIN using a Markov-random-field-based method. Furthermore, empirical P-value is derived based on a nonparametric permutation test to quantify the likelihood significance of the match between PPIN structure and protein abundance data. Extensive numerical experiments demonstrate the superior performance of ActivePPI and result in network activity evaluation, pathway activity assessment, and optimal network architecture tuning tasks. To summarize it succinctly, ActivePPI is a versatile tool for evaluating PPI network that can uncover the functional significance of protein interactions in crucial cellular biological processes and offer further insights into physiological phenomena.Availability and implementation: All source code and data are freely available at https://github.com/zpliulab/ActivePPI.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516639/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10224105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

aenmd: annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants. aenmd:注释具有蛋白质截短变体的转录物从无义介导的衰变中逃逸。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad556

Jonathan Klonowski, Qianqian Liang, Zeynep Coban-Akdemir, Cecilia Lo, Dennis Kostka

Summary: DNA changes that cause premature termination codons (PTCs) represent a large fraction of clinically relevant pathogenic genomic variation. Typically, PTCs induce transcript degradation by nonsense-mediated mRNA decay (NMD) and render such changes loss-of-function alleles. However, certain PTC-containing transcripts escape NMD and can exert dominant-negative or gain-of-function (DN/GOF) effects. Therefore, systematic identification of human PTC-causing variants and their susceptibility to NMD contributes to the investigation of the role of DN/GOF alleles in human disease. Here we present aenmd, a software for annotating PTC-containing transcript-variant pairs for predicted escape from NMD. aenmd is user-friendly and self-contained. It offers functionality not currently available in other methods and is based on established and experimentally validated rules for NMD escape; the software is designed to work at scale, and to integrate seamlessly with existing analysis workflows. We applied aenmd to variants in the gnomAD, Clinvar, and GWAS catalog databases and report the prevalence of human PTC-causing variants in these databases, and the subset of these variants that could exert DN/GOF effects via NMD escape.

Availability and implementation: aenmd is implemented in the R programming language. Code is available on GitHub as an R-package (github.com/kostkalab/aenmd.git), and as a containerized command-line interface (github.com/kostkalab/aenmd_cli.git).

摘要：引起过早终止密码子（PTC）的DNA变化代表了临床相关致病基因组变异的很大一部分。通常，PTC通过无义介导的mRNA衰变（NMD）诱导转录物降解，并使这种变化失去功能等位基因。然而，某些含有PTC的转录物可以逃避NMD，并可以发挥显性负效应或功能获得效应（DN/GOF）。因此，系统鉴定人类PTC引起的变异及其对NMD的易感性有助于研究DN/GOF等位基因在人类疾病中的作用。在这里，我们介绍了aenmd，一个用于注释PTC的软件，该软件包含预测NMD逃逸的转录物变体对。aenmd是一个用户友好且独立的系统。它提供了目前其他方法无法提供的功能，并基于已建立和实验验证的NMD逃逸规则；该软件旨在大规模工作，并与现有的分析工作流程无缝集成。我们将aenmd应用于gnomAD、Clinvar和GWAS目录数据库中的变体，并报告了这些数据库中引起人类PTC的变体的流行率，以及这些变体中可以通过NMD逃逸发挥DN/GOF作用的子集。可用性和实现：aenmd是用R编程语言实现的。代码在GitHub上可以作为R包（GitHub.com/kostkalab/aenmd.git）和容器化命令行接口（GitHub.com/skostkalb/aenmd_cli.git）使用。

{"title":"aenmd: annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants.","authors":"Jonathan Klonowski, Qianqian Liang, Zeynep Coban-Akdemir, Cecilia Lo, Dennis Kostka","doi":"10.1093/bioinformatics/btad556","DOIUrl":"10.1093/bioinformatics/btad556","url":null,"abstract":"Summary: DNA changes that cause premature termination codons (PTCs) represent a large fraction of clinically relevant pathogenic genomic variation. Typically, PTCs induce transcript degradation by nonsense-mediated mRNA decay (NMD) and render such changes loss-of-function alleles. However, certain PTC-containing transcripts escape NMD and can exert dominant-negative or gain-of-function (DN/GOF) effects. Therefore, systematic identification of human PTC-causing variants and their susceptibility to NMD contributes to the investigation of the role of DN/GOF alleles in human disease. Here we present aenmd, a software for annotating PTC-containing transcript-variant pairs for predicted escape from NMD. aenmd is user-friendly and self-contained. It offers functionality not currently available in other methods and is based on established and experimentally validated rules for NMD escape; the software is designed to work at scale, and to integrate seamlessly with existing analysis workflows. We applied aenmd to variants in the gnomAD, Clinvar, and GWAS catalog databases and report the prevalence of human PTC-causing variants in these databases, and the subset of these variants that could exert DN/GOF effects via NMD escape.Availability and implementation: aenmd is implemented in the R programming language. Code is available on GitHub as an R-package (github.com/kostkalab/aenmd.git), and as a containerized command-line interface (github.com/kostkalab/aenmd_cli.git).","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10534055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10284138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tracking and curating putative SARS-CoV-2 recombinants with RIVET. 用RIVET追踪和培养假定的SARS-CoV-2重组病毒。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad538

Kyle Smith, Cheng Ye, Yatish Turakhia

Motivation: Identifying and tracking recombinant strains of SARS-CoV-2 is critical to understanding the evolution of the virus and controlling its spread. But confidently identifying SARS-CoV-2 recombinants from thousands of new genome sequences that are being shared online every day is quite challenging, causing many recombinants to be missed or suffer from weeks of delay in being formally identified while undergoing expert curation.

Results: We present RIVET-a software pipeline and visual platform that takes advantage of recent algorithmic advances in recombination inference to comprehensively and sensitively search for potential SARS-CoV-2 recombinants and organize the relevant information in a web interface that would help greatly accelerate the process of identifying and tracking recombinants.

Availability and implementation: RIVET-based web interface displaying the most updated analysis of potential SARS-CoV-2 recombinants is available at https://rivet.ucsd.edu/. RIVET's frontend and backend code is freely available under the MIT license at https://github.com/TurakhiaLab/rivet and the documentation for RIVET is available at https://turakhialab.github.io/rivet/. The inputs necessary for running RIVET's backend workflow for SARS-CoV-2 are available through a public database maintained and updated daily by UCSC (https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/).

动机:识别和追踪重组SARS-CoV-2菌株对于了解病毒的进化和控制其传播至关重要。但是，从每天在网上共享的数千个新基因组序列中自信地识别SARS-CoV-2重组是相当具有挑战性的，导致许多重组被遗漏，或者在接受专家管理时延迟数周才被正式识别。结果:我们提出了rivet -一个软件管道和可视化平台，利用重组推断的最新算法进展，全面、灵敏地搜索潜在的SARS-CoV-2重组体，并在web界面中组织相关信息，这将有助于大大加快识别和跟踪重组体的过程。可用性和实施:基于rivet的web界面显示对潜在SARS-CoV-2重组体的最新分析，可在https://rivet.ucsd.edu/上获得。RIVET的前端和后端代码在MIT许可下可在https://github.com/TurakhiaLab/rivet免费获得，RIVET的文档可在https://turakhialab.github.io/rivet/获得。运行RIVET针对SARS-CoV-2的后端工作流程所需的输入可通过UCSC每天维护和更新的公共数据库(https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/)获得。

{"title":"Tracking and curating putative SARS-CoV-2 recombinants with RIVET.","authors":"Kyle Smith, Cheng Ye, Yatish Turakhia","doi":"10.1093/bioinformatics/btad538","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad538","url":null,"abstract":"Motivation: Identifying and tracking recombinant strains of SARS-CoV-2 is critical to understanding the evolution of the virus and controlling its spread. But confidently identifying SARS-CoV-2 recombinants from thousands of new genome sequences that are being shared online every day is quite challenging, causing many recombinants to be missed or suffer from weeks of delay in being formally identified while undergoing expert curation.Results: We present RIVET-a software pipeline and visual platform that takes advantage of recent algorithmic advances in recombination inference to comprehensively and sensitively search for potential SARS-CoV-2 recombinants and organize the relevant information in a web interface that would help greatly accelerate the process of identifying and tracking recombinants.Availability and implementation: RIVET-based web interface displaying the most updated analysis of potential SARS-CoV-2 recombinants is available at https://rivet.ucsd.edu/. RIVET's frontend and backend code is freely available under the MIT license at https://github.com/TurakhiaLab/rivet and the documentation for RIVET is available at https://turakhialab.github.io/rivet/. The inputs necessary for running RIVET's backend workflow for SARS-CoV-2 are available through a public database maintained and updated daily by UCSC (https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/).","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0