Pub Date : 2023-12-01DOI: 10.1093/bioinformatics/btad735
{"title":"Correction to: GIL: a python package for designing custom indexing primers","authors":"","doi":"10.1093/bioinformatics/btad735","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad735","url":null,"abstract":"","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" 12","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138615615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-03DOI: 10.1093/bioinformatics/btad575
Rachel Colquhoun, Ben Jackson, Áine O'Toole, Andrew Rambaut
Summary: Scorpio provides a set of command line utilities for classifying, haplotyping, and defining constellations of mutations for an aligned set of genome sequences. It was developed to enable exploration and classification of variants of concern within the SARS-CoV-2 pandemic, but can be applied more generally to other species.
Availability and implementation: Scorpio is an open-source project distributed under the GNU GPL version 3 license. Source code and binaries are available at https://github.com/cov-lineages/scorpio, and binaries are also available from Bioconda. SARS-CoV-2 specific definitions can be installed as a separate dependency from https://github.com/cov-lineages/constellations.
{"title":"SCORPIO: a utility for defining and classifying mutation constellations of virus genomes.","authors":"Rachel Colquhoun, Ben Jackson, Áine O'Toole, Andrew Rambaut","doi":"10.1093/bioinformatics/btad575","DOIUrl":"10.1093/bioinformatics/btad575","url":null,"abstract":"<p><strong>Summary: </strong>Scorpio provides a set of command line utilities for classifying, haplotyping, and defining constellations of mutations for an aligned set of genome sequences. It was developed to enable exploration and classification of variants of concern within the SARS-CoV-2 pandemic, but can be applied more generally to other species.</p><p><strong>Availability and implementation: </strong>Scorpio is an open-source project distributed under the GNU GPL version 3 license. Source code and binaries are available at https://github.com/cov-lineages/scorpio, and binaries are also available from Bioconda. SARS-CoV-2 specific definitions can be installed as a separate dependency from https://github.com/cov-lineages/constellations.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10563142/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10265084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Efficient assessment of the blood-brain barrier (BBB) penetration ability of a drug compound is one of the major hurdles in central nervous system drug discovery since experimental methods are costly and time-consuming. To advance and elevate the success rate of neurotherapeutic drug discovery, it is essential to develop an accurate computational quantitative model to determine the absolute logBB value (a logarithmic ratio of the concentration of a drug in the brain to its concentration in the blood) of a drug candidate.
Results: Here, we developed a quantitative model (LogBB_Pred) capable of predicting a logBB value of a query compound. The model achieved an R2 of 0.61 on an independent test dataset and outperformed other publicly available quantitative models. When compared with the available qualitative (classification) models that only classified whether a compound is BBB-permeable or not, our model achieved the same accuracy (0.85) with the best qualitative model and far-outperformed other qualitative models (accuracies between 0.64 and 0.70). For further evaluation, our model, quantitative models, and the qualitative models were evaluated on a real-world central nervous system drug screening library. Our model showed an accuracy of 0.97 while the other models showed an accuracy in the range of 0.29-0.83. Consequently, our model can accurately classify BBB-permeable compounds as well as predict the absolute logBB values of drug candidates.
Availability and implementation: Web server is freely available on the web at http://ssbio.cau.ac.kr/software/logbb_pred/. The data used in this study are available to download at http://ssbio.cau.ac.kr/software/logbb_pred/dataset.zip.
{"title":"A machine learning-based quantitative model (LogBB_Pred) to predict the blood-brain barrier permeability (logBB value) of drug compounds.","authors":"Bilal Shaker, Jingyu Lee, Yunhyeok Lee, Myeong-Sang Yu, Hyang-Mi Lee, Eunee Lee, Hoon-Chul Kang, Kwang-Seok Oh, Hyung Wook Kim, Dokyun Na","doi":"10.1093/bioinformatics/btad577","DOIUrl":"10.1093/bioinformatics/btad577","url":null,"abstract":"<p><strong>Motivation: </strong>Efficient assessment of the blood-brain barrier (BBB) penetration ability of a drug compound is one of the major hurdles in central nervous system drug discovery since experimental methods are costly and time-consuming. To advance and elevate the success rate of neurotherapeutic drug discovery, it is essential to develop an accurate computational quantitative model to determine the absolute logBB value (a logarithmic ratio of the concentration of a drug in the brain to its concentration in the blood) of a drug candidate.</p><p><strong>Results: </strong>Here, we developed a quantitative model (LogBB_Pred) capable of predicting a logBB value of a query compound. The model achieved an R2 of 0.61 on an independent test dataset and outperformed other publicly available quantitative models. When compared with the available qualitative (classification) models that only classified whether a compound is BBB-permeable or not, our model achieved the same accuracy (0.85) with the best qualitative model and far-outperformed other qualitative models (accuracies between 0.64 and 0.70). For further evaluation, our model, quantitative models, and the qualitative models were evaluated on a real-world central nervous system drug screening library. Our model showed an accuracy of 0.97 while the other models showed an accuracy in the range of 0.29-0.83. Consequently, our model can accurately classify BBB-permeable compounds as well as predict the absolute logBB values of drug candidates.</p><p><strong>Availability and implementation: </strong>Web server is freely available on the web at http://ssbio.cau.ac.kr/software/logbb_pred/. The data used in this study are available to download at http://ssbio.cau.ac.kr/software/logbb_pred/dataset.zip.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10560102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10260174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-03DOI: 10.1093/bioinformatics/btad572
Kevin D Volkel, Kevin N Lin, Paul W Hook, Winston Timp, Albert J Keung, James M Tuck
Motivation: DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge, we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components.
Results: We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using multiple sequence alignment algorithms and others that do not. We found that the choice to include multiple sequence alignment in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems.
Availability and implementation: The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at https://github.com/dna-storage/framed, (https://dx.doi.org/10.5281/zenodo.7757762).
{"title":"FrameD: framework for DNA-based data storage design, verification, and validation.","authors":"Kevin D Volkel, Kevin N Lin, Paul W Hook, Winston Timp, Albert J Keung, James M Tuck","doi":"10.1093/bioinformatics/btad572","DOIUrl":"10.1093/bioinformatics/btad572","url":null,"abstract":"<p><strong>Motivation: </strong>DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge, we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components.</p><p><strong>Results: </strong>We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using multiple sequence alignment algorithms and others that do not. We found that the choice to include multiple sequence alignment in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems.</p><p><strong>Availability and implementation: </strong>The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at https://github.com/dna-storage/framed, (https://dx.doi.org/10.5281/zenodo.7757762).</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10563143/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10261101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-05DOI: 10.1093/bioinformatics/btad542
Yan Hu, Vipina K Keloth, Kalpana Raja, Yong Chen, Hua Xu
Motivation: Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation.
Results: We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level.
Availability: Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach.","authors":"Yan Hu, Vipina K Keloth, Kalpana Raja, Yong Chen, Hua Xu","doi":"10.1093/bioinformatics/btad542","DOIUrl":"10.1093/bioinformatics/btad542","url":null,"abstract":"<p><strong>Motivation: </strong>Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation.</p><p><strong>Results: </strong>We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level.</p><p><strong>Availability: </strong>Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500081/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10261389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad553
Tudor-Stefan Cotet, Andreas Agrafiotis, Victor Kreiner, Raphael Kuhn, Danielle Shlesinger, Marcos Manero-Carranza, Keywan Khodaverdi, Evgenios Kladis, Aurora Desideri Perea, Dylan Maassen-Veeters, Wiona Glänzer, Solène Massery, Lorenzo Guerci, Kai-Lin Hong, Jiami Han, Kostas Stiklioraitis, Vittoria Martinolli D'Arcy, Raphael Dizerens, Samuel Kilchenmann, Lucas Stalder, Leon Nissen, Basil Vogelsanger, Stine Anzböck, Daria Laslo, Sophie Bakker, Melinda Kondorosy, Marco Venerito, Alejandro Sanz García, Isabelle Feller, Annette Oxenius, Sai T Reddy, Alexander Yermanos
Motivation: The maturation of systems immunology methodologies requires novel and transparent computational frameworks capable of integrating diverse data modalities in a reproducible manner.
Results: Here, we present the ePlatypus computational immunology ecosystem for immunogenomics data analysis, with a focus on adaptive immune repertoires and single-cell sequencing. ePlatypus is an open-source web-based platform and provides programming tutorials and an integrative database that helps elucidate signatures of B and T cell clonal selection. Furthermore, the ecosystem links novel and established bioinformatics pipelines relevant for single-cell immune repertoires and other aspects of computational immunology such as predicting ligand-receptor interactions, structural modeling, simulations, machine learning, graph theory, pseudotime, spatial transcriptomics, and phylogenetics. The ePlatypus ecosystem helps extract deeper insight in computational immunology and immunogenomics and promote open science.
Availability and implementation: Platypus code used in this manuscript can be found at github.com/alexyermanos/Platypus.
{"title":"ePlatypus: an ecosystem for computational analysis of immunogenomics data.","authors":"Tudor-Stefan Cotet, Andreas Agrafiotis, Victor Kreiner, Raphael Kuhn, Danielle Shlesinger, Marcos Manero-Carranza, Keywan Khodaverdi, Evgenios Kladis, Aurora Desideri Perea, Dylan Maassen-Veeters, Wiona Glänzer, Solène Massery, Lorenzo Guerci, Kai-Lin Hong, Jiami Han, Kostas Stiklioraitis, Vittoria Martinolli D'Arcy, Raphael Dizerens, Samuel Kilchenmann, Lucas Stalder, Leon Nissen, Basil Vogelsanger, Stine Anzböck, Daria Laslo, Sophie Bakker, Melinda Kondorosy, Marco Venerito, Alejandro Sanz García, Isabelle Feller, Annette Oxenius, Sai T Reddy, Alexander Yermanos","doi":"10.1093/bioinformatics/btad553","DOIUrl":"10.1093/bioinformatics/btad553","url":null,"abstract":"<p><strong>Motivation: </strong>The maturation of systems immunology methodologies requires novel and transparent computational frameworks capable of integrating diverse data modalities in a reproducible manner.</p><p><strong>Results: </strong>Here, we present the ePlatypus computational immunology ecosystem for immunogenomics data analysis, with a focus on adaptive immune repertoires and single-cell sequencing. ePlatypus is an open-source web-based platform and provides programming tutorials and an integrative database that helps elucidate signatures of B and T cell clonal selection. Furthermore, the ecosystem links novel and established bioinformatics pipelines relevant for single-cell immune repertoires and other aspects of computational immunology such as predicting ligand-receptor interactions, structural modeling, simulations, machine learning, graph theory, pseudotime, spatial transcriptomics, and phylogenetics. The ePlatypus ecosystem helps extract deeper insight in computational immunology and immunogenomics and promote open science.</p><p><strong>Availability and implementation: </strong>Platypus code used in this manuscript can be found at github.com/alexyermanos/Platypus.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10518073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad563
Asmita Roy, Jun Chen, Xianyang Zhang
Motivation: Genomic data are subject to various sources of confounding, such as demographic variables, biological heterogeneity, and batch effects. To identify genomic features associated with a variable of interest in the presence of confounders, the traditional approach involves fitting a confounder-adjusted regression model to each genomic feature, followed by multiplicity correction.
Results: This study shows that the traditional approach is suboptimal and proposes a new two-dimensional false discovery rate control framework (2DFDR+) that provides significant power improvement over the conventional method and applies to a wide range of settings. 2DFDR+ uses marginal independence test statistics as auxiliary information to filter out less promising features, and FDR control is performed based on conditional independence test statistics in the remaining features. 2DFDR+ provides (asymptotically) valid inference from samples in settings where the conditional distribution of the genomic variables given the covariate of interest and the confounders is arbitrary and completely unknown. Promising finite sample performance is demonstrated via extensive simulations and real data applications.
Availability and implementation: R codes and vignettes are available at https://github.com/asmita112358/tdfdr.np.
{"title":"A general framework for powerful confounder adjustment in omics association studies.","authors":"Asmita Roy, Jun Chen, Xianyang Zhang","doi":"10.1093/bioinformatics/btad563","DOIUrl":"10.1093/bioinformatics/btad563","url":null,"abstract":"<p><strong>Motivation: </strong>Genomic data are subject to various sources of confounding, such as demographic variables, biological heterogeneity, and batch effects. To identify genomic features associated with a variable of interest in the presence of confounders, the traditional approach involves fitting a confounder-adjusted regression model to each genomic feature, followed by multiplicity correction.</p><p><strong>Results: </strong>This study shows that the traditional approach is suboptimal and proposes a new two-dimensional false discovery rate control framework (2DFDR+) that provides significant power improvement over the conventional method and applies to a wide range of settings. 2DFDR+ uses marginal independence test statistics as auxiliary information to filter out less promising features, and FDR control is performed based on conditional independence test statistics in the remaining features. 2DFDR+ provides (asymptotically) valid inference from samples in settings where the conditional distribution of the genomic variables given the covariate of interest and the confounders is arbitrary and completely unknown. Promising finite sample performance is demonstrated via extensive simulations and real data applications.</p><p><strong>Availability and implementation: </strong>R codes and vignettes are available at https://github.com/asmita112358/tdfdr.np.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10539716/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10188188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad567
Chuanyuan Wang, Shiyu Xu, Duanchen Sun, Zhi-Ping Liu
Motivation: Protein-protein interactions (PPI) are crucial components of the biomolecular networks that enable cells to function. Biological experiments have identified a large number of PPI, and these interactions are stored in knowledge bases. However, these interactions are often restricted to specific cellular environments and conditions. Network activity can be characterized as the extent of agreement between a PPI network (PPIN) and a distinct cellular environment measured by protein mass spectrometry, and it can also be quantified as a statistical significance score. Without knowing the activity of these PPI in the cellular environments or specific phenotypes, it is impossible to reveal how these PPI perform and affect cellular functioning.
Results: To calculate the activity of PPIN in different cellular conditions, we proposed a PPIN activity evaluation framework named ActivePPI to measure the consistency between network architecture and protein measurement data. ActivePPI estimates the probability density of protein mass spectrometry abundance and models PPIN using a Markov-random-field-based method. Furthermore, empirical P-value is derived based on a nonparametric permutation test to quantify the likelihood significance of the match between PPIN structure and protein abundance data. Extensive numerical experiments demonstrate the superior performance of ActivePPI and result in network activity evaluation, pathway activity assessment, and optimal network architecture tuning tasks. To summarize it succinctly, ActivePPI is a versatile tool for evaluating PPI network that can uncover the functional significance of protein interactions in crucial cellular biological processes and offer further insights into physiological phenomena.
Availability and implementation: All source code and data are freely available at https://github.com/zpliulab/ActivePPI.
{"title":"ActivePPI: quantifying protein-protein interaction network activity with Markov random fields.","authors":"Chuanyuan Wang, Shiyu Xu, Duanchen Sun, Zhi-Ping Liu","doi":"10.1093/bioinformatics/btad567","DOIUrl":"10.1093/bioinformatics/btad567","url":null,"abstract":"<p><strong>Motivation: </strong>Protein-protein interactions (PPI) are crucial components of the biomolecular networks that enable cells to function. Biological experiments have identified a large number of PPI, and these interactions are stored in knowledge bases. However, these interactions are often restricted to specific cellular environments and conditions. Network activity can be characterized as the extent of agreement between a PPI network (PPIN) and a distinct cellular environment measured by protein mass spectrometry, and it can also be quantified as a statistical significance score. Without knowing the activity of these PPI in the cellular environments or specific phenotypes, it is impossible to reveal how these PPI perform and affect cellular functioning.</p><p><strong>Results: </strong>To calculate the activity of PPIN in different cellular conditions, we proposed a PPIN activity evaluation framework named ActivePPI to measure the consistency between network architecture and protein measurement data. ActivePPI estimates the probability density of protein mass spectrometry abundance and models PPIN using a Markov-random-field-based method. Furthermore, empirical P-value is derived based on a nonparametric permutation test to quantify the likelihood significance of the match between PPIN structure and protein abundance data. Extensive numerical experiments demonstrate the superior performance of ActivePPI and result in network activity evaluation, pathway activity assessment, and optimal network architecture tuning tasks. To summarize it succinctly, ActivePPI is a versatile tool for evaluating PPI network that can uncover the functional significance of protein interactions in crucial cellular biological processes and offer further insights into physiological phenomena.</p><p><strong>Availability and implementation: </strong>All source code and data are freely available at https://github.com/zpliulab/ActivePPI.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516639/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10224105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad556
Jonathan Klonowski, Qianqian Liang, Zeynep Coban-Akdemir, Cecilia Lo, Dennis Kostka
Summary: DNA changes that cause premature termination codons (PTCs) represent a large fraction of clinically relevant pathogenic genomic variation. Typically, PTCs induce transcript degradation by nonsense-mediated mRNA decay (NMD) and render such changes loss-of-function alleles. However, certain PTC-containing transcripts escape NMD and can exert dominant-negative or gain-of-function (DN/GOF) effects. Therefore, systematic identification of human PTC-causing variants and their susceptibility to NMD contributes to the investigation of the role of DN/GOF alleles in human disease. Here we present aenmd, a software for annotating PTC-containing transcript-variant pairs for predicted escape from NMD. aenmd is user-friendly and self-contained. It offers functionality not currently available in other methods and is based on established and experimentally validated rules for NMD escape; the software is designed to work at scale, and to integrate seamlessly with existing analysis workflows. We applied aenmd to variants in the gnomAD, Clinvar, and GWAS catalog databases and report the prevalence of human PTC-causing variants in these databases, and the subset of these variants that could exert DN/GOF effects via NMD escape.
Availability and implementation: aenmd is implemented in the R programming language. Code is available on GitHub as an R-package (github.com/kostkalab/aenmd.git), and as a containerized command-line interface (github.com/kostkalab/aenmd_cli.git).
{"title":"aenmd: annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants.","authors":"Jonathan Klonowski, Qianqian Liang, Zeynep Coban-Akdemir, Cecilia Lo, Dennis Kostka","doi":"10.1093/bioinformatics/btad556","DOIUrl":"10.1093/bioinformatics/btad556","url":null,"abstract":"<p><strong>Summary: </strong>DNA changes that cause premature termination codons (PTCs) represent a large fraction of clinically relevant pathogenic genomic variation. Typically, PTCs induce transcript degradation by nonsense-mediated mRNA decay (NMD) and render such changes loss-of-function alleles. However, certain PTC-containing transcripts escape NMD and can exert dominant-negative or gain-of-function (DN/GOF) effects. Therefore, systematic identification of human PTC-causing variants and their susceptibility to NMD contributes to the investigation of the role of DN/GOF alleles in human disease. Here we present aenmd, a software for annotating PTC-containing transcript-variant pairs for predicted escape from NMD. aenmd is user-friendly and self-contained. It offers functionality not currently available in other methods and is based on established and experimentally validated rules for NMD escape; the software is designed to work at scale, and to integrate seamlessly with existing analysis workflows. We applied aenmd to variants in the gnomAD, Clinvar, and GWAS catalog databases and report the prevalence of human PTC-causing variants in these databases, and the subset of these variants that could exert DN/GOF effects via NMD escape.</p><p><strong>Availability and implementation: </strong>aenmd is implemented in the R programming language. Code is available on GitHub as an R-package (github.com/kostkalab/aenmd.git), and as a containerized command-line interface (github.com/kostkalab/aenmd_cli.git).</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10534055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10284138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad538
Kyle Smith, Cheng Ye, Yatish Turakhia
Motivation: Identifying and tracking recombinant strains of SARS-CoV-2 is critical to understanding the evolution of the virus and controlling its spread. But confidently identifying SARS-CoV-2 recombinants from thousands of new genome sequences that are being shared online every day is quite challenging, causing many recombinants to be missed or suffer from weeks of delay in being formally identified while undergoing expert curation.
Results: We present RIVET-a software pipeline and visual platform that takes advantage of recent algorithmic advances in recombination inference to comprehensively and sensitively search for potential SARS-CoV-2 recombinants and organize the relevant information in a web interface that would help greatly accelerate the process of identifying and tracking recombinants.
Availability and implementation: RIVET-based web interface displaying the most updated analysis of potential SARS-CoV-2 recombinants is available at https://rivet.ucsd.edu/. RIVET's frontend and backend code is freely available under the MIT license at https://github.com/TurakhiaLab/rivet and the documentation for RIVET is available at https://turakhialab.github.io/rivet/. The inputs necessary for running RIVET's backend workflow for SARS-CoV-2 are available through a public database maintained and updated daily by UCSC (https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/).
{"title":"Tracking and curating putative SARS-CoV-2 recombinants with RIVET.","authors":"Kyle Smith, Cheng Ye, Yatish Turakhia","doi":"10.1093/bioinformatics/btad538","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad538","url":null,"abstract":"<p><strong>Motivation: </strong>Identifying and tracking recombinant strains of SARS-CoV-2 is critical to understanding the evolution of the virus and controlling its spread. But confidently identifying SARS-CoV-2 recombinants from thousands of new genome sequences that are being shared online every day is quite challenging, causing many recombinants to be missed or suffer from weeks of delay in being formally identified while undergoing expert curation.</p><p><strong>Results: </strong>We present RIVET-a software pipeline and visual platform that takes advantage of recent algorithmic advances in recombination inference to comprehensively and sensitively search for potential SARS-CoV-2 recombinants and organize the relevant information in a web interface that would help greatly accelerate the process of identifying and tracking recombinants.</p><p><strong>Availability and implementation: </strong>RIVET-based web interface displaying the most updated analysis of potential SARS-CoV-2 recombinants is available at https://rivet.ucsd.edu/. RIVET's frontend and backend code is freely available under the MIT license at https://github.com/TurakhiaLab/rivet and the documentation for RIVET is available at https://turakhialab.github.io/rivet/. The inputs necessary for running RIVET's backend workflow for SARS-CoV-2 are available through a public database maintained and updated daily by UCSC (https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/).</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}