Bioinformatics advances最新文献_第10页

VCAb: a web-tool for structure-guided exploration of antibodies. VCAb：结构引导下的抗体探索网络工具。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-20 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae137

Dongjun Guo, Joseph Chi-Fung Ng, Deborah K Dunn-Walters, Franca Fraternali

Motivation: Effective responses against immune challenges require antibodies of different isotypes performing specific effector functions. Structural information on these isotypes is essential to engineer antibodies with desired physico-chemical features of their antigen-binding properties, and optimal developability as potential therapeutics. In silico mutational scanning profiles on antibody structures would further pinpoint candidate mutations for enhancing antibody stability and function. Current antibody structure databases lack consistent annotations of isotypes and structural coverage of 3D antibody structures, as well as computed deep mutation profiles.

Results: The V and C region bearing antibody (VCAb) web-tool is established to clarify these annotations and provides an accessible resource to facilitate antibody engineering and design. VCAb currently provides data on 7,166 experimentally determined antibody structures including both V and C regions from different species. Additionally, VCAb provides annotations of species and isotypes with numbering schemes applied. These information can be interactively queried or downloaded in batch.

Availability and implementation: VCAb is implemented as a R shiny application to enable interactive data interrogation. The online application is freely accessible https://fraternalilab.cs.ucl.ac.uk/VCAb/. The source code to generate the database and the online application is available open-source at https://github.com/Fraternalilab/VCAb.

动机针对免疫挑战的有效反应需要不同异型的抗体发挥特定的效应功能。要使抗体具有理想的抗原结合理化特性，并能作为潜在的治疗药物进行最佳开发，这些抗体异型的结构信息至关重要。对抗体结构进行硅学突变扫描可以进一步确定增强抗体稳定性和功能的候选突变。目前的抗体结构数据库缺乏一致的同种型注释和三维抗体结构的结构覆盖范围，也缺乏计算的深度突变图谱：结果：V和C区抗体（VCAb）网络工具的建立是为了澄清这些注释，并提供一个可访问的资源，以促进抗体工程和设计。VCAb 目前提供了 7,166 个实验确定的抗体结构数据，包括来自不同物种的 V 区和 C 区。此外，VCAb 还提供了物种和异型的注释，并应用了编号方案。这些信息可以交互式查询或批量下载：VCAb以R闪亮应用程序的形式实现，可进行交互式数据查询。该在线应用程序可免费访问 https://fraternalilab.cs.ucl.ac.uk/VCAb/。生成数据库和在线应用程序的源代码可在 https://github.com/Fraternalilab/VCAb 上免费获取。

{"title":"VCAb: a web-tool for structure-guided exploration of antibodies.","authors":"Dongjun Guo, Joseph Chi-Fung Ng, Deborah K Dunn-Walters, Franca Fraternali","doi":"10.1093/bioadv/vbae137","DOIUrl":"https://doi.org/10.1093/bioadv/vbae137","url":null,"abstract":"Motivation: Effective responses against immune challenges require antibodies of different isotypes performing specific effector functions. Structural information on these isotypes is essential to engineer antibodies with desired physico-chemical features of their antigen-binding properties, and optimal developability as potential therapeutics. In silico mutational scanning profiles on antibody structures would further pinpoint candidate mutations for enhancing antibody stability and function. Current antibody structure databases lack consistent annotations of isotypes and structural coverage of 3D antibody structures, as well as computed deep mutation profiles.Results: The V and C region bearing antibody (VCAb) web-tool is established to clarify these annotations and provides an accessible resource to facilitate antibody engineering and design. VCAb currently provides data on 7,166 experimentally determined antibody structures including both V and C regions from different species. Additionally, VCAb provides annotations of species and isotypes with numbering schemes applied. These information can be interactively queried or downloaded in batch.Availability and implementation: VCAb is implemented as a R shiny application to enable interactive data interrogation. The online application is freely accessible https://fraternalilab.cs.ucl.ac.uk/VCAb/. The source code to generate the database and the online application is available open-source at https://github.com/Fraternalilab/VCAb.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae137"},"PeriodicalIF":2.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471263/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DECOMICS, a shiny application for unsupervised cell type deconvolution and biological interpretation of bulk omic data. DECOMICS 是一款闪亮的应用程序，用于对大量 omic 数据进行无监督细胞类型解卷积和生物学解释。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-20 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae136

Slim Karkar, Ashwini Sharma, Carl Herrmann, Yuna Blum, Magali Richard

Summary: Unsupervised deconvolution algorithms are often used to estimate cell composition from bulk tissue samples. However, applying cell-type deconvolution and interpreting the results remain a challenge, even more without prior training in bioinformatics. Here, we propose a tool for estimating and identifying cell type composition from bulk transcriptomes or methylomes. DECOMICS is a shiny-web application dedicated to unsupervised deconvolution approaches of bulk omic data. It provides (i) a variety of existing algorithms to perform deconvolution on the gene expression or methylation-level matrix, (ii) an enrichment analysis module to aid biological interpretation of the deconvolved components, based on enrichment analysis, and (iii) some visualization tools. Input data can be downloaded in csv format and preprocessed in the web application (normalization, transformation, and feature selection). The results of the deconvolution, enrichment, and visualization processes can be downloaded.

Availability and implementation: DECOMICS is an R-shiny web application that can be launched (i) directly from a local R session using the R package available here: https://gitlab.in2p3.fr/Magali.Richard/decomics (either by installing it locally or via a virtual machine and a Docker image that we provide); or (ii) in the Biosphere-IFB Clouds Federation for Life Science, a multi-cloud environment scalable for high-performance computing: https://biosphere.france-bioinformatique.fr/catalogue/appliance/193/.

摘要：无监督解卷积算法常用于估算大量组织样本中的细胞成分。然而，应用细胞类型解卷积和解释结果仍然是一项挑战，如果没有生物信息学方面的培训，就更是如此。在这里，我们提出了一种从大块转录组或甲基组中估计和识别细胞类型组成的工具。DECOMICS 是一个闪亮的网络应用程序，专门用于对大容量 omic 数据进行无监督解卷积。它提供（i）多种现有算法，用于对基因表达或甲基化水平矩阵进行解卷积；（ii）一个富集分析模块，用于根据富集分析帮助对解卷积成分进行生物学解释；以及（iii）一些可视化工具。输入数据可以 csv 格式下载，并在网络应用程序中进行预处理（归一化、转换和特征选择）。解卷积、富集和可视化过程的结果可以下载：DECOMICS是一个R-shiny网络应用程序，可(i)使用此处提供的R软件包直接从本地R会话启动：https://gitlab.in2p3.fr/Magali.Richard/decomics（可通过本地安装或通过我们提供的虚拟机和Docker镜像）；或(ii)在生物圈-IFB生命科学云联盟（一个可扩展的高性能计算多云环境）中启动：https://biosphere.france-bioinformatique.fr/catalogue/appliance/193/。

{"title":"DECOMICS, a shiny application for unsupervised cell type deconvolution and biological interpretation of bulk omic data.","authors":"Slim Karkar, Ashwini Sharma, Carl Herrmann, Yuna Blum, Magali Richard","doi":"10.1093/bioadv/vbae136","DOIUrl":"https://doi.org/10.1093/bioadv/vbae136","url":null,"abstract":"Summary: Unsupervised deconvolution algorithms are often used to estimate cell composition from bulk tissue samples. However, applying cell-type deconvolution and interpreting the results remain a challenge, even more without prior training in bioinformatics. Here, we propose a tool for estimating and identifying cell type composition from bulk transcriptomes or methylomes. DECOMICS is a shiny-web application dedicated to unsupervised deconvolution approaches of bulk omic data. It provides (i) a variety of existing algorithms to perform deconvolution on the gene expression or methylation-level matrix, (ii) an enrichment analysis module to aid biological interpretation of the deconvolved components, based on enrichment analysis, and (iii) some visualization tools. Input data can be downloaded in csv format and preprocessed in the web application (normalization, transformation, and feature selection). The results of the deconvolution, enrichment, and visualization processes can be downloaded.Availability and implementation: DECOMICS is an R-shiny web application that can be launched (i) directly from a local R session using the R package available here: https://gitlab.in2p3.fr/Magali.Richard/decomics (either by installing it locally or via a virtual machine and a Docker image that we provide); or (ii) in the Biosphere-IFB Clouds Federation for Life Science, a multi-cloud environment scalable for high-performance computing: https://biosphere.france-bioinformatique.fr/catalogue/appliance/193/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae136"},"PeriodicalIF":2.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479579/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigation of protein family relationships with deep learning. 利用深度学习研究蛋白质家族关系。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-18 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae132

Irina Ponamareva, Antonina Andreeva, Maxwell L Bileschi, Lucy Colwell, Alex Bateman

Motivation: In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison.

Results: We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families.

Availability and implementation: github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.

动机在本文中，我们提出了一种基于预训练神经网络 ProtENN2 的 Pfam 族间相似性发现方法。我们使用 ProtENN2 每残基嵌入模型生成新的高维每族嵌入，并开发了一种基于这些嵌入计算族间相似性得分的方法，并使用结构比较对其预测结果进行了评估：我们将我们的方法应用到 Pfam 注释中，通过完善 Pfam 家族的家族成员资格，为现有家族推荐新成员，并为未来发布的 Pfam 推荐潜在的新家族。我们研究了我们方法的一些失败模式，为今后的改进提出了方向。我们的方法相对简单，参数很少，可以应用于其他蛋白质族分类模型。总之，我们的工作表明，利用深度学习提高我们对蛋白质家族关系和以前未表征家族功能的理解具有潜在的益处。可用性和实现：github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909。

{"title":"Investigation of protein family relationships with deep learning.","authors":"Irina Ponamareva, Antonina Andreeva, Maxwell L Bileschi, Lucy Colwell, Alex Bateman","doi":"10.1093/bioadv/vbae132","DOIUrl":"https://doi.org/10.1093/bioadv/vbae132","url":null,"abstract":"Motivation: In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison.Results: We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families.Availability and implementation: github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae132"},"PeriodicalIF":2.4,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11467057/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery. EmbedGEM：一个评估嵌入在基因发现中的效用的框架。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-17 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae135

Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos

Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean $χ^{2}$ statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.

Availability and implementation: https://github.com/insitro/EmbedGEM.

摘要：机器学习衍生的嵌入是高内容数据模式的压缩表示。嵌入可以捕获有关疾病状态的详细信息，并已定性地证明在遗传发现中是有用的。尽管它们很有前途，但嵌入有一个主要的限制：不清楚与嵌入相关的遗传变异是否与感兴趣的疾病或特征相关。在这项工作中，我们描述了嵌入遗传评估方法（EmbedGEM），这是一个系统评估嵌入在遗传发现中的效用的框架。EmbedGEM侧重于沿着两个轴比较嵌入：遗传性和疾病相关性。作为遗传力的度量，我们考虑了全基因组显著关联的数量和显著位点的平均χ 2统计量。对于疾病相关性，我们计算每个嵌入主成分的多基因风险评分，然后评估它们与高置信度疾病或特征标签的相关性。虽然我们开发EmbedGEM的动机是嵌入，但该方法通常适用于多变量特征，并且可以很容易地扩展以适应沿着评估轴的其他指标。我们通过在两个独立的数据集中评估嵌入和多变量特征来展示EmbedGEM的实用性：(i)模拟的合成数据集，以证明该框架能够根据其遗传性和疾病相关性对特征进行正确排序；（ii）来自UK Biobank的真实数据，包括代谢和肝脏相关特征。重要的是，我们表明更大的疾病相关性并不自动遵循更大的遗传性。可用性和实现：https://github.com/insitro/EmbedGEM。

{"title":"EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.","authors":"Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos","doi":"10.1093/bioadv/vbae135","DOIUrl":"10.1093/bioadv/vbae135","url":null,"abstract":"Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean <math> <mrow> <mrow> <msup><mrow><mo>χ</mo></mrow> <mn>2</mn></msup> </mrow> </mrow> </math> statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.Availability and implementation: https://github.com/insitro/EmbedGEM.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae135"},"PeriodicalIF":2.4,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Batch-effect correction in single-cell RNA sequencing data using JIVE. 利用 JIVE 对单细胞 RNA 测序数据进行批次效应校正。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-13 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae134

Joseph Hastings, Donghyung Lee, Michael J O'Connell

Motivation: In single-cell RNA sequencing analysis, addressing batch effects-technical artifacts stemming from factors such as varying sequencing technologies, equipment, and capture times-is crucial. These factors can cause unwanted variation and obfuscate the underlying biological signal of interest. The joint and individual variation explained (JIVE) method can be used to extract shared biological patterns from multi-source sequencing data while adjusting for individual non-biological variations (i.e. batch effect). However, its current implementation is originally designed for bulk sequencing data, making it computationally infeasible for large-scale single-cell sequencing datasets.

Results: In this study, we enhance JIVE for large-scale single-cell data by boosting its computational efficiency. Additionally, we introduce a novel application of JIVE for batch-effect correction on multiple single-cell sequencing datasets. Our enhanced method aims to decompose single-cell sequencing datasets into a joint structure capturing the true biological variability and individual structures, which capture technical variability within each batch. This joint structure is then suitable for use in downstream analyses. We benchmarked the results against four popular tools, Seurat v5, Harmony, LIGER, and Combat-seq, which were developed for this purpose. JIVE performed best in terms of preserving cell-type effects and in scenarios in which the batch sizes are balanced.

Availability and implementation: The JIVE implementation used for this analysis can be found at https://github.com/oconnell-statistics-lab/scJIVE.

动机在单细胞 RNA 测序分析中，解决批次效应--由不同测序技术、设备和捕获时间等因素产生的技术假象--至关重要。这些因素会导致不必要的变异，并掩盖所关注的潜在生物信号。联合和个体差异解释（JIVE）方法可用于从多源测序数据中提取共同的生物模式，同时调整个体非生物变异（即批次效应）。然而，该方法目前的实现最初是为批量测序数据设计的，因此在计算上不适合大规模单细胞测序数据集：在这项研究中，我们提高了 JIVE 的计算效率，使其更适用于大规模单细胞数据。此外，我们还介绍了 JIVE 在多个单细胞测序数据集上批量效应校正的新应用。我们的增强方法旨在将单细胞测序数据集分解成一个联合结构和一个单独结构，前者捕捉真实的生物变异性，后者捕捉每个批次中的技术变异性。这种联合结构适用于下游分析。我们将其结果与四种流行的工具（Seurat v5、Harmony、LIGER 和 Combat-seq）进行了比较。在保留细胞类型效应方面，以及在批量大小平衡的情况下，JIVE表现最佳：本分析所用的 JIVE 实现可在 https://github.com/oconnell-statistics-lab/scJIVE 上找到。

{"title":"Batch-effect correction in single-cell RNA sequencing data using JIVE.","authors":"Joseph Hastings, Donghyung Lee, Michael J O'Connell","doi":"10.1093/bioadv/vbae134","DOIUrl":"10.1093/bioadv/vbae134","url":null,"abstract":"Motivation: In single-cell RNA sequencing analysis, addressing batch effects-technical artifacts stemming from factors such as varying sequencing technologies, equipment, and capture times-is crucial. These factors can cause unwanted variation and obfuscate the underlying biological signal of interest. The joint and individual variation explained (JIVE) method can be used to extract shared biological patterns from multi-source sequencing data while adjusting for individual non-biological variations (i.e. batch effect). However, its current implementation is originally designed for bulk sequencing data, making it computationally infeasible for large-scale single-cell sequencing datasets.Results: In this study, we enhance JIVE for large-scale single-cell data by boosting its computational efficiency. Additionally, we introduce a novel application of JIVE for batch-effect correction on multiple single-cell sequencing datasets. Our enhanced method aims to decompose single-cell sequencing datasets into a joint structure capturing the true biological variability and individual structures, which capture technical variability within each batch. This joint structure is then suitable for use in downstream analyses. We benchmarked the results against four popular tools, Seurat v5, Harmony, LIGER, and Combat-seq, which were developed for this purpose. JIVE performed best in terms of preserving cell-type effects and in scenarios in which the batch sizes are balanced.Availability and implementation: The JIVE implementation used for this analysis can be found at https://github.com/oconnell-statistics-lab/scJIVE.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae134"},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11461915/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text. 评估用于生物医学文本中蛋白质-蛋白质相互作用识别的 GPT 和 BERT 模型。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-11 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae133

Hasin Rehana, Nur Bengisu Çam, Mert Basmaci, Jie Zheng, Christianah Jemiyo, Yongqun He, Arzucan Özgür, Junguk Hur

Motivation: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks.

Results: We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature.

Availability and implementation: The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.

动机检测蛋白质-蛋白质相互作用（PPIs）对于了解遗传机制、疾病发病机理和药物设计至关重要。随着生物医学文献的快速增长，人们越来越需要自动、准确地提取这些相互作用，以促进科学发现。在自然语言处理任务中，预训练语言模型，如生成式预训练变换器和来自变换器的双向编码器表示，已经显示出良好的效果：我们评估了使用基于转换器的多种模型在三个人工策划的黄金标准语料库中进行 PPI 识别的性能：这些语料库包括：在逻辑中学习语言（77 个句子中包含 164 次交互）、人类蛋白质参考数据库（145 个句子中包含 163 次交互）以及交互提取性能评估（486 个句子中包含 335 次交互）。基于双向编码器表征的模型取得了最佳的整体性能，其中 BioBERT 在《Learning Language in Logic》数据集上取得了 91.95% 的最高召回率和 86.84% 的 F1 分数。尽管没有针对生物医学文本进行明确的训练，GPT-4 仍然表现出值得称赞的性能，与双向编码器模型不相上下。具体来说，GPT-4 在同一数据集上取得了 88.37% 的最高精确度、85.14% 的召回率和 86.49% 的 F1 分数。这些结果表明，GPT-4 可以有效地检测文本中的蛋白质相互作用，为挖掘生物医学文献提供了有价值的应用：本研究使用的源代码和数据集可在 https://github.com/hurlab/PPI-GPT-BERT 网站上获取。

{"title":"Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text.","authors":"Hasin Rehana, Nur Bengisu Çam, Mert Basmaci, Jie Zheng, Christianah Jemiyo, Yongqun He, Arzucan Özgür, Junguk Hur","doi":"10.1093/bioadv/vbae133","DOIUrl":"https://doi.org/10.1093/bioadv/vbae133","url":null,"abstract":"Motivation: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks.Results: We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature.Availability and implementation: The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae133"},"PeriodicalIF":2.4,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11419952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142333636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AlphaCRV: a pipeline for identifying accurate binder topologies in mass-modeling with AlphaFold. AlphaCRV：利用 AlphaFold 在质量建模中识别准确粘合剂拓扑结构的管道。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-09-06 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae131

Francisco J Guzmán-Vega, Stefan T Arold

Motivation: The speed and accuracy of deep learning-based structure prediction algorithms make it now possible to perform in silico "pull-downs" to identify protein-protein interactions on a proteome-wide scale. However, on such a large scale, existing scoring algorithms are often insufficient to discriminate biologically relevant interactions from false positives.

Results: Here, we introduce AlphaCRV, a Python package that helps identify correct interactors in a one-against-many AlphaFold screen by clustering, ranking, and visualizing conserved binding topologies, based on protein sequence and fold.

Availability and implementation: AlphaCRV is a Python package for Linux, freely available at https://github.com/strubelab/AlphaCRV.

动机基于深度学习的结构预测算法的速度和准确性使得在整个蛋白质组范围内进行硅学 "下拉"（pull-downs）以识别蛋白质-蛋白质相互作用成为可能。然而，在如此大的范围内，现有的评分算法往往不足以区分生物相关相互作用和假阳性相互作用：在这里，我们介绍了 AlphaCRV，这是一个 Python 软件包，它可以根据蛋白质序列和折叠，对保守的结合拓扑进行聚类、排序和可视化，从而帮助在一对多的 AlphaFold 筛选中识别出正确的相互作用者：AlphaCRV 是一个适用于 Linux 的 Python 软件包，可从 https://github.com/strubelab/AlphaCRV 免费获取。

引用次数: 0

SPIDER: constructing cell-type-specific protein-protein interaction networks. SPIDER：构建细胞类型特异性蛋白质-蛋白质相互作用网络。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-08-30 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae130

Yael Kupershmidt, Simon Kasif, Roded Sharan

Motivation: Protein-protein interactions (PPIs) play essential roles in the buildup of cellular machinery and provide the skeleton for cellular signaling. However, these biochemical roles are context dependent and interactions may change across cell type, time, and space. In contrast, PPI detection assays are run in a single condition that may not even be an endogenous condition of the organism, resulting in static networks that do not reflect full cellular complexity. Thus, there is a need for computational methods to predict cell-type-specific interactions.

Results: Here we present SPIDER (Supervised Protein Interaction DEtectoR), a graph attention-based model for predicting cell-type-specific PPI networks. In contrast to previous attempts at this problem, which were unsupervised in nature, our model's training is guided by experimentally measured cell-type-specific networks, enhancing its performance. We evaluate our method using experimental data of cell-type-specific networks from both humans and mice, and show that it outperforms current approaches by a large margin. We further demonstrate the ability of our method to generalize the predictions to datasets of tissues lacking prior PPI experimental data. We leverage the networks predicted by the model to facilitate the identification of tissue-specific disease genes.

Availability and implementation: Our code and data are available at https://github.com/Kuper994/SPIDER.

动机蛋白质-蛋白质相互作用（PPIs）在细胞机制的构建中发挥着重要作用，并为细胞信号传导提供了骨架。然而，这些生化作用与环境有关，相互作用可能会因细胞类型、时间和空间的不同而发生变化。与此相反，PPI 检测试验是在单一条件下进行的，而这种条件甚至可能不是生物体的内源条件，因此产生的静态网络不能反映细胞的全部复杂性。因此，需要用计算方法来预测细胞类型特异性的相互作用：在这里，我们介绍了 SPIDER（监督蛋白质相互作用 DEtectoR），这是一种基于图注意的模型，用于预测细胞类型特异性 PPI 网络。与以往在此问题上的无监督尝试不同，我们的模型是在实验测量的细胞类型特异性网络的指导下进行训练的，从而提高了模型的性能。我们使用人类和小鼠细胞类型特异性网络的实验数据对我们的方法进行了评估，结果表明我们的方法大大优于目前的方法。我们进一步证明了我们的方法能够将预测结果推广到缺乏 PPI 实验数据的组织数据集。我们利用模型预测的网络来促进组织特异性疾病基因的鉴定：我们的代码和数据可从 https://github.com/Kuper994/SPIDER 获取。

{"title":"SPIDER: constructing cell-type-specific protein-protein interaction networks.","authors":"Yael Kupershmidt, Simon Kasif, Roded Sharan","doi":"10.1093/bioadv/vbae130","DOIUrl":"https://doi.org/10.1093/bioadv/vbae130","url":null,"abstract":"Motivation: Protein-protein interactions (PPIs) play essential roles in the buildup of cellular machinery and provide the skeleton for cellular signaling. However, these biochemical roles are context dependent and interactions may change across cell type, time, and space. In contrast, PPI detection assays are run in a single condition that may not even be an endogenous condition of the organism, resulting in static networks that do not reflect full cellular complexity. Thus, there is a need for computational methods to predict cell-type-specific interactions.Results: Here we present SPIDER (Supervised Protein Interaction DEtectoR), a graph attention-based model for predicting cell-type-specific PPI networks. In contrast to previous attempts at this problem, which were unsupervised in nature, our model's training is guided by experimentally measured cell-type-specific networks, enhancing its performance. We evaluate our method using experimental data of cell-type-specific networks from both humans and mice, and show that it outperforms current approaches by a large margin. We further demonstrate the ability of our method to generalize the predictions to datasets of tissues lacking prior PPI experimental data. We leverage the networks predicted by the model to facilitate the identification of tissue-specific disease genes.Availability and implementation: Our code and data are available at https://github.com/Kuper994/SPIDER.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae130"},"PeriodicalIF":2.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438548/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142333637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

synphage: a pipeline for phage genome synteny graphics focused on gene conservation. synphage：以基因保护为重点的噬菌体基因组同源性图谱管道。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-08-29 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae126

Virginie Grosboillot, Anna Dragoš

Motivation: Visualization and comparison of genome maps of bacteriophages can be very effective, but none of the tools available on the market allow visualization of gene conservation between multiple sequences at a glance. In addition, most bioinformatic tools running locally are command line only, making them hard to setup, debug, and monitor.

Results: To address these motivations, we developed synphage, an easy-to-use and intuitive tool to generate synteny diagrams from GenBank files. This software has a user-friendly interface and uses metadata to monitor the progress and success of the data transformation process. The output plot features colour-coded genes according to their degree of conservation among the group of displayed sequences. The strength of synphage lies also in its modularity and the ability to generate multiple plots with different configurations without having to re-process all the data. In conclusion, synphage reduces the bioinformatic workload of users and allows them to focus on analysis, the most impactful area of their work.

Availability and implementation: The synphage tool is implemented in the Python language and is available from the GitHub repository at https://github.com/vestalisvirginis/synphage. This software is released under an Apache-2.0 licence. A PyPI synphage package is available at https://pypi.org/project/synphage/ and a containerized version is available at https://hub.docker.com/r/vestalisvirginis/synphage. Contributions to the software are welcome whether it is reporting a bug or proposing new features and the contribution guidelines are available at https://github.com/vestalisvirginis/synphage/blob/main/CONTRIBUTING.md.

动机噬菌体基因组图谱的可视化和比较非常有效，但市场上现有的工具都无法一目了然地显示多个序列之间的基因保护情况。此外，大多数在本地运行的生物信息学工具只能通过命令行方式运行，因此很难进行设置、调试和监控：为了解决这些问题，我们开发了 synphage，这是一种易于使用且直观的工具，可从 GenBank 文件中生成同源关系图。该软件拥有友好的用户界面，并使用元数据监控数据转换过程的进度和成功率。输出图的特点是根据显示序列组中基因的保守程度用颜色编码。synphage 的优势还在于它的模块性，能够生成具有不同配置的多个图谱，而无需重新处理所有数据。总之，synphage 减少了用户的生物信息工作量，使他们能够专注于分析工作，这也是对他们工作影响最大的领域：synphage 工具使用 Python 语言实现，可从 GitHub 存储库 https://github.com/vestalisvirginis/synphage 获取。该软件根据 Apache-2.0 许可发布。PyPI synphage 软件包可从 https://pypi.org/project/synphage/ 获取，容器化版本可从 https://hub.docker.com/r/vestalisvirginis/synphage 获取。欢迎对软件进行贡献，无论是报告错误还是提出新功能，贡献指南可在 https://github.com/vestalisvirginis/synphage/blob/main/CONTRIBUTING.md 上获取。

{"title":"synphage: a pipeline for phage genome synteny graphics focused on gene conservation.","authors":"Virginie Grosboillot, Anna Dragoš","doi":"10.1093/bioadv/vbae126","DOIUrl":"10.1093/bioadv/vbae126","url":null,"abstract":"Motivation: Visualization and comparison of genome maps of bacteriophages can be very effective, but none of the tools available on the market allow visualization of gene conservation between multiple sequences at a glance. In addition, most bioinformatic tools running locally are command line only, making them hard to setup, debug, and monitor.Results: To address these motivations, we developed synphage, an easy-to-use and intuitive tool to generate synteny diagrams from GenBank files. This software has a user-friendly interface and uses metadata to monitor the progress and success of the data transformation process. The output plot features colour-coded genes according to their degree of conservation among the group of displayed sequences. The strength of synphage lies also in its modularity and the ability to generate multiple plots with different configurations without having to re-process all the data. In conclusion, synphage reduces the bioinformatic workload of users and allows them to focus on analysis, the most impactful area of their work.Availability and implementation: The synphage tool is implemented in the Python language and is available from the GitHub repository at https://github.com/vestalisvirginis/synphage. This software is released under an Apache-2.0 licence. A PyPI synphage package is available at https://pypi.org/project/synphage/ and a containerized version is available at https://hub.docker.com/r/vestalisvirginis/synphage. Contributions to the software are welcome whether it is reporting a bug or proposing new features and the contribution guidelines are available at https://github.com/vestalisvirginis/synphage/blob/main/CONTRIBUTING.md.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae126"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ginmappeR: an unified approach for integrating gene and protein identifiers across biological sequence databases. ginmappeR：整合生物序列数据库中基因和蛋白质标识符的统一方法。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-08-29 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae129

Fernando Sola, Daniel Ayala, Marina Pulido, Rafael Ayala, Lorena López-Cerero, Inma Hernández, David Ruiz

Summary: The proliferation of biological sequence data, due to developments in molecular biology techniques, has led to the creation of numerous open access databases on gene and protein sequencing. However, the lack of direct equivalence between identifiers across these databases difficults data integration. To address this challenge, we introduce ginmappeR, an integrated R package facilitating the translation of gene and protein identifiers between databases. By providing a unified interface, ginmappeR streamlines the integration of diverse data sources into biological workflows, so it enhances efficiency and user experience.

Availability and implementation: from Bioconductor: https://bioconductor.org/packages/ginmappeR.

摘要：由于分子生物学技术的发展，生物序列数据激增，从而产生了许多基因和蛋白质测序的开放存取数据库。然而，这些数据库的标识符之间缺乏直接的等同性，给数据整合带来了困难。为了应对这一挑战，我们引入了 ginmappeR，这是一个便于在数据库之间转换基因和蛋白质标识符的集成 R 软件包。通过提供统一的界面，ginmappeR 简化了将不同数据源整合到生物工作流中的过程，从而提高了效率和用户体验。可用性和实现：来自 Bioconductor：https://bioconductor.org/packages/ginmappeR。

引用次数: 0