Pub Date : 2024-09-27eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae146
David Greenwood, Marianne Shawe-Taylor, Hermaleigh Townsley, Joshua Gahir, Nikita Sahadeo, Yakubu Alhassan, Charlotte Chaloner, Oliver Galgut, Gavin Kelly, David L V Bauer, Emma C Wall, Mary Y Wu, Edward J Carr
Motivation: Observational cohort studies that track vaccine and infection responses offer real-world data to inform pandemic policy. Translating biological hypotheses, such as whether different patterns of accumulated antigenic exposures confer differing antibody responses, into analysis code can be onerous, particularly when source data is dis-aggregated.
Results: The R package chronogram introduces the class chronogram, where metadata is seamlessly aggregated with sparse infection episode, clinical and laboratory data. Each experimental modality is added sequentially, allowing the incorporation of new data, such as specialized time-consuming research assays, or their downstream analyses. Source data can be any rectangular data format, including database tables (such as structured query language databases). This supports annotations that aggregate data types/sources, for example, combining symptoms, molecular testing, and sequencing of one or more infectious episodes in a pathogen-agnostic manner. Chronogram arranges observational data to allow the translation of biological hypotheses into their corresponding code via a shared vocabulary.
Availability and implementation: Chronogram is implemented R and available under an MIT licence at: https://www.github.com/FrancisCrickInstitute/chronogram; a user manual is available at: https://franciscrickinstitute.github.io/chronogram/.
{"title":"Chronogram: an R package for data curation and analysis of infection and vaccination cohort studies.","authors":"David Greenwood, Marianne Shawe-Taylor, Hermaleigh Townsley, Joshua Gahir, Nikita Sahadeo, Yakubu Alhassan, Charlotte Chaloner, Oliver Galgut, Gavin Kelly, David L V Bauer, Emma C Wall, Mary Y Wu, Edward J Carr","doi":"10.1093/bioadv/vbae146","DOIUrl":"https://doi.org/10.1093/bioadv/vbae146","url":null,"abstract":"<p><strong>Motivation: </strong>Observational cohort studies that track vaccine and infection responses offer real-world data to inform pandemic policy. Translating biological hypotheses, such as whether different patterns of accumulated antigenic exposures confer differing antibody responses, into analysis code can be onerous, particularly when source data is dis-aggregated.</p><p><strong>Results: </strong>The R package chronogram introduces the class chronogram, where metadata is seamlessly aggregated with sparse infection episode, clinical and laboratory data. Each experimental modality is added sequentially, allowing the incorporation of new data, such as specialized time-consuming research assays, or their downstream analyses. Source data can be any rectangular data format, including database tables (such as structured query language databases). This supports annotations that aggregate data types/sources, for example, combining symptoms, molecular testing, and sequencing of one or more infectious episodes in a pathogen-agnostic manner. Chronogram arranges observational data to allow the translation of biological hypotheses into their corresponding code via a shared vocabulary.</p><p><strong>Availability and implementation: </strong>Chronogram is implemented R and available under an MIT licence at: https://www.github.com/FrancisCrickInstitute/chronogram<b>;</b> a user manual is available at: https://franciscrickinstitute.github.io/chronogram/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae146"},"PeriodicalIF":2.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11470235/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae137
Dongjun Guo, Joseph Chi-Fung Ng, Deborah K Dunn-Walters, Franca Fraternali
Motivation: Effective responses against immune challenges require antibodies of different isotypes performing specific effector functions. Structural information on these isotypes is essential to engineer antibodies with desired physico-chemical features of their antigen-binding properties, and optimal developability as potential therapeutics. In silico mutational scanning profiles on antibody structures would further pinpoint candidate mutations for enhancing antibody stability and function. Current antibody structure databases lack consistent annotations of isotypes and structural coverage of 3D antibody structures, as well as computed deep mutation profiles.
Results: The V and C region bearing antibody (VCAb) web-tool is established to clarify these annotations and provides an accessible resource to facilitate antibody engineering and design. VCAb currently provides data on 7,166 experimentally determined antibody structures including both V and C regions from different species. Additionally, VCAb provides annotations of species and isotypes with numbering schemes applied. These information can be interactively queried or downloaded in batch.
Availability and implementation: VCAb is implemented as a R shiny application to enable interactive data interrogation. The online application is freely accessible https://fraternalilab.cs.ucl.ac.uk/VCAb/. The source code to generate the database and the online application is available open-source at https://github.com/Fraternalilab/VCAb.
动机针对免疫挑战的有效反应需要不同异型的抗体发挥特定的效应功能。要使抗体具有理想的抗原结合理化特性,并能作为潜在的治疗药物进行最佳开发,这些抗体异型的结构信息至关重要。对抗体结构进行硅学突变扫描可以进一步确定增强抗体稳定性和功能的候选突变。目前的抗体结构数据库缺乏一致的同种型注释和三维抗体结构的结构覆盖范围,也缺乏计算的深度突变图谱:结果:V和C区抗体(VCAb)网络工具的建立是为了澄清这些注释,并提供一个可访问的资源,以促进抗体工程和设计。VCAb 目前提供了 7,166 个实验确定的抗体结构数据,包括来自不同物种的 V 区和 C 区。此外,VCAb 还提供了物种和异型的注释,并应用了编号方案。这些信息可以交互式查询或批量下载:VCAb以R闪亮应用程序的形式实现,可进行交互式数据查询。该在线应用程序可免费访问 https://fraternalilab.cs.ucl.ac.uk/VCAb/。生成数据库和在线应用程序的源代码可在 https://github.com/Fraternalilab/VCAb 上免费获取。
{"title":"VCAb: a web-tool for structure-guided exploration of antibodies.","authors":"Dongjun Guo, Joseph Chi-Fung Ng, Deborah K Dunn-Walters, Franca Fraternali","doi":"10.1093/bioadv/vbae137","DOIUrl":"https://doi.org/10.1093/bioadv/vbae137","url":null,"abstract":"<p><strong>Motivation: </strong>Effective responses against immune challenges require antibodies of different isotypes performing specific effector functions. Structural information on these isotypes is essential to engineer antibodies with desired physico-chemical features of their antigen-binding properties, and optimal developability as potential therapeutics. <i>In silico</i> mutational scanning profiles on antibody structures would further pinpoint candidate mutations for enhancing antibody stability and function. Current antibody structure databases lack consistent annotations of isotypes and structural coverage of 3D antibody structures, as well as computed deep mutation profiles.</p><p><strong>Results: </strong>The <i>V</i> and <i>C</i> region bearing <i>a</i>nti<i>b</i>ody (VCAb) web-tool is established to clarify these annotations and provides an accessible resource to facilitate antibody engineering and design. VCAb currently provides data on 7,166 experimentally determined antibody structures including both V and C regions from different species. Additionally, VCAb provides annotations of species and isotypes with numbering schemes applied. These information can be interactively queried or downloaded in batch.</p><p><strong>Availability and implementation: </strong>VCAb is implemented as a R shiny application to enable interactive data interrogation. The online application is freely accessible https://fraternalilab.cs.ucl.ac.uk/VCAb/. The source code to generate the database and the online application is available open-source at https://github.com/Fraternalilab/VCAb.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae137"},"PeriodicalIF":2.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471263/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae136
Slim Karkar, Ashwini Sharma, Carl Herrmann, Yuna Blum, Magali Richard
Summary: Unsupervised deconvolution algorithms are often used to estimate cell composition from bulk tissue samples. However, applying cell-type deconvolution and interpreting the results remain a challenge, even more without prior training in bioinformatics. Here, we propose a tool for estimating and identifying cell type composition from bulk transcriptomes or methylomes. DECOMICS is a shiny-web application dedicated to unsupervised deconvolution approaches of bulk omic data. It provides (i) a variety of existing algorithms to perform deconvolution on the gene expression or methylation-level matrix, (ii) an enrichment analysis module to aid biological interpretation of the deconvolved components, based on enrichment analysis, and (iii) some visualization tools. Input data can be downloaded in csv format and preprocessed in the web application (normalization, transformation, and feature selection). The results of the deconvolution, enrichment, and visualization processes can be downloaded.
Availability and implementation: DECOMICS is an R-shiny web application that can be launched (i) directly from a local R session using the R package available here: https://gitlab.in2p3.fr/Magali.Richard/decomics (either by installing it locally or via a virtual machine and a Docker image that we provide); or (ii) in the Biosphere-IFB Clouds Federation for Life Science, a multi-cloud environment scalable for high-performance computing: https://biosphere.france-bioinformatique.fr/catalogue/appliance/193/.
{"title":"DECOMICS, a shiny application for unsupervised cell type deconvolution and biological interpretation of bulk omic data.","authors":"Slim Karkar, Ashwini Sharma, Carl Herrmann, Yuna Blum, Magali Richard","doi":"10.1093/bioadv/vbae136","DOIUrl":"https://doi.org/10.1093/bioadv/vbae136","url":null,"abstract":"<p><strong>Summary: </strong>Unsupervised deconvolution algorithms are often used to estimate cell composition from bulk tissue samples. However, applying cell-type deconvolution and interpreting the results remain a challenge, even more without prior training in bioinformatics. Here, we propose a tool for estimating and identifying cell type composition from bulk transcriptomes or methylomes. DECOMICS is a shiny-web application dedicated to unsupervised deconvolution approaches of bulk omic data. It provides (i) a variety of existing algorithms to perform deconvolution on the gene expression or methylation-level matrix, (ii) an enrichment analysis module to aid biological interpretation of the deconvolved components, based on enrichment analysis, and (iii) some visualization tools. Input data can be downloaded in csv format and preprocessed in the web application (normalization, transformation, and feature selection). The results of the deconvolution, enrichment, and visualization processes can be downloaded.</p><p><strong>Availability and implementation: </strong>DECOMICS is an R-shiny web application that can be launched (i) directly from a local R session using the R package available here: https://gitlab.in2p3.fr/Magali.Richard/decomics (either by installing it locally or via a virtual machine and a Docker image that we provide); or (ii) in the Biosphere-IFB Clouds Federation for Life Science, a multi-cloud environment scalable for high-performance computing: https://biosphere.france-bioinformatique.fr/catalogue/appliance/193/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae136"},"PeriodicalIF":2.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479579/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae132
Irina Ponamareva, Antonina Andreeva, Maxwell L Bileschi, Lucy Colwell, Alex Bateman
Motivation: In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison.
Results: We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families.
Availability and implementation: github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.
{"title":"Investigation of protein family relationships with deep learning.","authors":"Irina Ponamareva, Antonina Andreeva, Maxwell L Bileschi, Lucy Colwell, Alex Bateman","doi":"10.1093/bioadv/vbae132","DOIUrl":"https://doi.org/10.1093/bioadv/vbae132","url":null,"abstract":"<p><strong>Motivation: </strong>In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison.</p><p><strong>Results: </strong>We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families.</p><p><strong>Availability and implementation: </strong>github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae132"},"PeriodicalIF":2.4,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11467057/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae135
Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos
Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.
Availability and implementation: https://github.com/insitro/EmbedGEM.
{"title":"EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.","authors":"Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos","doi":"10.1093/bioadv/vbae135","DOIUrl":"10.1093/bioadv/vbae135","url":null,"abstract":"<p><strong>Summary: </strong>Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (<b>Embed</b>ding <b>G</b>enetic <b>E</b>valuation <b>M</b>ethods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean <math> <mrow> <mrow> <msup><mrow><mo>χ</mo></mrow> <mn>2</mn></msup> </mrow> </mrow> </math> statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.</p><p><strong>Availability and implementation: </strong>https://github.com/insitro/EmbedGEM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae135"},"PeriodicalIF":2.4,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-13eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae134
Joseph Hastings, Donghyung Lee, Michael J O'Connell
Motivation: In single-cell RNA sequencing analysis, addressing batch effects-technical artifacts stemming from factors such as varying sequencing technologies, equipment, and capture times-is crucial. These factors can cause unwanted variation and obfuscate the underlying biological signal of interest. The joint and individual variation explained (JIVE) method can be used to extract shared biological patterns from multi-source sequencing data while adjusting for individual non-biological variations (i.e. batch effect). However, its current implementation is originally designed for bulk sequencing data, making it computationally infeasible for large-scale single-cell sequencing datasets.
Results: In this study, we enhance JIVE for large-scale single-cell data by boosting its computational efficiency. Additionally, we introduce a novel application of JIVE for batch-effect correction on multiple single-cell sequencing datasets. Our enhanced method aims to decompose single-cell sequencing datasets into a joint structure capturing the true biological variability and individual structures, which capture technical variability within each batch. This joint structure is then suitable for use in downstream analyses. We benchmarked the results against four popular tools, Seurat v5, Harmony, LIGER, and Combat-seq, which were developed for this purpose. JIVE performed best in terms of preserving cell-type effects and in scenarios in which the batch sizes are balanced.
Availability and implementation: The JIVE implementation used for this analysis can be found at https://github.com/oconnell-statistics-lab/scJIVE.
{"title":"Batch-effect correction in single-cell RNA sequencing data using JIVE.","authors":"Joseph Hastings, Donghyung Lee, Michael J O'Connell","doi":"10.1093/bioadv/vbae134","DOIUrl":"10.1093/bioadv/vbae134","url":null,"abstract":"<p><strong>Motivation: </strong>In single-cell RNA sequencing analysis, addressing batch effects-technical artifacts stemming from factors such as varying sequencing technologies, equipment, and capture times-is crucial. These factors can cause unwanted variation and obfuscate the underlying biological signal of interest. The joint and individual variation explained (JIVE) method can be used to extract shared biological patterns from multi-source sequencing data while adjusting for individual non-biological variations (i.e. batch effect). However, its current implementation is originally designed for bulk sequencing data, making it computationally infeasible for large-scale single-cell sequencing datasets.</p><p><strong>Results: </strong>In this study, we enhance JIVE for large-scale single-cell data by boosting its computational efficiency. Additionally, we introduce a novel application of JIVE for batch-effect correction on multiple single-cell sequencing datasets. Our enhanced method aims to decompose single-cell sequencing datasets into a joint structure capturing the true biological variability and individual structures, which capture technical variability within each batch. This joint structure is then suitable for use in downstream analyses. We benchmarked the results against four popular tools, Seurat v5, Harmony, LIGER, and Combat-seq, which were developed for this purpose. JIVE performed best in terms of preserving cell-type effects and in scenarios in which the batch sizes are balanced.</p><p><strong>Availability and implementation: </strong>The JIVE implementation used for this analysis can be found at https://github.com/oconnell-statistics-lab/scJIVE.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae134"},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11461915/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-11eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae133
Hasin Rehana, Nur Bengisu Çam, Mert Basmaci, Jie Zheng, Christianah Jemiyo, Yongqun He, Arzucan Özgür, Junguk Hur
Motivation: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks.
Results: We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature.
Availability and implementation: The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.
动机检测蛋白质-蛋白质相互作用(PPIs)对于了解遗传机制、疾病发病机理和药物设计至关重要。随着生物医学文献的快速增长,人们越来越需要自动、准确地提取这些相互作用,以促进科学发现。在自然语言处理任务中,预训练语言模型,如生成式预训练变换器和来自变换器的双向编码器表示,已经显示出良好的效果:我们评估了使用基于转换器的多种模型在三个人工策划的黄金标准语料库中进行 PPI 识别的性能:这些语料库包括:在逻辑中学习语言(77 个句子中包含 164 次交互)、人类蛋白质参考数据库(145 个句子中包含 163 次交互)以及交互提取性能评估(486 个句子中包含 335 次交互)。基于双向编码器表征的模型取得了最佳的整体性能,其中 BioBERT 在《Learning Language in Logic》数据集上取得了 91.95% 的最高召回率和 86.84% 的 F1 分数。尽管没有针对生物医学文本进行明确的训练,GPT-4 仍然表现出值得称赞的性能,与双向编码器模型不相上下。具体来说,GPT-4 在同一数据集上取得了 88.37% 的最高精确度、85.14% 的召回率和 86.49% 的 F1 分数。这些结果表明,GPT-4 可以有效地检测文本中的蛋白质相互作用,为挖掘生物医学文献提供了有价值的应用:本研究使用的源代码和数据集可在 https://github.com/hurlab/PPI-GPT-BERT 网站上获取。
{"title":"Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text.","authors":"Hasin Rehana, Nur Bengisu Çam, Mert Basmaci, Jie Zheng, Christianah Jemiyo, Yongqun He, Arzucan Özgür, Junguk Hur","doi":"10.1093/bioadv/vbae133","DOIUrl":"https://doi.org/10.1093/bioadv/vbae133","url":null,"abstract":"<p><strong>Motivation: </strong>Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks.</p><p><strong>Results: </strong>We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature.</p><p><strong>Availability and implementation: </strong>The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae133"},"PeriodicalIF":2.4,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11419952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142333636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae131
Francisco J Guzmán-Vega, Stefan T Arold
Motivation: The speed and accuracy of deep learning-based structure prediction algorithms make it now possible to perform in silico "pull-downs" to identify protein-protein interactions on a proteome-wide scale. However, on such a large scale, existing scoring algorithms are often insufficient to discriminate biologically relevant interactions from false positives.
Results: Here, we introduce AlphaCRV, a Python package that helps identify correct interactors in a one-against-many AlphaFold screen by clustering, ranking, and visualizing conserved binding topologies, based on protein sequence and fold.
Availability and implementation: AlphaCRV is a Python package for Linux, freely available at https://github.com/strubelab/AlphaCRV.
{"title":"AlphaCRV: a pipeline for identifying accurate binder topologies in mass-modeling with AlphaFold.","authors":"Francisco J Guzmán-Vega, Stefan T Arold","doi":"10.1093/bioadv/vbae131","DOIUrl":"https://doi.org/10.1093/bioadv/vbae131","url":null,"abstract":"<p><strong>Motivation: </strong>The speed and accuracy of deep learning-based structure prediction algorithms make it now possible to perform in silico \"pull-downs\" to identify protein-protein interactions on a proteome-wide scale. However, on such a large scale, existing scoring algorithms are often insufficient to discriminate biologically relevant interactions from false positives.</p><p><strong>Results: </strong>Here, we introduce AlphaCRV, a Python package that helps identify correct interactors in a one-against-many AlphaFold screen by clustering, ranking, and visualizing conserved binding topologies, based on protein sequence and fold.</p><p><strong>Availability and implementation: </strong>AlphaCRV is a Python package for Linux, freely available at https://github.com/strubelab/AlphaCRV.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae131"},"PeriodicalIF":2.4,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11405088/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142302315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae130
Yael Kupershmidt, Simon Kasif, Roded Sharan
Motivation: Protein-protein interactions (PPIs) play essential roles in the buildup of cellular machinery and provide the skeleton for cellular signaling. However, these biochemical roles are context dependent and interactions may change across cell type, time, and space. In contrast, PPI detection assays are run in a single condition that may not even be an endogenous condition of the organism, resulting in static networks that do not reflect full cellular complexity. Thus, there is a need for computational methods to predict cell-type-specific interactions.
Results: Here we present SPIDER (Supervised Protein Interaction DEtectoR), a graph attention-based model for predicting cell-type-specific PPI networks. In contrast to previous attempts at this problem, which were unsupervised in nature, our model's training is guided by experimentally measured cell-type-specific networks, enhancing its performance. We evaluate our method using experimental data of cell-type-specific networks from both humans and mice, and show that it outperforms current approaches by a large margin. We further demonstrate the ability of our method to generalize the predictions to datasets of tissues lacking prior PPI experimental data. We leverage the networks predicted by the model to facilitate the identification of tissue-specific disease genes.
Availability and implementation: Our code and data are available at https://github.com/Kuper994/SPIDER.
动机蛋白质-蛋白质相互作用(PPIs)在细胞机制的构建中发挥着重要作用,并为细胞信号传导提供了骨架。然而,这些生化作用与环境有关,相互作用可能会因细胞类型、时间和空间的不同而发生变化。与此相反,PPI 检测试验是在单一条件下进行的,而这种条件甚至可能不是生物体的内源条件,因此产生的静态网络不能反映细胞的全部复杂性。因此,需要用计算方法来预测细胞类型特异性的相互作用:在这里,我们介绍了 SPIDER(监督蛋白质相互作用 DEtectoR),这是一种基于图注意的模型,用于预测细胞类型特异性 PPI 网络。与以往在此问题上的无监督尝试不同,我们的模型是在实验测量的细胞类型特异性网络的指导下进行训练的,从而提高了模型的性能。我们使用人类和小鼠细胞类型特异性网络的实验数据对我们的方法进行了评估,结果表明我们的方法大大优于目前的方法。我们进一步证明了我们的方法能够将预测结果推广到缺乏 PPI 实验数据的组织数据集。我们利用模型预测的网络来促进组织特异性疾病基因的鉴定:我们的代码和数据可从 https://github.com/Kuper994/SPIDER 获取。
{"title":"SPIDER: constructing cell-type-specific protein-protein interaction networks.","authors":"Yael Kupershmidt, Simon Kasif, Roded Sharan","doi":"10.1093/bioadv/vbae130","DOIUrl":"https://doi.org/10.1093/bioadv/vbae130","url":null,"abstract":"<p><strong>Motivation: </strong>Protein-protein interactions (PPIs) play essential roles in the buildup of cellular machinery and provide the skeleton for cellular signaling. However, these biochemical roles are context dependent and interactions may change across cell type, time, and space. In contrast, PPI detection assays are run in a single condition that may not even be an endogenous condition of the organism, resulting in static networks that do not reflect full cellular complexity. Thus, there is a need for computational methods to predict cell-type-specific interactions.</p><p><strong>Results: </strong>Here we present SPIDER (Supervised Protein Interaction DEtectoR), a graph attention-based model for predicting cell-type-specific PPI networks. In contrast to previous attempts at this problem, which were unsupervised in nature, our model's training is guided by experimentally measured cell-type-specific networks, enhancing its performance. We evaluate our method using experimental data of cell-type-specific networks from both humans and mice, and show that it outperforms current approaches by a large margin. We further demonstrate the ability of our method to generalize the predictions to datasets of tissues lacking prior PPI experimental data. We leverage the networks predicted by the model to facilitate the identification of tissue-specific disease genes.</p><p><strong>Availability and implementation: </strong>Our code and data are available at https://github.com/Kuper994/SPIDER.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae130"},"PeriodicalIF":2.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438548/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142333637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae126
Virginie Grosboillot, Anna Dragoš
Motivation: Visualization and comparison of genome maps of bacteriophages can be very effective, but none of the tools available on the market allow visualization of gene conservation between multiple sequences at a glance. In addition, most bioinformatic tools running locally are command line only, making them hard to setup, debug, and monitor.
Results: To address these motivations, we developed synphage, an easy-to-use and intuitive tool to generate synteny diagrams from GenBank files. This software has a user-friendly interface and uses metadata to monitor the progress and success of the data transformation process. The output plot features colour-coded genes according to their degree of conservation among the group of displayed sequences. The strength of synphage lies also in its modularity and the ability to generate multiple plots with different configurations without having to re-process all the data. In conclusion, synphage reduces the bioinformatic workload of users and allows them to focus on analysis, the most impactful area of their work.
Availability and implementation: The synphage tool is implemented in the Python language and is available from the GitHub repository at https://github.com/vestalisvirginis/synphage. This software is released under an Apache-2.0 licence. A PyPI synphage package is available at https://pypi.org/project/synphage/ and a containerized version is available at https://hub.docker.com/r/vestalisvirginis/synphage. Contributions to the software are welcome whether it is reporting a bug or proposing new features and the contribution guidelines are available at https://github.com/vestalisvirginis/synphage/blob/main/CONTRIBUTING.md.
{"title":"synphage: a pipeline for phage genome synteny graphics focused on gene conservation.","authors":"Virginie Grosboillot, Anna Dragoš","doi":"10.1093/bioadv/vbae126","DOIUrl":"10.1093/bioadv/vbae126","url":null,"abstract":"<p><strong>Motivation: </strong>Visualization and comparison of genome maps of bacteriophages can be very effective, but none of the tools available on the market allow visualization of gene conservation between multiple sequences at a glance. In addition, most bioinformatic tools running locally are command line only, making them hard to setup, debug, and monitor.</p><p><strong>Results: </strong>To address these motivations, we developed synphage, an easy-to-use and intuitive tool to generate synteny diagrams from GenBank files. This software has a user-friendly interface and uses metadata to monitor the progress and success of the data transformation process. The output plot features colour-coded genes according to their degree of conservation among the group of displayed sequences. The strength of synphage lies also in its modularity and the ability to generate multiple plots with different configurations without having to re-process all the data. In conclusion, synphage reduces the bioinformatic workload of users and allows them to focus on analysis, the most impactful area of their work.</p><p><strong>Availability and implementation: </strong>The synphage tool is implemented in the Python language and is available from the GitHub repository at https://github.com/vestalisvirginis/synphage. This software is released under an Apache-2.0 licence. A PyPI synphage package is available at https://pypi.org/project/synphage/ and a containerized version is available at https://hub.docker.com/r/vestalisvirginis/synphage. Contributions to the software are welcome whether it is reporting a bug or proposing new features and the contribution guidelines are available at https://github.com/vestalisvirginis/synphage/blob/main/CONTRIBUTING.md.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae126"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}