Pub Date : 2024-08-29eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae129
Fernando Sola, Daniel Ayala, Marina Pulido, Rafael Ayala, Lorena López-Cerero, Inma Hernández, David Ruiz
Summary: The proliferation of biological sequence data, due to developments in molecular biology techniques, has led to the creation of numerous open access databases on gene and protein sequencing. However, the lack of direct equivalence between identifiers across these databases difficults data integration. To address this challenge, we introduce ginmappeR, an integrated R package facilitating the translation of gene and protein identifiers between databases. By providing a unified interface, ginmappeR streamlines the integration of diverse data sources into biological workflows, so it enhances efficiency and user experience.
Availability and implementation: from Bioconductor: https://bioconductor.org/packages/ginmappeR.
摘要:由于分子生物学技术的发展,生物序列数据激增,从而产生了许多基因和蛋白质测序的开放存取数据库。然而,这些数据库的标识符之间缺乏直接的等同性,给数据整合带来了困难。为了应对这一挑战,我们引入了 ginmappeR,这是一个便于在数据库之间转换基因和蛋白质标识符的集成 R 软件包。通过提供统一的界面,ginmappeR 简化了将不同数据源整合到生物工作流中的过程,从而提高了效率和用户体验。可用性和实现:来自 Bioconductor:https://bioconductor.org/packages/ginmappeR。
{"title":"ginmappeR: an unified approach for integrating gene and protein identifiers across biological sequence databases.","authors":"Fernando Sola, Daniel Ayala, Marina Pulido, Rafael Ayala, Lorena López-Cerero, Inma Hernández, David Ruiz","doi":"10.1093/bioadv/vbae129","DOIUrl":"https://doi.org/10.1093/bioadv/vbae129","url":null,"abstract":"<p><strong>Summary: </strong>The proliferation of biological sequence data, due to developments in molecular biology techniques, has led to the creation of numerous open access databases on gene and protein sequencing. However, the lack of direct equivalence between identifiers across these databases difficults data integration. To address this challenge, we introduce <i>ginmappeR</i>, an integrated R package facilitating the translation of gene and protein identifiers between databases. By providing a unified interface, <i>ginmappeR</i> streamlines the integration of diverse data sources into biological workflows, so it enhances efficiency and user experience.</p><p><strong>Availability and implementation: </strong>from Bioconductor: https://bioconductor.org/packages/ginmappeR.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae129"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11387618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142302316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae128
Trent Dennis, Donghyung Lee
Motivation: With larger and more diverse studies becoming the standard in genome-wide association studies (GWAS), accurate estimation of ancestral proportions is increasingly important for summary-statistics-based methods such as those for imputing association summary statistics, adjusting allele frequencies (AFs) for ancestry, and prioritizing disease candidate variants or genes. Existing methods for estimating ancestral proportions in GWAS rely on the availability of study reference AFs, which are often inaccessible in current GWAS due to privacy concerns.
Results: In this study, we propose ZMIX (Z-score-based estimation of ethnic MIXing proportions), a novel method for estimating ethnic mixing proportions in GWAS using only association Z-scores, and we compare its performance to existing reference AF-based methods in both real-world and simulated GWAS settings. We found that ZMIX offered comparable results to the reference AF-based methods in simulation and real-world studies. When applied to summary-statistics imputation, all three methods produced high-quality imputations with almost identical results.
Availability and implementation: https://github.com/statsleelab/gauss.
{"title":"ZMIX: estimating ancestry proportions using GWAS association Z-scores.","authors":"Trent Dennis, Donghyung Lee","doi":"10.1093/bioadv/vbae128","DOIUrl":"10.1093/bioadv/vbae128","url":null,"abstract":"<p><strong>Motivation: </strong>With larger and more diverse studies becoming the standard in genome-wide association studies (GWAS), accurate estimation of ancestral proportions is increasingly important for summary-statistics-based methods such as those for imputing association summary statistics, adjusting allele frequencies (AFs) for ancestry, and prioritizing disease candidate variants or genes. Existing methods for estimating ancestral proportions in GWAS rely on the availability of study reference AFs, which are often inaccessible in current GWAS due to privacy concerns.</p><p><strong>Results: </strong>In this study, we propose ZMIX (Z-score-based estimation of ethnic MIXing proportions), a novel method for estimating ethnic mixing proportions in GWAS using only association Z-scores, and we compare its performance to existing reference AF-based methods in both real-world and simulated GWAS settings. We found that ZMIX offered comparable results to the reference AF-based methods in simulation and real-world studies. When applied to summary-statistics imputation, all three methods produced high-quality imputations with almost identical results.</p><p><strong>Availability and implementation: </strong>https://github.com/statsleelab/gauss.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae128"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae125
James G Davies, Georgina E Menzies
Motivation: Benzo[a]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[a]pyrene Diol-Epoxide (BPDE), a Benzo[a]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the TP53 gene, then applied to sites within TP53, cII, and lacZ genes.
Results: We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among TP53 and lacZ duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.
Availability and implementation: Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.
动机苯并[a]芘是一种臭名昭著的破坏 DNA 的致癌物质,属于多环芳烃家族,常见于烟草烟雾中。令人惊讶的是,核苷酸切除修复(NER)机制在识别特定大块 DNA 加合物(包括苯并[a]芘代谢物--苯并[a]芘二醇环氧化物(BPDE))方面表现出低效。虽然序列上下文正在成为导致 NER 对 BPDE 加合物反应不充分的主要因素,但人们对支配这些差异的精确结构属性仍然了解不足。因此,我们结合分子动力学和机器学习领域,对 BPDE-鸟嘌呤加合物在多种基因背景下引起的螺旋变形进行了全面评估。具体来说,我们采用了一种双重方法,包括基于随机森林分类的分析和随后的特征选择,以确定可区分不同修复能力的加合物位点的精确拓扑特征。我们使用从代表 TP53 基因中 BPDE 热点和非热点位点的双链提取的螺旋数据训练模型,然后将其应用于 TP53、cII 和 lacZ 基因中的位点:结果表明,我们的优化模型始终保持着卓越的性能,准确率、精确度和 f1 分数均超过 91%。我们的特征选择方法发现,区域碱基对旋转的明显差异对我们模型的决策起着至关重要的作用。值得注意的是,这些差异在 TP53 和 lacZ 双链体中高度一致,而且似乎受到区域 GC 含量的影响。因此,我们的研究结果表明,确实存在区分热点和非热点的保守拓扑特征,这突出表明区域 GC 含量是突变的潜在生物标志物:比较机器学习分类器并评估其性能的代码可在 https://github.com/jdavies24/ML-Classifier-Comparison 网站上获取,使用 Curves+ 分析 DNA 结构以及使用随机森林分析运河的代码可在 https://github.com/jdavies24/ML-classification-of-DNA-trajectories 网站上获取。
{"title":"Utilizing biological experimental data and molecular dynamics for the classification of mutational hotspots through machine learning.","authors":"James G Davies, Georgina E Menzies","doi":"10.1093/bioadv/vbae125","DOIUrl":"10.1093/bioadv/vbae125","url":null,"abstract":"<p><strong>Motivation: </strong>Benzo[<i>a</i>]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[<i>a</i>]pyrene Diol-Epoxide (BPDE), a Benzo[<i>a</i>]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the <i>TP53</i> gene, then applied to sites within <i>TP53</i>, <i>cII</i>, and <i>lacZ</i> genes.</p><p><strong>Results: </strong>We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among <i>TP53</i> and <i>lacZ</i> duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.</p><p><strong>Availability and implementation: </strong>Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae125"},"PeriodicalIF":2.4,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11377099/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae127
Joseph A Cogan, Natalia Benova, Rene Kuklinkova, James R Boyne, Chinedu A Anene
Motivation: Recent RNA-centric experimental methods have significantly expanded our knowledge of proteins with known RNA-binding functions. However, the complete regulatory network and pathways for many of these RNA-binding proteins (RBPs) in different cellular contexts remain unknown. Although critical to understanding the role of RBPs in health and disease, experimentally mapping the RBP-RNA interactomes in every single context is an impossible task due the cost and manpower required. Additionally, identifying relevant RNAs bound by RBPs is challenging due to their diverse binding modes and function.
Results: To address these challenges, we developed RBP interaction mapper RBPInper an integrative framework that discovers global RBP interactome using statistical data fusion. Experiments on splicing factor proline and glutamine rich (SFPQ) datasets revealed cogent global SFPQ interactome. Several biological processes associated with this interactome were previously linked with SFPQ function. Furthermore, we conducted tests using independent dataset to assess the transferability of the SFPQ interactome to another context. The results demonstrated robust utility in generating interactomes that transfers to unseen cellular context. Overall, RBPInper is a fast and user-friendly method that enables a systems-level understanding of RBP functions by integrating multiple molecular datasets. The tool is designed with a focus on simplicity, minimal dependencies, and straightforward input requirements. This intentional design aims to empower everyday biologists, making it easy for them to incorporate the tool into their research.
Availability and implementation: The source code, documentation, and installation instructions as well as results for use case are freely available at https://github.com/AneneLab/RBPInper. A user can easily compile similar datasets for a target RBP.
{"title":"Meta-analysis of RNA interaction profiles of RNA-binding protein using the RBPInper tool.","authors":"Joseph A Cogan, Natalia Benova, Rene Kuklinkova, James R Boyne, Chinedu A Anene","doi":"10.1093/bioadv/vbae127","DOIUrl":"10.1093/bioadv/vbae127","url":null,"abstract":"<p><strong>Motivation: </strong>Recent RNA-centric experimental methods have significantly expanded our knowledge of proteins with known RNA-binding functions. However, the complete regulatory network and pathways for many of these RNA-binding proteins (RBPs) in different cellular contexts remain unknown. Although critical to understanding the role of RBPs in health and disease, experimentally mapping the RBP-RNA interactomes in every single context is an impossible task due the cost and manpower required. Additionally, identifying relevant RNAs bound by RBPs is challenging due to their diverse binding modes and function.</p><p><strong>Results: </strong>To address these challenges, we developed RBP interaction mapper RBPInper an integrative framework that discovers global RBP interactome using statistical data fusion. Experiments on splicing factor proline and glutamine rich (SFPQ) datasets revealed cogent global SFPQ interactome. Several biological processes associated with this interactome were previously linked with SFPQ function. Furthermore, we conducted tests using independent dataset to assess the transferability of the SFPQ interactome to another context. The results demonstrated robust utility in generating interactomes that transfers to unseen cellular context. Overall, RBPInper is a fast and user-friendly method that enables a systems-level understanding of RBP functions by integrating multiple molecular datasets. The tool is designed with a focus on simplicity, minimal dependencies, and straightforward input requirements. This intentional design aims to empower everyday biologists, making it easy for them to incorporate the tool into their research.</p><p><strong>Availability and implementation: </strong>The source code, documentation, and installation instructions as well as results for use case are freely available at https://github.com/AneneLab/RBPInper. A user can easily compile similar datasets for a target RBP.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae127"},"PeriodicalIF":2.4,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11374027/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-24eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae123
Yinqi Zhao, Qiran Jia, Jesse Goodrich, Burcu Darst, David V Conti
Motivation: Latent unknown clustering integrating multi-omics data is a novel statistical model designed for multi-omics data analysis. It integrates omics data with exposures and an outcome through a latent cluster, elucidating how exposures influence processes reflected in multi-omics measurements, ultimately affecting an outcome. A significant challenge in multi-omics analysis is the issue of list-wise missingness. To address this, we extend the model to incorporate list-wise missingness within an integrated imputation framework, which can also handle sporadic missingness when necessary.
Results: Simulation studies demonstrate that our integrated imputation approach produces consistent and less biased estimates, closely reflecting true underlying values. We applied this model to data from the ISGlobal/ATHLETE "Exposome Data Challenge Event" to explore the association between maternal exposure to hexachlorobenzene and childhood body mass index by integrating incomplete proteomics data from 1301 children. The model successfully estimated proteomics profiles for two clusters representing higher and lower body mass index, characterizing the potential profiles linking prenatal hexachlorobenzene levels and childhood body mass index.
Availability and implementation: The proposed methods have been implemented in the R package LUCIDus. The source code is available at https://github.com/USCbiostats/LUCIDus.
动机整合多组学数据的潜在未知聚类是一种专为多组学数据分析设计的新型统计模型。它通过一个潜在聚类将 omics 数据与暴露和结果整合在一起,阐明暴露如何影响多组学测量所反映的过程,并最终影响结果。多组学分析中的一个重大挑战是列表缺失问题。为了解决这个问题,我们对模型进行了扩展,将列表缺失纳入了综合估算框架,必要时还可以处理零星缺失:模拟研究表明,我们的综合估算方法能产生一致且偏差较小的估计值,并能密切反映真实的基本值。我们将该模型应用于ISGlobal/ATHLETE "暴露组数据挑战活动 "的数据,通过整合1301名儿童的不完整蛋白质组学数据,探讨了母体暴露于六氯苯与儿童体重指数之间的关联。该模型成功估算出了代表较高和较低体重指数的两个群组的蛋白质组学特征,描述了产前六氯苯水平与儿童体重指数之间的潜在联系:建议的方法已在 R 软件包 LUCIDus 中实现。源代码见 https://github.com/USCbiostats/LUCIDus。
{"title":"An extension of latent unknown clustering integrating multi-omics data (LUCID) incorporating incomplete omics data.","authors":"Yinqi Zhao, Qiran Jia, Jesse Goodrich, Burcu Darst, David V Conti","doi":"10.1093/bioadv/vbae123","DOIUrl":"10.1093/bioadv/vbae123","url":null,"abstract":"<p><strong>Motivation: </strong>Latent unknown clustering integrating multi-omics data is a novel statistical model designed for multi-omics data analysis. It integrates omics data with exposures and an outcome through a latent cluster, elucidating how exposures influence processes reflected in multi-omics measurements, ultimately affecting an outcome. A significant challenge in multi-omics analysis is the issue of list-wise missingness. To address this, we extend the model to incorporate list-wise missingness within an integrated imputation framework, which can also handle sporadic missingness when necessary.</p><p><strong>Results: </strong>Simulation studies demonstrate that our integrated imputation approach produces consistent and less biased estimates, closely reflecting true underlying values. We applied this model to data from the ISGlobal/ATHLETE \"Exposome Data Challenge Event\" to explore the association between maternal exposure to hexachlorobenzene and childhood body mass index by integrating incomplete proteomics data from 1301 children. The model successfully estimated proteomics profiles for two clusters representing higher and lower body mass index, characterizing the potential profiles linking prenatal hexachlorobenzene levels and childhood body mass index.</p><p><strong>Availability and implementation: </strong>The proposed methods have been implemented in the R package <i>LUCIDus</i>. The source code is available at https://github.com/USCbiostats/LUCIDus.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae123"},"PeriodicalIF":2.4,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368387/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae124
Santiago Prochetto, Renata Reinheimer, Georgina Stegmayer
Motivation: Unraveling the connection between genes and traits is crucial for solving many biological puzzles. Ribonucleic acid molecules and proteins, derived from these genetic instructions, play crucial roles in shaping cell structures, influencing reactions, and guiding behavior. This fundamental biological principle links genetic makeup to observable traits, but integrating and extracting meaningful relationships from this complex, multimodal data present a significant challenge.
Results: We introduce evolSOM, a novel R package that allows exploring and visualizing the conservation or displacement of biological variables, easing the integration of phenotypic and genotypic attributes. It enables the projection of multi-dimensional expression profiles onto interpretable two-dimensional grids, aiding in the identification of conserved or displaced genes/phenotypes across multiple conditions. Variables displaced together suggest membership to the same regulatory network, where the nature of the displacement may hold biological significance. The conservation or displacement of variables is automatically calculated and graphically presented by evolSOM. Its user-friendly interface and visualization capabilities enhance the accessibility of complex network analyses.
Availability and implementation: The package is open-source under the GPL ( 3) and is available at https://github.com/sanprochetto/evolSOM, along with a step-by-step vignette and a full example dataset that can be accessed at https://github.com/sanprochetto/evolSOM/tree/main/inst/extdata.
{"title":"evolSOM: An R package for analyzing conservation and displacement of biological variables with self-organizing maps.","authors":"Santiago Prochetto, Renata Reinheimer, Georgina Stegmayer","doi":"10.1093/bioadv/vbae124","DOIUrl":"https://doi.org/10.1093/bioadv/vbae124","url":null,"abstract":"<p><strong>Motivation: </strong>Unraveling the connection between genes and traits is crucial for solving many biological puzzles. Ribonucleic acid molecules and proteins, derived from these genetic instructions, play crucial roles in shaping cell structures, influencing reactions, and guiding behavior. This fundamental biological principle links genetic makeup to observable traits, but integrating and extracting meaningful relationships from this complex, multimodal data present a significant challenge.</p><p><strong>Results: </strong>We introduce evolSOM, a novel R package that allows exploring and visualizing the conservation or displacement of biological variables, easing the integration of phenotypic and genotypic attributes. It enables the projection of multi-dimensional expression profiles onto interpretable two-dimensional grids, aiding in the identification of conserved or displaced genes/phenotypes across multiple conditions. Variables displaced together suggest membership to the same regulatory network, where the nature of the displacement may hold biological significance. The conservation or displacement of variables is automatically calculated and graphically presented by evolSOM. Its user-friendly interface and visualization capabilities enhance the accessibility of complex network analyses.</p><p><strong>Availability and implementation: </strong>The package is open-source under the GPL ( <math><mo>≥</mo></math> 3) and is available at https://github.com/sanprochetto/evolSOM, along with a step-by-step vignette and a full example dataset that can be accessed at https://github.com/sanprochetto/evolSOM/tree/main/inst/extdata.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae124"},"PeriodicalIF":2.4,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11361812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae122
Jose V Die
Summary: We introduce refseqR, an R package that offers a user-friendly solution, enabling common computational operations on RefSeq entries (GenBank, NCBI). The package is specifically designed to interact with records curated from the RefSeq database. Most importantly, the interoperability and integration with several Bioconductor objects allow connections to be applied to other projects.
Availability and implementation: The package refseqR is implemented in R and published under the MIT open-source license. The source code, documentation, and usage instructions are available on CRAN (https://CRAN.R-project.org/package=refseqR).
摘要:我们介绍的 refseqR 是一个 R 软件包,它提供了一个用户友好的解决方案,能够对 RefSeq 条目(GenBank、NCBI)进行常见的计算操作。该软件包专为与 RefSeq 数据库中的记录进行交互而设计。最重要的是,与多个 Bioconductor 对象的互操作性和集成性允许将连接应用于其他项目:refseqR 软件包是用 R 语言实现的,以 MIT 开源许可证发布。源代码、文档和使用说明可在 CRAN (https://CRAN.R-project.org/package=refseqR) 上获取。
{"title":"refseqR: an R package for common computational operations with records on RefSeq collection.","authors":"Jose V Die","doi":"10.1093/bioadv/vbae122","DOIUrl":"10.1093/bioadv/vbae122","url":null,"abstract":"<p><strong>Summary: </strong>We introduce refseqR, an R package that offers a user-friendly solution, enabling common computational operations on RefSeq entries (GenBank, NCBI). The package is specifically designed to interact with records curated from the RefSeq database. Most importantly, the interoperability and integration with several Bioconductor objects allow connections to be applied to other projects.</p><p><strong>Availability and implementation: </strong>The package refseqR is implemented in R and published under the MIT open-source license. The source code, documentation, and usage instructions are available on CRAN (https://CRAN.R-project.org/package=refseqR).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae122"},"PeriodicalIF":2.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368385/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae116
Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen
Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.
Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.
Availability and implementation: All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.
{"title":"CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes.","authors":"Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen","doi":"10.1093/bioadv/vbae116","DOIUrl":"https://doi.org/10.1093/bioadv/vbae116","url":null,"abstract":"<p><strong>Motivation: </strong>Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.</p><p><strong>Results: </strong>We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.</p><p><strong>Availability and implementation: </strong>All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae116"},"PeriodicalIF":2.4,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11474106/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Enhancers play critical roles in cell-type-specific transcriptional control. Despite the identification of thousands of candidate enhancers, unravelling their regulatory relationships with their target genes remains challenging. Therefore, computational approaches are needed to accurately infer enhancer-gene regulatory relationships.
Results: In this study, we propose a new method, IVEA, that predicts enhancer-gene regulatory interactions by estimating promoter and enhancer activities. Its statistical model is based on the gene regulatory mechanism of transcriptional bursting, which is characterized by burst size and frequency controlled by promoters and enhancers, respectively. Using transcriptional readouts, chromatin accessibility, and chromatin contact data as inputs, promoter and enhancer activities were estimated using variational Bayesian inference, and the contribution of each enhancer-promoter pair to target gene transcription was calculated. Our analysis demonstrates that the proposed method can achieve high prediction accuracy and provide biologically relevant enhancer-gene regulatory interactions.
Availability and implementation: The IVEA code is available on GitHub at https://github.com/yasumasak/ivea. The publicly available datasets used in this study are described in Supplementary Table S4.
{"title":"IVEA: an integrative variational Bayesian inference method for predicting enhancer-gene regulatory interactions.","authors":"Yasumasa Kimura, Yoshimasa Ono, Kotoe Katayama, Seiya Imoto","doi":"10.1093/bioadv/vbae118","DOIUrl":"10.1093/bioadv/vbae118","url":null,"abstract":"<p><strong>Motivation: </strong>Enhancers play critical roles in cell-type-specific transcriptional control. Despite the identification of thousands of candidate enhancers, unravelling their regulatory relationships with their target genes remains challenging. Therefore, computational approaches are needed to accurately infer enhancer-gene regulatory relationships.</p><p><strong>Results: </strong>In this study, we propose a new method, IVEA, that predicts enhancer-gene regulatory interactions by estimating promoter and enhancer activities. Its statistical model is based on the gene regulatory mechanism of transcriptional bursting, which is characterized by burst size and frequency controlled by promoters and enhancers, respectively. Using transcriptional readouts, chromatin accessibility, and chromatin contact data as inputs, promoter and enhancer activities were estimated using variational Bayesian inference, and the contribution of each enhancer-promoter pair to target gene transcription was calculated. Our analysis demonstrates that the proposed method can achieve high prediction accuracy and provide biologically relevant enhancer-gene regulatory interactions.</p><p><strong>Availability and implementation: </strong>The IVEA code is available on GitHub at https://github.com/yasumasak/ivea. The publicly available datasets used in this study are described in Supplementary Table S4.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae118"},"PeriodicalIF":2.4,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11349192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae120
Frimpong Boadu, Jianlin Cheng
Motivation: As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt.
Results: We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms.
Availability and implementation: https://github.com/BioinfoMachineLearning/TransFew.
动机由于只有不到1%的蛋白质通过实验确定了蛋白质的功能信息,因此计算预测蛋白质的功能对于获得大多数蛋白质的功能信息至关重要,这也是蛋白质生物信息学的一大挑战。尽管近十年来,蛋白质功能预测领域取得了重大进展,但蛋白质功能预测的总体准确率仍然不高,尤其是与蛋白质功能注释数据库(如 UniProt.Results)中少数蛋白质相关的罕见功能术语:我们介绍了一种新的转换器模型 TransFew,它可以学习蛋白质序列和功能标签 [基因本体(GO)术语] 的表示,从而预测蛋白质的功能。TransFew 利用大型预训练蛋白质语言模型(ESM2-t48)从原始蛋白质序列中学习与蛋白质功能相关的表征,并使用生物自然语言模型(BioBert)和基于图卷积神经网络的自动编码器从文本定义和层次关系中生成 GO 术语的语义表征,然后将这些表征结合在一起,通过交叉关注预测蛋白质功能。整合蛋白质序列和标签表征不仅提高了整体功能预测的准确性,而且通过促进GO术语之间的注释转移,在预测注释有限的罕见功能术语时提供了强大的性能。可用性和实现:https://github.com/BioinfoMachineLearning/TransFew。
{"title":"Improving protein function prediction by learning and integrating representations of protein sequences and function labels.","authors":"Frimpong Boadu, Jianlin Cheng","doi":"10.1093/bioadv/vbae120","DOIUrl":"10.1093/bioadv/vbae120","url":null,"abstract":"<p><strong>Motivation: </strong>As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt.</p><p><strong>Results: </strong>We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms.</p><p><strong>Availability and implementation: </strong>https://github.com/BioinfoMachineLearning/TransFew.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae120"},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11374024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142135095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}