Pub Date : 2024-08-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae124
Santiago Prochetto, Renata Reinheimer, Georgina Stegmayer
Motivation: Unraveling the connection between genes and traits is crucial for solving many biological puzzles. Ribonucleic acid molecules and proteins, derived from these genetic instructions, play crucial roles in shaping cell structures, influencing reactions, and guiding behavior. This fundamental biological principle links genetic makeup to observable traits, but integrating and extracting meaningful relationships from this complex, multimodal data present a significant challenge.
Results: We introduce evolSOM, a novel R package that allows exploring and visualizing the conservation or displacement of biological variables, easing the integration of phenotypic and genotypic attributes. It enables the projection of multi-dimensional expression profiles onto interpretable two-dimensional grids, aiding in the identification of conserved or displaced genes/phenotypes across multiple conditions. Variables displaced together suggest membership to the same regulatory network, where the nature of the displacement may hold biological significance. The conservation or displacement of variables is automatically calculated and graphically presented by evolSOM. Its user-friendly interface and visualization capabilities enhance the accessibility of complex network analyses.
Availability and implementation: The package is open-source under the GPL ( 3) and is available at https://github.com/sanprochetto/evolSOM, along with a step-by-step vignette and a full example dataset that can be accessed at https://github.com/sanprochetto/evolSOM/tree/main/inst/extdata.
{"title":"evolSOM: An R package for analyzing conservation and displacement of biological variables with self-organizing maps.","authors":"Santiago Prochetto, Renata Reinheimer, Georgina Stegmayer","doi":"10.1093/bioadv/vbae124","DOIUrl":"https://doi.org/10.1093/bioadv/vbae124","url":null,"abstract":"<p><strong>Motivation: </strong>Unraveling the connection between genes and traits is crucial for solving many biological puzzles. Ribonucleic acid molecules and proteins, derived from these genetic instructions, play crucial roles in shaping cell structures, influencing reactions, and guiding behavior. This fundamental biological principle links genetic makeup to observable traits, but integrating and extracting meaningful relationships from this complex, multimodal data present a significant challenge.</p><p><strong>Results: </strong>We introduce evolSOM, a novel R package that allows exploring and visualizing the conservation or displacement of biological variables, easing the integration of phenotypic and genotypic attributes. It enables the projection of multi-dimensional expression profiles onto interpretable two-dimensional grids, aiding in the identification of conserved or displaced genes/phenotypes across multiple conditions. Variables displaced together suggest membership to the same regulatory network, where the nature of the displacement may hold biological significance. The conservation or displacement of variables is automatically calculated and graphically presented by evolSOM. Its user-friendly interface and visualization capabilities enhance the accessibility of complex network analyses.</p><p><strong>Availability and implementation: </strong>The package is open-source under the GPL ( <math><mo>≥</mo></math> 3) and is available at https://github.com/sanprochetto/evolSOM, along with a step-by-step vignette and a full example dataset that can be accessed at https://github.com/sanprochetto/evolSOM/tree/main/inst/extdata.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae124"},"PeriodicalIF":2.4,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11361812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae122
Jose V Die
Summary: We introduce refseqR, an R package that offers a user-friendly solution, enabling common computational operations on RefSeq entries (GenBank, NCBI). The package is specifically designed to interact with records curated from the RefSeq database. Most importantly, the interoperability and integration with several Bioconductor objects allow connections to be applied to other projects.
Availability and implementation: The package refseqR is implemented in R and published under the MIT open-source license. The source code, documentation, and usage instructions are available on CRAN (https://CRAN.R-project.org/package=refseqR).
摘要:我们介绍的 refseqR 是一个 R 软件包,它提供了一个用户友好的解决方案,能够对 RefSeq 条目(GenBank、NCBI)进行常见的计算操作。该软件包专为与 RefSeq 数据库中的记录进行交互而设计。最重要的是,与多个 Bioconductor 对象的互操作性和集成性允许将连接应用于其他项目:refseqR 软件包是用 R 语言实现的,以 MIT 开源许可证发布。源代码、文档和使用说明可在 CRAN (https://CRAN.R-project.org/package=refseqR) 上获取。
{"title":"refseqR: an R package for common computational operations with records on RefSeq collection.","authors":"Jose V Die","doi":"10.1093/bioadv/vbae122","DOIUrl":"10.1093/bioadv/vbae122","url":null,"abstract":"<p><strong>Summary: </strong>We introduce refseqR, an R package that offers a user-friendly solution, enabling common computational operations on RefSeq entries (GenBank, NCBI). The package is specifically designed to interact with records curated from the RefSeq database. Most importantly, the interoperability and integration with several Bioconductor objects allow connections to be applied to other projects.</p><p><strong>Availability and implementation: </strong>The package refseqR is implemented in R and published under the MIT open-source license. The source code, documentation, and usage instructions are available on CRAN (https://CRAN.R-project.org/package=refseqR).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae122"},"PeriodicalIF":2.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368385/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae116
Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen
Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.
Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.
Availability and implementation: All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.
{"title":"CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes.","authors":"Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen","doi":"10.1093/bioadv/vbae116","DOIUrl":"https://doi.org/10.1093/bioadv/vbae116","url":null,"abstract":"<p><strong>Motivation: </strong>Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.</p><p><strong>Results: </strong>We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.</p><p><strong>Availability and implementation: </strong>All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae116"},"PeriodicalIF":2.4,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11474106/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Enhancers play critical roles in cell-type-specific transcriptional control. Despite the identification of thousands of candidate enhancers, unravelling their regulatory relationships with their target genes remains challenging. Therefore, computational approaches are needed to accurately infer enhancer-gene regulatory relationships.
Results: In this study, we propose a new method, IVEA, that predicts enhancer-gene regulatory interactions by estimating promoter and enhancer activities. Its statistical model is based on the gene regulatory mechanism of transcriptional bursting, which is characterized by burst size and frequency controlled by promoters and enhancers, respectively. Using transcriptional readouts, chromatin accessibility, and chromatin contact data as inputs, promoter and enhancer activities were estimated using variational Bayesian inference, and the contribution of each enhancer-promoter pair to target gene transcription was calculated. Our analysis demonstrates that the proposed method can achieve high prediction accuracy and provide biologically relevant enhancer-gene regulatory interactions.
Availability and implementation: The IVEA code is available on GitHub at https://github.com/yasumasak/ivea. The publicly available datasets used in this study are described in Supplementary Table S4.
{"title":"IVEA: an integrative variational Bayesian inference method for predicting enhancer-gene regulatory interactions.","authors":"Yasumasa Kimura, Yoshimasa Ono, Kotoe Katayama, Seiya Imoto","doi":"10.1093/bioadv/vbae118","DOIUrl":"10.1093/bioadv/vbae118","url":null,"abstract":"<p><strong>Motivation: </strong>Enhancers play critical roles in cell-type-specific transcriptional control. Despite the identification of thousands of candidate enhancers, unravelling their regulatory relationships with their target genes remains challenging. Therefore, computational approaches are needed to accurately infer enhancer-gene regulatory relationships.</p><p><strong>Results: </strong>In this study, we propose a new method, IVEA, that predicts enhancer-gene regulatory interactions by estimating promoter and enhancer activities. Its statistical model is based on the gene regulatory mechanism of transcriptional bursting, which is characterized by burst size and frequency controlled by promoters and enhancers, respectively. Using transcriptional readouts, chromatin accessibility, and chromatin contact data as inputs, promoter and enhancer activities were estimated using variational Bayesian inference, and the contribution of each enhancer-promoter pair to target gene transcription was calculated. Our analysis demonstrates that the proposed method can achieve high prediction accuracy and provide biologically relevant enhancer-gene regulatory interactions.</p><p><strong>Availability and implementation: </strong>The IVEA code is available on GitHub at https://github.com/yasumasak/ivea. The publicly available datasets used in this study are described in Supplementary Table S4.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae118"},"PeriodicalIF":2.4,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11349192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae120
Frimpong Boadu, Jianlin Cheng
Motivation: As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt.
Results: We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms.
Availability and implementation: https://github.com/BioinfoMachineLearning/TransFew.
动机由于只有不到1%的蛋白质通过实验确定了蛋白质的功能信息,因此计算预测蛋白质的功能对于获得大多数蛋白质的功能信息至关重要,这也是蛋白质生物信息学的一大挑战。尽管近十年来,蛋白质功能预测领域取得了重大进展,但蛋白质功能预测的总体准确率仍然不高,尤其是与蛋白质功能注释数据库(如 UniProt.Results)中少数蛋白质相关的罕见功能术语:我们介绍了一种新的转换器模型 TransFew,它可以学习蛋白质序列和功能标签 [基因本体(GO)术语] 的表示,从而预测蛋白质的功能。TransFew 利用大型预训练蛋白质语言模型(ESM2-t48)从原始蛋白质序列中学习与蛋白质功能相关的表征,并使用生物自然语言模型(BioBert)和基于图卷积神经网络的自动编码器从文本定义和层次关系中生成 GO 术语的语义表征,然后将这些表征结合在一起,通过交叉关注预测蛋白质功能。整合蛋白质序列和标签表征不仅提高了整体功能预测的准确性,而且通过促进GO术语之间的注释转移,在预测注释有限的罕见功能术语时提供了强大的性能。可用性和实现:https://github.com/BioinfoMachineLearning/TransFew。
{"title":"Improving protein function prediction by learning and integrating representations of protein sequences and function labels.","authors":"Frimpong Boadu, Jianlin Cheng","doi":"10.1093/bioadv/vbae120","DOIUrl":"10.1093/bioadv/vbae120","url":null,"abstract":"<p><strong>Motivation: </strong>As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt.</p><p><strong>Results: </strong>We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms.</p><p><strong>Availability and implementation: </strong>https://github.com/BioinfoMachineLearning/TransFew.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae120"},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11374024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142135095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae119
Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu
Motivation: Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.
Results: We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.
Availability and implementation: We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
{"title":"In the twilight zone of protein sequence homology: do protein language models learn protein structure?","authors":"Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu","doi":"10.1093/bioadv/vbae119","DOIUrl":"10.1093/bioadv/vbae119","url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.</p><p><strong>Results: </strong>We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the \"twilight zone\" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.</p><p><strong>Availability and implementation: </strong>We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae119"},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11344590/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142057444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae121
Gregor Rot, Arne Wehling, Roland Schmucki, Nikolaos Berntenis, Jitao David Zhang, Martin Ebeling
Motivation: Analysis of alternative splicing using short-read RNA-seq data is a complex process that involves several steps: alignment of reads to the reference genome, identification of alternatively spliced features, motif discovery, analysis of RNA-protein binding near donor and acceptor splice sites, and exploratory data visualization. To the best of our knowledge, there is currently no integrative open-source software dedicated to this task.
Results: Here, we introduce splicekit, a Python package that provides and integrates a set of existing and novel splicing analysis tools for conducting splicing analysis.
Availability and implementation: The software splicekit is open-source and available at Github (https://github.com/bedapub/splicekit) and via the Python Package Index.
{"title":"<i>splicekit</i>: an integrative toolkit for splicing analysis from short-read RNA-seq.","authors":"Gregor Rot, Arne Wehling, Roland Schmucki, Nikolaos Berntenis, Jitao David Zhang, Martin Ebeling","doi":"10.1093/bioadv/vbae121","DOIUrl":"10.1093/bioadv/vbae121","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis of alternative splicing using short-read RNA-seq data is a complex process that involves several steps: alignment of reads to the reference genome, identification of alternatively spliced features, motif discovery, analysis of RNA-protein binding near donor and acceptor splice sites, and exploratory data visualization. To the best of our knowledge, there is currently no integrative open-source software dedicated to this task.</p><p><strong>Results: </strong>Here, we introduce <i>splicekit</i>, a Python package that provides and integrates a set of existing and novel splicing analysis tools for conducting splicing analysis.</p><p><strong>Availability and implementation: </strong>The software <i>splicekit</i> is open-source and available at Github (https://github.com/bedapub/splicekit) and <i>via</i> the Python Package Index.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae121"},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11364168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae112
Yuanli Zuo, Wenrong Liu, Yang Jin, Yitong Pan, Ting Fan, Xin Fu, Jiawei Guo, Shuangyan Tan, Juan He, Yang Yang, Zhang Li, Chenyu Yang, Yong Peng
Motivation: Circular RNAs (circRNAs) play important roles in gene expression and their involvement in tumorigenesis is emerging. circRNA-related database is a powerful tool for researchers to investigate circRNAs. However, existing databases lack advanced platform integrating comprehensive information and analysis tools of cancer-related circRNAs.
Results: We developed a comprehensive platform called CircRNA to Cancer Database (C2CDB), encompassing 318 158 cancer-related circRNAs expressed in tumors and adjacent tissues across 30 types of cancers. C2CDB provides basic details such as sequence and expression levels of circRNAs, as well as crucial insights into biological mechanisms, including miRNA binding, RNA-binding protein interaction, coding potential, base modification, mutation, and secondary structure. Moreover, C2CDB collects an extensive compilation of published literature on cancer circRNAs, extracting and presenting pivotal content encompassing biological functions, underlying mechanisms, and molecular tools in these studies. Additionally, C2CDB offers integrated tools to analyse three potential mechanisms: circRNA-miRNA ceRNA interaction, circRNA encoding, and circRNA biogenesis, facilitating investigators with convenient access to highly reliable information. To enhance clarity and organization, C2CDB has meticulously curated and integrated the previously chaotic nomenclature of circRNAs, addressing the prevailing confusion and ambiguity surrounding their designations.
Availability and implementation: C2CDB is freely available at http://pengyonglab.com/c2cdb.
{"title":"C2CDB: an advanced platform integrating comprehensive information and analysis tools of cancer-related circRNAs.","authors":"Yuanli Zuo, Wenrong Liu, Yang Jin, Yitong Pan, Ting Fan, Xin Fu, Jiawei Guo, Shuangyan Tan, Juan He, Yang Yang, Zhang Li, Chenyu Yang, Yong Peng","doi":"10.1093/bioadv/vbae112","DOIUrl":"10.1093/bioadv/vbae112","url":null,"abstract":"<p><strong>Motivation: </strong>Circular RNAs (circRNAs) play important roles in gene expression and their involvement in tumorigenesis is emerging. circRNA-related database is a powerful tool for researchers to investigate circRNAs. However, existing databases lack advanced platform integrating comprehensive information and analysis tools of cancer-related circRNAs.</p><p><strong>Results: </strong>We developed a comprehensive platform called CircRNA to Cancer Database (C2CDB), encompassing 318 158 cancer-related circRNAs expressed in tumors and adjacent tissues across 30 types of cancers. C2CDB provides basic details such as sequence and expression levels of circRNAs, as well as crucial insights into biological mechanisms, including miRNA binding, RNA-binding protein interaction, coding potential, base modification, mutation, and secondary structure. Moreover, C2CDB collects an extensive compilation of published literature on cancer circRNAs, extracting and presenting pivotal content encompassing biological functions, underlying mechanisms, and molecular tools in these studies. Additionally, C2CDB offers integrated tools to analyse three potential mechanisms: circRNA-miRNA ceRNA interaction, circRNA encoding, and circRNA biogenesis, facilitating investigators with convenient access to highly reliable information. To enhance clarity and organization, C2CDB has meticulously curated and integrated the previously chaotic nomenclature of circRNAs, addressing the prevailing confusion and ambiguity surrounding their designations.</p><p><strong>Availability and implementation: </strong>C2CDB is freely available at http://pengyonglab.com/c2cdb.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae112"},"PeriodicalIF":2.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11379471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142156806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae099
Marinka Zitnik, Michelle M Li, Aydin Wells, Kimberly Glass, Deisy Morselli Gysi, Arjun Krishnan, T M Murali, Predrag Radivojac, Sushmita Roy, Anaïs Baudot, Serdar Bozdag, Danny Z Chen, Lenore Cowen, Kapil Devkota, Anthony Gitter, Sara J C Gosline, Pengfei Gu, Pietro H Guzzi, Heng Huang, Meng Jiang, Ziynet Nesibe Kesimoglu, Mehmet Koyuturk, Jian Ma, Alexander R Pico, Nataša Pržulj, Teresa M Przytycka, Benjamin J Raphael, Anna Ritz, Roded Sharan, Yang Shen, Mona Singh, Donna K Slonim, Hanghang Tong, Xinan Holly Yang, Byung-Jun Yoon, Haiyuan Yu, Tijana Milenković
Summary: Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology.
{"title":"Current and future directions in network biology.","authors":"Marinka Zitnik, Michelle M Li, Aydin Wells, Kimberly Glass, Deisy Morselli Gysi, Arjun Krishnan, T M Murali, Predrag Radivojac, Sushmita Roy, Anaïs Baudot, Serdar Bozdag, Danny Z Chen, Lenore Cowen, Kapil Devkota, Anthony Gitter, Sara J C Gosline, Pengfei Gu, Pietro H Guzzi, Heng Huang, Meng Jiang, Ziynet Nesibe Kesimoglu, Mehmet Koyuturk, Jian Ma, Alexander R Pico, Nataša Pržulj, Teresa M Przytycka, Benjamin J Raphael, Anna Ritz, Roded Sharan, Yang Shen, Mona Singh, Donna K Slonim, Hanghang Tong, Xinan Holly Yang, Byung-Jun Yoon, Haiyuan Yu, Tijana Milenković","doi":"10.1093/bioadv/vbae099","DOIUrl":"10.1093/bioadv/vbae099","url":null,"abstract":"<p><strong>Summary: </strong>Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae099"},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11321866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
Availability and implementation: The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).
摘要:本文介绍了组合核苷酸字节级编码器-解码器(ENBED)基础模型,利用编码器-解码器变换器架构分析字节级精度的 DNA 序列。ENBED利用注意力的亚二次方实现,开发出一种能够进行序列到序列转换的高效模型,从而推广了以往仅使用编码器或仅使用解码器架构的基因组模型。我们使用掩码语言建模技术(Masked Language Modeling),利用参考基因组序列对基础模型进行预训练,并将其应用于以下下游任务:(i) 识别增强子、启动子和剪接位点;(ii) 识别包含碱基调用错配和插入/删除错误的序列,这比涉及多个碱基对的标记化方案更有优势,因为后者失去了以字节级精度进行分析的能力;(iii) 识别基因组序列的生物功能注释;(iv) 使用编码器-解码器架构生成流感病毒的突变,并根据真实世界的观察结果对其进行验证。与现有的最先进成果相比,我们在上述每项任务中都取得了显著进步:用于开发和微调基础模型的源代码已在 Github 上发布(https://github.itap.purdue.edu/Clan-labs/ENBED)。
{"title":"Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.","authors":"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal","doi":"10.1093/bioadv/vbae117","DOIUrl":"10.1093/bioadv/vbae117","url":null,"abstract":"<p><strong>Summary: </strong>This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p><p><strong>Availability and implementation: </strong>The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae117"},"PeriodicalIF":2.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11341122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}