Pub Date : 2024-04-13DOI: 10.1093/bioinformatics/btae204
Yurui Chen, Louxin Zhang
MOTIVATION Personalized cancer treatments require accurate drug response predictions. Existing deep learning methods show promise but higher accuracy is needed to serve the purpose of precision medicine. The prediction accuracy can be improved with not only topology but geometrical information of drugs. RESULTS A novel deep learning methodology for drug response prediction is presented, named Hi-GeoMVP. It synthesizes hierarchical drug representation with multi-omics data, leveraging graph neural networks and variational autoencoders for detailed drug and cell line representations. Multi-task learning is employed to make better prediction, while both 2D and 3D molecular representations capture comprehensive drug information. Testing on the GDSC dataset confirms Hi-GeoMVP's enhanced performance, surpassing prior state-of-the-art methods by improving the Pearson correlation coefficient from 0.934 to 0.941 and decreasing the root mean square error from 0.969 to 0.931. In the case of blind test, Hi-GeoMVP demonstrated robustness, outperforming the best previous models with a superior Pearson correlation coefficient in the drug-blind test. These results underscore Hi-GeoMVP's capabilities in drug response prediction, implying its potential for precision medicine. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/matcyr/Hi-GeoMVP. SUPPLEMENTARY INFORMATION Supplementary data is available at Bioinformatics online.
{"title":"Hi-GeoMVP: a hierarchical geometry-enhanced deep learning model for drug response prediction.","authors":"Yurui Chen, Louxin Zhang","doi":"10.1093/bioinformatics/btae204","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae204","url":null,"abstract":"MOTIVATION\u0000Personalized cancer treatments require accurate drug response predictions. Existing deep learning methods show promise but higher accuracy is needed to serve the purpose of precision medicine. The prediction accuracy can be improved with not only topology but geometrical information of drugs.\u0000\u0000\u0000RESULTS\u0000A novel deep learning methodology for drug response prediction is presented, named Hi-GeoMVP. It synthesizes hierarchical drug representation with multi-omics data, leveraging graph neural networks and variational autoencoders for detailed drug and cell line representations. Multi-task learning is employed to make better prediction, while both 2D and 3D molecular representations capture comprehensive drug information. Testing on the GDSC dataset confirms Hi-GeoMVP's enhanced performance, surpassing prior state-of-the-art methods by improving the Pearson correlation coefficient from 0.934 to 0.941 and decreasing the root mean square error from 0.969 to 0.931. In the case of blind test, Hi-GeoMVP demonstrated robustness, outperforming the best previous models with a superior Pearson correlation coefficient in the drug-blind test. These results underscore Hi-GeoMVP's capabilities in drug response prediction, implying its potential for precision medicine.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000The source code is available at https://github.com/matcyr/Hi-GeoMVP.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data is available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140708510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-13DOI: 10.1093/bioinformatics/btae205
Anja Mösch, Filippo Grazioli, Pierre Machart, Brandon Malone
MOTIVATION Neoantigen vaccines make use of tumor-specific mutations to enable the patient's immune system to recognize and eliminate cancer. Selecting vaccine elements, however, is a complex task which needs to take into account not only the underlying antigen presentation pathway but also tumor heterogeneity. RESULTS Here, we present NeoAgDT, a two-step approach consisting of: (1) simulating individual cancer cells to create a digital twin of the patient's tumor cell population and (2) optimizing the vaccine composition by integer linear programming based on this digital twin. NeoAgDT shows improved selection of experimentally-validated neoantigens over ranking-based approaches in a study of seven patients. AVAILABILITY The NeoAgDT code is published on Github: https://github.com/nec-research/neoagdt. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"NeoAgDT: Optimization of personal neoantigen vaccine composition by digital twin simulation of a cancer cell population.","authors":"Anja Mösch, Filippo Grazioli, Pierre Machart, Brandon Malone","doi":"10.1093/bioinformatics/btae205","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae205","url":null,"abstract":"MOTIVATION\u0000Neoantigen vaccines make use of tumor-specific mutations to enable the patient's immune system to recognize and eliminate cancer. Selecting vaccine elements, however, is a complex task which needs to take into account not only the underlying antigen presentation pathway but also tumor heterogeneity.\u0000\u0000\u0000RESULTS\u0000Here, we present NeoAgDT, a two-step approach consisting of: (1) simulating individual cancer cells to create a digital twin of the patient's tumor cell population and (2) optimizing the vaccine composition by integer linear programming based on this digital twin. NeoAgDT shows improved selection of experimentally-validated neoantigens over ranking-based approaches in a study of seven patients.\u0000\u0000\u0000AVAILABILITY\u0000The NeoAgDT code is published on Github: https://github.com/nec-research/neoagdt.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140707633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-13DOI: 10.1093/bioinformatics/btae206
Kailing Tu, Xuemei Li, Qilin Zhang, Wei Huang, Dan Xie
MOTIVATION Identifying chromatin accessibility is one of the key steps in studying the regulation of eukaryotic genomes. The combination of exogenous methyltransferase and nanopore sequencing provides an strategy to identify open chromatin over long genomic ranges at the single-molecule scale. However, endogenous methylation, non-open-chromatin-specific exogenous methylation and base-calling errors limit the accuracy and hinders its application to complex genomes. RESULTS We systematically evaluated the impact of these three influence factors, and developed a model-based computational method, methyltransferase accessible genome region finder(MAGNIFIER), to address the issues. By incorporating control data, MAGNIFIER attenuates the three influence factors with data-adaptive comparison strategy. We demonstrate that MAGNIFIER is not only sensitive to identify the open chromatin with much improved accuracy, but also able to detect the chromatin accessibility of repetitive regions that are missed by NGS-based methods. By incorporating long-read RNA-seq data, we revealed the association between the accessible Alu elements and non-classic gene isoforms. AVAILABILITY Freely avaliable on web at https://github.com/Goatofmountain/MAGNIFIER. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
动机确定染色质的可及性是研究真核生物基因组调控的关键步骤之一。外源甲基转移酶与纳米孔测序的结合提供了一种在单分子尺度上识别长基因组范围内开放染色质的策略。结果 我们系统地评估了这三个影响因素的影响,并开发了一种基于模型的计算方法--甲基转移酶可访问基因组区域搜索器(MAGNIFIER)来解决这些问题。通过纳入对照数据,MAGNIFIER 利用数据自适应比较策略削弱了这三个影响因素。我们证明,MAGNIFIER 不仅能灵敏地识别开放染色质,而且准确性大大提高,还能检测基于 NGS 方法遗漏的重复区域的染色质可及性。通过结合长线程 RNA-seq 数据,我们揭示了可访问的 Alu 元素与非经典基因同工酶之间的关联。AVAILABILITY免费提供,网址:https://github.com/Goatofmountain/MAGNIFIER.SUPPLEMENTARY INFORMATIONSupplementary data are available at Bioinformatics online.
{"title":"A data-adaptive methods in detecting exogenous methyltransferase accessible chromatin in human genome using nanopore sequencing.","authors":"Kailing Tu, Xuemei Li, Qilin Zhang, Wei Huang, Dan Xie","doi":"10.1093/bioinformatics/btae206","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae206","url":null,"abstract":"MOTIVATION\u0000Identifying chromatin accessibility is one of the key steps in studying the regulation of eukaryotic genomes. The combination of exogenous methyltransferase and nanopore sequencing provides an strategy to identify open chromatin over long genomic ranges at the single-molecule scale. However, endogenous methylation, non-open-chromatin-specific exogenous methylation and base-calling errors limit the accuracy and hinders its application to complex genomes.\u0000\u0000\u0000RESULTS\u0000We systematically evaluated the impact of these three influence factors, and developed a model-based computational method, methyltransferase accessible genome region finder(MAGNIFIER), to address the issues. By incorporating control data, MAGNIFIER attenuates the three influence factors with data-adaptive comparison strategy. We demonstrate that MAGNIFIER is not only sensitive to identify the open chromatin with much improved accuracy, but also able to detect the chromatin accessibility of repetitive regions that are missed by NGS-based methods. By incorporating long-read RNA-seq data, we revealed the association between the accessible Alu elements and non-classic gene isoforms.\u0000\u0000\u0000AVAILABILITY\u0000Freely avaliable on web at https://github.com/Goatofmountain/MAGNIFIER.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140707991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MOTIVATION Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates. RESULTS To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions. AVAILABILITY AND IMPLEMENTATION PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"Literature mining discerns latent disease-gene relationships.","authors":"Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta","doi":"10.1093/bioinformatics/btae185","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae185","url":null,"abstract":"MOTIVATION\u0000Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.\u0000\u0000\u0000RESULTS\u0000To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140711872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MOTIVATION Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is an important step in revealing cellular heterogeneity. Many clustering methods have been proposed to discover heterogenous cell types from scRNA-seq data. However, adaptive clustering with accurate cluster number reflecting intrinsic biology nature from large-scale scRNA-seq data remains quite challenging. RESULTS Here we propose a single-cell Deep Adaptive Clustering (scDAC) model by coupling the Autoencoder (AE) and the Dirichlet Process Mixture Model (DPMM). By jointly optimizing the model parameters of AE and DPMM, scDAC achieves adaptive clustering with accurate cluster numbers on scRNA-seq data. We verify the performance of scDAC on five subsampled datasets with different numbers of cell types and compare it with fifteen widely-used clustering methods across nine scRNA-seq datasets. Our results demonstrate that scDAC can adaptively find accurate numbers of cell types or subtypes and outperforms other methods. Moreover, the performance of scDAC is robust to hyperparameter changes. AVAILABILITY The scDAC is implemented in Python. The source code is available at https://github.com/labomics/scDAC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and dirichlet process mixture model.","authors":"Sijing An, Jinhui Shi, Runyan Liu, Yaowen Chen, Jing Wang, Shuofeng Hu, Xinyu Xia, Guohua Dong, Xiaochen Bo, Zhen He, Xiaomin Ying","doi":"10.1093/bioinformatics/btae198","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae198","url":null,"abstract":"MOTIVATION\u0000Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is an important step in revealing cellular heterogeneity. Many clustering methods have been proposed to discover heterogenous cell types from scRNA-seq data. However, adaptive clustering with accurate cluster number reflecting intrinsic biology nature from large-scale scRNA-seq data remains quite challenging.\u0000\u0000\u0000RESULTS\u0000Here we propose a single-cell Deep Adaptive Clustering (scDAC) model by coupling the Autoencoder (AE) and the Dirichlet Process Mixture Model (DPMM). By jointly optimizing the model parameters of AE and DPMM, scDAC achieves adaptive clustering with accurate cluster numbers on scRNA-seq data. We verify the performance of scDAC on five subsampled datasets with different numbers of cell types and compare it with fifteen widely-used clustering methods across nine scRNA-seq datasets. Our results demonstrate that scDAC can adaptively find accurate numbers of cell types or subtypes and outperforms other methods. Moreover, the performance of scDAC is robust to hyperparameter changes.\u0000\u0000\u0000AVAILABILITY\u0000The scDAC is implemented in Python. The source code is available at https://github.com/labomics/scDAC.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140714203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1093/bioinformatics/btae192
Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li
MOTIVATION Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY www.dllab.org/software/VirusPredictor.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
动机发现致病病原体,尤其是没有参考基因组的病毒,是一项技术挑战,因为通过序列比对往往无法识别这些病原体。对无法与人类和病原体基因组比对的病人高通量序列进行机器学习预测,可能会发现源自未定性病毒的序列。目前,还缺乏专门用于准确预测人类数据中此类病毒序列的软件。结果我们利用内部病毒基因组数据库开发了一种快速 XGBoost 方法和软件 VirusPredictor。我们的两步 XGBoost 模型首先将每个查询序列分为三类:传染性病毒、内源性逆转录病毒 (ERV) 或非ERV 人类。序列越长,预测准确率越高,150-350(Illumina 短读数)、850-950(Sanger 测序数据)和 2,000-5,000 bp 序列的预测准确率分别为 0.76、0.93 和 0.98。当查询序列从 150-350 bp 增加到大于 850 bp 时,准确度从 0.92 增加到大于 0.98。结果表明,Illumina 短读数应尽可能在预测前从头组装成等体(例如,1000 bp 或更长)。我们将 VirusPredictor 应用于多个真实的基因组和元基因组数据集,并获得了很高的准确率。VirusPredictor 是一款用户友好的开源 Python 软件,可用于预测患者不可应用序列的来源。这项研究首次在传染性病毒序列预测中对 ERV 进行了分类。这也是第一项结合病毒亚群预测的研究。AVAILABILITYwww.dllab.org/software/VirusPredictor.html.SUPPLEMENTARY INFORMATIONS补充数据可在生物信息学网上获取。
{"title":"VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.","authors":"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li","doi":"10.1093/bioinformatics/btae192","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae192","url":null,"abstract":"MOTIVATION\u0000Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\u0000\u0000\u0000RESULTS\u0000We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\u0000\u0000\u0000AVAILABILITY\u0000www.dllab.org/software/VirusPredictor.html.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140719747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1093/bioinformatics/btae193
Jonathan L. Price, Omer Ziv, M. Pinckert, Andrew Lim, Eric A. Miska
SUMMARY RNA (Ribonucleic Acid) molecules have secondary and tertiary structures in vivo which play a crucial role in cellular processes such as the regulation of gene expression, RNA processing and localisation. The ability to investigate these structures will enhance our understanding of their function and contribute to the diagnosis and treatment of diseases caused by RNA dysregulation. However, there are no mature pipelines or packages for processing and analysing complex in vivo RNA structural data. Here, we present rnaCrosslinkOO (RNA Crosslink Object-Oriented), a novel software package for the comprehensive analysis of data derived from the COMRADES (Crosslinking of Matched RNA and Deep Sequencing) method. rnaCrosslinkOO offers a comprehensive pipeline from raw sequencing reads to the identification and comparison of RNA structural features. It includes read processing and alignment, clustering of duplexes, data exploration, folding and comparisons of RNA structures. rnaCrosslinkOO also enables comparisons between conditions, the identification of inter-RNA interactions, and the incorporation of reactivity data to improve structure prediction. AVAILABILITY AND IMPLEMENTATION rnaCrosslinkOO is freely available to non-commercial users and implemented in R, with the source code and documentation accessible at [https://CRAN.R-project.org/package=rnaCrosslinkOO]. The software is supported on Linux, macOS, and Windows platforms.
摘要RNA(核糖核酸)分子在体内具有二级和三级结构,在基因表达调控、RNA加工和定位等细胞过程中发挥着至关重要的作用。研究这些结构的能力将提高我们对其功能的认识,并有助于诊断和治疗由 RNA 失调引起的疾病。然而,目前还没有成熟的管道或软件包来处理和分析复杂的体内 RNA 结构数据。在这里,我们介绍 rnaCrosslinkOO(RNA Crosslink Object-Oriented),这是一个新颖的软件包,用于综合分析 COMRADES(Crosslinking of Matched RNA and Deep Sequencing)方法获得的数据。它包括读取处理和比对、双链聚类、数据探索、折叠和 RNA 结构比较。rnaCrosslinkOO 还能在不同条件下进行比较,识别 RNA 之间的相互作用,并结合反应性数据来改进结构预测。AVAILABILITY AND IMPLEMENTATIONrnaCrosslinkOO 可免费提供给非商业用户,并用 R 语言实现,源代码和文档可在 [https://CRAN.R-project.org/package=rnaCrosslinkOO] 上查阅。该软件支持 Linux、macOS 和 Windows 平台。
{"title":"rnaCrosslinkOO: An Object-Oriented R Package for the Analysis of RNA Structural Data Generated by RNA Crosslinking Experiments.","authors":"Jonathan L. Price, Omer Ziv, M. Pinckert, Andrew Lim, Eric A. Miska","doi":"10.1093/bioinformatics/btae193","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae193","url":null,"abstract":"SUMMARY\u0000RNA (Ribonucleic Acid) molecules have secondary and tertiary structures in vivo which play a crucial role in cellular processes such as the regulation of gene expression, RNA processing and localisation. The ability to investigate these structures will enhance our understanding of their function and contribute to the diagnosis and treatment of diseases caused by RNA dysregulation. However, there are no mature pipelines or packages for processing and analysing complex in vivo RNA structural data. Here, we present rnaCrosslinkOO (RNA Crosslink Object-Oriented), a novel software package for the comprehensive analysis of data derived from the COMRADES (Crosslinking of Matched RNA and Deep Sequencing) method. rnaCrosslinkOO offers a comprehensive pipeline from raw sequencing reads to the identification and comparison of RNA structural features. It includes read processing and alignment, clustering of duplexes, data exploration, folding and comparisons of RNA structures. rnaCrosslinkOO also enables comparisons between conditions, the identification of inter-RNA interactions, and the incorporation of reactivity data to improve structure prediction.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000rnaCrosslinkOO is freely available to non-commercial users and implemented in R, with the source code and documentation accessible at [https://CRAN.R-project.org/package=rnaCrosslinkOO]. The software is supported on Linux, macOS, and Windows platforms.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140718326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1093/bioinformatics/btae201
Christian Carrizosa, Dag E Undlien, Magnus D Vigeland
MOTIVATION Cosegregation analysis is a powerful tool for identifying pathogenic genetic variants, but its implementation remains challenging. Existing software is either limited in scope or too demanding for many end users. Moreover, current solutions lack methods for assessing the robustness of cosegregation evidence, which is important due to its reliance on uncertain estimates. RESULTS We present shinyseg, a comprehensive web application for clinical cosegregation analysis. Our app streamlines penetrance specification based on either liability classes or epidemiological data such as risks, hazard ratios, and age of onset distribution. In addition, it incorporates sensitivity analyses to assess the robustness of cosegregation evidence, and offers support in clinical interpretation. AVAILABILITY AND IMPLEMENTATION The shinyseg app is freely available at https://chrcarrizosa.shinyapps.io/shinyseg, with documentation and complete R source code on https://chrcarrizosa.github.io/shinyseg and https://github.com/chrcarrizosa/shinyseg. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"shinyseg: a web application for flexible cosegregation and sensitivity analysis.","authors":"Christian Carrizosa, Dag E Undlien, Magnus D Vigeland","doi":"10.1093/bioinformatics/btae201","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae201","url":null,"abstract":"MOTIVATION\u0000Cosegregation analysis is a powerful tool for identifying pathogenic genetic variants, but its implementation remains challenging. Existing software is either limited in scope or too demanding for many end users. Moreover, current solutions lack methods for assessing the robustness of cosegregation evidence, which is important due to its reliance on uncertain estimates.\u0000\u0000\u0000RESULTS\u0000We present shinyseg, a comprehensive web application for clinical cosegregation analysis. Our app streamlines penetrance specification based on either liability classes or epidemiological data such as risks, hazard ratios, and age of onset distribution. In addition, it incorporates sensitivity analyses to assess the robustness of cosegregation evidence, and offers support in clinical interpretation.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000The shinyseg app is freely available at https://chrcarrizosa.shinyapps.io/shinyseg, with documentation and complete R source code on https://chrcarrizosa.github.io/shinyseg and https://github.com/chrcarrizosa/shinyseg.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140716879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1093/bioinformatics/btae194
Antonio Di Maria, Lorenzo Bellomo, Fabrizio Billeci, Alfio Cardillo, S. Alaimo, Paolo Ferragina, Alfredo Ferro, A. Pulvirenti
MOTIVATION The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging. RESULTS We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e., in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches. AVAILABILITY https://netme.click/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics.
动机生物医学文献的迅速增加,使科学家越来越难以跟上研究发现的步伐。因此,计算工具变得越来越广泛,其中网络分析在一些生命科学领域发挥着至关重要的作用。结果我们介绍了 NetMe 2.0,这是一个基于网络的平台,它能从一组输入文本(即 PubMed Central 的论文全文或摘要、免费文本或用户上传的 PDF 文件)中自动提取相关生物医学实体及其关系,并将其建模为生物医学知识图谱(BKG)。NetMe 2.0 还实现了一个创新的检索增强生成模块(Graph-RAG),该模块在 BKG 建模的关系之上工作,允许提炼出解释其内容的格式良好的句子。实验结果表明,与最先进的方法相比,NetMe 2.0 可以推断出全面可靠的生物网络,并具有显著的精确度-召回率指标。AVAILABILITYhttps://netme.click/.SUPPLEMENTARY INFORMATIONS补充数据可在 Bioinformatics 网站获取。
{"title":"A web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph.","authors":"Antonio Di Maria, Lorenzo Bellomo, Fabrizio Billeci, Alfio Cardillo, S. Alaimo, Paolo Ferragina, Alfredo Ferro, A. Pulvirenti","doi":"10.1093/bioinformatics/btae194","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae194","url":null,"abstract":"MOTIVATION\u0000The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging.\u0000\u0000\u0000RESULTS\u0000We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e., in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches.\u0000\u0000\u0000AVAILABILITY\u0000https://netme.click/.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140718539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-05DOI: 10.1093/bioinformatics/btae183
F. De Paoli, Silvia Berardelli, I. Limongelli, E. Rizzo, S. Zucca
MOTIVATION In the modern era of genomic research, the scientific community is witnessing an explosive growth in the volume of published findings.While this abundance of data offers invaluable insights, it also places a pressing responsibility on genetic professionals and researchers to stay informed about the latest findings and their clinical significance. Genomic variant interpretation is currently facing a challenge in identifying the most up-to-date and relevant scientific papers, while also extracting meaningful information to accelerate the process from clinical assessment to reporting.Computer-aided literature search and summarization can play a pivotal role in this context. By synthesizing complex genomic findings into concise, interpretable summaries, this approach facilitates the translation of extensive genomic datasets into clinically relevant insights. RESULTS To bridge this gap, we present VarChat (varchat.engenome.com), an innovative tool based on generative AI, developed to find and summarize the fragmented scientific literature associated with genomic variants into brief yet informative texts.VarChat provides users with a concise description of specific genetic variants, detailing their impact on related proteins and possible effects on human health. Additionally, VarChat offers direct links to related scientific trustable sources, and encourages deeper research. AVAILABILITY varchat.engenome.com.
{"title":"VarChat: the generative AI assistant for the interpretation of human genomic variations.","authors":"F. De Paoli, Silvia Berardelli, I. Limongelli, E. Rizzo, S. Zucca","doi":"10.1093/bioinformatics/btae183","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae183","url":null,"abstract":"MOTIVATION\u0000In the modern era of genomic research, the scientific community is witnessing an explosive growth in the volume of published findings.While this abundance of data offers invaluable insights, it also places a pressing responsibility on genetic professionals and researchers to stay informed about the latest findings and their clinical significance. Genomic variant interpretation is currently facing a challenge in identifying the most up-to-date and relevant scientific papers, while also extracting meaningful information to accelerate the process from clinical assessment to reporting.Computer-aided literature search and summarization can play a pivotal role in this context. By synthesizing complex genomic findings into concise, interpretable summaries, this approach facilitates the translation of extensive genomic datasets into clinically relevant insights.\u0000\u0000\u0000RESULTS\u0000To bridge this gap, we present VarChat (varchat.engenome.com), an innovative tool based on generative AI, developed to find and summarize the fragmented scientific literature associated with genomic variants into brief yet informative texts.VarChat provides users with a concise description of specific genetic variants, detailing their impact on related proteins and possible effects on human health. Additionally, VarChat offers direct links to related scientific trustable sources, and encourages deeper research.\u0000\u0000\u0000AVAILABILITY\u0000varchat.engenome.com.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140740658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}