Pub Date : 2024-11-18DOI: 10.1093/bioinformatics/btae683
Timo Rachel, Eva Brombacher, Svenja Wöhrle, Olaf Groß, Clemens Kreutz
Motivation: Mathematical modelling plays a crucial role in understanding inter- and intracellular signalling processes. Currently, ordinary differential equations (ODEs) are the predominant approach in systems biology for modelling such pathways. While ODE models offer mechanistic interpretability, they also suffer from limitations, including the need to consider all relevant compounds, resulting in large models difficult to handle numerically and requiring extensive data.
Results: In previous work, we introduced the retarded transient function (RTF) as an alternative method for modelling temporal responses of signalling pathways. Here, we extend the RTF approach to integrate concentration or dose-dependencies into the modelling of dynamics. With this advancement, RTF modelling now fully encompasses the application range of ordinary differential equation (ODE) models, which comprises predictions in both time and concentration domains. Moreover, characterizing dose-dependencies provides an intuitive way to investigate and characterize signalling differences between biological conditions or cell-types based on their response to stimulating inputs. To demonstrate the applicability of our extended approach, we employ data from time- and dose-dependent inflammasome activation in bone-marrow derived macrophages (BMDMs) treated with nigericin sodium salt. Our results show the effectiveness of the extended RTF approach as a generic framework for modelling dose-dependent kinetics in cellular signalling. The approach results in intuitively interpretable parameters that describe signal dynamics and enables predictive modelling of time- and dose-dependencies even if only individual cellular components are quantified.
Availability: The presented approach is available within the MATLAB-based Data2Dynamics modelling toolbox at https://github.com/Data2Dynamics and https://zenodo.org/records/14008247 and as R code at https://github.com/kreutz-lab/RTF.
{"title":"Dynamic modelling of signalling pathways when ODEs are not feasible.","authors":"Timo Rachel, Eva Brombacher, Svenja Wöhrle, Olaf Groß, Clemens Kreutz","doi":"10.1093/bioinformatics/btae683","DOIUrl":"10.1093/bioinformatics/btae683","url":null,"abstract":"<p><strong>Motivation: </strong>Mathematical modelling plays a crucial role in understanding inter- and intracellular signalling processes. Currently, ordinary differential equations (ODEs) are the predominant approach in systems biology for modelling such pathways. While ODE models offer mechanistic interpretability, they also suffer from limitations, including the need to consider all relevant compounds, resulting in large models difficult to handle numerically and requiring extensive data.</p><p><strong>Results: </strong>In previous work, we introduced the retarded transient function (RTF) as an alternative method for modelling temporal responses of signalling pathways. Here, we extend the RTF approach to integrate concentration or dose-dependencies into the modelling of dynamics. With this advancement, RTF modelling now fully encompasses the application range of ordinary differential equation (ODE) models, which comprises predictions in both time and concentration domains. Moreover, characterizing dose-dependencies provides an intuitive way to investigate and characterize signalling differences between biological conditions or cell-types based on their response to stimulating inputs. To demonstrate the applicability of our extended approach, we employ data from time- and dose-dependent inflammasome activation in bone-marrow derived macrophages (BMDMs) treated with nigericin sodium salt. Our results show the effectiveness of the extended RTF approach as a generic framework for modelling dose-dependent kinetics in cellular signalling. The approach results in intuitively interpretable parameters that describe signal dynamics and enables predictive modelling of time- and dose-dependencies even if only individual cellular components are quantified.</p><p><strong>Availability: </strong>The presented approach is available within the MATLAB-based Data2Dynamics modelling toolbox at https://github.com/Data2Dynamics and https://zenodo.org/records/14008247 and as R code at https://github.com/kreutz-lab/RTF.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1093/bioinformatics/btae694
Thomas Enzlein, Alexander Geisel, Carsten Hopf, Stefan Schmidt
Summary: Fast computational evaluation and classification of concentration responses for hundreds of metabolites represented by their mass-to-charge (m/z) ratios is indispensable for unraveling complex metabolomic drug actions in label-free, whole-cell Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI MS) bioassays. In particular, the identification of novel pharmacodynamic biomarkers to determine target engagement, potency and potential polypharmacology of drug-like compounds in high-throughput applications requires robust data interpretation pipelines. Given the large number of mass features in cell-based MALDI MS bioassays, reliable identification of true biological response patterns and their differentiation from any measurement artefacts that may be present is critical. To facilitate the exploration of metabolomic responses in complex MALDI MS datasets, we present a novel software tool, M2ara. Implemented as a user-friendly R-based shiny application, it enables rapid evaluation of Molecular High Content Screening (MHCS) assay data. Furthermore, we introduce the concept of Curve Response Score (CRS) and CRS fingerprints to enable rapid visual inspection and ranking of mass features. In addition, these CRS fingerprints allow direct comparison of cellular effects among different compounds. Beyond cellular assays, our computational framework can also be applied to MALDI MS-based (cell-free) biochemical assays in general.
Availability and implementation: The software tool, code and examples are available at https://github.com/CeMOS-Mannheim/M2ara and https://dx.doi.org/10.6084/m9.figshare.25736541.
Supplementary information: Supplementary material is available at Bioinformatics online.
摘要在无标记、全细胞基质辅助激光解吸/电离质谱(MALDI MS)生物测定中,要揭示复杂的代谢组学药物作用,就必须对以质量电荷比(m/z)表示的数百种代谢物的浓度反应进行快速计算评估和分类。特别是,在高通量应用中鉴定新型药效学生物标志物以确定药物类似化合物的靶点参与、药效和潜在的多药理作用需要强大的数据解读管道。鉴于基于细胞的 MALDI MS 生物测定中存在大量质量特征,因此可靠地识别真正的生物反应模式并将其与可能存在的任何测量伪影区分开来至关重要。为了便于探索复杂 MALDI MS 数据集中的代谢组学反应,我们推出了一款新型软件工具 M2ara。它是一款基于 R 的闪亮应用程序,用户使用方便,能快速评估分子高内涵筛选 (MHCS) 检测数据。此外,我们还引入了曲线响应得分(CRS)和 CRS 指纹的概念,以实现质量特征的快速视觉检测和排序。此外,这些 CRS 指纹可以直接比较不同化合物对细胞的影响。除细胞检测外,我们的计算框架还可应用于基于 MALDI MS(无细胞)的一般生化检测:软件工具、代码和示例见 https://github.com/CeMOS-Mannheim/M2ara 和 https://dx.doi.org/10.6084/m9.figshare.25736541.Supplementary 信息:补充材料可在 Bioinformatics online 上查阅。
{"title":"M2ara: unraveling metabolomic drug responses in whole-cell MALDI mass spectrometry bioassays.","authors":"Thomas Enzlein, Alexander Geisel, Carsten Hopf, Stefan Schmidt","doi":"10.1093/bioinformatics/btae694","DOIUrl":"10.1093/bioinformatics/btae694","url":null,"abstract":"<p><strong>Summary: </strong>Fast computational evaluation and classification of concentration responses for hundreds of metabolites represented by their mass-to-charge (m/z) ratios is indispensable for unraveling complex metabolomic drug actions in label-free, whole-cell Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI MS) bioassays. In particular, the identification of novel pharmacodynamic biomarkers to determine target engagement, potency and potential polypharmacology of drug-like compounds in high-throughput applications requires robust data interpretation pipelines. Given the large number of mass features in cell-based MALDI MS bioassays, reliable identification of true biological response patterns and their differentiation from any measurement artefacts that may be present is critical. To facilitate the exploration of metabolomic responses in complex MALDI MS datasets, we present a novel software tool, M2ara. Implemented as a user-friendly R-based shiny application, it enables rapid evaluation of Molecular High Content Screening (MHCS) assay data. Furthermore, we introduce the concept of Curve Response Score (CRS) and CRS fingerprints to enable rapid visual inspection and ranking of mass features. In addition, these CRS fingerprints allow direct comparison of cellular effects among different compounds. Beyond cellular assays, our computational framework can also be applied to MALDI MS-based (cell-free) biochemical assays in general.</p><p><strong>Availability and implementation: </strong>The software tool, code and examples are available at https://github.com/CeMOS-Mannheim/M2ara and https://dx.doi.org/10.6084/m9.figshare.25736541.</p><p><strong>Supplementary information: </strong>Supplementary material is available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1093/bioinformatics/btae685
Lars Gabriel, Felix Becker, Katharina J Hoff, Mario Stanke
Motivation: For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst et al. demonstrated with their program Helixer that the accuracy of ab initio eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.
Results: We present Tiberius, a novel deep learning-based ab initio gene predictor that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. Tiberius uses a custom gene prediction loss and was trained for prediction in mammalian genomes and evaluated on human and two other genomes. It significantly outperforms existing ab initio methods, achieving F1-scores of 62% at gene level for the human genome, compared to 21% for the next best ab initio method. In de novo mode, Tiberius predicts the exon-intron structure of two out of three human genes without error. Remarkably, even Tiberius's ab initio accuracy matches that of BRAKER3, which uses RNA-seq data and a protein database. Tiberius's highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.
Availability and implementation: https://github.com/Gaius-Augustus/Tiberius.
动机25 年来,基于学习的真核生物基因预测器一直由直接输入 DNA 序列的隐马尔可夫模型(HMM)驱动。最近,Holst 等人利用他们的程序 Helixer 证明,通过将深度学习层与单独的 HMM 后处理器相结合,可以提高自证真核基因预测的准确性:我们介绍了基于深度学习的新型自证基因预测器 Tiberius,该预测器端到端集成了卷积层、长短期记忆层和可微分 HMM 层。Tiberius 使用定制的基因预测损失,针对哺乳动物基因组的预测进行了训练,并在人类和其他两个基因组上进行了评估。它的性能明显优于现有的自创方法,在人类基因组的基因水平上达到了 62% 的 F1 分数,而次好的自创方法只有 21%。在从头模式下,Tiberius 能准确预测三个人类基因中两个基因的外显子-内含子结构。值得注意的是,即使是 Tiberius 的自证准确率也能与使用 RNA-seq 数据和蛋白质数据库的 BRAKER3 相媲美。Tiberius 的高度并行化模型是目前最快的基因预测方法,处理人类基因组的时间不到 2 小时。可用性和实现:https://github.com/Gaius-Augustus/Tiberius。
{"title":"Tiberius: End-to-end deep learning with an HMM for gene prediction.","authors":"Lars Gabriel, Felix Becker, Katharina J Hoff, Mario Stanke","doi":"10.1093/bioinformatics/btae685","DOIUrl":"10.1093/bioinformatics/btae685","url":null,"abstract":"<p><strong>Motivation: </strong>For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst et al. demonstrated with their program Helixer that the accuracy of ab initio eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.</p><p><strong>Results: </strong>We present Tiberius, a novel deep learning-based ab initio gene predictor that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. Tiberius uses a custom gene prediction loss and was trained for prediction in mammalian genomes and evaluated on human and two other genomes. It significantly outperforms existing ab initio methods, achieving F1-scores of 62% at gene level for the human genome, compared to 21% for the next best ab initio method. In de novo mode, Tiberius predicts the exon-intron structure of two out of three human genes without error. Remarkably, even Tiberius's ab initio accuracy matches that of BRAKER3, which uses RNA-seq data and a protein database. Tiberius's highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.</p><p><strong>Availability and implementation: </strong>https://github.com/Gaius-Augustus/Tiberius.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1093/bioinformatics/btae690
Soroush Mozaffari, Paula Nazarena Arrías, Damiano Clementel, Damiano Piovesan, Carlo Ferrari, Silvio C E Tosatto, Alexander Miguel Monzon
Motivation: Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D-models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.
Results: We introduce STRPsearch, a novel tool for the rapid identification, classification, and mapping of STRPs. Leveraging manually curated entries from RepeatsDB as the known conformational space of STRPs, STRPsearch employs the latest advances in structural alignment for a fast and accurate detection of repeated structural motifs in proteins, followed by an innovative approach to map units and insertions through the generation of TM-score profiles. STRPsearch is highly scalable, efficiently processing large datasets, and can be applied to both experimental structures and predicted models. Additionally, it demonstrates superior performance compared to existing tools, offering researchers a reliable and comprehensive solution for STRP analysis across diverse proteomes.
Availability and implementation: STRPsearch is coded in Python. All scripts and associated documentation are available from: https://github.com/BioComputingUP/STRPsearch.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"STRPsearch: fast detection of structured tandem repeat proteins.","authors":"Soroush Mozaffari, Paula Nazarena Arrías, Damiano Clementel, Damiano Piovesan, Carlo Ferrari, Silvio C E Tosatto, Alexander Miguel Monzon","doi":"10.1093/bioinformatics/btae690","DOIUrl":"10.1093/bioinformatics/btae690","url":null,"abstract":"<p><strong>Motivation: </strong>Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D-models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.</p><p><strong>Results: </strong>We introduce STRPsearch, a novel tool for the rapid identification, classification, and mapping of STRPs. Leveraging manually curated entries from RepeatsDB as the known conformational space of STRPs, STRPsearch employs the latest advances in structural alignment for a fast and accurate detection of repeated structural motifs in proteins, followed by an innovative approach to map units and insertions through the generation of TM-score profiles. STRPsearch is highly scalable, efficiently processing large datasets, and can be applied to both experimental structures and predicted models. Additionally, it demonstrates superior performance compared to existing tools, offering researchers a reliable and comprehensive solution for STRP analysis across diverse proteomes.</p><p><strong>Availability and implementation: </strong>STRPsearch is coded in Python. All scripts and associated documentation are available from: https://github.com/BioComputingUP/STRPsearch.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1093/bioinformatics/btae695
Caitlin G Page, Andrew Londsdale, Katrina A Mitchell, Jan Schröder, Kieran F Harvey, Alicia Oshlack
Summary: DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.
Availability: The R package Damsel is available for install through the Bioconductor project https://bioconductor.org/packages/release/bioc/html/Damsel.html and the code is available on GitHub https://github.com/Oshlack/Damsel/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Damsel: Analysis and visualisation of DamID sequencing in R.","authors":"Caitlin G Page, Andrew Londsdale, Katrina A Mitchell, Jan Schröder, Kieran F Harvey, Alicia Oshlack","doi":"10.1093/bioinformatics/btae695","DOIUrl":"10.1093/bioinformatics/btae695","url":null,"abstract":"<p><strong>Summary: </strong>DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.</p><p><strong>Availability: </strong>The R package Damsel is available for install through the Bioconductor project https://bioconductor.org/packages/release/bioc/html/Damsel.html and the code is available on GitHub https://github.com/Oshlack/Damsel/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1093/bioinformatics/btae691
Samira van den Bogaard, Pedro A Saa, Tobias B Alter
Motivation: Expanding on constraint-based metabolic models, protein allocation models (PAMs) enhance flux predictions by accounting for protein resource allocation in cellular metabolism. Yet, to this date, there are no dedicated methods for analyzing and understanding the growth-limiting factors in simulated phenotypes in PAMs.
Results: Here, we introduce a systematic framework for identifying the most sensitive enzyme concentrations (sEnz) in PAMs. The framework exploits the primal and dual formulations of these models to derive sensitivity coefficients based on relations between variables, constraints, and the objective function. This approach enhances our understanding of the growth-limiting factors of metabolic phenotypes under specific environmental or genetic conditions. Compared to other traditional methods for calculating sensitivities, sEnz requires substantially less computation time and facilitates more intuitive comparison and analysis of sensitivities. The sensitivities calculated by sEnz cover enzymes, reactions and protein sectors, enabling a holistic overview of the factors influencing metabolism. When applied to an Escherichia coli PAM, sEnz revealed major pathways and enzymes driving overflow metabolism. Overall, sEnz offers a computational efficient framework for understanding PAM predictions and unravelling the factors governing a particular metabolic phenotype.
Availability and implementation: sEnz is implemented in the modular toolbox for the generation and analysis of PAMs in Python (PAModelpy; v.0.0.3.3), available on Pypi (https://pypi.org/project/PAModelpy/). The source code together with all other python scripts and notebooks are available on GitHub (https://github.com/iAMB-RWTH-Aachen/PAModelpy).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Sensitivities in protein allocation models reveal distribution of metabolic capacity and flux control.","authors":"Samira van den Bogaard, Pedro A Saa, Tobias B Alter","doi":"10.1093/bioinformatics/btae691","DOIUrl":"10.1093/bioinformatics/btae691","url":null,"abstract":"<p><strong>Motivation: </strong>Expanding on constraint-based metabolic models, protein allocation models (PAMs) enhance flux predictions by accounting for protein resource allocation in cellular metabolism. Yet, to this date, there are no dedicated methods for analyzing and understanding the growth-limiting factors in simulated phenotypes in PAMs.</p><p><strong>Results: </strong>Here, we introduce a systematic framework for identifying the most sensitive enzyme concentrations (sEnz) in PAMs. The framework exploits the primal and dual formulations of these models to derive sensitivity coefficients based on relations between variables, constraints, and the objective function. This approach enhances our understanding of the growth-limiting factors of metabolic phenotypes under specific environmental or genetic conditions. Compared to other traditional methods for calculating sensitivities, sEnz requires substantially less computation time and facilitates more intuitive comparison and analysis of sensitivities. The sensitivities calculated by sEnz cover enzymes, reactions and protein sectors, enabling a holistic overview of the factors influencing metabolism. When applied to an Escherichia coli PAM, sEnz revealed major pathways and enzymes driving overflow metabolism. Overall, sEnz offers a computational efficient framework for understanding PAM predictions and unravelling the factors governing a particular metabolic phenotype.</p><p><strong>Availability and implementation: </strong>sEnz is implemented in the modular toolbox for the generation and analysis of PAMs in Python (PAModelpy; v.0.0.3.3), available on Pypi (https://pypi.org/project/PAModelpy/). The source code together with all other python scripts and notebooks are available on GitHub (https://github.com/iAMB-RWTH-Aachen/PAModelpy).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-15DOI: 10.1093/bioinformatics/btae681
Antoine Neuraz, Ghislain Vaillant, Camila Arias, Olivier Birot, Kim-Tam Huynh, Thibaut Fabacher, Alice Rogier, Nicolas Garcelon, Ivan Lerner, Bastien Rance, Adrien Coulet
Summary: Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies.
Results: To facilitate the development, evaluation and reproducibility of phenotyping pipelines, we developed an open-source Python library named medkit. It enables composing data processing pipelines made of easy-to-reuse software bricks, named medkit operations. In addition to the core of the library, we share the operations and pipelines we already developed and invite the phenotyping community for their reuse and enrichment.
Availability and implementation: medkit is available at https://github.com/medkit-lib/medkit.
Supplementary information: Documentation, examples and tutorials are available at https://medkit-lib.org/.
{"title":"Facilitating phenotyping from clinical texts: the medkit library.","authors":"Antoine Neuraz, Ghislain Vaillant, Camila Arias, Olivier Birot, Kim-Tam Huynh, Thibaut Fabacher, Alice Rogier, Nicolas Garcelon, Ivan Lerner, Bastien Rance, Adrien Coulet","doi":"10.1093/bioinformatics/btae681","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae681","url":null,"abstract":"<p><strong>Summary: </strong>Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies.</p><p><strong>Results: </strong>To facilitate the development, evaluation and reproducibility of phenotyping pipelines, we developed an open-source Python library named medkit. It enables composing data processing pipelines made of easy-to-reuse software bricks, named medkit operations. In addition to the core of the library, we share the operations and pipelines we already developed and invite the phenotyping community for their reuse and enrichment.</p><p><strong>Availability and implementation: </strong>medkit is available at https://github.com/medkit-lib/medkit.</p><p><strong>Supplementary information: </strong>Documentation, examples and tutorials are available at https://medkit-lib.org/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142640428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-15DOI: 10.1093/bioinformatics/btae682
Sri Devan Appasamy, Craig L Zirbel
Motivation: The recent progress in RNA structure determination methods has resulted in a surge of newly solved RNA 3D structures. However, there is an absence of a user-friendly browser-based tool that can facilitate the comparison and visualization of RNA motifs across multiple 3D structures.
Results: We introduce R3DMCS, a web server that allows users to compare selected RNA nucleotides across all 3D structures of a given molecule from a given species, or across all 3D structures mapped to a single Rfam family. Starting from one instance of the motif, R3DMCS retrieves, aligns, annotates, organizes, and displays 3D coordinates of corresponding sets of nucleotides from other 3D structures. With R3DMCS, one can explore conformational changes of motifs due to 3D structures being solved in different functional states or different experimental conditions. One can also investigate conservation of 3D structure across species, or changes in 3D structure due to changes in sequence.
Availability: R3DMCS is open-source software and freely available at https://rna.bgsu.edu/correspondence/ and https://github.com/BGSU-RNA/RNA-3D-correspondence .
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"R3DMCS: a web server for visualizing structural variation in RNA motifs across experimental 3D structures from the same organism or across species.","authors":"Sri Devan Appasamy, Craig L Zirbel","doi":"10.1093/bioinformatics/btae682","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae682","url":null,"abstract":"<p><strong>Motivation: </strong>The recent progress in RNA structure determination methods has resulted in a surge of newly solved RNA 3D structures. However, there is an absence of a user-friendly browser-based tool that can facilitate the comparison and visualization of RNA motifs across multiple 3D structures.</p><p><strong>Results: </strong>We introduce R3DMCS, a web server that allows users to compare selected RNA nucleotides across all 3D structures of a given molecule from a given species, or across all 3D structures mapped to a single Rfam family. Starting from one instance of the motif, R3DMCS retrieves, aligns, annotates, organizes, and displays 3D coordinates of corresponding sets of nucleotides from other 3D structures. With R3DMCS, one can explore conformational changes of motifs due to 3D structures being solved in different functional states or different experimental conditions. One can also investigate conservation of 3D structure across species, or changes in 3D structure due to changes in sequence.</p><p><strong>Availability: </strong>R3DMCS is open-source software and freely available at https://rna.bgsu.edu/correspondence/ and https://github.com/BGSU-RNA/RNA-3D-correspondence .</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142640464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-15DOI: 10.1093/bioinformatics/btae679
Douglas B Craig, Sorin Drăghici
Motivation: Large Language Models (LLMs) have provided spectacular results across a wide variety of domains. However, persistent concerns about hallucination and fabrication of authoritative sources raise serious issues for their integral use in scientific research. Retrieval-augmented generation (RAG) is a technique for making data and documents, otherwise unavailable during training, available to the LLM for reasoning tasks. In addition to making dynamic and quantitative data available to the LLM, RAG provides the means by which to carefully control and trace source material, thereby ensuring results are accurate, complete and authoritative.
Results: Here we introduce LmRaC, an LLM-based tool capable of answering complex scientific questions in the context of a user's own experimental results. LmRaC allows users to dynamically build domain specific knowledge-bases from PubMed sources (RAGdom). Answers are drawn solely from this RAG with citations to the paragraph level, virtually eliminating any chance of hallucination or fabrication. These answers can then be used to construct an experimental context (RAGexp) that, along with user supplied documents (e.g., design, protocols) and quantitative results, can be used to answer questions about the user's specific experiment. Questions about quantitative experimental data are integral to LmRaC and are supported by a user-defined and functionally extensible REST API server (RAGfun).
Availability and implementation: Detailed documentation for LmRaC along with a sample REST API server for defining user functions can be found at https://github.com/dbcraig/LmRaC. The LmRaC web application image can be pulled from Docker Hub (https://hub.docker.com) as dbcraig/lmrac.
{"title":"LmRaC: a functionally extensible tool for LLM interrogation of user experimental results.","authors":"Douglas B Craig, Sorin Drăghici","doi":"10.1093/bioinformatics/btae679","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae679","url":null,"abstract":"<p><strong>Motivation: </strong>Large Language Models (LLMs) have provided spectacular results across a wide variety of domains. However, persistent concerns about hallucination and fabrication of authoritative sources raise serious issues for their integral use in scientific research. Retrieval-augmented generation (RAG) is a technique for making data and documents, otherwise unavailable during training, available to the LLM for reasoning tasks. In addition to making dynamic and quantitative data available to the LLM, RAG provides the means by which to carefully control and trace source material, thereby ensuring results are accurate, complete and authoritative.</p><p><strong>Results: </strong>Here we introduce LmRaC, an LLM-based tool capable of answering complex scientific questions in the context of a user's own experimental results. LmRaC allows users to dynamically build domain specific knowledge-bases from PubMed sources (RAGdom). Answers are drawn solely from this RAG with citations to the paragraph level, virtually eliminating any chance of hallucination or fabrication. These answers can then be used to construct an experimental context (RAGexp) that, along with user supplied documents (e.g., design, protocols) and quantitative results, can be used to answer questions about the user's specific experiment. Questions about quantitative experimental data are integral to LmRaC and are supported by a user-defined and functionally extensible REST API server (RAGfun).</p><p><strong>Availability and implementation: </strong>Detailed documentation for LmRaC along with a sample REST API server for defining user functions can be found at https://github.com/dbcraig/LmRaC. The LmRaC web application image can be pulled from Docker Hub (https://hub.docker.com) as dbcraig/lmrac.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142640460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Integrating multiple omics datasets can significantly advance our understanding of disease mechanisms, physiology, and treatment responses. However, a major challenge in multi-omics studies is the disparity in sample sizes across different datasets, which can introduce bias and reduce statistical power. To address this issue, we propose a novel framework, OmicsNMF, designed to impute missing omics data and enhance disease phenotype prediction. OmicsNMF integrates Generative Adversarial Networks (GANs) with Non-Negative Matrix Factorization (NMF). NMF is a well-established method for uncovering underlying patterns in omics data, while GANs enhance the imputation process by generating realistic data samples. This synergy aims to more effectively address sample size disparity, thereby improving data integration and prediction accuracy.
Results: For evaluation, we focused on predicting breast cancer subtypes using the imputed data generated by our proposed framework, OmicsNMF. Our results indicate that OmicsNMF consistently outperforms baseline methods. We further assessed the quality of the imputed data through survival analysis, revealing that the imputed omics profiles provide significant prognostic power for both overall survival and disease-free status. Overall, OmicsNMF effectively leverages GANs and NMF to impute missing samples while preserving key biological features. This approach shows potential for advancing precision oncology by improving data integration and analysis.
Availability and implementation: Source code is available at: https://github.com/compbiolabucf/OmicsNMF.
{"title":"Optimizing Multi-Omics Data Imputation with NMF and GAN Synergy.","authors":"Md Istiaq Ansari, Khandakar Tanvir Ahmed, Wei Zhang","doi":"10.1093/bioinformatics/btae674","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae674","url":null,"abstract":"<p><strong>Motivation: </strong>Integrating multiple omics datasets can significantly advance our understanding of disease mechanisms, physiology, and treatment responses. However, a major challenge in multi-omics studies is the disparity in sample sizes across different datasets, which can introduce bias and reduce statistical power. To address this issue, we propose a novel framework, OmicsNMF, designed to impute missing omics data and enhance disease phenotype prediction. OmicsNMF integrates Generative Adversarial Networks (GANs) with Non-Negative Matrix Factorization (NMF). NMF is a well-established method for uncovering underlying patterns in omics data, while GANs enhance the imputation process by generating realistic data samples. This synergy aims to more effectively address sample size disparity, thereby improving data integration and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we focused on predicting breast cancer subtypes using the imputed data generated by our proposed framework, OmicsNMF. Our results indicate that OmicsNMF consistently outperforms baseline methods. We further assessed the quality of the imputed data through survival analysis, revealing that the imputed omics profiles provide significant prognostic power for both overall survival and disease-free status. Overall, OmicsNMF effectively leverages GANs and NMF to impute missing samples while preserving key biological features. This approach shows potential for advancing precision oncology by improving data integration and analysis.</p><p><strong>Availability and implementation: </strong>Source code is available at: https://github.com/compbiolabucf/OmicsNMF.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142640461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}