Pub Date : 2024-03-01Epub Date: 2024-02-27DOI: 10.1089/cmb.2023.0283
Xinyu Gu, Yuanyuan Qi, Mohammed El-Kebir
The design of an RNA sequence that encodes an input target protein sequence is a crucial aspect of messenger RNA (mRNA) vaccine development. There are an exponential number of possible RNA sequences for a single target protein due to codon degeneracy. These potential RNA sequences can assume various secondary structure conformations, each with distinct minimum free energy (MFE), impacting thermodynamic stability and mRNA half-life. Furthermore, the presence of species-specific codon usage bias, quantified by the codon adaptation index (CAI), plays a vital role in translation efficiency. While earlier studies focused on optimizing either MFE or CAI, recent research has underscored the advantages of simultaneously optimizing both objectives. However, optimizing one objective comes at the expense of the other. In this work, we present the Pareto Optimal RNA Design problem, aiming to identify the set of Pareto optimal solutions for which no alternative solutions exist that exhibit better MFE and CAI values. Our algorithm DEsign RNA (DERNA) uses the weighted sum method to enumerate the Pareto front by optimizing convex combinations of both objectives. We use dynamic programming to solve each convex combination in time and space. Compared with a CDSfold, previous approach that only optimizes MFE, we show on a benchmark data set that DERNA obtains solutions with identical MFE but superior CAI. Moreover, we show that DERNA matches the performance in terms of solution quality of LinearDesign, a recent approach that similarly seeks to balance MFE and CAI. We conclude by demonstrating our method's potential for mRNA vaccine design for the SARS-CoV-2 spike protein.
{"title":"DERNA Enables Pareto Optimal RNA Design.","authors":"Xinyu Gu, Yuanyuan Qi, Mohammed El-Kebir","doi":"10.1089/cmb.2023.0283","DOIUrl":"10.1089/cmb.2023.0283","url":null,"abstract":"<p><p>The design of an RNA sequence <math><mstyle><mi>v</mi></mstyle></math> that encodes an input target protein sequence <math><mstyle><mi>w</mi></mstyle></math> is a crucial aspect of messenger RNA (mRNA) vaccine development. There are an exponential number of possible RNA sequences for a single target protein due to codon degeneracy. These potential RNA sequences can assume various secondary structure conformations, each with distinct minimum free energy (MFE), impacting thermodynamic stability and mRNA half-life. Furthermore, the presence of species-specific codon usage bias, quantified by the codon adaptation index (CAI), plays a vital role in translation efficiency. While earlier studies focused on optimizing either MFE or CAI, recent research has underscored the advantages of simultaneously optimizing both objectives. However, optimizing one objective comes at the expense of the other. In this work, we present the Pareto Optimal RNA Design problem, aiming to identify the set of Pareto optimal solutions for which no alternative solutions exist that exhibit better MFE and CAI values. Our algorithm DEsign RNA (DERNA) uses the weighted sum method to enumerate the Pareto front by optimizing convex combinations of both objectives. We use dynamic programming to solve each convex combination in <math><mstyle><mi>O</mi></mstyle><mrow><mo>(</mo><mrow><mo>|</mo><mstyle><mi>w</mi></mstyle><msup><mrow><mo>|</mo></mrow><mrow><mn>3</mn></mrow></msup></mrow><mo>)</mo></mrow></math> time and <math><mstyle><mi>O</mi></mstyle><mrow><mo>(</mo><mrow><mo>|</mo><mstyle><mi>w</mi></mstyle><msup><mrow><mo>|</mo></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)</mo></mrow></math> space. Compared with a CDSfold, previous approach that only optimizes MFE, we show on a benchmark data set that DERNA obtains solutions with identical MFE but superior CAI. Moreover, we show that DERNA matches the performance in terms of solution quality of LinearDesign, a recent approach that similarly seeks to balance MFE and CAI. We conclude by demonstrating our method's potential for mRNA vaccine design for the SARS-CoV-2 spike protein.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139990194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-11-28DOI: 10.1089/cmb.2023.0112
Jianhua Jia, Genqiang Wu, Meifang Li
Lysine glycation is one of the most significant protein post-translational modifications, which changes the properties of the proteins and causes them to be dysfunctional. Accurately identifying glycation sites helps to understand the biological function and potential mechanism of glycation in disease treatments. Nonetheless, the experimental methods are ordinarily inefficient and costly, so effective computational methods need to be developed. In this study, we proposed the new model called iGly-IDN based on the improved densely connected convolutional networks (DenseNet). First, one hot encoding was adopted to obtain the original feature maps. Afterward, the improved DenseNet was adopted to capture feature information with the importance degrees during the feature learning. According to the experimental results, Acc reaches 66%, and Mathews correlation coefficient reaches 0.33 on the independent testing data set, which indicates that the iGly-IDN can provide more effective glycation site identification than the current predictors.
{"title":"iGly-IDN: Identifying Lysine Glycation Sites in Proteins Based on Improved DenseNet.","authors":"Jianhua Jia, Genqiang Wu, Meifang Li","doi":"10.1089/cmb.2023.0112","DOIUrl":"10.1089/cmb.2023.0112","url":null,"abstract":"<p><p>Lysine glycation is one of the most significant protein post-translational modifications, which changes the properties of the proteins and causes them to be dysfunctional. Accurately identifying glycation sites helps to understand the biological function and potential mechanism of glycation in disease treatments. Nonetheless, the experimental methods are ordinarily inefficient and costly, so effective computational methods need to be developed. In this study, we proposed the new model called iGly-IDN based on the improved densely connected convolutional networks (DenseNet). First, one hot encoding was adopted to obtain the original feature maps. Afterward, the improved DenseNet was adopted to capture feature information with the importance degrees during the feature learning. According to the experimental results, Acc reaches 66%, and Mathews correlation coefficient reaches 0.33 on the independent testing data set, which indicates that the iGly-IDN can provide more effective glycation site identification than the current predictors.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138451634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The effective reproduction number is one of the most important epidemiological parameters, providing suggestions for monitoring the development trend of diseases and also for adjusting the prevention and control policies. However, a few studies have focused on the performance of some common computational methods for Rt. The purpose of this article is to compare the performance of three computational methods for Rt: the time-dependent (TD) method, the new time-varying (NT) method, and the sequential Bayesian (SB) method. Four evaluation methods-accuracy, correlation coefficient, similarity based on trend, and dynamic time warping distance-were used to compare the effectiveness of three computational methods for Rt under different time lags and time windows. The results showed that the NT method was a better choice for real-time monitoring and analysis of the epidemic in the middle and late stages of the infectious disease. The TD method could reflect the change of the number of cases stably and accurately, and was more suitable for monitoring the change of Rt during the whole process of the epidemic outbreak. When the data were relatively stable, the SB method could also provide a reliable estimate for Rt, while the error would increase when the fluctuation in the number of cases increased. The results would provide suggestions for selecting appropriate Rt estimation methods and making policy adjustments more timely and effectively according to the change of Rt.
{"title":"Comparing the Performance of Three Computational Methods for Estimating the Effective Reproduction Number.","authors":"Zihan Wang, Mengxia Xu, Zonglin Yang, Yu Jin, Yong Zhang","doi":"10.1089/cmb.2023.0065","DOIUrl":"10.1089/cmb.2023.0065","url":null,"abstract":"<p><p>The effective reproduction number <math><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow><mo>)</mo></mrow></math> is one of the most important epidemiological parameters, providing suggestions for monitoring the development trend of diseases and also for adjusting the prevention and control policies. However, a few studies have focused on the performance of some common computational methods for <i>R<sub>t</sub></i>. The purpose of this article is to compare the performance of three computational methods for <i>R<sub>t</sub></i>: the time-dependent (TD) method, the new time-varying (NT) method, and the sequential Bayesian (SB) method. Four evaluation methods-accuracy, correlation coefficient, similarity based on trend, and dynamic time warping distance-were used to compare the effectiveness of three computational methods for <i>R<sub>t</sub></i> under different time lags and time windows. The results showed that the NT method was a better choice for real-time monitoring and analysis of the epidemic in the middle and late stages of the infectious disease. The TD method could reflect the change of the number of cases stably and accurately, and was more suitable for monitoring the change of <i>R<sub>t</sub></i> during the whole process of the epidemic outbreak. When the data were relatively stable, the SB method could also provide a reliable estimate for <i>R<sub>t</sub></i>, while the error would increase when the fluctuation in the number of cases increased. The results would provide suggestions for selecting appropriate <i>R<sub>t</sub></i> estimation methods and making policy adjustments more timely and effectively according to the change of <i>R<sub>t</sub></i>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139472540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-12-13DOI: 10.1089/cmb.2023.0097
Xiaoyang Xiang, Jiaxuan Gao, Yanrui Ding
Using wet experimental methods to discover new thermophilic proteins or improve protein thermostability is time-consuming and expensive. Machine learning methods have shown powerful performance in the study of protein thermostability in recent years. However, how to make full use of multiview sequence information to predict thermostability effectively is still a challenge. In this study, we proposed a deep learning-based classifier named DeepPPThermo that fuses features of classical sequence features and deep learning representation features for classifying thermophilic and mesophilic proteins. In this model, deep neural network (DNN) and bi-long short-term memory (Bi-LSTM) are used to mine hidden features. Furthermore, local attention and global attention mechanisms give different importance to multiview features. The fused features are fed to a fully connected network classifier to distinguish thermophilic and mesophilic proteins. Our model is comprehensively compared with advanced machine learning algorithms and deep learning algorithms, proving that our model performs better. We further compare the effects of removing different features on the classification results, demonstrating the importance of each feature and the robustness of the model. Our DeepPPThermo model can be further used to explore protein diversity, identify new thermophilic proteins, and guide directed mutations of mesophilic proteins.
{"title":"DeepPPThermo: A Deep Learning Framework for Predicting Protein Thermostability Combining Protein-Level and Amino Acid-Level Features.","authors":"Xiaoyang Xiang, Jiaxuan Gao, Yanrui Ding","doi":"10.1089/cmb.2023.0097","DOIUrl":"10.1089/cmb.2023.0097","url":null,"abstract":"<p><p>Using wet experimental methods to discover new thermophilic proteins or improve protein thermostability is time-consuming and expensive. Machine learning methods have shown powerful performance in the study of protein thermostability in recent years. However, how to make full use of multiview sequence information to predict thermostability effectively is still a challenge. In this study, we proposed a deep learning-based classifier named DeepPPThermo that fuses features of classical sequence features and deep learning representation features for classifying thermophilic and mesophilic proteins. In this model, deep neural network (DNN) and bi-long short-term memory (Bi-LSTM) are used to mine hidden features. Furthermore, local attention and global attention mechanisms give different importance to multiview features. The fused features are fed to a fully connected network classifier to distinguish thermophilic and mesophilic proteins. Our model is comprehensively compared with advanced machine learning algorithms and deep learning algorithms, proving that our model performs better. We further compare the effects of removing different features on the classification results, demonstrating the importance of each feature and the robustness of the model. Our DeepPPThermo model can be further used to explore protein diversity, identify new thermophilic proteins, and guide directed mutations of mesophilic proteins.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138805058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2024-01-25DOI: 10.1089/cmb.2023.0115
Junrong Song, Zhiming Song, Jinpeng Zhang, Yuanli Gong
Identifying cancer subtype-specific driver genes from a large number of irrelevant passengers is crucial for targeted therapy in cancer treatment. Recently, the rapid accumulation of large-scale cancer genomics data from multiple institutions has presented remarkable opportunities for identification of cancer subtype-specific driver genes. However, the insufficient subtype samples, privacy issues, and heterogenous of aberration events pose great challenges in precisely identifying cancer subtype-specific driver genes. To address this, we introduce privatedriver, the first model for identifying subtype-specific driver genes that integrates genomics data from multiple institutions in a data privacy-preserving collaboration manner. The process of identifying subtype-specific cancer driver genes using privatedriver involves the following two steps: genomics data integration and collaborative training. In the integration process, the aberration events from multiple genomics data sources are combined for each institution using the forward and backward propagation method of NetICS. In the collaborative training process, each institution utilizes the federated learning framework to upload encrypted model parameters instead of raw data of all institutions to train a global model by using the non-negative matrix factorization algorithm. We applied privatedriver on head and neck squamous cell and colon cancer from The Cancer Genome Atlas website and evaluated it with two benchmarks using macro-Fscore. The comparison analysis demonstrates that privatedriver achieves comparable results to centralized learning models and outperforms most other nonprivacy preserving models, all while ensuring the confidentiality of patient information. We also demonstrate that, for varying predicted driver gene distributions in subtype, our model fully considers the heterogeneity of subtype and identifies subtype-specific driver genes corresponding to the given prognosis and therapeutic effect. The success of privatedriver reveals the feasibility and effectiveness of identifying cancer subtype-specific driver genes in a data protection manner, providing new insights for future privacy-preserving driver gene identification studies.
{"title":"Privacy-Preserving Identification of Cancer Subtype-Specific Driver Genes Based on Multigenomics Data with Privatedriver.","authors":"Junrong Song, Zhiming Song, Jinpeng Zhang, Yuanli Gong","doi":"10.1089/cmb.2023.0115","DOIUrl":"10.1089/cmb.2023.0115","url":null,"abstract":"<p><p>Identifying cancer subtype-specific driver genes from a large number of irrelevant passengers is crucial for targeted therapy in cancer treatment. Recently, the rapid accumulation of large-scale cancer genomics data from multiple institutions has presented remarkable opportunities for identification of cancer subtype-specific driver genes. However, the insufficient subtype samples, privacy issues, and heterogenous of aberration events pose great challenges in precisely identifying cancer subtype-specific driver genes. To address this, we introduce privatedriver, the first model for identifying subtype-specific driver genes that integrates genomics data from multiple institutions in a data privacy-preserving collaboration manner. The process of identifying subtype-specific cancer driver genes using privatedriver involves the following two steps: genomics data integration and collaborative training. In the integration process, the aberration events from multiple genomics data sources are combined for each institution using the forward and backward propagation method of NetICS. In the collaborative training process, each institution utilizes the federated learning framework to upload encrypted model parameters instead of raw data of all institutions to train a global model by using the non-negative matrix factorization algorithm. We applied privatedriver on head and neck squamous cell and colon cancer from The Cancer Genome Atlas website and evaluated it with two benchmarks using macro-Fscore. The comparison analysis demonstrates that privatedriver achieves comparable results to centralized learning models and outperforms most other nonprivacy preserving models, all while ensuring the confidentiality of patient information. We also demonstrate that, for varying predicted driver gene distributions in subtype, our model fully considers the heterogeneity of subtype and identifies subtype-specific driver genes corresponding to the given prognosis and therapeutic effect. The success of privatedriver reveals the feasibility and effectiveness of identifying cancer subtype-specific driver genes in a data protection manner, providing new insights for future privacy-preserving driver gene identification studies.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139564179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2024-02-02DOI: 10.1089/cmb.2023.0149
Aleksey Ogurtsov, Gelio Alves, Alex Rubio, Brendan Joyce, Björn Andersson, Roger Karlsson, Edward R B Moore, Yi-Kuo Yu
Although many user-friendly workflows exist for identifications of peptides and proteins in mass-spectrometry-based proteomics, there is a need of easy to use, fast, and accurate workflows for identifications of microorganisms, antimicrobial resistant proteins, and biomass estimation. Identification of microorganisms is a computationally demanding task that requires querying thousands of MS/MS spectra in a database containing thousands to tens of thousands of microorganisms. Existing software can't handle such a task in a time efficient manner, taking hours to process a single MS/MS experiment. Another paramount factor to consider is the necessity of accurate statistical significance to properly control the proportion of false discoveries among the identified microorganisms, and antimicrobial-resistant proteins, and to provide robust biomass estimation. Recently, we have developed Microorganism Classification and Identification (MiCId) workflow that assigns accurate statistical significance to identified microorganisms, antimicrobial-resistant proteins, and biomass estimation. MiCId's workflow is also computationally efficient, taking about 6-17 minutes to process a tandem mass-spectrometry (MS/MS) experiment using computer resources that are available in most laptop and desktop computers, making it a portable workflow. To make data analysis accessible to a broader range of users, beyond users familiar with the Linux environment, we have developed a graphical user interface (GUI) for MiCId's workflow. The GUI brings to users all the functionality of MiCId's workflow in a friendly interface along with tools for data analysis, visualization, and to export results.
{"title":"MiCId GUI: The Graphical User Interface for MiCId, a Fast Microorganism Classification and Identification Workflow with Accurate Statistics and High Recall.","authors":"Aleksey Ogurtsov, Gelio Alves, Alex Rubio, Brendan Joyce, Björn Andersson, Roger Karlsson, Edward R B Moore, Yi-Kuo Yu","doi":"10.1089/cmb.2023.0149","DOIUrl":"10.1089/cmb.2023.0149","url":null,"abstract":"<p><p>Although many user-friendly workflows exist for identifications of peptides and proteins in mass-spectrometry-based proteomics, there is a need of easy to use, fast, and accurate workflows for identifications of microorganisms, antimicrobial resistant proteins, and biomass estimation. Identification of microorganisms is a computationally demanding task that requires querying thousands of MS/MS spectra in a database containing thousands to tens of thousands of microorganisms. Existing software can't handle such a task in a time efficient manner, taking hours to process a single MS/MS experiment. Another paramount factor to consider is the necessity of accurate statistical significance to properly control the proportion of false discoveries among the identified microorganisms, and antimicrobial-resistant proteins, and to provide robust biomass estimation. Recently, we have developed Microorganism Classification and Identification (MiCId) workflow that assigns accurate statistical significance to identified microorganisms, antimicrobial-resistant proteins, and biomass estimation. MiCId's workflow is also computationally efficient, taking about 6-17 minutes to process a tandem mass-spectrometry (MS/MS) experiment using computer resources that are available in most laptop and desktop computers, making it a portable workflow. To make data analysis accessible to a broader range of users, beyond users familiar with the Linux environment, we have developed a graphical user interface (GUI) for MiCId's workflow. The GUI brings to users all the functionality of MiCId's workflow in a friendly interface along with tools for data analysis, visualization, and to export results.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10874827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139671948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-10-27DOI: 10.1089/cmb.2023.0122
Guy Karlebach, Peter N Robinson
Models of gene regulatory networks (GRNs) capture the dynamics of the regulatory processes that occur within the cell as a means to understanding the variability observed in gene expression between different conditions. Arguably the simplest mathematical construct used for modeling is the Boolean network, which dictates a set of logical rules for transition between states described as Boolean vectors. Due to the complexity of gene regulation and the limitations of experimental technologies, in most cases knowledge about regulatory interactions and Boolean states is partial. In addition, the logical rules themselves are not known a priori. Our goal in this work is to create an algorithm that finds the network that fits the data optimally, and identify the network states that correspond to the noise-free data. We present a novel methodology for integrating experimental data and performing a search for the optimal consistent structure via optimization of a linear objective function under a set of linear constraints. In addition, we extend our methodology into a heuristic that alleviates the computational complexity of the problem for datasets that are generated by single-cell RNA-Sequencing (scRNA-Seq). We demonstrate the effectiveness of these tools using simulated data, and in addition a publicly available scRNA-Seq dataset and the GRN that is associated with it. Our methodology will enable researchers to obtain a better understanding of the dynamics of GRNs and their biological role.
{"title":"Computing Minimal Boolean Models of Gene Regulatory Networks.","authors":"Guy Karlebach, Peter N Robinson","doi":"10.1089/cmb.2023.0122","DOIUrl":"10.1089/cmb.2023.0122","url":null,"abstract":"<p><p>Models of gene regulatory networks (GRNs) capture the dynamics of the regulatory processes that occur within the cell as a means to understanding the variability observed in gene expression between different conditions. Arguably the simplest mathematical construct used for modeling is the Boolean network, which dictates a set of logical rules for transition between states described as Boolean vectors. Due to the complexity of gene regulation and the limitations of experimental technologies, in most cases knowledge about regulatory interactions and Boolean states is partial. In addition, the logical rules themselves are not known a priori. Our goal in this work is to create an algorithm that finds the network that fits the data optimally, and identify the network states that correspond to the noise-free data. We present a novel methodology for integrating experimental data and performing a search for the optimal consistent structure via optimization of a linear objective function under a set of linear constraints. In addition, we extend our methodology into a heuristic that alleviates the computational complexity of the problem for datasets that are generated by single-cell RNA-Sequencing (scRNA-Seq). We demonstrate the effectiveness of these tools using simulated data, and in addition a publicly available scRNA-Seq dataset and the GRN that is associated with it. Our methodology will enable researchers to obtain a better understanding of the dynamics of GRNs and their biological role.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"61562871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-01Epub Date: 2024-01-04DOI: 10.1089/cmb.2021.0613
Ivan A Croydon-Veleslavov, Michael P H Stumpf
Single-cell data afford unprecedented insights into molecular processes. But the complexity and size of these data sets have proved challenging and given rise to a large armory of statistical and machine learning approaches. The majority of approaches focuses on either describing features of these data, or making predictions and classifying unlabeled samples. In this study, we introduce repeated decision stumping (ReDX) as a method to distill simple models from single-cell data. We develop decision trees of depth one-hence "stumps"-to identify in an inductive manner, gene products involved in driving cell fate transitions, and in applications to published data we are able to discover the key players involved in these processes in an unbiased manner without prior knowledge. Our algorithm is deliberately targeting the simplest possible candidate hypotheses that can be extracted from complex high-dimensional data. There are three reasons for this: (1) the predictions become straightforwardly testable hypotheses; (2) the identified candidates form the basis for further mechanistic model development, for example, for engineering and synthetic biology interventions; and (3) this approach complements existing descriptive modeling approaches and frameworks. The approach is computationally efficient, has remarkable predictive power, including in simulation studies where the ground truth is known, and yields robust and statistically stable predictors; the same set of candidates is generated by applying the algorithm to different subsamples of experimental data.
{"title":"Repeated Decision Stumping Distils Simple Rules from Single-Cell Data.","authors":"Ivan A Croydon-Veleslavov, Michael P H Stumpf","doi":"10.1089/cmb.2021.0613","DOIUrl":"10.1089/cmb.2021.0613","url":null,"abstract":"<p><p>Single-cell data afford unprecedented insights into molecular processes. But the complexity and size of these data sets have proved challenging and given rise to a large armory of statistical and machine learning approaches. The majority of approaches focuses on either describing features of these data, or making predictions and classifying unlabeled samples. In this study, we introduce repeated decision stumping (ReDX) as a method to distill simple models from single-cell data. We develop decision trees of depth one-hence \"stumps\"-to identify in an inductive manner, gene products involved in driving cell fate transitions, and in applications to published data we are able to discover the key players involved in these processes in an unbiased manner without prior knowledge. Our algorithm is deliberately targeting the simplest possible candidate hypotheses that can be extracted from complex high-dimensional data. There are three reasons for this: (1) the predictions become straightforwardly testable hypotheses; (2) the identified candidates form the basis for further mechanistic model development, for example, for engineering and synthetic biology interventions; and (3) this approach complements existing descriptive modeling approaches and frameworks. The approach is computationally efficient, has remarkable predictive power, including in simulation studies where the ground truth is known, and yields robust and statistically stable predictors; the same set of candidates is generated by applying the algorithm to different subsamples of experimental data.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139087092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-01Epub Date: 2023-11-17DOI: 10.1089/cmb.2023.0212
Minh Hoang, Guillaume Marçais, Carl Kingsford
Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
{"title":"Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.","authors":"Minh Hoang, Guillaume Marçais, Carl Kingsford","doi":"10.1089/cmb.2023.0212","DOIUrl":"10.1089/cmb.2023.0212","url":null,"abstract":"<p><p>Minimizers and syncmers are sketching methods that sample representative <i>k</i>-mer seeds from a long string. The minimizer scheme guarantees a well-spread <i>k</i>-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10794853/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136397678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-01Epub Date: 2023-11-21DOI: 10.1089/cmb.2023.0041
Hajoung Lee, Jaejik Kim
The analysis of gene expression data has made significant contributions to understanding disease mechanisms and developing new drugs and therapies. In such analysis, gene selection is often required for identifying informative and relevant genes and removing redundant and irrelevant ones. However, this is not an easy task as gene expression data have inherent challenges such as ultra-high dimensionality, biological noise, and measurement errors. This study focuses on the measurement errors in gene selection problems. Typically, high-throughput experiments have their own intrinsic measurement errors, which can result in an increase of falsely discovered genes. To alleviate this problem, this study proposes a gene selection method that takes into account measurement errors using generalized liner measurement error models. The method consists of iterative filtering and selection steps until convergence, leading to fewer false positives and providing stable results under measurement errors. The performance of the proposed method is demonstrated through simulation studies and applied to a lung cancer data set.
{"title":"A Gene Selection Method Considering Measurement Errors.","authors":"Hajoung Lee, Jaejik Kim","doi":"10.1089/cmb.2023.0041","DOIUrl":"10.1089/cmb.2023.0041","url":null,"abstract":"<p><p>The analysis of gene expression data has made significant contributions to understanding disease mechanisms and developing new drugs and therapies. In such analysis, gene selection is often required for identifying informative and relevant genes and removing redundant and irrelevant ones. However, this is not an easy task as gene expression data have inherent challenges such as ultra-high dimensionality, biological noise, and measurement errors. This study focuses on the measurement errors in gene selection problems. Typically, high-throughput experiments have their own intrinsic measurement errors, which can result in an increase of falsely discovered genes. To alleviate this problem, this study proposes a gene selection method that takes into account measurement errors using generalized liner measurement error models. The method consists of iterative filtering and selection steps until convergence, leading to fewer false positives and providing stable results under measurement errors. The performance of the proposed method is demonstrated through simulation studies and applied to a lung cancer data set.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138444862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}