Pub Date : 2024-04-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae053
Anže Božič, Rudolf Podgornik
Motivation: Charged amino acid residues on the spike protein of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have been shown to influence its binding to different cell surface receptors, its non-specific electrostatic interactions with the environment, and its structural stability and conformation. It is therefore important to obtain a good understanding of amino acid mutations that affect the total charge on the spike protein which have arisen across different SARS-CoV-2 lineages during the course of the virus' evolution.
Results: We analyse the change in the number of ionizable amino acids and the corresponding total charge on the spike proteins of almost 2200 SARS-CoV-2 lineages that have emerged over the span of the pandemic. Our results show that the previously observed trend toward an increase in the positive charge on the spike protein of SARS-CoV-2 variants of concern has essentially stopped with the emergence of the early omicron variants. Furthermore, recently emerged lineages show a greater diversity in terms of their composition of ionizable amino acids. We also demonstrate that the patterns of change in the number of ionizable amino acids on the spike protein are characteristic of related lineages within the broader clade division of the SARS-CoV-2 phylogenetic tree. Due to the ubiquity of electrostatic interactions in the biological environment, our findings are relevant for a broad range of studies dealing with the structural stability of SARS-CoV-2 and its interactions with the environment.
Availability and implementation: The data underlying the article are available in the Supplementary material.
{"title":"Changes in total charge on spike protein of SARS-CoV-2 in emerging lineages.","authors":"Anže Božič, Rudolf Podgornik","doi":"10.1093/bioadv/vbae053","DOIUrl":"https://doi.org/10.1093/bioadv/vbae053","url":null,"abstract":"<p><strong>Motivation: </strong>Charged amino acid residues on the spike protein of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have been shown to influence its binding to different cell surface receptors, its non-specific electrostatic interactions with the environment, and its structural stability and conformation. It is therefore important to obtain a good understanding of amino acid mutations that affect the total charge on the spike protein which have arisen across different SARS-CoV-2 lineages during the course of the virus' evolution.</p><p><strong>Results: </strong>We analyse the change in the number of ionizable amino acids and the corresponding total charge on the spike proteins of almost 2200 SARS-CoV-2 lineages that have emerged over the span of the pandemic. Our results show that the previously observed trend toward an increase in the positive charge on the spike protein of SARS-CoV-2 variants of concern has essentially stopped with the emergence of the early omicron variants. Furthermore, recently emerged lineages show a greater diversity in terms of their composition of ionizable amino acids. We also demonstrate that the patterns of change in the number of ionizable amino acids on the spike protein are characteristic of related lineages within the broader clade division of the SARS-CoV-2 phylogenetic tree. Due to the ubiquity of electrostatic interactions in the biological environment, our findings are relevant for a broad range of studies dealing with the structural stability of SARS-CoV-2 and its interactions with the environment.</p><p><strong>Availability and implementation: </strong>The data underlying the article are available in the Supplementary material.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11031363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140874999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae052
George Glidden-Handgis, Travis J Wheeler
Background: Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.
Results: We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.
Impact: Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
背景:用于标记生物序列的软件通常会为每个匹配序列生成一个基于理论的统计量(E 值),该统计量表示偶然看到该匹配序列得分的可能性。E 值可以准确预测随机(洗牌)序列比较的错误匹配率,从而为设置得分阈值提供了合理的机制,使其能够以较低的预期错误匹配率获得较高的灵敏度。这种阈值设置策略受到了真实生物序列的挑战,因为真实生物序列包含局部重复和低序列复杂性区域,这些区域会导致非同源序列之间的过度匹配。了解到这一点后,工具开发人员通常会开发一些基准,使用看似真实的诱饵序列来探索灵敏度和错误匹配率之间的经验权衡。最近的一个趋势是使用反向生物序列作为现实诱饵,因为这些序列保留了字母的分布和局部重复的存在,同时破坏了原始序列的功能特性。然而,我们和其他人观察到,序列似乎以惊人的频率与其反向序列产生高分比对,导致虚假匹配风险被夸大,可能对下游分析产生负面影响:我们证明,序列 S 与其(可能变异的)反向序列之间的比对往往比真正不相关的序列之间的比对产生更高的得分,即使 S 是一个没有明显重复或低复杂性区域的洗牌字符串。这种现象是由于一个不直观的事实,即(即使是随机洗牌的)序列包含的回文平均长度比同一序列的排列变体之间共享的最长公共子串(LCS)要长。虽然预期的回文长度只比预期的最长公共子串稍大,但涉及反转序列的配准得分分布却强烈右移,导致反转序列的高分配准频率大大增加:高估错误匹配风险会导致不必要的高分阈值,从而可能降低真正的匹配灵敏度。此外,当工具灵敏度只报告到第一个匹配诱饵序列的得分时,由反向序列组成的大型诱饵集可能会掩盖工具之间的灵敏度差异。根据上述观察结果,我们建议只有在注意去除原始(未反转)序列中的阳性匹配,或不担心虚假标记的夸大时,才使用反转生物序列作为诱饵。虽然分析的主要重点是序列注释,但我们也证明了内部回文的普遍存在可能会导致质谱法蛋白质鉴定中错误标记率的夸大。
{"title":"WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences.","authors":"George Glidden-Handgis, Travis J Wheeler","doi":"10.1093/bioadv/vbae052","DOIUrl":"10.1093/bioadv/vbae052","url":null,"abstract":"<p><strong>Background: </strong>Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.</p><p><strong>Results: </strong>We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.</p><p><strong>Impact: </strong>Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11099658/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141066149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae054
Sherry Dong, Kaiwen Deng, Xiuzhen Huang
Motivation: Annotating cell types is a challenging yet essential task in analyzing single-cell RNA sequencing data. However, due to the lack of a gold standard, it is difficult to evaluate the algorithms fairly and an overfitting algorithm may be favored in benchmarks. To address this challenge, we developed a deep learning-based single-cell type prediction tool that assigns the cell type to 265 different cell types for humans, based on data from approximately five million cells.
Results: We achieved a median area under the ROC curve (AUC) of 0.93 when evaluated across datasets. We found that inconsistent labeling in the existing database generated by different labs contributed to the mistakes of the model. Therefore, we used cell ontology to correct the annotations and retrained the model, which resulted in 0.971 median AUC. Our study reveals a limiting factor of the accuracy one may achieve with the current database annotation and points to the solutions towards an algorithm-based correction of the gold standard for future automated cell annotation approaches.
Availability and implementation: The code is available at: https://github.com/SherrySDong/Hierarchical-Correction-Improves-Automated-Single-cell-Type-Annotation. Data used in this study are listed in Supplementary Table S1 and are retrievable at the CZI database.
{"title":"Single-cell type annotation with deep learning in 265 cell types for humans.","authors":"Sherry Dong, Kaiwen Deng, Xiuzhen Huang","doi":"10.1093/bioadv/vbae054","DOIUrl":"https://doi.org/10.1093/bioadv/vbae054","url":null,"abstract":"<p><strong>Motivation: </strong>Annotating cell types is a challenging yet essential task in analyzing single-cell RNA sequencing data. However, due to the lack of a gold standard, it is difficult to evaluate the algorithms fairly and an overfitting algorithm may be favored in benchmarks. To address this challenge, we developed a deep learning-based single-cell type prediction tool that assigns the cell type to 265 different cell types for humans, based on data from approximately five million cells.</p><p><strong>Results: </strong>We achieved a median area under the ROC curve (AUC) of 0.93 when evaluated across datasets. We found that inconsistent labeling in the existing database generated by different labs contributed to the mistakes of the model. Therefore, we used cell ontology to correct the annotations and retrained the model, which resulted in 0.971 median AUC. Our study reveals a limiting factor of the accuracy one may achieve with the current database annotation and points to the solutions towards an algorithm-based correction of the gold standard for future automated cell annotation approaches.</p><p><strong>Availability and implementation: </strong>The code is available at: https://github.com/SherrySDong/Hierarchical-Correction-Improves-Automated-Single-cell-Type-Annotation. Data used in this study are listed in Supplementary Table S1 and are retrievable at the CZI database.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11031354/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140869341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Motivation Mechanistic modeling based on ordinary differential equations has led to numerous findings in systems biology by integrating prior knowledge and experimental data. However, the manual curation of knowledge necessary when constructing models poses a bottleneck. As the speed of knowledge accumulation continues to grow, there is a demand for a scalable means of constructing executable models. Results We previously introduced BioMASS—an open-source, Python-based framework–to construct, simulate, and analyze mechanistic models of signaling networks. With one of its features, Text2Model, BioMASS allows users to define models in a natural language-like format, thereby facilitating the construction of large-scale models. We demonstrate that Text2Model can serve as a tool for integrating external knowledge for mathematical modeling by generating Text2Model files from a pathway database or through the use of a large language model, and simulating its dynamics through BioMASS. Our findings reveal the tool's capabilities to encourage exploration from prior knowledge and pave the way for a fully data-driven approach to constructing mathematical models. Availability and implementation The code and documentation for BioMASS are available at https://github.com/biomass-dev/biomass and https://biomass-core.readthedocs.io, respectively. The code used in this article are available at https://github.com/okadalabipr/text2model-from-knowledge.
{"title":"Extending BioMASS to construct mathematical models from external knowledge","authors":"Kiwamu Arakane, Hiroaki Imoto, Fabian Ormersbach, Mariko Okada","doi":"10.1093/bioadv/vbae042","DOIUrl":"https://doi.org/10.1093/bioadv/vbae042","url":null,"abstract":"Abstract Motivation Mechanistic modeling based on ordinary differential equations has led to numerous findings in systems biology by integrating prior knowledge and experimental data. However, the manual curation of knowledge necessary when constructing models poses a bottleneck. As the speed of knowledge accumulation continues to grow, there is a demand for a scalable means of constructing executable models. Results We previously introduced BioMASS—an open-source, Python-based framework–to construct, simulate, and analyze mechanistic models of signaling networks. With one of its features, Text2Model, BioMASS allows users to define models in a natural language-like format, thereby facilitating the construction of large-scale models. We demonstrate that Text2Model can serve as a tool for integrating external knowledge for mathematical modeling by generating Text2Model files from a pathway database or through the use of a large language model, and simulating its dynamics through BioMASS. Our findings reveal the tool's capabilities to encourage exploration from prior knowledge and pave the way for a fully data-driven approach to constructing mathematical models. Availability and implementation The code and documentation for BioMASS are available at https://github.com/biomass-dev/biomass and https://biomass-core.readthedocs.io, respectively. The code used in this article are available at https://github.com/okadalabipr/text2model-from-knowledge.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140744108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingxiao Yan, T. Schillemans, Viktor Skantze, Carl Brunius
Abstract Motivation Machine learning (ML) methods are frequently used in Omics research to examine associations between molecular data and for example exposures and health conditions. ML is also used for feature selection to facilitate biological interpretation. Our previous MUVR algorithm was shown to generate predictions and variable selections at state-of-the-art performance. However, a general framework for assessing modeling fitness is still lacking. In addition, enabling to adjust for covariates is a highly desired, but largely lacking trait in ML. We aimed to address these issues in the new MUVR2 framework. Results The MUVR2 algorithm was developed to include the regularized regression framework elastic net in addition to partial least squares and random forest modeling. Compared with other cross-validation strategies, MUVR2 consistently showed state-of-the-art performance, including variable selection, while minimizing overfitting. Testing on simulated and real-world data, we also showed that MUVR2 allows for the adjustment for covariates using elastic net modeling, but not using partial least squares or random forest. Availability and implementation Algorithms, data, scripts, and a tutorial are open source under GPL-3 license and available in the MUVR2 R package at https://github.com/MetaboComp/MUVR2.
{"title":"Adjusting for covariates and assessing modeling fitness in machine learning using MUVR2","authors":"Yingxiao Yan, T. Schillemans, Viktor Skantze, Carl Brunius","doi":"10.1093/bioadv/vbae051","DOIUrl":"https://doi.org/10.1093/bioadv/vbae051","url":null,"abstract":"Abstract Motivation Machine learning (ML) methods are frequently used in Omics research to examine associations between molecular data and for example exposures and health conditions. ML is also used for feature selection to facilitate biological interpretation. Our previous MUVR algorithm was shown to generate predictions and variable selections at state-of-the-art performance. However, a general framework for assessing modeling fitness is still lacking. In addition, enabling to adjust for covariates is a highly desired, but largely lacking trait in ML. We aimed to address these issues in the new MUVR2 framework. Results The MUVR2 algorithm was developed to include the regularized regression framework elastic net in addition to partial least squares and random forest modeling. Compared with other cross-validation strategies, MUVR2 consistently showed state-of-the-art performance, including variable selection, while minimizing overfitting. Testing on simulated and real-world data, we also showed that MUVR2 allows for the adjustment for covariates using elastic net modeling, but not using partial least squares or random forest. Availability and implementation Algorithms, data, scripts, and a tutorial are open source under GPL-3 license and available in the MUVR2 R package at https://github.com/MetaboComp/MUVR2.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140745516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. M. Al Sium, Estefania Torrejón, S. Chowdhury, Rubaiat Ahmed, Aakriti Jain, Mirko Treccani, Laura Veschetti, Arsalan Riaz, Pradeep Eranti, Gabriel J Olguín-Orellana
Abstract Summary The 19th ISCB Student Council Symposium (SCS2023) organized by ISCB-SC adopted a hybrid format for the first time, allowing participants to engage in-person in Lyon, France, and virtually via an interactive online platform. The symposium prioritized inclusivity, featuring on-site sessions, poster presentations, and social activities for in-person attendees, while virtual participants accessed live sessions, interactive Q&A, and a virtual exhibit hall. Attendee statistics revealed a global reach, with Europe as the major contributor. SCS2023’s success in bridging in-person and virtual experiences sets a precedent for future events in Computational Biology and Bioinformatics. Availability and Implementation The details of the symposium, speaker information, schedules, and accepted abstracts, are available in the program booklet (https://doi.org/10.5281/zenodo.8173977). For organizers interested in adopting a similar hybrid model, it would be beneficial to have access to details regarding the online platform used, the types of sessions offered, and the challenges faced. Future iterations of SCS can address these aspects to further enhance accessibility and inclusivity.
{"title":"Adapting beyond borders: Insights from the 19th Student Council Symposium (SCS2023), the first hybrid ISCB Student Council global event","authors":"S. M. Al Sium, Estefania Torrejón, S. Chowdhury, Rubaiat Ahmed, Aakriti Jain, Mirko Treccani, Laura Veschetti, Arsalan Riaz, Pradeep Eranti, Gabriel J Olguín-Orellana","doi":"10.1093/bioadv/vbae028","DOIUrl":"https://doi.org/10.1093/bioadv/vbae028","url":null,"abstract":"Abstract Summary The 19th ISCB Student Council Symposium (SCS2023) organized by ISCB-SC adopted a hybrid format for the first time, allowing participants to engage in-person in Lyon, France, and virtually via an interactive online platform. The symposium prioritized inclusivity, featuring on-site sessions, poster presentations, and social activities for in-person attendees, while virtual participants accessed live sessions, interactive Q&A, and a virtual exhibit hall. Attendee statistics revealed a global reach, with Europe as the major contributor. SCS2023’s success in bridging in-person and virtual experiences sets a precedent for future events in Computational Biology and Bioinformatics. Availability and Implementation The details of the symposium, speaker information, schedules, and accepted abstracts, are available in the program booklet (https://doi.org/10.5281/zenodo.8173977). For organizers interested in adopting a similar hybrid model, it would be beneficial to have access to details regarding the online platform used, the types of sessions offered, and the challenges faced. Future iterations of SCS can address these aspects to further enhance accessibility and inclusivity.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140750855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae049
Ramtin Zargari Marandi
Summary: SHapley Additive exPlanations (SHAP) is a widely used method for model interpretation. However, its full potential often remains untapped due to the absence of dedicated software tools. In response, ExplaineR, an R package to facilitate interpretation of binary classification and regression models based on clustering functionality for SHAP analysis is introduced here. It additionally offers user-interactive elements in visualizations for evaluating model performance, fairness analysis, decision-curve analysis, and a diverse range of SHAP plots. It facilitates in-depth post-prediction analysis of models, enabling users to pinpoint potentially significant patterns in SHAP plots and subsequently trace them back to instances through SHAP clustering. This functionality is particularly valuable for identifying patient subgroups in clinical cohorts, thus enhancing its role as a robust profiling tool. ExplaineR empowers users to generate comprehensive reports on machine learning outcomes, ensuring consistent and thorough documentation of model performance and interpretations.
Availability and implementation: ExplaineR 1.0.0 is available on GitHub (https://persimune.github.io/explainer/) and CRAN (https://cran.r-project.org/web/packages/explainer/index.html).
{"title":"ExplaineR: an R package to explain machine learning models.","authors":"Ramtin Zargari Marandi","doi":"10.1093/bioadv/vbae049","DOIUrl":"https://doi.org/10.1093/bioadv/vbae049","url":null,"abstract":"<p><strong>Summary: </strong>SHapley Additive exPlanations (SHAP) is a widely used method for model interpretation. However, its full potential often remains untapped due to the absence of dedicated software tools. In response, <i>ExplaineR</i>, an R package to facilitate interpretation of binary classification and regression models based on clustering functionality for SHAP analysis is introduced here. It additionally offers user-interactive elements in visualizations for evaluating model performance, fairness analysis, decision-curve analysis, and a diverse range of SHAP plots. It facilitates in-depth post-prediction analysis of models, enabling users to pinpoint potentially significant patterns in SHAP plots and subsequently trace them back to instances through SHAP clustering. This functionality is particularly valuable for identifying patient subgroups in clinical cohorts, thus enhancing its role as a robust profiling tool. <i>ExplaineR</i> empowers users to generate comprehensive reports on machine learning outcomes, ensuring consistent and thorough documentation of model performance and interpretations.</p><p><strong>Availability and implementation: </strong><i>ExplaineR</i> 1.0.0 is available on GitHub (https://persimune.github.io/explainer/) and CRAN (https://cran.r-project.org/web/packages/explainer/index.html).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10994716/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140859282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Predicting anticancer treatment response from baseline genomic data is a critical obstacle in personalized medicine. Machine learning methods are commonly used for predicting drug response from gene expression data. In the process of constructing these machine learning models, one of the most significant challenges is identifying appropriate features among a massive number of genes.
Results: In this study, we utilize features (genes) extracted using the text-mining of scientific literatures. Using two independent cancer pharmacogenomic datasets, we demonstrate that text-mining-based features outperform traditional feature selection techniques in machine learning tasks. In addition, our analysis reveals that text-mining feature-based machine learning models trained on in vitro data also perform well when predicting the response of in vivo cancer models. Our results demonstrate that text-mining-based feature selection is an easy to implement approach that is suitable for building machine learning models for anticancer drug response prediction.
Availability and implementation: https://github.com/merlab/text_features.
{"title":"Text-mining-based feature selection for anticancer drug response prediction.","authors":"Grace Wu, Arvin Zaker, Amirhosein Ebrahimi, Shivanshi Tripathi, Arvind Singh Mer","doi":"10.1093/bioadv/vbae047","DOIUrl":"https://doi.org/10.1093/bioadv/vbae047","url":null,"abstract":"<p><strong>Motivation: </strong>Predicting anticancer treatment response from baseline genomic data is a critical obstacle in personalized medicine. Machine learning methods are commonly used for predicting drug response from gene expression data. In the process of constructing these machine learning models, one of the most significant challenges is identifying appropriate features among a massive number of genes.</p><p><strong>Results: </strong>In this study, we utilize features (genes) extracted using the text-mining of scientific literatures. Using two independent cancer pharmacogenomic datasets, we demonstrate that text-mining-based features outperform traditional feature selection techniques in machine learning tasks. In addition, our analysis reveals that text-mining feature-based machine learning models trained on <i>in vitro</i> data also perform well when predicting the response of <i>in vivo</i> cancer models. Our results demonstrate that text-mining-based feature selection is an easy to implement approach that is suitable for building machine learning models for anticancer drug response prediction.</p><p><strong>Availability and implementation: </strong>https://github.com/merlab/text_features.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11009020/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140869478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Cell-type deconvolution methods aim to infer cell composition from bulk transcriptomic data. The proliferation of developed methods coupled with inconsistent results obtained in many cases, highlights the pressing need for guidance in the selection of appropriate methods. Additionally, the growing accessibility of single-cell RNA sequencing datasets, often accompanied by bulk expression from related samples enable the benchmark of existing methods.
Results: In this study, we conduct a comprehensive assessment of 31 methods, utilizing single-cell RNA-sequencing data from diverse human and mouse tissues. Employing various simulation scenarios, we reveal the efficacy of regression-based deconvolution methods, highlighting their sensitivity to reference choices. We investigate the impact of bulk-reference differences, incorporating variables such as sample, study and technology. We provide validation using a gold standard dataset from mononuclear cells and suggest a consensus prediction of proportions when ground truth is not available. We validated the consensus method on data from the stomach and studied its spillover effect. Importantly, we propose the use of the critical assessment of transcriptomic deconvolution (CATD) pipeline which encompasses functionalities for generating references and pseudo-bulks and running implemented deconvolution methods. CATD streamlines simultaneous deconvolution of numerous bulk samples, providing a practical solution for speeding up the evaluation of newly developed methods.
Availability and implementation: https://github.com/Papatheodorou-Group/CATD_snakemake.
{"title":"CATD: a reproducible pipeline for selecting cell-type deconvolution methods across tissues.","authors":"Anna Vathrakokoili Pournara, Zhichao Miao, Ozgur Yilimaz Beker, Nadja Nolte, Alvis Brazma, Irene Papatheodorou","doi":"10.1093/bioadv/vbae048","DOIUrl":"https://doi.org/10.1093/bioadv/vbae048","url":null,"abstract":"<p><strong>Motivation: </strong>Cell-type deconvolution methods aim to infer cell composition from bulk transcriptomic data. The proliferation of developed methods coupled with inconsistent results obtained in many cases, highlights the pressing need for guidance in the selection of appropriate methods. Additionally, the growing accessibility of single-cell RNA sequencing datasets, often accompanied by bulk expression from related samples enable the benchmark of existing methods.</p><p><strong>Results: </strong>In this study, we conduct a comprehensive assessment of 31 methods, utilizing single-cell RNA-sequencing data from diverse human and mouse tissues. Employing various simulation scenarios, we reveal the efficacy of regression-based deconvolution methods, highlighting their sensitivity to reference choices. We investigate the impact of bulk-reference differences, incorporating variables such as sample, study and technology. We provide validation using a gold standard dataset from mononuclear cells and suggest a consensus prediction of proportions when ground truth is not available. We validated the consensus method on data from the stomach and studied its spillover effect. Importantly, we propose the use of the critical assessment of transcriptomic deconvolution (CATD) pipeline which encompasses functionalities for generating references and pseudo-bulks and running implemented deconvolution methods. CATD streamlines simultaneous deconvolution of numerous bulk samples, providing a practical solution for speeding up the evaluation of newly developed methods.</p><p><strong>Availability and implementation: </strong>https://github.com/Papatheodorou-Group/CATD_snakemake.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11023940/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140866913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae037
Apostolos Chalkis, Vissarion Fisikopoulos, Elias Tsigaridas, Haris Zafeiropoulos
We present dingo, a Python package that supports a variety of methods to sample from the flux space of metabolic models, based on state-of-the-art random walks and rounding methods. For uniform sampling, dingo's sampling methods provide significant speed-ups and outperform existing software. Indicatively, dingo can sample from the flux space of the largest metabolic model up to now (Recon3D) in less than a day using a personal computer, under several statistical guarantees; this computation is out of reach for other similar software. In addition, dingo supports common analysis methods, such as flux balance analysis and flux variability analysis, and visualization components. dingo contributes to the arsenal of tools in metabolic modelling by enabling flux sampling in high dimensions (in the order of thousands).
Availability and implementation: The dingo Python library is available in GitHub at https://github.com/GeomScale/dingo and the data underlying this article are available in https://doi.org/10.5281/zenodo.10423335.
{"title":"dingo: a Python package for metabolic flux sampling.","authors":"Apostolos Chalkis, Vissarion Fisikopoulos, Elias Tsigaridas, Haris Zafeiropoulos","doi":"10.1093/bioadv/vbae037","DOIUrl":"https://doi.org/10.1093/bioadv/vbae037","url":null,"abstract":"<p><p>We present dingo, a Python package that supports a variety of methods to sample from the flux space of metabolic models, based on state-of-the-art random walks and rounding methods. For uniform sampling, dingo's sampling methods provide significant speed-ups and outperform existing software. Indicatively, dingo can sample from the flux space of the largest metabolic model up to now (Recon3D) in less than a day using a personal computer, under several statistical guarantees; this computation is out of reach for other similar software. In addition, dingo supports common analysis methods, such as flux balance analysis and flux variability analysis, and visualization components. dingo contributes to the arsenal of tools in metabolic modelling by enabling flux sampling in high dimensions (in the order of thousands).</p><p><strong>Availability and implementation: </strong>The dingo Python library is available in GitHub at https://github.com/GeomScale/dingo and the data underlying this article are available in https://doi.org/10.5281/zenodo.10423335.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10997433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140871553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}