Network-based methods utilize protein-protein interaction information to identify significantly perturbed subnetworks in cancer and to propose key molecular pathways. Numerous methods have been developed, but to date, a rigorous benchmark analysis to compare the performance of existing approaches is lacking. In this paper, we proposed a novel benchmarking framework using synthetic data and conducted a comprehensive analysis to investigate the ability of existing methods to detect target genes and subnetworks and to control false positives, and how they perform in the presence of topological biases at both gene and subnetwork levels. Our analysis revealed insights into algorithmic performance that were previously unattainable. Based on the results of the benchmark study, we presented a practical guide for users on how to select appropriate detection methods and protein-protein interaction networks for cancer pathway identification, and provided suggestions for future algorithm development.
{"title":"A comprehensive benchmark study of methods for identifying significantly perturbed subnetworks in cancer.","authors":"Le Yang, Runpu Chen, Steve Goodison, Yijun Sun","doi":"10.1093/bib/bbae692","DOIUrl":"10.1093/bib/bbae692","url":null,"abstract":"<p><p>Network-based methods utilize protein-protein interaction information to identify significantly perturbed subnetworks in cancer and to propose key molecular pathways. Numerous methods have been developed, but to date, a rigorous benchmark analysis to compare the performance of existing approaches is lacking. In this paper, we proposed a novel benchmarking framework using synthetic data and conducted a comprehensive analysis to investigate the ability of existing methods to detect target genes and subnetworks and to control false positives, and how they perform in the presence of topological biases at both gene and subnetwork levels. Our analysis revealed insights into algorithmic performance that were previously unattainable. Based on the results of the benchmark study, we presented a practical guide for users on how to select appropriate detection methods and protein-protein interaction networks for cancer pathway identification, and provided suggestions for future algorithm development.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the context of the global damage caused by coronavirus disease 2019 (COVID-19) and the emergence of the monkeypox virus (MPXV) outbreak as a public health emergency of international concern, research into methods that can rapidly test potential therapeutics during an outbreak of a new infectious disease is urgently needed. Computational drug discovery is an effective way to solve such problems. The existence of various large open databases has mitigated the time and resource consumption of traditional drug development and improved the speed of drug discovery. However, the diversity of cell lines used in various databases remains limited, and previous drug discovery methods are ineffective for cross-cell prediction. In this study, we propose a correlation-dependent connectivity map (CDCM) to achieve cross-cell predictions of drug similarity. The CDCM mainly identifies drug-drug or disease-drug relationships from the perspective of gene networks by exploring the correlation changes between genes and identifying similarities in the effects of drugs or diseases on gene expression. We validated the CDCM on multiple datasets and found that it performed well for drug identification across cell lines. A comparison with the Connectivity Map revealed that our method was more stable and performed better across different cell lines. In the application of the CDCM to COVID-19 and MPXV data, the predictions of potential therapeutic compounds for COVID-19 were consistent with several previous studies, and most of the predicted drugs were found to be experimentally effective against MPXV. This result confirms the practical value of the CDCM. With the ability to predict across cell lines, the CDCM outperforms the Connectivity Map, and it has wider application prospects and a reduced cost of use.
{"title":"CDCM: a correlation-dependent connectivity map approach to rapidly screen drugs during outbreaks of infectious diseases.","authors":"Junlei Liao, Hongyang Yi, Hao Wang, Sumei Yang, Duanmei Jiang, Xin Huang, Mingxia Zhang, Jiayin Shen, Hongzhou Lu, Yuanling Niu","doi":"10.1093/bib/bbae659","DOIUrl":"10.1093/bib/bbae659","url":null,"abstract":"<p><p>In the context of the global damage caused by coronavirus disease 2019 (COVID-19) and the emergence of the monkeypox virus (MPXV) outbreak as a public health emergency of international concern, research into methods that can rapidly test potential therapeutics during an outbreak of a new infectious disease is urgently needed. Computational drug discovery is an effective way to solve such problems. The existence of various large open databases has mitigated the time and resource consumption of traditional drug development and improved the speed of drug discovery. However, the diversity of cell lines used in various databases remains limited, and previous drug discovery methods are ineffective for cross-cell prediction. In this study, we propose a correlation-dependent connectivity map (CDCM) to achieve cross-cell predictions of drug similarity. The CDCM mainly identifies drug-drug or disease-drug relationships from the perspective of gene networks by exploring the correlation changes between genes and identifying similarities in the effects of drugs or diseases on gene expression. We validated the CDCM on multiple datasets and found that it performed well for drug identification across cell lines. A comparison with the Connectivity Map revealed that our method was more stable and performed better across different cell lines. In the application of the CDCM to COVID-19 and MPXV data, the predictions of potential therapeutic compounds for COVID-19 were consistent with several previous studies, and most of the predicted drugs were found to be experimentally effective against MPXV. This result confirms the practical value of the CDCM. With the ability to predict across cell lines, the CDCM outperforms the Connectivity Map, and it has wider application prospects and a reduced cost of use.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11658818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142863338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chi Zhang, Yiran Cheng, Kaiwen Feng, Fa Zhang, Renmin Han, Jieqing Feng
Automatic single particle picking is a critical step in the data processing pipeline of cryo-electron microscopy structure reconstruction. In recent years, several deep learning-based algorithms have been developed, demonstrating their potential to solve this challenge. However, current methods highly depend on manually labeled training data, which is labor-intensive and prone to biases especially for high-noise and low-contrast micrographs, resulting in suboptimal precision and recall. To address these problems, we propose UPicker, a semi-supervised transformer-based particle-picking method with a two-stage training process: unsupervised pretraining and supervised fine-tuning. During the unsupervised pretraining, an Adaptive Laplacian of Gaussian region proposal generator is proposed to obtain pseudo-labels from unlabeled data for initial feature learning. For the supervised fine-tuning, UPicker only needs a small amount of labeled data to achieve high accuracy in particle picking. To further enhance model performance, UPicker employs a contrastive denoising training strategy to reduce redundant detections and accelerate convergence, along with a hybrid data augmentation strategy to deal with limited labeled data. Comprehensive experiments on both simulated and experimental datasets demonstrate that UPicker outperforms state-of-the-art particle-picking methods in terms of accuracy and robustness while requiring fewer labeled data than other transformer-based models. Furthermore, ablation studies demonstrate the effectiveness and necessity of each component of UPicker. The source code and data are available at https://github.com/JachyLikeCoding/UPicker.
{"title":"UPicker: a semi-supervised particle picking transformer method for cryo-EM micrographs.","authors":"Chi Zhang, Yiran Cheng, Kaiwen Feng, Fa Zhang, Renmin Han, Jieqing Feng","doi":"10.1093/bib/bbae636","DOIUrl":"10.1093/bib/bbae636","url":null,"abstract":"<p><p>Automatic single particle picking is a critical step in the data processing pipeline of cryo-electron microscopy structure reconstruction. In recent years, several deep learning-based algorithms have been developed, demonstrating their potential to solve this challenge. However, current methods highly depend on manually labeled training data, which is labor-intensive and prone to biases especially for high-noise and low-contrast micrographs, resulting in suboptimal precision and recall. To address these problems, we propose UPicker, a semi-supervised transformer-based particle-picking method with a two-stage training process: unsupervised pretraining and supervised fine-tuning. During the unsupervised pretraining, an Adaptive Laplacian of Gaussian region proposal generator is proposed to obtain pseudo-labels from unlabeled data for initial feature learning. For the supervised fine-tuning, UPicker only needs a small amount of labeled data to achieve high accuracy in particle picking. To further enhance model performance, UPicker employs a contrastive denoising training strategy to reduce redundant detections and accelerate convergence, along with a hybrid data augmentation strategy to deal with limited labeled data. Comprehensive experiments on both simulated and experimental datasets demonstrate that UPicker outperforms state-of-the-art particle-picking methods in terms of accuracy and robustness while requiring fewer labeled data than other transformer-based models. Furthermore, ablation studies demonstrate the effectiveness and necessity of each component of UPicker. The source code and data are available at https://github.com/JachyLikeCoding/UPicker.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631311/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142806025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cathal Ormond, Niamh M Ryan, Mathieu Cap, William Byerley, Aiden Corvin, Elizabeth A Heron
Next-generation sequencing is widely applied to the investigation of pedigree data for gene discovery. However, identifying plausible disease-causing variants within a robust statistical framework is challenging. Here, we introduce BICEP: a Bayesian inference tool for rare variant causality evaluation in pedigree-based cohorts. BICEP calculates the posterior odds that a genomic variant is causal for a phenotype based on the variant cosegregation as well as a priori evidence such as deleteriousness and functional consequence. BICEP can correctly identify causal variants for phenotypes with both Mendelian and complex genetic architectures, outperforming existing methodologies. Additionally, BICEP can correctly down-weight common variants that are unlikely to be involved in phenotypic liability in the context of a pedigree, even if they have reasonable cosegregation patterns. The output metrics from BICEP allow for the quantitative comparison of variant causality within and across pedigrees, which is not possible with existing approaches.
{"title":"BICEP: Bayesian inference for rare genomic variant causality evaluation in pedigrees.","authors":"Cathal Ormond, Niamh M Ryan, Mathieu Cap, William Byerley, Aiden Corvin, Elizabeth A Heron","doi":"10.1093/bib/bbae624","DOIUrl":"10.1093/bib/bbae624","url":null,"abstract":"<p><p>Next-generation sequencing is widely applied to the investigation of pedigree data for gene discovery. However, identifying plausible disease-causing variants within a robust statistical framework is challenging. Here, we introduce BICEP: a Bayesian inference tool for rare variant causality evaluation in pedigree-based cohorts. BICEP calculates the posterior odds that a genomic variant is causal for a phenotype based on the variant cosegregation as well as a priori evidence such as deleteriousness and functional consequence. BICEP can correctly identify causal variants for phenotypes with both Mendelian and complex genetic architectures, outperforming existing methodologies. Additionally, BICEP can correctly down-weight common variants that are unlikely to be involved in phenotypic liability in the context of a pedigree, even if they have reasonable cosegregation patterns. The output metrics from BICEP allow for the quantitative comparison of variant causality within and across pedigrees, which is not possible with existing approaches.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11645550/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142827358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The accurate estimation of cell type proportions in tissues is crucial for various downstream analyses. With the increasing availability of single-cell sequencing data, numerous deconvolution methods that use single-cell RNA sequencing data as a reference have been developed. However, a unified understanding of how these deconvolution approaches perform in practical applications is still lacking. To address this, we systematically assessed the accuracy and robustness of nine deconvolution methods that use single-cell RNA sequencing data as a reference, evaluating them on real bulk data with cell proportions verified through flow cytometry, as well as simulated bulk data generated from five single-cell RNA sequencing datasets. Our study highlights the importance of several factors-including reference dataset construction strategies, dataset size, cell type subdivision, and cell type inconsistency-on the accuracy and robustness of deconvolution results. We also propose a set of recommended guidelines for software users in diverse scenarios.
{"title":"Cell-type deconvolution for bulk RNA-seq data using single-cell reference: a comparative analysis and recommendation guideline.","authors":"Xintian Xu, Rui Li, Ouyang Mo, Kai Liu, Justin Li, Pei Hao","doi":"10.1093/bib/bbaf031","DOIUrl":"10.1093/bib/bbaf031","url":null,"abstract":"<p><p>The accurate estimation of cell type proportions in tissues is crucial for various downstream analyses. With the increasing availability of single-cell sequencing data, numerous deconvolution methods that use single-cell RNA sequencing data as a reference have been developed. However, a unified understanding of how these deconvolution approaches perform in practical applications is still lacking. To address this, we systematically assessed the accuracy and robustness of nine deconvolution methods that use single-cell RNA sequencing data as a reference, evaluating them on real bulk data with cell proportions verified through flow cytometry, as well as simulated bulk data generated from five single-cell RNA sequencing datasets. Our study highlights the importance of several factors-including reference dataset construction strategies, dataset size, cell type subdivision, and cell type inconsistency-on the accuracy and robustness of deconvolution results. We also propose a set of recommended guidelines for software users in diverse scenarios.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11789683/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143122256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ezechiel B Tibiri, Palwende R Boua, Issiaka Soulama, Christine Dubreuil-Tranchant, Ndomassi Tando, Charlotte Tollenaere, Christophe Brugidou, Romaric K Nanema, Fidèle Tiendrebeogo
Bioinformatics, an interdisciplinary field combining biology and computer science, enables meaningful information to be extracted from complex biological data. The exponential growth of biological data, driven by high-throughput omics technologies and advanced sequencing methods, requires robust computational resources. Worldwide, bioinformatics skills and computational clusters are essential for managing and analysing large-scale biological datasets across health, agriculture, and environmental science, which are crucial for the African continent. In Burkina Faso, the establishment of bioinformatics infrastructure has been a gradual process. Initial training initiatives between 2015-2016, including bioinformatics courses and the establishment of the BurkinaBioinfo (BBi) platform, marked significant progress. Over 250 scientists have been trained at diverse levels in bioinformatics, 105 user accounts have been created for high-performance computing access. Operational since 2019, this platform has significantly facilitated training programs for scientists and system administrators in west Africa, covering data production, introductory bioinformatics, phylogenetic analysis, and metagenomics. Financial and technical support from various sources has facilitated the rapid development of the platform to meet the growing need for bioinformatics analysis, particularly in conjunction with local 'wet labs'. Establishing a bioinformatics cluster in Burkina Faso involved identifying the needs of researchers, selecting appropriate hardware and installing the necessary bioinformatics tools. At present, the main challenges for the BBi platform include ongoing staff training in bioinformatics skills and high-level IT infrastructure management in the face of growing infrastructure demands. Despite these challenges, the establishment of a bioinformatics platform in Burkina Faso offers significant opportunities for scientific research and economic development in the country.
{"title":"Challenges and opportunities of developing bioinformatics platforms in Africa: the case of BurkinaBioinfo at Joseph Ki-Zerbo University, Burkina Faso.","authors":"Ezechiel B Tibiri, Palwende R Boua, Issiaka Soulama, Christine Dubreuil-Tranchant, Ndomassi Tando, Charlotte Tollenaere, Christophe Brugidou, Romaric K Nanema, Fidèle Tiendrebeogo","doi":"10.1093/bib/bbaf040","DOIUrl":"10.1093/bib/bbaf040","url":null,"abstract":"<p><p>Bioinformatics, an interdisciplinary field combining biology and computer science, enables meaningful information to be extracted from complex biological data. The exponential growth of biological data, driven by high-throughput omics technologies and advanced sequencing methods, requires robust computational resources. Worldwide, bioinformatics skills and computational clusters are essential for managing and analysing large-scale biological datasets across health, agriculture, and environmental science, which are crucial for the African continent. In Burkina Faso, the establishment of bioinformatics infrastructure has been a gradual process. Initial training initiatives between 2015-2016, including bioinformatics courses and the establishment of the BurkinaBioinfo (BBi) platform, marked significant progress. Over 250 scientists have been trained at diverse levels in bioinformatics, 105 user accounts have been created for high-performance computing access. Operational since 2019, this platform has significantly facilitated training programs for scientists and system administrators in west Africa, covering data production, introductory bioinformatics, phylogenetic analysis, and metagenomics. Financial and technical support from various sources has facilitated the rapid development of the platform to meet the growing need for bioinformatics analysis, particularly in conjunction with local 'wet labs'. Establishing a bioinformatics cluster in Burkina Faso involved identifying the needs of researchers, selecting appropriate hardware and installing the necessary bioinformatics tools. At present, the main challenges for the BBi platform include ongoing staff training in bioinformatics skills and high-level IT infrastructure management in the face of growing infrastructure demands. Despite these challenges, the establishment of a bioinformatics platform in Burkina Faso offers significant opportunities for scientific research and economic development in the country.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11789681/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143122300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Li, Qinglin Mei, Chaoxia Yang, Naibo Zhu, Guojun Li
Biclustering has emerged as a promising approach for analyzing high-dimensional expression data, offering unique advantages in uncovering localized co-expression patterns that traditional clustering methods often miss and thus facilitating advancements in complex disease research and other biomedical applications. However, state-of-the-art methods identify distinct patterns at the expense of losing information about specific patterns, some of which have been used to define cancer subtypes or reflect the progression of a disease or cellular processes. Additionally, these methods exhibit poor effectiveness in noisy environments. To address these limitations, we propose the bucket trend-preserving (BTP) pattern, a novel generalization of existing patterns. And we have developed an algorithm, TransBic, to extract significant biclusters of BTP-patterns. Specifically, TransBic transforms the problem into identifying common multipartite acyclic tournament subdigraphs shared by distinct subsets of acyclic tournament digraphs derived from a given expression matrix. Compared with prominent tools, TransBic demonstrates superior performance in identifying biclusters of all non-row-constant patterns, especially under noise and data fluctuations. Furthermore, TransBic successfully identifies the most disease-related pathways for type 2 diabetes (T2D), colorectal cancer, hepatocellular carcinoma, and breast cancer, outperforming other tools in this regard. Different from previous generalizations, BTP-patterns capture specific up-regulation and down-regulation dynamics. Through targeted analysis of BTP-patterns in T2D expression data, TransBic uncovers biological processes affected by disease risk factors, extending the application of trend-preserving biclustering in expression data analysis.
{"title":"TransBic: bucket trend-preserving biclustering for finding local and interpretable expression patterns.","authors":"Jing Li, Qinglin Mei, Chaoxia Yang, Naibo Zhu, Guojun Li","doi":"10.1093/bib/bbaf050","DOIUrl":"10.1093/bib/bbaf050","url":null,"abstract":"<p><p>Biclustering has emerged as a promising approach for analyzing high-dimensional expression data, offering unique advantages in uncovering localized co-expression patterns that traditional clustering methods often miss and thus facilitating advancements in complex disease research and other biomedical applications. However, state-of-the-art methods identify distinct patterns at the expense of losing information about specific patterns, some of which have been used to define cancer subtypes or reflect the progression of a disease or cellular processes. Additionally, these methods exhibit poor effectiveness in noisy environments. To address these limitations, we propose the bucket trend-preserving (BTP) pattern, a novel generalization of existing patterns. And we have developed an algorithm, TransBic, to extract significant biclusters of BTP-patterns. Specifically, TransBic transforms the problem into identifying common multipartite acyclic tournament subdigraphs shared by distinct subsets of acyclic tournament digraphs derived from a given expression matrix. Compared with prominent tools, TransBic demonstrates superior performance in identifying biclusters of all non-row-constant patterns, especially under noise and data fluctuations. Furthermore, TransBic successfully identifies the most disease-related pathways for type 2 diabetes (T2D), colorectal cancer, hepatocellular carcinoma, and breast cancer, outperforming other tools in this regard. Different from previous generalizations, BTP-patterns capture specific up-regulation and down-regulation dynamics. Through targeted analysis of BTP-patterns in T2D expression data, TransBic uncovers biological processes affected by disease risk factors, extending the application of trend-preserving biclustering in expression data analysis.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11794469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143188339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Because current genome-wide association studies are primarily conducted in individuals of European ancestry and information disparities exist among different populations, the polygenic score derived from Europeans thus exhibits poor transferability. Borrowing the idea of transfer learning, which enables the utilization of knowledge acquired from auxiliary samples to enhance learning capability in target samples, we propose transPGS, a novel polygenic score method, for genetic prediction in underrepresented populations by leveraging genetic similarity shared between the European and non-European populations while explaining the trans-ethnic difference in linkage disequilibrium (LD) and effect sizes. We demonstrate the usefulness and robustness of transPGS in elevated prediction accuracy via individual-level and summary-level simulations and apply it to seven continuous phenotypes and three diseases in the African, Chinese, and East Asian populations of the UK Biobank and Genetic Epidemiology Research Study on Adult Health and Aging cohorts. We further reveal that distinct LD and minor allele frequency patterns across ancestral groups are responsible for the dissatisfactory portability of PGS.
{"title":"Polygenic prediction for underrepresented populations through transfer learning by utilizing genetic similarity shared with European populations.","authors":"Yiyang Zhu, Wenying Chen, Kexuan Zhu, Yuxin Liu, Shuiping Huang, Ping Zeng","doi":"10.1093/bib/bbaf048","DOIUrl":"10.1093/bib/bbaf048","url":null,"abstract":"<p><p>Because current genome-wide association studies are primarily conducted in individuals of European ancestry and information disparities exist among different populations, the polygenic score derived from Europeans thus exhibits poor transferability. Borrowing the idea of transfer learning, which enables the utilization of knowledge acquired from auxiliary samples to enhance learning capability in target samples, we propose transPGS, a novel polygenic score method, for genetic prediction in underrepresented populations by leveraging genetic similarity shared between the European and non-European populations while explaining the trans-ethnic difference in linkage disequilibrium (LD) and effect sizes. We demonstrate the usefulness and robustness of transPGS in elevated prediction accuracy via individual-level and summary-level simulations and apply it to seven continuous phenotypes and three diseases in the African, Chinese, and East Asian populations of the UK Biobank and Genetic Epidemiology Research Study on Adult Health and Aging cohorts. We further reveal that distinct LD and minor allele frequency patterns across ancestral groups are responsible for the dissatisfactory portability of PGS.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11794457/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143188337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans, and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation, the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions. Implementations of analyses are available at https://github.com/santule/pLMEvo.
{"title":"Do protein language models learn phylogeny?","authors":"Sanjana Tule, Gabriel Foley, Mikael Bodén","doi":"10.1093/bib/bbaf047","DOIUrl":"https://doi.org/10.1093/bib/bbaf047","url":null,"abstract":"<p><p>Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans, and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation, the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions. Implementations of analyses are available at https://github.com/santule/pLMEvo.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143482209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheyu Ding, Rong Wei, Jianing Xia, Yonghao Mu, Jiahuan Wang, Yingying Lin
Ribosome profiling (Ribo-seq) provides transcriptome-wide insights into protein synthesis dynamics, yet its analysis poses challenges, particularly for nonbioinformatics researchers. Large language model-based chatbots offer promising solutions by leveraging natural language processing. This review explores their convergence, highlighting opportunities for synergy. We discuss challenges in Ribo-seq analysis and how chatbots mitigate them, facilitating scientific discovery. Through case studies, we illustrate chatbots' potential contributions, including data analysis and result interpretation. Despite the absence of applied examples, existing software underscores the value of chatbots and the large language model. We anticipate their pivotal role in future Ribo-seq analysis, overcoming limitations. Challenges such as model bias and data privacy require attention, but emerging trends offer promise. The integration of large language models and Ribo-seq analysis holds immense potential for advancing translational regulation and gene expression understanding.
{"title":"Exploring the potential of large language model-based chatbots in challenges of ribosome profiling data analysis: a review.","authors":"Zheyu Ding, Rong Wei, Jianing Xia, Yonghao Mu, Jiahuan Wang, Yingying Lin","doi":"10.1093/bib/bbae641","DOIUrl":"10.1093/bib/bbae641","url":null,"abstract":"<p><p>Ribosome profiling (Ribo-seq) provides transcriptome-wide insights into protein synthesis dynamics, yet its analysis poses challenges, particularly for nonbioinformatics researchers. Large language model-based chatbots offer promising solutions by leveraging natural language processing. This review explores their convergence, highlighting opportunities for synergy. We discuss challenges in Ribo-seq analysis and how chatbots mitigate them, facilitating scientific discovery. Through case studies, we illustrate chatbots' potential contributions, including data analysis and result interpretation. Despite the absence of applied examples, existing software underscores the value of chatbots and the large language model. We anticipate their pivotal role in future Ribo-seq analysis, overcoming limitations. Challenges such as model bias and data privacy require attention, but emerging trends offer promise. The integration of large language models and Ribo-seq analysis holds immense potential for advancing translational regulation and gene expression understanding.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11638007/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142817162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}