Pub Date : 2022-12-01DOI: 10.1016/j.gpb.2022.05.010
Duo Chen , Xue Yuan , Xuehai Zheng , Jingping Fang , Gang Lin , Rongmao Li , Jiannan Chen , Wenjin He , Zhen Huang , Wenfang Fan , Limin Liang , Chentao Lin , Jinmao Zhu , Youqiang Chen , Ting Xue
Isochrysis galbana is considered an ideal bait for functional foods and nutraceuticals of humans because of its high fucoxanthin (Fx) content. However, multi-omics analysis of the regulatory networks for Fx biosynthesis in I. galbana has not been reported. In this study, we report a high-quality genome assembly of I. galbana LG007, which has a genome size of 92.73 Mb, with a contig N50 of 6.99 Mb and 14,900 protein-coding genes. Phylogenetic analysis confirmed the monophyly of Haptophyta, with I. galbana sister to Emiliania huxleyi and Chrysochromulina tobinii. Evolutionary analysis revealed an estimated divergence time between I. galbana and E. huxleyi of ∼ 133 million years ago. Gene family analysis indicated that lipid metabolism-related genes exhibited significant expansion, including IgPLMT, IgOAR1, and IgDEGS1. Metabolome analysis showed that the content of carotenoids in I. galbana cultured under green light for 7 days was higher than that under white light, and β-carotene was the main carotenoid, accounting for 79.09% of the total carotenoids. Comprehensive multi-omics analysis revealed that the content of β-carotene, antheraxanthin, zeaxanthin, and Fx was increased by green light induction, which was significantly correlated with the expression of IgMYB98, IgZDS, IgPDS, IgLHCX2, IgZEP, IgLCYb, and IgNSY. These findings contribute to the understanding of Fx biosynthesis and its regulation, providing a valuable reference for food and pharmaceutical applications.
{"title":"Multi-omics Analyses Provide Insight into the Biosynthesis Pathways of Fucoxanthin in Isochrysis galbana","authors":"Duo Chen , Xue Yuan , Xuehai Zheng , Jingping Fang , Gang Lin , Rongmao Li , Jiannan Chen , Wenjin He , Zhen Huang , Wenfang Fan , Limin Liang , Chentao Lin , Jinmao Zhu , Youqiang Chen , Ting Xue","doi":"10.1016/j.gpb.2022.05.010","DOIUrl":"10.1016/j.gpb.2022.05.010","url":null,"abstract":"<div><p><strong><em>Isochrysis galbana</em></strong> is considered an ideal bait for functional foods and nutraceuticals of humans because of its high <strong>fucoxanthin</strong> (Fx) content. However, multi-omics analysis of the regulatory networks for Fx biosynthesis in <em>I</em>. <em>galbana</em> has not been reported. In this study, we report a high-quality genome assembly of <em>I</em>. <em>galbana</em> LG007, which has a genome size of 92.73 Mb, with a contig N50 of 6.99 Mb and 14,900 protein-coding genes. Phylogenetic analysis confirmed the monophyly of Haptophyta, with <em>I</em>. <em>galbana</em> sister to <em>Emiliania huxleyi</em> and <em>Chrysochromulina tobinii.</em> Evolutionary analysis revealed an estimated divergence time between <em>I</em>. <em>galbana</em> and <em>E. huxleyi</em> of ∼ 133 million years ago. Gene family analysis indicated that lipid metabolism-related genes exhibited significant expansion, including <em>IgPLMT</em>, <em>IgOAR1</em>, and <em>IgDEGS1</em>. <strong>Metabolome</strong> analysis showed that the content of carotenoids in <em>I</em>. <em>galbana</em> cultured under green light for 7 days was higher than that under white light, and β-carotene was the main carotenoid, accounting for 79.09% of the total carotenoids. Comprehensive multi-omics analysis revealed that the content of β-carotene, antheraxanthin, zeaxanthin, and Fx was increased by green light induction, which was significantly correlated with the expression of <em>IgMYB98</em>, <em>IgZDS</em>, <em>IgPDS</em>, <em>IgLHCX2</em>, <em>IgZEP</em>, <em>IgLCYb</em>, and <em>IgNSY</em>. These findings contribute to the understanding of Fx biosynthesis and its regulation, providing a valuable reference for food and pharmaceutical applications.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 6","pages":"Pages 1138-1153"},"PeriodicalIF":9.5,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10225490/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9896384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1016/j.gpb.2021.06.003
Wan-Ping Lee , Qihui Zhu , Xiaofei Yang , Silvia Liu , Eliza Cerveira , Mallory Ryan , Adam Mil-Homens , Lauren Bellfy , Kai Ye , Charles Lee , Chengsheng Zhang
We aimed to develop a whole-genome sequencing (WGS)-based copy number variant (CNV) calling algorithm with the potential of replacing chromosomal microarray assay (CMA) for clinical diagnosis. JAX-CNV is thus developed for CNV detection from WGS data. The performance of this CNV calling algorithm was evaluated in a blinded manner on 31 samples and compared to the 112 CNVs reported by clinically validated CMAs for these 31 samples. The result showed that JAX-CNV recalled 100% of these CNVs. Besides, JAX-CNV identified an average of 30 CNVs per individual, respresenting an approximately seven-fold increase compared to calls of clinically validated CMAs. Experimental validation of 24 randomly selected CNVs showed one false positive, i.e., a false discovery rate (FDR) of 4.17%. A robustness test on lower-coverage data revealed a 100% sensitivity for CNVs larger than 300 kb (the current threshold for College of American Pathologists) down to 10× coverage. For CNVs larger than 50 kb, sensitivities were 100% for coverages deeper than 20×, 97% for 15×, and 95% for 10×. We developed a WGS-based CNV pipeline, including this newly developed CNV caller JAX-CNV, and found it capable of detecting CMA-reported CNVs at a sensitivity of 100% with about a FDR of 4%. We propose that JAX-CNV could be further examined in a multi-institutional study to justify the transition of first-tier genetic testing from CMAs to WGS. JAX-CNV is available at https://github.com/TheJacksonLaboratory/JAX-CNV.
{"title":"JAX-CNV: A Whole-genome Sequencing-based Algorithm for Copy Number Detection at Clinical Grade Level","authors":"Wan-Ping Lee , Qihui Zhu , Xiaofei Yang , Silvia Liu , Eliza Cerveira , Mallory Ryan , Adam Mil-Homens , Lauren Bellfy , Kai Ye , Charles Lee , Chengsheng Zhang","doi":"10.1016/j.gpb.2021.06.003","DOIUrl":"10.1016/j.gpb.2021.06.003","url":null,"abstract":"<div><p>We aimed to develop a <strong>whole-genome sequencing</strong> (WGS)-based <strong>copy number variant</strong> (CNV) calling algorithm with the potential of replacing <strong>chromosomal microarray assay</strong> (CMA) for clinical diagnosis. <strong>JAX-CNV</strong> is thus developed for CNV detection from WGS data. The performance of this CNV calling algorithm was evaluated in a blinded manner on 31 samples and compared to the 112 CNVs reported by clinically validated CMAs for these 31 samples. The result showed that JAX-CNV recalled 100% of these CNVs. Besides, JAX-CNV identified an average of 30 CNVs per individual, respresenting an approximately seven-fold increase compared to calls of clinically validated CMAs. Experimental validation of 24 randomly selected CNVs showed one false positive, <em>i.e.</em>, a false discovery rate (FDR) of 4.17%. A robustness test on lower-coverage data revealed a 100% sensitivity for CNVs larger than 300 kb (the current threshold for College of American Pathologists) down to 10× coverage. For CNVs larger than 50 kb, sensitivities were 100% for coverages deeper than 20×, 97% for 15×, and 95% for 10×. We developed a WGS-based CNV pipeline, including this newly developed CNV caller JAX-CNV, and found it capable of detecting CMA-reported CNVs at a sensitivity of 100% with about a FDR of 4%. We propose that JAX-CNV could be further examined in a multi-institutional study to justify the transition of first-tier <strong>genetic testing</strong> from CMAs to WGS. JAX-CNV is available at <span>https://github.com/TheJacksonLaboratory/JAX-CNV</span><svg><path></path></svg>.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 6","pages":"Pages 1197-1206"},"PeriodicalIF":9.5,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10225484/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9545849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1016/j.gpb.2022.11.008
Wubing Zhang , Shourya S. Roy Burman , Jiaye Chen , Katherine A. Donovan , Yang Cao , Chelsea Shu , Boning Zhang , Zexian Zeng , Shengqing Gu , Yi Zhang , Dian Li , Eric S. Fischer , Collin Tokheim , X. Shirley Liu
Targeted protein degradation (TPD) has rapidly emerged as a therapeutic modality to eliminate previously undruggable proteins by repurposing the cell’s endogenous protein degradation machinery. However, the susceptibility of proteins for targeting by TPD approaches, termed “degradability”, is largely unknown. Here, we developed a machine learning model, model-free analysis of protein degradability (MAPD), to predict degradability from features intrinsic to protein targets. MAPD shows accurate performance in predicting kinases that are degradable by TPD compounds [with an area under the precision–recall curve (AUPRC) of 0.759 and an area under the receiver operating characteristic curve (AUROC) of 0.775] and is likely generalizable to independent non-kinase proteins. We found five features with statistical significance to achieve optimal prediction, with ubiquitination potential being the most predictive. By structural modeling, we found that E2-accessible ubiquitination sites, but not lysine residues in general, are particularly associated with kinase degradability. Finally, we extended MAPD predictions to the entire proteome to find 964 disease-causing proteins (including proteins encoded by 278 cancer genes) that may be tractable to TPD drug development.
{"title":"Machine Learning Modeling of Protein-intrinsic Features Predicts Tractability of Targeted Protein Degradation","authors":"Wubing Zhang , Shourya S. Roy Burman , Jiaye Chen , Katherine A. Donovan , Yang Cao , Chelsea Shu , Boning Zhang , Zexian Zeng , Shengqing Gu , Yi Zhang , Dian Li , Eric S. Fischer , Collin Tokheim , X. Shirley Liu","doi":"10.1016/j.gpb.2022.11.008","DOIUrl":"10.1016/j.gpb.2022.11.008","url":null,"abstract":"<div><p><strong>Targeted protein degradation</strong> (TPD) has rapidly emerged as a therapeutic modality to eliminate previously undruggable proteins by repurposing the cell’s endogenous protein degradation machinery. However, the susceptibility of proteins for targeting by TPD approaches, termed “<strong>degradability</strong>”, is largely unknown. Here, we developed a <strong>machine learning</strong> model, model-free analysis of protein degradability (MAPD), to predict degradability from features intrinsic to protein targets. MAPD shows accurate performance in predicting kinases that are degradable by TPD compounds [with an area under the precision–recall curve (AUPRC) of 0.759 and an area under the receiver operating characteristic curve (AUROC) of 0.775] and is likely generalizable to independent non-kinase proteins. We found five features with statistical significance to achieve optimal prediction, with <strong>ubiquitination</strong> potential being the most predictive. By structural modeling, we found that E2-accessible ubiquitination sites, but not lysine residues in general, are particularly associated with kinase degradability. Finally, we extended MAPD predictions to the entire proteome to find 964 disease-causing proteins (including proteins encoded by 278 cancer genes) that may be tractable to TPD drug development.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 882-898"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/cc/c8/main.PMC10025769.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9159819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1016/j.gpb.2022.11.004
Shao-Wu Zhang, Jing-Yu Xu, Tong Zhang
Identification of cancer driver genes plays an important role in precision oncology research, which is helpful to understand cancer initiation and progression. However, most existing computational methods mainly used the protein–protein interaction (PPI) networks, or treated the directed gene regulatory networks (GRNs) as the undirected gene–gene association networks to identify the cancer driver genes, which will lose the unique structure regulatory information in the directed GRNs, and then affect the outcome of the cancer driver gene identification. Here, based on the multi-omics pan-cancer data (i.e., gene expression, mutation, copy number variation, and DNA methylation), we propose a novel method (called DGMP) to identify cancer driver genes by jointing directed graph convolutional network (DGCN) and multilayer perceptron (MLP). DGMP learns the multi-omics features of genes as well as the topological structure features in GRN with the DGCN model and uses MLP to weigh more on gene features for mitigating the bias toward the graph topological features in the DGCN learning process. The results on three GRNs show that DGMP outperforms other existing state-of-the-art methods. The ablation experimental results on the DawnNet network indicate that introducing MLP into DGCN can offset the performance degradation of DGCN, and jointing MLP and DGCN can effectively improve the performance of identifying cancer driver genes. DGMP can identify not only the highly mutated cancer driver genes but also the driver genes harboring other kinds of alterations (e.g., differential expression and aberrant DNA methylation) or genes involved in GRNs with other cancer genes. The source code of DGMP can be freely downloaded from https://github.com/NWPU-903PR/DGMP.
{"title":"DGMP: Identifying Cancer Driver Genes by Jointing DGCN and MLP from Multi-omics Genomic Data","authors":"Shao-Wu Zhang, Jing-Yu Xu, Tong Zhang","doi":"10.1016/j.gpb.2022.11.004","DOIUrl":"10.1016/j.gpb.2022.11.004","url":null,"abstract":"<div><p>Identification of cancer <strong>driver genes</strong> plays an important role in precision oncology research, which is helpful to understand cancer initiation and progression. However, most existing computational methods mainly used the protein–protein interaction (PPI) networks, or treated the directed <strong>gene regulatory networks</strong> (GRNs) as the undirected gene–gene association networks to identify the cancer driver genes, which will lose the unique structure regulatory information in the directed GRNs, and then affect the outcome of the cancer driver gene identification. Here, based on the multi-omics pan-cancer data (<em>i.e.</em>, gene expression, mutation, copy number variation, and DNA methylation), we propose a novel method (called DGMP) to identify cancer driver genes by jointing <strong>directed graph convolutional network</strong> (DGCN) and <strong>multilayer perceptron</strong> (MLP). DGMP learns the multi-omics features of genes as well as the topological structure features in GRN with the DGCN model and uses MLP to weigh more on gene features for mitigating the bias toward the graph topological features in the DGCN learning process. The results on three GRNs show that DGMP outperforms other existing state-of-the-art methods. The ablation experimental results on the DawnNet network indicate that introducing MLP into DGCN can offset the performance degradation of DGCN, and jointing MLP and DGCN can effectively improve the performance of identifying cancer driver genes. DGMP can identify not only the highly mutated cancer driver genes but also the driver genes harboring other kinds of alterations (<em>e.g.</em>, differential expression and aberrant DNA methylation) or genes involved in GRNs with other cancer genes. The source code of DGMP can be freely downloaded from <span>https://github.com/NWPU-903PR/DGMP</span><svg><path></path></svg>.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 928-938"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/17/4d/main.PMC10025764.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9159807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1016/j.gpb.2022.11.011
Matthew Brendel , Chang Su , Zilong Bai , Hao Zhang , Olivier Elemento , Fei Wang
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
{"title":"Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review","authors":"Matthew Brendel , Chang Su , Zilong Bai , Hao Zhang , Olivier Elemento , Fei Wang","doi":"10.1016/j.gpb.2022.11.011","DOIUrl":"10.1016/j.gpb.2022.11.011","url":null,"abstract":"<div><p><strong>Single-cell RNA sequencing</strong> (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). <strong>Deep learning</strong>, a recent advance of <strong>artificial intelligence</strong> that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 814-835"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025684/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9209084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sequencing-based spatial transcriptomics (ST) is an emerging technology to study in situ gene expression patterns at the whole-genome scale. Currently, ST data analysis is still complicated by high technical noises and low resolution. In addition to the transcriptomic data, matched histopathological images are usually generated for the same tissue sample along the ST experiment. The matched high-resolution histopathological images provide complementary cellular phenotypical information, providing an opportunity to mitigate the noises in ST data. We present a novel ST data analysis method called transcriptome and histopathological image integrative analysis for ST (TIST), which enables the identification of spatial clusters (SCs) and the enhancement of spatial gene expression patterns by integrative analysis of matched transcriptomic data and images. TIST devises a histopathological feature extraction method based on Markov random field (MRF) to learn the cellular features from histopathological images, and integrates them with the transcriptomic data and location information as a network, termed TIST-net. Based on TIST-net, SCs are identified by a random walk-based strategy, and gene expression patterns are enhanced by neighborhood smoothing. We benchmark TIST on both simulated datasets and 32 real samples against several state-of-the-art methods. Results show that TIST is robust to technical noises on multiple analysis tasks for sequencing-based ST data and can find interesting microstructures in different biological scenarios. TIST is available at http://lifeome.net/software/tist/ and https://ngdc.cncb.ac.cn/biocode/tools/BT007317.
{"title":"TIST: Transcriptome and Histopathological Image Integrative Analysis for Spatial Transcriptomics","authors":"Yiran Shan , Qian Zhang , Wenbo Guo , Yanhong Wu , Yuxin Miao , Hongyi Xin , Qiuyu Lian , Jin Gu","doi":"10.1016/j.gpb.2022.11.012","DOIUrl":"10.1016/j.gpb.2022.11.012","url":null,"abstract":"<div><p>Sequencing-based <strong>spatial transcriptomics</strong> (ST) is an emerging technology to study <em>in situ</em> gene expression patterns at the whole-genome scale. Currently, ST data analysis is still complicated by high technical noises and low resolution. In addition to the transcriptomic data, matched histopathological images are usually generated for the same tissue sample along the ST experiment. The matched high-resolution histopathological images provide complementary cellular phenotypical information, providing an opportunity to mitigate the noises in ST data. We present a novel ST data analysis method called transcriptome and histopathological image integrative analysis for ST (TIST), which enables the identification of spatial clusters (SCs) and the enhancement of spatial gene expression patterns by integrative analysis of matched transcriptomic data and images. TIST devises a histopathological feature extraction method based on Markov random field (MRF) to learn the cellular features from histopathological images, and integrates them with the transcriptomic data and location information as a network, termed TIST-net. Based on TIST-net, SCs are identified by a random walk-based strategy, and gene expression patterns are enhanced by neighborhood smoothing. We benchmark TIST on both simulated datasets and 32 real samples against several state-of-the-art methods. Results show that TIST is robust to technical noises on multiple analysis tasks for sequencing-based ST data and can find interesting microstructures in different biological scenarios. TIST is available at <span>http://lifeome.net/software/tist/</span><svg><path></path></svg> and <span>https://ngdc.cncb.ac.cn/biocode/tools/BT007317</span><svg><path></path></svg>.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 974-988"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025771/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9151285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1016/j.gpb.2022.11.003
Yawei Li , Xin Wu , Ping Yang , Guoqian Jiang , Yuan Luo
The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
{"title":"Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis","authors":"Yawei Li , Xin Wu , Ping Yang , Guoqian Jiang , Yuan Luo","doi":"10.1016/j.gpb.2022.11.003","DOIUrl":"10.1016/j.gpb.2022.11.003","url":null,"abstract":"<div><p>The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis <strong>prediction</strong>, and <strong>immunotherapy</strong> practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 850-866"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9153059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1016/j.gpb.2022.03.001
Yi-Heng Zhu , Chengxin Zhang , Yan Liu , Gilbert S. Omenn , Peter L. Freddolino , Dong-Jun Yu , Yang Zhang
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
{"title":"TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction","authors":"Yi-Heng Zhu , Chengxin Zhang , Yan Liu , Gilbert S. Omenn , Peter L. Freddolino , Dong-Jun Yu , Yang Zhang","doi":"10.1016/j.gpb.2022.03.001","DOIUrl":"10.1016/j.gpb.2022.03.001","url":null,"abstract":"<div><p><strong>Gene Ontology</strong> (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on <strong>transcript expression profile</strong>, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, <em>Arabidopsis</em>, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new <strong>triplet network</strong>-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and <strong>protein-level alignments</strong>, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at <span>https://zhanggroup.org/TripletGO/</span><svg><path></path></svg>.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 1013-1027"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025770/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9834982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1016/j.gpb.2022.07.003
Yongbing Zhao , Jinfeng Shao , Yan W. Asmann
Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
{"title":"Assessment and Optimization of Explainable Machine Learning Models Applied to Transcriptomic Data","authors":"Yongbing Zhao , Jinfeng Shao , Yan W. Asmann","doi":"10.1016/j.gpb.2022.07.003","DOIUrl":"10.1016/j.gpb.2022.07.003","url":null,"abstract":"<div><p>Explainable artificial intelligence aims to interpret how <strong>machine learning</strong> models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.</p></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"20 5","pages":"Pages 899-911"},"PeriodicalIF":9.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025763/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9152040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}