Pub Date : 2020-09-11eCollection Date: 2020-01-01DOI: 10.1177/1176934320941495
Jun Yang, Peng Xu, Diqiu Yu
Rice (Oryza sativa) yield is correlated to various factors. Transcription regulators are important factors, such as the typical SHORT INTERNODES-related sequences (SRSs), which encode proteins with single zinc finger motifs. Nevertheless, knowledge regarding the evolutionary and functional characteristics of the SRS gene family members in rice is insufficient. Therefore, we performed a genome-wide screening and characterization of the OsSRS gene family in Oryza sativa japonica rice. We also examined the SRS proteins from 11 rice sub-species, consisting of 3 cultivars, 6 wild varieties, and 2 other genome types. SRS members from maize, sorghum, Brachypodium distachyon, and Arabidopsis were also investigated. All these SRS proteins exhibited species-specific characteristics, as well as monocot- and dicot-specific characteristics, as assessed by phylogenetic analysis, which was further validated by gene structure and motif analyses. Genome comparisons revealed that segmental duplications may have played significant roles in the recombination of the OsSRS gene family and their expression levels. The family was mainly subjected to purifying selective pressure. In addition, the expression data demonstrated the distinct responses of OsSRS genes to various abiotic stresses and hormonal treatments, indicating their functional divergence. Our study provides a good reference for elucidating the functions of SRS genes in rice.
{"title":"Genome-Wide Identification and Characterization of the SHI-Related Sequence Gene Family in Rice.","authors":"Jun Yang, Peng Xu, Diqiu Yu","doi":"10.1177/1176934320941495","DOIUrl":"https://doi.org/10.1177/1176934320941495","url":null,"abstract":"<p><p>Rice (<i>Oryza sativa</i>) yield is correlated to various factors. Transcription regulators are important factors, such as the typical SHORT INTERNODES-related sequences (SRSs), which encode proteins with single zinc finger motifs. Nevertheless, knowledge regarding the evolutionary and functional characteristics of the <i>SRS</i> gene family members in rice is insufficient. Therefore, we performed a genome-wide screening and characterization of the <i>OsSRS</i> gene family in <i>Oryza sativa</i> japonica rice. We also examined the SRS proteins from 11 rice sub-species, consisting of 3 cultivars, 6 wild varieties, and 2 other genome types. SRS members from maize, sorghum, <i>Brachypodium distachyon</i>, and <i>Arabidopsis</i> were also investigated. All these SRS proteins exhibited species-specific characteristics, as well as monocot- and dicot-specific characteristics, as assessed by phylogenetic analysis, which was further validated by gene structure and motif analyses. Genome comparisons revealed that segmental duplications may have played significant roles in the recombination of the <i>OsSRS</i> gene family and their expression levels. The family was mainly subjected to purifying selective pressure. In addition, the expression data demonstrated the distinct responses of <i>OsSRS</i> genes to various abiotic stresses and hormonal treatments, indicating their functional divergence. Our study provides a good reference for elucidating the functions of <i>SRS</i> genes in rice.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320941495"},"PeriodicalIF":2.6,"publicationDate":"2020-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320941495","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38408336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-02eCollection Date: 2020-01-01DOI: 10.1177/1176934320941500
Kabita Baral, Peter Rotwein
Recent advances in genetics present unique opportunities for enhancing our understanding of human physiology and disease predisposition through detailed analysis of gene structure, expression, and population variation via examination of data in publicly accessible genome and gene expression repositories. Yet, the vast majority of human genes remain understudied. Here, we show the scope of these genomic and genetic resources by evaluating ZMAT2, a member of a 5-gene family that through May 2020 had been the focus of only 4 peer-reviewed scientific publications. Using analysis of information extracted from public databases, we show that human ZMAT2 is a 6-exon gene and find that it exhibits minimal genetic variation in human populations and in disease states, including cancer. We further demonstrate that the gene and its encoded protein are highly conserved among nonhuman primates and define a cohort of ZMAT2 pseudogenes in the marmoset genome. Collectively, our investigations illustrate how complementary use of genomic, gene expression, and population genetic resources can lead to new insights about human and mammalian biology and evolution, and when coupled with data supporting key roles for ZMAT2 in keratinocyte differentiation and pre-RNA splicing argue that this gene is worthy of further study.
{"title":"<i>ZMAT2</i> in Humans and Other Primates: A Highly Conserved and Understudied Gene.","authors":"Kabita Baral, Peter Rotwein","doi":"10.1177/1176934320941500","DOIUrl":"https://doi.org/10.1177/1176934320941500","url":null,"abstract":"<p><p>Recent advances in genetics present unique opportunities for enhancing our understanding of human physiology and disease predisposition through detailed analysis of gene structure, expression, and population variation via examination of data in publicly accessible genome and gene expression repositories. Yet, the vast majority of human genes remain understudied. Here, we show the scope of these genomic and genetic resources by evaluating <i>ZMAT2</i>, a member of a 5-gene family that through May 2020 had been the focus of only 4 peer-reviewed scientific publications. Using analysis of information extracted from public databases, we show that human <i>ZMAT2</i> is a 6-exon gene and find that it exhibits minimal genetic variation in human populations and in disease states, including cancer. We further demonstrate that the gene and its encoded protein are highly conserved among nonhuman primates and define a cohort of <i>ZMAT2</i> pseudogenes in the marmoset genome. Collectively, our investigations illustrate how complementary use of genomic, gene expression, and population genetic resources can lead to new insights about human and mammalian biology and evolution, and when coupled with data supporting key roles for ZMAT2 in keratinocyte differentiation and pre-RNA splicing argue that this gene is worthy of further study.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320941500"},"PeriodicalIF":2.6,"publicationDate":"2020-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320941500","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38496158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Triple-negative breast cancer (TNBC) is the most aggressive and fatal sub-type of breast cancer. This study aimed to identify metastasis-associated genes that could serve as biomarkers for TNBC diagnosis and prognosis. RNA-seq data and clinical information on TNBC from the Cancer Genome Atlas were used to conduct analyses. Expression data were used to establish co-expression modules using average linkage hierarchical clustering. We used weighted gene co-expression network analysis to explore the associations between gene sets and clinical features and to identify metastasis-associated candidate biomarkers. The K-M plotter website was used to explore the association between the expression of candidate biomarkers and patient survival. In addition, receiver operating characteristic curve analysis was used to illustrate the diagnostic performance of candidate genes. The pale turquoise module was significantly associated with the occurrence of metastasis. In this module, 64 genes were identified, and its functional enrichment analysis revealed that they were mainly associated with transcriptional misregulation in cancer, microRNAs in cancer, and negative regulation of angiogenesis. Further, 4 genes, IGSF10, RUNX1T1, XIST, and TSHZ2, which were negatively associated with relapse-free survival and have seldom been reported before in TNBC, were selected. In addition, the mRNA expression levels of the 4 candidate genes were significantly lower in TNBC tumor tissues compared with healthy tissues. Based on the K-M plotter, these 4 genes were correlated with poor prognosis of TNBC. The area under the curve of IGSF10, RUNX1T1, TSHZ2, and XIST was 0.918, 0.957, 0.977, and 0.749. These findings provide new insight into TNBC metastasis. IGSF10, RUNX1T1, TSHZ2, and XIST could be used as candidate biomarkers for the diagnosis and prognosis of TNBC metastasis.
{"title":"Identification of Metastasis-Associated Genes in Triple-Negative Breast Cancer Using Weighted Gene Co-expression Network Analysis.","authors":"Wenting Xie, Zhongshi Du, Yijie Chen, Naxiang Liu, Zhaoming Zhong, Youhong Shen, Lina Tang","doi":"10.1177/1176934320954868","DOIUrl":"https://doi.org/10.1177/1176934320954868","url":null,"abstract":"<p><p>Triple-negative breast cancer (TNBC) is the most aggressive and fatal sub-type of breast cancer. This study aimed to identify metastasis-associated genes that could serve as biomarkers for TNBC diagnosis and prognosis. RNA-seq data and clinical information on TNBC from the Cancer Genome Atlas were used to conduct analyses. Expression data were used to establish co-expression modules using average linkage hierarchical clustering. We used weighted gene co-expression network analysis to explore the associations between gene sets and clinical features and to identify metastasis-associated candidate biomarkers. The K-M plotter website was used to explore the association between the expression of candidate biomarkers and patient survival. In addition, receiver operating characteristic curve analysis was used to illustrate the diagnostic performance of candidate genes. The pale turquoise module was significantly associated with the occurrence of metastasis. In this module, 64 genes were identified, and its functional enrichment analysis revealed that they were mainly associated with transcriptional misregulation in cancer, microRNAs in cancer, and negative regulation of angiogenesis. Further, 4 genes, <i>IGSF10, RUNX1T1, XIST</i>, and <i>TSHZ2</i>, which were negatively associated with relapse-free survival and have seldom been reported before in TNBC, were selected. In addition, the mRNA expression levels of the 4 candidate genes were significantly lower in TNBC tumor tissues compared with healthy tissues. Based on the K-M plotter, these 4 genes were correlated with poor prognosis of TNBC. The area under the curve of <i>IGSF10, RUNX1T1, TSHZ2</i>, and <i>XIST</i> was 0.918, 0.957, 0.977, and 0.749. These findings provide new insight into TNBC metastasis. <i>IGSF10, RUNX1T1, TSHZ2</i>, and <i>XIST</i> could be used as candidate biomarkers for the diagnosis and prognosis of TNBC metastasis.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320954868"},"PeriodicalIF":2.6,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320954868","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38496159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-27eCollection Date: 2020-01-01DOI: 10.1177/1176934320942192
Su Xu, Jianjun Cheng, Xiangchen Meng, Yan Xu, Ying Mu
Lactobacillus reuteri YSJL-12 was isolated from healthy sow fresh feces and used as probiotics additives previously. To investigate the genetic basis on probiotic potential and identify the genes in the strain, the complete genome of YSJL-12 was sequenced. Then comparative genome analysis on 9 strains of Lactobacillus reuteri was performed. The genome of YSJL-12 consisted of a circular 2,084,748 bp chromosome and 2 circular plasmids (51,906 and 15,134 bp). From among the 2065 protein-coding sequences (CDSs), the genes resistant to the environmental stress were identified. The function of COG (Clusters of Orthologous Group) protein genes was predicted, and the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways were analyzed. The comparative genome analysis indicated that the pan-genome contained a core genome of 1257 orthologous gene clusters, an accessory genome of 1064 orthologous gene clusters, and 1148 strain-specific genes, and the antibacterial mechanism among Lactobacillus reuteri strains might be different. The phylogenetic analysis and genomic collinearity revealed that the phylogenetic relationship among 9 strains of Lactobacillus reuteri was connected with host species and showed host specificity. The research could help us to better predict genes function and understand genetic basis on adapting to host gut in Lactobacillus reuteri YSJL-12.
罗伊氏乳杆菌YSJL-12是从健康母猪新鲜粪便中分离得到的,曾作为益生菌添加剂使用。为了研究该菌株益生菌潜力的遗传基础和鉴定菌株的基因,对YSJL-12进行了全基因组测序。对9株罗伊氏乳杆菌进行比较基因组分析。YSJL-12基因组由一条环状2084748 bp的染色体和两个环状质粒(51906 bp和15134 bp)组成。从2065个蛋白质编码序列(CDSs)中鉴定出抗环境胁迫的基因。预测了COG (Clusters of Orthologous Group)蛋白基因的功能,并分析了KEGG (Kyoto Encyclopedia of genes and Genomes)通路。比较基因组分析表明,该泛基因组包含1257个同源基因簇的核心基因组,1064个同源基因簇的辅助基因组,以及1148个菌株特异性基因,菌株间的抑菌机制可能存在差异。系统发育分析和基因组共线性分析表明,9株罗伊氏乳杆菌的系统发育关系与宿主种类有关,具有宿主特异性。本研究有助于更好地预测罗伊氏乳杆菌YSJL-12的基因功能,了解其适应宿主肠道的遗传基础。
{"title":"Complete Genome and Comparative Genome Analysis of <i>Lactobacillus reuteri</i> YSJL-12, a Potential Probiotics Strain Isolated From Healthy Sow Fresh Feces.","authors":"Su Xu, Jianjun Cheng, Xiangchen Meng, Yan Xu, Ying Mu","doi":"10.1177/1176934320942192","DOIUrl":"https://doi.org/10.1177/1176934320942192","url":null,"abstract":"<p><p><i>Lactobacillus reuteri</i> YSJL-12 was isolated from healthy sow fresh feces and used as probiotics additives previously. To investigate the genetic basis on probiotic potential and identify the genes in the strain, the complete genome of YSJL-12 was sequenced. Then comparative genome analysis on 9 strains of <i>Lactobacillus reuteri</i> was performed. The genome of YSJL-12 consisted of a circular 2,084,748 bp chromosome and 2 circular plasmids (51,906 and 15,134 bp). From among the 2065 protein-coding sequences (CDSs), the genes resistant to the environmental stress were identified. The function of COG (Clusters of Orthologous Group) protein genes was predicted, and the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways were analyzed. The comparative genome analysis indicated that the pan-genome contained a core genome of 1257 orthologous gene clusters, an accessory genome of 1064 orthologous gene clusters, and 1148 strain-specific genes, and the antibacterial mechanism among <i>Lactobacillus reuteri</i> strains might be different. The phylogenetic analysis and genomic collinearity revealed that the phylogenetic relationship among 9 strains of <i>Lactobacillus reuteri</i> was connected with host species and showed host specificity. The research could help us to better predict genes function and understand genetic basis on adapting to host gut in <i>Lactobacillus reuteri</i> YSJL-12.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320942192"},"PeriodicalIF":2.6,"publicationDate":"2020-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320942192","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38262586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-27eCollection Date: 2020-01-01DOI: 10.1177/1176934320924681
Efraín Hernando Pinzón-Reyes, Daniel Alfonso Sierra-Bueno, Miguel Orlando Suarez-Barrera, Nohora Juliana Rueda-Forero, Sebastián Abaunza-Villamizar, Paola Rondón-Villareal
Directed evolution methods mimic in vitro Darwinian evolution, inducing random mutations and selective pressure in genes to obtain proteins with enhanced characteristics. These techniques are developed using trial-and-error testing at an experimental level with a high degree of uncertainty. Therefore, in silico modeling of directed evolution is required to support experimental assays. Several in silico approaches have reproduced directed evolution, using statistical, thermodynamic, and kinetic models in an attempt to recreate experimental conditions. Likewise, optimization techniques using heuristic models have been used to understand and find the best scenarios of directed evolution. Our study uses an in silico model named HeurIstics DirecteD EvolutioN, which is based on a genetic algorithm designed to generate chimeric libraries from 2 parental genes, cry11Aa and cry11Ba, of Bacillus thuringiensis. These genes encode crystal-shaped δ-endotoxins with 3 conserved domains. Cry11 toxins are of biotechnological interest because they have shown to be effective as biopesticides for disease-spreading vectors. With our heuristic model, we considered experimental parameters such as DNA fragmentation length, number of generations or simulation cycles, and mutation rate, to get characteristics of Cry11 chimeric libraries such as percentage of population identity, truncation of variants obtained from the presence of internal stop codons, percentage of thermodynamic diversity, and stability of variants. Our study allowed us to focus on experimental conditions that may be useful for the design of in vitro and in silico experiments of directed evolution with Cry toxins of 3 conserved domains. Furthermore, we obtained in silico libraries of Cry11 variants, in which structural characteristics of wild Cry families were observed in a review of a sample of in silico sequences. We consider that future studies could use our in silico libraries and heuristic computational models, as the one suggested here, to support in vitro experiments of directed evolution.
{"title":"Generation of Cry11 Variants of <i>Bacillus thuringiensis</i> by Heuristic Computational Modeling.","authors":"Efraín Hernando Pinzón-Reyes, Daniel Alfonso Sierra-Bueno, Miguel Orlando Suarez-Barrera, Nohora Juliana Rueda-Forero, Sebastián Abaunza-Villamizar, Paola Rondón-Villareal","doi":"10.1177/1176934320924681","DOIUrl":"https://doi.org/10.1177/1176934320924681","url":null,"abstract":"<p><p>Directed evolution methods mimic in vitro Darwinian evolution, inducing random mutations and selective pressure in genes to obtain proteins with enhanced characteristics. These techniques are developed using trial-and-error testing at an experimental level with a high degree of uncertainty. Therefore, in silico modeling of directed evolution is required to support experimental assays. Several in silico approaches have reproduced directed evolution, using statistical, thermodynamic, and kinetic models in an attempt to recreate experimental conditions. Likewise, optimization techniques using heuristic models have been used to understand and find the best scenarios of directed evolution. Our study uses an in silico model named HeurIstics DirecteD EvolutioN, which is based on a genetic algorithm designed to generate chimeric libraries from 2 parental genes, <i>cry11Aa</i> and <i>cry11Ba</i>, of <i>Bacillus thuringiensis</i>. These genes encode crystal-shaped δ-endotoxins with 3 conserved domains. <i>Cry11</i> toxins are of biotechnological interest because they have shown to be effective as biopesticides for disease-spreading vectors. With our heuristic model, we considered experimental parameters such as DNA fragmentation length, number of generations or simulation cycles, and mutation rate, to get characteristics of <i>Cry11</i> chimeric libraries such as percentage of population identity, truncation of variants obtained from the presence of internal stop codons, percentage of thermodynamic diversity, and stability of variants. Our study allowed us to focus on experimental conditions that may be useful for the design of in vitro and in silico experiments of directed evolution with <i>Cry</i> toxins of 3 conserved domains. Furthermore, we obtained in silico libraries of <i>Cry11</i> variants, in which structural characteristics of wild <i>Cry</i> families were observed in a review of a sample of in silico sequences. We consider that future studies could use our in silico libraries and heuristic computational models, as the one suggested here, to support in vitro experiments of directed evolution.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320924681"},"PeriodicalIF":2.6,"publicationDate":"2020-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320924681","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38262585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-27eCollection Date: 2020-01-01DOI: 10.1177/1176934320944932
Yi-Pin Lai, Thomas R Ioerger
Many antibacterial drugs have multiple mechanisms of resistance, which are often represented simultaneously by a mixture of resistance mutations (some more frequent than others) in a clinical population. This presents a challenge for Genome-Wide Association Studies (GWAS) methods, making it difficult to detect less prevalent resistance mechanisms purely through (weak) statistical associations. Homoplasy, or the occurrence of multiple independent mutations at the same site, is often observed with drug resistance mutations and can be a strong indicator of positive selection. However, traditional GWAS methods, such as those based on allele counting or linear regression, are not designed to take homoplasy into account. In this article, we present a new method, called ECAT (for Evolutionary Cluster-based Association Test), that extends traditional regression-based GWAS methods with the ability to take advantage of homoplasy. This is achieved through a preprocessing step which identifies hypervariable regions in the genome exhibiting statistically significant clusters of distinct evolutionary changes, to which association testing by a linear mixed model (LMM) is applied using GEMMA (a well-established LMM-based GWAS tool). Thus, the approach can be viewed as extending GEMMA from the usual site- or gene-level analysis to focusing on clustered regions of mutations. This approach was evaluated on a large collection of more than 600 clinical isolates of multidrug-resistant (MDR) Mycobacterium tuberculosis from Lima, Peru. We show that ECAT does a better job of detecting known resistance mutations for several antitubercular drugs (including less prevalent mutations with weaker associations), compared with (site- or gene-based) GEMMA, as representative of existing GWAS methods. The power of the multiphase approach in ECAT comes from focusing association testing on the hypervariable regions of the genome, which reduces complexity in the model and increases statistical power.
{"title":"Exploiting Homoplasy in Genome-Wide Association Studies to Enhance Identification of Antibiotic-Resistance Mutations in Bacterial Genomes.","authors":"Yi-Pin Lai, Thomas R Ioerger","doi":"10.1177/1176934320944932","DOIUrl":"https://doi.org/10.1177/1176934320944932","url":null,"abstract":"<p><p>Many antibacterial drugs have multiple mechanisms of resistance, which are often represented simultaneously by a mixture of resistance mutations (some more frequent than others) in a clinical population. This presents a challenge for Genome-Wide Association Studies (GWAS) methods, making it difficult to detect less prevalent resistance mechanisms purely through (weak) statistical associations. Homoplasy, or the occurrence of multiple independent mutations at the same site, is often observed with drug resistance mutations and can be a strong indicator of positive selection. However, traditional GWAS methods, such as those based on allele counting or linear regression, are not designed to take homoplasy into account. In this article, we present a new method, called ECAT (for Evolutionary Cluster-based Association Test), that extends traditional regression-based GWAS methods with the ability to take advantage of homoplasy. This is achieved through a preprocessing step which identifies hypervariable regions in the genome exhibiting statistically significant clusters of distinct evolutionary changes, to which association testing by a linear mixed model (LMM) is applied using GEMMA (a well-established LMM-based GWAS tool). Thus, the approach can be viewed as extending GEMMA from the usual site- or gene-level analysis to focusing on clustered regions of mutations. This approach was evaluated on a large collection of more than 600 clinical isolates of multidrug-resistant (MDR) <i>Mycobacterium tuberculosis</i> from Lima, Peru. We show that ECAT does a better job of detecting known resistance mutations for several antitubercular drugs (including less prevalent mutations with weaker associations), compared with (site- or gene-based) GEMMA, as representative of existing GWAS methods. The power of the multiphase approach in ECAT comes from focusing association testing on the hypervariable regions of the genome, which reduces complexity in the model and increases statistical power.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320944932"},"PeriodicalIF":2.6,"publicationDate":"2020-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320944932","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38255196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA N6-methyladenosine (m6A) has emerged as an important epigenetic modification for its role in regulating the stability, structure, processing, and translation of RNA. Instability of m6A homeostasis may result in flaws in stem cell regulation, decrease in fertility, and risk of cancer. To this day, experimental detection and quantification of RNA m6A modification are still time-consuming and labor-intensive. There is only a limited number of epitranscriptome samples in existing databases, and a matched RNA methylation profile is not often available for a biological problem of interests. As gene expression data are usually readily available for most biological problems, it could be appealing if we can estimate the RNA methylation status from gene expression data using in silico methods. In this study, we explored the possibility of computational prediction of RNA methylation status from gene expression data using classification and regression methods based on mouse RNA methylation data collected from 73 experimental conditions. Elastic Net-regularized Logistic Regression (ENLR), Support Vector Machine (SVM), and Random Forests (RF) were constructed for classification. Both SVM and RF achieved the best performance with the mean area under the curve (AUC) = 0.84 across samples; SVM had a narrower AUC spread. Gene Site Enrichment Analysis was conducted on those sites selected by ENLR as predictors to access the biological significance of the model. Three functional annotation terms were found statistically significant: phosphoprotein, SRC Homology 3 (SH3) domain, and endoplasmic reticulum. All 3 terms were found to be closely related to m6A pathway. For regression analysis, Elastic Net was implemented, which yielded a mean Pearson correlation coefficient = 0.68 and a mean Spearman correlation coefficient = 0.64. Our exploratory study suggested that gene expression data could be used to construct predictors for m6A methylation status with adequate accuracy. Our work showed for the first time that RNA methylation status may be predicted from the matched gene expression data. This finding may facilitate RNA modification research in various biological contexts when a matched RNA methylation profile is not available, especially in the very early stage of the study.
{"title":"Prediction of RNA Methylation Status From Gene Expression Data Using Classification and Regression Methods.","authors":"Hao Xue, Zhen Wei, Kunqi Chen, Yujiao Tang, Xiangyu Wu, Jionglong Su, Jia Meng","doi":"10.1177/1176934320915707","DOIUrl":"https://doi.org/10.1177/1176934320915707","url":null,"abstract":"<p><p>RNA <i>N</i> <sup>6</sup>-methyladenosine (m<sup>6</sup>A) has emerged as an important epigenetic modification for its role in regulating the stability, structure, processing, and translation of RNA. Instability of m<sup>6</sup>A homeostasis may result in flaws in stem cell regulation, decrease in fertility, and risk of cancer. To this day, experimental detection and quantification of RNA m<sup>6</sup>A modification are still time-consuming and labor-intensive. There is only a limited number of epitranscriptome samples in existing databases, and a matched RNA methylation profile is not often available for a biological problem of interests. As gene expression data are usually readily available for most biological problems, it could be appealing if we can estimate the RNA methylation status from gene expression data using <i>in silico</i> methods. In this study, we explored the possibility of computational prediction of RNA methylation status from gene expression data using classification and regression methods based on mouse RNA methylation data collected from 73 experimental conditions. Elastic Net-regularized Logistic Regression (ENLR), Support Vector Machine (SVM), and Random Forests (RF) were constructed for classification. Both SVM and RF achieved the best performance with the mean area under the curve (AUC) = 0.84 across samples; SVM had a narrower AUC spread. Gene Site Enrichment Analysis was conducted on those sites selected by ENLR as predictors to access the biological significance of the model. Three functional annotation terms were found statistically significant: phosphoprotein, SRC Homology 3 (SH3) domain, and endoplasmic reticulum. All 3 terms were found to be closely related to m<sup>6</sup>A pathway. For regression analysis, Elastic Net was implemented, which yielded a mean Pearson correlation coefficient = 0.68 and a mean Spearman correlation coefficient = 0.64. Our exploratory study suggested that gene expression data could be used to construct predictors for m<sup>6</sup>A methylation status with adequate accuracy. Our work showed for the first time that RNA methylation status may be predicted from the matched gene expression data. This finding may facilitate RNA modification research in various biological contexts when a matched RNA methylation profile is not available, especially in the very early stage of the study.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320915707"},"PeriodicalIF":2.6,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320915707","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38209900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-10eCollection Date: 2020-01-01DOI: 10.1177/1176934320939945
Qin-Long Dai, Jian-Wei Li, Yi Yang, Min Li, Kan Zhang, Liu-Yang He, Jun Zhang, Bo Tang, Hui-Ping Liu, Yu-Xia Li, Li-Feng Zhu, Zhi-Song Yang, Qiang Dai
Release of individuals is an effective conservation approach to protect endangered species. To save this small isolated giant panda population in Liziping Nature Reserve, a few giant pandas have been released to this population. Here we assess genetic diversity and future changes in the population using noninvasive genetic sampling after releasing giant pandas. In this study, a total of 28 giant pandas (including 4 released individuals) were identified in the Liziping, China. Compared with other giant panda populations, this population has medium-level genetic diversity; however, a Bayesian-coalescent method clearly detected, quantified, and dated a recent decrease in population size. The predictions for genetic diversity and survival of the population in the next 100 years indicate that this population has a high risk of extinction. We show that released giant pandas can preserve genetic diversity and improve the probability of survival in this small isolated giant panda population. To promote the recovery of this population, we suggest that panda release should be continued and this population will need to release 10 males and 20 females in the future.
{"title":"Genetic Diversity and Prediction Analysis of Small Isolated Giant Panda Populations After Release of Individuals.","authors":"Qin-Long Dai, Jian-Wei Li, Yi Yang, Min Li, Kan Zhang, Liu-Yang He, Jun Zhang, Bo Tang, Hui-Ping Liu, Yu-Xia Li, Li-Feng Zhu, Zhi-Song Yang, Qiang Dai","doi":"10.1177/1176934320939945","DOIUrl":"10.1177/1176934320939945","url":null,"abstract":"<p><p>Release of individuals is an effective conservation approach to protect endangered species. To save this small isolated giant panda population in Liziping Nature Reserve, a few giant pandas have been released to this population. Here we assess genetic diversity and future changes in the population using noninvasive genetic sampling after releasing giant pandas. In this study, a total of 28 giant pandas (including 4 released individuals) were identified in the Liziping, China. Compared with other giant panda populations, this population has medium-level genetic diversity; however, a Bayesian-coalescent method clearly detected, quantified, and dated a recent decrease in population size. The predictions for genetic diversity and survival of the population in the next 100 years indicate that this population has a high risk of extinction. We show that released giant pandas can preserve genetic diversity and improve the probability of survival in this small isolated giant panda population. To promote the recovery of this population, we suggest that panda release should be continued and this population will need to release 10 males and 20 females in the future.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320939945"},"PeriodicalIF":2.6,"publicationDate":"2020-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320939945","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38189798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-09eCollection Date: 2020-01-01DOI: 10.1177/1176934320939943
Akshay Yadav, David Fernández-Baca, Steven B Cannon
Protein domains can be regarded as sections of protein sequences capable of folding independently and performing specific functions. In addition to amino-acid level changes, protein sequences can also evolve through domain shuffling events such as domain insertion, deletion, or duplication. The evolution of protein domains can be studied by tracking domain changes in a selected set of species with known phylogenetic relationships. Here, we conduct such an analysis by defining domains as “features” or “descriptors,” and considering the species (target + outgroup) as instances or data-points in a data matrix. We then look for features (domains) that are significantly different between the target species and the outgroup species. We study the domain changes in 2 large, distinct groups of plant species: legumes (Fabaceae) and grasses (Poaceae), with respect to selected outgroup species. We evaluate 4 types of domain feature matrices: domain content, domain duplication, domain abundance, and domain versatility. The 4 types of domain feature matrices attempt to capture different aspects of domain changes through which the protein sequences may evolve—that is, via gain or loss of domains, increase or decrease in the copy number of domains along the sequences, expansion or contraction of domains, or through changes in the number of adjacent domain partners. All the feature matrices were analyzed using feature selection techniques and statistical tests to select protein domains that have significant different feature values in legumes and grasses. We report the biological functions of the top selected domains from the analysis of all the feature matrices. In addition, we also perform domain-centric gene ontology (dcGO) enrichment analysis on all selected domains from all 4 feature matrices to study the gene ontology terms associated with the significantly evolving domains in legumes and grasses. Domain content analysis revealed a striking loss of protein domains from the Fanconi anemia (FA) pathway, the pathway responsible for the repair of interstrand DNA crosslinks. The abundance analysis of domains found in legumes revealed an increase in glutathione synthase enzyme, an antioxidant required from nitrogen fixation, and a decrease in xanthine oxidizing enzymes, a phenomenon confirmed by previous studies. In grasses, the abundance analysis showed increases in domains related to gene silencing which could be due to polyploidy or due to enhanced response to viral infection. We provide a docker container that can be used to perform this analysis workflow on any user-defined sets of species, available at https://cloud.docker.com/u/akshayayadav/repository/docker/akshayayadav/protein-domain-evolution-project.
{"title":"Family-Specific Gains and Losses of Protein Domains in the Legume and Grass Plant Families.","authors":"Akshay Yadav, David Fernández-Baca, Steven B Cannon","doi":"10.1177/1176934320939943","DOIUrl":"https://doi.org/10.1177/1176934320939943","url":null,"abstract":"Protein domains can be regarded as sections of protein sequences capable of folding independently and performing specific functions. In addition to amino-acid level changes, protein sequences can also evolve through domain shuffling events such as domain insertion, deletion, or duplication. The evolution of protein domains can be studied by tracking domain changes in a selected set of species with known phylogenetic relationships. Here, we conduct such an analysis by defining domains as “features” or “descriptors,” and considering the species (target + outgroup) as instances or data-points in a data matrix. We then look for features (domains) that are significantly different between the target species and the outgroup species. We study the domain changes in 2 large, distinct groups of plant species: legumes (Fabaceae) and grasses (Poaceae), with respect to selected outgroup species. We evaluate 4 types of domain feature matrices: domain content, domain duplication, domain abundance, and domain versatility. The 4 types of domain feature matrices attempt to capture different aspects of domain changes through which the protein sequences may evolve—that is, via gain or loss of domains, increase or decrease in the copy number of domains along the sequences, expansion or contraction of domains, or through changes in the number of adjacent domain partners. All the feature matrices were analyzed using feature selection techniques and statistical tests to select protein domains that have significant different feature values in legumes and grasses. We report the biological functions of the top selected domains from the analysis of all the feature matrices. In addition, we also perform domain-centric gene ontology (dcGO) enrichment analysis on all selected domains from all 4 feature matrices to study the gene ontology terms associated with the significantly evolving domains in legumes and grasses. Domain content analysis revealed a striking loss of protein domains from the Fanconi anemia (FA) pathway, the pathway responsible for the repair of interstrand DNA crosslinks. The abundance analysis of domains found in legumes revealed an increase in glutathione synthase enzyme, an antioxidant required from nitrogen fixation, and a decrease in xanthine oxidizing enzymes, a phenomenon confirmed by previous studies. In grasses, the abundance analysis showed increases in domains related to gene silencing which could be due to polyploidy or due to enhanced response to viral infection. We provide a docker container that can be used to perform this analysis workflow on any user-defined sets of species, available at https://cloud.docker.com/u/akshayayadav/repository/docker/akshayayadav/protein-domain-evolution-project.","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320939943"},"PeriodicalIF":2.6,"publicationDate":"2020-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320939943","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38186090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-30eCollection Date: 2020-01-01DOI: 10.1177/1176934320934498
Xin-Ke Zhan, Zhu-Hong You, Li-Ping Li, Yang Li, Zheng Wang, Jie Pan
Protein-protein interactions (PPIs) play a crucial role in the life cycles of living cells. Thus, it is important to understand the underlying mechanisms of PPIs. Although many high-throughput technologies have generated large amounts of PPI data in different organisms, the experiments for detecting PPIs are still costly and time-consuming. Therefore, novel computational methods are urgently needed for predicting PPIs. For this reason, developing a new computational method for predicting PPIs is drawing more and more attention. In this study, we proposed a novel computational method based on texture feature of protein sequence for predicting PPIs. Especially, the Gabor feature is used to extract texture feature and protein evolutionary information from Position-Specific Scoring Matrix, which is generated by Position-Specific Iterated Basic Local Alignment Search Tool. Then, random forest-based classifiers are used to infer the protein interactions. When performed on PPI data sets of yeast, human, and Helicobacter pylori, we obtained good results with average accuracies of 92.10%, 97.03%, and 86.45%, respectively. To better evaluate the proposed method, we compared Gabor feature, Discrete Cosine Transform, and Local Phase Quantization. Our results show that the proposed method is both feasible and stable and the Gabor feature descriptor is reliable in extracting protein sequence information. Furthermore, additional experiments have been conducted to predict PPIs of other 4 species data sets. The promising results indicate that our proposed method is both powerful and robust.
{"title":"Using Random Forest Model Combined With Gabor Feature to Predict Protein-Protein Interaction From Protein Sequence.","authors":"Xin-Ke Zhan, Zhu-Hong You, Li-Ping Li, Yang Li, Zheng Wang, Jie Pan","doi":"10.1177/1176934320934498","DOIUrl":"https://doi.org/10.1177/1176934320934498","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) play a crucial role in the life cycles of living cells. Thus, it is important to understand the underlying mechanisms of PPIs. Although many high-throughput technologies have generated large amounts of PPI data in different organisms, the experiments for detecting PPIs are still costly and time-consuming. Therefore, novel computational methods are urgently needed for predicting PPIs. For this reason, developing a new computational method for predicting PPIs is drawing more and more attention. In this study, we proposed a novel computational method based on texture feature of protein sequence for predicting PPIs. Especially, the Gabor feature is used to extract texture feature and protein evolutionary information from Position-Specific Scoring Matrix, which is generated by Position-Specific Iterated Basic Local Alignment Search Tool. Then, random forest-based classifiers are used to infer the protein interactions. When performed on PPI data sets of <i>yeast, human</i>, and <i>Helicobacter pylori</i>, we obtained good results with average accuracies of 92.10%, 97.03%, and 86.45%, respectively. To better evaluate the proposed method, we compared Gabor feature, Discrete Cosine Transform, and Local Phase Quantization. Our results show that the proposed method is both feasible and stable and the Gabor feature descriptor is reliable in extracting protein sequence information. Furthermore, additional experiments have been conducted to predict PPIs of other 4 species data sets. The promising results indicate that our proposed method is both powerful and robust.</p>","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"16 ","pages":"1176934320934498"},"PeriodicalIF":2.6,"publicationDate":"2020-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176934320934498","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38150704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}