Deciphering gene expression patterns using large-scale transcriptomic data and its applications.

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Briefings in bioinformatics Pub Date : 2024-09-23 DOI:10.1093/bib/bbae590

Shunjie Chen, Pei Wang, Haiping Guo, Yujie Zhang

{"title":"Deciphering gene expression patterns using large-scale transcriptomic data and its applications.","authors":"Shunjie Chen, Pei Wang, Haiping Guo, Yujie Zhang","doi":"10.1093/bib/bbae590","DOIUrl":null,"url":null,"abstract":"<p><p>Gene expression varies stochastically across genders, racial groups, and health statuses. Deciphering these patterns is crucial for identifying informative genes, classifying samples, and understanding diseases like cancer. This study analyzes 11,252 bulk RNA-seq samples to explore expression patterns of 19,156 genes, including 10,512 cancer tissue samples and 740 normal samples. Additionally, 4,884 single-cell RNA-seq samples are examined. Statistical analysis using 16 probability distributions shows that normal samples display a wider range of distributions compared to cancer samples. Cancer samples tend to favor asymmetric distributions such as generalized extreme value, logarithmic normal, and Gaussian mixture distributions. In contrast, certain genes in normal samples exhibit symmetric distributions. Remarkably, more than 95.5% of genes exhibit non-normal distributions, which challenges traditional assumptions. Furthermore, distributions differ significantly between bulk and single-cell RNA-seq data. Many cancer driver genes exhibit distinct distribution patterns across sample types, suggesting potential for gene selection and classification based on distribution characteristics. A novel skewness-based metric is proposed to quantify distribution variation across datasets, showing genes with significant skewness differences have biological relevance. Finally, an improved naïve Bayes method incorporating gene-specific distributions demonstrates superior performance in simulations over traditional methods. This work enhances understanding of gene expression and its application in omics-based gene selection and sample classification.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562847/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae590","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Gene expression varies stochastically across genders, racial groups, and health statuses. Deciphering these patterns is crucial for identifying informative genes, classifying samples, and understanding diseases like cancer. This study analyzes 11,252 bulk RNA-seq samples to explore expression patterns of 19,156 genes, including 10,512 cancer tissue samples and 740 normal samples. Additionally, 4,884 single-cell RNA-seq samples are examined. Statistical analysis using 16 probability distributions shows that normal samples display a wider range of distributions compared to cancer samples. Cancer samples tend to favor asymmetric distributions such as generalized extreme value, logarithmic normal, and Gaussian mixture distributions. In contrast, certain genes in normal samples exhibit symmetric distributions. Remarkably, more than 95.5% of genes exhibit non-normal distributions, which challenges traditional assumptions. Furthermore, distributions differ significantly between bulk and single-cell RNA-seq data. Many cancer driver genes exhibit distinct distribution patterns across sample types, suggesting potential for gene selection and classification based on distribution characteristics. A novel skewness-based metric is proposed to quantify distribution variation across datasets, showing genes with significant skewness differences have biological relevance. Finally, an improved naïve Bayes method incorporating gene-specific distributions demonstrates superior performance in simulations over traditional methods. This work enhances understanding of gene expression and its application in omics-based gene selection and sample classification.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用大规模转录组数据破译基因表达模式及其应用。

不同性别、种族群体和健康状况的基因表达随机变化。破解这些模式对于识别信息基因、对样本进行分类以及了解癌症等疾病至关重要。本研究分析了 11,252 份大容量 RNA-seq 样本，探索了 19,156 个基因的表达模式，其中包括 10,512 份癌症组织样本和 740 份正常样本。此外，还研究了 4884 个单细胞 RNA-seq 样本。使用 16 种概率分布进行的统计分析显示，与癌症样本相比，正常样本的分布范围更广。癌症样本倾向于非对称分布，如广义极值分布、对数正态分布和高斯混合分布。相比之下，正常样本中的某些基因则呈现对称分布。值得注意的是，超过 95.5% 的基因呈现非正态分布，这对传统假设提出了挑战。此外，大量 RNA 序列数据和单细胞 RNA 序列数据的分布也有很大不同。许多癌症驱动基因在不同样本类型中表现出不同的分布模式，这表明基于分布特征的基因选择和分类具有潜力。研究人员提出了一种基于偏度的新指标来量化不同数据集的分布差异，结果表明具有显著偏度差异的基因具有生物学相关性。最后，一种包含基因特异性分布的改进型天真贝叶斯方法在模拟中表现出优于传统方法的性能。这项研究加深了人们对基因表达的理解，并将其应用于基于 omics 的基因选择和样本分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.