Two subtle problems with overrepresentation analysis.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Bioinformatics advances Pub Date : 2024-10-21 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae159
Mark Ziemann, Barry Schroeter, Anusuiya Bora
{"title":"Two subtle problems with overrepresentation analysis.","authors":"Mark Ziemann, Barry Schroeter, Anusuiya Bora","doi":"10.1093/bioadv/vbae159","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.</p><p><strong>Results: </strong>Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.</p><p><strong>Availability and implementation: </strong>An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae159"},"PeriodicalIF":2.4000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11557902/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.

Results: Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.

Availability and implementation: An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
过度代表性分析存在两个微妙的问题。
动机过度代表性分析(ORA)被广泛用于评估基因列表与背景列表相比功能类别的富集程度。因此,ORA 是解释'omics'数据的重要方法,它将基因列表与生物功能和主题联系起来。虽然 ORA 大受欢迎,但我们和其他人注意到一些 ORA 工具可能存在两种不受欢迎的行为。第一种我们称之为 "背景问题",因为它涉及软件从背景列表中剔除大量未注释为属于任何类别的基因。第二个问题我们称之为 "错误发现率问题",因为有些工具低估了并行测试的真实数量:在这里,我们展示了这些问题对几个真实 RNA-seq 数据集的影响,并使用模拟 RNA-seq 数据来量化这些问题的影响。我们发现,这些问题的严重程度取决于基因组库、列表中的基因数量以及数据集中的噪声程度。这些问题可以通过更换 ORA 的软件包/网站或改用其他方法(如功能分类评分)来缓解:R/Shiny 工具已在 https://oratool.ziemann-lab.net/ 上提供,辅助材料可从 Zenodo (https://zenodo.org/records/13823301) 获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
期刊最新文献
Predicting CRISPR-Cas9 off-target effects in human primary cells using bidirectional LSTM with BERT embedding. Genal: a Python toolkit for genetic risk scoring and Mendelian randomization. QOMIC: quantum optimization for motif identification. SurfR: Riding the wave of RNA-seq data with a comprehensive bioconductor package to identify surface protein-coding genes. Exploring the role of the Rab network in epithelial-to-mesenchymal transition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1