Two subtle problems with overrepresentation analysis.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Bioinformatics advances Pub Date : 2024-10-21 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae159

Mark Ziemann, Barry Schroeter, Anusuiya Bora

{"title":"Two subtle problems with overrepresentation analysis.","authors":"Mark Ziemann, Barry Schroeter, Anusuiya Bora","doi":"10.1093/bioadv/vbae159","DOIUrl":null,"url":null,"abstract":"Motivation: Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.Results: Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.Availability and implementation: An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae159"},"PeriodicalIF":2.8000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11557902/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.

Results: Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.

Availability and implementation: An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

过度代表性分析存在两个微妙的问题。

动机过度代表性分析（ORA）被广泛用于评估基因列表与背景列表相比功能类别的富集程度。因此，ORA 是解释'omics'数据的重要方法，它将基因列表与生物功能和主题联系起来。虽然 ORA 大受欢迎，但我们和其他人注意到一些 ORA 工具可能存在两种不受欢迎的行为。第一种我们称之为 "背景问题"，因为它涉及软件从背景列表中剔除大量未注释为属于任何类别的基因。第二个问题我们称之为 "错误发现率问题"，因为有些工具低估了并行测试的真实数量：在这里，我们展示了这些问题对几个真实 RNA-seq 数据集的影响，并使用模拟 RNA-seq 数据来量化这些问题的影响。我们发现，这些问题的严重程度取决于基因组库、列表中的基因数量以及数据集中的噪声程度。这些问题可以通过更换 ORA 的软件包/网站或改用其他方法（如功能分类评分）来缓解：R/Shiny 工具已在 https://oratool.ziemann-lab.net/ 上提供，辅助材料可从 Zenodo (https://zenodo.org/records/13823301) 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量