Performance Comparison of Computational Methods for the Prediction of the Function and Pathogenicity of Non-coding Variants

IF 7.9 2区生物学 Q1 GENETICS & HEREDITY Genomics, Proteomics & Bioinformatics Pub Date : 2023-06-01 DOI:10.1016/j.gpb.2022.02.002

Zheng Wang , Guihu Zhao , Bin Li , Zhenghuan Fang , Qian Chen , Xiaomeng Wang , Tengfei Luo , Yijing Wang , Qiao Zhou , Kuokuo Li , Lu Xia , Yi Zhang , Xun Zhou , Hongxu Pan , Yuwen Zhao , Yige Wang , Lin Wang , Jifeng Guo , Beisha Tang , Kun Xia , Jinchen Li

{"title":"Performance Comparison of Computational Methods for the Prediction of the Function and Pathogenicity of Non-coding Variants","authors":"Zheng Wang , Guihu Zhao , Bin Li , Zhenghuan Fang , Qian Chen , Xiaomeng Wang , Tengfei Luo , Yijing Wang , Qiao Zhou , Kuokuo Li , Lu Xia , Yi Zhang , Xun Zhou , Hongxu Pan , Yuwen Zhao , Yige Wang , Lin Wang , Jifeng Guo , Beisha Tang , Kun Xia , Jinchen Li","doi":"10.1016/j.gpb.2022.02.002","DOIUrl":null,"url":null,"abstract":"<div><div><strong>Non-coding variants</strong> in the human genome significantly influence human traits and complex diseases via their regulation and modification effects. Hence, an increasing number of computational methods are developed to predict the effects of variants in human non-coding sequences. However, it is difficult for inexperienced users to select appropriate computational methods from dozens of available methods. To solve this issue, we assessed 12 performance metrics of 24 methods on four independent non-coding variant benchmark datasets: (1) rare germline variants from clinical relevant sequence variants (ClinVar), (2) rare somatic variants from Catalogue Of Somatic Mutations In Cancer (COSMIC), (3) common regulatory variants from curated expression quantitative trait locus (eQTL) data, and (4) disease-associated common variants from curated genome-wide association studies (GWAS). All 24 tested methods performed differently under various conditions, indicating varying strengths and weaknesses under different scenarios. Importantly, the performance of existing methods was acceptable for rare germline variants from ClinVar with the area under the receiver operating characteristic curve (AUROC) of 0.4481–0.8033 and poor for rare somatic variants from COSMIC (AUROC = 0.4984–0.7131), common regulatory variants from curated eQTL data (AUROC = 0.4837–0.6472), and disease-associated common variants from curated GWAS (AUROC = 0.4766–0.5188). We also compared the prediction performance of 24 methods for non-coding <em>de novo</em> mutations in autism spectrum disorder, and found that the combined annotation-dependent depletion (CADD) and context-dependent tolerance score (CDTS) methods showed better performance. Summarily, we assessed the performance of 24 computational methods under diverse scenarios, providing preliminary advice for proper tool selection and guiding the development of new techniques in interpreting non-coding variants.</div></div>","PeriodicalId":12528,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"21 3","pages":"Pages 649-661"},"PeriodicalIF":7.9000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10787016/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, Proteomics & Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S167202292200016X","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Non-coding variants in the human genome significantly influence human traits and complex diseases via their regulation and modification effects. Hence, an increasing number of computational methods are developed to predict the effects of variants in human non-coding sequences. However, it is difficult for inexperienced users to select appropriate computational methods from dozens of available methods. To solve this issue, we assessed 12 performance metrics of 24 methods on four independent non-coding variant benchmark datasets: (1) rare germline variants from clinical relevant sequence variants (ClinVar), (2) rare somatic variants from Catalogue Of Somatic Mutations In Cancer (COSMIC), (3) common regulatory variants from curated expression quantitative trait locus (eQTL) data, and (4) disease-associated common variants from curated genome-wide association studies (GWAS). All 24 tested methods performed differently under various conditions, indicating varying strengths and weaknesses under different scenarios. Importantly, the performance of existing methods was acceptable for rare germline variants from ClinVar with the area under the receiver operating characteristic curve (AUROC) of 0.4481–0.8033 and poor for rare somatic variants from COSMIC (AUROC = 0.4984–0.7131), common regulatory variants from curated eQTL data (AUROC = 0.4837–0.6472), and disease-associated common variants from curated GWAS (AUROC = 0.4766–0.5188). We also compared the prediction performance of 24 methods for non-coding de novo mutations in autism spectrum disorder, and found that the combined annotation-dependent depletion (CADD) and context-dependent tolerance score (CDTS) methods showed better performance. Summarily, we assessed the performance of 24 computational methods under diverse scenarios, providing preliminary advice for proper tool selection and guiding the development of new techniques in interpreting non-coding variants.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非编码变异功能和致病性预测计算方法的性能比较。

人类基因组中的非编码变异通过其调控和修饰作用对人类性状和复杂疾病产生重大影响。因此，越来越多的计算方法被用来预测人类非编码序列中变异的影响。然而，缺乏经验的用户很难从数十种可用方法中选择合适的计算方法。为了解决这个问题，我们在四个独立的非编码变异基准数据集上评估了 24 种方法的 12 个性能指标：（1）来自临床相关序列变异（ClinVar）的罕见种系变异；（2）来自癌症中体细胞突变目录（COSMIC）的罕见体细胞变异；（3）来自编辑的表达量性状位点（eQTL）数据的常见调控变异；以及（4）来自编辑的全基因组关联研究（GWAS）的疾病相关常见变异。所有 24 种测试方法在不同条件下的表现都不尽相同，这表明在不同情况下它们的优缺点也各不相同。重要的是，对于来自 ClinVar 的罕见种系变异，现有方法的性能是可以接受的，接收者操作特征曲线下面积（AUROC）为 0.4481-0.8033，而对于来自 COSMIC 的罕见体细胞变异（AUROC = 0.4984-0.7131）、来自已整合 eQTL 数据的常见调控变异（AUROC = 0.4837-0.6472）和来自已整合 GWAS 的疾病相关常见变异（AUROC = 0.4766-0.5188），现有方法的性能较差。我们还比较了 24 种方法对自闭症谱系障碍中的非编码从头突变的预测性能，发现注释依赖性删除法（CADD）和上下文依赖性容许度评分法（CDTS）显示出更好的性能。总之，我们评估了 24 种计算方法在不同情况下的性能，为正确选择工具提供了初步建议，并为开发解释非编码变异的新技术提供了指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Genomics, Proteomics & Bioinformatics Biochemistry, Genetics and Molecular Biology-Biochemistry

CiteScore

14.30

自引率

4.20%

发文量

844

审稿时长

61 days

期刊介绍： Genomics, Proteomics and Bioinformatics (GPB) is the official journal of the Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China. It aims to disseminate new developments in the field of omics and bioinformatics, publish high-quality discoveries quickly, and promote open access and online publication. GPB welcomes submissions in all areas of life science, biology, and biomedicine, with a focus on large data acquisition, analysis, and curation. Manuscripts covering omics and related bioinformatics topics are particularly encouraged. GPB is indexed/abstracted by PubMed/MEDLINE, PubMed Central, Scopus, BIOSIS Previews, Chemical Abstracts, CSCD, among others.