Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets

IF 3.9 4区生物学 Q2 BIOCHEMICAL RESEARCH METHODS Proteomics Pub Date : 2024-12-30 DOI:10.1002/pmic.202400100

Yannis Schumann, Antonia Gocke, Julia E. Neumann

{"title":"Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets","authors":"Yannis Schumann, Antonia Gocke, Julia E. Neumann","doi":"10.1002/pmic.202400100","DOIUrl":null,"url":null,"abstract":"Molecular profiling of different omic-modalities (e.g., DNA methylomics, transcriptomics, proteomics) in biological systems represents the basis for research and clinical decision-making. Measurement-specific biases, so-called batch effects, often hinder the integration of independently acquired datasets, and missing values further hamper the applicability of typical data processing algorithms. In addition to careful experimental design, well-defined standards in data acquisition and data exchange, the alleviation of these phenomena particularly requires a dedicated data integration and preprocessing pipeline. This review aims to give a comprehensive overview of computational methods for data integration and missing value imputation for omic data analyses.We provide formal definitions for missing value mechanisms and propose a novel statistical taxonomy for batch effects, especially in the presence of missing data. Based on an automated document search and systematic literature review, we describe 32 distinct data integration methods from five main methodological categories, as well as 37 algorithms for missing value imputation from five separate categories. Additionally, this review highlights multiple quantitative evaluation methods to aid researchers in selecting a suitable set of methods for their work. Finally, this work provides an integrated discussion of the relevance of batch effects and missing values in omics with corresponding method recommendations. We then propose a comprehensive three-step workflow from the study conception to final data analysis and deduce perspectives for future research. Eventually, we present a comprehensive flow chart as well as exemplary decision trees to aid practitioners in the selection of specific approaches for imputation and data integration in their studies.","PeriodicalId":224,"journal":{"name":"Proteomics","volume":"25 1-2","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/pmic.202400100","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteomics","FirstCategoryId":"99","ListUrlMain":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pmic.202400100","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Molecular profiling of different omic-modalities (e.g., DNA methylomics, transcriptomics, proteomics) in biological systems represents the basis for research and clinical decision-making. Measurement-specific biases, so-called batch effects, often hinder the integration of independently acquired datasets, and missing values further hamper the applicability of typical data processing algorithms. In addition to careful experimental design, well-defined standards in data acquisition and data exchange, the alleviation of these phenomena particularly requires a dedicated data integration and preprocessing pipeline. This review aims to give a comprehensive overview of computational methods for data integration and missing value imputation for omic data analyses.

We provide formal definitions for missing value mechanisms and propose a novel statistical taxonomy for batch effects, especially in the presence of missing data. Based on an automated document search and systematic literature review, we describe 32 distinct data integration methods from five main methodological categories, as well as 37 algorithms for missing value imputation from five separate categories. Additionally, this review highlights multiple quantitative evaluation methods to aid researchers in selecting a suitable set of methods for their work. Finally, this work provides an integrated discussion of the relevance of batch effects and missing values in omics with corresponding method recommendations. We then propose a comprehensive three-step workflow from the study conception to final data analysis and deduce perspectives for future research. Eventually, we present a comprehensive flow chart as well as exemplary decision trees to aid practitioners in the selection of specific approaches for imputation and data integration in their studies.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

组学数据集的数据集成与缺失值的计算方法。

生物系统中不同组学模式（如DNA甲基组学、转录组学、蛋白质组学）的分子谱分析是研究和临床决策的基础。测量特定偏差，即所谓的批效应，通常会阻碍独立获取的数据集的整合，而缺失的值进一步阻碍了典型数据处理算法的适用性。除了仔细的实验设计，在数据采集和数据交换中定义明确的标准，这些现象的缓解特别需要一个专门的数据集成和预处理管道。本文综述了基因组学数据分析中数据集成和缺失值估算的计算方法。我们提供了缺失值机制的正式定义，并提出了一种新的批量效应统计分类，特别是在存在缺失数据的情况下。基于自动文档搜索和系统文献综述，我们描述了来自5个主要方法类别的32种不同的数据集成方法，以及来自5个不同类别的37种缺失值估算算法。此外，这篇综述强调了多种定量评估方法，以帮助研究人员选择一套合适的方法为他们的工作。最后，本研究对组学中批效应和缺失值的相关性进行了综合讨论，并给出了相应的方法建议。然后，我们提出了一个全面的三步工作流程，从研究概念到最终数据分析，并推断未来研究的观点。最后，我们提出了一个全面的流程图，以及示范性的决策树，以帮助从业者在他们的研究中选择具体的方法进行imputation和数据集成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proteomics 生物-生化研究方法

CiteScore

6.30

自引率

5.90%

发文量

193

审稿时长

3 months

期刊介绍： PROTEOMICS is the premier international source for information on all aspects of applications and technologies, including software, in proteomics and other "omics". The journal includes but is not limited to proteomics, genomics, transcriptomics, metabolomics and lipidomics, and systems biology approaches. Papers describing novel applications of proteomics and integration of multi-omics data and approaches are especially welcome.