Guiding questions to avoid data leakage in biological machine learning applications

IF 36.1 1区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Nature Methods Pub Date : 2024-08-09 DOI:10.1038/s41592-024-02362-y

Judith Bernett, David B. Blumenthal, Dominik G. Grimm, Florian Haselbeck, Roman Joeres, Olga V. Kalinina, Markus List

{"title":"Guiding questions to avoid data leakage in biological machine learning applications","authors":"Judith Bernett, David B. Blumenthal, Dominik G. Grimm, Florian Haselbeck, Roman Joeres, Olga V. Kalinina, Markus List","doi":"10.1038/s41592-024-02362-y","DOIUrl":null,"url":null,"abstract":"Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology. This Perspective discusses the issue of data leakage in machine learning based models and presents seven questions designed to identify and avoid the problems resulting from data leakage.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":null,"pages":null},"PeriodicalIF":36.1000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Methods","FirstCategoryId":"99","ListUrlMain":"https://www.nature.com/articles/s41592-024-02362-y","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology. This Perspective discusses the issue of data leakage in machine learning based models and presents seven questions designed to identify and avoid the problems resulting from data leakage.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在生物机器学习应用中避免数据泄露的指导性问题。

从高维数据中提取模式的机器学习方法在生物科学领域非常重要。然而，在某些情况下，实际应用无法证实所报告的预测性能。造成这种情况的主要原因之一是数据泄漏，即训练数据和测试数据之间非法共享信息，从而导致性能估计值远远优于在预期应用场景中观察到的性能。由于生物数据集具有复杂的依赖关系，因此很难检测到数据泄漏。有鉴于此，我们提出了在生物领域构建机器学习模型时应注意的七个问题，以防止数据泄漏。我们将这些问题应用于非微不足道的例子中，以说明它们的实用性。我们的目标是提高人们对潜在数据泄露问题的认识，促进生物学领域基于机器学习的研究的稳健性和可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Nature Methods 生物-生化研究方法

CiteScore

58.70

自引率

1.70%

发文量

326

审稿时长

1 months

期刊介绍： Nature Methods is a monthly journal that focuses on publishing innovative methods and substantial enhancements to fundamental life sciences research techniques. Geared towards a diverse, interdisciplinary readership of researchers in academia and industry engaged in laboratory work, the journal offers new tools for research and emphasizes the immediate practical significance of the featured work. It publishes primary research papers and reviews recent technical and methodological advancements, with a particular interest in primary methods papers relevant to the biological and biomedical sciences. This includes methods rooted in chemistry with practical applications for studying biological problems.

期刊最新文献

Pushing the limits of MRI brain imaging A leap for mesoscale imaging Multi-pass nanopore for single-molecule protein sequencing The bearded dragon Pogona vitticeps Microscopic art