The effects of data leakage on connectome-based machine learning models.

IF 3.2 2区地球科学 Q1 GEOCHEMISTRY & GEOPHYSICS Geophysics Pub Date : 2023-12-28 DOI:10.1101/2023.06.09.544383

Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, Dustin Scheinost

{"title":"The effects of data leakage on connectome-based machine learning models.","authors":"Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, Dustin Scheinost","doi":"10.1101/2023.06.09.544383","DOIUrl":null,"url":null,"abstract":"<p><p>Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.</p>","PeriodicalId":55102,"journal":{"name":"Geophysics","volume":"1 1","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10793416/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geophysics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.06.09.544383","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOCHEMISTRY & GEOPHYSICS","Score":null,"Total":0}

引用次数: 0

Abstract

Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据泄露对基于连接体的机器学习模型的影响。

预测模型现已成为神经影像学的一项核心技术，用于识别复杂的大脑行为关系并测试其对未知数据的普适性。然而，数据泄漏会无意中破坏用于训练和测试模型的数据之间的分离，从而损害预测模型的有效性。以往的文献表明，泄漏在机器学习中普遍存在，但很少有研究对神经影像数据泄漏的影响进行实证评估。虽然泄漏始终是一种不正确的做法，但了解泄漏对神经影像预测模型的影响有助于深入了解泄漏对文献的影响程度。在这里，我们研究了泄漏对两种常见神经成像模式--功能性和结构性连接体--的机器学习模型的影响。利用跨越四个大型数据集和三种表型的 400 多个不同管道，我们评估了适合三大类的五种泄漏形式：特征选择、协变量校正和受试者之间缺乏独立性。不出所料，通过特征选择和重复受试者造成的泄漏大大提高了预测性能。值得注意的是，其他形式的泄漏只产生了轻微的影响（如泄漏部位校正），甚至降低了预测性能（如泄漏协变量回归）。在某些情况下，泄漏不仅会影响预测性能，还会影响模型系数，进而影响神经生物学解释。最后，我们发现使用小数据集的预测模型对泄漏更敏感。总之，我们的研究结果说明了泄漏对预测管道的不同影响，并强调了避免数据泄漏对提高预测建模的有效性和可重复性的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Geophysics 地学-地球化学与地球物理

CiteScore

6.90

自引率

18.20%

发文量

354

审稿时长

3 months

期刊介绍： Geophysics, published by the Society of Exploration Geophysicists since 1936, is an archival journal encompassing all aspects of research, exploration, and education in applied geophysics. Geophysics articles, generally more than 275 per year in six issues, cover the entire spectrum of geophysical methods, including seismology, potential fields, electromagnetics, and borehole measurements. Geophysics, a bimonthly, provides theoretical and mathematical tools needed to reproduce depicted work, encouraging further development and research. Geophysics papers, drawn from industry and academia, undergo a rigorous peer-review process to validate the described methods and conclusions and ensure the highest editorial and production quality. Geophysics editors strongly encourage the use of real data, including actual case histories, to highlight current technology and tutorials to stimulate ideas. Some issues feature a section of solicited papers on a particular subject of current interest. Recent special sections focused on seismic anisotropy, subsalt exploration and development, and microseismic monitoring. The PDF format of each Geophysics paper is the official version of record.