Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, Dustin Scheinost
{"title":"The effects of data leakage on connectome-based machine learning models.","authors":"Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, Dustin Scheinost","doi":"10.1101/2023.06.09.544383","DOIUrl":null,"url":null,"abstract":"<p><p>Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.</p>","PeriodicalId":55102,"journal":{"name":"Geophysics","volume":"1 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10793416/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geophysics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.06.09.544383","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOCHEMISTRY & GEOPHYSICS","Score":null,"Total":0}
引用次数: 0
Abstract
Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.
期刊介绍:
Geophysics, published by the Society of Exploration Geophysicists since 1936, is an archival journal encompassing all aspects of research, exploration, and education in applied geophysics.
Geophysics articles, generally more than 275 per year in six issues, cover the entire spectrum of geophysical methods, including seismology, potential fields, electromagnetics, and borehole measurements. Geophysics, a bimonthly, provides theoretical and mathematical tools needed to reproduce depicted work, encouraging further development and research.
Geophysics papers, drawn from industry and academia, undergo a rigorous peer-review process to validate the described methods and conclusions and ensure the highest editorial and production quality. Geophysics editors strongly encourage the use of real data, including actual case histories, to highlight current technology and tutorials to stimulate ideas. Some issues feature a section of solicited papers on a particular subject of current interest. Recent special sections focused on seismic anisotropy, subsalt exploration and development, and microseismic monitoring.
The PDF format of each Geophysics paper is the official version of record.