集群诱发数据泄露的一些组合学说

IF 3.9 3区 环境科学与生态学 Q1 ENGINEERING, CIVIL Stochastic Environmental Research and Risk Assessment Pub Date : 2024-04-11 DOI:10.1007/s00477-024-02715-1
Fabian Guignard, David Ginsbourger, Lilia Levy Häner, Juan Manuel Herrera
{"title":"集群诱发数据泄露的一些组合学说","authors":"Fabian Guignard, David Ginsbourger, Lilia Levy Häner, Juan Manuel Herrera","doi":"10.1007/s00477-024-02715-1","DOIUrl":null,"url":null,"abstract":"<p>Data leakage is a common issue that can lead to misleading generalisation error estimation and incorrect hyperparameter tuning. However, its mechanisms are not always well understood. In this work, we consider the case of clustered data and investigate the distribution of the number of elements in leakage when the data set is uniformly split. For both the validation and test sets, the first and second moments of the number of elements in leakage are derived analytically. Modelling consequences are investigated and exemplified on simulated data. In addition, the case of an actual agronomic feasibility study is presented. We demonstrate how data leakage can distort model performance estimation when an inadequate data splitting strategy is used. We provide an understanding of data leakage in the context of clustered data by quantifying its role in predictive modelling. This sheds light on related challenges that may impact the practice in agronomy and beyond.</p>","PeriodicalId":21987,"journal":{"name":"Stochastic Environmental Research and Risk Assessment","volume":"120 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Some combinatorics of data leakage induced by clusters\",\"authors\":\"Fabian Guignard, David Ginsbourger, Lilia Levy Häner, Juan Manuel Herrera\",\"doi\":\"10.1007/s00477-024-02715-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Data leakage is a common issue that can lead to misleading generalisation error estimation and incorrect hyperparameter tuning. However, its mechanisms are not always well understood. In this work, we consider the case of clustered data and investigate the distribution of the number of elements in leakage when the data set is uniformly split. For both the validation and test sets, the first and second moments of the number of elements in leakage are derived analytically. Modelling consequences are investigated and exemplified on simulated data. In addition, the case of an actual agronomic feasibility study is presented. We demonstrate how data leakage can distort model performance estimation when an inadequate data splitting strategy is used. We provide an understanding of data leakage in the context of clustered data by quantifying its role in predictive modelling. This sheds light on related challenges that may impact the practice in agronomy and beyond.</p>\",\"PeriodicalId\":21987,\"journal\":{\"name\":\"Stochastic Environmental Research and Risk Assessment\",\"volume\":\"120 1\",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Stochastic Environmental Research and Risk Assessment\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://doi.org/10.1007/s00477-024-02715-1\",\"RegionNum\":3,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, CIVIL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stochastic Environmental Research and Risk Assessment","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1007/s00477-024-02715-1","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
引用次数: 0

摘要

数据泄漏是一个常见问题,可能导致误导性的泛化误差估计和不正确的超参数调整。然而,人们并不总是能很好地理解其机制。在这项工作中,我们考虑了聚类数据的情况,并研究了数据集均匀分割时泄漏元素数量的分布。对于验证集和测试集,泄漏元素数量的第一矩和第二矩都是通过分析得出的。在模拟数据上对建模结果进行了研究和举例说明。此外,还介绍了实际农艺可行性研究的案例。我们展示了在使用不适当的数据分割策略时,数据泄漏会如何扭曲模型性能估计。通过量化数据泄漏在预测建模中的作用,我们了解了聚类数据背景下的数据泄漏。这揭示了可能影响农学及其他领域实践的相关挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Some combinatorics of data leakage induced by clusters

Data leakage is a common issue that can lead to misleading generalisation error estimation and incorrect hyperparameter tuning. However, its mechanisms are not always well understood. In this work, we consider the case of clustered data and investigate the distribution of the number of elements in leakage when the data set is uniformly split. For both the validation and test sets, the first and second moments of the number of elements in leakage are derived analytically. Modelling consequences are investigated and exemplified on simulated data. In addition, the case of an actual agronomic feasibility study is presented. We demonstrate how data leakage can distort model performance estimation when an inadequate data splitting strategy is used. We provide an understanding of data leakage in the context of clustered data by quantifying its role in predictive modelling. This sheds light on related challenges that may impact the practice in agronomy and beyond.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.10
自引率
9.50%
发文量
189
审稿时长
3.8 months
期刊介绍: Stochastic Environmental Research and Risk Assessment (SERRA) will publish research papers, reviews and technical notes on stochastic and probabilistic approaches to environmental sciences and engineering, including interactions of earth and atmospheric environments with people and ecosystems. The basic idea is to bring together research papers on stochastic modelling in various fields of environmental sciences and to provide an interdisciplinary forum for the exchange of ideas, for communicating on issues that cut across disciplinary barriers, and for the dissemination of stochastic techniques used in different fields to the community of interested researchers. Original contributions will be considered dealing with modelling (theoretical and computational), measurements and instrumentation in one or more of the following topical areas: - Spatiotemporal analysis and mapping of natural processes. - Enviroinformatics. - Environmental risk assessment, reliability analysis and decision making. - Surface and subsurface hydrology and hydraulics. - Multiphase porous media domains and contaminant transport modelling. - Hazardous waste site characterization. - Stochastic turbulence and random hydrodynamic fields. - Chaotic and fractal systems. - Random waves and seafloor morphology. - Stochastic atmospheric and climate processes. - Air pollution and quality assessment research. - Modern geostatistics. - Mechanisms of pollutant formation, emission, exposure and absorption. - Physical, chemical and biological analysis of human exposure from single and multiple media and routes; control and protection. - Bioinformatics. - Probabilistic methods in ecology and population biology. - Epidemiological investigations. - Models using stochastic differential equations stochastic or partial differential equations. - Hazardous waste site characterization.
期刊最新文献
Hybrid method for rainfall-induced regional landslide susceptibility mapping Prediction of urban flood inundation using Bayesian convolutional neural networks Unravelling complexities: a study on geopolitical dynamics, economic complexity, R&D impact on green innovation in China AHP and FAHP-based multi-criteria analysis for suitable dam location analysis: a case study of the Bagmati Basin, Nepal Risk and retraction: asymmetric nexus between monetary policy uncertainty and eco-friendly investment
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1