原则数据预处理:在水污染水生生物指标中的应用

Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac
{"title":"原则数据预处理:在水污染水生生物指标中的应用","authors":"Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac","doi":"10.1109/DEXA.2017.27","DOIUrl":null,"url":null,"abstract":"In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Principled Data Preprocessing: Application to Biological Aquatic Indicators of Water Pollution\",\"authors\":\"Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac\",\"doi\":\"10.1109/DEXA.2017.27\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.\",\"PeriodicalId\":127009,\"journal\":{\"name\":\"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.2017.27\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2017.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

在许多生物学研究中,统计和数据挖掘方法被广泛用于分析数据和发现可操作的知识。但是,糟糕的数据质量导致不正确的分析结果和错误的解释可能导致误导性的结论和不充分的决策。为了保证结果的有效性,避免偏差和数据误用,不仅需要控制整个分析管道,最重要的是要通过适当的数据预处理选择来控制数据的质量。由于各种预处理技术和替代策略可能导致截然不同的输出,因此依赖于原则和严格的方法来选择最佳的数据预处理步骤集至关重要,这些步骤取决于输入数据分布特征和目标统计或数据挖掘方法的固有特征。在本文中,我们提出了一种方法,在给定数据集的情况下,选择一组最优的预处理任务来应用于数据,从而使总体数据预处理输出最大化各种聚类、回归和分类技术的分析结果的质量。我们提出了一些有希望的结果,验证了我们在生物监测数据准备方面的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Principled Data Preprocessing: Application to Biological Aquatic Indicators of Water Pollution
In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MuMs: Energy-Aware VM Selection Scheme for Cloud Data Center Biclustering of Biological Sequences Global and Local Feature Learning for Ego-Network Analysis Evaluation of Contextualization and Diversification Approaches in Aggregated Search Towards a Cloud of Clouds Elasticity Management System
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1