Facing & mitigating common challenges when working with real-world data: The Data Learning Paradigm

IF 3.7 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Journal of Computational Science Pub Date : 2025-02-01 Epub Date: 2025-01-08 DOI:10.1016/j.jocs.2024.102523

Jake Lever , Sibo Cheng , César Quilodrán Casas , Che Liu , Hongwei Fan , Robert Platt , Andrianirina Rakotoharisoa , Eleda Johnson , Siyi Li , Zhendan Shang , Rossella Arcucci

{"title":"Facing & mitigating common challenges when working with real-world data: The Data Learning Paradigm","authors":"Jake Lever , Sibo Cheng , César Quilodrán Casas , Che Liu , Hongwei Fan , Robert Platt , Andrianirina Rakotoharisoa , Eleda Johnson , Siyi Li , Zhendan Shang , Rossella Arcucci","doi":"10.1016/j.jocs.2024.102523","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of data-driven applications is ubiquitous across virtually all scientific domains, and has led to an increasing demand for effective methods to handle data deficiencies and mitigate the effects of imperfect data. This paper presents a guide for researchers encountering real-world data-driven applications, and the respective challenges associated with this. This article proposes the concept of the Data Learning Paradigm, combining the principles of machine learning, data science and data assimilation to tackle real-world challenges in data-driven applications. Models are a product of the data upon which they are trained, and no data collected from real world scenarios is perfect due to natural limitations of sensing and collection. Thus, computational modelling of real world systems is intrinsically limited by the various deficiencies encountered in real data. The Data Learning Paradigm aims to leverage the strengths of data improvement to enhance the accuracy, reliability, and interpretability of data-driven models. We outline a range of methods which are currently being implemented in the field of Data Learning involving machine learning and data science methods, and discuss how these mitigate the various problems associated with data-driven models, illustrating improved results in a multitude of real world applications. We highlight examples where these methods have led to significant advancements in fields such as environmental monitoring, planetary exploration, healthcare analytics, linguistic analysis, social networks, and smart manufacturing. We offer a guide to how these methods may be implemented to deal with general types of limitations in data, alongside their current and potential applications.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"85 ","pages":"Article 102523"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750324003168","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/8 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid growth of data-driven applications is ubiquitous across virtually all scientific domains, and has led to an increasing demand for effective methods to handle data deficiencies and mitigate the effects of imperfect data. This paper presents a guide for researchers encountering real-world data-driven applications, and the respective challenges associated with this. This article proposes the concept of the Data Learning Paradigm, combining the principles of machine learning, data science and data assimilation to tackle real-world challenges in data-driven applications. Models are a product of the data upon which they are trained, and no data collected from real world scenarios is perfect due to natural limitations of sensing and collection. Thus, computational modelling of real world systems is intrinsically limited by the various deficiencies encountered in real data. The Data Learning Paradigm aims to leverage the strengths of data improvement to enhance the accuracy, reliability, and interpretability of data-driven models. We outline a range of methods which are currently being implemented in the field of Data Learning involving machine learning and data science methods, and discuss how these mitigate the various problems associated with data-driven models, illustrating improved results in a multitude of real world applications. We highlight examples where these methods have led to significant advancements in fields such as environmental monitoring, planetary exploration, healthcare analytics, linguistic analysis, social networks, and smart manufacturing. We offer a guide to how these methods may be implemented to deal with general types of limitations in data, alongside their current and potential applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面对和缓解处理现实世界数据时的常见挑战：数据学习范式

数据驱动应用的快速增长几乎遍及所有科学领域，并导致对有效方法的需求不断增加，以处理数据缺陷和减轻不完美数据的影响。本文为研究人员遇到现实世界的数据驱动应用程序以及与此相关的各自挑战提供了指南。本文提出了数据学习范式的概念，结合机器学习、数据科学和数据同化的原理来解决数据驱动应用中的现实挑战。模型是训练数据的产物，由于感知和收集的自然限制，从现实世界场景中收集的数据没有一个是完美的。因此，真实世界系统的计算建模本质上受到真实数据中遇到的各种缺陷的限制。数据学习范式旨在利用数据改进的优势来提高数据驱动模型的准确性、可靠性和可解释性。我们概述了目前在数据学习领域正在实施的一系列方法，包括机器学习和数据科学方法，并讨论了这些方法如何缓解与数据驱动模型相关的各种问题，说明了在众多现实世界应用中的改进结果。我们重点介绍了这些方法在环境监测、行星探测、医疗保健分析、语言分析、社交网络和智能制造等领域取得重大进展的例子。我们提供了一个指南，说明如何实现这些方法来处理数据中一般类型的限制，以及它们当前和潜在的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computational Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

5.50

自引率

3.00%

发文量

227

审稿时长

41 days

期刊介绍： Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory. The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation. This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods. Computational science typically unifies three distinct elements: • Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous); • Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems; • Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).