Frontiers in Big Data最新文献_第6页

When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data. 当我们谈论大数据时，我们真正指的是什么？更准确地定义大数据。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-10 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1441869

Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos

Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this "no consensus" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular "V" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the "V" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.

尽管对大数据的官方定义缺乏共识，但多年来，基于这种 "无共识 "的立场，研究和调查仍在继续推进。然而，由于缺乏对大数据的明确定义和范围，导致科学研究和交流缺乏共同点。即使具有流行的 "V "型特征，大数据仍然难以捉摸。该术语含义广泛，在研究中的用法各不相同，往往指代完全不同的概念，而论文中也很少明确说明这一点。虽然许多研究和综述都试图对大数据有一个全面的理解，但对大数据一词在研究环境中的定位和实际意义却鲜有系统的研究。针对这一空白，本文对二手研究进行了系统性文献综述（SLR），以全面概述大数据在不同科学领域的应用和理解。我们的目标是监测大数据概念在科学领域的应用情况，确定哪些技术在哪些领域盛行，并调查对该术语的理论理解与实际使用之间的差异。我们的研究发现，不同的科学领域正在使用各种大数据技术，包括机器学习算法、分布式计算框架和其他工具。大数据的这些表现形式可分为四大类：抽象概念、大型数据集、机器学习技术和大数据生态系统。本研究发现，尽管对 "V "的特征达成了普遍共识，但不同科学领域的研究人员对大数据有着不同的隐含理解。这些隐含的理解极大地影响了涉及大数据的研究内容和讨论，尽管这些理解往往没有明确表述。我们呼吁在研究中更清晰地阐明大数据的含义，以促进更顺畅的科学交流。

{"title":"When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data.","authors":"Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos","doi":"10.3389/fdata.2024.1441869","DOIUrl":"https://doi.org/10.3389/fdata.2024.1441869","url":null,"abstract":"Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this \"no consensus\" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular \"V\" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the \"V\" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1441869"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11420115/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark. SparkDWM：使用 Apache Spark 的数据清洗机的可扩展设计。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-09 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1446071

Nicholas Kofi Akortia Hagan, John R Talburt

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

数据量一直是大多数实际应用中快速增长的资产之一。这就增加了人为错误的发生率，如记录重复、拼写错误和转置错误，以及其他数据质量问题。实体解析是一种 ETL 流程，旨在通过确保实体指向相同的现实世界对象来解决数据不一致问题。大多数传统实体解析系统面临的主要挑战之一是确保其可扩展性，以满足不断增长的数据需求。本研究旨在重构一个名为 "数据清洗机"（Data Washing Machine）的概念验证实体解析系统，使其能够使用 Apache Spark 分布式数据处理框架实现高度可扩展性。我们使用 PySpark 的弹性分布式数据集解决了传统数据清洗机的单线程设计问题，并改进了数据清洗机的设计，使其能够使用来自引用的内在元数据信息。我们使用 18 个合成生成的数据集证明，我们的系统实现了与传统数据清洗机相同的结果。我们还使用从数千到数百万的各种真实基准 ER 数据集测试了我们系统的可扩展性。实验结果表明，我们提出的系统比基于 MapReduce 的数据清洗机性能更好。我们还将我们的系统与 Famer 进行了比较，得出的结论是，在给定最佳聚类起始参数的情况下，我们的系统可以找到更多的聚类。

{"title":"SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.","authors":"Nicholas Kofi Akortia Hagan, John R Talburt","doi":"10.3389/fdata.2024.1446071","DOIUrl":"10.3389/fdata.2024.1446071","url":null,"abstract":"Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1446071"},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deepfake: definitions, performance metrics and standards, datasets, and a meta-review. Deepfake：定义、性能指标和标准、数据集和元综述。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-04 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1400024

Enes Altuncu, Virginia N L Franqueira, Shujun Li

Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term "deepfake." Based on both the research literature and resources in English, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including (1) different definitions, (2) commonly used performance metrics and standards, and (3) deepfake-related datasets. In addition, the paper also reports a meta-review of 15 selected deepfake-related survey papers published since 2020, focusing not only on the mentioned aspects but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of the aspects covered.

最近，人工智能（尤其是深度学习）的发展促进了新的逼真合成媒体（视频、图像和音频）的创建和对现有媒体的处理的显著增加，这导致了新术语 "deepfake "的产生。本文以研究文献和英文资源为基础，对深度伪造进行了全面概述，涵盖了这一新兴概念的多个重要方面，包括：（1）不同的定义；（2）常用的性能指标和标准；（3）与深度伪造相关的数据集。此外，本文还报告了对 2020 年以来发表的 15 篇精选 deepfake 相关调查论文的元综述，不仅侧重于上述方面，还分析了主要挑战和建议。我们认为，就所涉及的方面而言，本文是对 deepfake 最全面的综述。

引用次数: 0

Sparse and Expandable Network for Google's Pathways. 谷歌 Pathways 的稀疏可扩展网络。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-08-29 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1348030

Charles X Ling, Ganyu Wang, Boyu Wang

Introduction: Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.

Methods: To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.

Results: The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.

Discussion: SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.

简介最近，谷歌推出了下一代人工智能架构 Pathways。Pathways 必须解决三个关键挑战：为多个连续任务学习一个通用模型；确保任务之间可以相互利用，同时不遗忘旧任务；从图像和音频等多模态数据中学习。此外，Pathways 还必须在学习和部署过程中保持稀疏性。目前的终身多任务学习方法不足以应对这些挑战：为了应对这些挑战，我们提出了稀疏可扩展网络 SEN。SEN 的设计目的是通过保持稀疏性来同时处理多个任务，并在引入新任务时实现扩展。该网络利用多模态数据，整合来自不同来源的信息，同时防止任务之间的干扰：结果：所提出的 SEN 模型在多任务学习方面有显著改进，成功地管理了任务干扰和遗忘。它有效整合了各种模式的数据，并在学习和部署阶段通过稀疏性保持了效率：SEN 为解决当前终身多任务学习方法的局限性提供了一个简单而有效的解决方案。通过解决 Pathways 架构中发现的挑战，SEN 为开发能够在不牺牲性能或效率的情况下进行长期学习和适应的人工智能系统提供了一种前景广阔的方法。

{"title":"Sparse and Expandable Network for Google's Pathways.","authors":"Charles X Ling, Ganyu Wang, Boyu Wang","doi":"10.3389/fdata.2024.1348030","DOIUrl":"https://doi.org/10.3389/fdata.2024.1348030","url":null,"abstract":"Introduction: Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.Methods: To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.Results: The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.Discussion: SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1348030"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11390433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient use of binned data for imputing univariate time series data. 有效利用二进制数据归因单变量时间序列数据。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-08-21 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1422650

Jay Darji, Nupur Biswas, Vijay Padul, Jaya Gill, Santosh Kesari, Shashaanka Ashili

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

时间序列数据记录在各个部门，从而产生了大量数据。然而，这些数据的连续性经常会中断，从而导致数据缺失。有几种算法可用于缺失数据的估算，这些方法的性能差别很大。除了算法的选择，有效的估算还取决于缺失数据和可用数据的性质。我们利用不同类型的时间序列数据，特别是心率数据和耗电量数据，进行了广泛的研究。我们生成了不同时间跨度的缺失数据，并使用不同的算法对不同大小的二进制数据进行了估算。使用均方根误差 (RMSE) 指标对性能进行评估。我们观察到，与整个数据集相比，使用二进制数据时 RMSE 有所降低，尤其是在期望最大化（EM）算法中。我们发现，在对 1、5 和 15 分钟的缺失数据使用二进制数据时，RMSE 都有所降低，其中 15 分钟缺失数据的 RMSE 降低幅度更大。我们还观察到了数据波动的影响。我们的结论是，二进制数据的实用性恰恰取决于缺失数据的跨度、数据的采样频率以及数据内部的波动。根据缺失数据和可用数据的固有特征、质量和数量，二进制数据可以替代多种数据，包括从物联网（IoT）设备智能手表中提取的生物心率数据和非生物数据，如家庭用电量数据。

{"title":"Efficient use of binned data for imputing univariate time series data.","authors":"Jay Darji, Nupur Biswas, Vijay Padul, Jaya Gill, Santosh Kesari, Shashaanka Ashili","doi":"10.3389/fdata.2024.1422650","DOIUrl":"10.3389/fdata.2024.1422650","url":null,"abstract":"Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1422650"},"PeriodicalIF":2.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Equitable differential privacy. 公平的差别隐私。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-08-16 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1420344

Vasundhara Kaul, Tamalika Mukherjee

Differential privacy (DP) has been in the public spotlight since the announcement of its use in the 2020 U.S. Census. While DP algorithms have substantially improved the confidentiality protections provided to Census respondents, concerns have been raised about the accuracy of the DP-protected Census data. The extent to which the use of DP distorts the ability to draw inferences that drive policy about small-populations, especially marginalized communities, has been of particular concern to researchers and policy makers. After all, inaccurate information about marginalized populations can often engender policies that exacerbate rather than ameliorate social inequities. Consequently, computer science experts have focused on developing mechanisms that help achieve equitable privacy, i.e., mechanisms that mitigate the data distortions introduced by privacy protections to ensure equitable outcomes and benefits for all groups, particularly marginalized groups. Our paper extends the conversation on equitable privacy by highlighting the importance of inclusive communication in ensuring equitable outcomes for all social groups through all the stages of deploying a differentially private system. We conceptualize Equitable DP as the design, communication, and implementation of DP algorithms that ensure equitable outcomes. Thus, in addition to adopting computer scientists' recommendations of incorporating equity parameters within DP algorithms, we suggest that it is critical for an organization to also facilitate inclusive communication throughout the design, development, and implementation stages of a DP algorithm to ensure it has an equitable impact on social groups and does not hinder the redressal of social inequities. To demonstrate the importance of communication for Equitable DP, we undertake a case study of the process through which DP was adopted as the newest disclosure avoidance system for the 2020 U.S. Census. Drawing on the Inclusive Science Communication (ISC) framework, we examine the extent to which the Census Bureau's communication strategies encouraged engagement across the diverse groups of users that employ the decennial Census data for research and policy making. Our analysis provides lessons that can be used by other government organizations interested in incorporating the Equitable DP approach in their data collection practices.

自从宣布在 2020 年美国人口普查中使用差分隐私 (DP) 后，它一直是公众关注的焦点。虽然 DP 算法大大提高了对人口普查受访者的保密保护，但受 DP 保护的人口普查数据的准确性也引起了关注。研究人员和政策制定者尤其关注的是，DP 的使用在多大程度上扭曲了对小群体，尤其是边缘化群体进行推论以推动政策制定的能力。毕竟，关于边缘化人群的不准确信息往往会导致政策加剧而非改善社会不平等。因此，计算机科学专家专注于开发有助于实现公平隐私的机制，即减轻隐私保护带来的数据扭曲的机制，以确保所有群体，特别是边缘化群体获得公平的结果和利益。我们的论文通过强调包容性交流在确保所有社会群体在部署差异化隐私系统的所有阶段都能获得公平结果方面的重要性，扩展了有关公平隐私的讨论。我们将公平 DP 概念化为确保公平结果的 DP 算法的设计、交流和实施。因此，除了采纳计算机科学家关于在 DP 算法中纳入公平参数的建议外，我们还建议组织在 DP 算法的整个设计、开发和实施阶段促进包容性沟通，以确保其对社会群体产生公平影响，且不妨碍纠正社会不公平现象，这一点至关重要。为了证明沟通对于公平 DP 的重要性，我们对 DP 被采纳为 2020 年美国人口普查最新的信息披露规避系统的过程进行了案例研究。借鉴包容性科学交流（ISC）框架，我们研究了人口普查局的交流策略在多大程度上鼓励了使用十年一次的人口普查数据进行研究和决策的不同用户群体的参与。我们的分析为其他有意将公平 DP 方法纳入其数据收集实践的政府组织提供了可借鉴的经验。

{"title":"Equitable differential privacy.","authors":"Vasundhara Kaul, Tamalika Mukherjee","doi":"10.3389/fdata.2024.1420344","DOIUrl":"10.3389/fdata.2024.1420344","url":null,"abstract":"Differential privacy (DP) has been in the public spotlight since the announcement of its use in the 2020 U.S. Census. While DP algorithms have substantially improved the confidentiality protections provided to Census respondents, concerns have been raised about the accuracy of the DP-protected Census data. The extent to which the use of DP distorts the ability to draw inferences that drive policy about small-populations, especially marginalized communities, has been of particular concern to researchers and policy makers. After all, inaccurate information about marginalized populations can often engender policies that exacerbate rather than ameliorate social inequities. Consequently, computer science experts have focused on developing mechanisms that help achieve equitable privacy, i.e., mechanisms that mitigate the data distortions introduced by privacy protections to ensure equitable outcomes and benefits for all groups, particularly marginalized groups. Our paper extends the conversation on equitable privacy by highlighting the importance of inclusive communication in ensuring equitable outcomes for all social groups through all the stages of deploying a differentially private system. We conceptualize Equitable DP as the design, communication, and implementation of DP algorithms that ensure equitable outcomes. Thus, in addition to adopting computer scientists' recommendations of incorporating equity parameters within DP algorithms, we suggest that it is critical for an organization to also facilitate inclusive communication throughout the design, development, and implementation stages of a DP algorithm to ensure it has an equitable impact on social groups and does not hinder the redressal of social inequities. To demonstrate the importance of communication for Equitable DP, we undertake a case study of the process through which DP was adopted as the newest disclosure avoidance system for the 2020 U.S. Census. Drawing on the Inclusive Science Communication (ISC) framework, we examine the extent to which the Census Bureau's communication strategies encouraged engagement across the diverse groups of users that employ the decennial Census data for research and policy making. Our analysis provides lessons that can be used by other government organizations interested in incorporating the Equitable DP approach in their data collection practices.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1420344"},"PeriodicalIF":2.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363707/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data science's cultural construction: qualitative ideas for quantitative work. 数据科学的文化构建：定量工作的定性思想。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-08-14 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1287442

Philipp Brandt

Introduction: "Data scientists" quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science's collective definition on Twitter.

Methods: The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed.

Results: The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science.

Discussion: The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.

导言："数据科学家 "很快就变得无处不在，而且常常臭名昭著，但他们一直在为新角色的模糊性而挣扎。本文研究了推特上对数据科学的集体定义：分析方法：本文通过文化视角和 1,025 至 752,815 条推文的互补数据集，应对了研究边界不清、内容不明的新兴案例所面临的挑战。它汇集了在推特上谈论数据科学的账户之间的关系、他们使用的标签、表明的目的以及他们讨论的主题：第一批结果再现了人们熟悉的商业和技术动机。其他结果显示，对新的实用和道德标准的关注是构建数据科学的一个独特动机：这篇文章为通常抽象的数据集提供了局部意义的感性认识，也为浏览日益丰富的数据集以获得惊人的洞察力提供了启发。对于数据科学家来说，这篇文章为他们提供了一个指南，帮助他们定位自己与他人的关系，从而为自己的职业未来导航。

{"title":"Data science's cultural construction: qualitative ideas for quantitative work.","authors":"Philipp Brandt","doi":"10.3389/fdata.2024.1287442","DOIUrl":"https://doi.org/10.3389/fdata.2024.1287442","url":null,"abstract":"Introduction: \"Data scientists\" quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science's collective definition on Twitter.Methods: The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed.Results: The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science.Discussion: The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1287442"},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11349665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The development and application of a novel E-commerce recommendation system used in electric power B2B sector. 新型电子商务推荐系统在电力 B2B 行业的开发与应用。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-07-31 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1374980

Wenjun Meng, Lili Chen, Zhaomin Dong

The advent of the digital era has transformed E-commerce platforms into critical tools for industry, yet traditional recommendation systems often fall short in the specialized context of the electric power industry. These systems typically struggle with the industry's unique challenges, such as infrequent and high-stakes transactions, prolonged decision-making processes, and sparse data. This research has developed a novel recommendation engine tailored to these specific conditions, such as to handle the low frequency and long cycle nature of Business-to-Business (B2B) transactions. This approach includes algorithmic enhancements to better process and interpret the limited data available, and data pre-processing techniques designed to enrich the sparse datasets characteristic of this industry. This research also introduces a methodological innovation that integrates multi-dimensional data, combining user E-commerce activities, product specifics, and essential non-tendering information. The proposed engine employs advanced machine learning techniques to provide more accurate and relevant recommendations. The results demonstrate a marked improvement over traditional models, offering a more robust and effective tool for facilitating B2B transactions in the electric power industry. This research not only addresses the sector's unique challenges but also provides a blueprint for adapting recommendation systems to other industries with similar B2B characteristics.

数字时代的到来已将电子商务平台转变为工业领域的重要工具，但传统的推荐系统在电力行业的专业背景下往往显得力不从心。这些系统通常难以应对电力行业的独特挑战，如交易频率低、风险高、决策过程漫长、数据稀少等。本研究针对这些特定条件开发了一种新颖的推荐引擎，例如处理企业对企业 (B2B) 交易的低频率和长周期特性。这种方法包括算法改进，以更好地处理和解释有限的可用数据，以及旨在丰富该行业特有的稀疏数据集的数据预处理技术。这项研究还引入了一种方法创新，它整合了多维数据，将用户电子商务活动、产品细节和基本的非招标信息结合在一起。所提出的引擎采用了先进的机器学习技术，以提供更准确、更相关的推荐。研究结果表明，与传统模型相比，该引擎有了明显改善，为促进电力行业的 B2B 交易提供了更强大、更有效的工具。这项研究不仅解决了该行业面临的独特挑战，还为具有类似 B2B 特征的其他行业调整推荐系统提供了蓝图。

{"title":"The development and application of a novel E-commerce recommendation system used in electric power B2B sector.","authors":"Wenjun Meng, Lili Chen, Zhaomin Dong","doi":"10.3389/fdata.2024.1374980","DOIUrl":"https://doi.org/10.3389/fdata.2024.1374980","url":null,"abstract":"The advent of the digital era has transformed E-commerce platforms into critical tools for industry, yet traditional recommendation systems often fall short in the specialized context of the electric power industry. These systems typically struggle with the industry's unique challenges, such as infrequent and high-stakes transactions, prolonged decision-making processes, and sparse data. This research has developed a novel recommendation engine tailored to these specific conditions, such as to handle the low frequency and long cycle nature of Business-to-Business (B2B) transactions. This approach includes algorithmic enhancements to better process and interpret the limited data available, and data pre-processing techniques designed to enrich the sparse datasets characteristic of this industry. This research also introduces a methodological innovation that integrates multi-dimensional data, combining user E-commerce activities, product specifics, and essential non-tendering information. The proposed engine employs advanced machine learning techniques to provide more accurate and relevant recommendations. The results demonstrate a marked improvement over traditional models, offering a more robust and effective tool for facilitating B2B transactions in the electric power industry. This research not only addresses the sector's unique challenges but also provides a blueprint for adapting recommendation systems to other industries with similar B2B characteristics.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1374980"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11322496/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141983886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient enhancement of low-rank tensor completion via thin QR decomposition. 通过薄 QR 分解有效增强低等级张量补全。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-07-02 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1382144

Yan Wu, Yunzhi Jin

Low-rank tensor completion (LRTC), which aims to complete missing entries from tensors with partially observed terms by utilizing the low-rank structure of tensors, has been widely used in various real-world issues. The core tensor nuclear norm minimization (CTNM) method based on Tucker decomposition is one of common LRTC methods. However, the CTNM methods based on Tucker decomposition often have a large computing cost due to the fact that the general factor matrix solving technique involves multiple singular value decompositions (SVDs) in each loop. To address this problem, this article enhances the method and proposes an effective CTNM method based on thin QR decomposition (CTNM-QR) with lower computing complexity. The proposed method extends the CTNM by introducing tensor versions of the auxiliary variables instead of matrices, while using the thin QR decomposition to solve the factor matrix rather than the SVD, which can save the computational complexity and improve the tensor completion accuracy. In addition, the CTNM-QR method's convergence and complexity are analyzed further. Numerous experiments in synthetic data, real color images, and brain MRI data at different missing rates demonstrate that the proposed method not only outperforms in terms of completion accuracy and visualization, but also conducts more efficiently than most state-of-the-art LRTC methods.

低秩张量补全（LRTC）旨在利用张量的低秩结构，补全张量中部分观测项的缺失项，已被广泛应用于各种实际问题中。基于塔克分解的核心张量核规范最小化（CTNM）方法是常见的 LRTC 方法之一。然而，由于一般的因子矩阵求解技术在每个循环中都要进行多次奇异值分解（SVD），因此基于 Tucker 分解的 CTNM 方法通常计算成本较高。针对这一问题，本文对该方法进行了改进，提出了一种计算复杂度更低的基于薄 QR 分解的有效 CTNM 方法（CTNM-QR）。该方法通过引入辅助变量的张量版本而不是矩阵来扩展 CTNM，同时使用薄 QR 分解而不是 SVD 来求解因子矩阵，从而节省了计算复杂度并提高了张量补全精度。此外，还进一步分析了 CTNM-QR 方法的收敛性和复杂性。在合成数据、真实彩色图像和不同缺失率的脑磁共振成像数据中进行的大量实验表明，所提出的方法不仅在补全精度和可视化方面表现出色，而且比大多数最先进的 LRTC 方法更高效。

{"title":"Efficient enhancement of low-rank tensor completion via thin QR decomposition.","authors":"Yan Wu, Yunzhi Jin","doi":"10.3389/fdata.2024.1382144","DOIUrl":"10.3389/fdata.2024.1382144","url":null,"abstract":"Low-rank tensor completion (LRTC), which aims to complete missing entries from tensors with partially observed terms by utilizing the low-rank structure of tensors, has been widely used in various real-world issues. The core tensor nuclear norm minimization (CTNM) method based on Tucker decomposition is one of common LRTC methods. However, the CTNM methods based on Tucker decomposition often have a large computing cost due to the fact that the general factor matrix solving technique involves multiple singular value decompositions (SVDs) in each loop. To address this problem, this article enhances the method and proposes an effective CTNM method based on thin QR decomposition (CTNM-QR) with lower computing complexity. The proposed method extends the CTNM by introducing tensor versions of the auxiliary variables instead of matrices, while using the thin QR decomposition to solve the factor matrix rather than the SVD, which can save the computational complexity and improve the tensor completion accuracy. In addition, the CTNM-QR method's convergence and complexity are analyzed further. Numerous experiments in synthetic data, real color images, and brain MRI data at different missing rates demonstrate that the proposed method not only outperforms in terms of completion accuracy and visualization, but also conducts more efficiently than most state-of-the-art LRTC methods.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1382144"},"PeriodicalIF":2.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11250652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Random kernel k-nearest neighbors regression. 随机核 k 近邻回归

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-07-01 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1402384

Patchanok Srisuradetchai, Korn Suksrikran

The k-nearest neighbors (KNN) regression method, known for its nonparametric nature, is highly valued for its simplicity and its effectiveness in handling complex structured data, particularly in big data contexts. However, this method is susceptible to overfitting and fit discontinuity, which present significant challenges. This paper introduces the random kernel k-nearest neighbors (RK-KNN) regression as a novel approach that is well-suited for big data applications. It integrates kernel smoothing with bootstrap sampling to enhance prediction accuracy and the robustness of the model. This method aggregates multiple predictions using random sampling from the training dataset and selects subsets of input variables for kernel KNN (K-KNN). A comprehensive evaluation of RK-KNN on 15 diverse datasets, employing various kernel functions including Gaussian and Epanechnikov, demonstrates its superior performance. When compared to standard KNN and the random KNN (R-KNN) models, it significantly reduces the root mean square error (RMSE) and mean absolute error, as well as improving R-squared values. The RK-KNN variant that employs a specific kernel function yielding the lowest RMSE will be benchmarked against state-of-the-art methods, including support vector regression, artificial neural networks, and random forests.

k 近邻（KNN）回归方法因其非参数性质而闻名，因其简单性和处理复杂结构数据的有效性而备受推崇，尤其是在大数据背景下。然而，这种方法容易出现过拟合和拟合不连续的问题，带来了巨大的挑战。本文介绍了随机核 k 近邻（RK-KNN）回归法，这是一种非常适合大数据应用的新方法。它将核平滑与自举采样相结合，以提高预测的准确性和模型的鲁棒性。该方法使用从训练数据集随机抽样的方法汇总多个预测结果，并为核 KNN（K-KNN）选择输入变量子集。在 15 个不同的数据集上对 RK-KNN 进行了全面评估，采用了包括高斯和 Epanechnikov 在内的各种核函数，结果表明 RK-KNN 性能优越。与标准 KNN 和随机 KNN（R-KNN）模型相比，它显著降低了均方根误差（RMSE）和平均绝对误差，并提高了 R 平方值。RK-KNN 变体采用了特定的核函数，RMSE 最低，将与支持向量回归、人工神经网络和随机森林等最先进的方法进行比较。

{"title":"Random kernel k-nearest neighbors regression.","authors":"Patchanok Srisuradetchai, Korn Suksrikran","doi":"10.3389/fdata.2024.1402384","DOIUrl":"10.3389/fdata.2024.1402384","url":null,"abstract":"The k-nearest neighbors (KNN) regression method, known for its nonparametric nature, is highly valued for its simplicity and its effectiveness in handling complex structured data, particularly in big data contexts. However, this method is susceptible to overfitting and fit discontinuity, which present significant challenges. This paper introduces the random kernel k-nearest neighbors (RK-KNN) regression as a novel approach that is well-suited for big data applications. It integrates kernel smoothing with bootstrap sampling to enhance prediction accuracy and the robustness of the model. This method aggregates multiple predictions using random sampling from the training dataset and selects subsets of input variables for kernel KNN (K-KNN). A comprehensive evaluation of RK-KNN on 15 diverse datasets, employing various kernel functions including Gaussian and Epanechnikov, demonstrates its superior performance. When compared to standard KNN and the random KNN (R-KNN) models, it significantly reduces the root mean square error (RMSE) and mean absolute error, as well as improving R-squared values. The RK-KNN variant that employs a specific kernel function yielding the lowest RMSE will be benchmarked against state-of-the-art methods, including support vector regression, artificial neural networks, and random forests.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1402384"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11246867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141622134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0