arXiv - CS - Databases最新文献_第8页

PGB: Benchmarking Differentially Private Synthetic Graph Generation Algorithms PGB：差异化私有合成图生成算法基准测试

arXiv - CS - Databases

Pub Date : 2024-08-06 DOI: arxiv-2408.02928

Shang Liu, Hao Du, Yang Cao, Bo Yan, Jinfei Liu, Masatoshi Yoshikawa

Differentially private graph analysis is a powerful tool for derivinginsights from diverse graph data while protecting individual information.Designing private analytic algorithms for different graph queries oftenrequires starting from scratch. In contrast, differentially private syntheticgraph generation offers a general paradigm that supports one-time generationfor multiple queries. Although a rich set of differentially private graphgeneration algorithms has been proposed, comparing them effectively remainschallenging due to various factors, including differing privacy definitions,diverse graph datasets, varied privacy requirements, and multiple utilitymetrics. To this end, we propose PGB (Private Graph Benchmark), a comprehensivebenchmark designed to enable researchers to compare differentially privategraph generation algorithms fairly. We begin by identifying four essentialelements of existing works as a 4-tuple: mechanisms, graph datasets, privacyrequirements, and utility metrics. We discuss principles regarding theseelements to ensure the comprehensiveness of a benchmark. Next, we present abenchmark instantiation that adheres to all principles, establishing a newmethod to evaluate existing and newly proposed graph generation algorithms.Through extensive theoretical and empirical analysis, we gain valuable insightsinto the strengths and weaknesses of prior algorithms. Our results indicatethat there is no universal solution for all possible cases. Finally, we provideguidelines to help researchers select appropriate mechanisms for variousscenarios.

为不同的图查询设计私有分析算法往往需要从头开始。相比之下，差异化私有合成图生成提供了一种通用范式，支持一次性生成多个查询。虽然已经提出了丰富的差异化私有图生成算法，但由于隐私定义不同、图数据集不同、隐私要求不同以及实用性指标不同等各种因素，对这些算法进行有效比较仍然是一项挑战。为此，我们提出了 PGB（隐私图基准），这是一个综合性基准，旨在帮助研究人员公平地比较不同的隐私图生成算法。我们首先确定了现有工作的四个基本要素：机制、图数据集、隐私要求和效用度量。我们讨论了有关这些要素的原则，以确保基准的全面性。通过广泛的理论和实证分析，我们对现有算法的优缺点有了宝贵的认识。我们的结果表明，并不存在适用于所有可能情况的通用解决方案。最后，我们提供了指导原则，帮助研究人员针对各种情况选择合适的机制。

{"title":"PGB: Benchmarking Differentially Private Synthetic Graph Generation Algorithms","authors":"Shang Liu, Hao Du, Yang Cao, Bo Yan, Jinfei Liu, Masatoshi Yoshikawa","doi":"arxiv-2408.02928","DOIUrl":"https://doi.org/arxiv-2408.02928","url":null,"abstract":"Differentially private graph analysis is a powerful tool for deriving\u0000insights from diverse graph data while protecting individual information.\u0000Designing private analytic algorithms for different graph queries often\u0000requires starting from scratch. In contrast, differentially private synthetic\u0000graph generation offers a general paradigm that supports one-time generation\u0000for multiple queries. Although a rich set of differentially private graph\u0000generation algorithms has been proposed, comparing them effectively remains\u0000challenging due to various factors, including differing privacy definitions,\u0000diverse graph datasets, varied privacy requirements, and multiple utility\u0000metrics. To this end, we propose PGB (Private Graph Benchmark), a comprehensive\u0000benchmark designed to enable researchers to compare differentially private\u0000graph generation algorithms fairly. We begin by identifying four essential\u0000elements of existing works as a 4-tuple: mechanisms, graph datasets, privacy\u0000requirements, and utility metrics. We discuss principles regarding these\u0000elements to ensure the comprehensiveness of a benchmark. Next, we present a\u0000benchmark instantiation that adheres to all principles, establishing a new\u0000method to evaluate existing and newly proposed graph generation algorithms.\u0000Through extensive theoretical and empirical analysis, we gain valuable insights\u0000into the strengths and weaknesses of prior algorithms. Our results indicate\u0000that there is no universal solution for all possible cases. Finally, we provide\u0000guidelines to help researchers select appropriate mechanisms for various\u0000scenarios.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic String Data Validation with Pattern Discovery 利用模式发现自动验证字符串数据

arXiv - CS - Databases

Pub Date : 2024-08-06 DOI: arxiv-2408.03005

Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin

In enterprise data pipelines, data insertions occur periodically and mayimpact downstream services if data quality issues are not addressed. Typically,such problems can be investigated and fixed by on-call engineers, but locatingthe cause of such problems and fixing errors are often time-consuming.Therefore, automatic data validation is a better solution to defend the systemand downstream services by enabling early detection of errors and providingdetailed error messages for quick resolution. This paper proposes aself-validate data management system with automatic pattern discoverytechniques to verify the correctness of semi-structural string data inenterprise data pipelines. Our solution extracts patterns from historical dataand detects erroneous incoming data in a top-down fashion. High-levelinformation of historical data is analyzed to discover the format skeleton ofcorrect values. Fine-grained semantic patterns are then extracted to strike abalance between generalization and specification of the discovered pattern,thus covering as many correct values as possible while avoiding over-fitting.To tackle cold start and rapid data growth, we propose an incremental updatestrategy and example generalization strategy. Experiments on large-scaleindustrial and public datasets demonstrate the effectiveness and efficiency ofour method compared to alternative solutions. Furthermore, a case study on anindustrial platform (Ant Group Inc.) with thousands of applications shows thatour system captures meaningful data patterns in daily operations and helpsengineers quickly identify errors.

在企业数据管道中，数据插入会定期发生，如果不解决数据质量问题，可能会影响下游服务。因此，自动数据验证是保护系统和下游服务的更好解决方案，它能及早发现错误，并提供详细的错误信息以便快速解决。本文提出了一种具有自动模式发现技术的自我验证数据管理系统，用于验证企业数据管道中半结构字符串数据的正确性。我们的解决方案以自上而下的方式从历史数据中提取模式，并检测错误的传入数据。通过分析历史数据的高级信息来发现正确值的格式骨架。为了解决冷启动和数据快速增长问题，我们提出了增量更新策略和示例泛化策略。在大规模工业和公共数据集上的实验证明，与其他解决方案相比，我们的方法是有效和高效的。此外，在一个拥有数千个应用程序的工业平台（蚂蚁金服集团公司）上进行的案例研究表明，我们的系统可以捕捉到日常运营中有意义的数据模式，并帮助工程师快速识别错误。

{"title":"Automatic String Data Validation with Pattern Discovery","authors":"Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin","doi":"arxiv-2408.03005","DOIUrl":"https://doi.org/arxiv-2408.03005","url":null,"abstract":"In enterprise data pipelines, data insertions occur periodically and may\u0000impact downstream services if data quality issues are not addressed. Typically,\u0000such problems can be investigated and fixed by on-call engineers, but locating\u0000the cause of such problems and fixing errors are often time-consuming.\u0000Therefore, automatic data validation is a better solution to defend the system\u0000and downstream services by enabling early detection of errors and providing\u0000detailed error messages for quick resolution. This paper proposes a\u0000self-validate data management system with automatic pattern discovery\u0000techniques to verify the correctness of semi-structural string data in\u0000enterprise data pipelines. Our solution extracts patterns from historical data\u0000and detects erroneous incoming data in a top-down fashion. High-level\u0000information of historical data is analyzed to discover the format skeleton of\u0000correct values. Fine-grained semantic patterns are then extracted to strike a\u0000balance between generalization and specification of the discovered pattern,\u0000thus covering as many correct values as possible while avoiding over-fitting.\u0000To tackle cold start and rapid data growth, we propose an incremental update\u0000strategy and example generalization strategy. Experiments on large-scale\u0000industrial and public datasets demonstrate the effectiveness and efficiency of\u0000our method compared to alternative solutions. Furthermore, a case study on an\u0000industrial platform (Ant Group Inc.) with thousands of applications shows that\u0000our system captures meaningful data patterns in daily operations and helps\u0000engineers quickly identify errors.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeurDB: On the Design and Implementation of an AI-powered Autonomous Database NeurDB：关于人工智能驱动的自主数据库的设计与实现

arXiv - CS - Databases

Pub Date : 2024-08-06 DOI: arxiv-2408.03013

Zhanhao Zhao, Shaofeng Cai, Haotian Gao, Hexiang Pan, Siqi Xiang, Naili Xing, Gang Chen, Beng Chin Ooi, Yanyan Shen, Yuncheng Wu, Meihui Zhang

Databases are increasingly embracing AI to provide autonomous systemoptimization and intelligent in-database analytics, aiming to relieve end-userburdens across various industry sectors. Nonetheless, most existing approachesfail to account for the dynamic nature of databases, which renders themineffective for real-world applications characterized by evolving data andworkloads. This paper introduces NeurDB, an AI-powered autonomous database thatdeepens the fusion of AI and databases with adaptability to data and workloaddrift. NeurDB establishes a new in-database AI ecosystem that seamlesslyintegrates AI workflows within the database. This integration enables efficientand effective in-database AI analytics and fast-adaptive learned systemcomponents. Empirical evaluations demonstrate that NeurDB substantiallyoutperforms existing solutions in managing AI analytics tasks, with theproposed learned components more effectively handling environmental dynamismthan state-of-the-art approaches.

数据库正越来越多地采用人工智能来提供自主系统优化和智能数据库内分析，旨在减轻各行业领域终端用户的负担。然而，大多数现有方法都没有考虑到数据库的动态特性，这使得它们在以不断变化的数据和工作量为特征的现实世界应用中效果不佳。本文介绍的 NeurDB 是一种人工智能驱动的自主数据库，它深化了人工智能与数据库的融合，具有对数据和工作负载漂移的适应性。NeurDB 建立了一个新的数据库内人工智能生态系统，将人工智能工作流无缝集成到数据库中。这种集成实现了高效的数据库内人工智能分析和快速自适应的学习系统组件。实证评估表明，NeurDB 在管理人工智能分析任务方面的性能大大优于现有解决方案，与最先进的方法相比，所提出的学习组件能更有效地处理环境动态变化。

引用次数: 0

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report] 采用大型语言模型的合成事件自增强视频数据管理系统 [技术报告］

arXiv - CS - Databases

Pub Date : 2024-08-05 DOI: arxiv-2408.02243

Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska

Complex video queries can be answered by decomposing them into modularsubtasks. However, existing video data management systems assume the existenceof predefined modules for each subtask. We introduce VOCAL-UDF, a novelself-enhancing system that supports compositional queries over videos withoutthe need for predefined modules. VOCAL-UDF automatically identifies andconstructs missing modules and encapsulates them as user-defined functions(UDFs), thus expanding its querying capabilities. To achieve this, we formulatea unified UDF model that leverages large language models (LLMs) to aid in newUDF generation. VOCAL-UDF handles a wide range of concepts by supporting bothprogram-based UDFs (i.e., Python functions generated by LLMs) anddistilled-model UDFs (lightweight vision models distilled from strongpretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDFgenerates multiple candidate UDFs and uses active learning to efficientlyselect the best one. With the self-enhancing capability, VOCAL-UDFsignificantly improves query performance across three video datasets.

复杂的视频查询可通过将其分解为模块化子任务来回答。然而，现有的视频数据管理系统假设每个子任务都存在预定义的模块。我们介绍了 VOCAL-UDF，它是一种新颖的自我增强系统，无需预定义模块即可支持视频组合查询。VOCAL-UDF 可自动识别和构建缺失的模块，并将其封装为用户自定义函数（UDF），从而扩展其查询功能。为此，我们建立了一个统一的 UDF 模型，利用大型语言模型（LLM）来帮助生成新的 UDF。VOCAL-UDF 支持基于程序的 UDF（即由 LLM 生成的 Python 函数）和经蒸馏的模型 UDF（从强预处理模型中蒸馏出的轻量级视觉模型），可以处理各种概念。为了解决用户意图中固有的模糊性，VOCAL-UDF 生成多个候选 UDF，并利用主动学习有效地选择最佳 UDF。凭借自我增强能力，VOCAL-UDF 显著提高了三个视频数据集的查询性能。

{"title":"Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]","authors":"Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska","doi":"arxiv-2408.02243","DOIUrl":"https://doi.org/arxiv-2408.02243","url":null,"abstract":"Complex video queries can be answered by decomposing them into modular\u0000subtasks. However, existing video data management systems assume the existence\u0000of predefined modules for each subtask. We introduce VOCAL-UDF, a novel\u0000self-enhancing system that supports compositional queries over videos without\u0000the need for predefined modules. VOCAL-UDF automatically identifies and\u0000constructs missing modules and encapsulates them as user-defined functions\u0000(UDFs), thus expanding its querying capabilities. To achieve this, we formulate\u0000a unified UDF model that leverages large language models (LLMs) to aid in new\u0000UDF generation. VOCAL-UDF handles a wide range of concepts by supporting both\u0000program-based UDFs (i.e., Python functions generated by LLMs) and\u0000distilled-model UDFs (lightweight vision models distilled from strong\u0000pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF\u0000generates multiple candidate UDFs and uses active learning to efficiently\u0000select the best one. With the self-enhancing capability, VOCAL-UDF\u0000significantly improves query performance across three video datasets.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation 大型语言模型是否擅长数据库旋钮调整？综合实验评估

arXiv - CS - Databases

Pub Date : 2024-08-05 DOI: arxiv-2408.02213

Yiyan Li, Haoyang Li, Zhao Pu, Jing Zhang, Xinyi Zhang, Tao Ji, Luming Sun, Cuiping Li, Hong Chen

Knob tuning plays a crucial role in optimizing databases by adjusting knobsto enhance database performance. However, traditional tuning methods oftenfollow a Try-Collect-Adjust approach, proving inefficient anddatabase-specific. Moreover, these methods are often opaque, making itchallenging for DBAs to grasp the underlying decision-making process. The emergence of large language models (LLMs) like GPT-4 and Claude-3 hasexcelled in complex natural language tasks, yet their potential in databaseknob tuning remains largely unexplored. This study harnesses LLMs asexperienced DBAs for knob-tuning tasks with carefully designed prompts. Weidentify three key subtasks in the tuning system: knob pruning, modelinitialization, and knob recommendation, proposing LLM-driven solutions toreplace conventional methods for each subtask. We conduct extensive experiments to compare LLM-driven approaches againsttraditional methods across the subtasks to evaluate LLMs' efficacy in the knobtuning domain. Furthermore, we explore the adaptability of LLM-based solutionsin diverse evaluation settings, encompassing new benchmarks, database engines,and hardware environments. Our findings reveal that LLMs not only match orsurpass traditional methods but also exhibit notable interpretability bygenerating responses in a coherent ``chain-of-thought'' manner. We furtherobserve that LLMs exhibit remarkable generalizability through simpleadjustments in prompts, eliminating the necessity for additional training orextensive code modifications. Drawing insights from our experimental findings, we identify severalopportunities for future research aimed at advancing the utilization of LLMs inthe realm of database management.

旋钮调整通过调整旋钮来提高数据库性能，在优化数据库方面发挥着至关重要的作用。然而，传统的调整方法通常采用尝试-收集-调整的方法，效率低下，而且针对特定数据库。此外，这些方法往往不透明，使 DBA 难以掌握基本的决策过程。GPT-4 和 Claude-3 等大型语言模型（LLM）在复杂的自然语言任务中表现出色，但它们在数据库调整方面的潜力在很大程度上仍未得到开发。本研究利用 LLMs 作为经验丰富的 DBA，通过精心设计的提示执行旋钮调整任务。我们确定了调整系统中的三个关键子任务：旋钮修剪、模型初始化和旋钮推荐，并针对每个子任务提出了 LLM 驱动的解决方案，以取代传统方法。我们进行了大量实验，将 LLM 驱动的方法与传统方法在各个子任务中进行比较，以评估 LLM 在旋钮调谐领域的功效。此外，我们还探索了基于 LLM 的解决方案在不同评估环境中的适应性，包括新基准、数据库引擎和硬件环境。我们的研究结果表明，LLM 不仅可以匹配或超越传统方法，而且还能以连贯的 "思维链 "方式生成响应，从而表现出显著的可解释性。我们还注意到，通过对提示进行简单的调整，LLMs 无需进行额外的培训或大量的代码修改，就能表现出显著的通用性。从我们的实验结果中，我们发现了未来研究的几个机会，旨在推动数据库管理领域对LLM的利用。

{"title":"Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation","authors":"Yiyan Li, Haoyang Li, Zhao Pu, Jing Zhang, Xinyi Zhang, Tao Ji, Luming Sun, Cuiping Li, Hong Chen","doi":"arxiv-2408.02213","DOIUrl":"https://doi.org/arxiv-2408.02213","url":null,"abstract":"Knob tuning plays a crucial role in optimizing databases by adjusting knobs\u0000to enhance database performance. However, traditional tuning methods often\u0000follow a Try-Collect-Adjust approach, proving inefficient and\u0000database-specific. Moreover, these methods are often opaque, making it\u0000challenging for DBAs to grasp the underlying decision-making process. The emergence of large language models (LLMs) like GPT-4 and Claude-3 has\u0000excelled in complex natural language tasks, yet their potential in database\u0000knob tuning remains largely unexplored. This study harnesses LLMs as\u0000experienced DBAs for knob-tuning tasks with carefully designed prompts. We\u0000identify three key subtasks in the tuning system: knob pruning, model\u0000initialization, and knob recommendation, proposing LLM-driven solutions to\u0000replace conventional methods for each subtask. We conduct extensive experiments to compare LLM-driven approaches against\u0000traditional methods across the subtasks to evaluate LLMs' efficacy in the knob\u0000tuning domain. Furthermore, we explore the adaptability of LLM-based solutions\u0000in diverse evaluation settings, encompassing new benchmarks, database engines,\u0000and hardware environments. Our findings reveal that LLMs not only match or\u0000surpass traditional methods but also exhibit notable interpretability by\u0000generating responses in a coherent ``chain-of-thought'' manner. We further\u0000observe that LLMs exhibit remarkable generalizability through simple\u0000adjustments in prompts, eliminating the necessity for additional training or\u0000extensive code modifications. Drawing insights from our experimental findings, we identify several\u0000opportunities for future research aimed at advancing the utilization of LLMs in\u0000the realm of database management.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AMIDER: A Multidisciplinary Research Database and Its Application to Promote Open Science AMIDER：多学科研究数据库及其在促进开放科学中的应用

arXiv - CS - Databases

Pub Date : 2024-08-05 DOI: arxiv-2408.02246

Masayoshi KozaiPolar Environment Data Science Center, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tachikawa, Japan, Yoshimasa TanakaPolar Environment Data Science Center, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tachikawa, JapanNational Institute of Polar Research, Research Organization of Information and Systems, Tachikawa, Japan, Shuji AbeInternational Research Center for Space and Planetary Environmental Science, Kyushu University, Fukuoka, Japan, Yasuyuki MinamiyamaResearch Center for Open Science and Data Platform, National Institute of Informatics, Research Organization of Information and Systems, Tokyo, Japan, Atsuki ShinboriInstitute for Space and Earth Environmental Research Center for Integrated Data Science, Nagoya University, Nagoya, Japan, Akira KadokuraPolar Environment Data Science Center, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tachikawa, Japan

The AMIDER, Advanced Multidisciplinary Integrated-Database for Exploring newResearch, is a newly developed research data catalog to demonstrate an advanceddatabase application. AMIDER is characterized as a multidisciplinary databaseequipped with a user-friendly web application. Its catalog view displaysdiverse research data at once beyond any limitation of each individualdiscipline. Some useful functions, such as a selectable data download, dataformat conversion, and display of data visual information, are alsoimplemented. Further advanced functions, such as visualization of datasetmutual relationship, are also implemented as a preliminary trial. Thesecharacteristics and functions are expected to enhance the accessibility toindividual research data, even from non-expertized users, and be helpful forcollaborations among diverse scientific fields beyond individual disciplines.Multidisciplinary data management is also one of AMIDER's uniqueness, wherevarious metadata schemas can be mapped to a uniform metadata table, andstandardized and self-describing data formats are adopted. AMIDER website(https://amider.rois.ac.jp/) had been launched in April 2024. As of July 2024,over 15,000 metadata in various research fields of polar science have beenregistered in the database, and approximately 500 visitors are viewing thewebsite every day on average. Expansion of the database to furthermultidisciplinary scientific fields, not only polar science, is planned, andadvanced attempts, such as applying Natural Language Processing (NLP) tometadata, have also been considered.

AMIDER（探索新研究的先进多学科综合数据库）是一个新开发的研究数据目录，展示了先进的数据库应用。AMIDER 的特点是多学科数据库配备了用户友好的网络应用程序。它的目录视图可同时显示多种研究数据，不受各学科的限制。此外，还实现了一些有用的功能，如可选择的数据下载、数据格式转换和数据可视化信息显示。进一步的高级功能，如数据集相互关系的可视化，也在初步试用中。多学科数据管理也是 AMIDER 的独特之处，不同的元数据模式可以映射到统一的元数据表中，并采用标准化和自描述的数据格式。AMIDER 网站（https://amider.rois.ac.jp/）于 2024 年 4 月开通。截至 2024 年 7 月，超过 15,000 个极地科学各研究领域的元数据已在数据库中注册，平均每天约有 500 名访问者浏览该网站。除极地科学外，还计划将数据库扩展到更多的多学科科学领域，并考虑在元数据中应用自然语言处理（NLP）等高级尝试。

{"title":"AMIDER: A Multidisciplinary Research Database and Its Application to Promote Open Science","authors":"Masayoshi KozaiPolar Environment Data Science Center, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tachikawa, Japan, Yoshimasa TanakaPolar Environment Data Science Center, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tachikawa, JapanNational Institute of Polar Research, Research Organization of Information and Systems, Tachikawa, Japan, Shuji AbeInternational Research Center for Space and Planetary Environmental Science, Kyushu University, Fukuoka, Japan, Yasuyuki MinamiyamaResearch Center for Open Science and Data Platform, National Institute of Informatics, Research Organization of Information and Systems, Tokyo, Japan, Atsuki ShinboriInstitute for Space and Earth Environmental Research Center for Integrated Data Science, Nagoya University, Nagoya, Japan, Akira KadokuraPolar Environment Data Science Center, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tachikawa, Japan","doi":"arxiv-2408.02246","DOIUrl":"https://doi.org/arxiv-2408.02246","url":null,"abstract":"The AMIDER, Advanced Multidisciplinary Integrated-Database for Exploring new\u0000Research, is a newly developed research data catalog to demonstrate an advanced\u0000database application. AMIDER is characterized as a multidisciplinary database\u0000equipped with a user-friendly web application. Its catalog view displays\u0000diverse research data at once beyond any limitation of each individual\u0000discipline. Some useful functions, such as a selectable data download, data\u0000format conversion, and display of data visual information, are also\u0000implemented. Further advanced functions, such as visualization of dataset\u0000mutual relationship, are also implemented as a preliminary trial. These\u0000characteristics and functions are expected to enhance the accessibility to\u0000individual research data, even from non-expertized users, and be helpful for\u0000collaborations among diverse scientific fields beyond individual disciplines.\u0000Multidisciplinary data management is also one of AMIDER's uniqueness, where\u0000various metadata schemas can be mapped to a uniform metadata table, and\u0000standardized and self-describing data formats are adopted. AMIDER website\u0000(https://amider.rois.ac.jp/) had been launched in April 2024. As of July 2024,\u0000over 15,000 metadata in various research fields of polar science have been\u0000registered in the database, and approximately 500 visitors are viewing the\u0000website every day on average. Expansion of the database to further\u0000multidisciplinary scientific fields, not only polar science, is planned, and\u0000advanced attempts, such as applying Natural Language Processing (NLP) to\u0000metadata, have also been considered.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Earth System Data Cubes: Avenues for advancing Earth system research 地球系统数据立方体：推进地球系统研究的途径

arXiv - CS - Databases

Pub Date : 2024-08-05 DOI: arxiv-2408.02348

David Montero, Guido Kraemer, Anca Anghelea, César Aybar, Gunnar Brandt, Gustau Camps-Valls, Felix Cremer, Ida Flik, Fabian Gans, Sarah Habershon, Chaonan Ji, Teja Kattenborn, Laura Martínez-Ferrer, Francesco Martinuzzi, Martin Reinhardt, Maximilian Söchting, Khalil Teber, Miguel D. Mahecha

Recent advancements in Earth system science have been marked by theexponential increase in the availability of diverse, multivariate datasetscharacterised by moderate to high spatio-temporal resolutions. Earth SystemData Cubes (ESDCs) have emerged as one suitable solution for transforming thisflood of data into a simple yet robust data structure. ESDCs achieve this byorganising data into an analysis-ready format aligned with a spatio-temporalgrid, facilitating user-friendly analysis and diminishing the need forextensive technical data processing knowledge. Despite these significantbenefits, the completion of the entire ESDC life cycle remains a challengingtask. Obstacles are not only of a technical nature but also relate todomain-specific problems in Earth system research. There exist barriers torealising the full potential of data collections in light of novel cloud-basedtechnologies, particularly in curating data tailored for specific applicationdomains. These include transforming data to conform to a spatio-temporal gridwith minimum distortions and managing complexities such as spatio-temporalautocorrelation issues. Addressing these challenges is pivotal for theeffective application of Artificial Intelligence (AI) approaches. Furthermore,adhering to open science principles for data dissemination, reproducibility,visualisation, and reuse is crucial for fostering sustainable research.Overcoming these challenges offers a substantial opportunity to advancedata-driven Earth system research, unlocking the full potential of anintegrated, multidimensional view of Earth system processes. This isparticularly true when such research is coupled with innovative researchparadigms and technological progress.

地球系统科学近来取得了长足进步，其标志是以中高时空分辨率为特征的多样化多变量数据集的可用性呈指数级增长。地球系统数据立方体（ESDC）是将大量数据转化为简单而强大的数据结构的合适解决方案。为了实现这一目标，ESDCs 将数据组织成与时空网格相匹配的分析就绪格式，方便用户进行分析，并减少了对大量数据处理技术知识的需求。尽管有这些重大优势，但完成整个 ESDC 生命周期仍然是一项具有挑战性的任务。这些障碍不仅是技术性的，也与地球系统研究的具体领域问题有关。鉴于基于云的新技术，在充分发挥数据收集的潜力方面存在障碍，特别是在为特定应用领域量身定制数据方面。这些障碍包括转换数据，使其符合时空网格，并尽量减少失真，以及管理时空自相关性等复杂问题。应对这些挑战对于有效应用人工智能（AI）方法至关重要。此外，在数据传播、可重现性、可视化和再利用方面坚持开放科学原则，对于促进可持续研究至关重要。克服这些挑战为推进数据驱动的地球系统研究提供了大量机会，释放了综合、多维地球系统过程视图的全部潜力。当这种研究与创新研究范式和技术进步相结合时，情况尤其如此。

{"title":"Earth System Data Cubes: Avenues for advancing Earth system research","authors":"David Montero, Guido Kraemer, Anca Anghelea, César Aybar, Gunnar Brandt, Gustau Camps-Valls, Felix Cremer, Ida Flik, Fabian Gans, Sarah Habershon, Chaonan Ji, Teja Kattenborn, Laura Martínez-Ferrer, Francesco Martinuzzi, Martin Reinhardt, Maximilian Söchting, Khalil Teber, Miguel D. Mahecha","doi":"arxiv-2408.02348","DOIUrl":"https://doi.org/arxiv-2408.02348","url":null,"abstract":"Recent advancements in Earth system science have been marked by the\u0000exponential increase in the availability of diverse, multivariate datasets\u0000characterised by moderate to high spatio-temporal resolutions. Earth System\u0000Data Cubes (ESDCs) have emerged as one suitable solution for transforming this\u0000flood of data into a simple yet robust data structure. ESDCs achieve this by\u0000organising data into an analysis-ready format aligned with a spatio-temporal\u0000grid, facilitating user-friendly analysis and diminishing the need for\u0000extensive technical data processing knowledge. Despite these significant\u0000benefits, the completion of the entire ESDC life cycle remains a challenging\u0000task. Obstacles are not only of a technical nature but also relate to\u0000domain-specific problems in Earth system research. There exist barriers to\u0000realising the full potential of data collections in light of novel cloud-based\u0000technologies, particularly in curating data tailored for specific application\u0000domains. These include transforming data to conform to a spatio-temporal grid\u0000with minimum distortions and managing complexities such as spatio-temporal\u0000autocorrelation issues. Addressing these challenges is pivotal for the\u0000effective application of Artificial Intelligence (AI) approaches. Furthermore,\u0000adhering to open science principles for data dissemination, reproducibility,\u0000visualisation, and reuse is crucial for fostering sustainable research.\u0000Overcoming these challenges offers a substantial opportunity to advance\u0000data-driven Earth system research, unlocking the full potential of an\u0000integrated, multidimensional view of Earth system processes. This is\u0000particularly true when such research is coupled with innovative research\u0000paradigms and technological progress.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mining Path Association Rules in Large Property Graphs (with Appendix) 挖掘大型属性图中的路径关联规则（附录）

arXiv - CS - Databases

Pub Date : 2024-08-04 DOI: arxiv-2408.02029

Yuya Sasaki, Panagiotis Karras

How can we mine frequent path regularities from a graph with edge labels andvertex attributes? The task of association rule mining successfully discoversregular patterns in item sets and substructures. Still, to our best knowledge,this concept has not yet been extended to path patterns in large propertygraphs. In this paper, we introduce the problem of path association rule mining(PARM). Applied to any emph{reachability path} between two vertices within alarge graph, PARM discovers regular ways in which path patterns, identified byvertex attributes and edge labels, co-occur with each other. We develop anefficient and scalable algorithm PIONEER that exploits an anti-monotonicityproperty to effectively prune the search space. Further, we deviseapproximation techniques and employ parallelization to achieve scalable pathassociation rule mining. Our experimental study using real-world graph dataverifies the significance of path association rules and the efficiency of oursolutions.

如何从带有边标签和顶点属性的图中挖掘频繁路径的规律性？关联规则挖掘任务成功地发现了项目集和子结构中的规则模式。然而，据我们所知，这一概念尚未扩展到大型属性图中的路径模式。本文介绍了路径关联规则挖掘（PARM）问题。PARM 适用于大型图中两个顶点之间的任何 "可达性路径"，它能发现由顶点属性和边标签确定的路径模式相互共现的规则方式。我们开发了一种高效、可扩展的算法 PIONEER，它利用反单调性特性有效地剪裁搜索空间。此外，我们还设计了近似技术并采用并行化方法来实现可扩展的路径关联规则挖掘。我们使用真实世界图数据进行的实验研究证明了路径关联规则的重要性和我们解决方案的效率。

引用次数: 0

Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue 实现查询答案多样性的可操作性：超计量学的拯救

arXiv - CS - Databases

Pub Date : 2024-08-03 DOI: arxiv-2408.01657

Marcelo Arenas, Timo Camillo Merkl, Reinhard Pichler, Cristian Riveros

The set of answers to a query may be very large, potentially overwhelmingusers when presented with the entire set. In such cases, presenting only asmall subset of the answers to the user may be preferable. A naturalrequirement for this subset is that it should be as diverse as possible toreflect the variety of the entire population. To achieve this, the diversity ofa subset is measured using a metric that determines how different two solutionsare and a diversity function that extends this metric from pairs to sets. Inthe past, several studies have shown that finding a diverse subset from anexplicitly given set is intractable even for simple metrics (like Hammingdistance) and simple diversity functions (like summing all pairwise distances).This complexity barrier becomes even more challenging when trying to output adiverse subset from a set that is only implicitly given such as the queryanswers of a query and a database. Until now, tractable cases have been foundonly for restricted problems and particular diversity functions. To overcome these limitations, we focus on the notion of ultrametrics, whichhave been widely studied and used in many applications. Starting from anyultrametric $d$ and a diversity function $delta$ extending $d$, we providesufficient conditions over $delta$ for having polynomial-time algorithms toconstruct diverse answers. To the best of our knowledge, these conditions aresatisfied by all diversity functions considered in the literature. Moreover, wecomplement these results with lower bounds that show specific cases when theseconditions are not satisfied and finding diverse subsets becomes intractable.We conclude by applying these results to the evaluation of conjunctive queries,demonstrating efficient algorithms for finding a diverse subset of solutionsfor acyclic conjunctive queries when the attribute order is used to measurediversity.

查询的答案集可能非常庞大，如果向用户展示整个答案集，可能会让用户不知所措。在这种情况下，最好只向用户展示一小部分答案子集。对这个子集的一个自然要求是它应尽可能多样化，以反映整个人群的多样性。为了实现这一目标，可以使用一种指标来衡量子集的多样性，这种指标可以确定两个解决方案的不同程度，而多样性函数则可以将这一指标从对扩展到集。过去的一些研究表明，从一个明确给定的集合中找到一个多样性子集，即使对于简单的度量（如汉明距离）和简单的多样性函数（如求所有成对距离之和）来说也是难以实现的。当试图从一个仅隐式给定的集合（如查询和数据库的查询答案）中输出一个多样性子集时，这种复杂性障碍就变得更具挑战性。迄今为止，只有在有限问题和特定多样性函数中才发现了可处理的案例。为了克服这些局限性，我们将重点放在超度量的概念上，超度量已被广泛研究并应用于许多领域。从任意超度量 $d$ 和扩展 $d$ 的多样性函数 $delta$ 开始，我们提供了在 $delta$ 上构建多样性答案的多项式时间算法的充分条件。据我们所知，文献中考虑的所有多样性函数都能满足这些条件。最后，我们将这些结果应用于连接查询的评估，展示了当属性顺序被用来衡量多样性时，为非循环连接查询寻找解决方案多样性子集的高效算法。

{"title":"Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue","authors":"Marcelo Arenas, Timo Camillo Merkl, Reinhard Pichler, Cristian Riveros","doi":"arxiv-2408.01657","DOIUrl":"https://doi.org/arxiv-2408.01657","url":null,"abstract":"The set of answers to a query may be very large, potentially overwhelming\u0000users when presented with the entire set. In such cases, presenting only a\u0000small subset of the answers to the user may be preferable. A natural\u0000requirement for this subset is that it should be as diverse as possible to\u0000reflect the variety of the entire population. To achieve this, the diversity of\u0000a subset is measured using a metric that determines how different two solutions\u0000are and a diversity function that extends this metric from pairs to sets. In\u0000the past, several studies have shown that finding a diverse subset from an\u0000explicitly given set is intractable even for simple metrics (like Hamming\u0000distance) and simple diversity functions (like summing all pairwise distances).\u0000This complexity barrier becomes even more challenging when trying to output a\u0000diverse subset from a set that is only implicitly given such as the query\u0000answers of a query and a database. Until now, tractable cases have been found\u0000only for restricted problems and particular diversity functions. To overcome these limitations, we focus on the notion of ultrametrics, which\u0000have been widely studied and used in many applications. Starting from any\u0000ultrametric $d$ and a diversity function $delta$ extending $d$, we provide\u0000sufficient conditions over $delta$ for having polynomial-time algorithms to\u0000construct diverse answers. To the best of our knowledge, these conditions are\u0000satisfied by all diversity functions considered in the literature. Moreover, we\u0000complement these results with lower bounds that show specific cases when these\u0000conditions are not satisfied and finding diverse subsets becomes intractable.\u0000We conclude by applying these results to the evaluation of conjunctive queries,\u0000demonstrating efficient algorithms for finding a diverse subset of solutions\u0000for acyclic conjunctive queries when the attribute order is used to measure\u0000diversity.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Controlling Dataflows with a Bolt-on Data Escrow 用螺栓式数据托管控制数据流

arXiv - CS - Databases

Pub Date : 2024-08-02 DOI: arxiv-2408.01580

Zhiru Zhu, Raul Castro Fernandez

The data-driven economy has created tremendous value in our society.Individuals share their data with platforms in exchange for services such assearch, social networks, and health recommendations. Platforms use the data toprovide those services and create other revenue-generating opportunities, e.g.,selling the data to data brokers. With the ever-expanding data economy comesthe growing concern about potential data misuse. While most platforms giveindividuals certain control over their data (i.e., what data is being shared),individuals do not know how the data will be used once shared; they cannotcontrol the purpose. In this paper, we introduce a data escrow design that permits individuals toobserve all dataflows - not just what is shared but for what purpose. Ratherthan data flowing to the platform, the platform delegates their computation tothe escrow, where individuals can observe and manage their data. To make thedata escrow practical, we design and implement a prototype that works alongsidethe Apple ecosystem; specifically, we retrofit the Apple SDKs with aprogramming interface to enable delegated computation. Our solution does notdepend on Apple's software and can be applied to other platforms, but buildingfor Apple lets us study the main hypothesis of our work: whether such a dataescrow solution is a feasible alternative to today's data governance. We showthat our escrow prototype implementation is efficient, and we analyze thedataflows in real-world apps and show that the escrow's programming interfacesupports implementing a wide range of dataflows.

数据驱动的经济为我们的社会创造了巨大的价值。个人与平台分享他们的数据，以换取搜索、社交网络和健康建议等服务。平台利用数据提供这些服务，并创造其他创收机会，例如将数据出售给数据经纪人。随着数据经济的不断扩大，人们对潜在的数据滥用问题日益关注。虽然大多数平台都赋予个人对其数据的某些控制权（即共享哪些数据），但个人并不知道数据一旦共享后将如何使用；他们无法控制数据的用途。在本文中，我们介绍了一种数据托管设计，它允许个人监控所有数据流--不仅是共享了什么数据，还有共享的目的。与其说数据流到了平台，不如说平台将计算委托给了托管机构，个人可以在托管机构观察和管理自己的数据。为了使数据托管切实可行，我们设计并实现了一个与苹果生态系统协同工作的原型；具体来说，我们在苹果 SDK 中加装了编程接口，以实现委托计算。我们的解决方案并不依赖于苹果公司的软件，也可以应用于其他平台，但为苹果公司构建的解决方案让我们可以研究我们工作的主要假设：这种数据托管解决方案是否是当今数据治理的可行替代方案。我们证明了我们的代管原型实现是高效的，我们分析了真实世界应用程序中的数据流，并证明了代管的编程接口支持实现各种数据流。

{"title":"Controlling Dataflows with a Bolt-on Data Escrow","authors":"Zhiru Zhu, Raul Castro Fernandez","doi":"arxiv-2408.01580","DOIUrl":"https://doi.org/arxiv-2408.01580","url":null,"abstract":"The data-driven economy has created tremendous value in our society.\u0000Individuals share their data with platforms in exchange for services such as\u0000search, social networks, and health recommendations. Platforms use the data to\u0000provide those services and create other revenue-generating opportunities, e.g.,\u0000selling the data to data brokers. With the ever-expanding data economy comes\u0000the growing concern about potential data misuse. While most platforms give\u0000individuals certain control over their data (i.e., what data is being shared),\u0000individuals do not know how the data will be used once shared; they cannot\u0000control the purpose. In this paper, we introduce a data escrow design that permits individuals to\u0000observe all dataflows - not just what is shared but for what purpose. Rather\u0000than data flowing to the platform, the platform delegates their computation to\u0000the escrow, where individuals can observe and manage their data. To make the\u0000data escrow practical, we design and implement a prototype that works alongside\u0000the Apple ecosystem; specifically, we retrofit the Apple SDKs with a\u0000programming interface to enable delegated computation. Our solution does not\u0000depend on Apple's software and can be applied to other platforms, but building\u0000for Apple lets us study the main hypothesis of our work: whether such a data\u0000escrow solution is a feasible alternative to today's data governance. We show\u0000that our escrow prototype implementation is efficient, and we analyze the\u0000dataflows in real-world apps and show that the escrow's programming interface\u0000supports implementing a wide range of dataflows.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141969048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0