首页 > 最新文献

arXiv - CS - Databases最新文献

英文 中文
MEVDT: Multi-Modal Event-Based Vehicle Detection and Tracking Dataset MEVDT:基于事件的多模式车辆检测与跟踪数据集
Pub Date : 2024-07-29 DOI: arxiv-2407.20446
Zaid A. El Shair, Samir A. Rawashdeh
In this data article, we introduce the Multi-Modal Event-based VehicleDetection and Tracking (MEVDT) dataset. This dataset provides a synchronizedstream of event data and grayscale images of traffic scenes, captured using theDynamic and Active-Pixel Vision Sensor (DAVIS) 240c hybrid event-based camera.MEVDT comprises 63 multi-modal sequences with approximately 13k images, 5Mevents, 10k object labels, and 85 unique object tracking trajectories.Additionally, MEVDT includes manually annotated ground truth labels$unicode{x2014}$ consisting of object classifications, pixel-precise boundingboxes, and unique object IDs $unicode{x2014}$ which are provided at a labelingfrequency of 24 Hz. Designed to advance the research in the domain ofevent-based vision, MEVDT aims to address the critical need for high-quality,real-world annotated datasets that enable the development and evaluation ofobject detection and tracking algorithms in automotive environments.
在这篇数据文章中,我们将介绍基于多模式事件的车辆检测与跟踪(MEVDT)数据集。MEVDT 包含 63 个多模态序列,其中有约 13k 幅图像、5 个事件、10k 个对象标签和 85 个独特的对象跟踪轨迹。此外,MEVDT 还包含人工标注的地面真实标签(unicode{x2014}$),包括物体分类、像素精确的边界框和独特的物体 ID(unicode{x2014}$),标签频率为 24 Hz。MEVDT 旨在推动基于事件的视觉领域的研究,满足对高质量、真实世界注释数据集的迫切需求,从而开发和评估汽车环境中的物体检测和跟踪算法。
{"title":"MEVDT: Multi-Modal Event-Based Vehicle Detection and Tracking Dataset","authors":"Zaid A. El Shair, Samir A. Rawashdeh","doi":"arxiv-2407.20446","DOIUrl":"https://doi.org/arxiv-2407.20446","url":null,"abstract":"In this data article, we introduce the Multi-Modal Event-based Vehicle\u0000Detection and Tracking (MEVDT) dataset. This dataset provides a synchronized\u0000stream of event data and grayscale images of traffic scenes, captured using the\u0000Dynamic and Active-Pixel Vision Sensor (DAVIS) 240c hybrid event-based camera.\u0000MEVDT comprises 63 multi-modal sequences with approximately 13k images, 5M\u0000events, 10k object labels, and 85 unique object tracking trajectories.\u0000Additionally, MEVDT includes manually annotated ground truth labels\u0000$unicode{x2014}$ consisting of object classifications, pixel-precise bounding\u0000boxes, and unique object IDs $unicode{x2014}$ which are provided at a labeling\u0000frequency of 24 Hz. Designed to advance the research in the domain of\u0000event-based vision, MEVDT aims to address the critical need for high-quality,\u0000real-world annotated datasets that enable the development and evaluation of\u0000object detection and tracking algorithms in automotive environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shapley Value Computation in Ontology-Mediated Query Answering 本体中介查询回答中的夏普利值计算
Pub Date : 2024-07-29 DOI: arxiv-2407.20058
Meghyn Bienvenu, Diego Figueira, Pierre Lafourcade
The Shapley value, originally introduced in cooperative game theory forwealth distribution, has found use in KR and databases for the purpose ofassigning scores to formulas and database tuples based upon their contributionto obtaining a query result or inconsistency. In the present paper, we explorethe use of Shapley values in ontology-mediated query answering (OMQA) andpresent a detailed complexity analysis of Shapley value computation (SVC) inthe OMQA setting. In particular, we establish a PF/#P-hard dichotomy for SVCfor ontology-mediated queries (T,q) composed of an ontology T formulated in thedescription logic ELHI_bot and a connected constant-free homomorphism-closedquery q. We further show that the #P-hardness side of the dichotomy can bestrengthened to cover possibly disconnected queries with constants. Our resultsexploit recently discovered connections between SVC and probabilistic queryevaluation and allow us to generalize existing results on probabilistic OMQA.
沙普利值(Shapley value)最初是在财富分配的合作博弈论中提出的,现在已被用于知识关系和数据库中,目的是根据公式和数据库元组对获得查询结果的贡献或不一致性给它们打分。在本文中,我们探讨了 Shapley 值在以本体为中介的查询回答(OMQA)中的应用,并对 OMQA 环境中的 Shapley 值计算(SVC)进行了详细的复杂性分析。特别是,我们为本体中介查询(T,q)的 SVC 建立了 PF/#P 硬度二分法,本体中介查询(T,q)由描述逻辑 ELHIbot 中表述的本体 T 和连接的无常数同态封闭查询 q 组成。我们的结果利用了最近发现的 SVC 与概率查询评估之间的联系,使我们能够推广概率 OMQA 的现有结果。
{"title":"Shapley Value Computation in Ontology-Mediated Query Answering","authors":"Meghyn Bienvenu, Diego Figueira, Pierre Lafourcade","doi":"arxiv-2407.20058","DOIUrl":"https://doi.org/arxiv-2407.20058","url":null,"abstract":"The Shapley value, originally introduced in cooperative game theory for\u0000wealth distribution, has found use in KR and databases for the purpose of\u0000assigning scores to formulas and database tuples based upon their contribution\u0000to obtaining a query result or inconsistency. In the present paper, we explore\u0000the use of Shapley values in ontology-mediated query answering (OMQA) and\u0000present a detailed complexity analysis of Shapley value computation (SVC) in\u0000the OMQA setting. In particular, we establish a PF/#P-hard dichotomy for SVC\u0000for ontology-mediated queries (T,q) composed of an ontology T formulated in the\u0000description logic ELHI_bot and a connected constant-free homomorphism-closed\u0000query q. We further show that the #P-hardness side of the dichotomy can be\u0000strengthened to cover possibly disconnected queries with constants. Our results\u0000exploit recently discovered connections between SVC and probabilistic query\u0000evaluation and allow us to generalize existing results on probabilistic OMQA.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload 利用复杂的 SQL 工作负载评估文本到 SQL 生成的 LLM
Pub Date : 2024-07-28 DOI: arxiv-2407.19517
Limin Ma, Ken Pu, Ying Zhu
This study presents a comparative analysis of the a complex SQL benchmark,TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findingsreveal that TPC-DS queries exhibit a significantly higher level of structuralcomplexity compared to the other two benchmarks. This underscores the need formore intricate benchmarks to simulate realistic scenarios effectively. Tofacilitate this comparison, we devised several measures of structuralcomplexity and applied them across all three benchmarks. The results of thisstudy can guide future research in the development of more sophisticatedtext-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries basedon the query descriptions provided by the TPC-DS benchmark. The promptengineering process incorporated both the query description as outlined in theTPC-DS specification and the database schema of TPC-DS. Our findings indicatethat the current state-of-the-art generative AI models fall short in generatingaccurate decision-making queries. We conducted a comparison of the generatedqueries with the TPC-DS gold standard queries using a series of fuzzy structurematching techniques based on query features. The results demonstrated that theaccuracy of the generated queries is insufficient for practical real-worldapplication.
本研究对复杂 SQL 基准 TPC-DS 与现有的两个文本到 SQL 基准 BIRD 和 Spider 进行了比较分析。我们的研究结果表明,与其他两个基准相比,TPC-DS 查询表现出更高的结构复杂性。这说明需要更复杂的基准来有效模拟现实场景。为了便于比较,我们设计了几种结构复杂性测量方法,并将它们应用于所有三种基准。这项研究的结果可以指导未来开发更复杂的文本到 SQL 基准的研究。我们利用 11 种不同的语言模型 (LLM) 根据 TPC-DS 基准提供的查询描述生成 SQL 查询。提示工程过程既包括 TPC-DS 规范中概述的查询描述,也包括 TPC-DS 的数据库模式。我们的研究结果表明,当前最先进的生成式人工智能模型在生成准确的决策查询方面存在不足。我们使用一系列基于查询特征的模糊结构匹配技术,将生成的查询与TPC-DS黄金标准查询进行了比较。结果表明,生成查询的准确性不足以满足实际应用的需要。
{"title":"Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload","authors":"Limin Ma, Ken Pu, Ying Zhu","doi":"arxiv-2407.19517","DOIUrl":"https://doi.org/arxiv-2407.19517","url":null,"abstract":"This study presents a comparative analysis of the a complex SQL benchmark,\u0000TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings\u0000reveal that TPC-DS queries exhibit a significantly higher level of structural\u0000complexity compared to the other two benchmarks. This underscores the need for\u0000more intricate benchmarks to simulate realistic scenarios effectively. To\u0000facilitate this comparison, we devised several measures of structural\u0000complexity and applied them across all three benchmarks. The results of this\u0000study can guide future research in the development of more sophisticated\u0000text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based\u0000on the query descriptions provided by the TPC-DS benchmark. The prompt\u0000engineering process incorporated both the query description as outlined in the\u0000TPC-DS specification and the database schema of TPC-DS. Our findings indicate\u0000that the current state-of-the-art generative AI models fall short in generating\u0000accurate decision-making queries. We conducted a comparison of the generated\u0000queries with the TPC-DS gold standard queries using a series of fuzzy structure\u0000matching techniques based on query features. The results demonstrated that the\u0000accuracy of the generated queries is insufficient for practical real-world\u0000application.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Turning Multidimensional Big Data Analytics into Practice: Design and Implementation of ClustCube Big-Data Tools in Real-Life Scenarios 将多维大数据分析付诸实践:在实际生活场景中设计和实施 ClustCube 大数据工具
Pub Date : 2024-07-26 DOI: arxiv-2407.18604
Alfredo Cuzzocrea, Abderraouf Hafsaoui, Ismail Benlaredj
Multidimensional Big Data Analytics is an emerging area that marries thecapabilities of OLAP with modern Big Data Analytics. Essentially, the idea isengrafting multidimensional models into Big Data analytics processes to gaininto expressive power of the overall discovery task. ClustCube is astate-of-the-art model that combines OLAP and Clustering, thus delving intopractical and well-understood advantages in the context of real-lifeapplications and systems. In this paper, we show how ClustCube can effectivelyand efficiently realizing nice tools for supporting Multidimensional Big DataAnalytics, and assess these tools in the context of real-life researchprojects.
多维大数据分析是一个新兴领域,它将 OLAP 的能力与现代大数据分析相结合。从本质上讲,其理念是将多维模型嫁接到大数据分析流程中,以获得整体发现任务的表达能力。ClustCube 是将 OLAP 和聚类相结合的最新模型,因此在实际应用和系统中具有实用和广为人知的优势。在本文中,我们展示了 ClustCube 如何高效地实现支持多维大数据分析的好工具,并结合实际研究项目对这些工具进行了评估。
{"title":"Turning Multidimensional Big Data Analytics into Practice: Design and Implementation of ClustCube Big-Data Tools in Real-Life Scenarios","authors":"Alfredo Cuzzocrea, Abderraouf Hafsaoui, Ismail Benlaredj","doi":"arxiv-2407.18604","DOIUrl":"https://doi.org/arxiv-2407.18604","url":null,"abstract":"Multidimensional Big Data Analytics is an emerging area that marries the\u0000capabilities of OLAP with modern Big Data Analytics. Essentially, the idea is\u0000engrafting multidimensional models into Big Data analytics processes to gain\u0000into expressive power of the overall discovery task. ClustCube is a\u0000state-of-the-art model that combines OLAP and Clustering, thus delving into\u0000practical and well-understood advantages in the context of real-life\u0000applications and systems. In this paper, we show how ClustCube can effectively\u0000and efficiently realizing nice tools for supporting Multidimensional Big Data\u0000Analytics, and assess these tools in the context of real-life research\u0000projects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards A More Reasonable Semantic Web 迈向更合理的语义网
Pub Date : 2024-07-26 DOI: arxiv-2407.19095
Vleer Doing, Ryan Wisnesky
We aim to accelerate the original vision of the semantic web by revisitingdesign decisions that have defined the semantic web up until now. We propose ashift in direction that more broadly embraces existing data infrastructure byreconsidering the semantic web's logical foundations. We argue to shiftattention away from description logic, which has so far underpinned thesemantic web, to a different fragment of first-order logic. We argue, usingexamples from the (geo)spatial domain, that by doing so, the semantic web canbe approached as a traditional data migration and integration problem at amassive scale. That way, a huge amount of existing tools and theories can bedeployed to the semantic web's benefit, and the original vision of ontology asshared abstraction be reinvigorated.
我们旨在通过重新审视迄今为止定义语义网的设计决策,加快实现语义网的最初愿景。我们建议转变方向,通过重新考虑语义网的逻辑基础,更广泛地接纳现有的数据基础设施。我们主张将注意力从迄今为止一直支撑着语义网的描述逻辑转移到一阶逻辑的另一个片段上。我们以(地理)空间领域为例,认为这样做可以把语义网当作一个大规模的传统数据迁移和整合问题来处理。这样,大量现有的工具和理论就能为语义网带来益处,本体共享抽象的原始愿景也能重新焕发活力。
{"title":"Towards A More Reasonable Semantic Web","authors":"Vleer Doing, Ryan Wisnesky","doi":"arxiv-2407.19095","DOIUrl":"https://doi.org/arxiv-2407.19095","url":null,"abstract":"We aim to accelerate the original vision of the semantic web by revisiting\u0000design decisions that have defined the semantic web up until now. We propose a\u0000shift in direction that more broadly embraces existing data infrastructure by\u0000reconsidering the semantic web's logical foundations. We argue to shift\u0000attention away from description logic, which has so far underpinned the\u0000semantic web, to a different fragment of first-order logic. We argue, using\u0000examples from the (geo)spatial domain, that by doing so, the semantic web can\u0000be approached as a traditional data migration and integration problem at a\u0000massive scale. That way, a huge amount of existing tools and theories can be\u0000deployed to the semantic web's benefit, and the original vision of ontology as\u0000shared abstraction be reinvigorated.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Partial Adaptive Indexing for Approximate Query Answering 用于近似查询回答的部分自适应索引
Pub Date : 2024-07-26 DOI: arxiv-2407.18702
Stavros Maroulis, Nikos Bikakis, Vassilis Stamatopoulos, George Papastefanatos
In data exploration, users need to analyze large data files quickly, aimingto minimize data-to-analysis time. While recent adaptive indexing approachesaddress this need, they are cases where demonstrate poor performance.Particularly, during the initial queries, in regions with a high density ofobjects, and in very large files over commodity hardware. This work introducesan approach for adaptive indexing driven by both query workload anduser-defined accuracy constraints to support approximate query answering. Theapproach is based on partial index adaptation which reduces the costsassociated with reading data files and refining indexes. We leverage ahierarchical tile-based indexing scheme and its stored metadata to provideefficient query evaluation, ensuring accuracy within user-specified bounds. Ourpreliminary evaluation demonstrates improvement on query evaluation time,especially during initial user exploration.
在数据探索过程中,用户需要快速分析大型数据文件,以尽量缩短数据到分析的时间。虽然最近的自适应索引方法满足了这一需求,但它们在一些情况下表现出了较差的性能,特别是在初始查询期间、对象密度较高的区域以及在使用商品硬件的超大文件中。这项工作介绍了一种由查询工作量和用户定义的准确性约束驱动的自适应索引方法,以支持近似查询回答。该方法基于部分索引自适应,可降低读取数据文件和完善索引的相关成本。我们利用基于层次的瓦片索引方案及其存储的元数据来提供高效的查询评估,确保准确性在用户指定的范围内。我们的初步评估结果表明,查询评估时间有所缩短,尤其是在用户初始探索期间。
{"title":"Partial Adaptive Indexing for Approximate Query Answering","authors":"Stavros Maroulis, Nikos Bikakis, Vassilis Stamatopoulos, George Papastefanatos","doi":"arxiv-2407.18702","DOIUrl":"https://doi.org/arxiv-2407.18702","url":null,"abstract":"In data exploration, users need to analyze large data files quickly, aiming\u0000to minimize data-to-analysis time. While recent adaptive indexing approaches\u0000address this need, they are cases where demonstrate poor performance.\u0000Particularly, during the initial queries, in regions with a high density of\u0000objects, and in very large files over commodity hardware. This work introduces\u0000an approach for adaptive indexing driven by both query workload and\u0000user-defined accuracy constraints to support approximate query answering. The\u0000approach is based on partial index adaptation which reduces the costs\u0000associated with reading data files and refining indexes. We leverage a\u0000hierarchical tile-based indexing scheme and its stored metadata to provide\u0000efficient query evaluation, ensuring accuracy within user-specified bounds. Our\u0000preliminary evaluation demonstrates improvement on query evaluation time,\u0000especially during initial user exploration.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of open-source data quality tools: shedding light on the materialization of data quality dimensions in practice 开源数据质量工具调查:揭示数据质量在实践中的具体体现
Pub Date : 2024-07-26 DOI: arxiv-2407.18649
Vasileios Papastergios, Anastasios Gounaris
Data Quality (DQ) describes the degree to which data characteristics meetrequirements and are fit for use by humans and/or systems. There are severalaspects in which DQ can be measured, called DQ dimensions (i.e. accuracy,completeness, consistency, etc.), also referred to as characteristics inliterature. ISO/IEC 25012 Standard defines a data quality model with fifteensuch dimensions, setting the requirements a data product should meet. In thisshort report, we aim to bridge the gap between lower-level functionalitiesoffered by DQ tools and higher-level dimensions in a systematic manner,revealing the many-to-many relationships between them. To this end, we examine6 open-source DQ tools and we emphasize on providing a mapping between thefunctionalities they offer and the DQ dimensions, as defined by the ISOstandard. Wherever applicable, we also provide insights into the softwareengineering details that tools leverage, in order to address DQ challenges.
数据质量(DQ)描述了数据特征满足要求并适合人类和/或系统使用的程度。有几个方面可以衡量 DQ,称为 DQ 维度(即准确性、完整性、一致性等),在文献中也称为特征。ISO/IEC 25012 标准定义了一个包含 15 个维度的数据质量模型,规定了数据产品应满足的要求。在这份简短的报告中,我们旨在系统地弥合数据质量工具提供的低级功能与高级维度之间的差距,揭示它们之间的多对多关系。为此,我们研究了 6 款开源 DQ 工具,并着重提供了这些工具所提供的功能与 ISO 标准所定义的 DQ 维度之间的映射关系。在适用的情况下,我们还深入分析了工具所利用的软件工程细节,以应对 DQ 挑战。
{"title":"A survey of open-source data quality tools: shedding light on the materialization of data quality dimensions in practice","authors":"Vasileios Papastergios, Anastasios Gounaris","doi":"arxiv-2407.18649","DOIUrl":"https://doi.org/arxiv-2407.18649","url":null,"abstract":"Data Quality (DQ) describes the degree to which data characteristics meet\u0000requirements and are fit for use by humans and/or systems. There are several\u0000aspects in which DQ can be measured, called DQ dimensions (i.e. accuracy,\u0000completeness, consistency, etc.), also referred to as characteristics in\u0000literature. ISO/IEC 25012 Standard defines a data quality model with fifteen\u0000such dimensions, setting the requirements a data product should meet. In this\u0000short report, we aim to bridge the gap between lower-level functionalities\u0000offered by DQ tools and higher-level dimensions in a systematic manner,\u0000revealing the many-to-many relationships between them. To this end, we examine\u00006 open-source DQ tools and we emphasize on providing a mapping between the\u0000functionalities they offer and the DQ dimensions, as defined by the ISO\u0000standard. Wherever applicable, we also provide insights into the software\u0000engineering details that tools leverage, in order to address DQ challenges.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced Privacy Bound for Shuffle Model with Personalized Privacy 增强洗牌模型的个性化隐私约束
Pub Date : 2024-07-25 DOI: arxiv-2407.18157
Yixuan Liu, Yuhan Liu, Li Xiong, Yujie Gu, Hong Chen
The shuffle model of Differential Privacy (DP) is an enhanced privacyprotocol which introduces an intermediate trusted server between local usersand a central data curator. It significantly amplifies the central DP guaranteeby anonymizing and shuffling the local randomized data. Yet, deriving a tightprivacy bound is challenging due to its complicated randomization protocol.While most existing work are focused on unified local privacy settings, thiswork focuses on deriving the central privacy bound for a more practical settingwhere personalized local privacy is required by each user. To bound the privacyafter shuffling, we first need to capture the probability of each usergenerating clones of the neighboring data points. Second, we need to quantifythe indistinguishability between two distributions of the number of clones onneighboring datasets. Existing works either inaccurately capture theprobability, or underestimate the indistinguishability between neighboringdatasets. Motivated by this, we develop a more precise analysis, which yields ageneral and tighter bound for arbitrary DP mechanisms. Firstly, we derive theclone-generating probability by hypothesis testing %from a randomizer-specificperspective, which leads to a more accurate characterization of theprobability. Secondly, we analyze the indistinguishability in the context of$f$-DP, where the convexity of the distributions is leveraged to achieve atighter privacy bound. Theoretical and numerical results demonstrate that ourbound remarkably outperforms the existing results in the literature.
差分隐私(DP)的洗牌模型是一种增强型隐私协议,它在本地用户和中央数据管理员之间引入了一个中间可信服务器。它通过对本地随机数据进行匿名化和洗牌,大大增强了中央 DP 保证。然而,由于其随机化协议复杂,要推导出严密的隐私约束具有挑战性。现有的大部分工作都集中在统一的本地隐私设置上,而本工作则侧重于推导出更实用的中央隐私约束,即每个用户都需要个性化的本地隐私。为了约束洗牌后的隐私,我们首先需要捕捉每个用户生成相邻数据点克隆的概率。其次,我们需要量化相邻数据集上克隆数量的两种分布之间的不可分性。现有的研究要么没有准确捕捉到这种概率,要么低估了相邻数据集之间的不可分性。受此启发,我们开发了一种更精确的分析方法,可为任意 DP 机制提供更宽泛、更严格的约束。首先,我们通过假设检验%,从特定随机化器的角度推导出了克隆产生概率,从而更准确地描述了该概率。其次,我们分析了$f$-DP背景下的不可区分性,利用分布的凸性实现了更高的隐私约束。理论和数值结果表明,我们的边界明显优于文献中的现有结果。
{"title":"Enhanced Privacy Bound for Shuffle Model with Personalized Privacy","authors":"Yixuan Liu, Yuhan Liu, Li Xiong, Yujie Gu, Hong Chen","doi":"arxiv-2407.18157","DOIUrl":"https://doi.org/arxiv-2407.18157","url":null,"abstract":"The shuffle model of Differential Privacy (DP) is an enhanced privacy\u0000protocol which introduces an intermediate trusted server between local users\u0000and a central data curator. It significantly amplifies the central DP guarantee\u0000by anonymizing and shuffling the local randomized data. Yet, deriving a tight\u0000privacy bound is challenging due to its complicated randomization protocol.\u0000While most existing work are focused on unified local privacy settings, this\u0000work focuses on deriving the central privacy bound for a more practical setting\u0000where personalized local privacy is required by each user. To bound the privacy\u0000after shuffling, we first need to capture the probability of each user\u0000generating clones of the neighboring data points. Second, we need to quantify\u0000the indistinguishability between two distributions of the number of clones on\u0000neighboring datasets. Existing works either inaccurately capture the\u0000probability, or underestimate the indistinguishability between neighboring\u0000datasets. Motivated by this, we develop a more precise analysis, which yields a\u0000general and tighter bound for arbitrary DP mechanisms. Firstly, we derive the\u0000clone-generating probability by hypothesis testing %from a randomizer-specific\u0000perspective, which leads to a more accurate characterization of the\u0000probability. Secondly, we analyze the indistinguishability in the context of\u0000$f$-DP, where the convexity of the distributions is leveraged to achieve a\u0000tighter privacy bound. Theoretical and numerical results demonstrate that our\u0000bound remarkably outperforms the existing results in the literature.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
My Ontologist: Evaluating BFO-Based AI for Definition Support 我的本体论者评估基于 BFO 的定义支持人工智能
Pub Date : 2024-07-24 DOI: arxiv-2407.17657
Carter Benson, Alec Sculley, Austin Liebers, John Beverley
Generative artificial intelligence (AI), exemplified by the release ofGPT-3.5 in 2022, has significantly advanced the potential applications of largelanguage models (LLMs), including in the realms of ontology development andknowledge graph creation. Ontologies, which are structured frameworks fororganizing information, and knowledge graphs, which combine ontologies withactual data, are essential for enabling interoperability and automatedreasoning. However, current research has largely overlooked the generation ofontologies extending from established upper-level frameworks like the BasicFormal Ontology (BFO), risking the creation of non-integrable ontology silos.This study explores the extent to which LLMs, particularly GPT-4, can supportontologists trained in BFO. Through iterative development of a specialized GPTmodel named "My Ontologist," we aimed to generate BFO-conformant ontologies.Initial versions faced challenges in maintaining definition conventions andleveraging foundational texts effectively. My Ontologist 3.0 showed promise byadhering to structured rules and modular ontology suites, yet the release ofGPT-4o disrupted this progress by altering the model's behavior. Our findingsunderscore the importance of aligning LLM-generated ontologies with top-levelstandards and highlight the complexities of integrating evolving AIcapabilities in ontology engineering.
以 2022 年发布的 GPT-3.5 为代表的生成人工智能(AI)极大地推动了大型语言模型(LLM)的潜在应用,包括本体开发和知识图谱创建领域。本体是组织信息的结构化框架,知识图谱则将本体与实际数据相结合,对于实现互操作性和自动推理至关重要。然而,目前的研究在很大程度上忽视了从已有的上层框架(如基本规范本体(BFO))延伸出的本体的生成,这有可能造成不可整合的本体孤岛。通过迭代开发名为 "我的本体论者 "的专用GPT模型,我们旨在生成符合BFO的本体。My Ontologist 3.0通过坚持结构化规则和模块化本体套件显示出了前景,然而GPT-4o的发布通过改变模型的行为破坏了这一进展。我们的发现强调了使 LLM 生成的本体与顶级标准保持一致的重要性,并突出了在本体工程中整合不断发展的人工智能能力的复杂性。
{"title":"My Ontologist: Evaluating BFO-Based AI for Definition Support","authors":"Carter Benson, Alec Sculley, Austin Liebers, John Beverley","doi":"arxiv-2407.17657","DOIUrl":"https://doi.org/arxiv-2407.17657","url":null,"abstract":"Generative artificial intelligence (AI), exemplified by the release of\u0000GPT-3.5 in 2022, has significantly advanced the potential applications of large\u0000language models (LLMs), including in the realms of ontology development and\u0000knowledge graph creation. Ontologies, which are structured frameworks for\u0000organizing information, and knowledge graphs, which combine ontologies with\u0000actual data, are essential for enabling interoperability and automated\u0000reasoning. However, current research has largely overlooked the generation of\u0000ontologies extending from established upper-level frameworks like the Basic\u0000Formal Ontology (BFO), risking the creation of non-integrable ontology silos.\u0000This study explores the extent to which LLMs, particularly GPT-4, can support\u0000ontologists trained in BFO. Through iterative development of a specialized GPT\u0000model named \"My Ontologist,\" we aimed to generate BFO-conformant ontologies.\u0000Initial versions faced challenges in maintaining definition conventions and\u0000leveraging foundational texts effectively. My Ontologist 3.0 showed promise by\u0000adhering to structured rules and modular ontology suites, yet the release of\u0000GPT-4o disrupted this progress by altering the model's behavior. Our findings\u0000underscore the importance of aligning LLM-generated ontologies with top-level\u0000standards and highlight the complexities of integrating evolving AI\u0000capabilities in ontology engineering.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Subgraph Matching via Cost-Model-based Vertex Dominance Embeddings (Technical Report) 通过基于成本模型的顶点支配嵌入实现动态子图匹配(技术报告)
Pub Date : 2024-07-23 DOI: arxiv-2407.16660
Yutong Ye, Xiang Lian, Nan Zhang, Mingsong Chen
In many real-world applications such as social network analysis, knowledgegraph discovery, biological network analytics, and so on, graph data managementhas become increasingly important and has drawn much attention from thedatabase community. While many graphs (e.g., Twitter, Wikipedia, etc.) areusually involving over time, it is of great importance to study the dynamicsubgraph matching (DSM) problem, a fundamental yet challenging graph operator,which continuously monitors subgraph matching results over dynamic graphs witha stream of edge updates. To efficiently tackle the DSM problem, we carefullydesign a novel vertex dominance embedding approach, which effectively encodesvertex labels that can be incrementally maintained upon graph updates. Inspireby low pruning power for high-degree vertices, we propose a new degree groupingtechnique over basic subgraph patterns in different degree groups (i.e., groupsof star substructures), and devise degree-aware star substructure synopses(DAS^3) to effectively facilitate our designed vertex dominance and rangepruning strategies. We develop efficient algorithms to incrementally maintaindynamic graphs and answer DSM queries. Through extensive experiments, weconfirm the efficiency of our proposed approaches over both real and syntheticgraphs.
在社交网络分析、知识图谱发现、生物网络分析等许多现实世界的应用中,图数据管理变得越来越重要,并引起了数据库界的广泛关注。许多图(如 Twitter、维基百科等)通常会随着时间的推移而发生变化,因此研究动态子图匹配(DSM)问题具有重要意义。为了高效地解决 DSM 问题,我们精心设计了一种新颖的顶点优势嵌入方法,它能有效地编码顶点标签,并在图更新时增量地维护这些标签。受高度顶点剪枝能力低的启发,我们在不同度组(即星形子结构组)的基本子图模式上提出了一种新的度分组技术,并设计了度感知星形子结构概要(DAS^3),以有效促进我们设计的顶点支配和范围运行策略。我们开发了增量维护动态图和回答 DSM 查询的高效算法。通过大量实验,我们证实了我们提出的方法在真实图和合成图上的效率。
{"title":"Dynamic Subgraph Matching via Cost-Model-based Vertex Dominance Embeddings (Technical Report)","authors":"Yutong Ye, Xiang Lian, Nan Zhang, Mingsong Chen","doi":"arxiv-2407.16660","DOIUrl":"https://doi.org/arxiv-2407.16660","url":null,"abstract":"In many real-world applications such as social network analysis, knowledge\u0000graph discovery, biological network analytics, and so on, graph data management\u0000has become increasingly important and has drawn much attention from the\u0000database community. While many graphs (e.g., Twitter, Wikipedia, etc.) are\u0000usually involving over time, it is of great importance to study the dynamic\u0000subgraph matching (DSM) problem, a fundamental yet challenging graph operator,\u0000which continuously monitors subgraph matching results over dynamic graphs with\u0000a stream of edge updates. To efficiently tackle the DSM problem, we carefully\u0000design a novel vertex dominance embedding approach, which effectively encodes\u0000vertex labels that can be incrementally maintained upon graph updates. Inspire\u0000by low pruning power for high-degree vertices, we propose a new degree grouping\u0000technique over basic subgraph patterns in different degree groups (i.e., groups\u0000of star substructures), and devise degree-aware star substructure synopses\u0000(DAS^3) to effectively facilitate our designed vertex dominance and range\u0000pruning strategies. We develop efficient algorithms to incrementally maintain\u0000dynamic graphs and answer DSM queries. Through extensive experiments, we\u0000confirm the efficiency of our proposed approaches over both real and synthetic\u0000graphs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"351 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Databases
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1