首页 > 最新文献

arXiv - CS - Databases最新文献

英文 中文
Process Trace Querying using Knowledge Graphs and Notation3 使用知识图谱和符号查询流程跟踪3
Pub Date : 2024-08-26 DOI: arxiv-2409.04452
William Van Woensel
In process mining, a log exploration step allows making sense of the eventtraces; e.g., identifying event patterns and illogical traces, and gaininginsight into their variability. To support expressive log exploration, theevent log can be converted into a Knowledge Graph (KG), which can then bequeried using general-purpose languages. We explore the creation of semantic KGusing the Resource Description Framework (RDF) as a data model, combined withthe general-purpose Notation3 (N3) rule language for querying. We show howtypical trace querying constraints, inspired by the state of the art, can beimplemented in N3. We convert case- and object-centric event logs into atrace-based semantic KG; OCEL2 logs are hereby "flattened" into traces based onobject paths through the KG. This solution offers (a) expressivity, as queriescan instantiate constraints in multiple ways and arbitrarily constrainattributes and relations (e.g., actors, resources); (b) flexibility, as OCEL2event logs can be serialized as traces in arbitrary ways based on the KG; and(c) extensibility, as others can extend our library by leveraging the sameimplementation patterns.
在流程挖掘中,日志探索步骤可以使事件轨迹变得有意义,例如,识别事件模式和不合逻辑的轨迹,并深入了解其可变性。为了支持富有表现力的日志探索,可以将事件日志转换为知识图谱(KG),然后使用通用语言对其进行查询。我们探索了使用资源描述框架(RDF)作为数据模型,结合通用 Notation3(N3)规则语言进行查询的语义 KG 创建方法。我们展示了如何在 N3 中实现受最新技术启发的典型跟踪查询约束。我们将以案例和对象为中心的事件日志转换为基于轨迹的语义 KG;从而将 OCEL2 日志 "扁平化 "为基于对象通过 KG 的路径的轨迹。这种解决方案具有:(a) 表达性,因为查询可以多种方式实例化约束,并可任意约束属性和关系(如行为体、资源);(b) 灵活性,因为 OCEL2 事件日志可以基于 KG 以任意方式序列化为轨迹;(c) 可扩展性,因为其他人可以利用相同的实现模式扩展我们的库。
{"title":"Process Trace Querying using Knowledge Graphs and Notation3","authors":"William Van Woensel","doi":"arxiv-2409.04452","DOIUrl":"https://doi.org/arxiv-2409.04452","url":null,"abstract":"In process mining, a log exploration step allows making sense of the event\u0000traces; e.g., identifying event patterns and illogical traces, and gaining\u0000insight into their variability. To support expressive log exploration, the\u0000event log can be converted into a Knowledge Graph (KG), which can then be\u0000queried using general-purpose languages. We explore the creation of semantic KG\u0000using the Resource Description Framework (RDF) as a data model, combined with\u0000the general-purpose Notation3 (N3) rule language for querying. We show how\u0000typical trace querying constraints, inspired by the state of the art, can be\u0000implemented in N3. We convert case- and object-centric event logs into a\u0000trace-based semantic KG; OCEL2 logs are hereby \"flattened\" into traces based on\u0000object paths through the KG. This solution offers (a) expressivity, as queries\u0000can instantiate constraints in multiple ways and arbitrarily constrain\u0000attributes and relations (e.g., actors, resources); (b) flexibility, as OCEL2\u0000event logs can be serialized as traces in arbitrary ways based on the KG; and\u0000(c) extensibility, as others can extend our library by leveraging the same\u0000implementation patterns.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-variable Quantification of BDDs in External Memory using Nested Sweeping (Extended Paper) 利用嵌套扫描对外部存储器中的 BDD 进行多变量量化(扩展论文)
Pub Date : 2024-08-26 DOI: arxiv-2408.14216
Steffan Christ Sølvsten, Jaco van de Pol
Previous research on the Adiar BDD package has been successful at designingalgorithms capable of handling large Binary Decision Diagrams (BDDs) stored inexternal memory. To do so, it uses consecutive sweeps through the BDDs toresolve computations. Yet, this approach has kept algorithms for multi-variablequantification, the relational product, and variable reordering out of itsscope. In this work, we address this by introducing the nested sweeping framework.Here, multiple concurrent sweeps pass information between eachother to computethe result. We have implemented the framework in Adiar and used it to create anew external memory multi-variable quantification algorithm. Compared toconventional depth-first implementations, Adiar with nested sweeping is able tosolve more instances of our benchmarks and/or solve them faster.
以前对 Adiar BDD 软件包的研究成功地设计出了能够处理存储在外部内存中的大型二进制判定图 (BDD) 的算法。为此,它使用连续扫描 BDD 来解决计算问题。然而,这种方法将多变量量化、关系积和变量重排序的算法排除在外。在这项工作中,我们通过引入嵌套清扫框架来解决这个问题。在这里,多个并发清扫会相互传递信息以计算结果。我们在 Adiar 中实现了该框架,并用它创建了一种新的外部内存多变量量化算法。与传统的深度优先算法相比,嵌套扫频的 Adiar 能够解决更多的基准实例和/或更快地解决它们。
{"title":"Multi-variable Quantification of BDDs in External Memory using Nested Sweeping (Extended Paper)","authors":"Steffan Christ Sølvsten, Jaco van de Pol","doi":"arxiv-2408.14216","DOIUrl":"https://doi.org/arxiv-2408.14216","url":null,"abstract":"Previous research on the Adiar BDD package has been successful at designing\u0000algorithms capable of handling large Binary Decision Diagrams (BDDs) stored in\u0000external memory. To do so, it uses consecutive sweeps through the BDDs to\u0000resolve computations. Yet, this approach has kept algorithms for multi-variable\u0000quantification, the relational product, and variable reordering out of its\u0000scope. In this work, we address this by introducing the nested sweeping framework.\u0000Here, multiple concurrent sweeps pass information between eachother to compute\u0000the result. We have implemented the framework in Adiar and used it to create a\u0000new external memory multi-variable quantification algorithm. Compared to\u0000conventional depth-first implementations, Adiar with nested sweeping is able to\u0000solve more instances of our benchmarks and/or solve them faster.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
$boldsymbol{Steiner}$-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes $boldsymbol{Steiner}$-Hardness:基于图的 ANN 索引的查询硬度度量
Pub Date : 2024-08-25 DOI: arxiv-2408.13899
Zeyu Wang, Qitong Wang, Xiaoxing Cheng, Peng Wang, Themis Palpanas, Wei Wang
Graph-based indexes have been widely employed to accelerate approximatesimilarity search of high-dimensional vectors. However, the performance ofgraph indexes to answer different queries varies vastly, leading to an unstablequality of service for downstream applications. This necessitates an effectivemeasure to test query hardness on graph indexes. Nonetheless, populardistance-based hardness measures like LID lose their effects due to theignorance of the graph structure. In this paper, we propose $Steiner$-hardness,a novel connection-based graph-native query hardness measure. Specifically, wefirst propose a theoretical framework to analyze the minimum query effort ongraph indexes and then define $Steiner$-hardness as the minimum effort on arepresentative graph. Moreover, we prove that our $Steiner$-hardness is highlyrelevant to the classical Directed $Steiner$ Tree (DST) problems. In this case,we design a novel algorithm to reduce our problem to DST problems and thenleverage their solvers to help calculate $Steiner$-hardness efficiently.Compared with LID and other similar measures, $Steiner$-hardness shows asignificantly better correlation with the actual query effort on variousdatasets. Additionally, an unbiased evaluation designed based on$Steiner$-hardness reveals new ranking results, indicating a meaningfuldirection for enhancing the robustness of graph indexes. This paper is acceptedby PVLDB 2025.
基于图的索引已被广泛用于加速高维向量的近似相似性搜索。然而,图索引在回答不同查询时的性能差异很大,导致下游应用的服务质量不稳定。这就需要一种有效的方法来测试图索引的查询硬度。然而,由于图结构的不确定性,基于大众距离的硬度测量(如 LID)失去了效果。在本文中,我们提出了$Steiner$-hardness,一种基于连接的新型图本地查询硬度度量。具体来说,我们首先提出了一个理论框架来分析图索引上的最小查询工作量,然后将$Steiner$-硬度定义为呈现图上的最小工作量。此外,我们还证明了$Steiner$-hardness与经典的有向$Steiner$树(DST)问题高度相关。在这种情况下,我们设计了一种新颖的算法,将我们的问题简化为DST问题,然后利用它们的求解器来帮助高效计算$Steiner$-hardness。与LID和其他类似度量相比,$Steiner$-hardness与各种数据集上的实际查询工作量的相关性显著提高。此外,基于$Steiner$-hardness设计的无偏评估揭示了新的排名结果,为增强图索引的鲁棒性指明了有意义的方向。本文已被 PVLDB 2025 接收。
{"title":"$boldsymbol{Steiner}$-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes","authors":"Zeyu Wang, Qitong Wang, Xiaoxing Cheng, Peng Wang, Themis Palpanas, Wei Wang","doi":"arxiv-2408.13899","DOIUrl":"https://doi.org/arxiv-2408.13899","url":null,"abstract":"Graph-based indexes have been widely employed to accelerate approximate\u0000similarity search of high-dimensional vectors. However, the performance of\u0000graph indexes to answer different queries varies vastly, leading to an unstable\u0000quality of service for downstream applications. This necessitates an effective\u0000measure to test query hardness on graph indexes. Nonetheless, popular\u0000distance-based hardness measures like LID lose their effects due to the\u0000ignorance of the graph structure. In this paper, we propose $Steiner$-hardness,\u0000a novel connection-based graph-native query hardness measure. Specifically, we\u0000first propose a theoretical framework to analyze the minimum query effort on\u0000graph indexes and then define $Steiner$-hardness as the minimum effort on a\u0000representative graph. Moreover, we prove that our $Steiner$-hardness is highly\u0000relevant to the classical Directed $Steiner$ Tree (DST) problems. In this case,\u0000we design a novel algorithm to reduce our problem to DST problems and then\u0000leverage their solvers to help calculate $Steiner$-hardness efficiently.\u0000Compared with LID and other similar measures, $Steiner$-hardness shows a\u0000significantly better correlation with the actual query effort on various\u0000datasets. Additionally, an unbiased evaluation designed based on\u0000$Steiner$-hardness reveals new ranking results, indicating a meaningful\u0000direction for enhancing the robustness of graph indexes. This paper is accepted\u0000by PVLDB 2025.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Converged Relational-Graph Optimization Framework 建立一个融合关系图优化框架
Pub Date : 2024-08-24 DOI: arxiv-2408.13480
Yunkai Lou, Longbin Lai, Bingqing Lyu, Yufan Yang, Xiaoli Zhou, Wenyuan Yu, Ying Zhang, Jingren Zhou
The recent ISO SQL:2023 standard adopts SQL/PGQ (Property Graph Queries),facilitating graph-like querying within relational databases. This advancement,however, underscores a significant gap in how to effectively optimize SQL/PGQqueries within relational database systems. To address this gap, we extend thefoundational SPJ(Select-Project-Join) queries to SPJM queries, which include anadditional matching operator for representing graph pattern matching inSQL/PGQ. Although SPJM queries can be converted to SPJ queries and optimizedusing existing relational query optimizers, our analysis shows that such agraph-agnostic method fails to benefit from graph-specific optimizationtechniques found in the literature. To address this issue, we develop aconverged relational-graph optimization framework called RelGo for optimizingSPJM queries, leveraging joint efforts from both relational and graph queryoptimizations. Using DuckDB as the underlying relational execution engine, ourexperiments show that RelGo can generate efficient execution plans for SPJMqueries. On well-established benchmarks, these plans exhibit an average speedupof 21.90$times$ compared to those produced by the graph-agnostic optimizer.
最近发布的 ISO SQL:2023 标准采用了 SQL/PGQ(属性图查询),以方便在关系数据库中进行图式查询。然而,这一进步凸显了如何在关系型数据库系统中有效优化 SQL/PGQ 查询的巨大差距。为了解决这个问题,我们将基本的 SPJ(Select-Project-Join)查询扩展为 SPJM 查询,SPJM 查询包括一个额外的匹配操作符,用于在SQL/PGQ 中表示图形模式匹配。虽然 SPJM 查询可以转换为 SPJ 查询,并使用现有的关系查询优化器进行优化,但我们的分析表明,这种与图无关的方法无法从文献中发现的图特定优化技术中获益。为了解决这个问题,我们开发了一个名为 RelGo 的关系-图优化融合框架,用于优化SPJM 查询,充分利用了关系和图查询优化的共同努力。使用 DuckDB 作为底层关系执行引擎,我们的实验表明 RelGo 可以为 SPJM 查询生成高效的执行计划。在成熟的基准测试中,这些计划与图无关优化器生成的计划相比,平均提速 21.90 倍。
{"title":"Towards a Converged Relational-Graph Optimization Framework","authors":"Yunkai Lou, Longbin Lai, Bingqing Lyu, Yufan Yang, Xiaoli Zhou, Wenyuan Yu, Ying Zhang, Jingren Zhou","doi":"arxiv-2408.13480","DOIUrl":"https://doi.org/arxiv-2408.13480","url":null,"abstract":"The recent ISO SQL:2023 standard adopts SQL/PGQ (Property Graph Queries),\u0000facilitating graph-like querying within relational databases. This advancement,\u0000however, underscores a significant gap in how to effectively optimize SQL/PGQ\u0000queries within relational database systems. To address this gap, we extend the\u0000foundational SPJ(Select-Project-Join) queries to SPJM queries, which include an\u0000additional matching operator for representing graph pattern matching in\u0000SQL/PGQ. Although SPJM queries can be converted to SPJ queries and optimized\u0000using existing relational query optimizers, our analysis shows that such a\u0000graph-agnostic method fails to benefit from graph-specific optimization\u0000techniques found in the literature. To address this issue, we develop a\u0000converged relational-graph optimization framework called RelGo for optimizing\u0000SPJM queries, leveraging joint efforts from both relational and graph query\u0000optimizations. Using DuckDB as the underlying relational execution engine, our\u0000experiments show that RelGo can generate efficient execution plans for SPJM\u0000queries. On well-established benchmarks, these plans exhibit an average speedup\u0000of 21.90$times$ compared to those produced by the graph-agnostic optimizer.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Targeted Least Cardinality Candidate Key for Relational Databases 关系数据库的目标最小卡片性候选密钥
Pub Date : 2024-08-24 DOI: arxiv-2408.13540
Vasileios Nakos, Hung Q. Ngo, Charalampos E. Tsourakakis
Functional dependencies (FDs) are a central theme in databases, playing amajor role in the design of database schemas and the optimization of queries.In this work, we introduce the {it targeted least cardinality candidate keyproblem} (TCAND). This problem is defined over a set of functional dependencies$F$ and a target variable set $T subseteq V$, and it aims to find the smallestset $X subseteq V$ such that the FD $X to T$ can be derived from $F$. TheTCAND problem generalizes the well-known NP-hard problem of finding the leastcardinality candidate key~cite{lucchesi1978candidate}, which has beenpreviously demonstrated to be at least as difficult as the set cover problem. We present an integer programming (IP) formulation for the TCAND problem,analogous to a layered set cover problem. We analyze its linear programming(LP) relaxation from two perspectives: we propose two approximation algorithmsand investigate the integrality gap. Our findings indicate that theapproximation upper bounds for our algorithms are not significantly improvablethrough LP rounding, a notable distinction from the standard set cover problem.Additionally, we discover that a generalization of the TCAND problem isequivalent to a variant of the set cover problem, named red-blue setcover~cite{carr1999red}, which cannot be approximated within a sub-polynomialfactor in polynomial time under plausibleconjectures~cite{chlamtavc2023approximating}. Despite the extensive historysurrounding the issue of identifying the least cardinality candidate key, ourresearch contributes new theoretical insights, novel algorithms, anddemonstrates that the general TCAND problem poses complexities beyond thoseencountered in the set cover problem.
功能依赖(FDs)是数据库的核心主题,在数据库模式设计和查询优化中发挥着重要作用。在这项工作中,我们引入了{it targeted least cardinality candidate keyproblem}(TCAND)。这个问题是在一组函数依赖$F$和一个目标变量集$T subseteq V$上定义的,它的目的是找到最小的集$X subseteq V$,使得FD $X to T$ 可以从$F$中导出。TCAND问题概括了众所周知的NP-hard问题--寻找最小卡最小度候选密钥(leastcardinality candidate key~cite{lucchesi1978candidate} )。我们提出了 TCAND 问题的整数编程(IP)公式,类似于分层集合覆盖问题。我们从两个角度分析了它的线性规划(LP)松弛:我们提出了两种近似算法,并研究了积分差距。我们的研究结果表明,我们算法的近似上限并不能通过 LP 舍入得到明显改善,这是与标准集合覆盖问题的显著区别。此外,我们还发现,TCAND 问题的一个泛化等价于集合覆盖问题的一个变种,被命名为红蓝集合覆盖~cite{carr1999red},在可信猜想~cite{chlamtavc2023approximating}下,它无法在多项式时间内以亚多项式因子逼近。尽管围绕识别最小卡片数候选密钥的问题已有大量研究,但我们的研究贡献了新的理论见解、新的算法,并证明了一般 TCAND 问题所带来的复杂性超出了集合覆盖问题所遇到的复杂性。
{"title":"Targeted Least Cardinality Candidate Key for Relational Databases","authors":"Vasileios Nakos, Hung Q. Ngo, Charalampos E. Tsourakakis","doi":"arxiv-2408.13540","DOIUrl":"https://doi.org/arxiv-2408.13540","url":null,"abstract":"Functional dependencies (FDs) are a central theme in databases, playing a\u0000major role in the design of database schemas and the optimization of queries.\u0000In this work, we introduce the {it targeted least cardinality candidate key\u0000problem} (TCAND). This problem is defined over a set of functional dependencies\u0000$F$ and a target variable set $T subseteq V$, and it aims to find the smallest\u0000set $X subseteq V$ such that the FD $X to T$ can be derived from $F$. The\u0000TCAND problem generalizes the well-known NP-hard problem of finding the least\u0000cardinality candidate key~cite{lucchesi1978candidate}, which has been\u0000previously demonstrated to be at least as difficult as the set cover problem. We present an integer programming (IP) formulation for the TCAND problem,\u0000analogous to a layered set cover problem. We analyze its linear programming\u0000(LP) relaxation from two perspectives: we propose two approximation algorithms\u0000and investigate the integrality gap. Our findings indicate that the\u0000approximation upper bounds for our algorithms are not significantly improvable\u0000through LP rounding, a notable distinction from the standard set cover problem.\u0000Additionally, we discover that a generalization of the TCAND problem is\u0000equivalent to a variant of the set cover problem, named red-blue set\u0000cover~cite{carr1999red}, which cannot be approximated within a sub-polynomial\u0000factor in polynomial time under plausible\u0000conjectures~cite{chlamtavc2023approximating}. Despite the extensive history\u0000surrounding the issue of identifying the least cardinality candidate key, our\u0000research contributes new theoretical insights, novel algorithms, and\u0000demonstrates that the general TCAND problem poses complexities beyond those\u0000encountered in the set cover problem.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GNN: Graph Neural Network and Large Language Model Based for Data Discovery GNN:基于图神经网络和大型语言模型的数据发现
Pub Date : 2024-08-24 DOI: arxiv-2408.13609
Thomas Hoang
Our algorithm GNN: Graph Neural Network and Large Language Model Based forData Discovery inherits the benefits of cite{hoang2024plod} (PLOD: PredictiveLearning Optimal Data Discovery), cite{Hoang2024BODBO} (BOD: Blindly OptimalData Discovery) in terms of overcoming the challenges of having to predefineutility function and the human input for attribute ranking, which helps preventthe time-consuming loop process. In addition to these previous works, ouralgorithm GNN leverages the advantages of graph neural networks and largelanguage models to understand text type values that cannot be understood byPLOD and MOD, thus making the task of predicting outcomes more reliable. GNNcould be seen as an extension of PLOD in terms of understanding the text typevalue and the user's preferences based on not only numerical values but alsotext values, making the promise of data science and analytics purposes.
我们的算法 GNN:基于图神经网络和大语言模型的数据发现算法(GNN:Graph Neural Network and Large Language Model Based forData Discovery)继承了PLOD(PredictiveLearning Optimal Data Discovery)、BOD(Blindly OptimalData Discovery)的优点,克服了属性排序必须预先定义效用函数和人工输入的难题,从而避免了耗时的循环过程。除了这些前人的研究成果,我们的算法 GNN 充分利用了图神经网络和大型语言模型的优势,能够理解PLOD 和 MOD 无法理解的文本类型值,从而使预测结果的任务更加可靠。GNN可以看作是PLOD的延伸,它不仅能根据数值,还能根据文本值理解文本类型值和用户的偏好,从而实现数据科学和分析的目的。
{"title":"GNN: Graph Neural Network and Large Language Model Based for Data Discovery","authors":"Thomas Hoang","doi":"arxiv-2408.13609","DOIUrl":"https://doi.org/arxiv-2408.13609","url":null,"abstract":"Our algorithm GNN: Graph Neural Network and Large Language Model Based for\u0000Data Discovery inherits the benefits of cite{hoang2024plod} (PLOD: Predictive\u0000Learning Optimal Data Discovery), cite{Hoang2024BODBO} (BOD: Blindly Optimal\u0000Data Discovery) in terms of overcoming the challenges of having to predefine\u0000utility function and the human input for attribute ranking, which helps prevent\u0000the time-consuming loop process. In addition to these previous works, our\u0000algorithm GNN leverages the advantages of graph neural networks and large\u0000language models to understand text type values that cannot be understood by\u0000PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN\u0000could be seen as an extension of PLOD in terms of understanding the text type\u0000value and the user's preferences based on not only numerical values but also\u0000text values, making the promise of data science and analytics purposes.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-Aware Uncertainty Reduction in Schema Matching with GPT-4: The Prompt-Matcher Framework 利用 GPT-4 降低模式匹配中的不确定性:提示-匹配器框架
Pub Date : 2024-08-24 DOI: arxiv-2408.14507
Longyu Feng, Huahang Li, Chen Jason Zhang
Schema matching is the process of identifying correspondences between theelements of two given schemata, essential for database management systems, dataintegration, and data warehousing. The inherent uncertainty of current schemamatching algorithms leads to the generation of a set of candidate matches.Storing these results necessitates the use of databases and systems capable ofhandling probabilistic queries. This complicates the querying process andincreases the associated storage costs. Motivated by GPT-4 outstandingperformance, we explore its potential to reduce uncertainty. Our proposal is tosupplant the role of crowdworkers with GPT-4 for querying the set of candidatematches. To get more precise correspondence verification responses from GPT-4,We have crafted Semantic-match and Abbreviation-match prompt for GPT-4,achieving state-of-the-art results on two benchmark datasets DeepMDatasets 100%(+0.0) and Fabricated-Datasets 91.8% (+2.2) recall rate. To optimise budgetutilisation, we have devised a cost-aware solution. Within the constraints ofthe budget, our solution delivers favourable outcomes with minimal timeexpenditure. We introduce a novel framework, Prompt-Matcher, to reduce the uncertainty inthe process of integration of multiple automatic schema matching algorithms andthe selection of complex parameterization. It assists users in diminishing theuncertainty associated with candidate schema match results and in optimallyranking the most promising matches. We formally define the CorrespondenceSelection Problem, aiming to optimise the revenue within the confines of theGPT-4 budget. We demonstrate that CSP is NP-Hard and propose an approximationalgorithm with minimal time expenditure. Ultimately, we demonstrate theefficacy of Prompt-Matcher through rigorous experiments.
模式匹配是识别两个给定模式的元素之间对应关系的过程,对于数据库管理系统、数据整合和数据仓库至关重要。当前模式匹配算法固有的不确定性会导致生成一组候选匹配结果,要存储这些结果,就必须使用能够处理概率查询的数据库和系统。这使得查询过程变得复杂,并增加了相关的存储成本。在 GPT-4 卓越性能的激励下,我们探索了其减少不确定性的潜力。我们的建议是用 GPT-4 来替代人群工作者的角色,以查询候选匹配集。为了从 GPT-4 中获得更精确的对应验证响应,我们为 GPT-4 设计了语义匹配(Semantic-match)和缩写匹配(Abbreviation-match)提示,在两个基准数据集 DeepMDatasets 100%(+0.0)和 Fabricated-Datasets 91.8%(+2.2)召回率上取得了最先进的结果。为了优化预算利用,我们设计了一种成本感知解决方案。在预算有限的情况下,我们的解决方案能以最少的时间支出获得最佳结果。我们引入了一个新颖的框架--Prompt-Matcher,以减少整合多种自动模式匹配算法和选择复杂参数化过程中的不确定性。它可以帮助用户减少与候选模式匹配结果相关的不确定性,并对最有希望的匹配结果进行优化排序。我们正式定义了 "对应选择问题"(CorrespondenceSelection Problem),目的是在 GPT-4 预算范围内优化收益。我们证明了 CSP 的 NP 难度,并提出了一种耗时最少的近似计算方法。最后,我们通过严格的实验证明了 Prompt-Matcher 的有效性。
{"title":"Cost-Aware Uncertainty Reduction in Schema Matching with GPT-4: The Prompt-Matcher Framework","authors":"Longyu Feng, Huahang Li, Chen Jason Zhang","doi":"arxiv-2408.14507","DOIUrl":"https://doi.org/arxiv-2408.14507","url":null,"abstract":"Schema matching is the process of identifying correspondences between the\u0000elements of two given schemata, essential for database management systems, data\u0000integration, and data warehousing. The inherent uncertainty of current schema\u0000matching algorithms leads to the generation of a set of candidate matches.\u0000Storing these results necessitates the use of databases and systems capable of\u0000handling probabilistic queries. This complicates the querying process and\u0000increases the associated storage costs. Motivated by GPT-4 outstanding\u0000performance, we explore its potential to reduce uncertainty. Our proposal is to\u0000supplant the role of crowdworkers with GPT-4 for querying the set of candidate\u0000matches. To get more precise correspondence verification responses from GPT-4,\u0000We have crafted Semantic-match and Abbreviation-match prompt for GPT-4,\u0000achieving state-of-the-art results on two benchmark datasets DeepMDatasets 100%\u0000(+0.0) and Fabricated-Datasets 91.8% (+2.2) recall rate. To optimise budget\u0000utilisation, we have devised a cost-aware solution. Within the constraints of\u0000the budget, our solution delivers favourable outcomes with minimal time\u0000expenditure. We introduce a novel framework, Prompt-Matcher, to reduce the uncertainty in\u0000the process of integration of multiple automatic schema matching algorithms and\u0000the selection of complex parameterization. It assists users in diminishing the\u0000uncertainty associated with candidate schema match results and in optimally\u0000ranking the most promising matches. We formally define the Correspondence\u0000Selection Problem, aiming to optimise the revenue within the confines of the\u0000GPT-4 budget. We demonstrate that CSP is NP-Hard and propose an approximation\u0000algorithm with minimal time expenditure. Ultimately, we demonstrate the\u0000efficacy of Prompt-Matcher through rigorous experiments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems BIPeC:用于识别大规模数据库系统性能回归的变化点组合分析器
Pub Date : 2024-08-22 DOI: arxiv-2408.12414
Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke
Performance testing in large-scale database systems like SAP HANA is acrucial yet labor-intensive task, involving extensive manual analysis ofthousands of measurements, such as CPU time and elapsed time. Manualmaintenance of these metrics is time-consuming and susceptible to human error,making early detection of performance regressions challenging. We address theseissues by proposing an automated approach to detect performance regressions insuch measurements. Our approach integrates Bayesian inference with the PrunedExact Linear Time (PELT) algorithm, enhancing the detection of change pointsand performance regressions with high precision and efficiency compared toprevious approaches. Our method minimizes false negatives and ensures SAPHANA's system's reliability and performance quality. The proposed solution canaccelerate testing and contribute to more sustainable performance managementpractices in large-scale data management environments.
SAP HANA 等大型数据库系统的性能测试是一项重要而又劳动密集型的任务,需要对 CPU 时间和已用时间等数以千计的测量数据进行大量手动分析。对这些指标的手动维护既耗时又容易出现人为错误,因此及早发现性能退步具有挑战性。为了解决这些问题,我们提出了一种自动方法来检测这些测量指标的性能退步情况。我们的方法将贝叶斯推理与剪枝精确线性时间(PELT)算法结合在一起,与以前的方法相比,具有更高的精度和效率,从而提高了对变化点和性能回归的检测能力。我们的方法最大程度地减少了误判,确保了 SAPHANA 系统的可靠性和性能质量。所提出的解决方案可以加快测试速度,并有助于在大规模数据管理环境中采用更可持续的性能管理实践。
{"title":"BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems","authors":"Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke","doi":"arxiv-2408.12414","DOIUrl":"https://doi.org/arxiv-2408.12414","url":null,"abstract":"Performance testing in large-scale database systems like SAP HANA is a\u0000crucial yet labor-intensive task, involving extensive manual analysis of\u0000thousands of measurements, such as CPU time and elapsed time. Manual\u0000maintenance of these metrics is time-consuming and susceptible to human error,\u0000making early detection of performance regressions challenging. We address these\u0000issues by proposing an automated approach to detect performance regressions in\u0000such measurements. Our approach integrates Bayesian inference with the Pruned\u0000Exact Linear Time (PELT) algorithm, enhancing the detection of change points\u0000and performance regressions with high precision and efficiency compared to\u0000previous approaches. Our method minimizes false negatives and ensures SAP\u0000HANA's system's reliability and performance quality. The proposed solution can\u0000accelerate testing and contribute to more sustainable performance management\u0000practices in large-scale data management environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging SQL-GEN:通过合成数据和模型合并弥合文本到 SQL 的方言差距
Pub Date : 2024-08-22 DOI: arxiv-2408.12733
Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik
Text-to-SQL systems, which convert natural language queries into SQLcommands, have seen significant progress primarily for the SQLite dialect.However, adapting these systems to other SQL dialects like BigQuery andPostgreSQL remains a challenge due to the diversity in SQL syntax andfunctions. We introduce SQL-GEN, a framework for generating high-qualitydialect-specific synthetic data guided by dialect-specific tutorials, anddemonstrate its effectiveness in creating training datasets for multipledialects. Our approach significantly improves performance, by up to 20%, overprevious methods and reduces the gap with large-scale human-annotated datasets.Moreover, combining our synthetic data with human-annotated data providesadditional performance boosts of 3.3% to 5.6%. We also introduce a novelMixture of Experts (MoE) initialization method that integrates dialect-specificmodels into a unified system by merging self-attention layers and initializingthe gates with dialect-specific keywords, further enhancing performance acrossdifferent SQL dialects.
文本到 SQL 系统可将自然语言查询转换为 SQL 命令,主要在 SQLite 方言方面取得了重大进展。然而,由于 SQL 语法和功能的多样性,将这些系统适用于 BigQuery 和 PostgreSQL 等其他 SQL 方言仍然是一项挑战。我们介绍了 SQL-GEN,这是一个在特定方言教程指导下生成高质量特定方言合成数据的框架,并演示了它在创建多方言训练数据集方面的有效性。与以前的方法相比,我们的方法大大提高了性能,提高幅度高达20%,并缩小了与大规模人类标注数据集的差距。此外,将我们的合成数据与人类标注数据相结合,还能使性能提高3.3%到5.6%。我们还引入了一种新颖的专家混合(MoE)初始化方法,该方法通过合并自注意层和使用特定方言关键词初始化门,将特定方言模型集成到一个统一的系统中,从而进一步提高了跨不同 SQL 方言的性能。
{"title":"SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging","authors":"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik","doi":"arxiv-2408.12733","DOIUrl":"https://doi.org/arxiv-2408.12733","url":null,"abstract":"Text-to-SQL systems, which convert natural language queries into SQL\u0000commands, have seen significant progress primarily for the SQLite dialect.\u0000However, adapting these systems to other SQL dialects like BigQuery and\u0000PostgreSQL remains a challenge due to the diversity in SQL syntax and\u0000functions. We introduce SQL-GEN, a framework for generating high-quality\u0000dialect-specific synthetic data guided by dialect-specific tutorials, and\u0000demonstrate its effectiveness in creating training datasets for multiple\u0000dialects. Our approach significantly improves performance, by up to 20%, over\u0000previous methods and reduces the gap with large-scale human-annotated datasets.\u0000Moreover, combining our synthetic data with human-annotated data provides\u0000additional performance boosts of 3.3% to 5.6%. We also introduce a novel\u0000Mixture of Experts (MoE) initialization method that integrates dialect-specific\u0000models into a unified system by merging self-attention layers and initializing\u0000the gates with dialect-specific keywords, further enhancing performance across\u0000different SQL dialects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unlocking Sustainability Compliance: Characterizing the EU Taxonomy for Business Process Management 揭开可持续性合规的神秘面纱:描述欧盟业务流程管理分类标准
Pub Date : 2024-08-21 DOI: arxiv-2408.11386
Finn Klessascheck, Stephan A. Fahrenkrog-Petersen, Jan Mendling, Luise Pufahl
To promote sustainable business practices, and to achieve climate neutralityby 2050, the EU has developed the taxonomy of sustainable activities, whichdescribes when exactly business practices can be considered sustainable. Whilethe taxonomy has only been recently established, progressively more companieswill have to report how much of their revenue was created via sustainablyexecuted business processes. To help companies prepare to assess whether theirbusiness processes comply with the constraints outlined in the taxonomy, weinvestigate in how far these criteria can be used for conformance checking,that is, assessing in a data-driven manner, whether business process executionsadhere to regulatory constraints. For this, we develop a few-shot learningpipeline to characterize the constraints of the taxonomy with the help of anLLM as to the process dimensions they relate to. We find that many constraintsof the taxonomy are useable for conformance checking, particularly in thesectors of energy, manufacturing, and transport. This will aid companies inpreparing to monitor regulatory compliance with the taxonomy automatically, bycharacterizing what kind of information they need to extract, and by providinga better understanding of sectors where such an assessment is feasible andwhere it is not.
为了促进可持续的商业行为,并在 2050 年前实现气候中和,欧盟制定了可持续活动分类法,规定了商业行为在什么情况下可以被视为可持续的。虽然该分类标准最近才制定,但将有越来越多的公司必须报告其收入中有多少是通过可持续执行的业务流程创造的。为了帮助企业准备评估其业务流程是否符合分类标准中列出的约束条件,我们研究了这些标准在多大程度上可用于一致性检查,即以数据驱动的方式评估业务流程的执行是否符合法规约束条件。为此,我们开发了一个少量学习管道,在 LLM 的帮助下对分类法的约束条件进行描述,以确定它们所涉及的流程维度。我们发现,分类标准中的许多约束条件都可用于一致性检查,尤其是在能源、制造和运输等行业。这将有助于公司准备自动监控分类标准的法规遵从情况,方法是确定他们需要提取哪类信息,并更好地了解哪些部门可以进行此类评估,哪些部门不可以。
{"title":"Unlocking Sustainability Compliance: Characterizing the EU Taxonomy for Business Process Management","authors":"Finn Klessascheck, Stephan A. Fahrenkrog-Petersen, Jan Mendling, Luise Pufahl","doi":"arxiv-2408.11386","DOIUrl":"https://doi.org/arxiv-2408.11386","url":null,"abstract":"To promote sustainable business practices, and to achieve climate neutrality\u0000by 2050, the EU has developed the taxonomy of sustainable activities, which\u0000describes when exactly business practices can be considered sustainable. While\u0000the taxonomy has only been recently established, progressively more companies\u0000will have to report how much of their revenue was created via sustainably\u0000executed business processes. To help companies prepare to assess whether their\u0000business processes comply with the constraints outlined in the taxonomy, we\u0000investigate in how far these criteria can be used for conformance checking,\u0000that is, assessing in a data-driven manner, whether business process executions\u0000adhere to regulatory constraints. For this, we develop a few-shot learning\u0000pipeline to characterize the constraints of the taxonomy with the help of an\u0000LLM as to the process dimensions they relate to. We find that many constraints\u0000of the taxonomy are useable for conformance checking, particularly in the\u0000sectors of energy, manufacturing, and transport. This will aid companies in\u0000preparing to monitor regulatory compliance with the taxonomy automatically, by\u0000characterizing what kind of information they need to extract, and by providing\u0000a better understanding of sectors where such an assessment is feasible and\u0000where it is not.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Databases
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1