In process mining, a log exploration step allows making sense of the event traces; e.g., identifying event patterns and illogical traces, and gaining insight into their variability. To support expressive log exploration, the event log can be converted into a Knowledge Graph (KG), which can then be queried using general-purpose languages. We explore the creation of semantic KG using the Resource Description Framework (RDF) as a data model, combined with the general-purpose Notation3 (N3) rule language for querying. We show how typical trace querying constraints, inspired by the state of the art, can be implemented in N3. We convert case- and object-centric event logs into a trace-based semantic KG; OCEL2 logs are hereby "flattened" into traces based on object paths through the KG. This solution offers (a) expressivity, as queries can instantiate constraints in multiple ways and arbitrarily constrain attributes and relations (e.g., actors, resources); (b) flexibility, as OCEL2 event logs can be serialized as traces in arbitrary ways based on the KG; and (c) extensibility, as others can extend our library by leveraging the same implementation patterns.
在流程挖掘中,日志探索步骤可以使事件轨迹变得有意义,例如,识别事件模式和不合逻辑的轨迹,并深入了解其可变性。为了支持富有表现力的日志探索,可以将事件日志转换为知识图谱(KG),然后使用通用语言对其进行查询。我们探索了使用资源描述框架(RDF)作为数据模型,结合通用 Notation3(N3)规则语言进行查询的语义 KG 创建方法。我们展示了如何在 N3 中实现受最新技术启发的典型跟踪查询约束。我们将以案例和对象为中心的事件日志转换为基于轨迹的语义 KG;从而将 OCEL2 日志 "扁平化 "为基于对象通过 KG 的路径的轨迹。这种解决方案具有:(a) 表达性,因为查询可以多种方式实例化约束,并可任意约束属性和关系(如行为体、资源);(b) 灵活性,因为 OCEL2 事件日志可以基于 KG 以任意方式序列化为轨迹;(c) 可扩展性,因为其他人可以利用相同的实现模式扩展我们的库。
{"title":"Process Trace Querying using Knowledge Graphs and Notation3","authors":"William Van Woensel","doi":"arxiv-2409.04452","DOIUrl":"https://doi.org/arxiv-2409.04452","url":null,"abstract":"In process mining, a log exploration step allows making sense of the event\u0000traces; e.g., identifying event patterns and illogical traces, and gaining\u0000insight into their variability. To support expressive log exploration, the\u0000event log can be converted into a Knowledge Graph (KG), which can then be\u0000queried using general-purpose languages. We explore the creation of semantic KG\u0000using the Resource Description Framework (RDF) as a data model, combined with\u0000the general-purpose Notation3 (N3) rule language for querying. We show how\u0000typical trace querying constraints, inspired by the state of the art, can be\u0000implemented in N3. We convert case- and object-centric event logs into a\u0000trace-based semantic KG; OCEL2 logs are hereby \"flattened\" into traces based on\u0000object paths through the KG. This solution offers (a) expressivity, as queries\u0000can instantiate constraints in multiple ways and arbitrarily constrain\u0000attributes and relations (e.g., actors, resources); (b) flexibility, as OCEL2\u0000event logs can be serialized as traces in arbitrary ways based on the KG; and\u0000(c) extensibility, as others can extend our library by leveraging the same\u0000implementation patterns.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Previous research on the Adiar BDD package has been successful at designing algorithms capable of handling large Binary Decision Diagrams (BDDs) stored in external memory. To do so, it uses consecutive sweeps through the BDDs to resolve computations. Yet, this approach has kept algorithms for multi-variable quantification, the relational product, and variable reordering out of its scope. In this work, we address this by introducing the nested sweeping framework. Here, multiple concurrent sweeps pass information between eachother to compute the result. We have implemented the framework in Adiar and used it to create a new external memory multi-variable quantification algorithm. Compared to conventional depth-first implementations, Adiar with nested sweeping is able to solve more instances of our benchmarks and/or solve them faster.
{"title":"Multi-variable Quantification of BDDs in External Memory using Nested Sweeping (Extended Paper)","authors":"Steffan Christ Sølvsten, Jaco van de Pol","doi":"arxiv-2408.14216","DOIUrl":"https://doi.org/arxiv-2408.14216","url":null,"abstract":"Previous research on the Adiar BDD package has been successful at designing\u0000algorithms capable of handling large Binary Decision Diagrams (BDDs) stored in\u0000external memory. To do so, it uses consecutive sweeps through the BDDs to\u0000resolve computations. Yet, this approach has kept algorithms for multi-variable\u0000quantification, the relational product, and variable reordering out of its\u0000scope. In this work, we address this by introducing the nested sweeping framework.\u0000Here, multiple concurrent sweeps pass information between eachother to compute\u0000the result. We have implemented the framework in Adiar and used it to create a\u0000new external memory multi-variable quantification algorithm. Compared to\u0000conventional depth-first implementations, Adiar with nested sweeping is able to\u0000solve more instances of our benchmarks and/or solve them faster.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph-based indexes have been widely employed to accelerate approximate similarity search of high-dimensional vectors. However, the performance of graph indexes to answer different queries varies vastly, leading to an unstable quality of service for downstream applications. This necessitates an effective measure to test query hardness on graph indexes. Nonetheless, popular distance-based hardness measures like LID lose their effects due to the ignorance of the graph structure. In this paper, we propose $Steiner$-hardness, a novel connection-based graph-native query hardness measure. Specifically, we first propose a theoretical framework to analyze the minimum query effort on graph indexes and then define $Steiner$-hardness as the minimum effort on a representative graph. Moreover, we prove that our $Steiner$-hardness is highly relevant to the classical Directed $Steiner$ Tree (DST) problems. In this case, we design a novel algorithm to reduce our problem to DST problems and then leverage their solvers to help calculate $Steiner$-hardness efficiently. Compared with LID and other similar measures, $Steiner$-hardness shows a significantly better correlation with the actual query effort on various datasets. Additionally, an unbiased evaluation designed based on $Steiner$-hardness reveals new ranking results, indicating a meaningful direction for enhancing the robustness of graph indexes. This paper is accepted by PVLDB 2025.
{"title":"$boldsymbol{Steiner}$-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes","authors":"Zeyu Wang, Qitong Wang, Xiaoxing Cheng, Peng Wang, Themis Palpanas, Wei Wang","doi":"arxiv-2408.13899","DOIUrl":"https://doi.org/arxiv-2408.13899","url":null,"abstract":"Graph-based indexes have been widely employed to accelerate approximate\u0000similarity search of high-dimensional vectors. However, the performance of\u0000graph indexes to answer different queries varies vastly, leading to an unstable\u0000quality of service for downstream applications. This necessitates an effective\u0000measure to test query hardness on graph indexes. Nonetheless, popular\u0000distance-based hardness measures like LID lose their effects due to the\u0000ignorance of the graph structure. In this paper, we propose $Steiner$-hardness,\u0000a novel connection-based graph-native query hardness measure. Specifically, we\u0000first propose a theoretical framework to analyze the minimum query effort on\u0000graph indexes and then define $Steiner$-hardness as the minimum effort on a\u0000representative graph. Moreover, we prove that our $Steiner$-hardness is highly\u0000relevant to the classical Directed $Steiner$ Tree (DST) problems. In this case,\u0000we design a novel algorithm to reduce our problem to DST problems and then\u0000leverage their solvers to help calculate $Steiner$-hardness efficiently.\u0000Compared with LID and other similar measures, $Steiner$-hardness shows a\u0000significantly better correlation with the actual query effort on various\u0000datasets. Additionally, an unbiased evaluation designed based on\u0000$Steiner$-hardness reveals new ranking results, indicating a meaningful\u0000direction for enhancing the robustness of graph indexes. This paper is accepted\u0000by PVLDB 2025.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent ISO SQL:2023 standard adopts SQL/PGQ (Property Graph Queries), facilitating graph-like querying within relational databases. This advancement, however, underscores a significant gap in how to effectively optimize SQL/PGQ queries within relational database systems. To address this gap, we extend the foundational SPJ(Select-Project-Join) queries to SPJM queries, which include an additional matching operator for representing graph pattern matching in SQL/PGQ. Although SPJM queries can be converted to SPJ queries and optimized using existing relational query optimizers, our analysis shows that such a graph-agnostic method fails to benefit from graph-specific optimization techniques found in the literature. To address this issue, we develop a converged relational-graph optimization framework called RelGo for optimizing SPJM queries, leveraging joint efforts from both relational and graph query optimizations. Using DuckDB as the underlying relational execution engine, our experiments show that RelGo can generate efficient execution plans for SPJM queries. On well-established benchmarks, these plans exhibit an average speedup of 21.90$times$ compared to those produced by the graph-agnostic optimizer.
{"title":"Towards a Converged Relational-Graph Optimization Framework","authors":"Yunkai Lou, Longbin Lai, Bingqing Lyu, Yufan Yang, Xiaoli Zhou, Wenyuan Yu, Ying Zhang, Jingren Zhou","doi":"arxiv-2408.13480","DOIUrl":"https://doi.org/arxiv-2408.13480","url":null,"abstract":"The recent ISO SQL:2023 standard adopts SQL/PGQ (Property Graph Queries),\u0000facilitating graph-like querying within relational databases. This advancement,\u0000however, underscores a significant gap in how to effectively optimize SQL/PGQ\u0000queries within relational database systems. To address this gap, we extend the\u0000foundational SPJ(Select-Project-Join) queries to SPJM queries, which include an\u0000additional matching operator for representing graph pattern matching in\u0000SQL/PGQ. Although SPJM queries can be converted to SPJ queries and optimized\u0000using existing relational query optimizers, our analysis shows that such a\u0000graph-agnostic method fails to benefit from graph-specific optimization\u0000techniques found in the literature. To address this issue, we develop a\u0000converged relational-graph optimization framework called RelGo for optimizing\u0000SPJM queries, leveraging joint efforts from both relational and graph query\u0000optimizations. Using DuckDB as the underlying relational execution engine, our\u0000experiments show that RelGo can generate efficient execution plans for SPJM\u0000queries. On well-established benchmarks, these plans exhibit an average speedup\u0000of 21.90$times$ compared to those produced by the graph-agnostic optimizer.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasileios Nakos, Hung Q. Ngo, Charalampos E. Tsourakakis
Functional dependencies (FDs) are a central theme in databases, playing a major role in the design of database schemas and the optimization of queries. In this work, we introduce the {it targeted least cardinality candidate key problem} (TCAND). This problem is defined over a set of functional dependencies $F$ and a target variable set $T subseteq V$, and it aims to find the smallest set $X subseteq V$ such that the FD $X to T$ can be derived from $F$. The TCAND problem generalizes the well-known NP-hard problem of finding the least cardinality candidate key~cite{lucchesi1978candidate}, which has been previously demonstrated to be at least as difficult as the set cover problem. We present an integer programming (IP) formulation for the TCAND problem, analogous to a layered set cover problem. We analyze its linear programming (LP) relaxation from two perspectives: we propose two approximation algorithms and investigate the integrality gap. Our findings indicate that the approximation upper bounds for our algorithms are not significantly improvable through LP rounding, a notable distinction from the standard set cover problem. Additionally, we discover that a generalization of the TCAND problem is equivalent to a variant of the set cover problem, named red-blue set cover~cite{carr1999red}, which cannot be approximated within a sub-polynomial factor in polynomial time under plausible conjectures~cite{chlamtavc2023approximating}. Despite the extensive history surrounding the issue of identifying the least cardinality candidate key, our research contributes new theoretical insights, novel algorithms, and demonstrates that the general TCAND problem poses complexities beyond those encountered in the set cover problem.
{"title":"Targeted Least Cardinality Candidate Key for Relational Databases","authors":"Vasileios Nakos, Hung Q. Ngo, Charalampos E. Tsourakakis","doi":"arxiv-2408.13540","DOIUrl":"https://doi.org/arxiv-2408.13540","url":null,"abstract":"Functional dependencies (FDs) are a central theme in databases, playing a\u0000major role in the design of database schemas and the optimization of queries.\u0000In this work, we introduce the {it targeted least cardinality candidate key\u0000problem} (TCAND). This problem is defined over a set of functional dependencies\u0000$F$ and a target variable set $T subseteq V$, and it aims to find the smallest\u0000set $X subseteq V$ such that the FD $X to T$ can be derived from $F$. The\u0000TCAND problem generalizes the well-known NP-hard problem of finding the least\u0000cardinality candidate key~cite{lucchesi1978candidate}, which has been\u0000previously demonstrated to be at least as difficult as the set cover problem. We present an integer programming (IP) formulation for the TCAND problem,\u0000analogous to a layered set cover problem. We analyze its linear programming\u0000(LP) relaxation from two perspectives: we propose two approximation algorithms\u0000and investigate the integrality gap. Our findings indicate that the\u0000approximation upper bounds for our algorithms are not significantly improvable\u0000through LP rounding, a notable distinction from the standard set cover problem.\u0000Additionally, we discover that a generalization of the TCAND problem is\u0000equivalent to a variant of the set cover problem, named red-blue set\u0000cover~cite{carr1999red}, which cannot be approximated within a sub-polynomial\u0000factor in polynomial time under plausible\u0000conjectures~cite{chlamtavc2023approximating}. Despite the extensive history\u0000surrounding the issue of identifying the least cardinality candidate key, our\u0000research contributes new theoretical insights, novel algorithms, and\u0000demonstrates that the general TCAND problem poses complexities beyond those\u0000encountered in the set cover problem.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Our algorithm GNN: Graph Neural Network and Large Language Model Based for Data Discovery inherits the benefits of cite{hoang2024plod} (PLOD: Predictive Learning Optimal Data Discovery), cite{Hoang2024BODBO} (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user's preferences based on not only numerical values but also text values, making the promise of data science and analytics purposes.
我们的算法 GNN:基于图神经网络和大语言模型的数据发现算法(GNN:Graph Neural Network and Large Language Model Based forData Discovery)继承了PLOD(PredictiveLearning Optimal Data Discovery)、BOD(Blindly OptimalData Discovery)的优点,克服了属性排序必须预先定义效用函数和人工输入的难题,从而避免了耗时的循环过程。除了这些前人的研究成果,我们的算法 GNN 充分利用了图神经网络和大型语言模型的优势,能够理解PLOD 和 MOD 无法理解的文本类型值,从而使预测结果的任务更加可靠。GNN可以看作是PLOD的延伸,它不仅能根据数值,还能根据文本值理解文本类型值和用户的偏好,从而实现数据科学和分析的目的。
{"title":"GNN: Graph Neural Network and Large Language Model Based for Data Discovery","authors":"Thomas Hoang","doi":"arxiv-2408.13609","DOIUrl":"https://doi.org/arxiv-2408.13609","url":null,"abstract":"Our algorithm GNN: Graph Neural Network and Large Language Model Based for\u0000Data Discovery inherits the benefits of cite{hoang2024plod} (PLOD: Predictive\u0000Learning Optimal Data Discovery), cite{Hoang2024BODBO} (BOD: Blindly Optimal\u0000Data Discovery) in terms of overcoming the challenges of having to predefine\u0000utility function and the human input for attribute ranking, which helps prevent\u0000the time-consuming loop process. In addition to these previous works, our\u0000algorithm GNN leverages the advantages of graph neural networks and large\u0000language models to understand text type values that cannot be understood by\u0000PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN\u0000could be seen as an extension of PLOD in terms of understanding the text type\u0000value and the user's preferences based on not only numerical values but also\u0000text values, making the promise of data science and analytics purposes.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. The inherent uncertainty of current schema matching algorithms leads to the generation of a set of candidate matches. Storing these results necessitates the use of databases and systems capable of handling probabilistic queries. This complicates the querying process and increases the associated storage costs. Motivated by GPT-4 outstanding performance, we explore its potential to reduce uncertainty. Our proposal is to supplant the role of crowdworkers with GPT-4 for querying the set of candidate matches. To get more precise correspondence verification responses from GPT-4, We have crafted Semantic-match and Abbreviation-match prompt for GPT-4, achieving state-of-the-art results on two benchmark datasets DeepMDatasets 100% (+0.0) and Fabricated-Datasets 91.8% (+2.2) recall rate. To optimise budget utilisation, we have devised a cost-aware solution. Within the constraints of the budget, our solution delivers favourable outcomes with minimal time expenditure. We introduce a novel framework, Prompt-Matcher, to reduce the uncertainty in the process of integration of multiple automatic schema matching algorithms and the selection of complex parameterization. It assists users in diminishing the uncertainty associated with candidate schema match results and in optimally ranking the most promising matches. We formally define the Correspondence Selection Problem, aiming to optimise the revenue within the confines of the GPT-4 budget. We demonstrate that CSP is NP-Hard and propose an approximation algorithm with minimal time expenditure. Ultimately, we demonstrate the efficacy of Prompt-Matcher through rigorous experiments.
{"title":"Cost-Aware Uncertainty Reduction in Schema Matching with GPT-4: The Prompt-Matcher Framework","authors":"Longyu Feng, Huahang Li, Chen Jason Zhang","doi":"arxiv-2408.14507","DOIUrl":"https://doi.org/arxiv-2408.14507","url":null,"abstract":"Schema matching is the process of identifying correspondences between the\u0000elements of two given schemata, essential for database management systems, data\u0000integration, and data warehousing. The inherent uncertainty of current schema\u0000matching algorithms leads to the generation of a set of candidate matches.\u0000Storing these results necessitates the use of databases and systems capable of\u0000handling probabilistic queries. This complicates the querying process and\u0000increases the associated storage costs. Motivated by GPT-4 outstanding\u0000performance, we explore its potential to reduce uncertainty. Our proposal is to\u0000supplant the role of crowdworkers with GPT-4 for querying the set of candidate\u0000matches. To get more precise correspondence verification responses from GPT-4,\u0000We have crafted Semantic-match and Abbreviation-match prompt for GPT-4,\u0000achieving state-of-the-art results on two benchmark datasets DeepMDatasets 100%\u0000(+0.0) and Fabricated-Datasets 91.8% (+2.2) recall rate. To optimise budget\u0000utilisation, we have devised a cost-aware solution. Within the constraints of\u0000the budget, our solution delivers favourable outcomes with minimal time\u0000expenditure. We introduce a novel framework, Prompt-Matcher, to reduce the uncertainty in\u0000the process of integration of multiple automatic schema matching algorithms and\u0000the selection of complex parameterization. It assists users in diminishing the\u0000uncertainty associated with candidate schema match results and in optimally\u0000ranking the most promising matches. We formally define the Correspondence\u0000Selection Problem, aiming to optimise the revenue within the confines of the\u0000GPT-4 budget. We demonstrate that CSP is NP-Hard and propose an approximation\u0000algorithm with minimal time expenditure. Ultimately, we demonstrate the\u0000efficacy of Prompt-Matcher through rigorous experiments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke
Performance testing in large-scale database systems like SAP HANA is a crucial yet labor-intensive task, involving extensive manual analysis of thousands of measurements, such as CPU time and elapsed time. Manual maintenance of these metrics is time-consuming and susceptible to human error, making early detection of performance regressions challenging. We address these issues by proposing an automated approach to detect performance regressions in such measurements. Our approach integrates Bayesian inference with the Pruned Exact Linear Time (PELT) algorithm, enhancing the detection of change points and performance regressions with high precision and efficiency compared to previous approaches. Our method minimizes false negatives and ensures SAP HANA's system's reliability and performance quality. The proposed solution can accelerate testing and contribute to more sustainable performance management practices in large-scale data management environments.
SAP HANA 等大型数据库系统的性能测试是一项重要而又劳动密集型的任务,需要对 CPU 时间和已用时间等数以千计的测量数据进行大量手动分析。对这些指标的手动维护既耗时又容易出现人为错误,因此及早发现性能退步具有挑战性。为了解决这些问题,我们提出了一种自动方法来检测这些测量指标的性能退步情况。我们的方法将贝叶斯推理与剪枝精确线性时间(PELT)算法结合在一起,与以前的方法相比,具有更高的精度和效率,从而提高了对变化点和性能回归的检测能力。我们的方法最大程度地减少了误判,确保了 SAPHANA 系统的可靠性和性能质量。所提出的解决方案可以加快测试速度,并有助于在大规模数据管理环境中采用更可持续的性能管理实践。
{"title":"BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems","authors":"Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke","doi":"arxiv-2408.12414","DOIUrl":"https://doi.org/arxiv-2408.12414","url":null,"abstract":"Performance testing in large-scale database systems like SAP HANA is a\u0000crucial yet labor-intensive task, involving extensive manual analysis of\u0000thousands of measurements, such as CPU time and elapsed time. Manual\u0000maintenance of these metrics is time-consuming and susceptible to human error,\u0000making early detection of performance regressions challenging. We address these\u0000issues by proposing an automated approach to detect performance regressions in\u0000such measurements. Our approach integrates Bayesian inference with the Pruned\u0000Exact Linear Time (PELT) algorithm, enhancing the detection of change points\u0000and performance regressions with high precision and efficiency compared to\u0000previous approaches. Our method minimizes false negatives and ensures SAP\u0000HANA's system's reliability and performance quality. The proposed solution can\u0000accelerate testing and contribute to more sustainable performance management\u0000practices in large-scale data management environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik
Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3% to 5.6%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.
{"title":"SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging","authors":"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik","doi":"arxiv-2408.12733","DOIUrl":"https://doi.org/arxiv-2408.12733","url":null,"abstract":"Text-to-SQL systems, which convert natural language queries into SQL\u0000commands, have seen significant progress primarily for the SQLite dialect.\u0000However, adapting these systems to other SQL dialects like BigQuery and\u0000PostgreSQL remains a challenge due to the diversity in SQL syntax and\u0000functions. We introduce SQL-GEN, a framework for generating high-quality\u0000dialect-specific synthetic data guided by dialect-specific tutorials, and\u0000demonstrate its effectiveness in creating training datasets for multiple\u0000dialects. Our approach significantly improves performance, by up to 20%, over\u0000previous methods and reduces the gap with large-scale human-annotated datasets.\u0000Moreover, combining our synthetic data with human-annotated data provides\u0000additional performance boosts of 3.3% to 5.6%. We also introduce a novel\u0000Mixture of Experts (MoE) initialization method that integrates dialect-specific\u0000models into a unified system by merging self-attention layers and initializing\u0000the gates with dialect-specific keywords, further enhancing performance across\u0000different SQL dialects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finn Klessascheck, Stephan A. Fahrenkrog-Petersen, Jan Mendling, Luise Pufahl
To promote sustainable business practices, and to achieve climate neutrality by 2050, the EU has developed the taxonomy of sustainable activities, which describes when exactly business practices can be considered sustainable. While the taxonomy has only been recently established, progressively more companies will have to report how much of their revenue was created via sustainably executed business processes. To help companies prepare to assess whether their business processes comply with the constraints outlined in the taxonomy, we investigate in how far these criteria can be used for conformance checking, that is, assessing in a data-driven manner, whether business process executions adhere to regulatory constraints. For this, we develop a few-shot learning pipeline to characterize the constraints of the taxonomy with the help of an LLM as to the process dimensions they relate to. We find that many constraints of the taxonomy are useable for conformance checking, particularly in the sectors of energy, manufacturing, and transport. This will aid companies in preparing to monitor regulatory compliance with the taxonomy automatically, by characterizing what kind of information they need to extract, and by providing a better understanding of sectors where such an assessment is feasible and where it is not.
{"title":"Unlocking Sustainability Compliance: Characterizing the EU Taxonomy for Business Process Management","authors":"Finn Klessascheck, Stephan A. Fahrenkrog-Petersen, Jan Mendling, Luise Pufahl","doi":"arxiv-2408.11386","DOIUrl":"https://doi.org/arxiv-2408.11386","url":null,"abstract":"To promote sustainable business practices, and to achieve climate neutrality\u0000by 2050, the EU has developed the taxonomy of sustainable activities, which\u0000describes when exactly business practices can be considered sustainable. While\u0000the taxonomy has only been recently established, progressively more companies\u0000will have to report how much of their revenue was created via sustainably\u0000executed business processes. To help companies prepare to assess whether their\u0000business processes comply with the constraints outlined in the taxonomy, we\u0000investigate in how far these criteria can be used for conformance checking,\u0000that is, assessing in a data-driven manner, whether business process executions\u0000adhere to regulatory constraints. For this, we develop a few-shot learning\u0000pipeline to characterize the constraints of the taxonomy with the help of an\u0000LLM as to the process dimensions they relate to. We find that many constraints\u0000of the taxonomy are useable for conformance checking, particularly in the\u0000sectors of energy, manufacturing, and transport. This will aid companies in\u0000preparing to monitor regulatory compliance with the taxonomy automatically, by\u0000characterizing what kind of information they need to extract, and by providing\u0000a better understanding of sectors where such an assessment is feasible and\u0000where it is not.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}