Proc. VLDB Endow.最新文献_第7页

ZKSQL: Verifiable and Efficient Query Evaluation with Zero-Knowledge Proofs ZKSQL:基于零知识证明的可验证高效查询评估

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594513

Xiling Li, Chenkai Weng, Yongxin Xu, Xiao Wang, Jennie Duggan

Individuals and organizations are using databases to store personal information at an unprecedented rate. This creates a quandary for data providers. They are responsible for protecting the privacy of individuals described in their database. On the other hand, data providers are sometimes required to provide statistics about their data instead of sharing it wholesale with strong assurances that these answers are correct and complete such as in regulatory filings for the US SEC and other goverment organizations. We introduce a system, ZKSQL , that provides authenticated answers to ad-hoc SQL queries with zero-knowledge proofs. Its proofs show that the answers are correct and sound with respect to the database's contents and they do not divulge any information about its input records. This system constructs proofs over the steps in a query's evaluation and it accelerates this process with authenticated set operations. We validate the efficiency of this approach over a suite of TPC-H queries and our results show that ZKSQL achieves two orders of magnitude speedup over the baseline.

个人和组织正在以前所未有的速度使用数据库来存储个人信息。这给数据提供商造成了一个困境。他们有责任保护其数据库中所描述的个人隐私。另一方面，数据提供商有时被要求提供有关其数据的统计数据，而不是像向美国证券交易委员会和其他政府组织提交的监管文件那样，在保证这些答案是正确和完整的情况下，大规模地共享数据。我们介绍了一个系统ZKSQL，它为具有零知识证明的特别SQL查询提供经过身份验证的答案。它的证明表明，就数据库的内容而言，答案是正确和合理的，并且它们不泄露任何关于其输入记录的信息。该系统在查询评估的步骤中构建证明，并通过认证集操作加速这一过程。我们在一组TPC-H查询上验证了这种方法的效率，结果表明ZKSQL在基线上实现了两个数量级的加速。

引用次数: 0

Collective Grounding: Applying Database Techniques to Grounding Templated Models 集体接地:应用数据库技术接地模板模型

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594516

Eriq Augustine, L. Getoor

The process of instantiating, or "grounding", a first-order model is a fundamental component of reasoning in logic. It has been widely studied in the context of theorem proving, database theory, and artificial intelligence. Within the relational learning community, the concept of grounding has been expanded to apply to models that use more general templates in the place of first-order logical formulae. In order to perform inference, grounding of these templates is required for instantiating a distribution over possible worlds. However, because of the complex data dependencies stemming from instantiating generalized templates with interconnected data, grounding is often the key computational bottleneck to relational learning. While we motivate our work in the context of relational learning, similar issues arise in probabilistic databases, particularly those that do not make strong tuple independence assumptions. In this paper, we investigate how key techniques from relational database theory can be utilized to improve the computational efficiency of the grounding process. We introduce the notion of collective grounding which treats logical programs not as a collection of independent rules, but instead as a joint set of interdependent workloads that can be shared. We introduce the theoretical concept of collective grounding, the components necessary in a collective grounding system, implementations of these components, and show how to use database theory to speed up these components. We demonstrate collective groundings effectiveness on seven popular datasets, and show up to a 70% reduction in runtime using collective grounding. Our results are fully reproducible and all code, data, and experimental scripts are included.

一阶模型的实例化或“接地”过程是逻辑推理的基本组成部分。它已经在定理证明、数据库理论和人工智能的背景下被广泛研究。在关系学习社区中，基础的概念已经扩展到使用更通用的模板来代替一阶逻辑公式的模型。为了执行推理，需要这些模板的基础来实例化可能世界上的分布。然而，由于使用相互关联的数据实例化通用模板产生了复杂的数据依赖性，因此接地通常是关系学习的关键计算瓶颈。当我们在关系学习的背景下激励我们的工作时，在概率数据库中也出现了类似的问题，特别是那些没有做出强元组独立性假设的数据库。在本文中，我们研究了如何利用关系数据库理论中的关键技术来提高接地过程的计算效率。我们引入了集体基础的概念，它不是将逻辑程序视为独立规则的集合，而是将其视为可以共享的相互依赖的工作负载的联合集合。我们介绍了集体接地的理论概念、集体接地系统中必要的组件、这些组件的实现，并展示了如何使用数据库理论来加快这些组件的速度。我们在七个流行的数据集上展示了集体接地的有效性，并显示使用集体接地可以减少70%的运行时间。我们的结果是完全可重复的，所有的代码、数据和实验脚本都包括在内。

{"title":"Collective Grounding: Applying Database Techniques to Grounding Templated Models","authors":"Eriq Augustine, L. Getoor","doi":"10.14778/3594512.3594516","DOIUrl":"https://doi.org/10.14778/3594512.3594516","url":null,"abstract":"\u0000 The process of instantiating, or \"grounding\", a first-order model is a fundamental component of reasoning in logic. It has been widely studied in the context of theorem proving, database theory, and artificial intelligence. Within the relational learning community, the concept of grounding has been expanded to apply to models that use more general\u0000 templates\u0000 in the place of first-order logical formulae. In order to perform inference, grounding of these templates is required for instantiating a distribution over possible worlds. However, because of the complex data dependencies stemming from instantiating generalized templates with interconnected data, grounding is often the key computational bottleneck to relational learning. While we motivate our work in the context of relational learning, similar issues arise in probabilistic databases, particularly those that do not make strong tuple independence assumptions. In this paper, we investigate how key techniques from relational database theory can be utilized to improve the computational efficiency of the grounding process. We introduce the notion of\u0000 collective grounding\u0000 which treats logical programs not as a collection of independent rules, but instead as a joint set of interdependent workloads that can be shared. We introduce the theoretical concept of collective grounding, the components necessary in a collective grounding system, implementations of these components, and show how to use database theory to speed up these components. We demonstrate collective groundings effectiveness on seven popular datasets, and show up to a 70% reduction in runtime using collective grounding. Our results are fully reproducible and all code, data, and experimental scripts are included.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"5 1","pages":"1843-1855"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83251596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BASE: Bridging the Gap between Cost and Latency for Query Optimization BASE:在查询优化的成本和延迟之间架起桥梁

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594525

Xu Chen, Zhen Wang, Shuncheng Liu, Yaliang Li, Kai Zeng, Bolin Ding, Jingren Zhou, Han Su, Kai Zheng

Some recent works have shown the advantages of reinforcement learning (RL) based learned query optimizers. These works often use the cost (i.e., the estimation of cost model) or the latency (i.e., execution time) as guidance signals for training their learned models. However, cost-based learning underperforms in latency and latency-based learning is time-intensive. In order to bypass such a dilemma, researchers attempt to transfer a learned value network from the cost domain to the latency domain. We recognize critical insights in cost/latency-based training, prompting us to transfer the reward function rather than the value network. Based on this idea, we propose a two-stage RL-based framework, BASE , to bridge the gap between cost and latency. After learning a policy based on cost signals in its first stage, BASE formulates transferring the reward function as a variant of inverse reinforcement learning. Intuitively, BASE learns to calibrate the reward function and updates the policy regarding the calibrated one in a mutually-improved manner. Extensive experiments exhibit the superiority of BASE on two benchmark datasets: Our optimizer outperforms traditional DBMS, using 30% less training time than SOTA methods. Meanwhile, our approach can enhance the efficiency of other learning-based optimizers.

最近的一些研究显示了基于强化学习(RL)的学习查询优化器的优势。这些工作通常使用成本(即成本模型的估计)或延迟(即执行时间)作为训练其学习模型的指导信号。然而，基于成本的学习在延迟方面表现不佳，并且基于延迟的学习是时间密集型的。为了绕过这一困境，研究人员试图将学习值网络从代价域转移到延迟域。我们认识到基于成本/延迟的培训的关键见解，促使我们转移奖励函数而不是价值网络。基于这个想法，我们提出了一个两阶段的基于rl的框架BASE，以弥合成本和延迟之间的差距。在第一阶段学习了基于代价信号的策略后，BASE将传递奖励函数表述为逆强化学习的一种变体。直观地，BASE学习校准奖励函数，并以一种相互改进的方式更新针对校准后的奖励函数的策略。大量的实验显示了BASE在两个基准数据集上的优势:我们的优化器优于传统的DBMS，使用比SOTA方法少30%的训练时间。同时，我们的方法可以提高其他基于学习的优化器的效率。

{"title":"BASE: Bridging the Gap between Cost and Latency for Query Optimization","authors":"Xu Chen, Zhen Wang, Shuncheng Liu, Yaliang Li, Kai Zeng, Bolin Ding, Jingren Zhou, Han Su, Kai Zheng","doi":"10.14778/3594512.3594525","DOIUrl":"https://doi.org/10.14778/3594512.3594525","url":null,"abstract":"\u0000 Some recent works have shown the advantages of reinforcement learning (RL) based learned query optimizers. These works often use the cost (i.e., the estimation of cost model) or the latency (i.e., execution time) as guidance signals for training their learned models. However, cost-based learning underperforms in latency and latency-based learning is time-intensive. In order to bypass such a dilemma, researchers attempt to transfer a learned value network from the cost domain to the latency domain. We recognize critical insights in cost/latency-based training, prompting us to transfer the reward function rather than the value network. Based on this idea, we propose a two-stage RL-based framework,\u0000 BASE\u0000 , to bridge the gap between cost and latency. After learning a policy based on cost signals in its first stage,\u0000 BASE\u0000 formulates transferring the reward function as a variant of inverse reinforcement learning. Intuitively,\u0000 BASE\u0000 learns to calibrate the reward function and updates the policy regarding the calibrated one in a mutually-improved manner. Extensive experiments exhibit the superiority of\u0000 BASE\u0000 on two benchmark datasets: Our optimizer outperforms traditional DBMS, using 30% less training time than SOTA methods. Meanwhile, our approach can enhance the efficiency of other learning-based optimizers.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"81 1","pages":"1958-1966"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83984260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sim-Piece: Highly Accurate Piecewise Linear Approximation through Similar Segment Merging Sim-Piece:通过相似分段合并实现的高精度分段线性逼近

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594521

Xenophon Kitsios, Panagiotis Liakos, Katia Papakonstantinopoulou, Y. Kotidis

Approximating series of timestamped data points using a sequence of line segments with a maximum error guarantee is a fundamental data compression problem, termed as piecewise linear approximation (PLA). Due to the increasing need to analyze massive collections of time-series data in diverse domains, the problem has recently received significant attention, and recent PLA algorithms that have emerged do help us handle the overwhelming amount of information, at the cost of some precision loss. More specifically, these algorithms entail a trade-off between the maximum precision loss and the space savings achieved. However, advances in the area of lossless compression are undercutting the offerings of PLA techniques in real datasets. In this work, we propose Sim-Piece, a novel lossy compression algorithm for time-series data that optimizes the space requirements of representing PLA line segments, by finding the minimum number of groups we can organize these segments into, to represent them jointly. Our experimental evaluation demonstrates that our approach readily outperforms competing techniques, attaining compression ratios with more than twofold improvement on average over what PLA algorithms can offer. This allows for providing significantly higher accuracy with equivalent space requirements. Moreover, our algorithm, due to the simplicity of its merging phase, imposes little overhead while compacting the PLA description, offering a significantly improved trade-off between space and running time. The aforementioned benefits of our approach significantly improve the efficiency in which we can store time-series data, while allowing a tight maximum error in the representation of their values.

使用具有最大误差保证的线段序列逼近一系列时间戳数据点是一个基本的数据压缩问题，称为分段线性逼近(PLA)。由于越来越需要分析不同领域的大量时间序列数据，这个问题最近受到了极大的关注，最近出现的PLA算法确实帮助我们处理了大量的信息，但代价是一些精度损失。更具体地说，这些算法需要在最大精度损失和节省空间之间进行权衡。然而，无损压缩领域的进步正在削弱PLA技术在真实数据集中的应用。在这项工作中，我们提出了Sim-Piece，一种新的时间序列数据有损压缩算法，通过找到我们可以组织这些线段的最小组数来共同表示PLA线段，从而优化表示PLA线段的空间要求。我们的实验评估表明，我们的方法很容易优于竞争技术，获得的压缩比平均比PLA算法可以提供的压缩比提高两倍以上。这允许在同等的空间要求下提供更高的精度。此外，我们的算法，由于其合并阶段的简单性，在压缩PLA描述时施加很少的开销，在空间和运行时间之间提供了显着改进的权衡。我们方法的上述优点显著提高了存储时间序列数据的效率，同时允许在其值的表示中有一个很小的最大误差。

{"title":"Sim-Piece: Highly Accurate Piecewise Linear Approximation through Similar Segment Merging","authors":"Xenophon Kitsios, Panagiotis Liakos, Katia Papakonstantinopoulou, Y. Kotidis","doi":"10.14778/3594512.3594521","DOIUrl":"https://doi.org/10.14778/3594512.3594521","url":null,"abstract":"Approximating series of timestamped data points using a sequence of line segments with a maximum error guarantee is a fundamental data compression problem, termed as piecewise linear approximation (PLA). Due to the increasing need to analyze massive collections of time-series data in diverse domains, the problem has recently received significant attention, and recent PLA algorithms that have emerged do help us handle the overwhelming amount of information, at the cost of some precision loss. More specifically, these algorithms entail a trade-off between the maximum precision loss and the space savings achieved. However, advances in the area of lossless compression are undercutting the offerings of PLA techniques in real datasets. In this work, we propose Sim-Piece, a novel lossy compression algorithm for time-series data that optimizes the space requirements of representing PLA line segments, by finding the minimum number of groups we can organize these segments into, to represent them jointly. Our experimental evaluation demonstrates that our approach readily outperforms competing techniques, attaining compression ratios with more than twofold improvement on average over what PLA algorithms can offer. This allows for providing significantly higher accuracy with equivalent space requirements. Moreover, our algorithm, due to the simplicity of its merging phase, imposes little overhead while compacting the PLA description, offering a significantly improved trade-off between space and running time. The aforementioned benefits of our approach significantly improve the efficiency in which we can store time-series data, while allowing a tight maximum error in the representation of their values.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"2674 1","pages":"1910-1922"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87813495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Learning and Deducing Temporal Orders 学习和演绎时间顺序

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594524

W. Fan, Resul Tugay, Yaoshu Wang, Min Xie, M. Ali

This paper studies how to determine temporal orders on attribute values in a set of tuples that pertain to the same entity, in the absence of complete timestamps. We propose a creator-critic framework to learn and deduce temporal orders by combining deep learning and rule-based deduction, referred to as GATE (Get the lATEst). The creator of GATE trains a ranking model via deep learning, to learn temporal orders and rank attribute values based on correlations among the attributes. The critic then validates the temporal orders learned and deduces more ranked pairs by chasing the data with currency constraints; it also provides augmented training data as feedback for the creator to improve the ranking in the next round. The process proceeds until the temporal order obtained becomes stable. Using real-life and synthetic datasets, we show that GATE is able to determine temporal orders with F -measure above 80%, improving deep learning by 7.8% and rule-based methods by 34.4%.

本文研究了在没有完整时间戳的情况下，如何确定属于同一实体的一组元组中属性值的时间顺序。我们提出了一个创造者-批评家框架，通过结合深度学习和基于规则的推理来学习和推断时间顺序，称为GATE (Get the lATEst)。GATE的创建者通过深度学习训练排序模型，根据属性之间的相关性学习时间顺序和对属性值进行排序。然后，评论家验证学习到的时间顺序，并通过追踪具有货币约束的数据推断出更多的排名对;它还提供增强的训练数据作为创建者的反馈，以便在下一轮中提高排名。这个过程一直进行，直到获得的时间顺序变得稳定为止。使用现实生活和合成数据集，我们表明GATE能够确定F -measure超过80%的时间顺序，将深度学习提高7.8%，将基于规则的方法提高34.4%。

引用次数: 0

Pollock: A Data Loading Benchmark Pollock:一个数据加载基准

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594518

Gerardo Vitagliano, Mazhar Hameed, Lan Jiang, Lucas Reisener, Eugene Wu, Felix Naumann

Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is csv. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard csv formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic "pollution" process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.

数据驱动项目中的任何系统都有一个基本要求:加载数据的能力。分发和使用原始数据的事实上的标准格式是csv。然而，这种格式的纯文本和灵活的特性通常使这类文件难以解析和正确加载其内容，需要繁琐的数据准备步骤。我们提出了一个基准来评估系统在从非标准csv格式加载数据和结构不一致时的鲁棒性。首先，我们形式化了一个模型来描述影响现实世界文件的问题，并使用它来导出一个系统的“污染”过程，以生成任何给定语法的方言。我们的基准利用csv格式的污染框架。为了指导污染，我们调查了数千个真实的、公开的csv文件，记录了我们遇到的问题。我们通过测试和评分16个不同的系统来证明基准的适用性:流行的csv解析框架、关系数据库工具、电子表格系统和数据可视化工具。

引用次数: 2

An Experimental Evaluation of Process Concept Drift Detection 过程概念漂移检测的实验评价

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594517

Jan Niklas Adams, Cameron Pitsch, T. Brockhoff, Wil M.P. van der Aalst

Process mining provides techniques to learn models from event data. These models can be descriptive (e.g., Petri nets) or predictive (e.g., neural networks). The learned models offer operational support to process owners by conformance checking, process enhancement, or predictive monitoring. However, processes are frequently subject to significant changes, making the learned models outdated and less valuable over time. To tackle this problem, Process Concept Drift (PCD) detection techniques are employed. By identifying when the process changes occur, one can replace learned models by relearning, updating, or discounting pre-drift knowledge. Various techniques to detect PCDs have been proposed. However, each technique's evaluation focuses on different evaluation goals out of accuracy, latency, versatility, scalability, parameter sensitivity, and robustness. Furthermore, the employed evaluation techniques and data sets differ. Since many techniques are not evaluated against more than one other technique, this lack of comparability raises one question: How do PCD detection techniques compare against each other? With this paper, we propose, implement, and apply a unified evaluation framework for PCD detection. We do this by collecting evaluation goals and evaluation techniques together with data sets. We derive a representative sample of techniques from a taxonomy for PCD detection. The implemented techniques and proposed evaluation framework are provided in a publicly available repository. We present the results of our experimental evaluation and observe that none of the implemented techniques works well across all evaluation goals. However, the results indicate future improvement points of algorithms and guide practitioners.

流程挖掘提供了从事件数据中学习模型的技术。这些模型可以是描述性的(例如，Petri网)或预测性的(例如，神经网络)。学习到的模型通过一致性检查、过程增强或预测性监视为过程所有者提供操作支持。然而，过程经常受到重大变化的影响，使得学习到的模型随着时间的推移变得过时且不那么有价值。为了解决这个问题，采用了过程概念漂移(PCD)检测技术。通过识别过程变化发生的时间，可以通过重新学习、更新或忽略漂移前的知识来替换学习过的模型。人们提出了各种检测PCDs的技术。然而，每种技术的评估都侧重于准确性、延迟、通用性、可伸缩性、参数敏感性和鲁棒性等不同的评估目标。此外，所采用的评估技术和数据集也有所不同。由于许多技术没有与一种以上的其他技术进行比较，这种可比性的缺乏提出了一个问题:PCD检测技术如何相互比较?在本文中，我们提出、实现并应用了一个统一的PCD检测评估框架。我们通过收集评估目标和评估技术以及数据集来做到这一点。我们从PCD检测的分类学中获得了具有代表性的技术样本。实现的技术和建议的评估框架在一个公开可用的存储库中提供。我们给出了实验评估的结果，并观察到没有一种实现的技术能够很好地跨越所有评估目标。然而，结果表明了算法未来的改进点，并指导从业者。

{"title":"An Experimental Evaluation of Process Concept Drift Detection","authors":"Jan Niklas Adams, Cameron Pitsch, T. Brockhoff, Wil M.P. van der Aalst","doi":"10.14778/3594512.3594517","DOIUrl":"https://doi.org/10.14778/3594512.3594517","url":null,"abstract":"\u0000 Process mining provides techniques to learn models from event data. These models can be descriptive (e.g., Petri nets) or predictive (e.g., neural networks). The learned models offer operational support to process owners by conformance checking, process enhancement, or predictive monitoring. However, processes are frequently subject to significant changes, making the learned models outdated and less valuable over time. To tackle this problem,\u0000 Process Concept Drift\u0000 (PCD) detection techniques are employed. By identifying when the process changes occur, one can replace learned models by relearning, updating, or discounting pre-drift knowledge. Various techniques to detect PCDs have been proposed. However, each technique's evaluation focuses on different evaluation goals out of accuracy, latency, versatility, scalability, parameter sensitivity, and robustness. Furthermore, the employed evaluation techniques and data sets differ. Since many techniques are not evaluated against more than one other technique, this lack of comparability raises one question:\u0000 How do PCD detection techniques compare against each other?\u0000 With this paper, we propose, implement, and apply a unified evaluation framework for PCD detection. We do this by collecting evaluation goals and evaluation techniques together with data sets. We derive a representative sample of techniques from a taxonomy for PCD detection. The implemented techniques and proposed evaluation framework are provided in a publicly available repository. We present the results of our experimental evaluation and observe that none of the implemented techniques works well across all evaluation goals. However, the results indicate future improvement points of algorithms and guide practitioners.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"410 1","pages":"1856-1869"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76536911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces 高维空间中高效索引构建与近似最近邻搜索

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594527

Xi Zhao, Yao Tian, Kai Huang, Bolong Zheng, Xiaofang Zhou

The approximate nearest neighbor (ANN) search in high-dimensional spaces is a fundamental but computationally very expensive problem. Many methods have been designed for solving the ANN problem, such as LSH-based methods and graph-based methods. The LSH-based methods can be costly to reach high query quality due to the hash-boundary issues, while the graph-based methods can achieve better query performance by greedy expansion in an approximate proximity graph (APG). However, the construction cost of these APGs can be one or two orders of magnitude higher than that for building hash-based indexes. In addition, they fail short in incrementally maintaining APGs as the underlying dataset evolves. In this paper, we propose a novel approach named LSH-APG to build APGs and facilitate fast ANN search using a lightweight LSH framework. LSH-APG builds an APG via consecutively inserting points based on their nearest neighbor relationship with an efficient and accurate LSH-based search strategy. A high-quality entry point selection technique and an LSH-based pruning condition are developed to accelerate index construction and query processing by reducing the number of points to be accessed during the search. LSH-APG supports fast maintenance of APGs in lieu of building them from scratch as dataset evolves. Its maintenance cost and query cost for a point is proven to be less affected by dataset cardinality. Extensive experiments on real-world and synthetic datasets demonstrate that LSH-APG incurs significantly less construction cost but achieves better query performance than existing graph-based methods.

高维空间中的近似最近邻(ANN)搜索是一个基本问题，但计算成本非常高。人们设计了许多方法来解决人工神经网络问题，如基于lsh的方法和基于图的方法。基于lsh的方法由于存在哈希边界问题而难以达到高查询质量，而基于图的方法通过在近似接近图(APG)中贪婪展开可以获得更好的查询性能。然而，这些apg的构建成本可能比构建基于哈希的索引高出一到两个数量级。此外，随着底层数据集的发展，它们在增量式维护apg方面也存在不足。在本文中，我们提出了一种名为LSH- apg的新方法来构建apg，并使用轻量级LSH框架促进快速ANN搜索。LSH-APG通过基于最近邻关系的连续插入点来构建APG，并采用了一种高效、准确的基于lsh的搜索策略。开发了一种高质量的入口点选择技术和基于lsh的剪枝条件，通过减少搜索过程中需要访问的点的数量来加快索引构建和查询处理。LSH-APG支持apg的快速维护，而不是随着数据集的发展从零开始构建它们。它的维护成本和点的查询成本受数据集基数的影响较小。在真实数据集和合成数据集上的大量实验表明，LSH-APG的构建成本明显低于现有的基于图的方法，但查询性能更好。

{"title":"Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces","authors":"Xi Zhao, Yao Tian, Kai Huang, Bolong Zheng, Xiaofang Zhou","doi":"10.14778/3594512.3594527","DOIUrl":"https://doi.org/10.14778/3594512.3594527","url":null,"abstract":"The approximate nearest neighbor (ANN) search in high-dimensional spaces is a fundamental but computationally very expensive problem. Many methods have been designed for solving the ANN problem, such as LSH-based methods and graph-based methods. The LSH-based methods can be costly to reach high query quality due to the hash-boundary issues, while the graph-based methods can achieve better query performance by greedy expansion in an approximate proximity graph (APG). However, the construction cost of these APGs can be one or two orders of magnitude higher than that for building hash-based indexes. In addition, they fail short in incrementally maintaining APGs as the underlying dataset evolves. In this paper, we propose a novel approach named LSH-APG to build APGs and facilitate fast ANN search using a lightweight LSH framework. LSH-APG builds an APG via consecutively inserting points based on their nearest neighbor relationship with an efficient and accurate LSH-based search strategy. A high-quality entry point selection technique and an LSH-based pruning condition are developed to accelerate index construction and query processing by reducing the number of points to be accessed during the search. LSH-APG supports fast maintenance of APGs in lieu of building them from scratch as dataset evolves. Its maintenance cost and query cost for a point is proven to be less affected by dataset cardinality. Extensive experiments on real-world and synthetic datasets demonstrate that LSH-APG incurs significantly less construction cost but achieves better query performance than existing graph-based methods.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"371 1","pages":"1979-1991"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77970591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semi-Oblivious Chase Termination for Linear Existential Rules: An Experimental Study 线性存在规则的半遗忘追逐终止:实验研究

Proc. VLDB Endow.

Pub Date : 2023-03-22 DOI: 10.48550/arXiv.2303.12851

M. Calautti, Mostafa Milani, Andreas Pieris

The chase procedure is a fundamental algorithmic tool in databases that allows us to reason with constraints, such as existential rules, with a plethora of applications. It takes as input a database and a set of constraints, and iteratively completes the database as dictated by the constraints. A key challenge, though, is the fact that it may not terminate, which leads to the problem of checking whether it terminates given a database and a set of constraints. In this work, we focus on the semi-oblivious version of the chase, which is well-suited for practical implementations, and linear existential rules, a central class of constraints with several applications. In this setting, there is a mature body of theoretical work that provides syntactic characterizations of when the chase terminates, algorithms for checking chase termination, and precise complexity results. Our main objective is to experimentally evaluate the existing chase termination algorithms with the aim of understanding which input parameters affect their performance, clarifying whether they can be used in practice, and revealing their performance limitations.

追逐过程是数据库中的一种基本算法工具，它允许我们对大量应用程序的约束(如存在规则)进行推理。它将数据库和一组约束作为输入，并按照约束的指示迭代地完成数据库。但是，一个关键的挑战是，它可能不会终止，这导致检查它是否会在给定数据库和一组约束条件下终止的问题。在这项工作中，我们专注于追逐的半遗忘版本，它非常适合于实际实现，以及线性存在规则，这是几种应用的中心约束类。在这种情况下，有一个成熟的理论工作体，提供了追逐何时终止的语法特征、检查追逐终止的算法和精确的复杂性结果。我们的主要目标是通过实验评估现有的追逐终止算法，以了解哪些输入参数会影响它们的性能，明确它们是否可以在实践中使用，并揭示它们的性能局限性。

引用次数: 0

SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning SUREL+:基于可扩展子图的图表示学习从步行到集合

Proc. VLDB Endow.

Pub Date : 2023-03-06 DOI: 10.48550/arXiv.2303.03379

Haoteng Yin, Muhan Zhang, Jianguo Wang, Pan Li

Subgraph-based graph representation learning (SGRL) has recently emerged as a powerful tool in many prediction tasks on graphs due to its advantages in model expressiveness and generalization ability. Most previous SGRL models face computational issues related to the high cost of extracting subgraphs for each training or testing query. Recently, SUREL was proposed to accelerate SGRL, which samples random walks offline and joins these walks online as a proxy of subgraphs for prediction. Thanks to the reusability of sampled walks across different queries, SUREL achieves state-of-the-art performance in terms of scalability and prediction accuracy. However, SUREL still suffers from high computational overhead caused by node redundancy in sampled walks. In this work, we propose a novel framework SUREL+ that upgrades SUREL by using node sets instead of walks to represent subgraphs. By definition, such set-based representations avoid repeated nodes, but node sets can be irregular in size. To solve this issue, we design a dedicated sparse data structure to efficiently store and access node sets, and provide a specialized operator to join them in parallel batches. SUREL+ is modularized to support multiple types of set samplers, structural features, and neural encoders to complement the loss of structural information after the reduction from walks to sets. Extensive experiments have been performed to verify the effectiveness of SUREL+ in the prediction tasks of links, relation types, and higher-order patterns. SUREL+ achieves 3--11× speedups of SUREL while maintaining comparable or even better prediction performance; compared to other SGRL baselines, SUREL+ achieves ~20× speedups and significantly improves the prediction accuracy.

基于子图的图表示学习(Subgraph-based graph representation learning, SGRL)由于其在模型表达能力和泛化能力方面的优势，近年来在许多图预测任务中成为一种强大的工具。大多数以前的SGRL模型面临着计算问题，这些问题与为每个训练或测试查询提取子图的高成本有关。最近，SUREL被提出用于加速SGRL, SGRL将随机行走离线采样，并将这些行走在线连接作为子图的代理进行预测。由于跨不同查询的采样行走的可重用性，SUREL在可伸缩性和预测准确性方面实现了最先进的性能。然而，SUREL仍然受到采样行走中节点冗余导致的高计算开销的困扰。在这项工作中，我们提出了一个新的框架SUREL+，它通过使用节点集而不是行走来表示子图来升级SUREL。根据定义，这种基于集合的表示避免了重复的节点，但是节点集的大小可能是不规则的。为了解决这个问题，我们设计了一个专用的稀疏数据结构来有效地存储和访问节点集，并提供了一个专门的算子来并行批量地连接节点集。SUREL+是模块化的，支持多种类型的集合采样器、结构特征和神经编码器，以补充从步行到集合减少后结构信息的损失。已经进行了大量的实验来验证SUREL+在链接、关系类型和高阶模式的预测任务中的有效性。SUREL+实现了3- 11倍的速度，同时保持相当甚至更好的预测性能;与其他SGRL基线相比，SUREL+实现了~20倍的加速，显著提高了预测精度。

{"title":"SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning","authors":"Haoteng Yin, Muhan Zhang, Jianguo Wang, Pan Li","doi":"10.48550/arXiv.2303.03379","DOIUrl":"https://doi.org/10.48550/arXiv.2303.03379","url":null,"abstract":"Subgraph-based graph representation learning (SGRL) has recently emerged as a powerful tool in many prediction tasks on graphs due to its advantages in model expressiveness and generalization ability. Most previous SGRL models face computational issues related to the high cost of extracting subgraphs for each training or testing query. Recently, SUREL was proposed to accelerate SGRL, which samples random walks offline and joins these walks online as a proxy of subgraphs for prediction. Thanks to the reusability of sampled walks across different queries, SUREL achieves state-of-the-art performance in terms of scalability and prediction accuracy. However, SUREL still suffers from high computational overhead caused by node redundancy in sampled walks. In this work, we propose a novel framework SUREL+ that upgrades SUREL by using node sets instead of walks to represent subgraphs. By definition, such set-based representations avoid repeated nodes, but node sets can be irregular in size. To solve this issue, we design a dedicated sparse data structure to efficiently store and access node sets, and provide a specialized operator to join them in parallel batches. SUREL+ is modularized to support multiple types of set samplers, structural features, and neural encoders to complement the loss of structural information after the reduction from walks to sets. Extensive experiments have been performed to verify the effectiveness of SUREL+ in the prediction tasks of links, relation types, and higher-order patterns. SUREL+ achieves 3--11× speedups of SUREL while maintaining comparable or even better prediction performance; compared to other SGRL baselines, SUREL+ achieves ~20× speedups and significantly improves the prediction accuracy.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"3 1","pages":"2939-2948"},"PeriodicalIF":0.0,"publicationDate":"2023-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89015828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2