The VLDB Journal最新文献_第5页

Identifying similar-bicliques in bipartite graphs 识别二叉图中的相似二叉

The VLDB Journal

Pub Date : 2024-01-25 DOI: 10.1007/s00778-023-00834-9

Kai Yao, Lijun Chang, Jeffrey Xu Yu

Bipartite graphs have been widely used to model the relationship between entities of different types, where vertices are partitioned into two disjoint sets/sides. Finding dense subgraphs in a bipartite graph is of great significance and encompasses many applications. However, none of the existing dense bipartite subgraph models consider similarity between vertices from the same side, and as a result, the identified results may include vertices that are not similar to each other. In this work, we formulate the notion of similar-biclique which is a special kind of biclique where all vertices from a designated side are similar to each other and aim to enumerate all similar-bicliques. The naive approach of first enumerating all maximal bicliques and then extracting all maximal similar-bicliques from them is inefficient, as enumerating maximal bicliques is already time consuming. We propose a backtracking algorithm (textsf{MSBE}) to directly enumerate maximal similar-bicliques and power it by vertex reduction and optimization techniques. In addition, we design a novel index structure to speed up a time-critical operation of (textsf{MSBE}), as well as to speed up vertex reduction. Efficient index construction algorithms are developed. To handle dynamic graph updates, we also propose algorithms and optimization techniques for maintaining our index. Finally, we parallelize our index construction algorithms to exploit multiple CPU cores. Extensive experiments on 17 bipartite graphs as well as case studies are conducted to demonstrate the effectiveness and efficiency of our model and algorithms.

双向图被广泛用于模拟不同类型实体之间的关系，其中顶点被划分为两个互不相交的集合/边。在双元图中寻找密集子图意义重大，应用广泛。然而，现有的密集双叉图子图模型都没有考虑同侧顶点之间的相似性，因此，确定的结果可能包括彼此不相似的顶点。在这项工作中，我们提出了相似双骰子的概念，它是一种特殊的双骰子，指定边上的所有顶点都彼此相似，我们的目标是枚举所有相似双骰子。首先枚举所有最大双阙值，然后从中提取所有最大相似双阙值的天真方法效率很低，因为枚举最大双阙值已经非常耗时。我们提出了一种回溯算法（textsf{MSBE}）来直接枚举最大相似二叉点，并通过顶点缩减和优化技术为其提供动力。此外，我们还设计了一种新颖的索引结构来加速 (textsf{MSBE}) 的时间关键操作，以及加速顶点缩减。我们还开发了高效的索引构建算法。为了处理动态图更新，我们还提出了维护索引的算法和优化技术。最后，我们将索引构建算法并行化，以利用多个 CPU 内核。我们在 17 个双向图上进行了广泛的实验，并进行了案例研究，以证明我们的模型和算法的有效性和效率。

{"title":"Identifying similar-bicliques in bipartite graphs","authors":"Kai Yao, Lijun Chang, Jeffrey Xu Yu","doi":"10.1007/s00778-023-00834-9","DOIUrl":"https://doi.org/10.1007/s00778-023-00834-9","url":null,"abstract":"Bipartite graphs have been widely used to model the relationship between entities of different types, where vertices are partitioned into two disjoint sets/sides. Finding dense subgraphs in a bipartite graph is of great significance and encompasses many applications. However, none of the existing dense bipartite subgraph models consider similarity between vertices from the same side, and as a result, the identified results may include vertices that are not similar to each other. In this work, we formulate the notion of similar-biclique which is a special kind of biclique where all vertices from a designated side are similar to each other and aim to enumerate all similar-bicliques. The naive approach of first enumerating all maximal bicliques and then extracting all maximal similar-bicliques from them is inefficient, as enumerating maximal bicliques is already time consuming. We propose a backtracking algorithm (textsf{MSBE}) to directly enumerate maximal similar-bicliques and power it by vertex reduction and optimization techniques. In addition, we design a novel index structure to speed up a time-critical operation of (textsf{MSBE}), as well as to speed up vertex reduction. Efficient index construction algorithms are developed. To handle dynamic graph updates, we also propose algorithms and optimization techniques for maintaining our index. Finally, we parallelize our index construction algorithms to exploit multiple CPU cores. Extensive experiments on 17 bipartite graphs as well as case studies are conducted to demonstrate the effectiveness and efficiency of our model and algorithms.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards flexibility and robustness of LSM trees 实现 LSM 树的灵活性和稳健性

The VLDB Journal

Pub Date : 2024-01-11 DOI: 10.1007/s00778-023-00826-9

Andy Huynh, Harshal A. Chaudhari, Evimaria Terzi, Manos Athanassoulis

Log-structured merge trees (LSM trees) are increasingly used as part of the storage engine behind several data systems, and are frequently deployed in the cloud. As the number of applications relying on LSM-based storage backends increases, the problem of performance tuning of LSM trees receives increasing attention. We consider both nominal tunings—where workload and execution environment are accurately known a priori—and robust tunings—which consider uncertainty in the workload knowledge. This type of workload uncertainty is common in modern applications, notably in shared infrastructure environments like the public cloud. To address this problem, we introduce Endure, a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policy, size ratio, and memory allocation on the overall performance. Endure considers a robust formulation of the throughput maximization problem and recommends a tuning that offers near-optimal throughput when the executed workload is not the same, instead in a neighborhood of the expected workload. Additionally, we explore the robustness of flexible LSM designs by proposing a new unified design called K-LSM that encompasses existing designs. We deploy our robust tuning system, Endure, on a state-of-the-art key-value store, RocksDB, and demonstrate throughput improvements of up to 5(times ) in the presence of uncertainty. Our results indicate that the tunings obtained by Endure are more robust than tunings obtained under our expanded LSM design space. This indicates that robustness may not be inherent to a design, instead, it is an outcome of a tuning process that explicitly accounts for uncertainty.

日志结构合并树（LSM 树）越来越多地被用作多个数据系统背后存储引擎的一部分，并经常部署在云中。随着依赖基于 LSM 存储后端的应用数量不断增加，LSM 树的性能调优问题日益受到关注。我们既考虑了名义调优--在这种情况下，工作负载和执行环境都是事先准确知道的--也考虑了稳健调优--它考虑了工作负载知识的不确定性。这种类型的工作负载不确定性在现代应用中很常见，尤其是在公共云等共享基础设施环境中。为了解决这个问题，我们引入了 Endure，这是一种在工作负载不确定的情况下调整 LSM 树的新模式。具体来说，我们关注压缩策略、大小比例和内存分配的选择对整体性能的影响。Endure 考虑了吞吐量最大化问题的稳健表述，并推荐了一种调整方法，当执行的工作负载不同时，它能提供接近最优的吞吐量，而不是在预期工作负载的附近。此外，我们还探索了灵活的 LSM 设计的鲁棒性，提出了一种新的统一设计，称为 K-LSM，它包含了现有的设计。我们在最先进的键值存储 RocksDB 上部署了我们的鲁棒调优系统 Endure，并展示了在存在不确定性的情况下，吞吐量最多可提高 5（times ）。我们的结果表明，通过 Endure 获得的调整比在我们扩展的 LSM 设计空间下获得的调整更稳健。这表明，鲁棒性可能不是设计所固有的，相反，它是明确考虑不确定性的调整过程的结果。

{"title":"Towards flexibility and robustness of LSM trees","authors":"Andy Huynh, Harshal A. Chaudhari, Evimaria Terzi, Manos Athanassoulis","doi":"10.1007/s00778-023-00826-9","DOIUrl":"https://doi.org/10.1007/s00778-023-00826-9","url":null,"abstract":"Log-structured merge trees (LSM trees) are increasingly used as part of the storage engine behind several data systems, and are frequently deployed in the cloud. As the number of applications relying on LSM-based storage backends increases, the problem of performance tuning of LSM trees receives increasing attention. We consider both nominal tunings—where workload and execution environment are accurately known a priori—and robust tunings—which consider uncertainty in the workload knowledge. This type of workload uncertainty is common in modern applications, notably in shared infrastructure environments like the public cloud. To address this problem, we introduce Endure, a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policy, size ratio, and memory allocation on the overall performance. Endure considers a robust formulation of the throughput maximization problem and recommends a tuning that offers near-optimal throughput when the executed workload is not the same, instead in a neighborhood of the expected workload. Additionally, we explore the robustness of flexible LSM designs by proposing a new unified design called K-LSM that encompasses existing designs. We deploy our robust tuning system, Endure, on a state-of-the-art key-value store, RocksDB, and demonstrate throughput improvements of up to 5(times ) in the presence of uncertainty. Our results indicate that the tunings obtained by Endure are more robust than tunings obtained under our expanded LSM design space. This indicates that robustness may not be inherent to a design, instead, it is an outcome of a tuning process that explicitly accounts for uncertainty.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139423062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DB-BERT: making database tuning tools “read” the manual DB-BERT：让数据库调整工具 "读懂 "手册

The VLDB Journal

Pub Date : 2023-12-27 DOI: 10.1007/s00778-023-00831-y

Immanuel Trummer

DB-BERT is a database tuning tool that exploits information gained via natural language analysis of manuals and other relevant text documents. It uses text to identify database system parameters to tune as well as recommended parameter values. DB-BERT applies large, pre-trained language models (specifically, the BERT model) for text analysis. During an initial training phase, it fine-tunes model weights in order to translate natural language hints into recommended settings. At run time, DB-BERT learns to aggregate, adapt, and prioritize hints to achieve optimal performance for a specific database system and benchmark. Both phases are iterative and use reinforcement learning to guide the selection of tuning settings to evaluate (penalizing settings that the database system rejects while rewarding settings that improve performance). In our experiments, we leverage hundreds of text documents about database tuning as input for DB-BERT. We compare DB-BERT against various baselines, considering different benchmarks (TPC-C and TPC-H), metrics (throughput and run time), as well as database systems (PostgreSQL and MySQL). The experiments demonstrate clearly that DB-BERT benefits from combining general information about database tuning, mined from text documents, with scenario-specific insights, gained via trial runs. The full source code of DB-BERT is available online at https://itrummer.github.io/dbbert/.

DB-BERT 是一种数据库调整工具，可利用通过对手册和其他相关文本文档进行自然语言分析而获得的信息。它利用文本来识别需要调整的数据库系统参数以及推荐的参数值。DB-BERT 应用预先训练好的大型语言模型（特别是 BERT 模型）进行文本分析。在初始训练阶段，它会对模型权重进行微调，以便将自然语言提示转化为推荐设置。在运行阶段，DB-BERT 会学习汇总、调整和优先处理提示，以实现特定数据库系统和基准的最佳性能。这两个阶段都是迭代式的，并使用强化学习来指导选择要评估的调整设置（惩罚数据库系统拒绝接受的设置，同时奖励提高性能的设置）。在实验中，我们利用数百篇有关数据库调整的文本文档作为 DB-BERT 的输入。我们将 DB-BERT 与不同的基准（TPC-C 和 TPC-H）、指标（吞吐量和运行时间）以及数据库系统（PostgreSQL 和 MySQL）进行了比较。实验清楚地表明，DB-BERT 将从文本文档中挖掘出的数据库调优一般信息与通过试运行获得的特定场景洞察力相结合，从中受益匪浅。DB-BERT 的完整源代码可从 https://itrummer.github.io/dbbert/ 在线获取。

{"title":"DB-BERT: making database tuning tools “read” the manual","authors":"Immanuel Trummer","doi":"10.1007/s00778-023-00831-y","DOIUrl":"https://doi.org/10.1007/s00778-023-00831-y","url":null,"abstract":"DB-BERT is a database tuning tool that exploits information gained via natural language analysis of manuals and other relevant text documents. It uses text to identify database system parameters to tune as well as recommended parameter values. DB-BERT applies large, pre-trained language models (specifically, the BERT model) for text analysis. During an initial training phase, it fine-tunes model weights in order to translate natural language hints into recommended settings. At run time, DB-BERT learns to aggregate, adapt, and prioritize hints to achieve optimal performance for a specific database system and benchmark. Both phases are iterative and use reinforcement learning to guide the selection of tuning settings to evaluate (penalizing settings that the database system rejects while rewarding settings that improve performance). In our experiments, we leverage hundreds of text documents about database tuning as input for DB-BERT. We compare DB-BERT against various baselines, considering different benchmarks (TPC-C and TPC-H), metrics (throughput and run time), as well as database systems (PostgreSQL and MySQL). The experiments demonstrate clearly that DB-BERT benefits from combining general information about database tuning, mined from text documents, with scenario-specific insights, gained via trial runs. The full source code of DB-BERT is available online at https://itrummer.github.io/dbbert/.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139055027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable decoupling graph neural network with feature-oriented optimization 面向特征优化的可扩展解耦图神经网络

The VLDB Journal

Pub Date : 2023-12-27 DOI: 10.1007/s00778-023-00829-6

Ningyi Liao, Dingheng Mo, Siqiang Luo, Xiang Li, Pengcheng Yin

Recent advances in data processing have stimulated the demand for learning graphs of very large scales. Graph neural networks (GNNs), being an emerging and powerful approach in solving graph learning tasks, are known to be difficult to scale up. Most scalable models apply node-based techniques in simplifying the expensive graph message-passing propagation procedure of GNNs. However, we find such acceleration insufficient when applied to million- or even billion-scale graphs. In this work, we propose SCARA, a scalable GNN with feature-oriented optimization for graph computation. SCARA efficiently computes graph embedding from the dimension of node features, and further selects and reuses feature computation results to reduce overhead. Theoretical analysis indicates that our model achieves sub-linear time complexity with a guaranteed precision in propagation process as well as GNN training and inference. We conduct extensive experiments on various datasets to evaluate the efficacy and efficiency of SCARA. Performance comparison with baselines shows that SCARA can reach up to (800times ) graph propagation acceleration than current state-of-the-art methods with fast convergence and comparable accuracy. Most notably, it is efficient to process precomputation on the largest available billion-scale GNN dataset Papers100M (111 M nodes, 1.6 B edges) in 13 s.

数据处理领域的最新进展刺激了对超大规模图形学习的需求。图神经网络（GNN）是解决图学习任务的一种新兴而强大的方法，但众所周知难以扩展。大多数可扩展模型都采用了基于节点的技术，以简化图神经网络昂贵的图消息传递传播过程。然而，我们发现这种加速在应用于百万甚至十亿规模的图时并不充分。在这项工作中，我们提出了 SCARA，这是一种可扩展的 GNN，具有面向特征的图计算优化功能。SCARA 从节点特征维度高效计算图嵌入，并进一步选择和重用特征计算结果，以减少开销。理论分析表明，我们的模型实现了亚线性时间复杂度，并保证了传播过程以及 GNN 训练和推理的精度。我们在各种数据集上进行了大量实验，以评估 SCARA 的功效和效率。与基线方法的性能比较表明，与当前最先进的方法相比，SCARA 的图传播加速度可达 800 倍，而且收敛速度快，精度相当。最值得注意的是，它能在 13 秒内高效处理现有最大的十亿规模 GNN 数据集 Papers100M（111 M 节点，1.6 B 边）的预计算。

{"title":"Scalable decoupling graph neural network with feature-oriented optimization","authors":"Ningyi Liao, Dingheng Mo, Siqiang Luo, Xiang Li, Pengcheng Yin","doi":"10.1007/s00778-023-00829-6","DOIUrl":"https://doi.org/10.1007/s00778-023-00829-6","url":null,"abstract":"Recent advances in data processing have stimulated the demand for learning graphs of very large scales. Graph neural networks (GNNs), being an emerging and powerful approach in solving graph learning tasks, are known to be difficult to scale up. Most scalable models apply node-based techniques in simplifying the expensive graph message-passing propagation procedure of GNNs. However, we find such acceleration insufficient when applied to million- or even billion-scale graphs. In this work, we propose SCARA, a scalable GNN with feature-oriented optimization for graph computation. SCARA efficiently computes graph embedding from the dimension of node features, and further selects and reuses feature computation results to reduce overhead. Theoretical analysis indicates that our model achieves sub-linear time complexity with a guaranteed precision in propagation process as well as GNN training and inference. We conduct extensive experiments on various datasets to evaluate the efficacy and efficiency of SCARA. Performance comparison with baselines shows that SCARA can reach up to (800times ) graph propagation acceleration than current state-of-the-art methods with fast convergence and comparable accuracy. Most notably, it is efficient to process precomputation on the largest available billion-scale GNN dataset Papers100M (111 M nodes, 1.6 B edges) in 13 s.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139054985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hypergraph motifs and their extensions beyond binary 超图主题及其二进制以外的扩展

The VLDB Journal

Pub Date : 2023-12-26 DOI: 10.1007/s00778-023-00827-8

Geon Lee, Seokbum Yoon, Jihoon Ko, Hyunju Kim, Kijung Shin

Hypergraphs naturally represent group interactions, which are omnipresent in many domains: collaborations of researchers, co-purchases of items, and joint interactions of proteins, to name a few. In this work, we propose tools for answering the following questions in a systematic manner: (Q1) what are the structural design principles of real-world hypergraphs? (Q2) how can we compare local structures of hypergraphs of different sizes? (Q3) how can we identify domains from which hypergraphs are? We first define hypergraph motifs (h-motifs), which describe the overlapping patterns of three connected hyperedges. Then, we define the significance of each h-motif in a hypergraph as its occurrences relative to those in properly randomized hypergraphs. Lastly, we define the characteristic profile (CP) as the vector of the normalized significance of every h-motif. Regarding Q1, we find that h-motifs ’ occurrences in 11 real-world hypergraphs from 5 domains are clearly distinguished from those of randomized hypergraphs. In addition, we demonstrate that CPs capture local structural patterns unique to each domain, thus comparing CPs of hypergraphs addresses Q2 and Q3. The concept of CP is naturally extended to represent the connectivity pattern of each node or hyperedge as a vector, which proves useful in node classification and hyperedge prediction. Our algorithmic contribution is to propose MoCHy, a family of parallel algorithms for counting h-motifs ’ occurrences in a hypergraph. We theoretically analyze their speed and accuracy and show empirically that the advanced approximate version MoCHy-A(^{+}) is up to (25times ) more accurate and (32times ) faster than the basic approximate and exact versions, respectively. Furthermore, we explore ternary hypergraph motifs that extends h-motifs by taking into account not only the presence but also the cardinality of intersections among hyperedges. This extension proves beneficial for all previously mentioned applications.

超图天然地代表了群体互动，这种互动在许多领域无处不在：如研究人员的合作、物品的共同购买以及蛋白质的联合互动等等。在这项工作中，我们提出了用于系统回答以下问题的工具：（问题 1）现实世界超图的结构设计原则是什么？问题 2）如何比较不同大小超图的局部结构？(Q3) 如何识别超图所在的域？我们首先定义了超图主题（h-motifs），它描述了三个相连超边的重叠模式。然后，我们将超图中每个 h-motif 的重要性定义为相对于正确随机化超图中出现的次数。最后，我们将特征轮廓（CP）定义为每个 h-motif的归一化意义向量。关于问题 1，我们发现来自 5 个领域的 11 个真实超图中出现的 h-motifs与随机超图中出现的 h-motifs有明显区别。此外，我们还证明了 CP 可捕捉每个领域特有的局部结构模式，从而比较了 Q2 和 Q3 地址超图的 CP。CP 的概念可以自然地扩展到以向量的形式表示每个节点或超边的连接模式，这在节点分类和超边预测中非常有用。我们在算法上的贡献在于提出了 MoCHy，这是一系列并行算法，用于计算超图中出现的 h-motifs。我们从理论上分析了它们的速度和准确性，并通过实证表明，高级近似版本 MoCHy-A(^{+}) 比基本近似版本和精确版本分别准确了 25 倍和快了 32 倍。此外，我们还探索了三元超图图案，它不仅考虑到了超边的存在，而且还考虑到了超边之间交集的万有性，从而扩展了 h-图案。事实证明，这种扩展对前面提到的所有应用都是有益的。

{"title":"Hypergraph motifs and their extensions beyond binary","authors":"Geon Lee, Seokbum Yoon, Jihoon Ko, Hyunju Kim, Kijung Shin","doi":"10.1007/s00778-023-00827-8","DOIUrl":"https://doi.org/10.1007/s00778-023-00827-8","url":null,"abstract":"Hypergraphs naturally represent group interactions, which are omnipresent in many domains: collaborations of researchers, co-purchases of items, and joint interactions of proteins, to name a few. In this work, we propose tools for answering the following questions in a systematic manner: (Q1) what are the structural design principles of real-world hypergraphs? (Q2) how can we compare local structures of hypergraphs of different sizes? (Q3) how can we identify domains from which hypergraphs are? We first define hypergraph motifs (h-motifs), which describe the overlapping patterns of three connected hyperedges. Then, we define the significance of each h-motif in a hypergraph as its occurrences relative to those in properly randomized hypergraphs. Lastly, we define the characteristic profile (CP) as the vector of the normalized significance of every h-motif. Regarding Q1, we find that h-motifs ’ occurrences in 11 real-world hypergraphs from 5 domains are clearly distinguished from those of randomized hypergraphs. In addition, we demonstrate that CPs capture local structural patterns unique to each domain, thus comparing CPs of hypergraphs addresses Q2 and Q3. The concept of CP is naturally extended to represent the connectivity pattern of each node or hyperedge as a vector, which proves useful in node classification and hyperedge prediction. Our algorithmic contribution is to propose MoCHy, a family of parallel algorithms for counting h-motifs ’ occurrences in a hypergraph. We theoretically analyze their speed and accuracy and show empirically that the advanced approximate version MoCHy-A(^{+}) is up to (25times ) more accurate and (32times ) faster than the basic approximate and exact versions, respectively. Furthermore, we explore ternary hypergraph motifs that extends h-motifs by taking into account not only the presence but also the cardinality of intersections among hyperedges. This extension proves beneficial for all previously mentioned applications.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139054986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HPCache: memory-efficient OLAP through proportional caching revisited HPCache：通过比例缓存重温内存效率高的 OLAP

The VLDB Journal

Pub Date : 2023-12-22 DOI: 10.1007/s00778-023-00828-7

Hamish Nicholson, Periklis Chrysogelos, Anastasia Ailamaki

Analytical engines rely on in-memory data caching to avoid storage accesses and provide timely responses by keeping the most frequently accessed data in memory. Purely frequency- and time-based caching decisions, however, are a proxy of the expected query execution speedup only when storage accesses are significantly slower than in-memory query processing. On the other hand, fast storage offers loading times that approach fully in-memory query response times, rendering purely frequency-based statistics incapable of capturing the impact of a caching decision on query execution. For example, caching the input of a frequent query that spends most of its time processing joins is less beneficial than caching a page for a slightly less frequent but scan-heavy query. Thus, existing caching policies waste valuable memory space to cache input data that offer little-to-no acceleration for analytics. This paper proposes HPCache, a buffer management policy that enables fast analytics on high-bandwidth storage by efficiently using the available in-memory space. HPCache caches data based on the speedup potential instead of relying on frequency-based statistics. We show that, with fast storage, the benefit of in-memory caching varies significantly across queries; therefore, we quantify the efficiency of caching decisions and formulate an optimization problem. We implement HPCache in Proteus and show that (i) estimating speedup potential improves memory space utilization, and (ii) simple runtime statistics suffice to infer speedup. We show that HPCache achieves up to a 1.75x speed-up over frequency-based caching policies by caching column proportions and automatically tuning them. Overall, HPCache enables efficient use of the in-memory space for input caching in the presence of fast storage, without requiring workload predictions.

分析引擎依靠内存数据缓存来避免存储访问，并通过在内存中保留访问频率最高的数据来提供及时响应。然而，只有当存储访问明显慢于内存查询处理时，纯粹基于频率和时间的缓存决策才能代表预期的查询执行速度提升。另一方面，快速存储提供的加载时间接近完全内存查询响应时间，这使得纯粹基于频率的统计无法捕捉缓存决策对查询执行的影响。例如，缓存一个频繁查询的输入（该查询大部分时间用于处理连接），不如缓存一个频率稍低但扫描量大的查询页面。因此，现有的缓存策略浪费了宝贵的内存空间来缓存输入数据，对分析几乎没有任何加速作用。本文提出的 HPCache 是一种缓冲区管理策略，可通过有效利用可用的内存空间，在高带宽存储上实现快速分析。HPCache 根据加速潜力缓存数据，而不是依赖基于频率的统计数据。我们的研究表明，在快速存储的情况下，内存缓存的优势在不同查询中差别很大；因此，我们量化了缓存决策的效率，并提出了一个优化问题。我们在 Proteus 中实现了 HPCache，并证明：(i) 估算加速潜力可提高内存空间利用率；(ii) 简单的运行时统计数据足以推断出加速情况。我们表明，通过缓存列比例并自动调整它们，HPCache 比基于频率的缓存策略最多可提高 1.75 倍的速度。总之，HPCache 可以在快速存储的情况下高效利用内存空间进行输入缓存，而无需进行工作负载预测。

{"title":"HPCache: memory-efficient OLAP through proportional caching revisited","authors":"Hamish Nicholson, Periklis Chrysogelos, Anastasia Ailamaki","doi":"10.1007/s00778-023-00828-7","DOIUrl":"https://doi.org/10.1007/s00778-023-00828-7","url":null,"abstract":"Analytical engines rely on in-memory data caching to avoid storage accesses and provide timely responses by keeping the most frequently accessed data in memory. Purely frequency- and time-based caching decisions, however, are a proxy of the expected query execution speedup only when storage accesses are significantly slower than in-memory query processing. On the other hand, fast storage offers loading times that approach fully in-memory query response times, rendering purely frequency-based statistics incapable of capturing the impact of a caching decision on query execution. For example, caching the input of a frequent query that spends most of its time processing joins is less beneficial than caching a page for a slightly less frequent but scan-heavy query. Thus, existing caching policies waste valuable memory space to cache input data that offer little-to-no acceleration for analytics. This paper proposes HPCache, a buffer management policy that enables fast analytics on high-bandwidth storage by efficiently using the available in-memory space. HPCache caches data based on the speedup potential instead of relying on frequency-based statistics. We show that, with fast storage, the benefit of in-memory caching varies significantly across queries; therefore, we quantify the efficiency of caching decisions and formulate an optimization problem. We implement HPCache in Proteus and show that (i) estimating speedup potential improves memory space utilization, and (ii) simple runtime statistics suffice to infer speedup. We show that HPCache achieves up to a 1.75x speed-up over frequency-based caching policies by caching column proportions and automatically tuning them. Overall, HPCache enables efficient use of the in-memory space for input caching in the presence of fast storage, without requiring workload predictions.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139020483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new window Clause for SQL++ SQL++ 的新窗口条款

The VLDB Journal

Pub Date : 2023-12-19 DOI: 10.1007/s00778-023-00830-z

James Fang, Dmitry Lychagin, Michael J. Carey, Vassilis J. Tsotras

Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner.

窗口查询是有序数据的重要分析工具，在流数据和存储数据环境中都有研究。通过融合现有流式数据系统和存储数据系统中的窗口查询思想，我们提出了一种新的窗口语法，使各种窗口查询更易于编写和优化。我们在大数据管理系统 AsterixDB 的 SQL++（支持查询半结构化数据的 SQL 扩展）中实现了这种新的窗口语法，从而使我们能够以并行和高效的方式处理大型数据集上的窗口查询。

引用次数: 0

Label-constrained shortest path query processing on road networks 道路网络上的标签约束最短路径查询处理

The VLDB Journal

Pub Date : 2023-12-16 DOI: 10.1007/s00778-023-00825-w

Abstract

Computing the shortest path between two vertices is a fundamental problem in road networks. Most of the existing works assume that the edges in the road networks have no labels, but in many real applications, the edges have labels and label constraints may be placed on the edges appearing on a valid shortest path. Hence, we study the label-constrained shortest path queries in this paper. In order to process such queries efficiently, we adopt an index-based approach and propose a novel index structure, (textsf{LSD}) - (textsf{Index}) , based on tree decomposition. With (textsf{LSD}) - (textsf{Index}) , we design efficient query processing and index construction algorithms with good performance guarantees. Moreover, due to the dynamic properties of real-world networks, we also devise index maintenance algorithms that can maintain the index efficiently. To evaluate the performance of proposed methods, we conduct extensive experimental studies using large real road networks including the whole USA road network. Compared with the state-of-the-art approach, the experimental results demonstrate that our algorithm not only achieves up to two orders of magnitude speedup in query processing time but also consumes much less index space. Meanwhile, the experimental results also show that the index can also be efficiently constructed and maintained for dynamic graphs.

摘要计算两个顶点之间的最短路径是道路网络中的一个基本问题。现有的大多数研究都假设道路网络中的边没有标签，但在许多实际应用中，边是有标签的，而且在有效的最短路径上出现的边可能会受到标签约束。因此，我们在本文中研究了标签约束的最短路径查询。为了高效地处理这类查询，我们采用了一种基于索引的方法，并提出了一种基于树分解的新型索引结构--（textsf{LSD}）-（textsf{索引}）。有了(textsf{LSD}) -(textsf{Index}) ，我们就能设计出高效的查询处理和索引构建算法，并能保证良好的性能。此外，由于真实世界网络的动态特性，我们还设计了能够高效维护索引的索引维护算法。为了评估所提出方法的性能，我们使用大型真实道路网络（包括整个美国道路网络）进行了广泛的实验研究。实验结果表明，与最先进的方法相比，我们的算法不仅在查询处理时间上加快了两个数量级，而且消耗的索引空间也少得多。同时，实验结果还表明，对于动态图，我们也能高效地构建和维护索引。

{"title":"Label-constrained shortest path query processing on road networks","authors":"","doi":"10.1007/s00778-023-00825-w","DOIUrl":"https://doi.org/10.1007/s00778-023-00825-w","url":null,"abstract":"<h3>Abstract</h3> Computing the shortest path between two vertices is a fundamental problem in road networks. Most of the existing works assume that the edges in the road networks have no labels, but in many real applications, the edges have labels and label constraints may be placed on the edges appearing on a valid shortest path. Hence, we study the label-constrained shortest path queries in this paper. In order to process such queries efficiently, we adopt an index-based approach and propose a novel index structure, (textsf{LSD}) - (textsf{Index}) , based on tree decomposition. With (textsf{LSD}) - (textsf{Index}) , we design efficient query processing and index construction algorithms with good performance guarantees. Moreover, due to the dynamic properties of real-world networks, we also devise index maintenance algorithms that can maintain the index efficiently. To evaluate the performance of proposed methods, we conduct extensive experimental studies using large real road networks including the whole USA road network. Compared with the state-of-the-art approach, the experimental results demonstrate that our algorithm not only achieves up to two orders of magnitude speedup in query processing time but also consumes much less index space. Meanwhile, the experimental results also show that the index can also be efficiently constructed and maintained for dynamic graphs.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"201 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138691920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RCBench: an RDMA-enabled transaction framework for analyzing concurrency control algorithms RCBench：用于分析并发控制算法的 RDMA 事务框架

The VLDB Journal

Pub Date : 2023-12-14 DOI: 10.1007/s00778-023-00821-0

Hongyao Zhao, Jingyao Li, Wei Lu, Qian Zhang, Wanqing Yang, Jiajia Zhong, Meihui Zhang, Haixiang Li, Xiaoyong Du, Anqun Pan

Distributed transaction processing over the TCP/IP network suffers from the weak transaction scalability problem, i.e., its performance drops significantly when the number of involved data nodes per transaction increases. Although quite a few of works over the high-performance RDMA-capable network are proposed, they mainly focus on accelerating distributed transaction processing, rather than solving the weak transaction scalability problem. In this paper, we propose RCBench, an RDMA-enabled transaction framework, which serves as a unified evaluation tool for assessing the transaction scalability of various concurrency control algorithms. The usability and advancement of RCBench primarily come from the proposed concurrency control primitives , which facilitate the convenient implementation of RDMA-enabled concurrency control algorithms. Various optimization principles are proposed to ensure that concurrency control algorithms in RCBench can fully benefit from the advantages offered by RDMA-capable networks. We conduct extensive experiments to evaluate the scalability of mainstream concurrency control algorithms. The results show that by exploiting the capabilities of RDMA, concurrency control algorithms in RCBench can obtain 42X performance improvement, and transaction scalability can be achieved in RCBench.

TCP/IP 网络上的分布式事务处理存在事务弱可扩展性问题，即当每个事务涉及的数据节点数量增加时，其性能会显著下降。虽然有不少关于高性能 RDMA 网络的研究成果被提出，但它们主要侧重于加速分布式事务处理，而不是解决弱事务可扩展性问题。在本文中，我们提出了支持 RDMA 的事务框架 RCBench，它是评估各种并发控制算法事务可扩展性的统一评估工具。RCBench 的可用性和先进性主要来自于所提出的并发控制基元，这些基元可以方便地实现支持 RDMA 的并发控制算法。我们提出了各种优化原则，以确保 RCBench 中的并发控制算法能够充分受益于支持 RDMA 的网络所提供的优势。我们进行了大量实验来评估主流并发控制算法的可扩展性。结果表明，通过利用 RDMA 的功能，RCBench 中的并发控制算法可以获得 42 倍的性能提升，并且 RCBench 中可以实现事务的可扩展性。

{"title":"RCBench: an RDMA-enabled transaction framework for analyzing concurrency control algorithms","authors":"Hongyao Zhao, Jingyao Li, Wei Lu, Qian Zhang, Wanqing Yang, Jiajia Zhong, Meihui Zhang, Haixiang Li, Xiaoyong Du, Anqun Pan","doi":"10.1007/s00778-023-00821-0","DOIUrl":"https://doi.org/10.1007/s00778-023-00821-0","url":null,"abstract":"Distributed transaction processing over the TCP/IP network suffers from the weak transaction scalability problem, i.e., its performance drops significantly when the number of involved data nodes per transaction increases. Although quite a few of works over the high-performance RDMA-capable network are proposed, they mainly focus on accelerating distributed transaction processing, rather than solving the weak transaction scalability problem. In this paper, we propose RCBench, an RDMA-enabled transaction framework, which serves as a unified evaluation tool for assessing the transaction scalability of various concurrency control algorithms. The usability and advancement of RCBench primarily come from the proposed concurrency control primitives , which facilitate the convenient implementation of RDMA-enabled concurrency control algorithms. Various optimization principles are proposed to ensure that concurrency control algorithms in RCBench can fully benefit from the advantages offered by RDMA-capable networks. We conduct extensive experiments to evaluate the scalability of mainstream concurrency control algorithms. The results show that by exploiting the capabilities of RDMA, concurrency control algorithms in RCBench can obtain 42X performance improvement, and transaction scalability can be achieved in RCBench.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138691305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Morphtree: a polymorphic main-memory learned index for dynamic workloads Morphtree:动态工作负载的多态主存学习索引

The VLDB Journal

Pub Date : 2023-12-01 DOI: 10.1007/s00778-023-00823-y

Yongping Luo, Peiquan Jin, Zhaole Chu, Xiaoliang Wang, Yigui Yuan, Zhou Zhang, Yun Luo, Xufei Wu, Peng Zou

Modern database systems rely on indexes to accelerate data access. The recently proposed learned indexes can offer higher search performance with lower space costs than traditional indexes like B+-tree. We observe that existing main-memory learned indexes are particularly optimized for read-heavy workloads. However, such an optimization comes at the cost of model training and handling out-of-range key insertions, which will worsen the overall performance. We argue that workloads are not always read-heavy in real applications, and it is more important and practical to make learned indexes work efficiently for dynamic workloads with changing access patterns and data distributions. In this paper, we aim to improve the practicality of learned indexes by making them adaptive to dynamic workloads. Specifically, we propose a new polymorphic learned index named Morphtree, which can adaptively change the index structure to provide stable and high performance for dynamic workloads. The novelty of Morphtree lies in three aspects: (1) a decoupled tree structure for separating the inner search tree from the data layer consisting of leaf nodes, (2) a read-optimized learned inner tree for improving the performance of index search, and (3) an evolving data layer for automatically transforming node layouts into read friendly or write friendly according to workload changes. We evaluate these new ideas of Morphtree on various datasets and workloads. The comparative results with six up-to-date learned indexes, including ALEX, PGM-index, FITing-tree, LIPP, FINEdex, and XIndex, show that Morphtree can achieve, on average, 0.56x and 3x improvements in lookup and insertion performance, respectively. Moreover, when evaluated on dynamic workloads with changing lookup ratios and data distributions, Morphtree can achieve a sustained high throughput across different real-world datasets and query patterns, owing to its ability to automatically adjust the index structure according to workload changes.

现代数据库系统依靠索引来加速数据访问。与B+-tree等传统索引相比，最近提出的学习索引能够以更低的空间成本提供更高的搜索性能。我们观察到，现有的主存学习索引特别针对读取繁重的工作负载进行了优化。然而，这样的优化是以模型训练和处理超出范围的键插入为代价的，这将使整体性能恶化。我们认为，在实际应用程序中，工作负载并不总是读取繁重的，使学习索引有效地工作于具有不断变化的访问模式和数据分布的动态工作负载更为重要和实用。在本文中，我们的目标是通过使学习索引适应动态工作负载来提高其实用性。具体来说，我们提出了一种新的多态学习索引Morphtree，它可以自适应地改变索引结构，为动态工作负载提供稳定的高性能。Morphtree的新颖之处在于三个方面:(1)将内部搜索树与由叶节点组成的数据层分离的解耦树结构;(2)优化读取的学习内部树，提高索引搜索的性能;(3)进化的数据层，根据工作负载的变化自动将节点布局转换为读友好或写友好。我们在不同的数据集和工作负载上评估了Morphtree的这些新思想。与ALEX、pgr -index、fit -tree、LIPP、FINEdex、XIndex等6个最新学习索引的比较结果表明，Morphtree在查找和插入性能上平均分别提高了0.56倍和3倍。此外，当对查找比率和数据分布不断变化的动态工作负载进行评估时，由于Morphtree能够根据工作负载变化自动调整索引结构，因此可以跨不同的实际数据集和查询模式实现持续的高吞吐量。

{"title":"Morphtree: a polymorphic main-memory learned index for dynamic workloads","authors":"Yongping Luo, Peiquan Jin, Zhaole Chu, Xiaoliang Wang, Yigui Yuan, Zhou Zhang, Yun Luo, Xufei Wu, Peng Zou","doi":"10.1007/s00778-023-00823-y","DOIUrl":"https://doi.org/10.1007/s00778-023-00823-y","url":null,"abstract":"Modern database systems rely on indexes to accelerate data access. The recently proposed learned indexes can offer higher search performance with lower space costs than traditional indexes like B+-tree. We observe that existing main-memory learned indexes are particularly optimized for read-heavy workloads. However, such an optimization comes at the cost of model training and handling out-of-range key insertions, which will worsen the overall performance. We argue that workloads are not always read-heavy in real applications, and it is more important and practical to make learned indexes work efficiently for dynamic workloads with changing access patterns and data distributions. In this paper, we aim to improve the practicality of learned indexes by making them adaptive to dynamic workloads. Specifically, we propose a new polymorphic learned index named Morphtree, which can adaptively change the index structure to provide stable and high performance for dynamic workloads. The novelty of Morphtree lies in three aspects: (1) a decoupled tree structure for separating the inner search tree from the data layer consisting of leaf nodes, (2) a read-optimized learned inner tree for improving the performance of index search, and (3) an evolving data layer for automatically transforming node layouts into read friendly or write friendly according to workload changes. We evaluate these new ideas of Morphtree on various datasets and workloads. The comparative results with six up-to-date learned indexes, including ALEX, PGM-index, FITing-tree, LIPP, FINEdex, and XIndex, show that Morphtree can achieve, on average, 0.56x and 3x improvements in lookup and insertion performance, respectively. Moreover, when evaluated on dynamic workloads with changing lookup ratios and data distributions, Morphtree can achieve a sustained high throughput across different real-world datasets and query patterns, owing to its ability to automatically adjust the index structure according to workload changes.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0