Proceedings of the ACM on Management of Data最新文献

One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting 一粒种子，两只鸟：用于精确和近似计数的统一学习结构

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639270

Yingze Li, Hongzhi Wang, Xianglong Liu

The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are constrained either by query selectivity or by the compromise between query accuracy and efficiency. We propose CardIndex, a unified learned structure to solve the above problems. CardIndex serves as a versatile solution that not only functions as a multidimensional learned index for accurate counting but also doubles as an adaptive cardinality estimator, catering to varying counting scenarios with diverse requirements for precision and efficiency. Rigorous experimentation has showcased its superiority. Compared to the state-of-the-art (SOTA) autoregressive data-driven cardinality estimation baselines, our structure achieves training and updating times that are two orders of magnitude faster. Additionally, our CPU-based query estimation latency surpasses GPU-based baselines by two to three times. Notably, the estimation accuracy of low-selectivity queries is up to 314 times better than the current SOTA estimator. In terms of indexing tasks, the construction speed of our structure is two orders of magnitude faster than RSMI and 1.9 times faster than R-tree. Furthermore, it exhibits a point query processing speed that is 3%-17% times faster than RSMI and 1.07 to 2.75 times faster than R-tree and KDB-tree. Range queries under specific loads are 20% times faster than the SOTA indexes.

现代数据库有许多精确和近似的计数要求。然而，单独的多维索引或卡因估计器不足以满足所有计数场景中不断升级的需求。这些方法要么受到查询选择性的限制，要么受到查询准确性和效率之间折衷的限制。我们提出的 CardIndex 是一种解决上述问题的统一学习结构。CardIndex 是一种多用途解决方案，它不仅是一种用于精确计数的多维学习索引，而且还是一种自适应万有引力估计器，可满足不同计数场景对精度和效率的不同要求。严格的实验证明了它的优越性。与最先进的（SOTA）自回归数据驱动万有引力估计基线相比，我们的结构的训练和更新时间快了两个数量级。此外，我们基于 CPU 的查询估计延迟比基于 GPU 的基线快两到三倍。值得注意的是，低选择性查询的估计精度比当前的 SOTA 估计器高达 314 倍。在索引任务方面，我们的结构的构建速度比 RSMI 快两个数量级，比 R-tree 快 1.9 倍。此外，它的点查询处理速度比 RSMI 快 3%-17%，比 R-tree 和 KDB-tree 快 1.07-2.75 倍。特定负载下的范围查询比 SOTA 索引快 20%。

{"title":"One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting","authors":"Yingze Li, Hongzhi Wang, Xianglong Liu","doi":"10.1145/3639270","DOIUrl":"https://doi.org/10.1145/3639270","url":null,"abstract":"The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are constrained either by query selectivity or by the compromise between query accuracy and efficiency.\u0000 We propose CardIndex, a unified learned structure to solve the above problems. CardIndex serves as a versatile solution that not only functions as a multidimensional learned index for accurate counting but also doubles as an adaptive cardinality estimator, catering to varying counting scenarios with diverse requirements for precision and efficiency. Rigorous experimentation has showcased its superiority. Compared to the state-of-the-art (SOTA) autoregressive data-driven cardinality estimation baselines, our structure achieves training and updating times that are two orders of magnitude faster. Additionally, our CPU-based query estimation latency surpasses GPU-based baselines by two to three times. Notably, the estimation accuracy of low-selectivity queries is up to 314 times better than the current SOTA estimator. In terms of indexing tasks, the construction speed of our structure is two orders of magnitude faster than RSMI and 1.9 times faster than R-tree. Furthermore, it exhibits a point query processing speed that is 3%-17% times faster than RSMI and 1.07 to 2.75 times faster than R-tree and KDB-tree. Range queries under specific loads are 20% times faster than the SOTA indexes.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"56 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140394459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PECJ: Stream Window Join on Disorder Data Streams with Proactive Error Compensation PECJ：带主动误差补偿功能的无序数据流流窗口连接

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639268

Xianzhi Zeng, Shuhao Zhang, Hongbin Zhong, Hao Zhang, Mian Lu, Zhao Zheng, Yuqiang Chen

Stream Window Join (SWJ), a vital operation in stream analytics, struggles with achieving a balance between accuracy and latency due to out-of-order data arrivals. Existing methods predominantly rely on adaptive buffering, but often fall short in performance, thereby constraining practical applications. We introduce PECJ, a solution that proactively incorporates unobserved data to enhance accuracy while reducing latency, thus requiring robust predictive modeling of stream oscillation. At the heart of PECJ lies a mathematical formulation of the posterior distribution approximation (PDA) problem using variational inference (VI). This approach circumvents error propagation while meeting the low-latency demands of SWJ. We detail the implementation of PECJ, striking a balance between complexity and generality, and discuss both analytical and learning-based approaches. Experimental evaluations reveal PECJ's superior performance. The successful integration of PECJ into a multi-threaded SWJ benchmark testbed further establishes its practical value, demonstrating promising advancements in enhancing data stream processing capabilities amidst out-of-order data.

流窗口连接（SWJ）是流分析中的一项重要操作，但由于数据到达不按顺序，因此很难在准确性和延迟之间取得平衡。现有方法主要依赖于自适应缓冲，但性能往往不佳，从而限制了实际应用。我们介绍的 PECJ 是一种解决方案，它能主动纳入未观察到的数据，在提高准确性的同时减少延迟，因此需要对流振荡进行稳健的预测建模。PECJ 的核心是利用变异推理（VI）对后验分布近似（PDA）问题进行数学表述。这种方法既避免了误差传播，又满足了 SWJ 的低延迟要求。我们详细介绍了 PECJ 的实现，在复杂性和通用性之间取得了平衡，并讨论了分析方法和基于学习的方法。实验评估显示了 PECJ 的卓越性能。PECJ 与多线程 SWJ 基准测试平台的成功集成进一步确立了它的实用价值，展示了在增强无序数据中的数据流处理能力方面的巨大进步。

{"title":"PECJ: Stream Window Join on Disorder Data Streams with Proactive Error Compensation","authors":"Xianzhi Zeng, Shuhao Zhang, Hongbin Zhong, Hao Zhang, Mian Lu, Zhao Zheng, Yuqiang Chen","doi":"10.1145/3639268","DOIUrl":"https://doi.org/10.1145/3639268","url":null,"abstract":"Stream Window Join (SWJ), a vital operation in stream analytics, struggles with achieving a balance between accuracy and latency due to out-of-order data arrivals. Existing methods predominantly rely on adaptive buffering, but often fall short in performance, thereby constraining practical applications. We introduce PECJ, a solution that proactively incorporates unobserved data to enhance accuracy while reducing latency, thus requiring robust predictive modeling of stream oscillation. At the heart of PECJ lies a mathematical formulation of the posterior distribution approximation (PDA) problem using variational inference (VI). This approach circumvents error propagation while meeting the low-latency demands of SWJ. We detail the implementation of PECJ, striking a balance between complexity and generality, and discuss both analytical and learning-based approaches. Experimental evaluations reveal PECJ's superior performance. The successful integration of PECJ into a multi-threaded SWJ benchmark testbed further establishes its practical value, demonstrating promising advancements in enhancing data stream processing capabilities amidst out-of-order data.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"27 3‐4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140394900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSs 零边 RDMA：针对分解异构云 DBMS 的网络驱动数据洗牌

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639291

Matthias Jasny, Lasse Thostrup, Sajjad Tamimi, Andreas Koch, Z. István, Carsten Binnig

In this paper, we present a novel communication scheme called zero-sided RDMA, enabling data exchange as a native network service using a programmable switch. In contrast to one- or two-sided RDMA, in zero-sided RDMA, neither the sender nor the receiver is actively involved in data exchange. Zero-sided RDMA thus enables efficient RDMA-based data shuffling between heterogeneous hardware devices in a disaggregated setup without the need to implement a complete RDMA stack on each heterogeneous device or the need for a CPU that is co-located with the accelerator to coordinate the data transfer. As such, we think that zero-sided RDMA is a major building block to make efficient use of heterogeneous accelerators in future cloud DBMSs. In our evaluation, we show that zero-sided RDMA can outperform existing one-sided RDMA-based schemes for accelerator-to-accelerator communication and thus speed up typical distributed database operations such as joins.

在本文中，我们提出了一种名为零边 RDMA 的新型通信方案，通过使用可编程交换机将数据交换作为一种本地网络服务来实现。与单边或双边 RDMA 不同，在零边 RDMA 中，发送方和接收方都不主动参与数据交换。因此，零边 RDMA 可以在分解设置中的异构硬件设备之间高效地进行基于 RDMA 的数据洗牌，而无需在每个异构设备上实施完整的 RDMA 栈，也无需与加速器同处一地的 CPU 来协调数据传输。因此，我们认为零边 RDMA 是在未来的云 DBMS 中有效利用异构加速器的主要构件。在我们的评估中，我们发现零边 RDMA 在加速器到加速器的通信方面优于现有的基于单边 RDMA 的方案，从而加快了连接等典型分布式数据库操作的速度。

引用次数: 0

Determining Exact Quantiles with Randomized Summaries 用随机摘要确定精确定量

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639280

Ziling Chen, Haoquan Guan, Shaoxu Song, Xiangdong Huang, Chen Wang, Jianmin Wang

Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the data in multiple passes. The idea is to gradually shrink the range of the queried quantile till it is small enough to fit in memory for ranking the result. Existing methods use deterministic sketches to determine the exact range of quantile, known as deterministic filter, which could be inefficient in range shrinking. In this study, we propose to shrink the ranges more aggressively, using randomized summaries such as KLL sketch. That is, with a high probability the quantile lies in a smaller range, namely probabilistic filter, determined by the randomized sketch. Specifically, we estimate the expected passes for determining the exact quantiles with probabilistic filters, and select a proper probability that can minimize the expected passes. Analyses show that our exact quantile determination method can terminate in P passes with 1-δ confidence, storing O(N 1/P logP-1/2P (1/δ)) items, close to the lower bound Ømega(N1/P) for a fixed δ. The approach has been deployed as a function in an LSM-tree based time-series database Apache IoTDB. Remarkably, the randomized sketches can be pre-computed for the immutable SSTables in LSM-tree. Moreover, multiple quantile queries could share the data passes for probabilistic filters in range estimation. Extensive experiments on real and synthetic datasets demonstrate the superiority of our proposal compared to the existing methods with deterministic filters. On average, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch).

数量统计是各种数据科学任务中的基本统计，但计算成本很高，例如，将整个数据加载到内存中进行排序。由于内存空间有限，在终端设备或数据库负载较重的情况下非常普遍，因此需要多次扫描数据。我们的想法是逐渐缩小查询量级的范围，直到它小到足以容纳在内存中，以便对结果进行排序。现有的方法使用确定性草图来确定量化值的精确范围，称为确定性过滤器，这种方法在缩小范围时效率较低。在本研究中，我们建议使用随机摘要（如 KLL 草图）来更积极地缩小范围。也就是说，量子点很有可能位于一个较小的范围内，即概率过滤器，由随机草图决定。具体来说，我们估算了使用概率过滤器确定精确量化值的预期通过率，并选择了一个能使预期通过率最小化的适当概率。分析表明，我们的精确量化确定方法可以在 P 次且置信度为 1-δ 的情况下终止，存储 O(N 1/P logP-1/2P (1/δ)) 项，接近固定 δ 时的下限 Ømega(N1/P)。值得注意的是，随机草图可以为 LSM 树中不可变的 SST 表预先计算。此外，多个量化查询可以共享数据通道，用于范围估计中的概率过滤器。在真实数据集和合成数据集上进行的大量实验证明，与使用确定性滤波器的现有方法相比，我们的建议更具优势。与最先进的确定性草图（GK 草图）相比，我们的方法平均减少了 0.48 次传递，节省了 18% 的时间。

{"title":"Determining Exact Quantiles with Randomized Summaries","authors":"Ziling Chen, Haoquan Guan, Shaoxu Song, Xiangdong Huang, Chen Wang, Jianmin Wang","doi":"10.1145/3639280","DOIUrl":"https://doi.org/10.1145/3639280","url":null,"abstract":"Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the data in multiple passes. The idea is to gradually shrink the range of the queried quantile till it is small enough to fit in memory for ranking the result. Existing methods use deterministic sketches to determine the exact range of quantile, known as deterministic filter, which could be inefficient in range shrinking. In this study, we propose to shrink the ranges more aggressively, using randomized summaries such as KLL sketch. That is, with a high probability the quantile lies in a smaller range, namely probabilistic filter, determined by the randomized sketch. Specifically, we estimate the expected passes for determining the exact quantiles with probabilistic filters, and select a proper probability that can minimize the expected passes. Analyses show that our exact quantile determination method can terminate in P passes with 1-δ confidence, storing O(N 1/P logP-1/2P (1/δ)) items, close to the lower bound Ømega(N1/P) for a fixed δ. The approach has been deployed as a function in an LSM-tree based time-series database Apache IoTDB. Remarkably, the randomized sketches can be pre-computed for the immutable SSTables in LSM-tree. Moreover, multiple quantile queries could share the data passes for probabilistic filters in range estimation. Extensive experiments on real and synthetic datasets demonstrate the superiority of our proposal compared to the existing methods with deterministic filters. On average, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch).","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"107 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140395162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Privacy Amplification by Sampling under User-level Differential Privacy 用户级差异隐私下的采样隐私放大

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639289

Juanru Fang, Ke Yi

Random sampling is an effective tool for reducing the computational costs of query processing in large databases. It has also been used frequently for private data analysis, in particular, under differential privacy (DP). An interesting phenomenon that the literature has identified, is that sampling can amplify the privacy guarantee of a mechanism, which in turn leads to reduced noise scales that have to be injected. All existing privacy amplification results only hold in the standard, record-level DP model. Recently, user-level differential privacy (user-DP) has gained a lot of attention as it protects all data records contributed by any particular user, thus offering stronger privacy protection. Sampling-based mechanisms under user-DP have not been explored so far, except naively running the mechanism on a sample without privacy amplification, which results in large DP noises. In fact, sampling is in even more demand under user-DP, since all state-of-the-art user-DP mechanisms have high computational costs due to the complex relationships between users and records. In this paper, we take the first step towards the study of privacy amplification by sampling under user-DP, and give the amplification results for two common user-DP sampling strategies: simple sampling and sample-and-explore. The experimental results show that these sampling-based mechanisms can be a useful tool to obtain some quick and reasonably accurate estimates on large private datasets.

随机抽样是降低大型数据库查询处理计算成本的有效工具。它还经常用于隐私数据分析，特别是在差分隐私（DP）条件下。文献中发现的一个有趣现象是，抽样可以放大机制的隐私保证，这反过来又会降低必须注入的噪声尺度。所有现有的隐私放大结果都只在标准的记录级 DP 模型中成立。最近，用户级差异隐私（user-DP）受到了广泛关注，因为它能保护任何特定用户贡献的所有数据记录，从而提供更强的隐私保护。用户-DP 下基于采样的机制迄今为止还没有被探索过，只是在没有隐私放大的情况下天真地在样本上运行该机制，这会导致较大的 DP 噪声。事实上，由于用户和记录之间的复杂关系，所有最先进的用户-数据处理机制的计算成本都很高，因此在用户-数据处理机制下对采样的需求更大。在本文中，我们迈出了研究用户-数据处理下通过采样放大隐私的第一步，并给出了两种常见的用户-数据处理采样策略：简单采样和采样-探索的放大结果。实验结果表明，这些基于采样的机制可以作为一种有用的工具，在大型隐私数据集上快速获得合理准确的估计值。

{"title":"Privacy Amplification by Sampling under User-level Differential Privacy","authors":"Juanru Fang, Ke Yi","doi":"10.1145/3639289","DOIUrl":"https://doi.org/10.1145/3639289","url":null,"abstract":"Random sampling is an effective tool for reducing the computational costs of query processing in large databases. It has also been used frequently for private data analysis, in particular, under differential privacy (DP). An interesting phenomenon that the literature has identified, is that sampling can amplify the privacy guarantee of a mechanism, which in turn leads to reduced noise scales that have to be injected.\u0000 All existing privacy amplification results only hold in the standard, record-level DP model. Recently, user-level differential privacy (user-DP) has gained a lot of attention as it protects all data records contributed by any particular user, thus offering stronger privacy protection. Sampling-based mechanisms under user-DP have not been explored so far, except naively running the mechanism on a sample without privacy amplification, which results in large DP noises. In fact, sampling is in even more demand under user-DP, since all state-of-the-art user-DP mechanisms have high computational costs due to the complex relationships between users and records. In this paper, we take the first step towards the study of privacy amplification by sampling under user-DP, and give the amplification results for two common user-DP sampling strategies: simple sampling and sample-and-explore. The experimental results show that these sampling-based mechanisms can be a useful tool to obtain some quick and reasonably accurate estimates on large private datasets.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"81 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140395472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Buffer Management with Tiered Main Memory 利用分层主存储器实现缓冲区管理

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639286

Xiangpeng Hao, Xinjing Zhou, Xiangyao Yu, Michael Stonebraker

The scaling of per-GB DRAM cost has slowed down in recent years. Recent research has suggested that adding remote memory to a system can further reduce the overall memory cost while maintaining good performance. Remote memory (i.e., tiered memory), connected to host servers via high-speed interconnect protocols such as RDMA and CXL, is expected to deliver 100x (less than 1µs) lower latency than SSD and be more cost-effective than local DRAM through pooling or adopting cheaper memory technologies. Tiered memory opens up a large number of potential use cases within database systems. But previous work has only explored limited ways of using tiered memory. Our study provides a systematic study for DBMS to build tiered memory buffer management with respect to a wide range of hardware performance characteristics. Specifically, we study five different indexing designs that leverage remote memory in different ways and evaluate them through a wide range of metrics including performance, tiered-memory latency sensitivity, and cost-effectiveness. In addition, we propose a new memory provisioning strategy that allocates an optimal amount of local and remote memory for a given workload. Our evaluations show that while some designs achieve higher performance than others, no design can win in all measured dimensions.

近年来，每 GB DRAM 成本的增长速度有所放缓。最近的研究表明，在系统中添加远程内存可以进一步降低总体内存成本，同时保持良好的性能。通过 RDMA 和 CXL 等高速互连协议连接到主机服务器的远程内存（即分层内存），其延迟时间有望比固态硬盘低 100 倍（小于 1µs），而且通过池化或采用更便宜的内存技术，其成本效益比本地 DRAM 更高。分层内存为数据库系统开辟了大量潜在用例。但以前的工作只探索了有限的分层内存使用方法。我们的研究针对各种硬件性能特征，为数据库管理系统建立分层内存缓冲区管理提供了系统性研究。具体来说，我们研究了五种以不同方式利用远程内存的索引设计，并通过性能、分层内存延迟敏感性和成本效益等一系列指标对其进行了评估。此外，我们还提出了一种新的内存配置策略，可为特定工作负载分配最佳数量的本地和远程内存。我们的评估结果表明，虽然有些设计比其他设计实现了更高的性能，但没有一种设计能在所有测量维度中胜出。

{"title":"Towards Buffer Management with Tiered Main Memory","authors":"Xiangpeng Hao, Xinjing Zhou, Xiangyao Yu, Michael Stonebraker","doi":"10.1145/3639286","DOIUrl":"https://doi.org/10.1145/3639286","url":null,"abstract":"The scaling of per-GB DRAM cost has slowed down in recent years. Recent research has suggested that adding remote memory to a system can further reduce the overall memory cost while maintaining good performance. Remote memory (i.e., tiered memory), connected to host servers via high-speed interconnect protocols such as RDMA and CXL, is expected to deliver 100x (less than 1µs) lower latency than SSD and be more cost-effective than local DRAM through pooling or adopting cheaper memory technologies.\u0000 Tiered memory opens up a large number of potential use cases within database systems. But previous work has only explored limited ways of using tiered memory. Our study provides a systematic study for DBMS to build tiered memory buffer management with respect to a wide range of hardware performance characteristics. Specifically, we study five different indexing designs that leverage remote memory in different ways and evaluate them through a wide range of metrics including performance, tiered-memory latency sensitivity, and cost-effectiveness. In addition, we propose a new memory provisioning strategy that allocates an optimal amount of local and remote memory for a given workload. Our evaluations show that while some designs achieve higher performance than others, no design can win in all measured dimensions.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"78 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140395768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discovering Functional Dependencies through Hitting Set Enumeration 通过命中集合枚举发现功能依赖性

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639298

Tobias Bleifuß, Thorsten Papenbrock, Thomas Bläsius, Martin Schirneck, Felix Naumann

Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown, automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances in the past few years, but they still require a significant amount of time and memory to process datasets of practically relevant sizes. We present FDHits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational dataset. FDHits is based on several discovery optimizations that include a hybrid validation approach, effective hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show that FDHits, even without parallel execution, has a median speedup of 8.1 compared to state-of-the-art FD discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on datasets that could not be processed by the current state-of-the-art.

功能依赖（FD）是数据库中最重要的完整性约束之一。它们有助于规范数据集，从而解决冗余问题；它们有助于优化查询；它们经常被用来指导数据清理工作。由于特定数据集的 FD 通常是未知的，因此需要自动剖析算法来发现它们。这些算法在过去几年中取得了长足的进步，但处理实际相关规模的数据集仍需要大量的时间和内存。我们介绍的 FDHits 是一种新型 FD 发现算法，它能在给定的关系数据集中找到所有有效的最小 FD。FDHits 基于多种发现优化方法，包括混合验证方法、有效的命中集枚举技术、单次候选验证和并行化。我们的实验表明，即使没有并行执行，FDHits 与最先进的 FD 发现算法相比，中值速度也提高了 8.1 倍，同时内存使用量大幅减少。这样，即使在目前最先进的算法无法处理的数据集上，也能发现所有 FD。

{"title":"Discovering Functional Dependencies through Hitting Set Enumeration","authors":"Tobias Bleifuß, Thorsten Papenbrock, Thomas Bläsius, Martin Schirneck, Felix Naumann","doi":"10.1145/3639298","DOIUrl":"https://doi.org/10.1145/3639298","url":null,"abstract":"Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown, automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances in the past few years, but they still require a significant amount of time and memory to process datasets of practically relevant sizes.\u0000 We present FDHits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational dataset. FDHits is based on several discovery optimizations that include a hybrid validation approach, effective hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show that FDHits, even without parallel execution, has a median speedup of 8.1 compared to state-of-the-art FD discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on datasets that could not be processed by the current state-of-the-art.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"3 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140395919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedKNN: Secure Federated k-Nearest Neighbor Search FedKNN：安全的联合 k 近邻搜索

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639266

Xinyi Zhang, Qichen Wang, Cheng Xu, Yun Peng, Jianliang Xu

Nearest neighbor search is a fundamental task in various domains, such as federated learning, data mining, information retrieval, and biomedicine. With the increasing need to utilize data from different organizations while respecting privacy regulations, private data federation has emerged as a promising solution. However, it is costly to directly apply existing approaches to federated k-nearest neighbor (kNN) search with difficult-to-compute distance functions, like graph or sequence similarity. To address this challenge, we propose FedKNN, a system that supports secure federated kNN search queries with a wide range of similarity measurements. Our system is equipped with a new Distribution-Aware kNN (DANN) algorithm to minimize unnecessary local computations while protecting data privacy. We further develop DANN*, a secure version of DANN that satisfies differential obliviousness. Extensive evaluations show that FedKNN outperforms state-of-the-art solutions, achieving up to 4.8× improvement on federated graph kNN search and up to 2.7× improvement on federated sequence kNN search. Additionally, our approach offers a trade-off between privacy and efficiency, providing strong privacy guarantees with minimal overhead.

近邻搜索是联合学习、数据挖掘、信息检索和生物医学等多个领域的一项基本任务。随着人们越来越需要在尊重隐私法规的同时利用来自不同组织的数据，私人数据联盟已成为一种前景广阔的解决方案。然而，将现有方法直接应用于具有难以计算的距离函数（如图或序列相似性）的联合 k 近邻（kNN）搜索成本高昂。为了应对这一挑战，我们提出了 FedKNN 系统，该系统支持具有广泛相似性测量的安全联合 kNN 搜索查询。我们的系统配备了一种新的分布感知 kNN（DANN）算法，可在保护数据隐私的同时尽量减少不必要的局部计算。我们进一步开发了 DANN*，它是 DANN 的安全版本，满足差分遗忘性。广泛的评估表明，FedKNN 优于最先进的解决方案，在联合图 kNN 搜索方面实现了高达 4.8 倍的改进，在联合序列 kNN 搜索方面实现了高达 2.7 倍的改进。此外，我们的方法在隐私和效率之间进行了权衡，以最小的开销提供了强大的隐私保证。

{"title":"FedKNN: Secure Federated k-Nearest Neighbor Search","authors":"Xinyi Zhang, Qichen Wang, Cheng Xu, Yun Peng, Jianliang Xu","doi":"10.1145/3639266","DOIUrl":"https://doi.org/10.1145/3639266","url":null,"abstract":"Nearest neighbor search is a fundamental task in various domains, such as federated learning, data mining, information retrieval, and biomedicine. With the increasing need to utilize data from different organizations while respecting privacy regulations, private data federation has emerged as a promising solution. However, it is costly to directly apply existing approaches to federated k-nearest neighbor (kNN) search with difficult-to-compute distance functions, like graph or sequence similarity. To address this challenge, we propose FedKNN, a system that supports secure federated kNN search queries with a wide range of similarity measurements. Our system is equipped with a new Distribution-Aware kNN (DANN) algorithm to minimize unnecessary local computations while protecting data privacy. We further develop DANN*, a secure version of DANN that satisfies differential obliviousness. Extensive evaluations show that FedKNN outperforms state-of-the-art solutions, achieving up to 4.8× improvement on federated graph kNN search and up to 2.7× improvement on federated sequence kNN search. Additionally, our approach offers a trade-off between privacy and efficiency, providing strong privacy guarantees with minimal overhead.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"61 S279","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140394667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling Shifting Workloads for Learned Database Systems 为学习型数据库系统的工作负载变化建模

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639293

Peizhi Wu, Z. Ives

Learned database systems address several weaknesses of traditional cost estimation techniques in query optimization: they learn a model of a database instance, e.g., as queries are executed. However, when the database instance has skew and correlation, it is nontrivial to create an effective training set that anticipates workload shifts, where query structure changes and/or different regions of the data contribute to query answers. Our predictive model may perform poorly with these out-of-distribution inputs. In this paper, we study how the notion of a replay buffer can be managed through online algorithms to build a concise yet representative model of the workload distribution --- allowing for rapid adaptation and effective prediction of cardinalities and costs. We experimentally validate our methods over several data domains.

学习型数据库系统解决了查询优化中传统成本估算技术的几个弱点：例如，它们在执行查询时学习数据库实例的模型。然而，当数据库实例存在偏斜和相关性时，要创建一个有效的训练集来预测工作量的变化（查询结构变化和/或数据的不同区域对查询答案的贡献）并非易事。我们的预测模型可能会在这些超出分布范围的输入中表现不佳。在本文中，我们研究了如何通过在线算法来管理重放缓冲区的概念，从而建立一个简洁而又有代表性的工作量分布模型--允许快速适应并有效地预测卡片性和成本。我们在多个数据域对我们的方法进行了实验验证。

引用次数: 0

STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically STile：自动搜索稀疏深度学习算子的混合稀疏格式

Proceedings of the ACM on Management of Data

Pub Date : 2024-03-12 DOI: 10.1145/3639323

Jin Fang, Yanyan Shen, Yue Wang, Lei Chen

Sparse operators, i.e., operators that take sparse tensors as input, are of great importance in deep learning models. Due to the diverse sparsity patterns in different sparse tensors, it is challenging to optimize sparse operators by seeking an optimal sparse format, i.e., leading to the lowest operator latency. Existing works propose to decompose a sparse tensor into several parts and search for a hybrid of sparse formats to handle diverse sparse patterns. However, they often make a trade-off between search space and search time: their search spaces are limited in some cases, resulting in limited operator running efficiency they can achieve. In this paper, we try to extend the search space in its breadth (by doing flexible sparse tensor transformations) and depth (by enabling multi-level decomposition). We formally define the multi-level sparse format decomposition problem, which is NP-hard, and we propose a framework STile for it. To search efficiently, a greedy algorithm is used, which is guided by a cost model about the latency of computing a sub-task of the original operator after decomposing the sparse tensor. Experiments of two common kinds of sparse operators, SpMM and SDDMM, are conducted on various sparsity patterns, and we achieve 2.1-18.0× speedup against cuSPARSE on SpMMs and 1.5 - 6.9× speedup against DGL on SDDMM. The search time is less than one hour for any tested sparse operator, which can be amortized.

稀疏算子，即把稀疏张量作为输入的算子，在深度学习模型中非常重要。由于不同稀疏张量的稀疏性模式各不相同，因此通过寻求最佳稀疏格式（即导致最低的算子延迟）来优化稀疏算子具有挑战性。现有研究建议将稀疏张量分解成几个部分，并寻找一种混合稀疏格式来处理不同的稀疏模式。然而，它们往往要在搜索空间和搜索时间之间做出权衡：它们的搜索空间在某些情况下是有限的，导致它们能实现的算子运行效率有限。在本文中，我们尝试在搜索空间的广度（通过进行灵活的稀疏张量变换）和深度（通过启用多级分解）上进行扩展。我们正式定义了多级稀疏格式分解问题（这是一个 NP 难问题），并为此提出了一个框架 STile。为了高效搜索，我们使用了一种贪婪算法，该算法由一个成本模型指导，该模型涉及分解稀疏张量后计算原始算子的子任务的延迟。我们对 SpMM 和 SDDMM 两种常见的稀疏算子进行了各种稀疏模式的实验，结果表明，在 SpMM 上，我们的搜索速度比 cuSPARSE 快 2.1-18.0 倍；在 SDDMM 上，我们的搜索速度比 DGL 快 1.5-6.9 倍。对于任何测试过的稀疏算子，搜索时间都不到一小时，而且可以摊销。

{"title":"STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically","authors":"Jin Fang, Yanyan Shen, Yue Wang, Lei Chen","doi":"10.1145/3639323","DOIUrl":"https://doi.org/10.1145/3639323","url":null,"abstract":"Sparse operators, i.e., operators that take sparse tensors as input, are of great importance in deep learning models. Due to the diverse sparsity patterns in different sparse tensors, it is challenging to optimize sparse operators by seeking an optimal sparse format, i.e., leading to the lowest operator latency. Existing works propose to decompose a sparse tensor into several parts and search for a hybrid of sparse formats to handle diverse sparse patterns. However, they often make a trade-off between search space and search time: their search spaces are limited in some cases, resulting in limited operator running efficiency they can achieve. In this paper, we try to extend the search space in its breadth (by doing flexible sparse tensor transformations) and depth (by enabling multi-level decomposition). We formally define the multi-level sparse format decomposition problem, which is NP-hard, and we propose a framework STile for it. To search efficiently, a greedy algorithm is used, which is guided by a cost model about the latency of computing a sub-task of the original operator after decomposing the sparse tensor. Experiments of two common kinds of sparse operators, SpMM and SDDMM, are conducted on various sparsity patterns, and we achieve 2.1-18.0× speedup against cuSPARSE on SpMMs and 1.5 - 6.9× speedup against DGL on SDDMM. The search time is less than one hour for any tested sparse operator, which can be amortized.","PeriodicalId":204146,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"62 3‐4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140395593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0