Proc. VLDB Endow.最新文献_第6页

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel PyTorch FSDP:扩展完全分片数据并行的经验

Proc. VLDB Endow.

Pub Date : 2023-04-21 DOI: 10.48550/arXiv.2304.11277

Yanli Zhao, A. Gu, R. Varma, Liangchen Luo, Chien-chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Y. Hao, Shen Li

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

人们普遍认为，大型模型有潜力在广泛的领域提供卓越的性能。尽管在机器学习系统研究领域取得了显著进展，这使得大型模型的开发和探索成为可能，但这种能力仍然局限于一小部分高级用户和行业领导者，导致更广泛的社区访问和利用这些技术存在隐性的技术障碍。在本文中，我们介绍了PyTorch完全分片数据并行(FSDP)作为大型模型训练的工业级解决方案。FSDP与几个关键PyTorch核心组件紧密合作设计，包括张量实现，调度系统和CUDA内存缓存分配器，以提供非侵入式用户体验和高培训效率。此外，FSDP集成了一系列技术和设置，可以优化各种硬件配置的资源利用率。实验结果表明，FSDP能够实现与分布式数据并行相当的性能，同时在TFLOPS方面为具有近线性可扩展性的更大模型提供支持。

{"title":"PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel","authors":"Yanli Zhao, A. Gu, R. Varma, Liangchen Luo, Chien-chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Y. Hao, Shen Li","doi":"10.48550/arXiv.2304.11277","DOIUrl":"https://doi.org/10.48550/arXiv.2304.11277","url":null,"abstract":"It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"40 1","pages":"3848-3860"},"PeriodicalIF":0.0,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89610502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Similarity search in the blink of an eye with compressed indices 用压缩索引在一眨眼的时间内进行相似度搜索

Proc. VLDB Endow.

Pub Date : 2023-04-07 DOI: 10.48550/arXiv.2304.04759

C. Aguerrebere, Ishwar Bhati, Mark Hildebrand, Mariano Tepper, Ted L. Willke

Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications. Graph-based indices are currently the best performing techniques for billion-scale similarity search. However, their random-access memory pattern presents challenges to realize their full potential. In this work, we present new techniques and systems for creating faster and smaller graph-based indices. To this end, we introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and scalar quantization to improve search performance with fast similarity computations and a reduced effective bandwidth, while decreasing memory footprint and barely impacting accuracy. LVQ, when combined with a new high-performance computing system for graph-based similarity search, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory.

现在，数据是用向量表示的。从数百万甚至数十亿个与给定查询相似的向量中检索这些向量是一个普遍存在的问题，称为相似性搜索，与广泛的应用程序相关。基于图的索引是目前十亿级相似度搜索中性能最好的技术。然而，他们的随机存取存储器模式提出了充分发挥其潜力的挑战。在这项工作中，我们提出了创建更快、更小的基于图的索引的新技术和系统。为此，我们引入了一种新的矢量压缩方法，局部自适应矢量量化(LVQ)，它使用每个矢量缩放和标量量化来提高搜索性能，快速相似度计算和减少有效带宽，同时减少内存占用，几乎不影响精度。LVQ与用于基于图的相似性搜索的新型高性能计算系统相结合，在性能和内存占用方面达到了最新水平。对于数十亿个向量，LVQ优于第二好的替代方案:(1)在低内存状态下，吞吐量提高20.7倍，内存占用减少3倍;(2)在高吞吐量状态下，吞吐量提高5.8倍，内存减少1.4倍。

{"title":"Similarity search in the blink of an eye with compressed indices","authors":"C. Aguerrebere, Ishwar Bhati, Mark Hildebrand, Mariano Tepper, Ted L. Willke","doi":"10.48550/arXiv.2304.04759","DOIUrl":"https://doi.org/10.48550/arXiv.2304.04759","url":null,"abstract":"Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications. Graph-based indices are currently the best performing techniques for billion-scale similarity search. However, their random-access memory pattern presents challenges to realize their full potential. In this work, we present new techniques and systems for creating faster and smaller graph-based indices. To this end, we introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and scalar quantization to improve search performance with fast similarity computations and a reduced effective bandwidth, while decreasing memory footprint and barely impacting accuracy. LVQ, when combined with a new high-performance computing system for graph-based similarity search, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"116 2 1","pages":"3433-3446"},"PeriodicalIF":0.0,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84224984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-Grained Re-Execution for Efficient Batched Commit of Distributed Transactions 为分布式事务的高效批处理提交提供细粒度重执行

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594523

Zhiyuan Dong, Zhaoguo Wang, Xiaodong Zhang, Xian Xu, Changgeng Zhao, Haibo Chen, Aurojit Panda, Jinyang Li

Distributed transaction systems incur extensive cross-node communication to execute and commit serializable OLTP transactions. As a result, their performance greatly suffers. Caching data at nodes that execute transactions can cut down remote reads. Batching transactions for validation and persistence can amortize the communication cost during committing. However, caching and batching can significantly increase the likelihood of conflicts, causing expensive aborts. In this paper, we develop Hackwrench to address the challenge of caching and batching. Instead of aborting conflicted transactions, Hackwrench tries to repair them using fine-grained re-execution by tracking the dependencies of operations among a batch of transactions. Tracked dependencies allow Hackwrench to selectively invalidate and re-execute only those operations necessary to "fix" the conflict, which is cheaper than aborting and executing an entire batch of transactions. Evaluations using TPC-C and other micro-benchmarks show that Hackwrench can outperform existing commercial and research systems including FoundationDB, Calvin, COCO, and Sundial under comparable settings.

分布式事务系统需要大量的跨节点通信来执行和提交可序列化的OLTP事务。结果，他们的表现受到了极大的影响。在执行事务的节点上缓存数据可以减少远程读取。批处理用于验证和持久化的事务可以分摊提交期间的通信成本。然而，缓存和批处理会显著增加冲突的可能性，从而导致代价高昂的中止。在本文中，我们开发了Hackwrench来解决缓存和批处理的挑战。Hackwrench不是终止冲突的事务，而是通过跟踪一批事务之间操作的依赖关系，使用细粒度的重新执行来修复它们。跟踪的依赖关系允许Hackwrench选择性地使那些“修复”冲突所必需的操作无效并重新执行，这比终止并执行一整批事务要便宜。使用TPC-C和其他微基准测试的评估表明，在类似的设置下，Hackwrench可以优于现有的商业和研究系统，包括FoundationDB、Calvin、COCO和Sundial。

{"title":"Fine-Grained Re-Execution for Efficient Batched Commit of Distributed Transactions","authors":"Zhiyuan Dong, Zhaoguo Wang, Xiaodong Zhang, Xian Xu, Changgeng Zhao, Haibo Chen, Aurojit Panda, Jinyang Li","doi":"10.14778/3594512.3594523","DOIUrl":"https://doi.org/10.14778/3594512.3594523","url":null,"abstract":"Distributed transaction systems incur extensive cross-node communication to execute and commit serializable OLTP transactions. As a result, their performance greatly suffers. Caching data at nodes that execute transactions can cut down remote reads. Batching transactions for validation and persistence can amortize the communication cost during committing. However, caching and batching can significantly increase the likelihood of conflicts, causing expensive aborts.\u0000 \u0000 In this paper, we develop Hackwrench to address the challenge of caching and batching. Instead of aborting conflicted transactions, Hackwrench tries to repair them using\u0000 fine-grained re-execution\u0000 by tracking the dependencies of operations among a batch of transactions. Tracked dependencies allow Hackwrench to selectively invalidate and re-execute only those operations necessary to \"fix\" the conflict, which is cheaper than aborting and executing an entire batch of transactions. Evaluations using TPC-C and other micro-benchmarks show that Hackwrench can outperform existing commercial and research systems including FoundationDB, Calvin, COCO, and Sundial under comparable settings.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"33 1","pages":"1930-1943"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82550975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Benchmarking the Utility of w-event Differential Privacy Mechanisms - When Baselines Become Mighty Competitors 对w-事件差分隐私机制的效用进行基准测试——当基线成为强大的竞争者时

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594515

Christine Schäler, Thomas Hütter, Martin Schäler

The w -event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of w -event differential privacy, various mechanisms to implement the framework are proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism, and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent. To this end, we formalize an empirical study of w -event mechanisms by re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected - yet explainable - result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers.

w -event框架是确保连续监控数据流上的差异隐私的当前标准。在w事件差分隐私的基础上，提出了实现该框架的各种机制。它们在实证研究中的可比性对于从业者选择合适的机制和研究人员识别当前的局限性并提出新的机制至关重要。通过进行文献调查，我们观察到现有研究的结果很难比较，部分本质上不一致。为此，我们通过重新出现我们调查中发现的元素，形式化了w事件机制的实证研究。我们介绍了对这些元素的要求，以确保实验结果的可比性。此外，我们提出了一个满足所有要求的基准，并建立了一种评估现有机制和新提议机制的新方法。通过大规模的实证研究，我们对现有机制的优缺点获得了宝贵的新见解。一个意想不到的(但可以解释的)结果是基线至上，也就是说，使用两种基线机制中的一种有望提供良好甚至最佳的效用。最后，我们为从业者提供了选择合适的机制和研究人员改进方案的指南。

{"title":"Benchmarking the Utility of w-event Differential Privacy Mechanisms - When Baselines Become Mighty Competitors","authors":"Christine Schäler, Thomas Hütter, Martin Schäler","doi":"10.14778/3594512.3594515","DOIUrl":"https://doi.org/10.14778/3594512.3594515","url":null,"abstract":"\u0000 The\u0000 w\u0000 -event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of\u0000 w\u0000 -event differential privacy, various mechanisms to implement the framework are proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism, and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent.\u0000 \u0000 \u0000 To this end, we formalize an empirical study of\u0000 w\u0000 -event mechanisms by re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected - yet explainable - result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"2 1","pages":"1830-1842"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78674664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Computing Graph Edit Distance via Neural Graph Matching 通过神经图匹配计算图编辑距离

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594514

Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, Hongtao Cheng

Graph edit distance (GED) computation is a fundamental NP-hard problem in graph theory. Given a graph pair ( G 1 , G 2 ), GED is defined as the minimum number of primitive operations converting G 1 to G 2 . Early studies focus on search-based inexact algorithms such as A*-beam search, and greedy algorithms using bipartite matching due to its NP-hardness. They can obtain a sub-optimal solution by constructing an edit path (the sequence of operations that converts G 1 to G 2 ). Recent studies convert the GED between a given graph pair ( G 1 , G 2 ) into a similarity score in the range (0, 1) by a well designed function. Then machine learning models (mostly based on graph neural networks) are applied to predict the similarity score. They achieve a much higher numerical precision than the sub-optimal solutions found by classical algorithms. However, a major limitation is that these machine learning models cannot generate an edit path. They treat the GED computation as a pure regression task to bypass its intrinsic complexity, but ignore the essential task of converting G 1 to G 2 . This severely limits the interpretability and usability of the solution. In this paper, we propose a novel deep learning framework that solves the GED problem in a two-step manner: 1) The proposed graph neural network GEDGNN is in charge of predicting the GED value and a matching matrix; and 2) A post-processing algorithm based on k -best matching is used to derive k possible node matchings from the matching matrix generated by GEDGNN. The best matching will finally lead to a high-quality edit path. Extensive experiments are conducted on three real graph data sets and synthetic power-law graphs to demonstrate the effectiveness of our framework. Compared to the best result of existing GNN-based models, the mean absolute error (MAE) on GED value prediction decreases by 4.9% ~ 74.3%. Compared to the state-of-the-art searching algorithm Noah, the MAE on GED value based on edit path reduces by 53.6% ~ 88.1%.

图编辑距离(GED)计算是图论中一个基本的NP-hard问题。给定一个图对(g1, g2)，定义为将g1转换为g2的最小基元操作数。早期的研究主要集中在基于搜索的不精确算法，如A*波束搜索，以及由于其np硬度而使用二部匹配的贪婪算法。它们可以通过构造编辑路径(将g1转换为g2的操作序列)来获得次优解。最近的研究将给定图对(g1, g2)之间的GED通过精心设计的函数转化为(0,1)范围内的相似度分数。然后应用机器学习模型(主要基于图神经网络)来预测相似度得分。它们比传统算法的次优解具有更高的数值精度。然而，一个主要的限制是这些机器学习模型不能生成编辑路径。他们将GED计算视为纯粹的回归任务，以绕过其内在的复杂性，但忽略了将g1转换为g2的基本任务。这严重限制了解决方案的可解释性和可用性。本文提出了一种新的深度学习框架，分两步解决GED问题:1)提出的图神经网络GEDGNN负责预测GED值和匹配矩阵;2)采用基于k -最优匹配的后处理算法，从GEDGNN生成的匹配矩阵中导出k个可能的节点匹配。最佳匹配将最终导致高质量的编辑路径。在三个真实的图数据集和合成的幂律图上进行了大量的实验，以证明我们的框架的有效性。与现有基于gnn模型的最佳预测结果相比，平均绝对误差(MAE)降低了4.9% ~ 74.3%。与最先进的搜索算法Noah相比，基于编辑路径的GED值的MAE降低了53.6% ~ 88.1%。

{"title":"Computing Graph Edit Distance via Neural Graph Matching","authors":"Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, Hongtao Cheng","doi":"10.14778/3594512.3594514","DOIUrl":"https://doi.org/10.14778/3594512.3594514","url":null,"abstract":"\u0000 Graph edit distance (GED) computation is a fundamental NP-hard problem in graph theory. Given a graph pair (\u0000 G\u0000 1\u0000 ,\u0000 G\u0000 2\u0000 ), GED is defined as the minimum number of primitive operations converting\u0000 G\u0000 1\u0000 to\u0000 G\u0000 2\u0000 . Early studies focus on search-based inexact algorithms such as A*-beam search, and greedy algorithms using bipartite matching due to its NP-hardness. They can obtain a sub-optimal solution by constructing an edit path (the sequence of operations that converts\u0000 G\u0000 1\u0000 to\u0000 G\u0000 2\u0000 ). Recent studies convert the GED between a given graph pair (\u0000 G\u0000 1\u0000 ,\u0000 G\u0000 2\u0000 ) into a similarity score in the range (0, 1) by a well designed function. Then machine learning models (mostly based on graph neural networks) are applied to predict the similarity score. They achieve a much higher numerical precision than the sub-optimal solutions found by classical algorithms. However, a major limitation is that these machine learning models cannot generate an edit path. They treat the GED computation as a pure regression task to bypass its intrinsic complexity, but ignore the essential task of converting\u0000 G\u0000 1\u0000 to\u0000 G\u0000 2\u0000 . This severely limits the interpretability and usability of the solution.\u0000 \u0000 \u0000 In this paper, we propose a novel deep learning framework that solves the GED problem in a two-step manner: 1) The proposed graph neural network GEDGNN is in charge of predicting the GED value and a matching matrix; and 2) A post-processing algorithm based on\u0000 k\u0000 -best matching is used to derive\u0000 k\u0000 possible node matchings from the matching matrix generated by GEDGNN. The best matching will finally lead to a high-quality edit path. Extensive experiments are conducted on three real graph data sets and synthetic power-law graphs to demonstrate the effectiveness of our framework. Compared to the best result of existing GNN-based models, the mean absolute error (MAE) on GED value prediction decreases by 4.9% ~ 74.3%. Compared to the state-of-the-art searching algorithm Noah, the MAE on GED value based on edit path reduces by 53.6% ~ 88.1%.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"21 1","pages":"1817-1829"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73153903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Migration-Free Just-In-Case Data Archival for Future Cloud Data Lakes 面向未来云数据湖的免迁移即时数据归档

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594522

Eugenio Marinelli, Yiqing Yan, V. Magnone, Charlotte Dumargne, P. Barbry, T. Heinis, Raja Appuswamy

Given the growing adoption of AI, cloud data lakes are facing the need to support cost-effective "just-in-case" data archival over long time periods to meet regulatory compliance requirements. Unfortunately, current media technologies suffer from fundamental issues that will soon, if not already, make cost-effective data archival infeasible. In this paper, we present a vision for redesigning the archival tier of cloud data lakes based on a novel, obsolescence-free storage medium-synthetic DNA. In doing so, we make two contributions: (i) we highlight the challenges in using DNA for data archival and list several open research problems, (ii) we outline OligoArchive-DSM (OA-DSM)-an end-to-end DNA storage pipeline that we are developing to demonstrate the feasibility of our vision.

鉴于人工智能的日益普及，云数据湖正面临着长期支持经济高效的“以防万一”数据存档的需求，以满足法规遵从性要求。不幸的是，当前的媒体技术存在一些基本问题，这些问题很快(如果不是已经)将使具有成本效益的数据存档变得不可行。在本文中，我们提出了一个基于一种新颖的、无过时的存储介质——合成DNA——重新设计云数据湖存档层的愿景。在此过程中，我们做出了两个贡献:(i)我们强调了使用DNA进行数据存档的挑战，并列出了几个开放的研究问题，(ii)我们概述了OligoArchive-DSM (OA-DSM)——我们正在开发的端到端DNA存储管道，以证明我们愿景的可行性。

引用次数: 0

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594530

John Paparrizos, Kaize Wu, Aaron J. Elmore, C. Faloutsos, M. Franklin

Similarity search is a core analytical task, and its performance critically depends on the choice of distance measure. For time-series querying, elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) measures prune unnecessary comparisons with elastic distances to accelerate similarity search. Despite decades of attention, there has never been a study to assess the progress in this area. In addition, the research has disproportionately focused on one popular elastic measure, while other accurate measures have received little or no attention. Therefore, there is merit in developing a framework to accumulate knowledge from previously developed LBs and eliminate the notoriously challenging task of designing separate LBs for each elastic measure. In this paper, we perform the first comprehensive study of 11 LBs spanning 5 elastic measures using 128 datasets. We identify four properties that constitute the effectiveness of LBs and propose the Generalized Lower Bounding (GLB) framework to satisfy all desirable properties. GLB creates cache-friendly summaries, adaptively exploits summaries of both query and target time series, and captures boundary distances in an unsupervised manner. GLB outperforms all LBs in speedup (e.g., up to 13.5× faster against the strongest LB in terms of pruning power), establishes new state-of-the-art results for the 5 elastic measures, and provides the first LBs for 2 elastic measures with no known LBs. Overall, GLB enables the effective development of LBs to facilitate fast similarity search.

相似度搜索是一项核心的分析任务，它的性能在很大程度上取决于距离度量的选择。对于时间序列查询，弹性度量可以达到最先进的精度，但计算成本很高。因此，快速下限边界(LB)度量减少了不必要的与弹性距离的比较，从而加快了相似性搜索。尽管人们关注了几十年，但从来没有一项研究对这一领域的进展进行评估。此外，研究不成比例地集中在一种流行的弹性测量上，而其他准确的测量方法很少或根本没有得到关注。因此，有必要开发一个框架来积累以前开发的lb的知识，并消除为每个弹性度量设计单独lb的众所周知的挑战性任务。在本文中，我们使用128个数据集对11 LBs跨越5个弹性测量进行了首次全面研究。我们确定了构成lb有效性的四个性质，并提出了广义下边界(GLB)框架来满足所有理想的性质。GLB创建缓存友好的摘要，自适应地利用查询和目标时间序列的摘要，并以无监督的方式捕获边界距离。GLB在加速方面优于所有LB(例如，在修剪功率方面比最强LB快13.5倍)，为5个弹性测量建立了新的最先进的结果，并为2个未知LB的弹性测量提供了第一个LB。总体而言，GLB使LBs得以有效开发，便于快速进行相似度搜索。

{"title":"Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances","authors":"John Paparrizos, Kaize Wu, Aaron J. Elmore, C. Faloutsos, M. Franklin","doi":"10.14778/3594512.3594530","DOIUrl":"https://doi.org/10.14778/3594512.3594530","url":null,"abstract":"\u0000 Similarity search is a core analytical task, and its performance critically depends on the choice of distance measure. For time-series querying, elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) measures prune unnecessary comparisons with elastic distances to accelerate similarity search. Despite decades of attention, there has never been a study to assess the progress in this area. In addition, the research has disproportionately focused on one popular elastic measure, while other accurate measures have received little or no attention. Therefore, there is merit in developing a framework to accumulate knowledge from previously developed LBs and eliminate the notoriously challenging task of designing separate LBs for each elastic measure. In this paper, we perform the first comprehensive study of 11 LBs spanning 5 elastic measures using 128 datasets. We identify four properties that constitute the effectiveness of LBs and propose the Generalized Lower Bounding (GLB) framework to satisfy all desirable properties. GLB creates cache-friendly summaries, adaptively exploits summaries of both query and target time series, and captures boundary distances in an unsupervised manner. GLB outperforms\u0000 all\u0000 LBs in speedup (e.g., up to 13.5× faster against the strongest LB in terms of pruning power), establishes new state-of-the-art results for the 5 elastic measures, and provides the first LBs for 2 elastic measures with no known LBs. Overall, GLB enables the effective development of LBs to facilitate fast similarity search.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"26 1","pages":"2019-2032"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75057549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Efficient framework for operating on data sketches 对数据草图进行操作的有效框架

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594526

Jakub Lemiesz

We study the problem of analyzing massive data streams based on concise data sketches. Recently, a number of papers have investigated how to estimate the results of set-theory operations based on sketches. In this paper we present a framework that allows to estimate the result of any sequence of set-theory operations. The starting point for our solution is the solution from 2021. Compared to this solution, the newly presented sketching algorithm is much more computationally efficient as it requires on average O (log n ) rather than O ( n ) comparisons for n stream elements. We also show that the estimator dedicated to sketches proposed in that reference solution is, in fact, a maximum likelihood estimator.

研究了基于简洁数据草图的海量数据流分析问题。最近，一些论文研究了如何基于草图估计集合论运算的结果。在本文中，我们提出了一个框架，允许估计任何集合论操作序列的结果。我们的解决方案的起点是2021年的解决方案。与此解决方案相比，新提出的草图绘制算法的计算效率更高，因为它对n个流元素平均需要O (log n)而不是O (n)个比较。我们还表明，在该参考解决方案中提出的专门用于草图的估计量实际上是一个极大似然估计量。

引用次数: 1

Learned Index: A Comprehensive Experimental Evaluation 学习指数:一个综合实验评价

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594528

Zhaoyan Sun, Xuanhe Zhou, Guoliang Li

Indexes can improve query-processing performance by avoiding full table scans. Although traditional indexes (e.g., B+-tree) have been widely used, learned indexes are proposed to adopt machine learning models to reduce the query latency and index size. However, existing learned indexes are (1) not thoroughly evaluated under the same experimental framework and are (2) not comprehensively compared with different settings (e.g., key lookup, key insert, concurrent operations, bulk loading). Moreover, it is hard to select appropriate learned indexes for practitioners in different settings. To address those problems, this paper detailedly reviews existing learned indexes and discusses the design choices of key components in learned indexes, including key lookup (position inference which predicts the position of a key, and position refinement which re-searches the position if the predicted position is incorrect), key insert, concurrency, and bulk loading. Moreover, we provide a testbed to facilitate the design and test of new learned indexes for researchers. We compare state-of-the-art learned indexes in the same experimental framework, and provide findings to select suitable learned indexes under various practical scenarios.

索引可以通过避免全表扫描来提高查询处理性能。虽然传统索引(如B+-tree)已经被广泛使用，但学习索引被提出采用机器学习模型来减少查询延迟和索引大小。然而，现有的学习索引(1)没有在相同的实验框架下进行彻底的评估，(2)没有在不同的设置(如键查找、键插入、并发操作、批量加载)下进行全面的比较。此外，很难为不同背景的从业者选择合适的学习指标。为了解决这些问题，本文详细回顾了现有的学习索引，并讨论了学习索引中关键组件的设计选择，包括键查找(预测键位置的位置推断，如果预测的位置不正确则重新搜索位置的位置优化)，键插入，并发和批量加载。此外，我们还提供了一个试验台，方便研究人员设计和测试新的学习指标。在相同的实验框架下，我们比较了最先进的学习指标，并提供了在不同的实际场景下选择合适的学习指标的研究结果。

{"title":"Learned Index: A Comprehensive Experimental Evaluation","authors":"Zhaoyan Sun, Xuanhe Zhou, Guoliang Li","doi":"10.14778/3594512.3594528","DOIUrl":"https://doi.org/10.14778/3594512.3594528","url":null,"abstract":"Indexes can improve query-processing performance by avoiding full table scans. Although traditional indexes (e.g., B+-tree) have been widely used, learned indexes are proposed to adopt machine learning models to reduce the query latency and index size. However, existing learned indexes are (1) not thoroughly evaluated under the same experimental framework and are (2) not comprehensively compared with different settings (e.g., key lookup, key insert, concurrent operations, bulk loading). Moreover, it is hard to select appropriate learned indexes for practitioners in different settings. To address those problems, this paper detailedly reviews existing learned indexes and discusses the design choices of key components in learned indexes, including key lookup (position inference which predicts the position of a key, and position refinement which re-searches the position if the predicted position is incorrect), key insert, concurrency, and bulk loading. Moreover, we provide a testbed to facilitate the design and test of new learned indexes for researchers. We compare state-of-the-art learned indexes in the same experimental framework, and provide findings to select suitable learned indexes under various practical scenarios.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"94 1","pages":"1992-2004"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77554100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Longshot: Indexing Growing Databases using MPC and Differential Privacy 远景:使用MPC和差异隐私索引增长的数据库

Proc. VLDB Endow.

Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594529

Yanping Zhang, Johes Bater, Kartik Nayak, Ashwin Machanavajjhala

In this work, we propose Longshot, a novel design for secure outsourced database systems that supports ad-hoc queries through the use of secure multi-party computation and differential privacy. By combining these two techniques, we build and maintain data structures (i.e., synopses, indexes, and stores) that improve query execution efficiency while maintaining strong privacy and security guarantees. As new data records are uploaded by data owners, these data structures are continually updated by Longshot using novel algorithms that leverage bounded information leakage to minimize the use of expensive cryptographic protocols. Furthermore, Long-shot organizes the data structures as a hierarchical tree based on when the update occurred, allowing for update strategies that provide logarithmic error over time. Through this approach, Longshot introduces a tunable three-way trade-off between privacy, accuracy, and efficiency. Our experimental results confirm that our optimizations are not only asymptotic improvements but also observable in practice. In particular, we see a 5x efficiency improvement to update our data structures even when the number of updates is less than 200. Moreover, the data structures significantly improve query runtimes over time, about ~10 3 x faster compared to the baseline after 20 updates.

在这项工作中，我们提出了Longshot，这是一种安全外包数据库系统的新设计，通过使用安全多方计算和差分隐私来支持临时查询。通过结合这两种技术，我们可以构建和维护数据结构(即概要、索引和存储)，这些结构可以提高查询执行效率，同时保持强大的隐私和安全保证。当数据所有者上传新的数据记录时，Longshot使用新颖的算法不断更新这些数据结构，这些算法利用有限的信息泄漏来最大限度地减少昂贵的加密协议的使用。此外，Long-shot根据更新发生的时间将数据结构组织为分层树，从而允许随时间提供对数误差的更新策略。通过这种方法，Longshot在隐私、准确性和效率之间引入了一种可调的三方面权衡。我们的实验结果证实了我们的优化不仅是渐近的改进，而且在实践中是可观察的。特别是，即使更新次数少于200次，更新数据结构的效率也提高了5倍。此外，随着时间的推移，数据结构显著改善了查询运行时间，与更新20次后的基线相比，大约快了10.3倍。

{"title":"Longshot: Indexing Growing Databases using MPC and Differential Privacy","authors":"Yanping Zhang, Johes Bater, Kartik Nayak, Ashwin Machanavajjhala","doi":"10.14778/3594512.3594529","DOIUrl":"https://doi.org/10.14778/3594512.3594529","url":null,"abstract":"\u0000 In this work, we propose Longshot, a novel design for secure outsourced database systems that supports ad-hoc queries through the use of secure multi-party computation and differential privacy. By combining these two techniques, we build and maintain data structures (i.e., synopses, indexes, and stores) that improve query execution efficiency while maintaining strong privacy and security guarantees. As new data records are uploaded by data owners, these data structures are continually updated by Longshot using novel algorithms that leverage bounded information leakage to minimize the use of expensive cryptographic protocols. Furthermore, Long-shot organizes the data structures as a hierarchical tree based on when the update occurred, allowing for update strategies that provide logarithmic error over time. Through this approach, Longshot introduces a tunable three-way trade-off between privacy, accuracy, and efficiency. Our experimental results confirm that our optimizations are not only asymptotic improvements but also observable in practice. In particular, we see a 5x efficiency improvement to update our data structures even when the number of updates is less than 200. Moreover, the data structures significantly improve query runtimes over time, about ~10\u0000 3\u0000 x faster compared to the baseline after 20 updates.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"51 1","pages":"2005-2018"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85094874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0