Proc. VLDB Endow.最新文献_第3页

Can Large Language Models Predict Data Correlations from Column Names? 大型语言模型能否从列名预测数据相关性？

Proc. VLDB Endow.

Pub Date : 2023-09-01 DOI: 10.14778/3625054.3625066

Immanuel Trummer

Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.

最近有出版物建议使用数据库模式元素的自然语言分析来指导调整和剖析工作。其基本假设是，最先进的语言处理方法，即所谓的语言模型，能够从模式文本中提取有关数据属性的信息。本文在数据相关性分析的背景下研究了这一假设：通过语言模型分析列名，是否有可能找到具有相关数据的列对？首先，本文介绍了数据相关性分析的新基准，该基准是通过分析数千个 Kaggle 数据集创建的（可供下载）。其次，论文利用这些数据研究了语言模型根据列名预测相关性的能力。该分析涵盖了不同的语言模型、各种相关性指标和多种准确性指标。它指出了有助于成功预测的因素，如列名的长度和单词比例。最后，研究分析了列类型对预测性能的影响。研究结果表明，模式文本可以成为有用的信息源，并为今后针对 NLP 增强型数据库调整和数据剖析的研究工作提供参考。

{"title":"Can Large Language Models Predict Data Correlations from Column Names?","authors":"Immanuel Trummer","doi":"10.14778/3625054.3625066","DOIUrl":"https://doi.org/10.14778/3625054.3625066","url":null,"abstract":"Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"42 1","pages":"4310-4323"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139343991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Tutorial on Visual Representations of Relational Queries 关系查询的可视化表示教程

Proc. VLDB Endow.

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611578

Wolfgang Gatterbauer

Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries , whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries. This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).

查询公式越来越多地由需要猜测用户意图的系统执行(例如，通过口语界面)。但是，用户如何知道计算代理正在返回对“正确”查询的答案呢?更一般地说，考虑到关系查询可能变得非常复杂，我们如何帮助用户理解现有的关系查询，无论是人工生成的还是自动生成的?现在似乎是重新审视关系模型诞生之前的主题的合适时机:开发帮助用户理解关系查询的视觉隐喻。本讲座式教程概述了为关系表达式的视觉表示而开发的主要视觉隐喻。我们将调查关系查询的关系完备图表表示的历史和现状，讨论在一个多世纪的调查图表语言中发展起来的关键视觉隐喻，并通过将它们使用的视觉字母映射到关系代数(RA)和关系微积分(RC)的语法和语义来组织景观。

引用次数: 0

JoinBoost: Grow Trees Over Normalized Data Using Only SQL JoinBoost:仅使用SQL在规范化数据上生长树

Proc. VLDB Endow.

Pub Date : 2023-07-01 DOI: 10.48550/arXiv.2307.00422

Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the Y variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving , the key property of variance semi-ring to support rmse the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).

虽然在表格数据中占主导地位，但在规范化数据库(例如LightGBM, XGBoost)上训练树模型的ML库要求将数据反规范化为单个表，具体化并导出。此过程不可扩展，速度慢，并且存在安全风险。In-DB ML旨在训练dbms中的模型，以避免数据移动并提供数据治理。与其修改DBMS来支持In-DB ML，还不如为专门的ML库提供有竞争力的树训练性能……只使用SQL?我们介绍了JoinBoost，一个Python库，它将规范化数据库上的树训练算法重写为纯SQL。它可移植到任何DBMS，提供与专用ML库竞争的性能，并随DBMS的底层功能进行扩展。JoinBoost从算法和系统的角度扩展了先前的工作。在算法上，我们通过将Y变量更新为非物化连接结果中的残差来支持因式梯度提升。虽然这种视图更新问题通常是模棱两可的，但我们确定了方差半环支持rmse的关键属性——保持加法到乘法，这是最广泛使用的准则。在系统方面，我们认为残留更新是性能瓶颈。在列式dbms上，可以通过创建一个新的残值列并将其作为投影添加来最小化这种开销。我们用DuckDB上的两个实现验证了这一点，为了可移植性，对其内部没有或只有很少的修改。我们的实验表明，与LightGBM相比，JoinBoost在随机森林(梯度增强)方面的速度快了3倍(1.1倍)，比最先进的In-DB ML系统快了一个数量级。此外，JoinBoost在#特征、数据库大小(TPC-DS SF=1000)和连接图复杂性(星系模式)方面的扩展性远远超过LightGBM。

{"title":"JoinBoost: Grow Trees Over Normalized Data Using Only SQL","authors":"Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu","doi":"10.48550/arXiv.2307.00422","DOIUrl":"https://doi.org/10.48550/arXiv.2307.00422","url":null,"abstract":"Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?\u0000 \u0000 We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the\u0000 Y\u0000 variable to the residual in the\u0000 non-materialized join result.\u0000 Although this view update problem is generally ambiguous, we identify\u0000 addition-to-multiplication preserving\u0000 , the key property of variance semi-ring to support\u0000 rmse\u0000 the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"34 1","pages":"3071-3084"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80451406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating Aggregation Queries on Unstructured Streams of Data 加速非结构化数据流的聚合查询

Proc. VLDB Endow.

Pub Date : 2023-07-01 DOI: 10.14778/3611479.3611496

Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia

Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams. In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models ("proxies") and sampling techniques to limit the execution of an expensive high-precision model (an "oracle") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.

分析师和科学家对查询视频、音频和文本流以提取定量见解很感兴趣。例如，城市规划者可能希望通过查询来自交通摄像机的实时馈送来测量拥堵情况。先前的工作使用深度神经网络(dnn)来回答批处理设置中的此类查询。然而，大部分工作并不适合流设置，因为它需要在提交查询或特定于视频之前访问整个数据集。因此，据我们所知，没有先前的工作解决了在多模式流上有效回答查询的问题。在这项工作中，我们提出了InQuest，这是一个加速非结构化数据流聚合查询的系统，具有查询准确性的统计保证。InQuest利用廉价的近似模型(“代理”)和抽样技术，将昂贵的高精度模型(“oracle”)的执行限制在流的一个子集上。然后，它使用oracle预测来实时计算一个近似的查询答案。我们从理论上分析了InQuest，并表明它的查询估计的预期误差以与oracle预算成反比的速率收敛在固定流上。我们在六个真实世界的视频和文本数据集上评估了我们的算法，并表明InQuest实现了与两个流基线相同的均方根误差(RMSE)，最多减少了5.0倍的oracle调用。我们进一步表明，在固定的oracle调用次数下，与最先进的批处理设置算法相比，InQuest的RMSE可以降低1.9倍。

{"title":"Accelerating Aggregation Queries on Unstructured Streams of Data","authors":"Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia","doi":"10.14778/3611479.3611496","DOIUrl":"https://doi.org/10.14778/3611479.3611496","url":null,"abstract":"Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams.\u0000 In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models (\"proxies\") and sampling techniques to limit the execution of an expensive high-precision model (an \"oracle\") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"46 1","pages":"2897-2910"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84249854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples 自动表:合成多步骤转换来关系表，而不使用示例

Proc. VLDB Endow.

Pub Date : 2023-07-01 DOI: 10.48550/arXiv.2307.14565

Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chauduri

Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums. We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.

关系表中，每一行对应一个实体，每一列对应一个属性，这一直是关系数据库中表的标准。然而，当处理“野外”的表时，这样的标准不能被认为是理所当然的。我们对真实的电子表格和web表的调查显示，超过30%的此类表不符合关系标准，因此在使用基于sql的工具轻松查询这些表之前，需要进行复杂的表重构转换。不幸的是，所需的转换对于程序来说是非常重要的，这已经成为技术和非技术用户的一个重大痛点，正如在StackOverflow和Excel/Tableau论坛上的大量论坛问题所证明的那样。我们开发了一个Auto-Tables系统，该系统可以自动合成具有多步骤转换的管道(在Python或其他语言中)，将非关系表转换为下游分析的标准关系形式，从而避免了用户手动编程转换的需要。通过从用户电子表格和在线论坛中收集244个真实测试案例，我们为这项新任务编写了一个广泛的基准测试。我们的评估表明，Auto-Tables可以以交互速度成功地为超过70%的测试用例合成转换，而不需要用户的任何输入，这使得它成为技术和非技术用户为分析准备数据的有效工具。

{"title":"Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples","authors":"Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chauduri","doi":"10.48550/arXiv.2307.14565","DOIUrl":"https://doi.org/10.48550/arXiv.2307.14565","url":null,"abstract":"Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables \"in the wild\". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.\u0000 We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"27 1","pages":"3391-3403"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81896981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Saibot: A Differentially Private Data Search Platform Saibot:一个与众不同的私有数据搜索平台

Proc. VLDB Endow.

Pub Date : 2023-07-01 DOI: 10.48550/arXiv.2307.00432

Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu

Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.

最近的数据搜索平台使用基于机器学习任务的实用度量而不是基于元数据的关键字来搜索大型数据集语料库。请求者提交一个训练数据集，这些平台搜索增强-连接或联合兼容的数据集-当用于增强请求者的数据集时，大多数改进模型(例如，线性回归)性能。虽然有效，但管理个人可识别数据的提供商在授予这些平台数据访问权限之前需要差分隐私(DP)保证。不幸的是，让数据搜索具有不同的私密性并非易事，因为单个搜索可能涉及数百或数千次训练和评估数据集，从而迅速耗尽隐私预算。我们提出了Saibot，一个差分私有数据搜索平台，它采用了分解隐私机制(FPM)，一种新的DP机制，来计算ML在不同数据集组合上的足够的半环统计量。这些统计数据被私有化一次，并且可以在搜索中自由重用。这使得Saibot可以扩展到任意数量的数据集和请求，同时最小化DP噪声对搜索结果的影响。我们优化了FPM对常见增广操作的灵敏度，并分析了它在线性回归方面的性质。具体来说，我们开发了一个多对多连接的无偏估计器，证明了它的边界，并开发了一个优化来重新分配DP噪声以最小化对模型的影响。我们对包含329个数据集的真实数据集语料库的评估表明，Saibot可以返回在非私有搜索的50- 90%内实现模型精度的增强，而领先的替代DP机制(TPM, APM，洗牌)则差几个数量级。

{"title":"Saibot: A Differentially Private Data Search Platform","authors":"Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu","doi":"10.48550/arXiv.2307.00432","DOIUrl":"https://doi.org/10.48550/arXiv.2307.00432","url":null,"abstract":"\u0000 Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for\u0000 augmentations\u0000 ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.\u0000 \u0000 \u0000 We present\u0000 Saibot\u0000 , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that\u0000 Saibot\u0000 can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"27 1","pages":"3057-3070"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87844403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning 采用:通过强化学习自适应优化最坏情况最优连接算法的属性顺序

Proc. VLDB Endow.

Pub Date : 2023-07-01 DOI: 10.48550/arXiv.2307.16540

Junxiong Wang, Immanuel Trummer, A. Kara, Dan Olteanu

The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, keeping track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize.

最坏情况最优连接算法的性能取决于处理连接属性的顺序。在执行查询之前选择好的订单是很困难的，因为可能的订单空间很大，而且在数据倾斜或数据相关的情况下，执行成本估计不可靠。我们提出了ADOPT，这是一个将自适应查询处理与最坏情况最优连接算法相结合的查询引擎，它在连接属性上使用顺序而不是在关系上使用连接顺序。ADOPT将查询执行分为不同的集，在这些集中尝试不同的属性顺序。基于对属性顺序性能的运行时反馈，ADOPT算法快速收敛到接近最优的顺序。它通过新颖的数据结构避免了跨不同顺序的冗余工作，跟踪已成功处理的连接输入部分。它通过强化学习选择属性顺序进行尝试，平衡探索新顺序的需求和利用有前途的顺序的愿望。在各种数据集和查询的实验中，无论何时查询变得复杂，因此难以优化，它的性能都优于基线，包括使用最坏情况最优连接算法的商业和开源系统。

{"title":"ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning","authors":"Junxiong Wang, Immanuel Trummer, A. Kara, Dan Olteanu","doi":"10.48550/arXiv.2307.16540","DOIUrl":"https://doi.org/10.48550/arXiv.2307.16540","url":null,"abstract":"The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, keeping track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"19 1","pages":"2805-2817"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84271307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph Auto-BI:利用本地连接预测和全局模式图自动构建bi模型

Proc. VLDB Endow.

Pub Date : 2023-06-01 DOI: 10.48550/arXiv.2306.12515

Yiming Lin, Yeye He, S. Chaudhuri

Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves. We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called k-Min-Cost-Arborescence (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called arborescence. While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables). Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.

商业智能(BI)在现代企业和数十亿美元的业务中至关重要。传统上，像数据库管理员这样的技术专家会在技术水平较低的业务用户使用最终用户仪表板工具运行分析之前，手动准备连接数据仓库中的表的bi模型(例如，在星型或雪花模式中)。然而，近年来自助式BI(例如，Tableau和Power-BI)的流行产生了对技术含量较低的最终用户自己构建BI模型的强烈需求。我们开发了一个Auto-BI系统，该系统可以在给定一组输入表的情况下准确预测BI模型，使用我们提出的基于图的原则优化问题，称为k-Min-Cost-Arborescence (k-MCA)，该问题全面考虑了局部连接预测和全局模式图结构，利用称为arborescence的图理论结构。虽然我们证明k-MCA通常是难以处理的和不近似的，但我们开发了可以最优地解决k-MCA的新算法，这在亚秒延迟的实践中被证明是有效的，并且可以扩展到我们遇到的最大的bi模型(接近100个表)。Auto-BI在一个独特的数据集上进行了严格的评估，其中包含我们收集的超过10万个真实的BI模型，以及4个流行的TPC基准。它被证明既高效又准确，在真实和合成基准测试中都达到了0.9以上的f1分数。

{"title":"Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph","authors":"Yiming Lin, Yeye He, S. Chaudhuri","doi":"10.48550/arXiv.2306.12515","DOIUrl":"https://doi.org/10.48550/arXiv.2306.12515","url":null,"abstract":"Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves.\u0000 \u0000 We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called\u0000 k-Min-Cost-Arborescence\u0000 (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called\u0000 arborescence.\u0000 While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables).\u0000 \u0000 Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"5 1","pages":"2578-2590"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86606512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Adaptive Indexing of Objects with Spatial Extent 具有空间范围的对象自适应索引

Proc. VLDB Endow.

Pub Date : 2023-05-01 DOI: 10.14778/3598581.3598596

Fatemeh Zardbani, N. Mamoulis, Stratos Idreos, Panagiotis Karras

Can we quickly explore large multidimensional data in main memory? Adaptive indexing responds to this need by building an index incrementally, in response to queries; in its default form, it indexes a single attribute or, in the presence of several attributes, one attribute per index level. Unfortunately, this approach falters when indexing spatial data objects, encountered in data exploration tasks involving multidimensional range queries. In this paper, we introduce the Adaptive Incremental R-tree (AIR-tree): the first method for the adaptive indexing of non-point spatial objects; the AIR-tree incrementally and progressively constructs an in-memory spatial index over a static array, in response to incoming queries, using a suite of heuristics for creating and splitting nodes. Our thorough experimental study on synthetic and real data and workloads shows that the AIR-tree consistently outperforms prior adaptive indexing methods focusing on multidimensional points and a pre-built static R-tree in cumulative time over at least the first thousand queries.

我们能否快速探索主存中的大型多维数据?自适应索引通过增量地构建索引来响应查询，从而满足了这种需求;在其默认形式中，它索引单个属性，或者在存在多个属性时，每个索引级别索引一个属性。不幸的是，在涉及多维范围查询的数据探索任务中，这种方法在索引空间数据对象时会出现问题。本文介绍了自适应增量r树(AIR-tree)——非点空间目标自适应索引的第一种方法;AIR-tree使用一套用于创建和分割节点的启发式方法，逐步地在静态数组上构建内存中的空间索引，以响应传入的查询。我们对合成数据和真实数据以及工作负载的全面实验研究表明，在至少前1000次查询的累积时间内，AIR-tree始终优于先前关注多维点的自适应索引方法和预构建的静态R-tree。

引用次数: 1

Cracking-Like Join for Trusted Execution Environments 可信执行环境的类裂纹连接

Proc. VLDB Endow.

Pub Date : 2023-05-01 DOI: 10.14778/3598581.3598602

K. Maliszewski, Jorge-Arnulfo Quiané-Ruiz, V. Markl

Data processing on non-trusted infrastructures, such as the public cloud, has become increasingly popular, despite posing risks to data privacy. However, the existing cloud DBMSs either lack sufficient privacy guarantees or underperform. In this paper, we address both challenges (privacy and efficiency) by proposing CrkJoin, a join algorithm that leverages Trusted Execution Environments (TEE). We adapted CrkJoin to the architecture of TEEs to achieve significant improvements in latency of up to three orders of magnitude over baselines in a multi-tenant scenario. Moreover, CrkJoin offers at least 2.9x higher throughput than the state-of-the-art algorithms. Our research is unique in that it focuses on both privacy and efficiency concerns, which has not been adequately addressed in previous studies. Our findings suggest that CrkJoin makes joining in TEEs practical, and it lays a foundation towards a truly private and efficient cloud DBMS.

在不可信的基础设施(如公共云)上进行数据处理越来越受欢迎，尽管这会给数据隐私带来风险。然而，现有的云dbms要么缺乏足够的隐私保证，要么表现不佳。在本文中，我们通过提出一种利用可信执行环境(TEE)的连接算法CrkJoin来解决这两个挑战(隐私和效率)。我们将CrkJoin调整为tee架构，从而在多租户场景中实现延迟的显著改进，延迟比基线提高了三个数量级。此外，CrkJoin提供的吞吐量至少比最先进的算法高2.9倍。我们的研究是独特的，因为它同时关注隐私和效率问题，这在以前的研究中没有得到充分的解决。我们的研究结果表明，CrkJoin使加入tee变得可行，它为真正私有和高效的云DBMS奠定了基础。

引用次数: 0