Pub Date : 2023-09-01DOI: 10.14778/3625054.3625066
Immanuel Trummer
Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.
{"title":"Can Large Language Models Predict Data Correlations from Column Names?","authors":"Immanuel Trummer","doi":"10.14778/3625054.3625066","DOIUrl":"https://doi.org/10.14778/3625054.3625066","url":null,"abstract":"Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"42 1","pages":"4310-4323"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139343991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611578
Wolfgang Gatterbauer
Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries , whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries. This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).
{"title":"A Tutorial on Visual Representations of Relational Queries","authors":"Wolfgang Gatterbauer","doi":"10.14778/3611540.3611578","DOIUrl":"https://doi.org/10.14778/3611540.3611578","url":null,"abstract":"\u0000 Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the \"right\" query? More generally, given that relational queries can become pretty complicated,\u0000 how can we help users understand existing relational queries\u0000 , whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries.\u0000 \u0000 \u0000 This lecture-style tutorial surveys the key\u0000 visual metaphors developed for visual representations of relational expressions.\u0000 We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"39 1","pages":"3890-3893"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76468076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.48550/arXiv.2307.00422
Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the Y variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving , the key property of variance semi-ring to support rmse the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).
{"title":"JoinBoost: Grow Trees Over Normalized Data Using Only SQL","authors":"Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu","doi":"10.48550/arXiv.2307.00422","DOIUrl":"https://doi.org/10.48550/arXiv.2307.00422","url":null,"abstract":"Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?\u0000 \u0000 We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the\u0000 Y\u0000 variable to the residual in the\u0000 non-materialized join result.\u0000 Although this view update problem is generally ambiguous, we identify\u0000 addition-to-multiplication preserving\u0000 , the key property of variance semi-ring to support\u0000 rmse\u0000 the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"34 1","pages":"3071-3084"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80451406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.14778/3611479.3611496
Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia
Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams. In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models ("proxies") and sampling techniques to limit the execution of an expensive high-precision model (an "oracle") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.
{"title":"Accelerating Aggregation Queries on Unstructured Streams of Data","authors":"Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia","doi":"10.14778/3611479.3611496","DOIUrl":"https://doi.org/10.14778/3611479.3611496","url":null,"abstract":"Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams.\u0000 In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models (\"proxies\") and sampling techniques to limit the execution of an expensive high-precision model (an \"oracle\") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"46 1","pages":"2897-2910"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84249854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.48550/arXiv.2307.14565
Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chauduri
Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums. We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.
{"title":"Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples","authors":"Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chauduri","doi":"10.48550/arXiv.2307.14565","DOIUrl":"https://doi.org/10.48550/arXiv.2307.14565","url":null,"abstract":"Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables \"in the wild\". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.\u0000 We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"27 1","pages":"3391-3403"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81896981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.48550/arXiv.2307.00432
Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
{"title":"Saibot: A Differentially Private Data Search Platform","authors":"Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu","doi":"10.48550/arXiv.2307.00432","DOIUrl":"https://doi.org/10.48550/arXiv.2307.00432","url":null,"abstract":"\u0000 Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for\u0000 augmentations\u0000 ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.\u0000 \u0000 \u0000 We present\u0000 Saibot\u0000 , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that\u0000 Saibot\u0000 can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"27 1","pages":"3057-3070"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87844403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.48550/arXiv.2307.16540
Junxiong Wang, Immanuel Trummer, A. Kara, Dan Olteanu
The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, keeping track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize.
{"title":"ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning","authors":"Junxiong Wang, Immanuel Trummer, A. Kara, Dan Olteanu","doi":"10.48550/arXiv.2307.16540","DOIUrl":"https://doi.org/10.48550/arXiv.2307.16540","url":null,"abstract":"The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, keeping track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"19 1","pages":"2805-2817"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84271307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-01DOI: 10.48550/arXiv.2306.12515
Yiming Lin, Yeye He, S. Chaudhuri
Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves. We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called k-Min-Cost-Arborescence (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called arborescence. While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables). Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.
{"title":"Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph","authors":"Yiming Lin, Yeye He, S. Chaudhuri","doi":"10.48550/arXiv.2306.12515","DOIUrl":"https://doi.org/10.48550/arXiv.2306.12515","url":null,"abstract":"Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves.\u0000 \u0000 We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called\u0000 k-Min-Cost-Arborescence\u0000 (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called\u0000 arborescence.\u0000 While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables).\u0000 \u0000 Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"5 1","pages":"2578-2590"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86606512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.14778/3598581.3598596
Fatemeh Zardbani, N. Mamoulis, Stratos Idreos, Panagiotis Karras
Can we quickly explore large multidimensional data in main memory? Adaptive indexing responds to this need by building an index incrementally, in response to queries; in its default form, it indexes a single attribute or, in the presence of several attributes, one attribute per index level. Unfortunately, this approach falters when indexing spatial data objects, encountered in data exploration tasks involving multidimensional range queries. In this paper, we introduce the Adaptive Incremental R-tree (AIR-tree): the first method for the adaptive indexing of non-point spatial objects; the AIR-tree incrementally and progressively constructs an in-memory spatial index over a static array, in response to incoming queries, using a suite of heuristics for creating and splitting nodes. Our thorough experimental study on synthetic and real data and workloads shows that the AIR-tree consistently outperforms prior adaptive indexing methods focusing on multidimensional points and a pre-built static R-tree in cumulative time over at least the first thousand queries.
{"title":"Adaptive Indexing of Objects with Spatial Extent","authors":"Fatemeh Zardbani, N. Mamoulis, Stratos Idreos, Panagiotis Karras","doi":"10.14778/3598581.3598596","DOIUrl":"https://doi.org/10.14778/3598581.3598596","url":null,"abstract":"\u0000 Can we quickly explore large multidimensional data in main memory?\u0000 Adaptive indexing\u0000 responds to this need by building an index incrementally, in response to queries; in its default form, it indexes a single attribute or, in the presence of several attributes, one attribute per index level. Unfortunately, this approach falters when indexing spatial data objects, encountered in data exploration tasks involving multidimensional range queries. In this paper, we introduce the Adaptive Incremental R-tree (AIR-tree): the first method for the adaptive indexing of non-point spatial objects; the AIR-tree incrementally and progressively constructs an in-memory spatial index over a static array, in response to incoming queries, using a suite of heuristics for creating and splitting nodes. Our thorough experimental study on synthetic and real data and workloads shows that the AIR-tree consistently outperforms prior adaptive indexing methods focusing on multidimensional points and a pre-built static R-tree in cumulative time over at least the first thousand queries.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"4 1","pages":"2248-2260"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73174620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.14778/3598581.3598602
K. Maliszewski, Jorge-Arnulfo Quiané-Ruiz, V. Markl
Data processing on non-trusted infrastructures, such as the public cloud, has become increasingly popular, despite posing risks to data privacy. However, the existing cloud DBMSs either lack sufficient privacy guarantees or underperform. In this paper, we address both challenges (privacy and efficiency) by proposing CrkJoin, a join algorithm that leverages Trusted Execution Environments (TEE). We adapted CrkJoin to the architecture of TEEs to achieve significant improvements in latency of up to three orders of magnitude over baselines in a multi-tenant scenario. Moreover, CrkJoin offers at least 2.9x higher throughput than the state-of-the-art algorithms. Our research is unique in that it focuses on both privacy and efficiency concerns, which has not been adequately addressed in previous studies. Our findings suggest that CrkJoin makes joining in TEEs practical, and it lays a foundation towards a truly private and efficient cloud DBMS.
{"title":"Cracking-Like Join for Trusted Execution Environments","authors":"K. Maliszewski, Jorge-Arnulfo Quiané-Ruiz, V. Markl","doi":"10.14778/3598581.3598602","DOIUrl":"https://doi.org/10.14778/3598581.3598602","url":null,"abstract":"Data processing on non-trusted infrastructures, such as the public cloud, has become increasingly popular, despite posing risks to data privacy. However, the existing cloud DBMSs either lack sufficient privacy guarantees or underperform. In this paper, we address both challenges (privacy and efficiency) by proposing CrkJoin, a join algorithm that leverages Trusted Execution Environments (TEE). We adapted CrkJoin to the architecture of TEEs to achieve significant improvements in latency of up to three orders of magnitude over baselines in a multi-tenant scenario. Moreover, CrkJoin offers at least 2.9x higher throughput than the state-of-the-art algorithms. Our research is unique in that it focuses on both privacy and efficiency concerns, which has not been adequately addressed in previous studies. Our findings suggest that CrkJoin makes joining in TEEs practical, and it lays a foundation towards a truly private and efficient cloud DBMS.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"45 1","pages":"2330-2343"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90351226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}