Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chaudhuri
Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheettables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.
{"title":"Auto-Tables: Relationalize Tables without Using Examples","authors":"Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chaudhuri","doi":"10.1145/3665252.3665269","DOIUrl":"https://doi.org/10.1145/3665252.3665269","url":null,"abstract":"Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables \"in the wild\". Our survey of real spreadsheettables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"28 38","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the last decade, worst-case optimal join (WCOJ) algorithms have emerged as a new paradigm for one of the most fundamental challenges in query processing: computing joins efficiently. Such an algorithm can be asymptotically faster than traditional binary joins, all the while remaining simple to understand and implement. However, they have been found to be less efficient than the old paradigm, traditional binary join plans, on the typical acyclic queries found in practice. In an effort to unify and generalize the two paradigms, we proposed a new framework, called Free Join, in our SIGMOD 2023 paper. Not only does Free Join unite the worlds of traditional and worst-case optimal join algorithms, it uncovers optimizations and evaluation strategies that outperform both. In this article, we approach Free Join from the traditional perspective of binary joins, and re-derive the more general framework via a series of gradual transformations. We hope this perspective from the past can help practitioners better understand the Free Join framework, and find ways to incorporate some of the ideas into their own systems.
{"title":"From Binary Join to Free Join","authors":"Y. Wang, Max Willsey, Dan Suciu","doi":"10.1145/3665252.3665259","DOIUrl":"https://doi.org/10.1145/3665252.3665259","url":null,"abstract":"Over the last decade, worst-case optimal join (WCOJ) algorithms have emerged as a new paradigm for one of the most fundamental challenges in query processing: computing joins efficiently. Such an algorithm can be asymptotically faster than traditional binary joins, all the while remaining simple to understand and implement. However, they have been found to be less efficient than the old paradigm, traditional binary join plans, on the typical acyclic queries found in practice. In an effort to unify and generalize the two paradigms, we proposed a new framework, called Free Join, in our SIGMOD 2023 paper. Not only does Free Join unite the worlds of traditional and worst-case optimal join algorithms, it uncovers optimizations and evaluation strategies that outperform both.\u0000 In this article, we approach Free Join from the traditional perspective of binary joins, and re-derive the more general framework via a series of gradual transformations. We hope this perspective from the past can help practitioners better understand the Free Join framework, and find ways to incorporate some of the ideas into their own systems.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140981851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When interactively working with data, query latency is very important. In particular when ad-hoc queries are written in an explorative manner, it is essential to quickly get feedback in order to refine and correct the query based upon result values. This interactive use case is difficult to support if the underlying data is large, as analyzing large volumes of data is inherently expensive.
{"title":"Technical Perspective: Efficient and Reusable Lazy Sampling","authors":"Thomas Neumann","doi":"10.1145/3665252.3665260","DOIUrl":"https://doi.org/10.1145/3665252.3665260","url":null,"abstract":"When interactively working with data, query latency is very important. In particular when ad-hoc queries are written in an explorative manner, it is essential to quickly get feedback in order to refine and correct the query based upon result values. This interactive use case is difficult to support if the underlying data is large, as analyzing large volumes of data is inherently expensive.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"99 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140978326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Viktor Sanca, Periklis Chrysogelos, Anastasia Ailamaki
Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. While offline AQP relies on predictable workloads to a priori create samples that match the queries, as soon as workload predictability diminishes, returning to existing online AQP methods that create query-specific samples with little reuse across queries results in significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability. We propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific and design it for a scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of data access and computation reuse, making sampler placement after expensive operators more practical.
{"title":"Efficient and Reusable Lazy Sampling","authors":"Viktor Sanca, Periklis Chrysogelos, Anastasia Ailamaki","doi":"10.1145/3665252.3665261","DOIUrl":"https://doi.org/10.1145/3665252.3665261","url":null,"abstract":"Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. While offline AQP relies on predictable workloads to a priori create samples that match the queries, as soon as workload predictability diminishes, returning to existing online AQP methods that create query-specific samples with little reuse across queries results in significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability.\u0000 We propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific and design it for a scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of data access and computation reuse, making sampler placement after expensive operators more practical.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"27 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140979380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, V. Tannen
We describe DBSP, a framework for incremental computation. Incremental computations repeatedly evaluate a function on some input values that are "changing". The goal of an efficient implementation is to "reuse" previously computed results. Ideally, when presented with a new change to the input, an incremental computation should only perform work proportional to the size of the changes of the input, rather than to the size of the entire dataset.
{"title":"DBSP: Incremental Computation on Streams and Its Applications to Databases","authors":"Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, V. Tannen","doi":"10.1145/3665252.3665271","DOIUrl":"https://doi.org/10.1145/3665252.3665271","url":null,"abstract":"We describe DBSP, a framework for incremental computation. Incremental computations repeatedly evaluate a function on some input values that are \"changing\". The goal of an efficient implementation is to \"reuse\" previously computed results. Ideally, when presented with a new change to the input, an incremental computation should only perform work proportional to the size of the changes of the input, rather than to the size of the entire dataset.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"66 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140978919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Synthetic data is a vital substitute for real sensitive personal data in supporting social science research and policy studies. Extensive prior research has delved into various models for generating synthetic data, from traditional statistical approaches to cutting-edge deep-learning methods. However, selecting the most suitable one for unforeseen applications poses a significant challenge due to the varying strengths and weaknesses, dependent on factors such as the application domain, data distribution, analytical requirements, and privacy considerations.
{"title":"Technical Perspective: Synthetic Data Needs a Reproducibility Benchmark","authors":"Xi He","doi":"10.1145/3665252.3665266","DOIUrl":"https://doi.org/10.1145/3665252.3665266","url":null,"abstract":"Synthetic data is a vital substitute for real sensitive personal data in supporting social science research and policy studies. Extensive prior research has delved into various models for generating synthetic data, from traditional statistical approaches to cutting-edge deep-learning methods. However, selecting the most suitable one for unforeseen applications poses a significant challenge due to the varying strengths and weaknesses, dependent on factors such as the application domain, data distribution, analytical requirements, and privacy considerations.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"21 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140979949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The topics of private data analysis and streaming data management have both been separately the focus of much study within the data management community for many years. However, more recently there have been studies which bring these two previously isolated topics together.
{"title":"Technical Perspective on 'Better Differentially Private Approximate Histograms and Heavy Hitters using the Misra-Gries Sketch'","authors":"Graham Cormode","doi":"10.1145/3665252.3665254","DOIUrl":"https://doi.org/10.1145/3665252.3665254","url":null,"abstract":"The topics of private data analysis and streaming data management have both been separately the focus of much study within the data management community for many years. However, more recently there have been studies which bring these two previously isolated topics together.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"15 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, Nan Tang
Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the "same" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.
{"title":"Unicorn: A Unified Multi-Tasking Matching Model","authors":"Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, Nan Tang","doi":"10.1145/3665252.3665263","DOIUrl":"https://doi.org/10.1145/3665252.3665263","url":null,"abstract":"Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the \"same\" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"76 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140978783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nearly all of the world's population now uses online services that request personal information, covering almost every aspect of our lives. The abundance of personal data in digital form has brought incredible benefits to end users, enabling them to access personalized and advanced services based on the analysis of the data collected. This capability has dramatically improved the user experience in various application domains, ranging from healthcare to e-commerce, finance, logistics, and entertainment, to name a few. Numerous technological advancements in the field of big data have enabled this massive processing of personal data, and recent advances in AI data processing capabilities will expand the ways in which service providers will use personal data in the coming years. Machine learning algorithms, powered by AI, will be used to make increasingly accurate predictions about user behavior by uncovering hidden correlations within massive data sets. There is therefore a tension between the desire to fully exploit personal data in such ecosystems and the need to provide strong privacy and transparency guarantees to the individuals whose data is being exploited. Privacy protection is further complicated because data processing is typically not performed in isolation but through pipelines of different services, with each step making inferences about the personal data consumed by the services in subsequent steps.
{"title":"Technical Perspective: Graph Theory for Data Privacy: A New Approach for Complex Data Flows","authors":"Elena Ferrari","doi":"10.1145/3665252.3665264","DOIUrl":"https://doi.org/10.1145/3665252.3665264","url":null,"abstract":"Nearly all of the world's population now uses online services that request personal information, covering almost every aspect of our lives. The abundance of personal data in digital form has brought incredible benefits to end users, enabling them to access personalized and advanced services based on the analysis of the data collected. This capability has dramatically improved the user experience in various application domains, ranging from healthcare to e-commerce, finance, logistics, and entertainment, to name a few. Numerous technological advancements in the field of big data have enabled this massive processing of personal data, and recent advances in AI data processing capabilities will expand the ways in which service providers will use personal data in the coming years. Machine learning algorithms, powered by AI, will be used to make increasingly accurate predictions about user behavior by uncovering hidden correlations within massive data sets. There is therefore a tension between the desire to fully exploit personal data in such ecosystems and the need to provide strong privacy and transparency guarantees to the individuals whose data is being exploited. Privacy protection is further complicated because data processing is typically not performed in isolation but through pipelines of different services, with each step making inferences about the personal data consumed by the services in subsequent steps.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"35 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By now, it is widely-accepted folk wisdom that "half of the time in any data analysis project is spent wrangling the data". Analytic algorithms and tools-built on mathematical foundations of matrices and relations-require their data to be lined up in particular rows and columns. In the relational model (known in data science circles as "tidy data"), each row is an independent observation, and each column is a distinct attribute of the phenomenon described by the data. While there are many thorny aspects to data wrangling, perhaps none is more basic than the challenge of getting data reorganized, positionally, into the right form for analysis.
{"title":"Learning to Restructure Tables Automatically","authors":"J. M. Hellerstein","doi":"10.1145/3665252.3665268","DOIUrl":"https://doi.org/10.1145/3665252.3665268","url":null,"abstract":"By now, it is widely-accepted folk wisdom that \"half of the time in any data analysis project is spent wrangling the data\". Analytic algorithms and tools-built on mathematical foundations of matrices and relations-require their data to be lined up in particular rows and columns. In the relational model (known in data science circles as \"tidy data\"), each row is an independent observation, and each column is a distinct attribute of the phenomenon described by the data. While there are many thorny aspects to data wrangling, perhaps none is more basic than the challenge of getting data reorganized, positionally, into the right form for analysis.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"54 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140978864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}