ACM SIGMOD Record最新文献

英文中文

Auto-Tables: Relationalize Tables without Using Examples 自动表格无需使用示例即可建立关系表

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665269

Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chaudhuri

Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheettables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.

关系表中的每一行对应一个实体，每一列对应一个属性，这一直是关系数据库中表格的标准。然而，在处理 "野生 "表格时，不能想当然地认为这种标准是正确的。我们对真实的电子表格和网络表格进行的调查显示，超过 30% 的此类表格不符合关系标准，因此需要进行复杂的表格重组转换，才能使用基于 SQL 的工具轻松查询这些表格。遗憾的是，所需的转换编程难度很大，这已成为技术用户和非技术用户的一大痛点，StackOverflow 和 Excel/Tableau 论坛上的大量问题就是证明。

引用次数: 0

From Binary Join to Free Join 从二进制加盟到免费加盟

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665259

Y. Wang, Max Willsey, Dan Suciu

Over the last decade, worst-case optimal join (WCOJ) algorithms have emerged as a new paradigm for one of the most fundamental challenges in query processing: computing joins efficiently. Such an algorithm can be asymptotically faster than traditional binary joins, all the while remaining simple to understand and implement. However, they have been found to be less efficient than the old paradigm, traditional binary join plans, on the typical acyclic queries found in practice. In an effort to unify and generalize the two paradigms, we proposed a new framework, called Free Join, in our SIGMOD 2023 paper. Not only does Free Join unite the worlds of traditional and worst-case optimal join algorithms, it uncovers optimizations and evaluation strategies that outperform both. In this article, we approach Free Join from the traditional perspective of binary joins, and re-derive the more general framework via a series of gradual transformations. We hope this perspective from the past can help practitioners better understand the Free Join framework, and find ways to incorporate some of the ideas into their own systems.

在过去十年中，最坏情况最优连接（WCOJ）算法已成为解决查询处理中最基本挑战之一--高效计算连接--的新范例。这种算法在渐进上比传统的二进制连接更快，而且易于理解和实现。然而，在实际应用中发现的典型非循环查询中，这些算法的效率要低于旧范式，即传统的二进制连接计划。为了统一和推广这两种范式，我们在 SIGMOD 2023 论文中提出了一种新的框架，称为 Free Join。Free Join 不仅将传统的最优连接算法和最坏情况下的最优连接算法结合在一起，还发现了优于这两种算法的优化和评估策略。在本文中，我们从二进制连接的传统视角切入 Free Join，并通过一系列渐进转换重新推导出更通用的框架。我们希望这种前人的视角能帮助实践者更好地理解 Free Join 框架，并找到将其中一些想法融入自己系统的方法。

引用次数: 0

Technical Perspective: Efficient and Reusable Lazy Sampling 技术视角：高效、可重复使用的懒惰采样

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665260

Thomas Neumann

When interactively working with data, query latency is very important. In particular when ad-hoc queries are written in an explorative manner, it is essential to quickly get feedback in order to refine and correct the query based upon result values. This interactive use case is difficult to support if the underlying data is large, as analyzing large volumes of data is inherently expensive.

在交互式处理数据时，查询延迟非常重要。特别是在以探索方式编写临时查询时，必须快速获得反馈，以便根据结果值完善和修正查询。如果底层数据量很大，这种交互式用例就很难得到支持，因为分析大量数据的成本本身就很高。

引用次数: 0

Efficient and Reusable Lazy Sampling 高效、可重复使用的懒人取样

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665261

Viktor Sanca, Periklis Chrysogelos, Anastasia Ailamaki

Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. While offline AQP relies on predictable workloads to a priori create samples that match the queries, as soon as workload predictability diminishes, returning to existing online AQP methods that create query-specific samples with little reuse across queries results in significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability. We propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific and design it for a scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of data access and computation reuse, making sampler placement after expensive operators more practical.

现代分析引擎依靠近似查询处理（AQP）提供比硬件允许的精确查询回答更快的响应时间。然而，随着工作负载不可预测性的增加，现有的近似查询处理方法会带来严重的性能损失。离线 AQP 依赖于可预测的工作负载来先验地创建与查询相匹配的样本，而一旦工作负载的可预测性降低，返回到现有的在线 AQP 方法，即创建特定于查询的样本，而很少在不同查询之间重复使用，结果响应时间的提升明显较小。因此，现有方法无法在不可预测性增加的情况下充分发挥采样的优势。我们提出了 LAQy，这是一个用于构建、扩展和合并样本以适应工作负载谓词变化的框架。我们提出了 "懒采样 "来克服导致快速但专业的采样只能针对特定查询的不可预测性问题，并为一个扩展分析引擎设计了该框架，以展示我们的框架在现代系统中的适应性和实用性。作为数据访问和计算重用的函数，LAQy 加快了在线采样处理的速度，使得在昂贵的运算器之后放置采样器更加实用。

{"title":"Efficient and Reusable Lazy Sampling","authors":"Viktor Sanca, Periklis Chrysogelos, Anastasia Ailamaki","doi":"10.1145/3665252.3665261","DOIUrl":"https://doi.org/10.1145/3665252.3665261","url":null,"abstract":"Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. While offline AQP relies on predictable workloads to a priori create samples that match the queries, as soon as workload predictability diminishes, returning to existing online AQP methods that create query-specific samples with little reuse across queries results in significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability.\u0000 We propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific and design it for a scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of data access and computation reuse, making sampler placement after expensive operators more practical.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"27 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140979380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DBSP: Incremental Computation on Streams and Its Applications to Databases DBSP：流上的增量计算及其在数据库中的应用

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665271

Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, V. Tannen

We describe DBSP, a framework for incremental computation. Incremental computations repeatedly evaluate a function on some input values that are "changing". The goal of an efficient implementation is to "reuse" previously computed results. Ideally, when presented with a new change to the input, an incremental computation should only perform work proportional to the size of the changes of the input, rather than to the size of the entire dataset.

我们描述了增量计算框架 DBSP。增量计算对一些 "不断变化 "的输入值反复评估一个函数。高效实现的目标是 "重复使用 "之前计算的结果。理想情况下，当输入值出现新变化时，增量计算只需执行与输入值变化大小成比例的工作，而不是与整个数据集大小成比例的工作。

引用次数: 0

Technical Perspective: Synthetic Data Needs a Reproducibility Benchmark 技术视角：合成数据需要可重复性基准

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665266

Xi He

Synthetic data is a vital substitute for real sensitive personal data in supporting social science research and policy studies. Extensive prior research has delved into various models for generating synthetic data, from traditional statistical approaches to cutting-edge deep-learning methods. However, selecting the most suitable one for unforeseen applications poses a significant challenge due to the varying strengths and weaknesses, dependent on factors such as the application domain, data distribution, analytical requirements, and privacy considerations.

在支持社会科学研究和政策研究方面，合成数据是真实敏感个人数据的重要替代品。此前的大量研究已经深入探讨了生成合成数据的各种模型，从传统的统计方法到前沿的深度学习方法，不一而足。然而，由于优缺点各不相同，取决于应用领域、数据分布、分析要求和隐私考虑等因素，为不可预见的应用选择最合适的模型是一项重大挑战。

引用次数: 0

Technical Perspective on 'Better Differentially Private Approximate Histograms and Heavy Hitters using the Misra-Gries Sketch' 关于 "使用米斯拉-格里斯草图获得更好的差分私有近似直方图和重击 "的技术视角

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665254

Graham Cormode

The topics of private data analysis and streaming data management have both been separately the focus of much study within the data management community for many years. However, more recently there have been studies which bring these two previously isolated topics together.

私人数据分析和流式数据管理这两个主题多年来一直是数据管理界的研究重点。然而，最近的一些研究将这两个以前相互独立的主题结合在了一起。

引用次数: 0

Unicorn: A Unified Multi-Tasking Matching Model 独角兽统一的多任务匹配模型

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665263

Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, Nan Tang

Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the "same" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.

数据匹配决定两个数据元素（如字符串、元组、列或知识图谱实体）是否 "相同"（又称匹配），是数据集成中的一个关键概念。目前广泛使用的做法是建立针对特定任务甚至特定数据集的解决方案，这种解决方案很难通用化，也无法利用从不同数据集和多个任务中学到的知识共享机会。在本文中，我们提出了独角兽模型（Unicorn），这是一种普遍支持常见数据匹配任务的统一模型。由于输入数据元素的格式各不相同，而且多个任务的匹配语义也各不相同，因此建立这样一个统一模型具有很大的挑战性。为了应对这些挑战，Unicorn 采用了一个通用编码器（Encoder），将任意一对数据元素（a, b）转换为学习到的表示，并使用二元分类器（Matcher）来决定 a 是否匹配 b。我们使用 20 个数据集对 7 个经过充分研究的数据匹配任务进行了大量实验，结果发现，与针对特定任务和数据集分别训练的最先进的特定模型相比，我们的统一模型能在大多数任务中取得更好的平均性能。此外，Unicorn 还能很好地服务于零点学习的新匹配任务。

{"title":"Unicorn: A Unified Multi-Tasking Matching Model","authors":"Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, Nan Tang","doi":"10.1145/3665252.3665263","DOIUrl":"https://doi.org/10.1145/3665252.3665263","url":null,"abstract":"Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the \"same\" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"76 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140978783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Technical Perspective: Graph Theory for Data Privacy: A New Approach for Complex Data Flows 技术视角：数据隐私的图论：复杂数据流的新方法

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665264

Elena Ferrari

Nearly all of the world's population now uses online services that request personal information, covering almost every aspect of our lives. The abundance of personal data in digital form has brought incredible benefits to end users, enabling them to access personalized and advanced services based on the analysis of the data collected. This capability has dramatically improved the user experience in various application domains, ranging from healthcare to e-commerce, finance, logistics, and entertainment, to name a few. Numerous technological advancements in the field of big data have enabled this massive processing of personal data, and recent advances in AI data processing capabilities will expand the ways in which service providers will use personal data in the coming years. Machine learning algorithms, powered by AI, will be used to make increasingly accurate predictions about user behavior by uncovering hidden correlations within massive data sets. There is therefore a tension between the desire to fully exploit personal data in such ecosystems and the need to provide strong privacy and transparency guarantees to the individuals whose data is being exploited. Privacy protection is further complicated because data processing is typically not performed in isolation but through pipelines of different services, with each step making inferences about the personal data consumed by the services in subsequent steps.

目前，全球几乎所有人口都在使用要求提供个人信息的在线服务，这些服务几乎涵盖了我们生活的方方面面。大量数字形式的个人数据为终端用户带来了难以置信的好处，使他们能够根据对所收集数据的分析，获得个性化的高级服务。这种能力极大地改善了从医疗保健到电子商务、金融、物流和娱乐等各种应用领域的用户体验。大数据领域的众多技术进步促成了对个人数据的大规模处理，而人工智能数据处理能力的最新进展将在未来几年拓展服务提供商使用个人数据的方式。人工智能驱动的机器学习算法将通过发现海量数据集中隐藏的相关性，对用户行为做出越来越准确的预测。因此，既希望在此类生态系统中充分利用个人数据，又需要为数据被利用的个人提供强有力的隐私和透明度保障，这两者之间存在着矛盾。由于数据处理通常不是孤立进行的，而是通过不同服务的流水线进行的，每个步骤都会对服务在后续步骤中使用的个人数据进行推断，因此隐私保护变得更加复杂。

{"title":"Technical Perspective: Graph Theory for Data Privacy: A New Approach for Complex Data Flows","authors":"Elena Ferrari","doi":"10.1145/3665252.3665264","DOIUrl":"https://doi.org/10.1145/3665252.3665264","url":null,"abstract":"Nearly all of the world's population now uses online services that request personal information, covering almost every aspect of our lives. The abundance of personal data in digital form has brought incredible benefits to end users, enabling them to access personalized and advanced services based on the analysis of the data collected. This capability has dramatically improved the user experience in various application domains, ranging from healthcare to e-commerce, finance, logistics, and entertainment, to name a few. Numerous technological advancements in the field of big data have enabled this massive processing of personal data, and recent advances in AI data processing capabilities will expand the ways in which service providers will use personal data in the coming years. Machine learning algorithms, powered by AI, will be used to make increasingly accurate predictions about user behavior by uncovering hidden correlations within massive data sets. There is therefore a tension between the desire to fully exploit personal data in such ecosystems and the need to provide strong privacy and transparency guarantees to the individuals whose data is being exploited. Privacy protection is further complicated because data processing is typically not performed in isolation but through pipelines of different services, with each step making inferences about the personal data consumed by the services in subsequent steps.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"35 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to Restructure Tables Automatically 学会自动重组表格

ACM SIGMOD Record

Pub Date : 2024-05-14 DOI: 10.1145/3665252.3665268

J. M. Hellerstein

By now, it is widely-accepted folk wisdom that "half of the time in any data analysis project is spent wrangling the data". Analytic algorithms and tools-built on mathematical foundations of matrices and relations-require their data to be lined up in particular rows and columns. In the relational model (known in data science circles as "tidy data"), each row is an independent observation, and each column is a distinct attribute of the phenomenon described by the data. While there are many thorny aspects to data wrangling, perhaps none is more basic than the challenge of getting data reorganized, positionally, into the right form for analysis.

现在，"任何数据分析项目都有一半的时间花在处理数据上"，这是广为接受的民间智慧。建立在矩阵和关系数学基础上的分析算法和工具要求数据按特定的行列排列。在关系模型中（在数据科学界被称为 "整齐数据"），每一行都是一个独立的观察结果，每一列都是数据所描述现象的独特属性。虽然数据处理有许多棘手的问题，但最基本的挑战可能莫过于如何将数据重新组织、定位，使其成为分析所需的正确形式。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM SIGMOD Record

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀