首页 > 最新文献

ACM SIGMOD Record最新文献

英文 中文
DFI: The Data Flow Interface for High-Speed Networks 高速网络的数据流接口
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542705
Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, Carsten Binnig
In this paper, we propose the Data Flow Interface (DFI) as a way to make it easier for data processing systems to exploit high-speed networks without the need to deal with the complexity of RDMA. By lifting the level of abstraction, DFI factors out much of the complexity of network communication and makes it easier for developers to declaratively express how data should be efficiently routed to accomplish a given distributed data processing task. As we show in our experiments, DFI is able to support a wide variety of data-centric applications with high performance at a low complexity for the applications.
在本文中,我们提出了数据流接口(DFI)作为一种方法,使数据处理系统更容易利用高速网络,而无需处理RDMA的复杂性。通过提升抽象级别,DFI排除了网络通信的许多复杂性,并使开发人员更容易声明性地表达数据应该如何有效地路由以完成给定的分布式数据处理任务。正如我们在实验中所展示的那样,DFI能够支持各种以数据为中心的应用程序,并且具有高性能和低复杂性。
{"title":"DFI: The Data Flow Interface for High-Speed Networks","authors":"Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, Carsten Binnig","doi":"10.1145/3542700.3542705","DOIUrl":"https://doi.org/10.1145/3542700.3542705","url":null,"abstract":"In this paper, we propose the Data Flow Interface (DFI) as a way to make it easier for data processing systems to exploit high-speed networks without the need to deal with the complexity of RDMA. By lifting the level of abstraction, DFI factors out much of the complexity of network communication and makes it easier for developers to declaratively express how data should be efficiently routed to accomplish a given distributed data processing task. As we show in our experiments, DFI is able to support a wide variety of data-centric applications with high performance at a low complexity for the applications.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134088379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Technical Perspective 技术的角度来看
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542712
N. Mamoulis
The optimal assignment problem is a classic combinatorial optimization problem. Given a set of n agents A, a set T of m tasks, and an n×m cost matrix C, the objective is to find the matching between A and T, which minimizes or maximizes an aggregate cost of the assigned agent-task pairs. In its standard definition, n = m and we are looking for the 1-to-1 matching with the minimum total cost. From a graph theory perspective, this is a weighted bipartite graph matching problem. A classic algorithm for solving the assignment problem is the Hungarian algorithm (a.k.a. Kuhn-Munkres algorithm) [3], which bears a O(n3) computational cost (assuming that n = m); this is the best run-time of any strongly polynomial algorithm for this problem. There are many variants of the assignment problem, which differ in the optimization objective (i.e., minimize/maximize an aggregate cost, achieve a stable matching, maximize the number of agents matched which their top preferences, etc.) and in whether there are constraints on the number of matches for each agent or task.
最优分配问题是一个经典的组合优化问题。给定一个包含n个智能体的集合a,一个包含m个任务的集合T,以及一个n×m成本矩阵C,目标是找到a和T之间的匹配,从而最小化或最大化分配的智能体-任务对的总成本。在它的标准定义中,n = m,我们正在寻找与最小总成本的1对1匹配。从图论的角度来看,这是一个加权二部图匹配问题。求解分配问题的经典算法是匈牙利算法(又名Kuhn-Munkres算法)[3],其计算代价为O(n3)(假设n = m);对于这个问题,这是所有强多项式算法的最佳运行时间。分配问题有许多变体,它们在优化目标(即最小化/最大化总成本,实现稳定匹配,最大化与其首选项匹配的代理数量等)以及每个代理或任务的匹配数量是否存在约束方面有所不同。
{"title":"Technical Perspective","authors":"N. Mamoulis","doi":"10.1145/3542700.3542712","DOIUrl":"https://doi.org/10.1145/3542700.3542712","url":null,"abstract":"The optimal assignment problem is a classic combinatorial optimization problem. Given a set of n agents A, a set T of m tasks, and an n×m cost matrix C, the objective is to find the matching between A and T, which minimizes or maximizes an aggregate cost of the assigned agent-task pairs. In its standard definition, n = m and we are looking for the 1-to-1 matching with the minimum total cost. From a graph theory perspective, this is a weighted bipartite graph matching problem. A classic algorithm for solving the assignment problem is the Hungarian algorithm (a.k.a. Kuhn-Munkres algorithm) [3], which bears a O(n3) computational cost (assuming that n = m); this is the best run-time of any strongly polynomial algorithm for this problem. There are many variants of the assignment problem, which differ in the optimization objective (i.e., minimize/maximize an aggregate cost, achieve a stable matching, maximize the number of agents matched which their top preferences, etc.) and in whether there are constraints on the number of matches for each agent or task.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125337254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Technical Perspective 技术的角度来看
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542714
Bill Howe
There is a tension between an imperative style for control flow that has been shown to be easier to use, especially for novices, and a functional style for control flow that better exposes optimization opportunities, thereby making the optimizers more capable. The authors of "Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance" propose Mitos, a program rewriting framework that achieves the best of both worlds by borrowing program analysis concepts from compilers and lifting them to the distributed dataflow regime. Dataflow systems require significant data movement during processing, which can be highly redundant and wasteful in the context of iteration: naive execution plans can reprocess the same massive dataset on each iteration, and iteration i+1 must wait until iteration i is finished. The authors design a mechanism for labeling each intermediate result with its execution path, allowing the system to simultaneously manage complex branching situations while also implementing efficient processing via loop pipelining, all by reasoning about and comparing execution paths.
控制流的命令式风格更容易使用,特别是对于新手,而控制流的函数式风格更容易提供优化机会,从而使优化器更有能力,两者之间存在矛盾。“数据流系统中的高效控制流:当易用性与高性能相结合时”的作者提出了Mitos,这是一个程序重写框架,通过从编译器中借鉴程序分析概念并将其提升到分布式数据流机制,实现了两者的最佳结合。数据流系统在处理过程中需要大量的数据移动,这在迭代上下文中可能是高度冗余和浪费的:幼稚的执行计划可以在每次迭代中重新处理相同的大量数据集,迭代i+1必须等到迭代i完成。作者设计了一种机制来标记每个中间结果及其执行路径,允许系统同时管理复杂的分支情况,同时还通过循环流水线实现有效的处理,所有这些都是通过推理和比较执行路径来实现的。
{"title":"Technical Perspective","authors":"Bill Howe","doi":"10.1145/3542700.3542714","DOIUrl":"https://doi.org/10.1145/3542700.3542714","url":null,"abstract":"There is a tension between an imperative style for control flow that has been shown to be easier to use, especially for novices, and a functional style for control flow that better exposes optimization opportunities, thereby making the optimizers more capable. The authors of \"Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance\" propose Mitos, a program rewriting framework that achieves the best of both worlds by borrowing program analysis concepts from compilers and lifting them to the distributed dataflow regime. Dataflow systems require significant data movement during processing, which can be highly redundant and wasteful in the context of iteration: naive execution plans can reprocess the same massive dataset on each iteration, and iteration i+1 must wait until iteration i is finished. The authors design a mechanism for labeling each intermediate result with its execution path, allowing the system to simultaneously manage complex branching situations while also implementing efficient processing via loop pipelining, all by reasoning about and comparing execution paths.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127810548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure and Complexity of Bag Consistency 袋一致性的结构和复杂性
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542719
Albert Atserias, Phokion G. Kolaitis
Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a bynow classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.
从关系数据库的早期开始,人们就意识到,无循环超图会产生具有理想结构和算法特性的数据库模式。在一篇现在已成为经典的论文中,Beeri、Fagin、Maier和Yannakakis建立了几种不同的非环性等效表征;特别地,他们证明了模式的属性集形成一个非循环超图当且仅当该模式上的关系的局部到全局一致性属性成立,这意味着该模式上的成对一致关系的每个集合都是全局一致的。尽管现实生活中的数据库由袋子(多集)组成,但还没有对袋子的局部一致性和全局一致性之间的相互作用进行研究。我们在这里开始了这样的研究,我们首先证明了一个模式的属性集形成一个非循环超图当且仅当该模式上的包的局部到全局一致性属性成立。在此之后,我们通过分析袋子的全局一致性问题的计算复杂性来探索袋子的全局一致性的算法方面:给定一组袋子,这些袋子是否全局一致?我们证明了这个问题是NP的,即使模式是输入的一部分。然后,我们建立了固定模式的二分定理:如果模式是无循环的,那么袋的全局一致性问题在多项式时间内可解;如果模式是循环的,那么袋的全局一致性问题是np完全的。后一种结果与关系的状态形成鲜明对比,其中,对于每个固定模式,关系的全局一致性问题在多项式时间内可解。
{"title":"Structure and Complexity of Bag Consistency","authors":"Albert Atserias, Phokion G. Kolaitis","doi":"10.1145/3542700.3542719","DOIUrl":"https://doi.org/10.1145/3542700.3542719","url":null,"abstract":"Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a bynow classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131920708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FoundationDB
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542707
Jingyu Zhou, Meng Xu, A. Shraer, B. Namasivayam, A. Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, J. Leach, D. Rosenthal, Xin Dong, Willie B. Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xi-sheng Su, Vishesh Yadav
FoundationDB is an open source transactional key value store created more than ten years ago. It is one of the first systems to combine the flexibility and scalability of NoSQL architectures with the power of ACID transactions. FoundationDB adopts an unbundled architecture that decouples an in-memory transaction management system, a distributed storage system, and a built-in distributed configuration system. Each sub-system can be independently provisioned and configured to achieve scalability, high-availability and fault tolerance. FoundationDB includes a deterministic simulation framework, used to test every new feature under a myriad of possible faults. FoundationDB offers a minimal and carefully chosen feature set, which has enabled a range of disparate systems to be built as layers on top. FoundationDB is the underpinning of cloud infrastructure at Apple, Snowflake and other companies.
FoundationDB是十多年前创建的一个开源事务性键值存储。它是最早将NoSQL架构的灵活性和可伸缩性与ACID事务的强大功能结合在一起的系统之一。FoundationDB采用非捆绑架构,将内存事务管理系统、分布式存储系统和内置分布式配置系统解耦。每个子系统都可以独立供应和配置,以实现可伸缩性、高可用性和容错性。FoundationDB包括一个确定性模拟框架,用于在无数可能的故障下测试每个新特性。FoundationDB提供了一个最小且精心选择的特性集,这使得一系列不同的系统可以作为层构建在上面。FoundationDB是苹果、雪花和其他公司云基础设施的基础。
{"title":"FoundationDB","authors":"Jingyu Zhou, Meng Xu, A. Shraer, B. Namasivayam, A. Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, J. Leach, D. Rosenthal, Xin Dong, Willie B. Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xi-sheng Su, Vishesh Yadav","doi":"10.1145/3542700.3542707","DOIUrl":"https://doi.org/10.1145/3542700.3542707","url":null,"abstract":"FoundationDB is an open source transactional key value store created more than ten years ago. It is one of the first systems to combine the flexibility and scalability of NoSQL architectures with the power of ACID transactions. FoundationDB adopts an unbundled architecture that decouples an in-memory transaction management system, a distributed storage system, and a built-in distributed configuration system. Each sub-system can be independently provisioned and configured to achieve scalability, high-availability and fault tolerance. FoundationDB includes a deterministic simulation framework, used to test every new feature under a myriad of possible faults. FoundationDB offers a minimal and carefully chosen feature set, which has enabled a range of disparate systems to be built as layers on top. FoundationDB is the underpinning of cloud infrastructure at Apple, Snowflake and other companies.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131052821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
No PANE, No Gain 不付出就没有收获
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542711
Renchi Yang, Jieming Shi, X. Xiao, Yin Yang, S. Bhowmick, Juncheng Liu
Given a graph G where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node v 2 G to a compact vector Xv, which can be used in downstream machine learning tasks in a variety of applications. Existing ANE solutions do not scale to massive graphs due to prohibitive computation costs or generation of low-quality embeddings. This paper proposes PANE, an effective and scalable approach to ANE computation for massive graphs in a single server that achieves state-of-the-art result quality on multiple benchmark datasets for two common prediction tasks: link prediction and node classification. Under the hood, PANE takes inspiration from well-established data management techniques to scale up ANE in a single server. Specifically, it exploits a carefully formulated problem based on a novel random walk model, a highly efficient solver, and non-trivial parallelization by utilizing modern multi-core CPUs. Extensive experiments demonstrate that PANE consistently outperforms all existing methods in terms of result quality, while being orders of magnitude faster.
给定一个图G,其中每个节点与一组属性相关联,属性网络嵌入(ANE)将每个节点v2g映射到一个紧凑向量Xv,这可以用于各种应用中的下游机器学习任务。由于高昂的计算成本或生成低质量的嵌入,现有的ANE解决方案无法扩展到大量图形。本文提出了PANE,这是一种有效且可扩展的方法,用于在单个服务器上对大量图形进行ANE计算,可以在多个基准数据集上实现最先进的结果质量,用于两个常见的预测任务:链接预测和节点分类。在底层,PANE从完善的数据管理技术中获得灵感,在单个服务器中扩展ANE。具体来说,它利用了一个精心制定的问题,该问题基于一种新颖的随机行走模型,一个高效的求解器,并利用现代多核cpu进行非平凡的并行化。大量的实验表明,PANE在结果质量方面始终优于所有现有的方法,同时速度要快几个数量级。
{"title":"No PANE, No Gain","authors":"Renchi Yang, Jieming Shi, X. Xiao, Yin Yang, S. Bhowmick, Juncheng Liu","doi":"10.1145/3542700.3542711","DOIUrl":"https://doi.org/10.1145/3542700.3542711","url":null,"abstract":"Given a graph G where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node v 2 G to a compact vector Xv, which can be used in downstream machine learning tasks in a variety of applications. Existing ANE solutions do not scale to massive graphs due to prohibitive computation costs or generation of low-quality embeddings. This paper proposes PANE, an effective and scalable approach to ANE computation for massive graphs in a single server that achieves state-of-the-art result quality on multiple benchmark datasets for two common prediction tasks: link prediction and node classification. Under the hood, PANE takes inspiration from well-established data management techniques to scale up ANE in a single server. Specifically, it exploits a carefully formulated problem based on a novel random walk model, a highly efficient solver, and non-trivial parallelization by utilizing modern multi-core CPUs. Extensive experiments demonstrate that PANE consistently outperforms all existing methods in terms of result quality, while being orders of magnitude faster.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122573140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Technical Perspective 技术的角度来看
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542716
R. Pagh
The paper Relative Error Streaming Quantiles, by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler and Pavel Vesel´y studies a fundamental question in data stream processing, namely how to maintain information about the distribution of data in the form of quantiles. More precisely, given a stream S of elements from some ordered universe U we wish to maintain a compact summary data structure that allows us to estimate the number of elements in the stream that are smaller than a given query element y 2 U, i.e., estimate the rank of y. Solutions to this problem have numerous applications in large-scale data analysis and can potentially be used for range query selectivity estimation in database engines.
Graham Cormode、Zohar Karnin、Edo Liberty、Justin Thaler和Pavel Vesel合著的论文《相对误差流分位数》(Relative Error Streaming Quantiles)研究了数据流处理中的一个基本问题,即如何以分位数的形式维护数据分布的信息。更准确地说,给定一个来自某个有序宇宙U的元素流S,我们希望保持一个紧凑的摘要数据结构,使我们能够估计流中小于给定查询元素y 2 U的元素的数量,即估计y的秩。这个问题的解决方案在大规模数据分析中有许多应用,并且可能用于数据库引擎中的范围查询选择性估计。
{"title":"Technical Perspective","authors":"R. Pagh","doi":"10.1145/3542700.3542716","DOIUrl":"https://doi.org/10.1145/3542700.3542716","url":null,"abstract":"The paper Relative Error Streaming Quantiles, by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler and Pavel Vesel´y studies a fundamental question in data stream processing, namely how to maintain information about the distribution of data in the form of quantiles. More precisely, given a stream S of elements from some ordered universe U we wish to maintain a compact summary data structure that allows us to estimate the number of elements in the stream that are smaller than a given query element y 2 U, i.e., estimate the rank of y. Solutions to this problem have numerous applications in large-scale data analysis and can potentially be used for range query selectivity estimation in database engines.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133185570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Technical Perspective of TURL TURL的技术视角
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542708
Paolo Papotti
Several efforts aim at representing tabular data with neural models for supporting target applications at the intersection of natural language processing (NLP) and databases (DB) [1-3]. The goal is to extend to structured data the recent neural architectures, which achieve state of the art results in NLP applications. Language models (LMs) are usually pre-trained with unsupervised tasks on a large text corpus. The output LM is then fine-tuned on a variety of downstream tasks with a small set of specific examples. This process has many advantages, because the LM contains information about textual structure and content, which are used by the target application without manually defining features.
在自然语言处理(NLP)和数据库(DB)的交叉领域,一些研究旨在用神经模型来表示表格数据,以支持目标应用[1-3]。目标是将最新的神经架构扩展到结构化数据,从而在NLP应用中获得最先进的结果。语言模型(LMs)通常使用大型文本语料库上的无监督任务进行预训练。然后,输出LM使用一小组特定示例对各种下游任务进行微调。这个过程有很多优点,因为LM包含关于文本结构和内容的信息,目标应用程序可以使用这些信息,而无需手动定义特性。
{"title":"Technical Perspective of TURL","authors":"Paolo Papotti","doi":"10.1145/3542700.3542708","DOIUrl":"https://doi.org/10.1145/3542700.3542708","url":null,"abstract":"Several efforts aim at representing tabular data with neural models for supporting target applications at the intersection of natural language processing (NLP) and databases (DB) [1-3]. The goal is to extend to structured data the recent neural architectures, which achieve state of the art results in NLP applications. Language models (LMs) are usually pre-trained with unsupervised tasks on a large text corpus. The output LM is then fine-tuned on a variety of downstream tasks with a small set of specific examples. This process has many advantages, because the LM contains information about textual structure and content, which are used by the target application without manually defining features.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121134687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bipartite Matching 双方的匹配
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542713
Tenindra Abeywickrama, Victor Liang, K. Tan
The Kuhn-Munkres (KM) algorithm is a classical combinatorial optimization algorithm that is widely used for minimum cost bipartite matching in many real-world applications, such as transportation. For example, a ride-hailing service may use it to find the optimal assignment of drivers to passengers to minimize the overall wait time. Typically, given two bipartite sets, this process involves computing the edge costs between all bipartite pairs and finding an optimal matching. However, existing works overlook the impact of edge cost computation on the overall running time. In reality, edge computation often significantly outweighs the computation of the optimal assignment itself, as in the case of assigning drivers to passengers which involves computation of expensive graph shortest paths. Following on from this, we also observe common real-world settings exhibit a useful property that allows us to incrementally compute edge costs only as required using an inexpensive lower-bound heuristic. This technique significantly reduces the overall cost of assignment compared to the original KM algorithm, as we demonstrate experimentally on multiple real-world data sets and workloads. Moreover, our algorithm is not limited to this domain and is potentially applicable in other settings where lower-bounding heuristics are available.
Kuhn-Munkres (KM)算法是一种经典的组合优化算法,在交通运输等实际应用中广泛用于最小代价二部匹配。例如,叫车服务可能会使用它来找到司机对乘客的最佳分配,以最大限度地减少总体等待时间。通常,给定两个二部集,该过程涉及计算所有二部对之间的边代价并找到最优匹配。然而,现有的工作忽略了边缘成本计算对整体运行时间的影响。在现实中,边缘计算通常比最优分配本身的计算重要得多,比如在将司机分配给乘客的情况下,这涉及到昂贵的图最短路径的计算。在此基础上,我们还观察到常见的现实世界设置显示出一个有用的属性,该属性允许我们仅在需要时使用廉价的下界启发式增量计算边缘成本。与原始KM算法相比,该技术显著降低了分配的总成本,我们在多个真实数据集和工作负载上进行了实验验证。此外,我们的算法并不局限于这个领域,并且可能适用于其他可以使用下限启发式的设置。
{"title":"Bipartite Matching","authors":"Tenindra Abeywickrama, Victor Liang, K. Tan","doi":"10.1145/3542700.3542713","DOIUrl":"https://doi.org/10.1145/3542700.3542713","url":null,"abstract":"The Kuhn-Munkres (KM) algorithm is a classical combinatorial optimization algorithm that is widely used for minimum cost bipartite matching in many real-world applications, such as transportation. For example, a ride-hailing service may use it to find the optimal assignment of drivers to passengers to minimize the overall wait time. Typically, given two bipartite sets, this process involves computing the edge costs between all bipartite pairs and finding an optimal matching. However, existing works overlook the impact of edge cost computation on the overall running time. In reality, edge computation often significantly outweighs the computation of the optimal assignment itself, as in the case of assigning drivers to passengers which involves computation of expensive graph shortest paths. Following on from this, we also observe common real-world settings exhibit a useful property that allows us to incrementally compute edge costs only as required using an inexpensive lower-bound heuristic. This technique significantly reduces the overall cost of assignment compared to the original KM algorithm, as we demonstrate experimentally on multiple real-world data sets and workloads. Moreover, our algorithm is not limited to this domain and is potentially applicable in other settings where lower-bounding heuristics are available.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128662681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Imperative or Functional Control Flow Handling 命令式或功能性控制流处理
Pub Date : 2022-05-31 DOI: 10.1145/3542700.3542715
G. Gévay, T. Rabl, S. Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, V. Markl
Modern data analysis tasks often involve control flow statements, such as the iterations in PageRank and K-means. To achieve scalability, developers usually implement these tasks in distributed dataflow systems, such as Spark and Flink. Designers of such systems have to choose between providing imperative or functional control flow constructs to users. Imperative constructs are easier to use, but functional constructs are easier to compile to an efficient dataflow job. We propose Mitos, a system where control flow is both easy to use and efficient. Mitos relies on an intermediate representation based on the static single assignment form. This allows us to abstract away from specific control flow constructs and treat any imperative control flow uniformly both when building the dataflow job and when coordinating the distributed execution.
现代数据分析任务通常涉及控制流语句,例如PageRank和K-means中的迭代。为了实现可伸缩性,开发人员通常在分布式数据流系统(如Spark和Flink)中实现这些任务。这类系统的设计者必须在向用户提供命令式控制流构造和功能性控制流构造之间做出选择。命令式构造更容易使用,但函数式构造更容易编译成高效的数据流作业。我们建议使用Mitos,这是一个控制流既易于使用又高效的系统。Mitos依赖于基于静态单一赋值表单的中间表示。这允许我们从特定的控制流构造中抽象出来,并在构建数据流作业和协调分布式执行时统一地处理任何命令式控制流。
{"title":"Imperative or Functional Control Flow Handling","authors":"G. Gévay, T. Rabl, S. Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, V. Markl","doi":"10.1145/3542700.3542715","DOIUrl":"https://doi.org/10.1145/3542700.3542715","url":null,"abstract":"Modern data analysis tasks often involve control flow statements, such as the iterations in PageRank and K-means. To achieve scalability, developers usually implement these tasks in distributed dataflow systems, such as Spark and Flink. Designers of such systems have to choose between providing imperative or functional control flow constructs to users. Imperative constructs are easier to use, but functional constructs are easier to compile to an efficient dataflow job. We propose Mitos, a system where control flow is both easy to use and efficient. Mitos relies on an intermediate representation based on the static single assignment form. This allows us to abstract away from specific control flow constructs and treat any imperative control flow uniformly both when building the dataflow job and when coordinating the distributed execution.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125784090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM SIGMOD Record
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1