(This article is an imagined conversation with my U. at Buffalo UG algorithms class students.)
(本文是我与布法罗大学UG算法课学生的假想对话。)
{"title":"Technical Perspective: (Pre-) Semirings Come to the Recursion Party","authors":"A. Rudra","doi":"10.1145/3604437.3604453","DOIUrl":"https://doi.org/10.1145/3604437.3604453","url":null,"abstract":"(This article is an imagined conversation with my U. at Buffalo UG algorithms class students.)","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124937015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A data management platform provides many capabilities to assist the data owner, application coder, or end-user. For example, it should support an expressive query language, schema definition, and sophisticated access control. Another way many platforms add value is through a transaction mechanism, which allows the application programmer to indicate that a stretch of code, including multiple accesses to data, represents a single real-world activity and so all these steps should happen as if a single step, despite really being interleaved with other programs, or perhaps cancelled after partial execution. If the platform perfectly hides interleaving of different activities, the execution is called serializable, and this is a great aid to protecting data quality. Any integrity constraint over the data (whether explicitly declared in schema or not) which is preserved by each transaction running alone, is also valid at the end of any serializable execution of several transactions.
{"title":"Technical Perspective: When is it safe to run a transactional workload under Read Committed?","authors":"A. Fekete","doi":"10.1145/3604437.3604445","DOIUrl":"https://doi.org/10.1145/3604437.3604445","url":null,"abstract":"A data management platform provides many capabilities to assist the data owner, application coder, or end-user. For example, it should support an expressive query language, schema definition, and sophisticated access control. Another way many platforms add value is through a transaction mechanism, which allows the application programmer to indicate that a stretch of code, including multiple accesses to data, represents a single real-world activity and so all these steps should happen as if a single step, despite really being interleaved with other programs, or perhaps cancelled after partial execution. If the platform perfectly hides interleaving of different activities, the execution is called serializable, and this is a great aid to protecting data quality. Any integrity constraint over the data (whether explicitly declared in schema or not) which is preserved by each transaction running alone, is also valid at the end of any serializable execution of several transactions.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128985257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangqi Lu, W. Martens, Matthias Niewerth, Yufei Tao
Partial order multiway search (POMS) is an important problem that finds use in crowdsourcing, distributed file systems, software testing, etc. In this problem, a game is played between an algorithm A and an oracle, based on a directed acyclic graph G known to both parties. First, the oracle picks a vertex t in G called the target; then, A aims to figure out which vertex is t by probing reachability. In each probe, A selects a set Q of vertices in G whose size is bounded by a pre-agreed value k, and the oracle then reveals, for each vertex q 2 Q, whether q can reach the target in G. The objective of A is to minimize the number of probes. This article presents an algorithm to solve POMS in O(log1+k n + d k log1+d n) probes, where n is the number of vertices in G, and d is the largest out-degree of the vertices in G. The probing complexity is asymptotically optimal.
偏序多路搜索(POMS)是众包、分布式文件系统、软件测试等领域的一个重要问题。在这个问题中,基于双方已知的有向无环图G,在算法a和oracle之间进行博弈。首先,oracle在G中选择一个顶点t,称为目标;那么,A的目的是通过探测可达性来找出哪个顶点是t。在每个探测中,A在G中选择一个集合Q的顶点,其大小以预先约定的值k为界,然后oracle显示,对于每个顶点Q 2q, Q是否可以到达G中的目标。A的目标是最小化探测的数量。本文提出了一种用O(log1+k n +k log1+d n)个探针求解POMS的算法,其中n为G中顶点的个数,d为G中顶点的最大出度,探测复杂度是渐近最优的。
{"title":"An Optimal Algorithm for Partial Order Multiway Search","authors":"Shangqi Lu, W. Martens, Matthias Niewerth, Yufei Tao","doi":"10.1145/3604437.3604456","DOIUrl":"https://doi.org/10.1145/3604437.3604456","url":null,"abstract":"Partial order multiway search (POMS) is an important problem that finds use in crowdsourcing, distributed file systems, software testing, etc. In this problem, a game is played between an algorithm A and an oracle, based on a directed acyclic graph G known to both parties. First, the oracle picks a vertex t in G called the target; then, A aims to figure out which vertex is t by probing reachability. In each probe, A selects a set Q of vertices in G whose size is bounded by a pre-agreed value k, and the oracle then reveals, for each vertex q 2 Q, whether q can reach the target in G. The objective of A is to minimize the number of probes. This article presents an algorithm to solve POMS in O(log1+k n + d k log1+d n) probes, where n is the number of vertices in G, and d is the largest out-degree of the vertices in G. The probing complexity is asymptotically optimal.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115197016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph processing is becoming ubiquitous due to the proliferation of interconnected data in several domains, including life sciences, social networks, cybersecurity, finance and logistics, to name a few. In parallel with the growth of the underlying graph data sources, a plethora of graph workloads have appeared, ranging from graph analytics to graph traversals and graph pattern matching. Graph systems executing both complex and simple graph workloads need to leverage adequate data structures for efficiently processing heterogeneous graph data. While the underlying graph data structures have been extensively studied for the static case, they are less understood for the dynamic case, with the data undergoing several updates per second. Moreover, the existing solutions suffer lack of generality, as they focus on one specific requirement and workload type at a time. Designing a universal data structure that adapts to several kinds of graph workloads in a dynamic setting and achieves significant efficiency on all of them is far from being trivial.
{"title":"Technical Perspective: Sortledton: a Universal Graph Data Structure","authors":"A. Bonifati","doi":"10.1145/3604437.3604441","DOIUrl":"https://doi.org/10.1145/3604437.3604441","url":null,"abstract":"Graph processing is becoming ubiquitous due to the proliferation of interconnected data in several domains, including life sciences, social networks, cybersecurity, finance and logistics, to name a few. In parallel with the growth of the underlying graph data sources, a plethora of graph workloads have appeared, ranging from graph analytics to graph traversals and graph pattern matching. Graph systems executing both complex and simple graph workloads need to leverage adequate data structures for efficiently processing heterogeneous graph data. While the underlying graph data structures have been extensively studied for the static case, they are less understood for the dynamic case, with the data undergoing several updates per second. Moreover, the existing solutions suffer lack of generality, as they focus on one specific requirement and workload type at a time. Designing a universal data structure that adapts to several kinds of graph workloads in a dynamic setting and achieves significant efficiency on all of them is far from being trivial.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Separation of compute and storage has become the defacto standard for cloud database systems. First proposed in 2007 for database systems [2], it is now widely adopted by all major cloud providers such as Amazon Redshift, Google BigQuery, and Snowflake. Separation of compute and storage adds enormous value for the customer. Users can scale storage independently of compute, which enables them to only pay for what they really uses. Consider a scenario in which data grows linearly over time, but most queries only access the last month of data, which remains relatively stable. Without the separation of compute and storage, the user would gradually be forced to significantly increase the database cluster capacity. In contrast, modern cloud database systems allow scaling the storage separately from compute; the compute cluster stays the same over time, whereas the data is stored on cheap cloud storage services, like Amazon S3.
计算和存储的分离已经成为云数据库系统事实上的标准。它于2007年首次提出用于数据库系统[2],现在被所有主要的云提供商(如Amazon Redshift, Google BigQuery和Snowflake)广泛采用。计算和存储的分离为客户增加了巨大的价值。用户可以独立于计算扩展存储,这使得他们只需为他们真正使用的东西付费。考虑这样一个场景,其中数据随时间线性增长,但是大多数查询只访问上个月的数据,这保持相对稳定。如果没有计算和存储的分离,用户将逐渐被迫大幅增加数据库集群的容量。相比之下,现代云数据库系统允许将存储与计算分开扩展;随着时间的推移,计算集群保持不变,而数据存储在便宜的云存储服务上,如Amazon S3。
{"title":"Technical Perspective for Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory","authors":"Tim Kraska","doi":"10.1145/3604437.3604447","DOIUrl":"https://doi.org/10.1145/3604437.3604447","url":null,"abstract":"Separation of compute and storage has become the defacto standard for cloud database systems. First proposed in 2007 for database systems [2], it is now widely adopted by all major cloud providers such as Amazon Redshift, Google BigQuery, and Snowflake. Separation of compute and storage adds enormous value for the customer. Users can scale storage independently of compute, which enables them to only pay for what they really uses. Consider a scenario in which data grows linearly over time, but most queries only access the last month of data, which remains relatively stable. Without the separation of compute and storage, the user would gradually be forced to significantly increase the database cluster capacity. In contrast, modern cloud database systems allow scaling the storage separately from compute; the compute cluster stays the same over time, whereas the data is stored on cheap cloud storage services, like Amazon S3.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129742433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We often write queries using LIMIT k, indicating that only k answers are to be returned. This feature is present in most query languages, for different data models: SQL, SPARQL, Cypher etc. For example, in a repository of about 250M SPARQL queries, about 15M queries are of this form. Not surprisingly of course, the database research community studied such queries extensively. The dominant setting is this: there is an ordering on tuples that can be returned by a query. Then the answer is limited to the first k tuples in this ordering.
{"title":"Technical Perspective: Query Answers - Fewer is Faster","authors":"L. Libkin","doi":"10.1145/3604437.3604451","DOIUrl":"https://doi.org/10.1145/3604437.3604451","url":null,"abstract":"We often write queries using LIMIT k, indicating that only k answers are to be returned. This feature is present in most query languages, for different data models: SQL, SPARQL, Cypher etc. For example, in a repository of about 250M SPARQL queries, about 15M queries are of this form. Not surprisingly of course, the database research community studied such queries extensively. The dominant setting is this: there is an ordering on tuples that can be returned by a query. Then the answer is limited to the first k tuples in this ordering.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134313150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conjunctive queries with predicates in the form of comparisons that span multiple relations have regained interest recently, due to their relevance in OLAP queries, spatiotemporal databases, and machine learning over relational data. The standard technique, predicate pushdown, has limited efficacy on such comparisons. A technique by Willard can be used to process short comparisons that are adjacent in the join tree in time linear in the input size plus output size. In this paper, we describe a new algorithm for evaluating conjunctive queries with both short and long comparisons, and identify an acyclic condition under which linear time can be achieved. We have also implemented the new algorithm on top of Spark, and our experimental results demonstrate order-of-magnitude speedups over SparkSQL on a variety of graph patterns and analytical queries.
{"title":"Conjunctive Queries with Comparisons","authors":"Qichen Wang, K. Yi","doi":"10.1145/3604437.3604450","DOIUrl":"https://doi.org/10.1145/3604437.3604450","url":null,"abstract":"Conjunctive queries with predicates in the form of comparisons that span multiple relations have regained interest recently, due to their relevance in OLAP queries, spatiotemporal databases, and machine learning over relational data. The standard technique, predicate pushdown, has limited efficacy on such comparisons. A technique by Willard can be used to process short comparisons that are adjacent in the join tree in time linear in the input size plus output size. In this paper, we describe a new algorithm for evaluating conjunctive queries with both short and long comparisons, and identify an acyclic condition under which linear time can be achieved. We have also implemented the new algorithm on top of Spark, and our experimental results demonstrate order-of-magnitude speedups over SparkSQL on a variety of graph patterns and analytical queries.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129836803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqiu Zhang, Kaisong Huang, Tianzheng Wang, King Lv
Database systems are becoming increasingly multi-engine. In particular, a main-memory engine may coexist with a traditional storage-centric engine in a system to support various applications. It is desirable to allow applications to access data in both engines using cross-engine transactions. But existing systems are either only designed for singleengine accesses, or impose many restrictions by limiting crossengine transactions to certain isolation levels and operations. The result is inadequate cross-engine support in terms of correctness, performance and programmability.
{"title":"Efficiently Making Cross-Engine Transactions Consistent","authors":"Jianqiu Zhang, Kaisong Huang, Tianzheng Wang, King Lv","doi":"10.1145/3604437.3604444","DOIUrl":"https://doi.org/10.1145/3604437.3604444","url":null,"abstract":"Database systems are becoming increasingly multi-engine. In particular, a main-memory engine may coexist with a traditional storage-centric engine in a system to support various applications. It is desirable to allow applications to access data in both engines using cross-engine transactions. But existing systems are either only designed for singleengine accesses, or impose many restrictions by limiting crossengine transactions to certain isolation levels and operations. The result is inadequate cross-engine support in terms of correctness, performance and programmability.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116942549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Query engines are really good at choosing an efficient query plan. Users don't need to worry about how they write their query, since the optimizer makes all the right choices for executing the query, while taking into account all aspects of data, such as its size, the characteristics of the storage device, the distribution pattern, the availability of indexes, and so on. The query optimizer always makes the best choice, no matter how complex the query is, or how contrived it was written. Or, this is what we expect today from a modern query optimizer. Unfortunately, reality is not as nice.
{"title":"Technical Perspective: Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs","authors":"Dan Suciu","doi":"10.1145/3604437.3604457","DOIUrl":"https://doi.org/10.1145/3604437.3604457","url":null,"abstract":"Query engines are really good at choosing an efficient query plan. Users don't need to worry about how they write their query, since the optimizer makes all the right choices for executing the query, while taking into account all aspects of data, such as its size, the characteristics of the storage device, the distribution pattern, the availability of indexes, and so on. The query optimizer always makes the best choice, no matter how complex the query is, or how contrived it was written. Or, this is what we expect today from a modern query optimizer. Unfortunately, reality is not as nice.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133538883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a list of comparable items A = {a1, . . . , an sorted so that a1 < a2 < . . . < an, a canonical problem is locating a target item q within A if it exists. The canonical algorithm for this problem, of course, is binary search, which locates q using at most O(log n) comparisons between q and elements of A. Binary search is an indispensable tool for totally ordered datasets. However, many naturally occurring datasets are only partially ordered (posets), meaning that not all pairs of elements are comparable. Every such poset can be expressed as a directed acyclic graph (DAG), with edges (x,y) representing the relation x < y.
{"title":"Technical Perspective: Optimal Algorithms for Multiway Search on Partial Orders","authors":"Rajesh Jayaram","doi":"10.1145/3604437.3604455","DOIUrl":"https://doi.org/10.1145/3604437.3604455","url":null,"abstract":"Given a list of comparable items A = {a1, . . . , an sorted so that a1 < a2 < . . . < an, a canonical problem is locating a target item q within A if it exists. The canonical algorithm for this problem, of course, is binary search, which locates q using at most O(log n) comparisons between q and elements of A. Binary search is an indispensable tool for totally ordered datasets. However, many naturally occurring datasets are only partially ordered (posets), meaning that not all pairs of elements are comparable. Every such poset can be expressed as a directed acyclic graph (DAG), with edges (x,y) representing the relation x < y.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125088601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}