Pub Date : 2023-10-01DOI: 10.14778/3626292.3626306
Aaditya Naik, Aalok Thakkar, Adam Stein, R. Alur, Mayur Naik
We study the problem of synthesizing a core fragment of relational queries called select-project-join (SPJ) queries from input-output examples. Search-based synthesis techniques are suited to synthesizing projections and joins by navigating the network of relational tables but require additional supervision for synthesizing comparison predicates. On the other hand, decision tree learning techniques are suited to synthesizing comparison predicates when the input database can be summarized as a single labelled relational table. In this paper, we adapt and interleave methods from the domains of relational query synthesis and decision tree learning, and present an end-to-end framework for synthesizing relational queries with categorical and numerical comparison predicates. Our technique guarantees the completeness of the synthesis procedure and strongly encourages minimality of the synthesized program. We present Libra, an implementation of this technique and evaluate it on a benchmark suite of 1,475 instances of queries over 159 databases with multiple tables. Libra solves 1,361 of these instances in an average of 59 seconds per instance. It outperforms state-of-the-art program synthesis tools Scythe and PatSQL in terms of both the running time and the quality of the synthesized programs.
{"title":"Relational Query Synthesis ⋈ Decision Tree Learning","authors":"Aaditya Naik, Aalok Thakkar, Adam Stein, R. Alur, Mayur Naik","doi":"10.14778/3626292.3626306","DOIUrl":"https://doi.org/10.14778/3626292.3626306","url":null,"abstract":"We study the problem of synthesizing a core fragment of relational queries called select-project-join (SPJ) queries from input-output examples. Search-based synthesis techniques are suited to synthesizing projections and joins by navigating the network of relational tables but require additional supervision for synthesizing comparison predicates. On the other hand, decision tree learning techniques are suited to synthesizing comparison predicates when the input database can be summarized as a single labelled relational table. In this paper, we adapt and interleave methods from the domains of relational query synthesis and decision tree learning, and present an end-to-end framework for synthesizing relational queries with categorical and numerical comparison predicates. Our technique guarantees the completeness of the synthesis procedure and strongly encourages minimality of the synthesized program. We present Libra, an implementation of this technique and evaluate it on a benchmark suite of 1,475 instances of queries over 159 databases with multiple tables. Libra solves 1,361 of these instances in an average of 59 seconds per instance. It outperforms state-of-the-art program synthesis tools Scythe and PatSQL in terms of both the running time and the quality of the synthesized programs.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"28 1","pages":"250-263"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139326866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.14778/3626292.3626295
Jinyang Li, Y. Moskovitch, Julia Stoyanovich, H. V. Jagadish
Diversity, group representation, and similar needs often apply to query results, which in turn require constraints on the sizes of various subgroups in the result set. Traditional relational queries only specify conditions as part of the query predicate(s), and do not support such restrictions on the output. In this paper, we study the problem of modifying queries to have the result satisfy constraints on the sizes of multiple subgroups in it. This problem, in the worst case, cannot be solved in polynomial time. Yet, with the help of provenance annotation, we are able to develop a query refinement method that works quite efficiently, as we demonstrate through extensive experiments.
{"title":"Query Refinement for Diversity Constraint Satisfaction","authors":"Jinyang Li, Y. Moskovitch, Julia Stoyanovich, H. V. Jagadish","doi":"10.14778/3626292.3626295","DOIUrl":"https://doi.org/10.14778/3626292.3626295","url":null,"abstract":"Diversity, group representation, and similar needs often apply to query results, which in turn require constraints on the sizes of various subgroups in the result set. Traditional relational queries only specify conditions as part of the query predicate(s), and do not support such restrictions on the output. In this paper, we study the problem of modifying queries to have the result satisfy constraints on the sizes of multiple subgroups in it. This problem, in the worst case, cannot be solved in polynomial time. Yet, with the help of provenance annotation, we are able to develop a query refinement method that works quite efficiently, as we demonstrate through extensive experiments.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"14 1","pages":"106-118"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139330918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.14778/3626292.3626305
Sijing Duan, Feng Lyu, Xin Zhu, Yi Ding, Haotian Wang, Desheng Zhang, Xue Liu, Yaoxue Zhang, Ju Ren
For a nationwide logistics transportation system, it is critical to make the vehicle loading plans (i.e., given many packages, deciding vehicle types and numbers) at each sorting and distribution center. This task is currently completed by dispatchers at each center in many logistics companies and consumes a lot of workloads for dispatchers. Existing works formulate such an issue as a cargo loading problem and solve it by combinatorial optimization methods. However, it cannot work in some real-world nationwide applications due to the lack of accurate cargo volume information and effective model design under complicated impact factors as well as temporal correlation. In this paper, we explore a new opportunity to utilize large-scale route and human behavior data (i.e., dispatchers' decision process on planning vehicles) to generate vehicle loading plans (i.e., plans). Specifically, we collect a five-month nationwide operational dataset from JD Logistics in China and comprehensively analyze human behaviors. Based on the data-driven analytics insights, we design a Vehicle Loading Plan learning model, named VeLP, which consists of a pattern mining module and a deep temporal cross neural network, to learn the human behaviors on regular and irregular routes, respectively. Extensive experiments demonstrate the superiority of VeLP, which achieves performance improvement by 35.8% and 50% for trunk and branch routes compared with baselines, respectively. Besides, we deployed VeLP in JDL and applied it in about 400 routes, reducing the time by approximately 20% in creating plans. It saves significant human workload and improves operational efficiency for the logistics company.
{"title":"VeLP: Vehicle Loading Plan Learning from Human Behavior in Nationwide Logistics System","authors":"Sijing Duan, Feng Lyu, Xin Zhu, Yi Ding, Haotian Wang, Desheng Zhang, Xue Liu, Yaoxue Zhang, Ju Ren","doi":"10.14778/3626292.3626305","DOIUrl":"https://doi.org/10.14778/3626292.3626305","url":null,"abstract":"For a nationwide logistics transportation system, it is critical to make the vehicle loading plans (i.e., given many packages, deciding vehicle types and numbers) at each sorting and distribution center. This task is currently completed by dispatchers at each center in many logistics companies and consumes a lot of workloads for dispatchers. Existing works formulate such an issue as a cargo loading problem and solve it by combinatorial optimization methods. However, it cannot work in some real-world nationwide applications due to the lack of accurate cargo volume information and effective model design under complicated impact factors as well as temporal correlation. In this paper, we explore a new opportunity to utilize large-scale route and human behavior data (i.e., dispatchers' decision process on planning vehicles) to generate vehicle loading plans (i.e., plans). Specifically, we collect a five-month nationwide operational dataset from JD Logistics in China and comprehensively analyze human behaviors. Based on the data-driven analytics insights, we design a Vehicle Loading Plan learning model, named VeLP, which consists of a pattern mining module and a deep temporal cross neural network, to learn the human behaviors on regular and irregular routes, respectively. Extensive experiments demonstrate the superiority of VeLP, which achieves performance improvement by 35.8% and 50% for trunk and branch routes compared with baselines, respectively. Besides, we deployed VeLP in JDL and applied it in about 400 routes, reducing the time by approximately 20% in creating plans. It saves significant human workload and improves operational efficiency for the logistics company.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"29 1","pages":"241-249"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139330991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.14778/3626292.3626293
Ruidi Wei, F. Kerschbaum
Private record linkage (PRL) is the problem of identifying pairs of records that approximately match across datasets in a secure, privacy-preserving manner. Two-party PRL specifically allows each of the parties to obtain records from the other party, only given that each record matches with one of their own. The privacy goal is that no other information about the datasets should be released than the matching records. A fundamental challenge is not to leak information while at the same time not comparing all pairs of records. In plaintext record linkage this is done using a blocking strategy, e.g., locality-sensitive hashing. One recent approach proposed by He et al. (ACM CCS 2017) uses locality-sensitive hashing and then releases a provably differential private representation of the hash bins. However, differential privacy still leaks some, although provable bounded information and does not protect against attacks, such as property inference attacks. Another recent approach by Khurram and Kerschbaum (IEEE ICDE 2020) uses locality-preserving hashing and provides cryptographic security, i.e., it releases no information except the output. However, locality-preserving hash functions are much harder to construct than locality-sensitive hash functions and hence accuracy of this approach is limited, particularly on larger datasets. In this paper, we address the open problem of providing cryptographic security of PRL while using locality-sensitive hash functions. Using recent results in oblivious algorithms, we design a new cryptographically secure PRL with locality-sensitive hash functions. Our prototypical implementation can match 40000 records in the British National Library/Toronto Public Library and the North Carolina Voter Registry datasets with 99.3% and 99.9% accuracy, respectively, in less than an hour which is more than an order of magnitude faster than Khurram and Kerschbaum's work at a higher accuracy.
{"title":"Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing","authors":"Ruidi Wei, F. Kerschbaum","doi":"10.14778/3626292.3626293","DOIUrl":"https://doi.org/10.14778/3626292.3626293","url":null,"abstract":"Private record linkage (PRL) is the problem of identifying pairs of records that approximately match across datasets in a secure, privacy-preserving manner. Two-party PRL specifically allows each of the parties to obtain records from the other party, only given that each record matches with one of their own. The privacy goal is that no other information about the datasets should be released than the matching records. A fundamental challenge is not to leak information while at the same time not comparing all pairs of records. In plaintext record linkage this is done using a blocking strategy, e.g., locality-sensitive hashing. One recent approach proposed by He et al. (ACM CCS 2017) uses locality-sensitive hashing and then releases a provably differential private representation of the hash bins. However, differential privacy still leaks some, although provable bounded information and does not protect against attacks, such as property inference attacks. Another recent approach by Khurram and Kerschbaum (IEEE ICDE 2020) uses locality-preserving hashing and provides cryptographic security, i.e., it releases no information except the output. However, locality-preserving hash functions are much harder to construct than locality-sensitive hash functions and hence accuracy of this approach is limited, particularly on larger datasets. In this paper, we address the open problem of providing cryptographic security of PRL while using locality-sensitive hash functions. Using recent results in oblivious algorithms, we design a new cryptographically secure PRL with locality-sensitive hash functions. Our prototypical implementation can match 40000 records in the British National Library/Toronto Public Library and the North Carolina Voter Registry datasets with 99.3% and 99.9% accuracy, respectively, in less than an hour which is more than an order of magnitude faster than Khurram and Kerschbaum's work at a higher accuracy.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"84 1","pages":"79-91"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139325314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.14778/3626292.3626300
Xueyi Wu, Yuanyuan Xu, Wenjie Zhang, Ying Zhang
Bipartite graph embedding (BGE), as the fundamental task in bipartite network analysis, is to map each node to compact low-dimensional vectors that preserve intrinsic properties. The existing solutions towards BGE fall into two groups: metric-based methods and graph neural network-based (GNN-based) methods. The latter typically generates higher-quality embeddings than the former due to the strong representation ability of deep learning. Nevertheless, none of the existing GNN-based methods can handle billion-scale bipartite graphs due to the expensive message passing or complex modelling choices. Hence, existing solutions face a challenge in achieving both embedding quality and model scalability. Motivated by this, we propose a novel graph neural network named AnchorGNN based on global-local learning framework, which can generate high-quality BGE and scale to billion-scale bipartite graphs. Concretely, AnchorGNN leverages a novel anchor-based message passing schema for global learning, which enables global knowledge to be incorporated to generate node embeddings. Meanwhile, AnchorGNN offers an efficient one-hop local structure modelling using maximum likelihood estimation for bipartite graphs with rational analysis, avoiding large adjacency matrix construction. Both global information and local structure are integrated to generate distinguishable node embeddings. Extensive experiments demonstrate that AnchorGNN outperforms the best competitor by up to 36% in accuracy and achieves up to 28 times speed-up against the only metric-based baseline on billion-scale bipartite graphs.
{"title":"Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach","authors":"Xueyi Wu, Yuanyuan Xu, Wenjie Zhang, Ying Zhang","doi":"10.14778/3626292.3626300","DOIUrl":"https://doi.org/10.14778/3626292.3626300","url":null,"abstract":"Bipartite graph embedding (BGE), as the fundamental task in bipartite network analysis, is to map each node to compact low-dimensional vectors that preserve intrinsic properties. The existing solutions towards BGE fall into two groups: metric-based methods and graph neural network-based (GNN-based) methods. The latter typically generates higher-quality embeddings than the former due to the strong representation ability of deep learning. Nevertheless, none of the existing GNN-based methods can handle billion-scale bipartite graphs due to the expensive message passing or complex modelling choices. Hence, existing solutions face a challenge in achieving both embedding quality and model scalability. Motivated by this, we propose a novel graph neural network named AnchorGNN based on global-local learning framework, which can generate high-quality BGE and scale to billion-scale bipartite graphs. Concretely, AnchorGNN leverages a novel anchor-based message passing schema for global learning, which enables global knowledge to be incorporated to generate node embeddings. Meanwhile, AnchorGNN offers an efficient one-hop local structure modelling using maximum likelihood estimation for bipartite graphs with rational analysis, avoiding large adjacency matrix construction. Both global information and local structure are integrated to generate distinguishable node embeddings. Extensive experiments demonstrate that AnchorGNN outperforms the best competitor by up to 36% in accuracy and achieves up to 28 times speed-up against the only metric-based baseline on billion-scale bipartite graphs.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"25 1","pages":"175-183"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139327364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.14778/3626292.3626301
Wangze Ni, Pengze Chen, Lei Chen, Peng Cheng, Chen Zhang, Xuemin Lin
The payment channel network (PCN) is a promising solution to increase the throughput of blockchains. However, unidirectional transactions can deplete a user's deposits in a payment channel (PC), reducing the success ratio of transactions (SRoT). To address this depletion issue, rebalance protocols are used to shift tokens from well-deposited PCs to under-deposited PCs. To improve SRoT, it is beneficial to increase the balance of a PC with a lower balance and a higher weight (i.e., more transaction executions rely on the PC). In this paper, we define the utility of a transaction and the utility-aware rebalance (UAR) problem. The utility of a transaction is proportional to the weight of the PC and the amount of the transaction, and inversely proportional to the balance of the receiver. To maximize the effect of improving SRoT, UAR aims to find a set of transactions with maximized utilities, satisfying the budget and conservation constraints. The budget constraint limits the number of tokens shifted in a PC. The conservation constraint requires that the number of tokens each user sends equals the number of tokens received. We prove that UAR is NP-hard and cannot be approximately solved with a constant ratio. Thus, we propose two heuristic algorithms, namely Circuit Greedy and UAR_DC. Extensive experiments show that our approaches outperform the existing approach by at least 3.16 times in terms of utilities.
支付通道网络(PCN)是提高区块链吞吐量的一种有前途的解决方案。然而,单向交易会耗尽用户在支付通道(PC)中的存款,降低交易成功率(SRoT)。为解决这一消耗问题,使用再平衡协议将代币从存款充足的 PC 转移到存款不足的 PC。为了提高 SRoT,增加余额较低且权重较高的 PC 的余额是有益的(即更多的交易执行依赖于该 PC)。本文定义了交易效用和效用感知再平衡(UAR)问题。交易效用与 PC 的权重和交易金额成正比,与接收方的余额成反比。为了最大限度地提高 SRoT 的效果,UAR 的目标是找到一组效用最大的交易,同时满足预算和保护约束。预算约束限制了 PC 中转移的代币数量。保护约束要求每个用户发送的代币数等于收到的代币数。我们证明,UAR 是 NP 难题,无法以恒定比率近似求解。因此,我们提出了两种启发式算法,即 Circuit Greedy 和 UAR_DC。大量实验表明,就效用而言,我们的方法比现有方法至少高出 3.16 倍。
{"title":"Utility-aware Payment Channel Network Rebalance","authors":"Wangze Ni, Pengze Chen, Lei Chen, Peng Cheng, Chen Zhang, Xuemin Lin","doi":"10.14778/3626292.3626301","DOIUrl":"https://doi.org/10.14778/3626292.3626301","url":null,"abstract":"The payment channel network (PCN) is a promising solution to increase the throughput of blockchains. However, unidirectional transactions can deplete a user's deposits in a payment channel (PC), reducing the success ratio of transactions (SRoT). To address this depletion issue, rebalance protocols are used to shift tokens from well-deposited PCs to under-deposited PCs. To improve SRoT, it is beneficial to increase the balance of a PC with a lower balance and a higher weight (i.e., more transaction executions rely on the PC). In this paper, we define the utility of a transaction and the utility-aware rebalance (UAR) problem. The utility of a transaction is proportional to the weight of the PC and the amount of the transaction, and inversely proportional to the balance of the receiver. To maximize the effect of improving SRoT, UAR aims to find a set of transactions with maximized utilities, satisfying the budget and conservation constraints. The budget constraint limits the number of tokens shifted in a PC. The conservation constraint requires that the number of tokens each user sends equals the number of tokens received. We prove that UAR is NP-hard and cannot be approximately solved with a constant ratio. Thus, we propose two heuristic algorithms, namely Circuit Greedy and UAR_DC. Extensive experiments show that our approaches outperform the existing approach by at least 3.16 times in terms of utilities.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"8 1","pages":"184-196"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139325443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.14778/3625054.3625067
Javad Ghareh Chamani, I. Demertzis, Dimitrios Papadopoulos, Charalampos Papamanthou, R. Jalili
We propose GraphOS, a system that allows a client that owns a graph database to outsource it to an untrusted server for storage and querying. It relies on doubly-oblivious primitives and trusted hardware to achieve a very strong privacy and efficiency notion which we call oblivious graph processing : the server learns nothing besides the number of graph vertexes and edges, and for each query its type and response size. At a technical level, GraphOS stores the graph on a doubly-oblivious data structure , so that all vertex/edge accesses are indistinguishable. For this purpose, we propose Omix++, a novel doubly-oblivious map that outperforms the previous state of the art by up to 34×, and may be of independent interest. Moreover, to avoid any leakage from CPU instruction-fetching during query evaluation, we propose algorithms for four fundamental graph queries (BFS/DFS traversal, minimum spanning tree, and single-source shortest paths) that have a fixed execution trace , i.e., the sequence of executed operations is independent of the input. By combining these techniques, we eliminate all information that a hardware adversary observing the memory access pattern within the protected enclave can infer. We benchmarked GraphOS against the best existing solution, based on oblivious relational DBMS (translating graph queries to relational operators). GraphOS is not only significantly more performant (by up to two orders of magnitude for our tested graphs) but it eliminates leakage related to the graph topology that is practically inherent when a relational DBMS is used unless all operations are "padded" to the worst case.
{"title":"GraphOS: Towards Oblivious Graph Processing","authors":"Javad Ghareh Chamani, I. Demertzis, Dimitrios Papadopoulos, Charalampos Papamanthou, R. Jalili","doi":"10.14778/3625054.3625067","DOIUrl":"https://doi.org/10.14778/3625054.3625067","url":null,"abstract":"We propose GraphOS, a system that allows a client that owns a graph database to outsource it to an untrusted server for storage and querying. It relies on doubly-oblivious primitives and trusted hardware to achieve a very strong privacy and efficiency notion which we call oblivious graph processing : the server learns nothing besides the number of graph vertexes and edges, and for each query its type and response size. At a technical level, GraphOS stores the graph on a doubly-oblivious data structure , so that all vertex/edge accesses are indistinguishable. For this purpose, we propose Omix++, a novel doubly-oblivious map that outperforms the previous state of the art by up to 34×, and may be of independent interest. Moreover, to avoid any leakage from CPU instruction-fetching during query evaluation, we propose algorithms for four fundamental graph queries (BFS/DFS traversal, minimum spanning tree, and single-source shortest paths) that have a fixed execution trace , i.e., the sequence of executed operations is independent of the input. By combining these techniques, we eliminate all information that a hardware adversary observing the memory access pattern within the protected enclave can infer. We benchmarked GraphOS against the best existing solution, based on oblivious relational DBMS (translating graph queries to relational operators). GraphOS is not only significantly more performant (by up to two orders of magnitude for our tested graphs) but it eliminates leakage related to the graph topology that is practically inherent when a relational DBMS is used unless all operations are \"padded\" to the worst case.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"33 1","pages":"4324-4338"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139343351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.14778/3625054.3625055
Lina Qiu, Georgios Kellaris, N. Mamoulis, Kobbi Nissim, G. Kollios
Most cloud service providers offer limited data privacy guarantees, discouraging clients from using them for managing their sensitive data. Cloud providers may use servers with Trusted Execution Environments (TEEs) to protect outsourced data, while supporting remote querying. However, TEEs may leak access patterns and allow communication volume attacks, enabling an honest-but-curious cloud provider to learn sensitive information. Oblivious algorithms can be used to completely hide data access patterns, but their high overhead could render them impractical. To alleviate the latter, the notion of Differential Obliviousness (DO) has been recently proposed. DO applies differential privacy (DP) on access patterns while hiding the communication volume of intermediate and final results; it does so by trading some level of privacy for efficiency. We present Doquet: D ifferentially O blivious Range and Join Que ries with Private Data Struc t ures, a framework for DO outsourced database systems. Doquet is the first approach that supports private data structures, indices, selection, foreign key join, many-to-many join, and their composition select-join in a realistic TEE setting, even when the accesses to the private memory can be eavesdropped on by the adversary. We prove that the algorithms in Doquet satisfy differential obliviousness. Furthermore, we implemented Doquet and tested it on a machine having a second generation of Intel SGX (TEE); the results show that Doquet offers up to an order of magnitude speedup in comparison with other fully oblivious and differentially oblivious approaches.
大多数云服务提供商提供的数据隐私保证有限,因此客户不愿使用它们来管理敏感数据。云提供商可以使用带有可信执行环境(TEE)的服务器来保护外包数据,同时支持远程查询。但是,TEE 可能会泄露访问模式并允许通信量攻击,从而使诚实但好奇的云提供商了解敏感信息。遗忘算法可用于完全隐藏数据访问模式,但其高昂的开销可能使其变得不切实际。为了缓解后者的问题,最近有人提出了差分遗忘(DO)的概念。差分遗忘(DO)将差分隐私(DP)应用于访问模式,同时隐藏中间和最终结果的通信量;它是通过以一定程度的隐私换取效率来实现这一点的。 我们介绍 Doquet: D ifferentially O blivious Range and Join Que ries with Private Data Struc t ures),这是一种用于 DO 外包数据库系统的框架。Doquet 是第一种支持私有数据结构、索引、选择、外键连接、多对多连接以及它们在现实 TEE 环境中的组合 select-join 的方法,即使对私有内存的访问可以被对手窃听。我们证明了 Doquet 算法满足差分遗忘性。此外,我们实现了 Doquet,并在第二代英特尔 SGX(TEE)机器上进行了测试;结果表明,与其他完全遗忘和差分遗忘方法相比,Doquet 的速度提高了一个数量级。
{"title":"Doquet: Differentially Oblivious Range and Join Queries with Private Data Structures","authors":"Lina Qiu, Georgios Kellaris, N. Mamoulis, Kobbi Nissim, G. Kollios","doi":"10.14778/3625054.3625055","DOIUrl":"https://doi.org/10.14778/3625054.3625055","url":null,"abstract":"Most cloud service providers offer limited data privacy guarantees, discouraging clients from using them for managing their sensitive data. Cloud providers may use servers with Trusted Execution Environments (TEEs) to protect outsourced data, while supporting remote querying. However, TEEs may leak access patterns and allow communication volume attacks, enabling an honest-but-curious cloud provider to learn sensitive information. Oblivious algorithms can be used to completely hide data access patterns, but their high overhead could render them impractical. To alleviate the latter, the notion of Differential Obliviousness (DO) has been recently proposed. DO applies differential privacy (DP) on access patterns while hiding the communication volume of intermediate and final results; it does so by trading some level of privacy for efficiency. We present Doquet: D ifferentially O blivious Range and Join Que ries with Private Data Struc t ures, a framework for DO outsourced database systems. Doquet is the first approach that supports private data structures, indices, selection, foreign key join, many-to-many join, and their composition select-join in a realistic TEE setting, even when the accesses to the private memory can be eavesdropped on by the adversary. We prove that the algorithms in Doquet satisfy differential obliviousness. Furthermore, we implemented Doquet and tested it on a machine having a second generation of Intel SGX (TEE); the results show that Doquet offers up to an order of magnitude speedup in comparison with other fully oblivious and differentially oblivious approaches.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"58 1","pages":"4160-4173"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139344606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.14778/3625054.3625065
D. Melissourgos, Haibo Wang, Shigang Chen, Chaoyi Ma, Shiping Chen
Per-flow size measurement is key to many streaming applications and management systems, particularly in high-speed networks. Performing such measurement on the data plane of a network device at the line rate requires on-chip memory and computing resources that are shared by other key network functions. It leads to the need for very compact and fast data structures, called sketches, which trade off space for accuracy. Such a need also arises in other application context for extremely large data sets. The goal of sketch design is two-fold: to measure flow size as accurately as possible and to do so as efficiently as possible (for low overhead and thus high processing throughput). The existing sketches can be broadly categorized to multi-update sketches and single update sketches. The former are more accurate but carry larger overhead. The latter incur small overhead but their accuracy is poor. This paper proposes a Single update Sketch with a Variable counter Structure (SSVS), a new sketch design which is several times faster than the existing multi-update sketches with comparable accuracy, and is several times more accurate than the existing single update sketches with comparable overhead. The new sketch design embodies several technical contributions that integrate the enabling properties from both multi-update sketches and single update sketches in a novel structure that effectively controls the measurement error with minimum processing overhead.
{"title":"Single Update Sketch with Variable Counter Structure","authors":"D. Melissourgos, Haibo Wang, Shigang Chen, Chaoyi Ma, Shiping Chen","doi":"10.14778/3625054.3625065","DOIUrl":"https://doi.org/10.14778/3625054.3625065","url":null,"abstract":"Per-flow size measurement is key to many streaming applications and management systems, particularly in high-speed networks. Performing such measurement on the data plane of a network device at the line rate requires on-chip memory and computing resources that are shared by other key network functions. It leads to the need for very compact and fast data structures, called sketches, which trade off space for accuracy. Such a need also arises in other application context for extremely large data sets. The goal of sketch design is two-fold: to measure flow size as accurately as possible and to do so as efficiently as possible (for low overhead and thus high processing throughput). The existing sketches can be broadly categorized to multi-update sketches and single update sketches. The former are more accurate but carry larger overhead. The latter incur small overhead but their accuracy is poor. This paper proposes a Single update Sketch with a Variable counter Structure (SSVS), a new sketch design which is several times faster than the existing multi-update sketches with comparable accuracy, and is several times more accurate than the existing single update sketches with comparable overhead. The new sketch design embodies several technical contributions that integrate the enabling properties from both multi-update sketches and single update sketches in a novel structure that effectively controls the measurement error with minimum processing overhead.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"70 1","pages":"4296-4309"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139345716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.14778/3625054.3625059
Rong Gu, Han Li, Haipeng Dai, Wenjie Huang, Jie Xue, Meng Li, Jiaqi Zheng, Haoran Cai, Yihua Huang, Guihai Chen
Approximate query processing (AQP) is one of the key techniques to cope with big data querying problem on account that it obtains approximate answers efficiently. To address non-trivial sample selection and heavy sampling cost issues in AQP, we propose ShadowAQP, an efficient and accurate approach based on attribute-oriented sample size allocation and data generation. We select samples according to group-by and join attributes, and determine the sample size for each group of unique value combinations to improve query accuracy. We design a conditional variational autoencoder model with automatic table data encoding and model update strategies. To further improve accuracy and efficiency, we propose a set of extensions, including parallel multi-round sampling aggregation, data outlier-aware sampling, and dimension reduction optimization. Evaluation results on diversified datasets show that, compared with SOTA approaches, ShadowAQP achieves 5.8× query speed performance improvement on average (up to 12.8×), while reducing query error by 74% on average (up to 95%) at the same time.
近似查询处理(AQP)是应对大数据查询问题的关键技术之一,因为它能高效地获得近似答案。为了解决近似查询处理中样本选择困难和采样成本高的问题,我们提出了一种基于面向属性的样本大小分配和数据生成的高效、精确的方法--ShadowAQP。我们根据分组和连接属性选择样本,并确定每组唯一值组合的样本大小,以提高查询准确性。我们设计了一个条件变分自动编码器模型,该模型具有自动表数据编码和模型更新策略。为了进一步提高准确性和效率,我们提出了一系列扩展方案,包括并行多轮采样聚合、数据离群感知采样和降维优化。在多样化数据集上的评估结果表明,与 SOTA 方法相比,ShadowAQP 的查询速度平均提高了 5.8 倍(最高达 12.8 倍),同时查询错误平均减少了 74%(最高达 95%)。
{"title":"ShadowAQP: Efficient Approximate Group-by and Join Query via Attribute-oriented Sample Size Allocation and Data Generation","authors":"Rong Gu, Han Li, Haipeng Dai, Wenjie Huang, Jie Xue, Meng Li, Jiaqi Zheng, Haoran Cai, Yihua Huang, Guihai Chen","doi":"10.14778/3625054.3625059","DOIUrl":"https://doi.org/10.14778/3625054.3625059","url":null,"abstract":"Approximate query processing (AQP) is one of the key techniques to cope with big data querying problem on account that it obtains approximate answers efficiently. To address non-trivial sample selection and heavy sampling cost issues in AQP, we propose ShadowAQP, an efficient and accurate approach based on attribute-oriented sample size allocation and data generation. We select samples according to group-by and join attributes, and determine the sample size for each group of unique value combinations to improve query accuracy. We design a conditional variational autoencoder model with automatic table data encoding and model update strategies. To further improve accuracy and efficiency, we propose a set of extensions, including parallel multi-round sampling aggregation, data outlier-aware sampling, and dimension reduction optimization. Evaluation results on diversified datasets show that, compared with SOTA approaches, ShadowAQP achieves 5.8× query speed performance improvement on average (up to 12.8×), while reducing query error by 74% on average (up to 95%) at the same time.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"4 1","pages":"4216-4229"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139346226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}