Proceedings of the 2018 International Conference on Management of Data最新文献

英文中文

Managing Non-Volatile Memory in Database Systems 管理数据库系统中的非易失性内存

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196897

Alexander van Renen, Viktor Leis, A. Kemper, Thomas Neumann, T. Hashida, Kazuichi Oe, Y. Doi, L. Harada, Mitsuru Sato

Non-volatile memory (NVM) is a new storage technology that combines the performance and byte addressability of DRAM with the persistence of traditional storage devices like flash (SSD). While these properties make NVM highly promising, it is not yet clear how to best integrate NVM into the storage layer of modern database systems. Two system designs have been proposed. The first is to use NVM exclusively, i.e., to store all data and index structures on it. However, because NVM has a higher latency than DRAM, this design can be less efficient than main-memory database systems. For this reason, the second approach uses a page-based DRAM cache in front of NVM. This approach, however, does not utilize the byte addressability of NVM and, as a result, accessing an uncached tuple on NVM requires retrieving an entire page. In this work, we evaluate these two approaches and compare them with in-memory databases as well as more traditional buffer managers that use main memory as a cache in front of SSDs. This allows us to determine how much performance gain can be expected from NVM. We also propose a lightweight storage manager that simultaneously supports DRAM, NVM, and flash. Our design utilizes the byte addressability of NVM and uses it as an additional caching layer that improves performance without losing the benefits from the even faster DRAM and the large capacities of SSDs.

非易失性存储器(NVM)是一种新的存储技术，它将DRAM的性能和字节可寻址性与闪存(SSD)等传统存储设备的持久性结合在一起。虽然这些特性使NVM非常有前途，但是如何最好地将NVM集成到现代数据库系统的存储层中还不清楚。提出了两种系统设计方案。第一种方法是独占地使用NVM，也就是说，将所有数据和索引结构存储在NVM上。但是，由于NVM具有比DRAM更高的延迟，因此这种设计的效率可能低于主存数据库系统。出于这个原因，第二种方法在NVM前面使用基于页面的DRAM缓存。但是，这种方法没有利用NVM的字节可寻址性，因此，访问NVM上的未缓存元组需要检索整个页面。在这项工作中，我们评估了这两种方法，并将它们与内存数据库以及使用主存作为ssd前缓存的更传统的缓冲区管理器进行了比较。这使我们能够确定从NVM中可以获得多少性能增益。我们还提出了一个同时支持DRAM、NVM和闪存的轻量级存储管理器。我们的设计利用了NVM的字节可寻址性，并将其用作额外的缓存层，以提高性能，同时又不会失去更快的DRAM和大容量ssd带来的好处。

{"title":"Managing Non-Volatile Memory in Database Systems","authors":"Alexander van Renen, Viktor Leis, A. Kemper, Thomas Neumann, T. Hashida, Kazuichi Oe, Y. Doi, L. Harada, Mitsuru Sato","doi":"10.1145/3183713.3196897","DOIUrl":"https://doi.org/10.1145/3183713.3196897","url":null,"abstract":"Non-volatile memory (NVM) is a new storage technology that combines the performance and byte addressability of DRAM with the persistence of traditional storage devices like flash (SSD). While these properties make NVM highly promising, it is not yet clear how to best integrate NVM into the storage layer of modern database systems. Two system designs have been proposed. The first is to use NVM exclusively, i.e., to store all data and index structures on it. However, because NVM has a higher latency than DRAM, this design can be less efficient than main-memory database systems. For this reason, the second approach uses a page-based DRAM cache in front of NVM. This approach, however, does not utilize the byte addressability of NVM and, as a result, accessing an uncached tuple on NVM requires retrieving an entire page. In this work, we evaluate these two approaches and compare them with in-memory databases as well as more traditional buffer managers that use main memory as a cache in front of SSDs. This allows us to determine how much performance gain can be expected from NVM. We also propose a lightweight storage manager that simultaneously supports DRAM, NVM, and flash. Our design utilizes the byte addressability of NVM and uses it as an additional caching layer that improves performance without losing the benefits from the even faster DRAM and the large capacities of SSDs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88098087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 98

A General and Efficient Querying Method for Learning to Hash 一种通用高效的哈希学习查询方法

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183750

Jinfeng Li, Xiao Yan, Jian Zhang, An Xu, James Cheng, Jie Liu, K. K. Ng, Ti-Chung Cheng

As an effective solution to the approximate nearest neighbors (ANN) search problem, learning to hash (L2H) is able to learn similarity-preserving hash functions tailored for a given dataset. However, existing L2H research mainly focuses on improving query performance by learning good hash functions, while Hamming ranking (HR) is used as the default querying method. We show by analysis and experiments that Hamming distance, the similarity indicator used in HR, is too coarse-grained and thus limits the performance of query processing. We propose a new fine-grained similarity indicator, quantization distance (QD), which provides more information about the similarity between a query and the items in a bucket. We then develop two efficient querying methods based on QD, which achieve significantly better query performance than HR. Our methods are general and can work with various L2H algorithms. Our experiments demonstrate that a simple and elegant querying method can produce performance gain equivalent to advanced and complicated learning algorithms.

作为近似最近邻(ANN)搜索问题的有效解决方案，学习哈希(L2H)能够学习为给定数据集定制的保持相似性的哈希函数。然而，现有的L2H研究主要侧重于通过学习好的哈希函数来提高查询性能，而默认的查询方法是Hamming ranking (HR)。通过分析和实验表明，HR中使用的相似度指标Hamming距离过于粗粒度，从而限制了查询处理的性能。我们提出了一种新的细粒度相似度指标，量化距离(QD)，它提供了查询与桶中项目之间相似度的更多信息。然后，我们开发了两种基于QD的高效查询方法，其查询性能明显优于HR。我们的方法是通用的，可以与各种L2H算法一起工作。我们的实验表明，一个简单而优雅的查询方法可以产生相当于高级和复杂的学习算法的性能增益。

引用次数: 11

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models 数据计算器:从第一原则和学习成本模型出发的数据结构设计和成本综合

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3199671

Stratos Idreos, Konstantinos Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo

Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.

数据结构在任何数据驱动的场景中都是至关重要的，但由于巨大的设计空间以及性能对工作负载和硬件的依赖，它们很难设计。我们提出了一个设计引擎，数据计算器，它可以实现交互式和半自动化的数据结构设计。它带来了两个创新。首先，它提供了一组细粒度的设计原语，这些原语捕获了数据布局设计的首要原则:数据结构节点如何布局数据，以及它们如何相互定位。这允许对可能的数据结构设计进行结构化描述，这些设计可以作为这些原语的组合进行合成。第二个创新是使用学习成本模型计算性能。这些模型在不同的硬件和数据配置文件上进行训练，并捕获基本数据访问原语(例如，随机访问)的成本属性。有了这些模型，我们可以综合任意数据结构设计上复杂操作的性能成本，而不必:1)实现数据结构，2)运行工作负载，甚至3)访问目标硬件。我们演示了数据计算器可以帮助数据结构设计师和研究人员准确地回答几秒钟或几分钟的丰富的假设设计问题，即计算给定数据结构设计的性能(响应时间)如何受到以下变化的影响:1)设计、2)硬件、3)数据和4)查询工作负载。这使得在开始漫长的实现、部署和硬件获取步骤之前，可以毫不费力地测试大量的设计和想法。我们还证明了数据计算器可以合成全新的设计，自动完成部分设计，并检测次优设计选择。

{"title":"The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models","authors":"Stratos Idreos, Konstantinos Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo","doi":"10.1145/3183713.3199671","DOIUrl":"https://doi.org/10.1145/3183713.3199671","url":null,"abstract":"Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88357907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

The Data Interaction Game 数据交互游戏

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196899

Ben McCamish, Vahid Ghadakchi, Arash Termehchy, B. Touri, Liang Huang

As many users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and leverage their feedback on the returned results to learn the information needs behind users' queries. Current query interfaces assume that users follow a fixed strategy of expressing their information needs, that is, the likelihood by which a user submits a query to express an information need remains unchanged during her interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS. We also show that users' learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We analyze the challenges of efficient implementation of this method over large-scale relational databases and propose two efficient adaptations of this algorithm over large-scale relational databases. Our extensive empirical studies over real-world query workloads and large-scale relational databases indicate that our algorithms are efficient. Our empirical results also show that our proposed learning mechanism is more effective than the state-of-the-art query answering method.

由于许多用户并不确切地知道数据库的结构和/或内容，因此他们的查询不能准确地反映他们的信息需求。数据库管理系统(DBMS)可以与用户交互，并利用用户对返回结果的反馈来了解用户查询背后的信息需求。当前的查询接口假定用户遵循一种固定的策略来表达他们的信息需求，也就是说，在用户与DBMS交互期间，用户提交查询来表达信息需求的可能性保持不变。通过使用真实世界的交互工作负载，我们展示了用户在与DBMS交互期间学习和修改如何表达他们的信息需求。我们还表明，用户的学习是由一个著名的强化学习机制准确建模的。由于当前的数据交互系统假设用户不修改策略，因此无法有效发现用户查询背后的信息需求。我们将用户和DBMS之间的交互建模为两个理性代理之间具有相同兴趣的游戏，其目标是建立一种以查询形式表示信息需求的公共语言。我们提出了一种强化学习方法，学习和回答查询背后的信息需求，适应用户策略的变化，并证明了它提高了随机回答查询的有效性。我们分析了该方法在大型关系数据库上的有效实现所面临的挑战，并提出了该算法在大型关系数据库上的两种有效适应。我们对实际查询工作负载和大型关系数据库的广泛实证研究表明，我们的算法是高效的。我们的实证结果也表明，我们提出的学习机制比最先进的查询回答方法更有效。

{"title":"The Data Interaction Game","authors":"Ben McCamish, Vahid Ghadakchi, Arash Termehchy, B. Touri, Liang Huang","doi":"10.1145/3183713.3196899","DOIUrl":"https://doi.org/10.1145/3183713.3196899","url":null,"abstract":"As many users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and leverage their feedback on the returned results to learn the information needs behind users' queries. Current query interfaces assume that users follow a fixed strategy of expressing their information needs, that is, the likelihood by which a user submits a query to express an information need remains unchanged during her interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS. We also show that users' learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We analyze the challenges of efficient implementation of this method over large-scale relational databases and propose two efficient adaptations of this algorithm over large-scale relational databases. Our extensive empirical studies over real-world query workloads and large-scale relational databases indicate that our algorithms are efficient. Our empirical results also show that our proposed learning mechanism is more effective than the state-of-the-art query answering method.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86726385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Session details: Research 15: Databases for Emerging Hardware 研究15:面向新兴硬件的数据库

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3258023

P. Pietzuch

引用次数: 0

DITA: Distributed In-Memory Trajectory Analytics DITA:分布式内存轨迹分析

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183743

Zeyuan Shang, Guoliang Li, Z. Bao

Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.

轨迹分析可以使许多现实世界的应用受益，例如，基于频繁轨迹的导航系统、道路规划、拼车和交通优化。现有的算法侧重于在单个机器上优化这个问题。然而，轨迹的数量超过了单个机器的存储和处理能力，它需要在分布式环境中进行大规模的轨迹分析。分布式轨迹分析面临着数据位置感知划分、负载平衡、易于使用的界面以及支持各种轨迹相似函数的多功能性等挑战。为了解决这些挑战，我们提出了一个分布式内存轨迹分析系统DITA。为了解决数据局部性问题，我们提出了一种有效的分区方法:全局索引和局部索引。我们设计了基于成本的技术来平衡工作量。我们开发了一个过滤器验证框架来提高性能。此外，DITA可以支持大多数现有的相似度函数来量化轨迹之间的相似度。我们将我们的框架无缝集成到Spark SQL中，并使其支持SQL和DataFrame API接口。我们在真实世界的数据集上进行了大量的实验，实验结果表明DITA显著优于现有的分布式轨迹相似度搜索和连接方法。

{"title":"DITA: Distributed In-Memory Trajectory Analytics","authors":"Zeyuan Shang, Guoliang Li, Z. Bao","doi":"10.1145/3183713.3183743","DOIUrl":"https://doi.org/10.1145/3183713.3183743","url":null,"abstract":"Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84202353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

IMPROVE-QA: An Interactive Mechanism for RDF Question/Answering Systems 改进- qa: RDF问答系统的交互机制

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193555

Xinbo Zhang, Lei Zou

RDF Question/Answering(Q/A) systems can interpret user's question N as SPARQL query Q and return answer set $Q(D)$ over RDF repository D to the user. However, due to the complexity of linking natural phrases with specific RDF items (e.g., entities and predicates), it remains difficult to understand users' questions precisely, hence $Q(D)$ may not meet users' expectation, offering wrong answers and dismissing some correct answers. In this demo, we design an I Interactive Mechanism aiming for PRO motion V ia feedback to Q/A systems (IMPROVE-QA), a whole platform to make existing Q/A systems return more precise answers (denoted as $mathcal Q^prime (D)$) to users. Based on user's feedback over $Q(D)$, IMPROVE-QA automatically refines the original query Q into a new query graph $mathcal Q^prime $ with minimum modifications, where $mathcal Q^prime (D)$ provides more precise answers. We will also demonstrate how IMPROVE-QA can apply the "lesson'' learned from the user in each query to improve the precision of Q/A systems on subsequent natural language questions.

RDF问答(Q/A)系统可以将用户的问题N解释为SPARQL查询Q，并通过RDF存储库D向用户返回答案集Q(D)$。然而，由于将自然短语与特定RDF项(例如实体和谓词)连接起来的复杂性，仍然很难精确地理解用户的问题，因此$Q(D)$可能不符合用户的期望，提供错误的答案并放弃一些正确的答案。在这个演示中，我们设计了一个针对PRO motion V的交互机制，通过对Q/A系统的反馈(IMPROVE-QA)，一个完整的平台，使现有的Q/A系统返回更精确的答案(表示为$mathcal Q^prime (D)$)给用户。基于用户对$Q(D)$的反馈，improvement - qa自动将原始查询Q提炼为一个新的查询图$mathcal Q^prime $，修改最少，其中$mathcal Q^prime (D)$提供更精确的答案。我们还将演示improve - qa如何在每个查询中应用从用户那里学到的“教训”，以提高Q/A系统在后续自然语言问题上的精度。

引用次数: 7

EKTELO: A Framework for Defining Differentially-Private Computations EKTELO:定义微分私有计算的框架

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196921

Dan Zhang, Ryan McKenna, Ios Kotsogiannis, Michael Hay, Ashwin Machanavajjhala, G. Miklau

The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, that it allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms.

差分隐私的采用越来越多，但设计私密、高效、准确的算法的复杂性仍然很高。我们提出了一种新的编程框架和系统Ektelo，用于实现现有的和新的隐私算法。对于回答线性计数查询的任务，我们证明了几乎所有现有的算法都可以由算子组成，每个算子都符合少数算子类中的一个。虽然过去的编程框架有助于确保程序的私密性，但我们框架的新颖之处在于它对编写准确、高效(以及私有)程序的重要支持。在描述了Ektelo系统的设计和架构之后，我们展示了Ektelo是表达性的，它允许通过代码重用实现更安全的实现，并且它允许隐私新手和专家轻松设计算法。我们通过设计几个新的最先进的算法来演示Ektelo的使用。

引用次数: 54

Efficient k-Regret Query Algorithm with Restriction-free Bound for any Dimensionality 任意维无约束约束的k-遗憾查询算法

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196903

Min Xie, R. C. Wong, J. Li, Cheng Long, Ashwin Lall

Extracting interesting tuples from a large database is an important problem in multi-criteria decision making. Two representative queries were proposed in the literature: top- k queries and skyline queries. A top- k query requires users to specify their utility functions beforehand and then returns k tuples to the users. A skyline query does not require any utility function from users but it puts no control on the number of tuples returned to users. Recently, a k-regret query was proposed and received attention from the community because it does not require any utility function from users and the output size is controllable, and thus it avoids those deficiencies of top- k queries and skyline queries. Specifically, it returns k tuples that minimize a criterion called the maximum regret ratio . In this paper, we present the lower bound of the maximum regret ratio for the k -regret query. Besides, we propose a novel algorithm, called SPHERE, whose upper bound on the maximum regret ratio is asymptotically optimal and restriction-free for any dimensionality, the best-known result in the literature. We conducted extensive experiments to show that SPHERE performs better than the state-of-the-art methods for the k -regret query.

从大型数据库中提取感兴趣的元组是多准则决策中的一个重要问题。文献中提出了两种具有代表性的查询:top- k查询和skyline查询。top- k查询要求用户事先指定他们的实用函数，然后返回k个元组给用户。skyline查询不需要用户的任何实用函数，但它无法控制返回给用户的元组的数量。最近，由于k-后悔查询不需要用户的任何效用函数，并且输出大小可控，从而避免了top- k查询和skyline查询的不足，而被提出并受到了社区的关注。具体来说，它返回k个元组，这些元组最小化一个称为最大后悔率的标准。本文给出了k -后悔查询的最大后悔率的下界。此外，我们提出了一种新的算法，称为SPHERE，其最大后悔率的上界对于任何维度都是渐近最优的，并且没有限制，这是文献中最著名的结果。我们进行了大量的实验，以表明SPHERE比最先进的k -后悔查询方法表现得更好。

{"title":"Efficient k-Regret Query Algorithm with Restriction-free Bound for any Dimensionality","authors":"Min Xie, R. C. Wong, J. Li, Cheng Long, Ashwin Lall","doi":"10.1145/3183713.3196903","DOIUrl":"https://doi.org/10.1145/3183713.3196903","url":null,"abstract":"Extracting interesting tuples from a large database is an important problem in multi-criteria decision making. Two representative queries were proposed in the literature: top- k queries and skyline queries. A top- k query requires users to specify their utility functions beforehand and then returns k tuples to the users. A skyline query does not require any utility function from users but it puts no control on the number of tuples returned to users. Recently, a k-regret query was proposed and received attention from the community because it does not require any utility function from users and the output size is controllable, and thus it avoids those deficiencies of top- k queries and skyline queries. Specifically, it returns k tuples that minimize a criterion called the maximum regret ratio . In this paper, we present the lower bound of the maximum regret ratio for the k -regret query. Besides, we propose a novel algorithm, called SPHERE, whose upper bound on the maximum regret ratio is asymptotically optimal and restriction-free for any dimensionality, the best-known result in the literature. We conducted extensive experiments to show that SPHERE performs better than the state-of-the-art methods for the k -regret query.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78730928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base 以众包和知识库为动力的主体性知识库建设

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183732

Hao Xin, Rui Meng, Lei Chen

Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.

近年来，随着语义搜索、问答系统、谷歌知识图谱、IBM沃森问答系统等大规模知识库的应用需求日益增加，知识库建设(KBC)成为一个热门话题。现有的KBs主要集中于对世界上的事实进行编码，如城市区域、公司产品等，这些都被认为是客观知识，而在Web查询中经常被提及的主观知识却被忽略了。主观知识没有证据证明的基础真理，相反，真理依赖于人们的主导意见，这可以从网络众工那里征求。在我们的工作中，我们提出了一个利用群体知识和现有知识库构建主观知识库的KBC框架。本文提出了主体知识库构建的两阶段框架，即核心主体知识库构建和主体知识库充实。首先，我们尝试从现有知识库中挖掘核心主观知识库，其中每个实例都具有丰富的客观属性。然后，我们用从现有知识库中提取的实例填充核心主观知识库，其中人群可以用来注释实例的主观属性。为了优化群体标注过程，我们将主观知识库充实过程问题表述为成本感知的实例标注问题，提出了自适应实例标注和批处理实例标注两种实例标注算法。本文提出了主体知识库构建的两阶段体系，即核心主体知识库构建和主体知识充实。我们在真实的知识库和众包平台上对我们的框架进行了评估，实验结果表明，通过我们提出的框架，我们可以从现有的知识库和众包技术中获得高质量的主观知识事实。

{"title":"Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base","authors":"Hao Xin, Rui Meng, Lei Chen","doi":"10.1145/3183713.3183732","DOIUrl":"https://doi.org/10.1145/3183713.3183732","url":null,"abstract":"Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77969532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2018 International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀