首页 > 最新文献

Proceedings of the 2018 International Conference on Management of Data最新文献

英文 中文
The Data Interaction Game 数据交互游戏
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196899
Ben McCamish, Vahid Ghadakchi, Arash Termehchy, B. Touri, Liang Huang
As many users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and leverage their feedback on the returned results to learn the information needs behind users' queries. Current query interfaces assume that users follow a fixed strategy of expressing their information needs, that is, the likelihood by which a user submits a query to express an information need remains unchanged during her interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS. We also show that users' learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We analyze the challenges of efficient implementation of this method over large-scale relational databases and propose two efficient adaptations of this algorithm over large-scale relational databases. Our extensive empirical studies over real-world query workloads and large-scale relational databases indicate that our algorithms are efficient. Our empirical results also show that our proposed learning mechanism is more effective than the state-of-the-art query answering method.
由于许多用户并不确切地知道数据库的结构和/或内容,因此他们的查询不能准确地反映他们的信息需求。数据库管理系统(DBMS)可以与用户交互,并利用用户对返回结果的反馈来了解用户查询背后的信息需求。当前的查询接口假定用户遵循一种固定的策略来表达他们的信息需求,也就是说,在用户与DBMS交互期间,用户提交查询来表达信息需求的可能性保持不变。通过使用真实世界的交互工作负载,我们展示了用户在与DBMS交互期间学习和修改如何表达他们的信息需求。我们还表明,用户的学习是由一个著名的强化学习机制准确建模的。由于当前的数据交互系统假设用户不修改策略,因此无法有效发现用户查询背后的信息需求。我们将用户和DBMS之间的交互建模为两个理性代理之间具有相同兴趣的游戏,其目标是建立一种以查询形式表示信息需求的公共语言。我们提出了一种强化学习方法,学习和回答查询背后的信息需求,适应用户策略的变化,并证明了它提高了随机回答查询的有效性。我们分析了该方法在大型关系数据库上的有效实现所面临的挑战,并提出了该算法在大型关系数据库上的两种有效适应。我们对实际查询工作负载和大型关系数据库的广泛实证研究表明,我们的算法是高效的。我们的实证结果也表明,我们提出的学习机制比最先进的查询回答方法更有效。
{"title":"The Data Interaction Game","authors":"Ben McCamish, Vahid Ghadakchi, Arash Termehchy, B. Touri, Liang Huang","doi":"10.1145/3183713.3196899","DOIUrl":"https://doi.org/10.1145/3183713.3196899","url":null,"abstract":"As many users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and leverage their feedback on the returned results to learn the information needs behind users' queries. Current query interfaces assume that users follow a fixed strategy of expressing their information needs, that is, the likelihood by which a user submits a query to express an information need remains unchanged during her interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS. We also show that users' learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We analyze the challenges of efficient implementation of this method over large-scale relational databases and propose two efficient adaptations of this algorithm over large-scale relational databases. Our extensive empirical studies over real-world query workloads and large-scale relational databases indicate that our algorithms are efficient. Our empirical results also show that our proposed learning mechanism is more effective than the state-of-the-art query answering method.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86726385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A General and Efficient Querying Method for Learning to Hash 一种通用高效的哈希学习查询方法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183750
Jinfeng Li, Xiao Yan, Jian Zhang, An Xu, James Cheng, Jie Liu, K. K. Ng, Ti-Chung Cheng
As an effective solution to the approximate nearest neighbors (ANN) search problem, learning to hash (L2H) is able to learn similarity-preserving hash functions tailored for a given dataset. However, existing L2H research mainly focuses on improving query performance by learning good hash functions, while Hamming ranking (HR) is used as the default querying method. We show by analysis and experiments that Hamming distance, the similarity indicator used in HR, is too coarse-grained and thus limits the performance of query processing. We propose a new fine-grained similarity indicator, quantization distance (QD), which provides more information about the similarity between a query and the items in a bucket. We then develop two efficient querying methods based on QD, which achieve significantly better query performance than HR. Our methods are general and can work with various L2H algorithms. Our experiments demonstrate that a simple and elegant querying method can produce performance gain equivalent to advanced and complicated learning algorithms.
作为近似最近邻(ANN)搜索问题的有效解决方案,学习哈希(L2H)能够学习为给定数据集定制的保持相似性的哈希函数。然而,现有的L2H研究主要侧重于通过学习好的哈希函数来提高查询性能,而默认的查询方法是Hamming ranking (HR)。通过分析和实验表明,HR中使用的相似度指标Hamming距离过于粗粒度,从而限制了查询处理的性能。我们提出了一种新的细粒度相似度指标,量化距离(QD),它提供了查询与桶中项目之间相似度的更多信息。然后,我们开发了两种基于QD的高效查询方法,其查询性能明显优于HR。我们的方法是通用的,可以与各种L2H算法一起工作。我们的实验表明,一个简单而优雅的查询方法可以产生相当于高级和复杂的学习算法的性能增益。
{"title":"A General and Efficient Querying Method for Learning to Hash","authors":"Jinfeng Li, Xiao Yan, Jian Zhang, An Xu, James Cheng, Jie Liu, K. K. Ng, Ti-Chung Cheng","doi":"10.1145/3183713.3183750","DOIUrl":"https://doi.org/10.1145/3183713.3183750","url":null,"abstract":"As an effective solution to the approximate nearest neighbors (ANN) search problem, learning to hash (L2H) is able to learn similarity-preserving hash functions tailored for a given dataset. However, existing L2H research mainly focuses on improving query performance by learning good hash functions, while Hamming ranking (HR) is used as the default querying method. We show by analysis and experiments that Hamming distance, the similarity indicator used in HR, is too coarse-grained and thus limits the performance of query processing. We propose a new fine-grained similarity indicator, quantization distance (QD), which provides more information about the similarity between a query and the items in a bucket. We then develop two efficient querying methods based on QD, which achieve significantly better query performance than HR. Our methods are general and can work with various L2H algorithms. Our experiments demonstrate that a simple and elegant querying method can produce performance gain equivalent to advanced and complicated learning algorithms.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88236276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration 面向交互式和可视化勘探的地图地理空间数据高效选择
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183738
Tao Guo, Kaiyu Feng, G. Cong, Z. Bao
With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (SOS) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. % To further support interactive exploration of geospatial data on maps, we propose the Interactive SOS (ISOS) problem, in which we enrich the SOS problem with the zooming consistency and panning consistency constraints. The objective of ISOS is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the SOS problem to solve the ISOS problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.
随着移动设备的普及,大量地理空间数据变得可用,比如带有地理标记的照片。地图绘制系统在向最终用户呈现如此大的地理空间数据集方面发挥着重要作用。我们建议这样的系统应该支持以下可取的特性:代表性、可见性约束、缩放一致性和平移一致性。前两个约束是地图探索系统的基本挑战,其目的是从用户感兴趣的当前区域有效地选择一小组具有代表性的对象,并且任何两个被选中的对象都不应该太靠近,以免用户在有限的屏幕空间中区分。我们将其形式化为空间目标选择(SOS)问题,证明了它是一个np困难问题,并开发了一种新的具有性能保证的近似算法。为了进一步支持地图上地理空间数据的交互式探索,我们提出了交互式SOS (ISOS)问题,其中我们用缩放一致性和平移一致性约束丰富了SOS问题。ISOS的目标是为最终用户提供无缝体验,通过导航地图交互式地探索数据。我们扩展了SOS问题的算法来解决SOS问题,并提出了一种基于预取的新策略,显著提高了效率。最后,我们进行了大量的实验来证明我们的方法的效率和可扩展性。
{"title":"Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration","authors":"Tao Guo, Kaiyu Feng, G. Cong, Z. Bao","doi":"10.1145/3183713.3183738","DOIUrl":"https://doi.org/10.1145/3183713.3183738","url":null,"abstract":"With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (SOS) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. % To further support interactive exploration of geospatial data on maps, we propose the Interactive SOS (ISOS) problem, in which we enrich the SOS problem with the zooming consistency and panning consistency constraints. The objective of ISOS is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the SOS problem to solve the ISOS problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72788327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
EKTELO: A Framework for Defining Differentially-Private Computations EKTELO:定义微分私有计算的框架
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196921
Dan Zhang, Ryan McKenna, Ios Kotsogiannis, Michael Hay, Ashwin Machanavajjhala, G. Miklau
The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, that it allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms.
差分隐私的采用越来越多,但设计私密、高效、准确的算法的复杂性仍然很高。我们提出了一种新的编程框架和系统Ektelo,用于实现现有的和新的隐私算法。对于回答线性计数查询的任务,我们证明了几乎所有现有的算法都可以由算子组成,每个算子都符合少数算子类中的一个。虽然过去的编程框架有助于确保程序的私密性,但我们框架的新颖之处在于它对编写准确、高效(以及私有)程序的重要支持。在描述了Ektelo系统的设计和架构之后,我们展示了Ektelo是表达性的,它允许通过代码重用实现更安全的实现,并且它允许隐私新手和专家轻松设计算法。我们通过设计几个新的最先进的算法来演示Ektelo的使用。
{"title":"EKTELO: A Framework for Defining Differentially-Private Computations","authors":"Dan Zhang, Ryan McKenna, Ios Kotsogiannis, Michael Hay, Ashwin Machanavajjhala, G. Miklau","doi":"10.1145/3183713.3196921","DOIUrl":"https://doi.org/10.1145/3183713.3196921","url":null,"abstract":"The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, that it allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90054547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Speeding Up Set Intersections in Graph Algorithms using SIMD Instructions 使用SIMD指令加速图算法中的集合交叉点
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196924
Shuo Han, Lei Zou, J. Yu
In this paper, we focus on accelerating a widely employed computing pattern --- set intersection, to boost a group of graph algorithms. Graph's adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms. We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter out most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that encodes sets in a compact layout. By combining QFilter and BSR, we achieve data-parallelism in two levels --- inter-chunk and intra-chunk parallelism. Moreover, we find that node ordering impacts the performance of intersection by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering to enhance the intra-chunk parallelism. We conduct extensive experiments to confirm that our approach can improve the performance of set intersection in graph algorithms significantly.
在本文中,我们专注于加速一种广泛使用的计算模式——集合交集,以促进一组图算法。图的邻接表可以很自然地看作是节点集,因此集合相交是许多图算法中的基本操作。我们提出了QFilter,一个使用SIMD指令的集合交集算法。QFilter采用基于合并的框架,通过SIMD指令迭代比较两个元素块。我们改进的关键在于,我们在一个字节检查步骤中快速过滤掉了大多数不必要的比较。我们还提出了一种称为BSR的二进制表示,它以紧凑的布局对集合进行编码。通过结合QFilter和BSR,我们实现了两个层次的数据并行——块间并行和块内并行。此外,我们发现节点排序通过影响BSR的紧度来影响交集的性能。我们将图重排序问题表述为BSR紧性的一个优化问题,并证明了它的强np完备性。因此,我们提出了一种近似算法,可以找到更好的排序来提高块内并行性。我们进行了大量的实验,以证实我们的方法可以显着提高图算法中集合交集的性能。
{"title":"Speeding Up Set Intersections in Graph Algorithms using SIMD Instructions","authors":"Shuo Han, Lei Zou, J. Yu","doi":"10.1145/3183713.3196924","DOIUrl":"https://doi.org/10.1145/3183713.3196924","url":null,"abstract":"In this paper, we focus on accelerating a widely employed computing pattern --- set intersection, to boost a group of graph algorithms. Graph's adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms. We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter out most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that encodes sets in a compact layout. By combining QFilter and BSR, we achieve data-parallelism in two levels --- inter-chunk and intra-chunk parallelism. Moreover, we find that node ordering impacts the performance of intersection by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering to enhance the intra-chunk parallelism. We conduct extensive experiments to confirm that our approach can improve the performance of set intersection in graph algorithms significantly.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90747263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Kubernetes and the New Cloud Kubernetes和新云
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183725
E. Brewer
We are in the midst of shifting the notion of “Cloud” to a higher level of abstraction than virtual machines — one based on services, processes and APIs. Kubernetes epitomizes this shift and has rapidly become the de facto way to manage this new era of container-based applications. It aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and Istio and show how they work together to simplify evolution, scaling and operations.
我们正在将“云”的概念转移到比虚拟机更高的抽象层次——一个基于服务、流程和api的概念。Kubernetes是这种转变的缩影,并迅速成为管理这个基于容器的应用程序新时代的实际方式。它旨在简化服务的部署和管理,包括将应用程序构建为一组相互作用但独立的服务。我们解释了Kubernetes和Istio中的一些关键概念,并展示了它们如何协同工作以简化演化、扩展和操作。
{"title":"Kubernetes and the New Cloud","authors":"E. Brewer","doi":"10.1145/3183713.3183725","DOIUrl":"https://doi.org/10.1145/3183713.3183725","url":null,"abstract":"We are in the midst of shifting the notion of “Cloud” to a higher level of abstraction than virtual machines — one based on services, processes and APIs. Kubernetes epitomizes this shift and has rapidly become the de facto way to manage this new era of container-based applications. It aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and Istio and show how they work together to simplify evolution, scaling and operations.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83980180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis MISTIQUE:一个用于模型诊断的模型中间体存储和查询系统
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196934
Manasi Vartak, Joana M. F. da Trindade, S. Madden, M. Zaharia
Model diagnosis is the process of analyzing machine learning (ML) model performance to identify where the model works well and where it doesn't. It is a key part of the modeling process and helps ML developers iteratively improve model accuracy. Often, model diagnosis is performed by analyzing different datasets or intermediates associated with the model such as the input data and hidden representations learned by the model (e.g., [4, 24, 39,]). The bottleneck in fast model diagnosis is the creation and storage of model intermediates. Storing these intermediates requires tens to hundreds of GB of storage whereas re-running the model for each diagnostic query slows down model diagnosis. To address this bottleneck, we propose a system called MISTIQUE that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis. For each diagnostic query, MISTIQUE intelligently chooses whether to re-run the model or read a previously stored intermediate. For intermediates that are stored in MISTIQUE, we propose a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication. We evaluate our techniques on a range of real-world ML models in scikit-learn and Tensorflow. We demonstrate that our optimizations reduce storage by up to 110X for traditional ML pipelines and up to 6X for deep neural networks. Furthermore, by using MISTIQUE, we can speed up diagnostic queries on traditional ML pipelines by up to 390X and 210X on deep neural networks.
模型诊断是分析机器学习(ML)模型性能的过程,以确定模型在哪些地方工作良好,哪些地方不行。它是建模过程的关键部分,可以帮助ML开发人员迭代地提高模型准确性。通常,模型诊断是通过分析与模型相关的不同数据集或中间物来执行的,例如输入数据和模型学习到的隐藏表示(例如,[4,24,39,])。模型中间体的生成和存储是模型快速诊断的瓶颈。存储这些中间数据需要几十到几百GB的存储空间,而为每个诊断查询重新运行模型会减慢模型诊断的速度。为了解决这一瓶颈,我们提出了一个名为MISTIQUE的系统,该系统可以与传统的机器学习管道以及深度神经网络一起工作,以有效地捕获、存储和查询用于诊断的模型中间体。对于每个诊断查询,MISTIQUE智能地选择是重新运行模型还是读取先前存储的中间数据。对于存储在MISTIQUE中的中间体,我们提出了一系列优化措施来减少存储占用,包括量化、汇总和重复数据删除。我们在scikit-learn和Tensorflow中的一系列真实ML模型上评估了我们的技术。我们证明,我们的优化将传统ML管道的存储空间减少了110X,深度神经网络的存储空间减少了6X。此外,通过使用MISTIQUE,我们可以在深度神经网络上将传统ML管道的诊断查询速度提高390X和210X。
{"title":"MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis","authors":"Manasi Vartak, Joana M. F. da Trindade, S. Madden, M. Zaharia","doi":"10.1145/3183713.3196934","DOIUrl":"https://doi.org/10.1145/3183713.3196934","url":null,"abstract":"Model diagnosis is the process of analyzing machine learning (ML) model performance to identify where the model works well and where it doesn't. It is a key part of the modeling process and helps ML developers iteratively improve model accuracy. Often, model diagnosis is performed by analyzing different datasets or intermediates associated with the model such as the input data and hidden representations learned by the model (e.g., [4, 24, 39,]). The bottleneck in fast model diagnosis is the creation and storage of model intermediates. Storing these intermediates requires tens to hundreds of GB of storage whereas re-running the model for each diagnostic query slows down model diagnosis. To address this bottleneck, we propose a system called MISTIQUE that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis. For each diagnostic query, MISTIQUE intelligently chooses whether to re-run the model or read a previously stored intermediate. For intermediates that are stored in MISTIQUE, we propose a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication. We evaluate our techniques on a range of real-world ML models in scikit-learn and Tensorflow. We demonstrate that our optimizations reduce storage by up to 110X for traditional ML pipelines and up to 6X for deep neural networks. Furthermore, by using MISTIQUE, we can speed up diagnostic queries on traditional ML pipelines by up to 390X and 210X on deep neural networks.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76640855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
A Rating-Ranking Method for Crowdsourced Top-k Computation 一种众包Top-k计算的分级排序方法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183762
Kaiyu Li, Xiaohang Zhang, Guoliang Li
Crowdsourced top- k computation aims to utilize the human ability to identify Top- k objects from a given set of objects. Most of existing studies employ a pairwise comparison based method, which first asks workers to compare each pair of objects and then infers the Top- k results based on the pairwise comparison results. Obviously, it is quadratic to compare every object pair and these methods involve huge monetary cost, especially for large datasets. To address this problem, we propose a rating-ranking-based approach, which contains two types of questions to ask the crowd. The first is a rating question, which asks the crowd to give a score for an object. The second is a ranking question, which asks the crowd to rank several (e.g., 3) objects. Rating questions are coarse grained and can roughly get a score for each object, which can be used to prune the objects whose scores are much smaller than those of the Top- k objects. Ranking questions are fine grained and can be used to refine the scores. We propose a unified model to model the rating and ranking questions, and seamlessly combine them together to compute the Top- k results. We also study how to judiciously select appropriate rating or ranking questions and assign them to a coming worker. Experimental results on real datasets show that our method significantly outperforms existing approaches.
众包top- k计算旨在利用人类的能力从给定的一组对象中识别top- k对象。现有的研究大多采用基于成对比较的方法,首先要求工作人员对每对对象进行比较,然后根据成对比较的结果推断Top- k的结果。显然,每个对象对的比较是二次的,这些方法涉及巨大的货币成本,特别是对于大型数据集。为了解决这个问题,我们提出了一种基于评级-排名的方法,该方法包含两种类型的问题。第一个是评分问题,要求人们给一个物体打分。第二个是排序问题,它要求人群对几个(例如,3个)物体进行排序。评分问题是粗粒度的,可以粗略地得到每个对象的分数,可以用来修剪分数比Top- k对象小得多的对象。排名问题是细粒度的,可用于细化分数。我们提出了一个统一的模型来对评级和排名问题进行建模,并将它们无缝地结合在一起计算Top- k结果。我们还研究如何明智地选择适当的评级或排名问题,并将其分配给新员工。在实际数据集上的实验结果表明,我们的方法明显优于现有的方法。
{"title":"A Rating-Ranking Method for Crowdsourced Top-k Computation","authors":"Kaiyu Li, Xiaohang Zhang, Guoliang Li","doi":"10.1145/3183713.3183762","DOIUrl":"https://doi.org/10.1145/3183713.3183762","url":null,"abstract":"Crowdsourced top- k computation aims to utilize the human ability to identify Top- k objects from a given set of objects. Most of existing studies employ a pairwise comparison based method, which first asks workers to compare each pair of objects and then infers the Top- k results based on the pairwise comparison results. Obviously, it is quadratic to compare every object pair and these methods involve huge monetary cost, especially for large datasets. To address this problem, we propose a rating-ranking-based approach, which contains two types of questions to ask the crowd. The first is a rating question, which asks the crowd to give a score for an object. The second is a ranking question, which asks the crowd to rank several (e.g., 3) objects. Rating questions are coarse grained and can roughly get a score for each object, which can be used to prune the objects whose scores are much smaller than those of the Top- k objects. Ranking questions are fine grained and can be used to refine the scores. We propose a unified model to model the rating and ranking questions, and seamlessly combine them together to compute the Top- k results. We also study how to judiciously select appropriate rating or ranking questions and assign them to a coming worker. Experimental results on real datasets show that our method significantly outperforms existing approaches.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87439880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Improving Join Reorderability with Compensation Operators 利用补偿算子改进连接可排序性
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183731
Taining Wang, C. Chan
A critical task in query optimization is the join reordering problem which is to find an efficient evaluation order for the join operators in a query plan. While the join reordering problem is well studied for queries with only inner-joins, the problem becomes considerably harder when outerjoins/antijoins are involved as such operators are generally not associative. The existing solutions for this problem do not enumerate the complete space of join orderings due to various restrictions on the query rewriting rules considered. In this paper, we present a novel approach for this problem for the class of queries involving inner-joins, single-sided outerjoins, and/or antijoins. Our work is able to support complete join reorderability for this class of queries which supersedes the state-of-the-art approaches.
查询优化中的一个关键问题是连接重新排序问题,即为查询计划中的连接操作符找到有效的求值顺序。虽然对于仅使用内连接的查询,连接重新排序问题已经得到了很好的研究,但是当涉及到外连接/反连接时,问题变得相当困难,因为这些操作符通常不是关联的。由于对所考虑的查询重写规则的各种限制,此问题的现有解决方案不能枚举连接排序的完整空间。在本文中,我们提出了一个新的方法来解决这个问题的查询类涉及内连接,单侧外连接,和/或反连接。我们的工作能够支持这类查询的完全连接可重排序性,这取代了最先进的方法。
{"title":"Improving Join Reorderability with Compensation Operators","authors":"Taining Wang, C. Chan","doi":"10.1145/3183713.3183731","DOIUrl":"https://doi.org/10.1145/3183713.3183731","url":null,"abstract":"A critical task in query optimization is the join reordering problem which is to find an efficient evaluation order for the join operators in a query plan. While the join reordering problem is well studied for queries with only inner-joins, the problem becomes considerably harder when outerjoins/antijoins are involved as such operators are generally not associative. The existing solutions for this problem do not enumerate the complete space of join orderings due to various restrictions on the query rewriting rules considered. In this paper, we present a novel approach for this problem for the class of queries involving inner-joins, single-sided outerjoins, and/or antijoins. Our work is able to support complete join reorderability for this class of queries which supersedes the state-of-the-art approaches.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83153537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
RDSQ: Reliable Queue Protocol over Shared Logs RDSQ:共享日志上的可靠队列协议
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183718
Haolin Yu
{"title":"RDSQ: Reliable Queue Protocol over Shared Logs","authors":"Haolin Yu","doi":"10.1145/3183713.3183718","DOIUrl":"https://doi.org/10.1145/3183713.3183718","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81957356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2018 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1