首页 > 最新文献

Proceedings of the 2018 International Conference on Management of Data最新文献

英文 中文
Skyline Community Search in Multi-valued Networks 多值网络中的Skyline社区搜索
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183736
Ronghua Li, Lu Qin, Fanghua Ye, J. Yu, Xiaokui Xiao, Nong Xiao, Zibin Zheng
Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.
在一个科学合作网络中,如何找到具有高研究指标(如h指数)和不同研究兴趣的合作者?给定一个社交网络,我们如何识别具有高影响力的社区(例如,PageRank),并且与指定用户有相似的兴趣?在这种设置中,网络可以建模为一个多值网络,其中每个节点具有d ($d ge 1$)个数值属性(即h-index、多样性、PageRank、相似性评分等)。在多值网络中,我们希望从d个数值属性的角度找到不受其他群体支配的群体。现有的社区搜索算法要么完全忽略节点的数字属性,要么只考虑节点的一个数字属性。基于k核和天际线的概念,提出了一种新的社区模型,称为天际线社区。天际线群落是d维属性空间中不受其他连通k核支配的最大连通k核。我们开发了一种优雅的空间划分算法来有效地计算天际线社区。我们的算法有两个显著的优点:(1)它的时间复杂度主要依赖于答案s的大小(即天际线社区的数量),因此当s很小时,它是非常高效的;(2)可逐步输出天际线小区,对于只需要部分天际线小区的应用非常有用。在合成网络和实际网络上的大量实验证明了该算法的效率、可扩展性和有效性。
{"title":"Skyline Community Search in Multi-valued Networks","authors":"Ronghua Li, Lu Qin, Fanghua Ye, J. Yu, Xiaokui Xiao, Nong Xiao, Zibin Zheng","doi":"10.1145/3183713.3183736","DOIUrl":"https://doi.org/10.1145/3183713.3183736","url":null,"abstract":"Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78124952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration 面向交互式和可视化勘探的地图地理空间数据高效选择
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183738
Tao Guo, Kaiyu Feng, G. Cong, Z. Bao
With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (SOS) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. % To further support interactive exploration of geospatial data on maps, we propose the Interactive SOS (ISOS) problem, in which we enrich the SOS problem with the zooming consistency and panning consistency constraints. The objective of ISOS is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the SOS problem to solve the ISOS problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.
随着移动设备的普及,大量地理空间数据变得可用,比如带有地理标记的照片。地图绘制系统在向最终用户呈现如此大的地理空间数据集方面发挥着重要作用。我们建议这样的系统应该支持以下可取的特性:代表性、可见性约束、缩放一致性和平移一致性。前两个约束是地图探索系统的基本挑战,其目的是从用户感兴趣的当前区域有效地选择一小组具有代表性的对象,并且任何两个被选中的对象都不应该太靠近,以免用户在有限的屏幕空间中区分。我们将其形式化为空间目标选择(SOS)问题,证明了它是一个np困难问题,并开发了一种新的具有性能保证的近似算法。为了进一步支持地图上地理空间数据的交互式探索,我们提出了交互式SOS (ISOS)问题,其中我们用缩放一致性和平移一致性约束丰富了SOS问题。ISOS的目标是为最终用户提供无缝体验,通过导航地图交互式地探索数据。我们扩展了SOS问题的算法来解决SOS问题,并提出了一种基于预取的新策略,显著提高了效率。最后,我们进行了大量的实验来证明我们的方法的效率和可扩展性。
{"title":"Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration","authors":"Tao Guo, Kaiyu Feng, G. Cong, Z. Bao","doi":"10.1145/3183713.3183738","DOIUrl":"https://doi.org/10.1145/3183713.3183738","url":null,"abstract":"With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (SOS) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. % To further support interactive exploration of geospatial data on maps, we propose the Interactive SOS (ISOS) problem, in which we enrich the SOS problem with the zooming consistency and panning consistency constraints. The objective of ISOS is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the SOS problem to solve the ISOS problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72788327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Session details: Industry 3: DB Systems in the Cloud and Open Source 会议详情:行业3:云和开源中的数据库系统
Mohammad Sadoghi
{"title":"Session details: Industry 3: DB Systems in the Cloud and Open Source","authors":"Mohammad Sadoghi","doi":"10.1145/3258015","DOIUrl":"https://doi.org/10.1145/3258015","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83195788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interactive Demonstration of Probabilistic Predicates 概率谓词的交互式演示
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193542
Yao Lu, Srikanth Kandula, S. Chaudhuri
We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.
我们将演示一个原型查询处理引擎,它使用概率谓词(PPs)来加速机器学习推理工作。在当前的分析引擎中,机器学习函数被建模为用户定义函数(udf),这既耗时又耗费资源。这些udf防止谓词下推;不能将使用这些udf输出的谓词推到udf之前。因此,在将udf应用于将被后续谓词拒绝的输入时浪费了大量的时间和资源。我们使用pp,它们是直接应用于原始输入的轻量级分类器,并过滤与查询谓词不一致的数据块。通过减少udf要处理的输入,pp极大地改进了查询处理。我们将通过构建用于图像识别、文档分类和视频分析等许多推理任务的pp来证明pp是广泛适用的。我们还将演示将pp扩展到复杂查询谓词并支持不同精度要求的查询优化方法。
{"title":"Interactive Demonstration of Probabilistic Predicates","authors":"Yao Lu, Srikanth Kandula, S. Chaudhuri","doi":"10.1145/3183713.3193542","DOIUrl":"https://doi.org/10.1145/3183713.3193542","url":null,"abstract":"We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77225142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Speeding Up Set Intersections in Graph Algorithms using SIMD Instructions 使用SIMD指令加速图算法中的集合交叉点
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196924
Shuo Han, Lei Zou, J. Yu
In this paper, we focus on accelerating a widely employed computing pattern --- set intersection, to boost a group of graph algorithms. Graph's adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms. We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter out most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that encodes sets in a compact layout. By combining QFilter and BSR, we achieve data-parallelism in two levels --- inter-chunk and intra-chunk parallelism. Moreover, we find that node ordering impacts the performance of intersection by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering to enhance the intra-chunk parallelism. We conduct extensive experiments to confirm that our approach can improve the performance of set intersection in graph algorithms significantly.
在本文中,我们专注于加速一种广泛使用的计算模式——集合交集,以促进一组图算法。图的邻接表可以很自然地看作是节点集,因此集合相交是许多图算法中的基本操作。我们提出了QFilter,一个使用SIMD指令的集合交集算法。QFilter采用基于合并的框架,通过SIMD指令迭代比较两个元素块。我们改进的关键在于,我们在一个字节检查步骤中快速过滤掉了大多数不必要的比较。我们还提出了一种称为BSR的二进制表示,它以紧凑的布局对集合进行编码。通过结合QFilter和BSR,我们实现了两个层次的数据并行——块间并行和块内并行。此外,我们发现节点排序通过影响BSR的紧度来影响交集的性能。我们将图重排序问题表述为BSR紧性的一个优化问题,并证明了它的强np完备性。因此,我们提出了一种近似算法,可以找到更好的排序来提高块内并行性。我们进行了大量的实验,以证实我们的方法可以显着提高图算法中集合交集的性能。
{"title":"Speeding Up Set Intersections in Graph Algorithms using SIMD Instructions","authors":"Shuo Han, Lei Zou, J. Yu","doi":"10.1145/3183713.3196924","DOIUrl":"https://doi.org/10.1145/3183713.3196924","url":null,"abstract":"In this paper, we focus on accelerating a widely employed computing pattern --- set intersection, to boost a group of graph algorithms. Graph's adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms. We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter out most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that encodes sets in a compact layout. By combining QFilter and BSR, we achieve data-parallelism in two levels --- inter-chunk and intra-chunk parallelism. Moreover, we find that node ordering impacts the performance of intersection by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering to enhance the intra-chunk parallelism. We conduct extensive experiments to confirm that our approach can improve the performance of set intersection in graph algorithms significantly.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90747263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Kubernetes and the New Cloud Kubernetes和新云
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183725
E. Brewer
We are in the midst of shifting the notion of “Cloud” to a higher level of abstraction than virtual machines — one based on services, processes and APIs. Kubernetes epitomizes this shift and has rapidly become the de facto way to manage this new era of container-based applications. It aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and Istio and show how they work together to simplify evolution, scaling and operations.
我们正在将“云”的概念转移到比虚拟机更高的抽象层次——一个基于服务、流程和api的概念。Kubernetes是这种转变的缩影,并迅速成为管理这个基于容器的应用程序新时代的实际方式。它旨在简化服务的部署和管理,包括将应用程序构建为一组相互作用但独立的服务。我们解释了Kubernetes和Istio中的一些关键概念,并展示了它们如何协同工作以简化演化、扩展和操作。
{"title":"Kubernetes and the New Cloud","authors":"E. Brewer","doi":"10.1145/3183713.3183725","DOIUrl":"https://doi.org/10.1145/3183713.3183725","url":null,"abstract":"We are in the midst of shifting the notion of “Cloud” to a higher level of abstraction than virtual machines — one based on services, processes and APIs. Kubernetes epitomizes this shift and has rapidly become the de facto way to manage this new era of container-based applications. It aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and Istio and show how they work together to simplify evolution, scaling and operations.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83980180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis MISTIQUE:一个用于模型诊断的模型中间体存储和查询系统
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196934
Manasi Vartak, Joana M. F. da Trindade, S. Madden, M. Zaharia
Model diagnosis is the process of analyzing machine learning (ML) model performance to identify where the model works well and where it doesn't. It is a key part of the modeling process and helps ML developers iteratively improve model accuracy. Often, model diagnosis is performed by analyzing different datasets or intermediates associated with the model such as the input data and hidden representations learned by the model (e.g., [4, 24, 39,]). The bottleneck in fast model diagnosis is the creation and storage of model intermediates. Storing these intermediates requires tens to hundreds of GB of storage whereas re-running the model for each diagnostic query slows down model diagnosis. To address this bottleneck, we propose a system called MISTIQUE that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis. For each diagnostic query, MISTIQUE intelligently chooses whether to re-run the model or read a previously stored intermediate. For intermediates that are stored in MISTIQUE, we propose a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication. We evaluate our techniques on a range of real-world ML models in scikit-learn and Tensorflow. We demonstrate that our optimizations reduce storage by up to 110X for traditional ML pipelines and up to 6X for deep neural networks. Furthermore, by using MISTIQUE, we can speed up diagnostic queries on traditional ML pipelines by up to 390X and 210X on deep neural networks.
模型诊断是分析机器学习(ML)模型性能的过程,以确定模型在哪些地方工作良好,哪些地方不行。它是建模过程的关键部分,可以帮助ML开发人员迭代地提高模型准确性。通常,模型诊断是通过分析与模型相关的不同数据集或中间物来执行的,例如输入数据和模型学习到的隐藏表示(例如,[4,24,39,])。模型中间体的生成和存储是模型快速诊断的瓶颈。存储这些中间数据需要几十到几百GB的存储空间,而为每个诊断查询重新运行模型会减慢模型诊断的速度。为了解决这一瓶颈,我们提出了一个名为MISTIQUE的系统,该系统可以与传统的机器学习管道以及深度神经网络一起工作,以有效地捕获、存储和查询用于诊断的模型中间体。对于每个诊断查询,MISTIQUE智能地选择是重新运行模型还是读取先前存储的中间数据。对于存储在MISTIQUE中的中间体,我们提出了一系列优化措施来减少存储占用,包括量化、汇总和重复数据删除。我们在scikit-learn和Tensorflow中的一系列真实ML模型上评估了我们的技术。我们证明,我们的优化将传统ML管道的存储空间减少了110X,深度神经网络的存储空间减少了6X。此外,通过使用MISTIQUE,我们可以在深度神经网络上将传统ML管道的诊断查询速度提高390X和210X。
{"title":"MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis","authors":"Manasi Vartak, Joana M. F. da Trindade, S. Madden, M. Zaharia","doi":"10.1145/3183713.3196934","DOIUrl":"https://doi.org/10.1145/3183713.3196934","url":null,"abstract":"Model diagnosis is the process of analyzing machine learning (ML) model performance to identify where the model works well and where it doesn't. It is a key part of the modeling process and helps ML developers iteratively improve model accuracy. Often, model diagnosis is performed by analyzing different datasets or intermediates associated with the model such as the input data and hidden representations learned by the model (e.g., [4, 24, 39,]). The bottleneck in fast model diagnosis is the creation and storage of model intermediates. Storing these intermediates requires tens to hundreds of GB of storage whereas re-running the model for each diagnostic query slows down model diagnosis. To address this bottleneck, we propose a system called MISTIQUE that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis. For each diagnostic query, MISTIQUE intelligently chooses whether to re-run the model or read a previously stored intermediate. For intermediates that are stored in MISTIQUE, we propose a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication. We evaluate our techniques on a range of real-world ML models in scikit-learn and Tensorflow. We demonstrate that our optimizations reduce storage by up to 110X for traditional ML pipelines and up to 6X for deep neural networks. Furthermore, by using MISTIQUE, we can speed up diagnostic queries on traditional ML pipelines by up to 390X and 210X on deep neural networks.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76640855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
A Rating-Ranking Method for Crowdsourced Top-k Computation 一种众包Top-k计算的分级排序方法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183762
Kaiyu Li, Xiaohang Zhang, Guoliang Li
Crowdsourced top- k computation aims to utilize the human ability to identify Top- k objects from a given set of objects. Most of existing studies employ a pairwise comparison based method, which first asks workers to compare each pair of objects and then infers the Top- k results based on the pairwise comparison results. Obviously, it is quadratic to compare every object pair and these methods involve huge monetary cost, especially for large datasets. To address this problem, we propose a rating-ranking-based approach, which contains two types of questions to ask the crowd. The first is a rating question, which asks the crowd to give a score for an object. The second is a ranking question, which asks the crowd to rank several (e.g., 3) objects. Rating questions are coarse grained and can roughly get a score for each object, which can be used to prune the objects whose scores are much smaller than those of the Top- k objects. Ranking questions are fine grained and can be used to refine the scores. We propose a unified model to model the rating and ranking questions, and seamlessly combine them together to compute the Top- k results. We also study how to judiciously select appropriate rating or ranking questions and assign them to a coming worker. Experimental results on real datasets show that our method significantly outperforms existing approaches.
众包top- k计算旨在利用人类的能力从给定的一组对象中识别top- k对象。现有的研究大多采用基于成对比较的方法,首先要求工作人员对每对对象进行比较,然后根据成对比较的结果推断Top- k的结果。显然,每个对象对的比较是二次的,这些方法涉及巨大的货币成本,特别是对于大型数据集。为了解决这个问题,我们提出了一种基于评级-排名的方法,该方法包含两种类型的问题。第一个是评分问题,要求人们给一个物体打分。第二个是排序问题,它要求人群对几个(例如,3个)物体进行排序。评分问题是粗粒度的,可以粗略地得到每个对象的分数,可以用来修剪分数比Top- k对象小得多的对象。排名问题是细粒度的,可用于细化分数。我们提出了一个统一的模型来对评级和排名问题进行建模,并将它们无缝地结合在一起计算Top- k结果。我们还研究如何明智地选择适当的评级或排名问题,并将其分配给新员工。在实际数据集上的实验结果表明,我们的方法明显优于现有的方法。
{"title":"A Rating-Ranking Method for Crowdsourced Top-k Computation","authors":"Kaiyu Li, Xiaohang Zhang, Guoliang Li","doi":"10.1145/3183713.3183762","DOIUrl":"https://doi.org/10.1145/3183713.3183762","url":null,"abstract":"Crowdsourced top- k computation aims to utilize the human ability to identify Top- k objects from a given set of objects. Most of existing studies employ a pairwise comparison based method, which first asks workers to compare each pair of objects and then infers the Top- k results based on the pairwise comparison results. Obviously, it is quadratic to compare every object pair and these methods involve huge monetary cost, especially for large datasets. To address this problem, we propose a rating-ranking-based approach, which contains two types of questions to ask the crowd. The first is a rating question, which asks the crowd to give a score for an object. The second is a ranking question, which asks the crowd to rank several (e.g., 3) objects. Rating questions are coarse grained and can roughly get a score for each object, which can be used to prune the objects whose scores are much smaller than those of the Top- k objects. Ranking questions are fine grained and can be used to refine the scores. We propose a unified model to model the rating and ranking questions, and seamlessly combine them together to compute the Top- k results. We also study how to judiciously select appropriate rating or ranking questions and assign them to a coming worker. Experimental results on real datasets show that our method significantly outperforms existing approaches.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87439880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Improving Join Reorderability with Compensation Operators 利用补偿算子改进连接可排序性
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183731
Taining Wang, C. Chan
A critical task in query optimization is the join reordering problem which is to find an efficient evaluation order for the join operators in a query plan. While the join reordering problem is well studied for queries with only inner-joins, the problem becomes considerably harder when outerjoins/antijoins are involved as such operators are generally not associative. The existing solutions for this problem do not enumerate the complete space of join orderings due to various restrictions on the query rewriting rules considered. In this paper, we present a novel approach for this problem for the class of queries involving inner-joins, single-sided outerjoins, and/or antijoins. Our work is able to support complete join reorderability for this class of queries which supersedes the state-of-the-art approaches.
查询优化中的一个关键问题是连接重新排序问题,即为查询计划中的连接操作符找到有效的求值顺序。虽然对于仅使用内连接的查询,连接重新排序问题已经得到了很好的研究,但是当涉及到外连接/反连接时,问题变得相当困难,因为这些操作符通常不是关联的。由于对所考虑的查询重写规则的各种限制,此问题的现有解决方案不能枚举连接排序的完整空间。在本文中,我们提出了一个新的方法来解决这个问题的查询类涉及内连接,单侧外连接,和/或反连接。我们的工作能够支持这类查询的完全连接可重排序性,这取代了最先进的方法。
{"title":"Improving Join Reorderability with Compensation Operators","authors":"Taining Wang, C. Chan","doi":"10.1145/3183713.3183731","DOIUrl":"https://doi.org/10.1145/3183713.3183731","url":null,"abstract":"A critical task in query optimization is the join reordering problem which is to find an efficient evaluation order for the join operators in a query plan. While the join reordering problem is well studied for queries with only inner-joins, the problem becomes considerably harder when outerjoins/antijoins are involved as such operators are generally not associative. The existing solutions for this problem do not enumerate the complete space of join orderings due to various restrictions on the query rewriting rules considered. In this paper, we present a novel approach for this problem for the class of queries involving inner-joins, single-sided outerjoins, and/or antijoins. Our work is able to support complete join reorderability for this class of queries which supersedes the state-of-the-art approaches.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83153537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
RDSQ: Reliable Queue Protocol over Shared Logs RDSQ:共享日志上的可靠队列协议
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183718
Haolin Yu
{"title":"RDSQ: Reliable Queue Protocol over Shared Logs","authors":"Haolin Yu","doi":"10.1145/3183713.3183718","DOIUrl":"https://doi.org/10.1145/3183713.3183718","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81957356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2018 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1