We address the problem of sorting the top-k elements of a set, given a predefined partial order over the set elements. Our means to obtain missing order information is via a comparison operator that interacts with a crowd of domain experts to determine the order between two unordered items. The practical motivation for studying this problem is the common scenario where elements cannot be easily compared by machines and thus human experts are harnessed for this task. As some initial partial order is given, our goal is to optimally exploit it in order to minimize the domain experts work. The problem lies at the intersection of two well-studied problems in the theory and crowdsourcing communities:full sorting under partial order information and top-k sorting with no prior partial order information. As we show, resorting to one of the existing state-of-the-art algorithms in these two problems turns out to be extravagant in terms of the number of comparisons performed by the users. In light of this, we present a dedicated algorithm for top-k sorting that aims to minimize the number of comparisons by thoroughly leveraging the partial order information. We examine two possible interpretations of the comparison operator, taken from the theory and crowdsourcing communities, and demonstrate the efficiency and effectiveness of our algorithm for both of them. We further demonstrate the utility of our algorithm, beyond identifying the top-k elements in a dataset, as a vehicle to improve the performance of Learning-to-Rank algorithms in machine learning context. We conduct a comprehensive experimental evaluation in both synthetic and real-world settings.
{"title":"Top-k Sorting Under Partial Order Information","authors":"Eyal Dushkin, T. Milo","doi":"10.1145/3183713.3199672","DOIUrl":"https://doi.org/10.1145/3183713.3199672","url":null,"abstract":"We address the problem of sorting the top-k elements of a set, given a predefined partial order over the set elements. Our means to obtain missing order information is via a comparison operator that interacts with a crowd of domain experts to determine the order between two unordered items. The practical motivation for studying this problem is the common scenario where elements cannot be easily compared by machines and thus human experts are harnessed for this task. As some initial partial order is given, our goal is to optimally exploit it in order to minimize the domain experts work. The problem lies at the intersection of two well-studied problems in the theory and crowdsourcing communities:full sorting under partial order information and top-k sorting with no prior partial order information. As we show, resorting to one of the existing state-of-the-art algorithms in these two problems turns out to be extravagant in terms of the number of comparisons performed by the users. In light of this, we present a dedicated algorithm for top-k sorting that aims to minimize the number of comparisons by thoroughly leveraging the partial order information. We examine two possible interpretations of the comparison operator, taken from the theory and crowdsourcing communities, and demonstrate the efficiency and effectiveness of our algorithm for both of them. We further demonstrate the utility of our algorithm, beyond identifying the top-k elements in a dataset, as a vehicle to improve the performance of Learning-to-Rank algorithms in machine learning context. We conduct a comprehensive experimental evaluation in both synthetic and real-world settings.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74801859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A common operation in many data analytics workloads is to find the top-k items, i.e., the largest or smallest operations according to some sort order (implemented via LIMIT or ORDER BY expressions in SQL). A naive implementation of top-k is to sort all of the items and then return the first k, but this does much more work than needed. Although efficient implementations for top-k have been explored on traditional multi-core processors, there has been no prior systematic study of top-k implementations on GPUs, despite open requests for such implementations in GPU-based frameworks like TensorFlow and ArrayFire. In this work, we present several top-k algorithms for GPUs, including a new algorithm based on bitonic sort called bitonic top-k. The bitonic top-k algorithm is up to a factor of new15x faster than sort and 4x faster than a variety of other possible implementations for values of k up to 256. We also develop a cost model to predict the performance of several of our algorithms, and show that it accurately predicts actual performance on modern GPUs.
{"title":"Efficient Top-K Query Processing on Massively Parallel Hardware","authors":"Anil Shanbhag, H. Pirk, S. Madden","doi":"10.1145/3183713.3183735","DOIUrl":"https://doi.org/10.1145/3183713.3183735","url":null,"abstract":"A common operation in many data analytics workloads is to find the top-k items, i.e., the largest or smallest operations according to some sort order (implemented via LIMIT or ORDER BY expressions in SQL). A naive implementation of top-k is to sort all of the items and then return the first k, but this does much more work than needed. Although efficient implementations for top-k have been explored on traditional multi-core processors, there has been no prior systematic study of top-k implementations on GPUs, despite open requests for such implementations in GPU-based frameworks like TensorFlow and ArrayFire. In this work, we present several top-k algorithms for GPUs, including a new algorithm based on bitonic sort called bitonic top-k. The bitonic top-k algorithm is up to a factor of new15x faster than sort and 4x faster than a variety of other possible implementations for values of k up to 256. We also develop a cost model to predict the performance of several of our algorithms, and show that it accurately predicts actual performance on modern GPUs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74818824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote1","authors":"P. Bernstein","doi":"10.1145/3258003","DOIUrl":"https://doi.org/10.1145/3258003","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72863947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research 6: Storage & Indexing","authors":"K. A. Ross","doi":"10.1145/3258010","DOIUrl":"https://doi.org/10.1145/3258010","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79311329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research 13: Machine Learning & Knowledge-base Construction","authors":"Guoliang Li","doi":"10.1145/3258020","DOIUrl":"https://doi.org/10.1145/3258020","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84238589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Ma, Dana Van Aken, Ahmed S. Hefny, Gustavo Mezerhane, Andrew Pavlo, Geoffrey J. Gordon
The first step towards an autonomous database management system (DBMS) is the ability to model the target application's workload. This is necessary to allow the system to anticipate future workload needs and select the proper optimizations in a timely manner. Previous forecasting techniques model the resource utilization of the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless. We present a robust forecasting framework called QueryBot 5000 that allows a DBMS to predict the expected arrival rate of queries in the future based on historical data. To better support highly dynamic environments, our approach uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution. It provides multiple horizons (short- vs. long-term) with different aggregation intervals. We also present a clustering-based technique for reducing the total number of forecasting models to maintain. To evaluate our approach, we compare our forecasting models against other state-of-the-art models on three real-world database traces. We implemented our models in an external controller for PostgreSQL and MySQL and demonstrate their effectiveness in selecting indexes.
{"title":"Query-based Workload Forecasting for Self-Driving Database Management Systems","authors":"Lin Ma, Dana Van Aken, Ahmed S. Hefny, Gustavo Mezerhane, Andrew Pavlo, Geoffrey J. Gordon","doi":"10.1145/3183713.3196908","DOIUrl":"https://doi.org/10.1145/3183713.3196908","url":null,"abstract":"The first step towards an autonomous database management system (DBMS) is the ability to model the target application's workload. This is necessary to allow the system to anticipate future workload needs and select the proper optimizations in a timely manner. Previous forecasting techniques model the resource utilization of the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless. We present a robust forecasting framework called QueryBot 5000 that allows a DBMS to predict the expected arrival rate of queries in the future based on historical data. To better support highly dynamic environments, our approach uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution. It provides multiple horizons (short- vs. long-term) with different aggregation intervals. We also present a clustering-based technique for reducing the total number of forecasting models to maintain. To evaluate our approach, we compare our forecasting models against other state-of-the-art models on three real-world database traces. We implemented our models in an external controller for PostgreSQL and MySQL and demonstrate their effectiveness in selecting indexes.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79657544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Industry 2: Real-time Analytics","authors":"Barzan Mozafari","doi":"10.1145/3258011","DOIUrl":"https://doi.org/10.1145/3258011","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78771786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.
{"title":"DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions","authors":"Jiawei Jiang, B. Cui, Ce Zhang, Fangcheng Fu","doi":"10.1145/3183713.3196892","DOIUrl":"https://doi.org/10.1145/3183713.3196892","url":null,"abstract":"Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask \"Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?\" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78813421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Twitter's data engineering team is faced with the challenge of processing billions of events every day in batch and in real time, and we have built various tools to meet these demands. In this paper, we describe TSAR (TimeSeries AggregatoR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc. TSAR is built on top of Summingbird, an open-source framework for integrating batch and online MapReduce computations, and removes much of the tedium associated with building end-to-end aggregation pipelines---from the ingestion and processing of events to the publication of results in heterogeneous datastores. Clients are provided a query interface that powers dashboards and supports downstream ad hoc analytics.
{"title":"Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter","authors":"Peilin Yang, S. Thiagarajan, Jimmy J. Lin","doi":"10.1145/3183713.3190663","DOIUrl":"https://doi.org/10.1145/3183713.3190663","url":null,"abstract":"Twitter's data engineering team is faced with the challenge of processing billions of events every day in batch and in real time, and we have built various tools to meet these demands. In this paper, we describe TSAR (TimeSeries AggregatoR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc. TSAR is built on top of Summingbird, an open-source framework for integrating batch and online MapReduce computations, and removes much of the tedium associated with building end-to-end aggregation pipelines---from the ingestion and processing of events to the publication of results in heterogeneous datastores. Clients are provided a query interface that powers dashboards and supports downstream ad hoc analytics.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89637741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this demonstration we present POIsam, a visualization system supporting the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints aim to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. One unique feature of POISam is that any similarity metrics can be plugged into POISam to meet the user's specific needs in different scenarios. The latter two consistencies are fundamental challenges to efficiently update the selection result w.r.t. user's zoom in, zoom out and panning operations when they interact with the map. POISam drops a common assumption from all previous work, i.e. the zoom levels and region cells are pre-defined and indexed, and objects are selected from such region cells at a particular zoom level rather than from user's current region of interest (which in most cases do not correspond to the pre-defined cells). It results in extra challenge as we need to do object selection via online computation. To our best knowledge, this is the first system that is able to meet all the four features to achieve an interactive visualization map exploration system.
{"title":"POIsam: a System for Efficient Selection of Large-scale Geospatial Data on Maps","authors":"Tao Guo, Mingzhao Li, Peishan Li, Z. Bao, G. Cong","doi":"10.1145/3183713.3193549","DOIUrl":"https://doi.org/10.1145/3183713.3193549","url":null,"abstract":"In this demonstration we present POIsam, a visualization system supporting the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints aim to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. One unique feature of POISam is that any similarity metrics can be plugged into POISam to meet the user's specific needs in different scenarios. The latter two consistencies are fundamental challenges to efficiently update the selection result w.r.t. user's zoom in, zoom out and panning operations when they interact with the map. POISam drops a common assumption from all previous work, i.e. the zoom levels and region cells are pre-defined and indexed, and objects are selected from such region cells at a particular zoom level rather than from user's current region of interest (which in most cases do not correspond to the pre-defined cells). It results in extra challenge as we need to do object selection via online computation. To our best knowledge, this is the first system that is able to meet all the four features to achieve an interactive visualization map exploration system.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86689664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}