{"title":"Session details: Research 14: Approximate Query Processing","authors":"Stratos Idreos","doi":"10.1145/3258022","DOIUrl":"https://doi.org/10.1145/3258022","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89398102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.
{"title":"The Cascading Analysts Algorithm","authors":"M. Ruhl, Mukund Sundararajan, Qiqi Yan","doi":"10.1145/3183713.3183745","DOIUrl":"https://doi.org/10.1145/3183713.3183745","url":null,"abstract":"We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77060506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.
{"title":"RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning","authors":"Hwanjun Song, Jae-Gil Lee","doi":"10.1145/3183713.3196887","DOIUrl":"https://doi.org/10.1145/3183713.3196887","url":null,"abstract":"In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"147 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79590868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research 9: Similarity Queries & Estimation","authors":"Abolfazl Asudeh","doi":"10.1145/3258016","DOIUrl":"https://doi.org/10.1145/3258016","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88954938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.
{"title":"TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time","authors":"Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang","doi":"10.1145/3183713.3190659","DOIUrl":"https://doi.org/10.1145/3183713.3190659","url":null,"abstract":"Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91180198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.
{"title":"Interactive Demonstration of Probabilistic Predicates","authors":"Yao Lu, Srikanth Kandula, S. Chaudhuri","doi":"10.1145/3183713.3193542","DOIUrl":"https://doi.org/10.1145/3183713.3193542","url":null,"abstract":"We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77225142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ronghua Li, Lu Qin, Fanghua Ye, J. Yu, Xiaokui Xiao, Nong Xiao, Zibin Zheng
Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.
在一个科学合作网络中,如何找到具有高研究指标(如h指数)和不同研究兴趣的合作者?给定一个社交网络,我们如何识别具有高影响力的社区(例如,PageRank),并且与指定用户有相似的兴趣?在这种设置中,网络可以建模为一个多值网络,其中每个节点具有d ($d ge 1$)个数值属性(即h-index、多样性、PageRank、相似性评分等)。在多值网络中,我们希望从d个数值属性的角度找到不受其他群体支配的群体。现有的社区搜索算法要么完全忽略节点的数字属性,要么只考虑节点的一个数字属性。基于k核和天际线的概念,提出了一种新的社区模型,称为天际线社区。天际线群落是d维属性空间中不受其他连通k核支配的最大连通k核。我们开发了一种优雅的空间划分算法来有效地计算天际线社区。我们的算法有两个显著的优点:(1)它的时间复杂度主要依赖于答案s的大小(即天际线社区的数量),因此当s很小时,它是非常高效的;(2)可逐步输出天际线小区,对于只需要部分天际线小区的应用非常有用。在合成网络和实际网络上的大量实验证明了该算法的效率、可扩展性和有效性。
{"title":"Skyline Community Search in Multi-valued Networks","authors":"Ronghua Li, Lu Qin, Fanghua Ye, J. Yu, Xiaokui Xiao, Nong Xiao, Zibin Zheng","doi":"10.1145/3183713.3183736","DOIUrl":"https://doi.org/10.1145/3183713.3183736","url":null,"abstract":"Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78124952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.
{"title":"Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base","authors":"Hao Xin, Rui Meng, Lei Chen","doi":"10.1145/3183713.3183732","DOIUrl":"https://doi.org/10.1145/3183713.3183732","url":null,"abstract":"Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77969532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new "logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.
{"title":"Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks","authors":"Sibo Wang, Yufei Tao","doi":"10.1145/3183713.3196919","DOIUrl":"https://doi.org/10.1145/3183713.3196919","url":null,"abstract":"Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new \"logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81825550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.
{"title":"Precision Interfaces for Different Modalities","authors":"Haoci Zhang, Viraj Raj, Thibault Sellam, Eugene Wu","doi":"10.1145/3183713.3193570","DOIUrl":"https://doi.org/10.1145/3183713.3193570","url":null,"abstract":"Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88869549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}