首页 > 最新文献

Proceedings of the 2018 International Conference on Management of Data最新文献

英文 中文
Session details: Research 14: Approximate Query Processing 研究14:近似查询处理
Stratos Idreos
{"title":"Session details: Research 14: Approximate Query Processing","authors":"Stratos Idreos","doi":"10.1145/3258022","DOIUrl":"https://doi.org/10.1145/3258022","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89398102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Cascading Analysts Algorithm 级联分析算法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183745
M. Ruhl, Mukund Sundararajan, Qiqi Yan
We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.
我们研究在树的笛卡尔积上定义的度量的变化。这样的指标在许多实际应用中很自然地出现,其中全局指标(如收入)可以沿着几个层次维度(如位置、性别等)分解。给定这样一个度量的变化,我们的目标是识别一小组非重叠的数据段,这些数据段占了变化的大部分。对改进度量感兴趣的组织可以将注意力集中在这些数据段上。我们的主要贡献是一种算法,它自然地模仿了分析师分层组织的操作。该算法已成功应用于b谷歌的广告平台(AdWords),以帮助谷歌的广告商对广告活动的表现进行分类,并在谷歌分析中帮助网站开发人员了解他们的流量。我们通过将算法与人口普查数据集的基准进行比较,实证地分析了算法的运行时间和质量。我们证明了算法性能的理论、最坏情况边界。例如,我们证明该算法对于二维是最优的,并且对于d≥3维具有近似比log d-2 (n+1),其中n是输入数据段的数量。对于广告应用,我们可以证明我们的算法是一个2近似。为了描述问题的困难程度,我们研究了称为冲突的数据模式,这些模式允许我们构建问题的困难实例,并为我们的算法推导出1.144 d-2(再次d≥3)的下界,并表明问题是np困难的;这证明了我们专注于近似。
{"title":"The Cascading Analysts Algorithm","authors":"M. Ruhl, Mukund Sundararajan, Qiqi Yan","doi":"10.1145/3183713.3183745","DOIUrl":"https://doi.org/10.1145/3183713.3183745","url":null,"abstract":"We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77060506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning RP-DBSCAN:一种基于随机分区的超高速并行DBSCAN算法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196887
Hwanjun Song, Jae-Gil Lee
In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.
在大多数并行DBSCAN算法中,将相邻点分配到同一数据分区进行并行处理,以方便计算相邻点的密度。这种数据分区方案会导致一些关键问题,包括数据分区之间的负载不平衡,特别是在倾斜数据集中。为了解决这些问题,我们提出了一种基于单元的数据分区方案,即伪随机分区,它随机分布小单元而不是点本身。无论数据偏度如何,它都能实现高负载平衡,同时保留DBSCAN所需的数据连续性。此外,我们构建并传播整个数据集的高度紧凑的摘要,我们称之为两级单元字典,以补充随机分区。然后,我们开发了一种新的并行DBSCAN算法,随机分区-DBSCAN(简称RP-DBSCAN),它使用伪随机分区和两级单元字典。该算法同时找到每个数据分区的局部聚类,然后将这些局部聚类合并得到全局聚类。为了验证我们的方法的优点,我们在Spark上实现了RP-DBSCAN,并在12台Microsoft Azure机器(48核)上使用各种实际数据集进行了广泛的实验。在RP-DBSCAN中,数据分区和集群合并非常简单,并且每个分割上的集群不会被特定的工作人员拖出。因此,性能结果表明,RP-DBSCAN显著优于最先进的算法高达180倍。
{"title":"RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning","authors":"Hwanjun Song, Jae-Gil Lee","doi":"10.1145/3183713.3196887","DOIUrl":"https://doi.org/10.1145/3183713.3196887","url":null,"abstract":"In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"147 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79590868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Session details: Research 9: Similarity Queries & Estimation 会议细节:研究9:相似查询和估计
Abolfazl Asudeh
{"title":"Session details: Research 9: Similarity Queries & Estimation","authors":"Abolfazl Asudeh","doi":"10.1145/3258016","DOIUrl":"https://doi.org/10.1145/3258016","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88954938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time TcpRT:大规模实时云数据库服务质量检测与诊断分析系统
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190659
Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang
Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.
关键任务数据库系统流畅的端到端性能对于部署在云上的应用程序的稳定性至关重要。对于云数据库供应商来说,在复杂的网络环境中实时检测任何性能下降并快速定位根本原因是一个挑战。云数据库供应商倾向于采用多层分布式架构来实现多租户管理、可伸缩性和高可用性,这可能会使问题进一步复杂化。本文介绍了TcpRT——阿里云RDS中实现实时异常检测的仪器和诊断基础设施。我们编写了一个Linux内核模块来收集每个SQL查询的跟踪数据,设计的目的是以最小的开销高效的,它在TCP拥塞控制内核模块的回调中增加了跟踪点,这对数据库进程完全透明。为了在将数据发送到后端之前显著减少数据量,对原始跟踪数据进行聚合。然后,在分布式流计算平台中对聚合的跟踪数据进行处理、分组和分析。通过利用每个DB实例的历史性能数据的自调节Cauchy分布统计模型,可以自动检测数据库中的异常事件,从而消除了根据经验手动配置阈值的问题。在多个DB实例之间共享的任何网络组件(例如托管在同一物理机器上或上行链接到同一对TOR交换机)中发生的故障或打嗝可能导致大规模的服务质量下降。正在计算异常DB实例与网络组件的比例,这有助于查明故障组件。TcpRT已经在阿里云的生产环境中部署了3年,每秒收集超过2000万条原始轨迹,每天在后端处理超过100亿的本地聚合结果,并且对DB系统的性能影响在1%以内。我们提供了典型场景的案例研究,在这些场景中,TcpRT可以帮助解决生产系统中出现的各种问题。
{"title":"TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time","authors":"Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang","doi":"10.1145/3183713.3190659","DOIUrl":"https://doi.org/10.1145/3183713.3190659","url":null,"abstract":"Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91180198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Precision Interfaces for Different Modalities 不同模态的精密接口
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193570
Haoci Zhang, Viraj Raj, Thibault Sellam, Eugene Wu
Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.
构建支持数据分析的交互式工具是很困难的,因为构建什么以及如何构建并不总是很清楚。为了解决这个问题,我们提出了精密接口,这是一个半自动系统,用于生成特定任务的数据分析接口。通过识别程序之间的细微变化并将其映射到接口组件,Precision Interface可以将执行程序的日志转换为接口。本文主要关注SQL查询日志,但我们可以将这种方法推广到其他语言。我们的系统分两步运行:首先构建一个交互图,它描述了如何将查询转换为其他查询。然后,它找到一组涵盖最大数量转换的UI组件。为了限制要检测的更改领域,我们的系统使用特定于领域的语言PILang。我们描述了Precision Interface的每个组件,在真实的程序日志中展示了一个早期原型,并讨论了未来的研究机会。这个演示突出了从查询日志中挖掘数据驱动的交互界面的潜力。我们将首先向参与者介绍Precision Interfaces从查询日志生成交互式分析接口的过程。然后,我们将展示精密界面的多功能性,让参与者从多种不同的界面模式、交互设计和查询日志中进行选择,以生成2D指向和点击、手势甚至自然语言分析界面,用于通常执行的分析。
{"title":"Precision Interfaces for Different Modalities","authors":"Haoci Zhang, Viraj Raj, Thibault Sellam, Eugene Wu","doi":"10.1145/3183713.3193570","DOIUrl":"https://doi.org/10.1145/3183713.3193570","url":null,"abstract":"Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88869549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks 在个性化网页排名中寻找近似大人物的高效算法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196919
Sibo Wang, Yufei Tao
Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new "logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.
给定一个有向图G,一个源节点s和一个目标节点t,个性化PageRank (t相对于s的PPR)是从s开始的随机行走在t处终止的概率。个性化PageRank分数t相对于每个源节点υ∈V的平均值正好是节点t的PageRank分数π(t),它表示节点t在图中的总体重要性。节点t的重击者是对π(t)的贡献大于φ分数的节点,其中φ是0到1之间的值。在垃圾链接检测、网页分类和好友推荐等方面,寻找重量级网站都有重要的应用。在本文中,我们提出了BLOG,一个有效的框架,用于三种类型的重磅查询:配对近似重磅查询(AHH),反向AHH和多源反向AHH查询。对于两两AHH查询,我们的算法结合了蒙特卡罗方法和反向传播方法,降低了两种方法的成本,并引入了新的技术来处理高入度节点。对于反向AHH和多源反向AHH查询,我们的算法扩展了两两AHH算法背后的思想,采用了新的“对数桶”技术来提高查询效率。大量的实验表明,在这三个查询上,我们的BLOG比其他解决方案要高效得多。
{"title":"Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks","authors":"Sibo Wang, Yufei Tao","doi":"10.1145/3183713.3196919","DOIUrl":"https://doi.org/10.1145/3183713.3196919","url":null,"abstract":"Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new \"logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81825550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Splaying Log-Structured Merge-Trees 显示日志结构的合并树
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183723
Thomas Lively, Luca Schroeder, Carlos Mendizábal
Modern persistent key-value stores typically use a log-structured merge-tree (LSM-tree) design, which allows for high write throughput. Our observation is that the LSM-tree, however, has suboptimal performance during read-intensive workload windows with non-uniform key access distributions. To address this shortcoming, we propose and analyze a simple decision scheme that can be added to any LSM-based key-value store and dramatically reduce the number of disk I/Os for these classes of workloads. The key insight is that copying a frequently accessed key to the top of an LSM-tree ("splaying'') allows cheaper reads on that key in the near future.
现代的持久键值存储通常使用日志结构的合并树(LSM-tree)设计,这种设计允许高写吞吐量。然而,我们的观察是,lsm树在具有非统一键访问分布的读密集型工作负载窗口期间具有次优性能。为了解决这个缺点,我们提出并分析了一个简单的决策方案,该方案可以添加到任何基于lsm的键值存储中,并显著减少这些工作负载类的磁盘I/ o数量。关键的见解是,将频繁访问的键复制到lsm树的顶部(“展开”)可以在不久的将来更便宜地读取该键。
{"title":"Splaying Log-Structured Merge-Trees","authors":"Thomas Lively, Luca Schroeder, Carlos Mendizábal","doi":"10.1145/3183713.3183723","DOIUrl":"https://doi.org/10.1145/3183713.3183723","url":null,"abstract":"Modern persistent key-value stores typically use a log-structured merge-tree (LSM-tree) design, which allows for high write throughput. Our observation is that the LSM-tree, however, has suboptimal performance during read-intensive workload windows with non-uniform key access distributions. To address this shortcoming, we propose and analyze a simple decision scheme that can be added to any LSM-based key-value store and dramatically reduce the number of disk I/Os for these classes of workloads. The key insight is that copying a frequently accessed key to the top of an LSM-tree (\"splaying'') allows cheaper reads on that key in the near future.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85006710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Session details: Research 11: Data Mining 研究11:数据挖掘
L. Lakshmanan
{"title":"Session details: Research 11: Data Mining","authors":"L. Lakshmanan","doi":"10.1145/3258018","DOIUrl":"https://doi.org/10.1145/3258018","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90504153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications 烟雾演示:深吸一口气的数据密集型谱系应用
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193537
Fotis Psallidas, Eugene Wu
Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be expressed declaratively in lineage terms. Unfortunately, many applications resort to hand-tuned implementations because either lineage systems are not fast enough to meet their requirements or due to no knowledge of the lineage capabilities. Recently, we introduced a set of implementation design principles and associated techniques to optimize lineage-enabled database engines and realized them in our prototype database engine, namely, Smoke. In this demonstration, we showcase lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling. In addition, we show how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.
数据沿袭是描述工作流中输入和输出数据项之间关系的基本信息类型。因此,大量具有输入-输出关系逻辑的数据密集型应用程序可以用沿袭术语声明式地表示。不幸的是,许多应用程序求助于手动调优的实现,因为沿袭系统不够快,无法满足它们的需求,或者由于不了解沿袭功能。最近,我们介绍了一组实现设计原则和相关技术来优化支持继承的数据库引擎,并在我们的原型数据库引擎(即Smoke)中实现了它们。在这个演示中,我们将沿袭展示为跨各种数据密集型应用程序的构建块,包括工具提示和按需详细信息;crossfilter;还有数据分析。此外,我们还展示了Smoke如何优于其他沿袭系统,以满足或改进这些应用程序的现有手动调优实现。
{"title":"Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications","authors":"Fotis Psallidas, Eugene Wu","doi":"10.1145/3183713.3193537","DOIUrl":"https://doi.org/10.1145/3183713.3193537","url":null,"abstract":"Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be expressed declaratively in lineage terms. Unfortunately, many applications resort to hand-tuned implementations because either lineage systems are not fast enough to meet their requirements or due to no knowledge of the lineage capabilities. Recently, we introduced a set of implementation design principles and associated techniques to optimize lineage-enabled database engines and realized them in our prototype database engine, namely, Smoke. In this demonstration, we showcase lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling. In addition, we show how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88042857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 2018 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1