首页 > 最新文献

Proceedings of the 2018 International Conference on Management of Data最新文献

英文 中文
Session details: Research 14: Approximate Query Processing 研究14:近似查询处理
Stratos Idreos
{"title":"Session details: Research 14: Approximate Query Processing","authors":"Stratos Idreos","doi":"10.1145/3258022","DOIUrl":"https://doi.org/10.1145/3258022","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89398102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Cascading Analysts Algorithm 级联分析算法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183745
M. Ruhl, Mukund Sundararajan, Qiqi Yan
We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.
我们研究在树的笛卡尔积上定义的度量的变化。这样的指标在许多实际应用中很自然地出现,其中全局指标(如收入)可以沿着几个层次维度(如位置、性别等)分解。给定这样一个度量的变化,我们的目标是识别一小组非重叠的数据段,这些数据段占了变化的大部分。对改进度量感兴趣的组织可以将注意力集中在这些数据段上。我们的主要贡献是一种算法,它自然地模仿了分析师分层组织的操作。该算法已成功应用于b谷歌的广告平台(AdWords),以帮助谷歌的广告商对广告活动的表现进行分类,并在谷歌分析中帮助网站开发人员了解他们的流量。我们通过将算法与人口普查数据集的基准进行比较,实证地分析了算法的运行时间和质量。我们证明了算法性能的理论、最坏情况边界。例如,我们证明该算法对于二维是最优的,并且对于d≥3维具有近似比log d-2 (n+1),其中n是输入数据段的数量。对于广告应用,我们可以证明我们的算法是一个2近似。为了描述问题的困难程度,我们研究了称为冲突的数据模式,这些模式允许我们构建问题的困难实例,并为我们的算法推导出1.144 d-2(再次d≥3)的下界,并表明问题是np困难的;这证明了我们专注于近似。
{"title":"The Cascading Analysts Algorithm","authors":"M. Ruhl, Mukund Sundararajan, Qiqi Yan","doi":"10.1145/3183713.3183745","DOIUrl":"https://doi.org/10.1145/3183713.3183745","url":null,"abstract":"We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77060506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning RP-DBSCAN:一种基于随机分区的超高速并行DBSCAN算法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196887
Hwanjun Song, Jae-Gil Lee
In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.
在大多数并行DBSCAN算法中,将相邻点分配到同一数据分区进行并行处理,以方便计算相邻点的密度。这种数据分区方案会导致一些关键问题,包括数据分区之间的负载不平衡,特别是在倾斜数据集中。为了解决这些问题,我们提出了一种基于单元的数据分区方案,即伪随机分区,它随机分布小单元而不是点本身。无论数据偏度如何,它都能实现高负载平衡,同时保留DBSCAN所需的数据连续性。此外,我们构建并传播整个数据集的高度紧凑的摘要,我们称之为两级单元字典,以补充随机分区。然后,我们开发了一种新的并行DBSCAN算法,随机分区-DBSCAN(简称RP-DBSCAN),它使用伪随机分区和两级单元字典。该算法同时找到每个数据分区的局部聚类,然后将这些局部聚类合并得到全局聚类。为了验证我们的方法的优点,我们在Spark上实现了RP-DBSCAN,并在12台Microsoft Azure机器(48核)上使用各种实际数据集进行了广泛的实验。在RP-DBSCAN中,数据分区和集群合并非常简单,并且每个分割上的集群不会被特定的工作人员拖出。因此,性能结果表明,RP-DBSCAN显著优于最先进的算法高达180倍。
{"title":"RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning","authors":"Hwanjun Song, Jae-Gil Lee","doi":"10.1145/3183713.3196887","DOIUrl":"https://doi.org/10.1145/3183713.3196887","url":null,"abstract":"In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"147 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79590868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Session details: Research 9: Similarity Queries & Estimation 会议细节:研究9:相似查询和估计
Abolfazl Asudeh
{"title":"Session details: Research 9: Similarity Queries & Estimation","authors":"Abolfazl Asudeh","doi":"10.1145/3258016","DOIUrl":"https://doi.org/10.1145/3258016","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88954938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time TcpRT:大规模实时云数据库服务质量检测与诊断分析系统
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190659
Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang
Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.
关键任务数据库系统流畅的端到端性能对于部署在云上的应用程序的稳定性至关重要。对于云数据库供应商来说,在复杂的网络环境中实时检测任何性能下降并快速定位根本原因是一个挑战。云数据库供应商倾向于采用多层分布式架构来实现多租户管理、可伸缩性和高可用性,这可能会使问题进一步复杂化。本文介绍了TcpRT——阿里云RDS中实现实时异常检测的仪器和诊断基础设施。我们编写了一个Linux内核模块来收集每个SQL查询的跟踪数据,设计的目的是以最小的开销高效的,它在TCP拥塞控制内核模块的回调中增加了跟踪点,这对数据库进程完全透明。为了在将数据发送到后端之前显著减少数据量,对原始跟踪数据进行聚合。然后,在分布式流计算平台中对聚合的跟踪数据进行处理、分组和分析。通过利用每个DB实例的历史性能数据的自调节Cauchy分布统计模型,可以自动检测数据库中的异常事件,从而消除了根据经验手动配置阈值的问题。在多个DB实例之间共享的任何网络组件(例如托管在同一物理机器上或上行链接到同一对TOR交换机)中发生的故障或打嗝可能导致大规模的服务质量下降。正在计算异常DB实例与网络组件的比例,这有助于查明故障组件。TcpRT已经在阿里云的生产环境中部署了3年,每秒收集超过2000万条原始轨迹,每天在后端处理超过100亿的本地聚合结果,并且对DB系统的性能影响在1%以内。我们提供了典型场景的案例研究,在这些场景中,TcpRT可以帮助解决生产系统中出现的各种问题。
{"title":"TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time","authors":"Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang","doi":"10.1145/3183713.3190659","DOIUrl":"https://doi.org/10.1145/3183713.3190659","url":null,"abstract":"Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91180198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Interactive Demonstration of Probabilistic Predicates 概率谓词的交互式演示
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193542
Yao Lu, Srikanth Kandula, S. Chaudhuri
We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.
我们将演示一个原型查询处理引擎,它使用概率谓词(PPs)来加速机器学习推理工作。在当前的分析引擎中,机器学习函数被建模为用户定义函数(udf),这既耗时又耗费资源。这些udf防止谓词下推;不能将使用这些udf输出的谓词推到udf之前。因此,在将udf应用于将被后续谓词拒绝的输入时浪费了大量的时间和资源。我们使用pp,它们是直接应用于原始输入的轻量级分类器,并过滤与查询谓词不一致的数据块。通过减少udf要处理的输入,pp极大地改进了查询处理。我们将通过构建用于图像识别、文档分类和视频分析等许多推理任务的pp来证明pp是广泛适用的。我们还将演示将pp扩展到复杂查询谓词并支持不同精度要求的查询优化方法。
{"title":"Interactive Demonstration of Probabilistic Predicates","authors":"Yao Lu, Srikanth Kandula, S. Chaudhuri","doi":"10.1145/3183713.3193542","DOIUrl":"https://doi.org/10.1145/3183713.3193542","url":null,"abstract":"We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77225142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Skyline Community Search in Multi-valued Networks 多值网络中的Skyline社区搜索
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183736
Ronghua Li, Lu Qin, Fanghua Ye, J. Yu, Xiaokui Xiao, Nong Xiao, Zibin Zheng
Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.
在一个科学合作网络中,如何找到具有高研究指标(如h指数)和不同研究兴趣的合作者?给定一个社交网络,我们如何识别具有高影响力的社区(例如,PageRank),并且与指定用户有相似的兴趣?在这种设置中,网络可以建模为一个多值网络,其中每个节点具有d ($d ge 1$)个数值属性(即h-index、多样性、PageRank、相似性评分等)。在多值网络中,我们希望从d个数值属性的角度找到不受其他群体支配的群体。现有的社区搜索算法要么完全忽略节点的数字属性,要么只考虑节点的一个数字属性。基于k核和天际线的概念,提出了一种新的社区模型,称为天际线社区。天际线群落是d维属性空间中不受其他连通k核支配的最大连通k核。我们开发了一种优雅的空间划分算法来有效地计算天际线社区。我们的算法有两个显著的优点:(1)它的时间复杂度主要依赖于答案s的大小(即天际线社区的数量),因此当s很小时,它是非常高效的;(2)可逐步输出天际线小区,对于只需要部分天际线小区的应用非常有用。在合成网络和实际网络上的大量实验证明了该算法的效率、可扩展性和有效性。
{"title":"Skyline Community Search in Multi-valued Networks","authors":"Ronghua Li, Lu Qin, Fanghua Ye, J. Yu, Xiaokui Xiao, Nong Xiao, Zibin Zheng","doi":"10.1145/3183713.3183736","DOIUrl":"https://doi.org/10.1145/3183713.3183736","url":null,"abstract":"Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78124952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base 以众包和知识库为动力的主体性知识库建设
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183732
Hao Xin, Rui Meng, Lei Chen
Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.
近年来,随着语义搜索、问答系统、谷歌知识图谱、IBM沃森问答系统等大规模知识库的应用需求日益增加,知识库建设(KBC)成为一个热门话题。现有的KBs主要集中于对世界上的事实进行编码,如城市区域、公司产品等,这些都被认为是客观知识,而在Web查询中经常被提及的主观知识却被忽略了。主观知识没有证据证明的基础真理,相反,真理依赖于人们的主导意见,这可以从网络众工那里征求。在我们的工作中,我们提出了一个利用群体知识和现有知识库构建主观知识库的KBC框架。本文提出了主体知识库构建的两阶段框架,即核心主体知识库构建和主体知识库充实。首先,我们尝试从现有知识库中挖掘核心主观知识库,其中每个实例都具有丰富的客观属性。然后,我们用从现有知识库中提取的实例填充核心主观知识库,其中人群可以用来注释实例的主观属性。为了优化群体标注过程,我们将主观知识库充实过程问题表述为成本感知的实例标注问题,提出了自适应实例标注和批处理实例标注两种实例标注算法。本文提出了主体知识库构建的两阶段体系,即核心主体知识库构建和主体知识充实。我们在真实的知识库和众包平台上对我们的框架进行了评估,实验结果表明,通过我们提出的框架,我们可以从现有的知识库和众包技术中获得高质量的主观知识事实。
{"title":"Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base","authors":"Hao Xin, Rui Meng, Lei Chen","doi":"10.1145/3183713.3183732","DOIUrl":"https://doi.org/10.1145/3183713.3183732","url":null,"abstract":"Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77969532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks 在个性化网页排名中寻找近似大人物的高效算法
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196919
Sibo Wang, Yufei Tao
Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new "logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.
给定一个有向图G,一个源节点s和一个目标节点t,个性化PageRank (t相对于s的PPR)是从s开始的随机行走在t处终止的概率。个性化PageRank分数t相对于每个源节点υ∈V的平均值正好是节点t的PageRank分数π(t),它表示节点t在图中的总体重要性。节点t的重击者是对π(t)的贡献大于φ分数的节点,其中φ是0到1之间的值。在垃圾链接检测、网页分类和好友推荐等方面,寻找重量级网站都有重要的应用。在本文中,我们提出了BLOG,一个有效的框架,用于三种类型的重磅查询:配对近似重磅查询(AHH),反向AHH和多源反向AHH查询。对于两两AHH查询,我们的算法结合了蒙特卡罗方法和反向传播方法,降低了两种方法的成本,并引入了新的技术来处理高入度节点。对于反向AHH和多源反向AHH查询,我们的算法扩展了两两AHH算法背后的思想,采用了新的“对数桶”技术来提高查询效率。大量的实验表明,在这三个查询上,我们的BLOG比其他解决方案要高效得多。
{"title":"Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks","authors":"Sibo Wang, Yufei Tao","doi":"10.1145/3183713.3196919","DOIUrl":"https://doi.org/10.1145/3183713.3196919","url":null,"abstract":"Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new \"logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81825550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Precision Interfaces for Different Modalities 不同模态的精密接口
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193570
Haoci Zhang, Viraj Raj, Thibault Sellam, Eugene Wu
Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.
构建支持数据分析的交互式工具是很困难的,因为构建什么以及如何构建并不总是很清楚。为了解决这个问题,我们提出了精密接口,这是一个半自动系统,用于生成特定任务的数据分析接口。通过识别程序之间的细微变化并将其映射到接口组件,Precision Interface可以将执行程序的日志转换为接口。本文主要关注SQL查询日志,但我们可以将这种方法推广到其他语言。我们的系统分两步运行:首先构建一个交互图,它描述了如何将查询转换为其他查询。然后,它找到一组涵盖最大数量转换的UI组件。为了限制要检测的更改领域,我们的系统使用特定于领域的语言PILang。我们描述了Precision Interface的每个组件,在真实的程序日志中展示了一个早期原型,并讨论了未来的研究机会。这个演示突出了从查询日志中挖掘数据驱动的交互界面的潜力。我们将首先向参与者介绍Precision Interfaces从查询日志生成交互式分析接口的过程。然后,我们将展示精密界面的多功能性,让参与者从多种不同的界面模式、交互设计和查询日志中进行选择,以生成2D指向和点击、手势甚至自然语言分析界面,用于通常执行的分析。
{"title":"Precision Interfaces for Different Modalities","authors":"Haoci Zhang, Viraj Raj, Thibault Sellam, Eugene Wu","doi":"10.1145/3183713.3193570","DOIUrl":"https://doi.org/10.1145/3183713.3193570","url":null,"abstract":"Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88869549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Proceedings of the 2018 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1