2014 IEEE 30th International Conference on Data Engineering最新文献

Rethinking main memory OLTP recovery 重新思考主存OLTP恢复

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816685

Nirmesh Malviya, Ariel Weisberg, S. Madden, M. Stonebraker

Fine-grained, record-oriented write-ahead logging, as exemplified by systems like ARIES, has been the gold standard for relational database recovery. In this paper, we show that in modern high-throughput transaction processing systems, this is no longer the optimal way to recover a database system. In particular, as transaction throughputs get higher, ARIES-style logging starts to represent a non-trivial fraction of the overall transaction execution time. We propose a lighter weight, coarse-grained command logging technique which only records the transactions that were executed on the database. It then does recovery by starting from a transactionally consistent checkpoint and replaying the commands in the log as if they were new transactions. By avoiding the overhead of fine-grained logging of before and after images (both CPU complexity as well as substantial associated 110), command logging can yield significantly higher throughput at run-time. Recovery times for command logging are higher compared to an ARIEs-style physiological logging approach, but with the advent of high-availability techniques that can mask the outage of a recovering node, recovery speeds have become secondary in importance to run-time performance for most applications. We evaluated our approach on an implementation of TPCC in a main memory database system (VoltDB), and found that command logging can offer 1.5 x higher throughput than a main-memory optimized implementation of ARIEs-style physiological logging.

细粒度的、面向记录的预写日志记录，如ARIES等系统所示，已经成为关系数据库恢复的黄金标准。在本文中，我们展示了在现代高吞吐量事务处理系统中，这不再是恢复数据库系统的最佳方式。特别是，随着事务吞吐量越来越高，aries风格的日志记录开始占整个事务执行时间的很大一部分。我们提出了一种轻量级、粗粒度的命令日志技术，它只记录在数据库上执行的事务。然后，它通过从事务一致的检查点开始，并将日志中的命令当作新事务来重放，从而进行恢复。通过避免对前后映像进行细粒度日志记录的开销(CPU复杂性和大量相关的110)，命令日志记录可以在运行时显著提高吞吐量。与aries风格的生理日志方法相比，命令日志的恢复时间更长，但是随着高可用性技术的出现，恢复节点的中断可以被掩盖，对于大多数应用程序来说，恢复速度对于运行时性能来说已经变得次要。我们在主内存数据库系统(voldb)中的TPCC实现上评估了我们的方法，发现命令日志记录可以提供比主内存优化的aries式生理日志记录高1.5倍的吞吐量。

{"title":"Rethinking main memory OLTP recovery","authors":"Nirmesh Malviya, Ariel Weisberg, S. Madden, M. Stonebraker","doi":"10.1109/ICDE.2014.6816685","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816685","url":null,"abstract":"Fine-grained, record-oriented write-ahead logging, as exemplified by systems like ARIES, has been the gold standard for relational database recovery. In this paper, we show that in modern high-throughput transaction processing systems, this is no longer the optimal way to recover a database system. In particular, as transaction throughputs get higher, ARIES-style logging starts to represent a non-trivial fraction of the overall transaction execution time. We propose a lighter weight, coarse-grained command logging technique which only records the transactions that were executed on the database. It then does recovery by starting from a transactionally consistent checkpoint and replaying the commands in the log as if they were new transactions. By avoiding the overhead of fine-grained logging of before and after images (both CPU complexity as well as substantial associated 110), command logging can yield significantly higher throughput at run-time. Recovery times for command logging are higher compared to an ARIEs-style physiological logging approach, but with the advent of high-availability techniques that can mask the outage of a recovering node, recovery speeds have become secondary in importance to run-time performance for most applications. We evaluated our approach on an implementation of TPCC in a main memory database system (VoltDB), and found that command logging can offer 1.5 x higher throughput than a main-memory optimized implementation of ARIEs-style physiological logging.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

Contract & Expand: I/O Efficient SCCs Computing 合同和扩展:I/O高效SCCs计算

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816652

Zhiwei Zhang, Lu Qin, J. Yu

As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of the contracted graph in each iteration. We conduct extensive experimental studies using both real and synthetic web-scale graphs to confirm the I/O efficiency of our approaches.

大图处理作为大数据处理的一个重要分支，近年来越来越受到人们的关注。强连通分量(SCC)计算是有向图上的一种基本图运算，其中强连通分量是有向图G的极大子图S，其中有向图G中的每一对节点都是可达的，通过将每个强连通分量压缩成一个节点，可以用一个小的有向无环图(DAG)来表示一个大的一般有向图。在文献中，有高效的I/O半外部算法来计算图G的所有scc，假设图G的所有节点都可以放在主存中。然而，许多真实的图很大，甚至节点也不能完全驻留在主内存中。在本文中，我们研究了一种新的I/O高效的外部算法，用于寻找节点不能完全容纳在主存中的有向图G的所有scc。为了克服现有基于外部图收缩的方法通常不能在有限迭代中停止，以及基于外部DFS的方法会产生大量随机I/ o的不足，我们探索了一种新的基于收缩-扩展的方法。在图的收缩阶段，我们不是像基于收缩的方法那样收缩整个图，而是只收缩图的节点，这样更有选择性。当图的所有节点都能装入主存时，收缩阶段停止，从而可以使用半外部算法进行SCC计算。在图展开阶段，当图以与收缩相反的顺序展开时，计算图中所有节点的scc。图收缩阶段和图扩展阶段都只使用I/O高效顺序扫描和图中的外部节点/边排序。我们的算法利用了半外部SCC计算算法的效率，并且通常在少量迭代中停止。我们通过在每次迭代中减少收缩图的节点和边的大小来进一步优化我们的方法。我们使用真实的和合成的网络规模图进行了广泛的实验研究，以确认我们的方法的I/O效率。

{"title":"Contract & Expand: I/O Efficient SCCs Computing","authors":"Zhiwei Zhang, Lu Qin, J. Yu","doi":"10.1109/ICDE.2014.6816652","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816652","url":null,"abstract":"As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of the contracted graph in each iteration. We conduct extensive experimental studies using both real and synthetic web-scale graphs to confirm the I/O efficiency of our approaches.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116405072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Query optimization of distributed pattern matching 分布式模式匹配的查询优化

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816640

Jiewen Huang, K. Venkatraman, D. Abadi

Greedy algorithms for subgraph pattern matching operations are often sufficient when the graph data set can be held in memory on a single machine. However, as graph data sets increasingly expand and require external storage and partitioning across a cluster of machines, more sophisticated query optimization techniques become critical to avoid explosions in query latency. In this paper, we introduce several query optimization techniques for distributed graph pattern matching. These techniques include (1) a System-R style dynamic programming-based optimization algorithm that considers both linear and bushy plans, (2) a cycle detection-based algorithm that leverages cycles to reduce intermediate result set sizes, and (3) a computation reusing technique that eliminates redundant query execution and data transfer over the network. Experimental results show that these algorithms can lead to an order of magnitude improvement in query performance.

当图数据集可以保存在单个机器的内存中时，用于子图模式匹配操作的贪心算法通常就足够了。然而，随着图数据集日益扩展，并且需要跨机器集群的外部存储和分区，更复杂的查询优化技术对于避免查询延迟爆炸变得至关重要。本文介绍了分布式图模式匹配的几种查询优化技术。这些技术包括(1)基于System-R风格的动态规划优化算法，该算法考虑了线性和灌木计划，(2)基于循环检测的算法，该算法利用循环来减少中间结果集的大小，以及(3)计算重用技术，该技术消除了冗余的查询执行和网络上的数据传输。实验结果表明，这些算法可以使查询性能有一个数量级的提高。

引用次数: 38

Leveraging metadata for identifying local, robust multi-variate temporal (RMT) features 利用元数据来识别本地的、健壮的多变量时态(RMT)特征

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816667

Xiaolan Wang, K. Candan, M. Sapino

Many applications generate and/or consume multi-variate temporal data, yet experts often lack the means to adequately and systematically search for and interpret multi-variate observations. In this paper, we first observe that multi-variate time series often carry localized multi-variate temporal features that are robust against noise. We then argue that these multi-variate temporal features can be extracted by simultaneously considering, at multiple scales, temporal characteristics of the time-series along with external knowledge, including variate relationships, known a priori. Relying on these observations, we develop algorithms to detect robust multi-variate temporal (RMT) features which can be indexed for efficient and accurate retrieval and can be used for supporting analysis tasks, such as classification. Experiments confirm that the proposed RMT algorithm is highly effective and efficient in identifying robust multi-scale temporal features of multi-variate time series.

许多应用程序生成和/或消耗多变量时间数据，但专家往往缺乏充分和系统地搜索和解释多变量观测结果的手段。在本文中，我们首先观察到多变量时间序列通常带有局部多变量时间特征，这些特征对噪声具有鲁棒性。然后，我们认为这些多变量时间特征可以通过同时考虑在多个尺度上，时间序列的时间特征以及先验已知的外部知识(包括变量关系)来提取。根据这些观察结果，我们开发了检测鲁棒多变量时间(RMT)特征的算法，这些特征可以被索引以进行有效和准确的检索，并可用于支持分析任务，如分类。实验验证了该算法对多变量时间序列的鲁棒多尺度时间特征识别的有效性。

引用次数: 11

DBDesigner: A customizable physical design tool for Vertica Analytic Database DBDesigner:为Vertica分析数据库定制的物理设计工具

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816725

R. Varadarajan, V. Bharathan, A. Cary, J. Dave, Sreenath Bodagala

In this paper, we present Vertica's customizable physical design tool, called the DBDesigner (DBD), that produces designs optimized for various scenarios and applications. For a given workload and space budget, DBD automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance and recovery to meet different customer requirements. Vertica is a distributed, massively parallel columnar database that physically organizes data into projections. Projections are attribute subsets from one or more tables with tuples sorted by one or more attributes, that are replicated or segmented (distributed) on cluster nodes. The key challenges involved in projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings. To achieve the desired trade-off between query performance and storage footprint, DBD operates under three different design policies: (a) load-optimized, (b) query-optimized or (c) balanced. These policies indirectly control the number of projections proposed and queries optimized to achieve the desired balance. To cater to query workloads that evolve over time, DBD also operates in a comprehensive and incremental design mode. In addition, DBD lets users override specific features of projection design based on their intimate knowledge about the data and query workloads. We present the complete physical design algorithm, describing in detail how projection candidates are efficiently explored and evaluated using optimizer's cost and benefit model. Our experimental results show that DBD produces good physical designs that satisfy a variety of customer use cases.

在本文中，我们介绍了Vertica的可定制物理设计工具，称为DBDesigner (DBD)，它可以针对各种场景和应用程序生成优化的设计。对于给定的工作负载和空间预算，DBD会自动推荐能够优化查询性能、存储占用、容错和恢复的物理设计，以满足不同的客户需求。Vertica是一个分布式的、大规模并行的柱状数据库，它将数据物理地组织成投影。投影是来自一个或多个表的属性子集，这些表的元组按一个或多个属性排序，在集群节点上复制或分段(分布)。投影设计的关键挑战是选择合适的列集、排序顺序、集群数据分布和列编码。为了在查询性能和存储占用之间实现预期的平衡，DBD在三种不同的设计策略下运行:(a)负载优化、(b)查询优化或(c)平衡。这些策略间接控制建议的投影和优化查询的数量，以实现所需的平衡。为了满足随时间变化的查询工作负载，DBD还以全面和增量的设计模式运行。此外，DBD允许用户基于对数据和查询工作负载的深入了解来覆盖投影设计的特定功能。我们提出了完整的物理设计算法，详细描述了如何使用优化器的成本和效益模型有效地探索和评估投影候选对象。我们的实验结果表明，DBD产生了满足各种客户用例的良好物理设计。

{"title":"DBDesigner: A customizable physical design tool for Vertica Analytic Database","authors":"R. Varadarajan, V. Bharathan, A. Cary, J. Dave, Sreenath Bodagala","doi":"10.1109/ICDE.2014.6816725","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816725","url":null,"abstract":"In this paper, we present Vertica's customizable physical design tool, called the DBDesigner (DBD), that produces designs optimized for various scenarios and applications. For a given workload and space budget, DBD automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance and recovery to meet different customer requirements. Vertica is a distributed, massively parallel columnar database that physically organizes data into projections. Projections are attribute subsets from one or more tables with tuples sorted by one or more attributes, that are replicated or segmented (distributed) on cluster nodes. The key challenges involved in projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings. To achieve the desired trade-off between query performance and storage footprint, DBD operates under three different design policies: (a) load-optimized, (b) query-optimized or (c) balanced. These policies indirectly control the number of projections proposed and queries optimized to achieve the desired balance. To cater to query workloads that evolve over time, DBD also operates in a comprehensive and incremental design mode. In addition, DBD lets users override specific features of projection design based on their intimate knowledge about the data and query workloads. We present the complete physical design algorithm, describing in detail how projection candidates are efficiently explored and evaluated using optimizer's cost and benefit model. Our experimental results show that DBD produces good physical designs that satisfy a variety of customer use cases.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129695452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

SLICE: Reviving regions-based pruning for reverse k nearest neighbors queries SLICE:为反向k近邻查询恢复基于区域的修剪

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816698

Shiyu Yang, M. A. Cheema, Xuemin Lin, Ying Zhang

Given a set of facilities and a set of users, a reverse k nearest neighbors (RkNN) query q returns every user for which the query facility is one of the k-closest facilities. Due to its importance, RkNN query has received significant research attention in the past few years. Almost all of the existing techniques adopt a pruning-and-verification framework. Regions-based pruning and half-space pruning are the two most notable pruning strategies. The half-space based approach prunes a larger area and is generally believed to be superior. Influenced by this perception, almost all existing RkNN algorithms utilize and improve the half-space pruning strategy. We observe the weaknesses and strengths of both strategies and discover that the regions-based pruning has certain strengths that have not been exploited in the past. Motivated by this, we present a new RkNN algorithm called SLICE that utilizes the strength of regions-based pruning and overcomes its limitations. Our extensive experimental study on synthetic and real data sets demonstrate that SLICE is significantly more efficient than the existing algorithms. We also provide a detailed theoretical analysis to analyze various aspects of our algorithm such as I/O cost, the unpruned area, and the cost of its verification phase etc. The experimental study validates our theoretical analysis.

给定一组设施和一组用户，反向k近邻(RkNN)查询q返回查询设施是k个最近设施之一的所有用户。由于其重要性，RkNN查询在过去的几年里受到了很大的研究关注。几乎所有现有的技术都采用了修剪和验证框架。基于区域的剪枝和半空间剪枝是两种最显著的剪枝策略。基于半空间的方法修剪面积更大，通常被认为是优越的。受这种感知的影响，几乎所有现有的RkNN算法都利用并改进了半空间剪枝策略。我们观察了这两种策略的优缺点，发现基于区域的修剪具有某些过去未被利用的优势。基于此，我们提出了一种新的RkNN算法SLICE，该算法利用了基于区域的修剪的强度并克服了其局限性。我们对合成和真实数据集的广泛实验研究表明，SLICE比现有算法显着提高效率。我们还提供了详细的理论分析来分析我们的算法的各个方面，如I/O成本，未修剪面积和验证阶段的成本等。实验研究验证了我们的理论分析。

{"title":"SLICE: Reviving regions-based pruning for reverse k nearest neighbors queries","authors":"Shiyu Yang, M. A. Cheema, Xuemin Lin, Ying Zhang","doi":"10.1109/ICDE.2014.6816698","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816698","url":null,"abstract":"Given a set of facilities and a set of users, a reverse k nearest neighbors (RkNN) query q returns every user for which the query facility is one of the k-closest facilities. Due to its importance, RkNN query has received significant research attention in the past few years. Almost all of the existing techniques adopt a pruning-and-verification framework. Regions-based pruning and half-space pruning are the two most notable pruning strategies. The half-space based approach prunes a larger area and is generally believed to be superior. Influenced by this perception, almost all existing RkNN algorithms utilize and improve the half-space pruning strategy. We observe the weaknesses and strengths of both strategies and discover that the regions-based pruning has certain strengths that have not been exploited in the past. Motivated by this, we present a new RkNN algorithm called SLICE that utilizes the strength of regions-based pruning and overcomes its limitations. Our extensive experimental study on synthetic and real data sets demonstrate that SLICE is significantly more efficient than the existing algorithms. We also provide a detailed theoretical analysis to analyze various aspects of our algorithm such as I/O cost, the unpruned area, and the cost of its verification phase etc. The experimental study validates our theoretical analysis.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126269172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Keyword-based correlated network computation over large social media 基于关键词的大型社交媒体相关网络计算

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816657

Jianxin Li, Chengfei Liu, Md. Saiful Islam

Recent years have witnessed an unprecedented proliferation of social media, e.g., millions of blog posts, micro-blog posts, and social networks on the Internet. This kind of social media data can be modeled in a large graph where nodes represent the entities and edges represent relationships between entities of the social media. Discovering keyword-based correlated networks of these large graphs is an important primitive in data analysis, from which users can pay more attention about their concerned information in the large graph. In this paper, we propose and define the problem of keyword-based correlated network computation over a massive graph. To do this, we first present a novel tree data structure that only maintains the shortest path of any two graph nodes, by which the massive graph can be equivalently transformed into a tree data structure for addressing our proposed problem. After that, we design efficient algorithms to build the transformed tree data structure from a graph offline and compute the γ-bounded keyword matched subgraphs based on the pre-built tree data structure on the fly. To further improve the efficiency, we propose weighted shingle-based approximation approaches to measure the correlation among a large number of γ-bounded keyword matched subgraphs. At last, we develop a merge-sort based approach to efficiently generate the correlated networks. Our extensive experiments demonstrate the efficiency of our algorithms on reducing time and space cost. The experimental results also justify the effectiveness of our method in discovering correlated networks from three real datasets.

近年来，社交媒体出现了前所未有的激增，例如，互联网上有数百万篇博客文章、微博文章和社交网络。这种社交媒体数据可以用一个大的图来建模，其中节点表示实体，边表示社交媒体实体之间的关系。发现这些大图的基于关键字的关联网络是数据分析的重要基元，用户可以从中关注大图中他们关心的信息。本文提出并定义了基于关键字的海量图关联网络计算问题。为此，我们首先提出了一种新颖的树状数据结构，它只维护任意两个图节点的最短路径，通过这种结构，大规模图可以等效地转换为树状数据结构，以解决我们提出的问题。然后，我们设计了一种高效的算法，从一个图离线构建转换后的树状数据结构，并基于预构建的树状数据结构动态计算γ-有界关键字匹配子图。为了进一步提高效率，我们提出了基于加权瓦的近似方法来度量大量γ有界关键字匹配子图之间的相关性。最后，我们开发了一种基于归并排序的方法来高效地生成相关网络。大量的实验证明了我们的算法在减少时间和空间成本方面的有效性。实验结果也证明了我们的方法在三个真实数据集中发现相关网络的有效性。

{"title":"Keyword-based correlated network computation over large social media","authors":"Jianxin Li, Chengfei Liu, Md. Saiful Islam","doi":"10.1109/ICDE.2014.6816657","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816657","url":null,"abstract":"Recent years have witnessed an unprecedented proliferation of social media, e.g., millions of blog posts, micro-blog posts, and social networks on the Internet. This kind of social media data can be modeled in a large graph where nodes represent the entities and edges represent relationships between entities of the social media. Discovering keyword-based correlated networks of these large graphs is an important primitive in data analysis, from which users can pay more attention about their concerned information in the large graph. In this paper, we propose and define the problem of keyword-based correlated network computation over a massive graph. To do this, we first present a novel tree data structure that only maintains the shortest path of any two graph nodes, by which the massive graph can be equivalently transformed into a tree data structure for addressing our proposed problem. After that, we design efficient algorithms to build the transformed tree data structure from a graph offline and compute the γ-bounded keyword matched subgraphs based on the pre-built tree data structure on the fly. To further improve the efficiency, we propose weighted shingle-based approximation approaches to measure the correlation among a large number of γ-bounded keyword matched subgraphs. At last, we develop a merge-sort based approach to efficiently generate the correlated networks. Our extensive experiments demonstrate the efficiency of our algorithms on reducing time and space cost. The experimental results also justify the effectiveness of our method in discovering correlated networks from three real datasets.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128084700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Finding common ground among experts' opinions on data clustering: With applications in malware analysis 在数据聚类的专家意见中找到共同点:在恶意软件分析中的应用

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816636

Guanhua Yan

Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.

数据聚类是知识发现和数据挖掘的一项基本技术。随着数据量的显著增长，数据聚类的计算能力和资源需求变得令人望而却步，有时需要将这些任务外包给专门从事数据聚类的第三方专家。这项工作的目标是开发技术，在专家对数据聚类的意见中找到共同点，这些意见可能由于聚类中使用的特征或算法而有偏见。我们的工作不同于大量现有的共识聚类方法，因为我们不需要将所有数据对象分组到聚类中。更确切地说，我们的工作是由现实世界的应用程序驱动的，这些应用程序要求对数据对象(如果它们是被选择的)如何分组有很高的信心。我们严格地公式化了这个问题，并证明了它是np完全的。我们进一步开发了一种基于在三均匀超图中寻找最大独立集的轻量级技术，以选择不形成专家意见冲突的数据对象。我们将我们提出的方法应用于具有数十万实例的真实恶意软件数据集，根据多个主要反病毒软件对这些样本的分类方式来查找恶意软件集群。我们的工作通过在聚类质量和选择聚类的数据对象数量之间取得平衡，为共识聚类提供了一个新的方向。

{"title":"Finding common ground among experts' opinions on data clustering: With applications in malware analysis","authors":"Guanhua Yan","doi":"10.1109/ICDE.2014.6816636","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816636","url":null,"abstract":"Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130059670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Pay-as-you-go reconciliation in schema matching networks 模式匹配网络中的现收现付协调

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816653

Nguyen Quoc Viet Hung, T. Nguyen, Z. Miklós, K. Aberer, A. Gal, M. Weidlich

Schema matching is the process of establishing correspondences between the attributes of database schemas for data integration purposes. Although several automatic schema matching tools have been developed, their results are often incomplete or erroneous. To obtain a correct set of correspondences, a human expert is usually required to validate the generated correspondences. We analyze this reconciliation process in a setting where a number of schemas needs to be matched, in the presence of consistency expectations about the network of attribute correspondences. We develop a probabilistic model that helps to identify the most uncertain correspondences, thus allowing us to guide the expert's work and collect his input about the most problematic cases. As the availability of such experts is often limited, we develop techniques that can construct a set of good quality correspondences with a high probability, even if the expert does not validate all the necessary correspondences. We demonstrate the efficiency of our techniques through extensive experimentation using real-world datasets.

模式匹配是为了数据集成目的在数据库模式的属性之间建立对应关系的过程。虽然已经开发了几种自动模式匹配工具，但它们的结果往往是不完整或错误的。为了获得一组正确的对应，通常需要一个人类专家来验证生成的对应。我们在需要匹配多个模式的情况下，在对属性对应网络存在一致性期望的情况下，分析这个协调过程。我们开发了一个概率模型，帮助识别最不确定的通信，从而允许我们指导专家的工作，并收集他对最有问题的案例的输入。由于这些专家的可用性通常是有限的，我们开发了能够以高概率构建一组高质量通信的技术，即使专家没有验证所有必要的通信。我们通过使用真实世界的数据集进行广泛的实验来证明我们的技术的效率。

引用次数: 48

ADaPT: Automatic Data Personalization based on contextual preferences ADaPT:基于上下文偏好的自动数据个性化

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816749

A. Miele, E. Quintarelli, Emanuele Rabosio, L. Tanca

This demo presents a framework for personalizing data access on the basis of the users' context and of the preferences they show while in that context. The system is composed of (i) a server application, which “tailors” a view over the available data on the basis of the user's contextual preferences, previously inferred from log data, and (ii) a client application running on the user's mobile device, which allows to query the data view and collects the activity log for later mining. At each change of context detected by the system the corresponding tailored view is loaded on the client device: accordingly, the most relevant data is available to the user even when the connection is unstable or lacking. The demo features a movie database, where users can browse data in different contexts and appreciate the personalization of the data views according to the inferred contextual preferences.

这个演示展示了一个框架，用于根据用户的上下文和用户在该上下文中显示的首选项来个性化数据访问。该系统由(i)服务器应用程序组成，该应用程序根据用户的上下文偏好(以前从日志数据推断)“定制”可用数据的视图，以及(ii)运行在用户移动设备上的客户端应用程序，该应用程序允许查询数据视图并收集活动日志以供以后挖掘。在系统检测到的每次上下文变化时，相应的定制视图将加载到客户端设备上:因此，即使在连接不稳定或缺乏连接时，用户也可以获得最相关的数据。该演示以一个电影数据库为特色，用户可以在其中浏览不同上下文中的数据，并根据推断的上下文首选项欣赏数据视图的个性化。

引用次数: 6