2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献

英文中文

SUSIE: Search using services and information extraction 苏西:使用服务和信息提取进行搜索

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544827

N. Preda, Fabian M. Suchanek, Wenjun Yuan, G. Weikum

The API of a Web service restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called - even though the underlying database might have the desired piece of information. This asymmetry is particularly problematic if the service is used in a Web service orchestration system. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. We show how this idea can be integrated into a Web service orchestration system. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach.

Web服务的API限制了服务可以回答的查询类型。例如，Web服务可能提供返回给定歌手的歌曲的方法，但可能不提供返回给定歌曲的歌手的方法。如果用户询问某些特定歌曲的演唱者，则无法调用Web服务—即使底层数据库可能拥有所需的信息。如果在Web服务编排系统中使用该服务，则这种不对称性尤其成问题。在本文中，我们建议使用动态信息提取来收集可用作Web服务参数绑定的值。我们将展示如何将此想法集成到Web服务编排系统中。我们的方法完全实现在一个名为SUSIE的原型中。我们用现实生活中的数据和服务进行了实验，以证明我们的方法的实际可行性和良好的性能。

引用次数: 10

Time travel in a scientific array database 在科学阵列数据库中进行时间旅行

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544817

Emad Soroush, M. Balazinska

In this paper, we present TimeArr, a new storage manager for an array database. TimeArr supports the creation of a sequence of versions of each stored array and their exploration through two types of time travel operations: selection of a specific version of a (sub)-array and a more general extraction of a (sub)-array history, in the form of a series of (sub)-array versions. TimeArr contributes a combination of array-specific storage techniques to efficiently support these operations. To speed-up array exploration, TimeArr further introduces two additional techniques. The first is the notion of approximate time travel with two types of operations: approximate version selection and approximate history. For these operations, users can tune the degree of approximation tolerable and thus trade-off accuracy and performance in a principled manner. The second is to lazily create short connections, called skip links, between the same (sub)-arrays at different versions with similar data patterns to speed up the selection of a specific version. We implement TimeArr within the SciDB array processing engine and demonstrate its performance through experiments on two real datasets from the astronomy and earth sciences domains.

本文提出了一种新的阵列数据库存储管理器TimeArr。TimeArr支持创建每个存储数组的一系列版本，并通过两种类型的时间旅行操作来探索它们:选择(子)数组的特定版本，以及以一系列(子)数组版本的形式更一般地提取(子)数组历史。TimeArr提供了特定于数组的存储技术组合，以有效地支持这些操作。为了加速阵列探索，TimeArr进一步引入了两种额外的技术。第一种是近似时间旅行的概念，有两种操作:近似版本选择和近似历史。对于这些操作，用户可以调整可容忍的近似程度，从而以有原则的方式权衡准确性和性能。第二种方法是在具有相似数据模式的不同版本的相同(子)数组之间惰性地创建称为跳过链接的短连接，以加快对特定版本的选择。我们在SciDB阵列处理引擎中实现了TimeArr，并通过在天文学和地球科学领域的两个真实数据集上的实验验证了其性能。

{"title":"Time travel in a scientific array database","authors":"Emad Soroush, M. Balazinska","doi":"10.1109/ICDE.2013.6544817","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544817","url":null,"abstract":"In this paper, we present TimeArr, a new storage manager for an array database. TimeArr supports the creation of a sequence of versions of each stored array and their exploration through two types of time travel operations: selection of a specific version of a (sub)-array and a more general extraction of a (sub)-array history, in the form of a series of (sub)-array versions. TimeArr contributes a combination of array-specific storage techniques to efficiently support these operations. To speed-up array exploration, TimeArr further introduces two additional techniques. The first is the notion of approximate time travel with two types of operations: approximate version selection and approximate history. For these operations, users can tune the degree of approximation tolerable and thus trade-off accuracy and performance in a principled manner. The second is to lazily create short connections, called skip links, between the same (sub)-arrays at different versions with similar data patterns to speed up the selection of a specific version. We implement TimeArr within the SciDB array processing engine and demonstrate its performance through experiments on two real datasets from the astronomy and earth sciences domains.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123107863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Engineering Generalized Shortest Path queries 工程广义最短路径查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544888

Michael N. Rice, V. Tsotras

Generalized Shortest Path (GSP) queries represent a variant of constrained shortest path queries in which a solution path of minimum total cost must visit at least one location from each of a set of specified location categories (e.g., gas stations, grocery stores) in a specified order. This problem type has many practical applications in logistics and personalized location-based services, and is closely related to the NP-hard Generalized Traveling Salesman Path Problem (GTSPP). In this work, we present a new dynamic programming formulation to highlight the structure of this problem. Using this formulation as our foundation, we progressively engineer a fast and scalable GSP query algorithm for use on large, real-world road networks. Our approach incorporates concepts from Contraction Hierarchies, a well-known graph indexing technique for static shortest path queries. To demonstrate the practicality of our algorithm we experimented on the North American road network (with over 50 million edges) where we achieved up to several orders of magnitude speed improvements over the previous-best algorithm, depending on the relative sizes of the location categories.

广义最短路径(GSP)查询表示约束最短路径查询的一种变体，其中总成本最小的解路径必须以指定的顺序访问一组指定位置类别(例如，加油站、杂货店)中的每个位置中的至少一个位置。该问题类型在物流和个性化位置服务中有许多实际应用，并且与NP-hard广义旅行推销员路径问题(GTSPP)密切相关。在这项工作中，我们提出了一个新的动态规划公式来突出这个问题的结构。以该公式为基础，我们逐步设计出一种快速且可扩展的GSP查询算法，用于大型现实世界的道路网络。我们的方法结合了收缩层次结构的概念，收缩层次结构是一种著名的静态最短路径查询图索引技术。为了证明我们算法的实用性，我们在北美道路网络(超过5000万条边)上进行了实验，根据位置类别的相对大小，我们实现了比之前最佳算法速度提高几个数量级的速度。

{"title":"Engineering Generalized Shortest Path queries","authors":"Michael N. Rice, V. Tsotras","doi":"10.1109/ICDE.2013.6544888","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544888","url":null,"abstract":"Generalized Shortest Path (GSP) queries represent a variant of constrained shortest path queries in which a solution path of minimum total cost must visit at least one location from each of a set of specified location categories (e.g., gas stations, grocery stores) in a specified order. This problem type has many practical applications in logistics and personalized location-based services, and is closely related to the NP-hard Generalized Traveling Salesman Path Problem (GTSPP). In this work, we present a new dynamic programming formulation to highlight the structure of this problem. Using this formulation as our foundation, we progressively engineer a fast and scalable GSP query algorithm for use on large, real-world road networks. Our approach incorporates concepts from Contraction Hierarchies, a well-known graph indexing technique for static shortest path queries. To demonstrate the practicality of our algorithm we experimented on the North American road network (with over 50 million edges) where we achieved up to several orders of magnitude speed improvements over the previous-best algorithm, depending on the relative sizes of the location categories.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128655223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Efficient search algorithm for SimRank simmrank的高效搜索算法

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544858

Y. Fujiwara, M. Nakatsuji, Hiroaki Shiokawa, Makoto Onizuka

Graphs are a fundamental data structure and have been employed to model objects as well as their relationships. The similarity of objects on the web (e.g., webpages, photos, music, micro-blogs, and social networking service users) is the key to identifying relevant objects in many recent applications. SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many applications such as web spam detection, collaborative tagging analysis, link prediction, and so on. SimRank computes similarities iteratively, and it needs O(N4T) time and O(N2) space for similarity computation where N and T are the number of nodes and iterations, respectively. Unfortunately, this iterative approach is computationally expensive. The goal of this work is to process top-k search and range search efficiently for a given node. Our solution, SimMat, is based on two ideas: (1) It computes the approximate similarity of a selected node pair efficiently in non-iterative style based on the Sylvester equation, and (2) It prunes unnecessary approximate similarity computations when searching for the high similarity nodes by exploiting estimations based on the Cauchy-Schwarz inequality. These two ideas reduce the time and space complexities of the proposed approach to O(Nn) where n is the target rank of the low-rank approximation (n ≪ N in practice). Our experiments show that our approach is much faster, by several orders of magnitude, than previous approaches in finding the high similarity nodes.

图是一种基本的数据结构，被用来对对象及其关系进行建模。在最近的许多应用中，网络对象(如网页、照片、音乐、微博和社交网络服务用户)的相似性是识别相关对象的关键。由Jeh和Widom提出的simmrank提供了良好的相似度评分，并已成功地应用于许多应用，如web垃圾邮件检测、协作标记分析、链接预测等。simmrank迭代计算相似度，相似度计算需要O(N4T)时间和O(N2)空间，其中N为节点数，T为迭代数。不幸的是，这种迭代方法在计算上是昂贵的。这项工作的目标是有效地处理给定节点的top-k搜索和范围搜索。我们的解决方案SimMat基于两个思想:(1)基于Sylvester方程，以非迭代的方式高效地计算所选节点对的近似相似度;(2)利用基于Cauchy-Schwarz不等式的估计，在搜索高相似度节点时减少不必要的近似相似度计算。这两个想法减少了所提出的O(Nn)方法的时间和空间复杂性，其中n是低秩近似的目标秩(实际中n≪n)。我们的实验表明，在寻找高相似性节点方面，我们的方法比以前的方法要快得多，提高了几个数量级。

{"title":"Efficient search algorithm for SimRank","authors":"Y. Fujiwara, M. Nakatsuji, Hiroaki Shiokawa, Makoto Onizuka","doi":"10.1109/ICDE.2013.6544858","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544858","url":null,"abstract":"Graphs are a fundamental data structure and have been employed to model objects as well as their relationships. The similarity of objects on the web (e.g., webpages, photos, music, micro-blogs, and social networking service users) is the key to identifying relevant objects in many recent applications. SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many applications such as web spam detection, collaborative tagging analysis, link prediction, and so on. SimRank computes similarities iteratively, and it needs O(N4T) time and O(N2) space for similarity computation where N and T are the number of nodes and iterations, respectively. Unfortunately, this iterative approach is computationally expensive. The goal of this work is to process top-k search and range search efficiently for a given node. Our solution, SimMat, is based on two ideas: (1) It computes the approximate similarity of a selected node pair efficiently in non-iterative style based on the Sylvester equation, and (2) It prunes unnecessary approximate similarity computations when searching for the high similarity nodes by exploiting estimations based on the Cauchy-Schwarz inequality. These two ideas reduce the time and space complexities of the proposed approach to O(Nn) where n is the target rank of the low-rank approximation (n ≪ N in practice). Our experiments show that our approach is much faster, by several orders of magnitude, than previous approaches in finding the high similarity nodes.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121673872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Crowd-answering system via microblogging 微博群答系统

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544925

Xianke Zhou, Ke Chen, Sai Wu, Bingbing Zhang

Most crowdsourcing systems leverage the public platforms, such as Amazon Mechanical Turk (AMT), to publish their jobs and collect the results. They are charged for using the platform's service and they are also required to pay the workers for each successful job. Although the average wage of the online human worker is not high, for a 24×7 running service, the crowdsourcing system becomes very expensive to maintain. We observe that there are, in fact, many sources that can provide free online human volunteers. Microblogging system is one of the most promising human resources. In this paper, we present our CrowdAnswer system, which is built on top of Weibo, the largest microblogging system in China. CrowdAnswer is a question-answering system, which distributes various questions to different groups of microblogging users adaptively. The answers are then collected from those users' tweets and visualized for the question originator. CrowdAnswer maintains a virtual credit system. The users need credits to publish questions and they can gain credits by answering the questions. A novel algorithm is proposed to route the questions to the interested users, which tries to maximize the probability of successfully answering a question.

大多数众包系统利用公共平台，如亚马逊土耳其机械(AMT)，发布他们的工作并收集结果。他们要为使用平台的服务付费，还需要为每一项成功的工作支付报酬。虽然在线人力工作者的平均工资不高，但对于24×7运营服务来说，众包系统的维护成本非常高。我们观察到，事实上，有许多来源可以提供免费的在线人类志愿者。微博系统是最有前途的人力资源之一。在本文中，我们提出了建立在中国最大的微博系统——微博之上的CrowdAnswer系统。CrowdAnswer是一个问答系统，它可以自适应地将各种问题分发给不同的微博用户群体。然后从这些用户的tweet中收集答案，并为问题发起者可视化。CrowdAnswer维护一个虚拟信用系统。用户发布问题需要积分，通过回答问题可以获得积分。提出了一种将问题路由到感兴趣的用户的新算法，该算法试图最大化成功回答问题的概率。

{"title":"Crowd-answering system via microblogging","authors":"Xianke Zhou, Ke Chen, Sai Wu, Bingbing Zhang","doi":"10.1109/ICDE.2013.6544925","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544925","url":null,"abstract":"Most crowdsourcing systems leverage the public platforms, such as Amazon Mechanical Turk (AMT), to publish their jobs and collect the results. They are charged for using the platform's service and they are also required to pay the workers for each successful job. Although the average wage of the online human worker is not high, for a 24×7 running service, the crowdsourcing system becomes very expensive to maintain. We observe that there are, in fact, many sources that can provide free online human volunteers. Microblogging system is one of the most promising human resources. In this paper, we present our CrowdAnswer system, which is built on top of Weibo, the largest microblogging system in China. CrowdAnswer is a question-answering system, which distributes various questions to different groups of microblogging users adaptively. The answers are then collected from those users' tweets and visualized for the question originator. CrowdAnswer maintains a virtual credit system. The users need credits to publish questions and they can gain credits by answering the questions. A novel algorithm is proposed to route the questions to the interested users, which tries to maximize the probability of successfully answering a question.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"618 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116310886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

T-Music: A melody composer based on frequent pattern mining T-Music:基于频繁模式挖掘的旋律作曲家

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544937

Cheng Long, R. C. Wong, R. W. Sze

There are a bulk of studies on proposing algorithms for composing the melody of a song automatically with algorithms, which is known as algorithmic composition. To the best of our knowledge, none of them took the lyric into consideration for melody composition. However, according to some recent studies, within a song, there usually exists a certain extent of correlation between its melody and its lyric. In this demonstration, we propose to utilize this type of correlation information for melody composition. Based on this idea, we design a new melody composition algorithm and develop a melody composer called T-Music which employs this composition algorithm.

有大量的研究提出了用算法自动作曲歌曲旋律的算法，这被称为算法作曲。据我们所知，他们都没有考虑到歌词的旋律组成。然而，根据最近的一些研究，在一首歌中，它的旋律和歌词之间通常存在一定程度的相关性。在这个演示中，我们建议将这种类型的相关信息用于旋律作曲。基于这一思想，我们设计了一种新的旋律作曲算法，并开发了一个使用这种作曲算法的旋律作曲家T-Music。

引用次数: 9

Top-k graph pattern matching over large graphs 大图上的Top-k图模式匹配

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544895

Jiefeng Cheng, Xianggang Zeng, J. Yu

There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query (kGPM), and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting kGPM provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kGPM query. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a synthetic dataset and a real dataset, and we confirm the efficiency of our proposed approach.

目前有许多基于图的应用，包括生物信息学、社会科学、链接分析、引文分析和协同工作。都需要处理一个大的数据图。在给定一个大数据图的情况下，我们研究了图模式查询(kGPM)的top-k答案的查找，特别是图查询是循环的并且可能是复杂的top-k循环图查询。支持kGPM的功能为用户搜索图形提供了更大的灵活性。这个问题本身就很有挑战性。本文提出了一种基于循环图查询生成树的动态排序表处理kGPM的新框架。我们观察到使用多个排名列表来回答给定的kGPM查询的多维表示。在这种表示下，我们提出了一个成本模型来估计给定的kGPM查询在每个排名列表中消耗的最小树答案数。这导致了kGPM处理的查询优化方法，以及使用最优查询计划处理kGPM的top-k算法。我们使用合成数据集和真实数据集进行了广泛的性能研究，并证实了我们提出的方法的有效性。

{"title":"Top-k graph pattern matching over large graphs","authors":"Jiefeng Cheng, Xianggang Zeng, J. Yu","doi":"10.1109/ICDE.2013.6544895","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544895","url":null,"abstract":"There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query (kGPM), and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting kGPM provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kGPM query. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a synthetic dataset and a real dataset, and we confirm the efficiency of our proposed approach.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"87 17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126300668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Materialization strategies in the Vertica analytic database: Lessons learned Vertica分析数据库中的物化策略:经验教训

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544909

Lakshmikant Shrinivas, Sreenath Bodagala, R. Varadarajan, A. Cary, V. Bharathan, Chuck Bear

Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight their strengths and weaknesses, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries).

列存储数据库支持各种元组重构策略(也称为物化策略)。早期实体化容易实现，但通常比后期实体化表现更差。后期物质化实现起来更复杂，通常比早期物质化要好得多，尽管在某些情况下它会更差。我们确定了这些情况，它们本质上围绕着两个输入都不适合内存的连接(也称为溢出连接)。横向信息传递技术为两全其美提供了可行的解决方案。我们将演示如何将早期物化与横向信息传递相结合，使我们能够获得晚期物化的好处，而不会为溢出连接带来簿记复杂性或更差的性能。它还为Vertica中的查询处理提供了一些其他好处，因为它与数据的压缩和排序顺序进行了积极的交互。在本文中，我们报告了我们在后期和早期物化方面的经验，突出了它们的优点和缺点，并展示了我们的横向信息传递实现的细节。我们展示了比较这些物化策略的实验结果，这些结果突出了我们的横向信息传递实现提供的显著性能改进(在某些TPC-H查询上高达72%)。

{"title":"Materialization strategies in the Vertica analytic database: Lessons learned","authors":"Lakshmikant Shrinivas, Sreenath Bodagala, R. Varadarajan, A. Cary, V. Bharathan, Chuck Bear","doi":"10.1109/ICDE.2013.6544909","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544909","url":null,"abstract":"Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight their strengths and weaknesses, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries).","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125181777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

ExpFinder: Finding experts by graph pattern matching ExpFinder:通过图形模式匹配查找专家

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544933

W. Fan, Xin Wang, Yinghui Wu

We present ExpFinder, a system for finding experts in social networks based on graph pattern matching. We demonstrate (1) how ExpFinder identifies top-K experts in a social network by supporting bounded simulation of graph patterns, and by ranking the matches based on a metric for social impact; (2) how it copes with the sheer size of real-life social graphs by supporting incremental query evaluation and query preserving graph compression, and (3) how the GUI of ExpFinder interacts with users to help them construct queries and inspect matches.

我们提出了ExpFinder，一个基于图形模式匹配的社交网络专家搜索系统。我们演示了(1)ExpFinder如何通过支持图形模式的有界模拟来识别社交网络中的top-K专家，并根据社会影响指标对匹配进行排名;(2)如何通过支持增量查询评估和查询保留图压缩来应对现实社会图的庞大规模;(3)ExpFinder的GUI如何与用户交互，帮助他们构建查询和检查匹配。

引用次数: 22

On shortest unique substring queries 在最短唯一子字符串查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544887

J. Pei, W. Wu, Mi-Yen Yeh

In this paper, we tackle a novel type of interesting queries - shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest unique substring queries have many potential applications, such as information retrieval, bioinformatics, and event context analysis. We develop efficient algorithms for online query answering. First, we present an algorithm to answer a shortest unique substring query in O(n) time using a suffix tree index, where n is the length of string S. Second, we show that, using O(n·h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smaller than n and can be treated as a constant. Once the shortest unique substrings are pre-computed, shortest unique substring queries can be answered online in constant time. In addition to the solid algorithmic results, we empirically demonstrate the effectiveness and efficiency of shortest unique substring queries on real data sets.

在本文中，我们处理了一种新颖的有趣的查询类型——最短唯一子串查询。给定一个(长)字符串S和字符串中的查询点q，我们能否找到包含q的最短子字符串在S中是唯一的?我们说明了最短唯一子字符串查询有许多潜在的应用，如信息检索、生物信息学和事件上下文分析。我们开发了高效的在线查询应答算法。首先，我们提出了一种算法，使用后缀树索引在O(n)时间内回答最短的唯一子串查询，其中n是字符串s的长度。其次，我们表明，使用O(n·h)时间和O(n)空间，我们可以为给定字符串中的每个位置计算最短的唯一子串，其中h理论上在O(n)中是变量，但在实际数据集上通常比n小得多，可以被视为常数。一旦预先计算了最短的唯一子字符串，就可以在常量时间内在线回答最短的唯一子字符串查询。除了坚实的算法结果外，我们还通过经验证明了在真实数据集上最短唯一子串查询的有效性和效率。

引用次数: 38

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE 29th International Conference on Data Engineering (ICDE)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀