首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
Parametric Plan Caching Using Density-Based Clustering 使用基于密度的聚类的参数规划缓存
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.57
Günes Aluç, David DeHaan, Ivan T. Bowman
Query plan caching eliminates the need for repeated query optimization, hence, it has strong practical implications for relational database management systems (RDBMSs). Unfortunately, existing approaches consider only the query plan generated at the expected values of parameters that characterize the query, data and the current state of the system, while these parameters may take different values during the lifetime of a cached plan. A better alternative is to harvest the optimizer's plan choice for different parameter values, populate the cache with promising query plans, and select a cached plan based upon current parameter values. To address this challenge, we propose a parametric plan caching (PPC) framework that uses an online plan space clustering algorithm. The clustering algorithm is density-based, and it exploits locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces can be efficiently stored in database histograms and queried in constant time. We experimentally validate that our approach is precise, efficient in space-and-time and adaptive, requiring no eager exploration of the plan spaces of the optimizer.
查询计划缓存消除了重复查询优化的需要,因此,它对关系数据库管理系统(rdbms)具有很强的实际意义。不幸的是,现有的方法只考虑在参数的期望值下生成的查询计划,这些参数表征查询、数据和系统的当前状态,而这些参数在缓存计划的生命周期内可能采用不同的值。更好的替代方法是获取优化器对不同参数值的计划选择,用有希望的查询计划填充缓存,并根据当前参数值选择缓存的计划。为了解决这一挑战,我们提出了一个使用在线规划空间聚类算法的参数规划缓存(PPC)框架。聚类算法是基于密度的,它利用位置敏感的散列作为预处理步骤,使规划空间中的聚类可以有效地存储在数据库直方图中,并在恒定时间内查询。实验结果表明,该方法具有精确、高效、自适应的特点,无需对优化器的规划空间进行探索。
{"title":"Parametric Plan Caching Using Density-Based Clustering","authors":"Günes Aluç, David DeHaan, Ivan T. Bowman","doi":"10.1109/ICDE.2012.57","DOIUrl":"https://doi.org/10.1109/ICDE.2012.57","url":null,"abstract":"Query plan caching eliminates the need for repeated query optimization, hence, it has strong practical implications for relational database management systems (RDBMSs). Unfortunately, existing approaches consider only the query plan generated at the expected values of parameters that characterize the query, data and the current state of the system, while these parameters may take different values during the lifetime of a cached plan. A better alternative is to harvest the optimizer's plan choice for different parameter values, populate the cache with promising query plans, and select a cached plan based upon current parameter values. To address this challenge, we propose a parametric plan caching (PPC) framework that uses an online plan space clustering algorithm. The clustering algorithm is density-based, and it exploits locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces can be efficiently stored in database histograms and queried in constant time. We experimentally validate that our approach is precise, efficient in space-and-time and adaptive, requiring no eager exploration of the plan spaces of the optimizer.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130731254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Attribute-Based Subsequence Matching and Mining 基于属性的子序列匹配与挖掘
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.81
Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu
Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like "Director" and "Actors". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.
序列分析在我们的日常生活中非常重要。通常,每个序列都与一个有序的元素列表相关联。例如,在电影租赁应用程序中,客户的电影租赁记录包含一个有序的电影列表,这是一个序列示例。序列分析的研究大多集中在子序列匹配上,即找到数据库中存储的所有序列,使给定的查询序列是这些序列中的每一个序列的子序列。在许多应用程序中,元素与属性或属性相关联。例如,每部电影都与一些属性相关联,如“导演”和“演员”。不幸的是,据我们所知,所有现有的序列分析研究都没有考虑元素的属性。在本文中,我们提出两个问题。第一个问题是:给定一个查询序列和一组序列,考虑到元素的属性,我们希望找到与该查询序列匹配的所有序列。这个问题被称为基于属性的子序列匹配(ASM)。所有传统子序列匹配问题的现有应用都可以应用于我们的新问题,只要我们给定元素的属性。提出了一种求解ASM问题的有效算法。该算法效率的关键思想是将每个具有潜在许多相关属性的整个序列压缩成一个数字三元组。通过处理这些非常压缩的表示,我们大大加快了基于属性的子序列匹配。第二个问题是找到所有频繁的基于属性的子序列。我们还对第二个问题采用了一个现有的高效算法,以表明我们可以使用为第一个问题开发的算法。实证研究表明,我们的算法在大型数据集中是可扩展的。特别是,在大多数情况下,我们的算法运行速度至少比直接方法快一个数量级。这项工作可以激发生物信息学中基于子序列匹配的序列分类、频繁序列挖掘、基序检测和序列匹配等现有数据挖掘问题。
{"title":"Attribute-Based Subsequence Matching and Mining","authors":"Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu","doi":"10.1109/ICDE.2012.81","DOIUrl":"https://doi.org/10.1109/ICDE.2012.81","url":null,"abstract":"Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like \"Director\" and \"Actors\". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Trust and Share: Trusted Information Sharing in Online Social Networks 信任与分享:在线社交网络中的可信信息共享
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.127
B. Carminati, E. Ferrari, Jacopo Girardi
At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.
在Web 2.0时代之初,在线社交网络(Online Social Networks, OSNs)只是wiki、博客、视频共享等中的另一种现象。然而,它们很快成为互联网时代最大的革命之一。统计数据证实了社交网站在用户数量(例如,Facebook达到7.5亿用户,Twitter达到2亿用户,LinkedIn达到1亿用户)、在社交网站上花费的时间和数据流量(例如,Facebook用户在页面、群组、事件和社区页面方面与大约9亿条数据进行交互)方面的重要性持续上升。这种成功的趋势使osn成为Web上信息共享最有前途的范例之一。
{"title":"Trust and Share: Trusted Information Sharing in Online Social Networks","authors":"B. Carminati, E. Ferrari, Jacopo Girardi","doi":"10.1109/ICDE.2012.127","DOIUrl":"https://doi.org/10.1109/ICDE.2012.127","url":null,"abstract":"At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128265206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Provenance-based Indexing Support in Micro-blog Platforms 微博平台中基于出处的索引支持
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.36
Junjie Yao, B. Cui, Zijun Xue, Qi Liu
Recently, lots of micro-blog message sharing applications have emerged on the web. Users can publish short messages freely and get notified by the subscriptions instantly. Prominent examples include Twitter, Facebook's statuses, and Sina Weibo in China. The Micro-blog platform becomes a useful service for real time information creation and propagation. However, these messages' short length and dynamic characters have posed great challenges for effective content understanding. Additionally, the noise and fragments make it difficult to discover the temporal propagation trail to explore development of micro-blog messages. In this paper, we propose a provenance model to capture connections between micro-blog messages. Provenance refers to data origin identification and transformation logging, demonstrating of great value in recent database and workflow systems. To cope with the real time micro-message deluge, we utilize a novel message grouping approach to encode and maintain the provenance information. Furthermore, we adopt a summary index and several adaptive pruning strategies to implement efficient provenance updating. Based on the index, our provenance solution can support rich query retrieval and intuitive message tracking for effective message organization. Experiments conducted on a real dataset verify the effectiveness and efficiency of our approach. Provenance refers to data origin identification and transformation monitoring, which has been demonstrated of great value in database and workflow systems. In this paper, we propose a provenance model in micro-blog platforms, and design an indexing scheme to support provenance-based message discovery and maintenance, which can capture the interactions of messages for effective message organization. To cope with the real time micro-message tornadoes, we introduce a novel virtual annotation grouping approach to encode and maintain the provenance information. Furthermore, we design a summary index and adaptive pruning strategies to facilitate efficient message update. Based on this provenance index, our approach can support query and message tracking in micro-blog systems. Experiments conducted on real datasets verify the effectiveness and efficiency of our approach.
最近,网络上出现了许多微博信息分享应用程序。用户可以自由发布短消息,并立即收到订阅通知。突出的例子包括中国的Twitter、Facebook状态和新浪微博。微博平台成为实时信息创建和传播的有用服务。然而,这些信息的短长度和动态特征给有效的内容理解带来了很大的挑战。此外,微博信息的噪声和碎片化也给挖掘微博信息的时间传播轨迹带来了困难。在本文中,我们提出了一个来源模型来捕捉微博消息之间的联系。出处是指数据的来源识别和转换记录,在最近的数据库和工作流系统中显示出很大的价值。为了应对海量的实时微信,我们采用了一种新颖的消息分组方法来编码和维护微信的来源信息。此外,采用摘要索引和多种自适应修剪策略实现了有效的种源更新。基于索引,我们的溯源解决方案可以支持丰富的查询检索和直观的消息跟踪,从而实现有效的消息组织。在真实数据集上进行的实验验证了该方法的有效性和效率。来源是指数据的来源识别和转换监控,在数据库和工作流系统中具有重要的价值。本文提出了一种微博平台上的消息来源模型,并设计了一种支持基于消息来源的消息发现和维护的索引方案,该方案可以捕获消息之间的交互,从而实现有效的消息组织。为了应对实时的微信龙卷风,我们引入了一种新的虚拟注释分组方法来编码和维护微信的来源信息。此外,我们还设计了摘要索引和自适应剪枝策略,以促进高效的消息更新。基于此来源索引,我们的方法可以支持微博系统的查询和消息跟踪。在实际数据集上进行的实验验证了该方法的有效性和效率。
{"title":"Provenance-based Indexing Support in Micro-blog Platforms","authors":"Junjie Yao, B. Cui, Zijun Xue, Qi Liu","doi":"10.1109/ICDE.2012.36","DOIUrl":"https://doi.org/10.1109/ICDE.2012.36","url":null,"abstract":"Recently, lots of micro-blog message sharing applications have emerged on the web. Users can publish short messages freely and get notified by the subscriptions instantly. Prominent examples include Twitter, Facebook's statuses, and Sina Weibo in China. The Micro-blog platform becomes a useful service for real time information creation and propagation. However, these messages' short length and dynamic characters have posed great challenges for effective content understanding. Additionally, the noise and fragments make it difficult to discover the temporal propagation trail to explore development of micro-blog messages. In this paper, we propose a provenance model to capture connections between micro-blog messages. Provenance refers to data origin identification and transformation logging, demonstrating of great value in recent database and workflow systems. To cope with the real time micro-message deluge, we utilize a novel message grouping approach to encode and maintain the provenance information. Furthermore, we adopt a summary index and several adaptive pruning strategies to implement efficient provenance updating. Based on the index, our provenance solution can support rich query retrieval and intuitive message tracking for effective message organization. Experiments conducted on a real dataset verify the effectiveness and efficiency of our approach. Provenance refers to data origin identification and transformation monitoring, which has been demonstrated of great value in database and workflow systems. In this paper, we propose a provenance model in micro-blog platforms, and design an indexing scheme to support provenance-based message discovery and maintenance, which can capture the interactions of messages for effective message organization. To cope with the real time micro-message tornadoes, we introduce a novel virtual annotation grouping approach to encode and maintain the provenance information. Furthermore, we design a summary index and adaptive pruning strategies to facilitate efficient message update. Based on this provenance index, our approach can support query and message tracking in micro-blog systems. Experiments conducted on real datasets verify the effectiveness and efficiency of our approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129004098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Accelerating Range Queries for Brain Simulations 加速范围查询大脑模拟
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.56
F. Tauheed, Laurynas Biveinis, T. Heinis, F. Schürmann, H. Markram, A. Ailamaki
Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.
神经科学家越来越多地使用计算工具来建立和模拟大脑模型。这些模拟中涉及的数据量是巨大的,有效管理这些数据是关键。分析这些数据的一个特殊问题是对大脑空间模型的范围查询的可扩展执行。已知的索引方法即使在今天的小模型上也表现不佳,这些模型只代表大脑的一小部分,只包含几百万个密集排列的空间元素。当前方法的问题是,随着模型中细节级别的增加,树结构中的重叠也会增加,最终会减慢查询的执行速度。神经科学家需要使用更大、更详细(更密集)的模型,这促使我们开发一种新的索引方法。为此,我们开发了FLAT,这是一种用于密集数据集的可扩展索引方法。我们基于当前方法在密集数据集的情况下存在重叠的关键观察来开发FLAT。因此,我们将FLAT设计为两个阶段的方法,每个阶段都独立于密度。在第一阶段,它使用传统的空间索引来有效地检索初始对象。在第二阶段,它遍历初始对象的邻域以检索剩余的查询结果。我们的实验结果表明,FLAT不仅优于R-Tree变体,从2倍到8倍,而且还实现了与数据集大小和密度的独立性。
{"title":"Accelerating Range Queries for Brain Simulations","authors":"F. Tauheed, Laurynas Biveinis, T. Heinis, F. Schürmann, H. Markram, A. Ailamaki","doi":"10.1109/ICDE.2012.56","DOIUrl":"https://doi.org/10.1109/ICDE.2012.56","url":null,"abstract":"Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121248881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Efficient Top-k Keyword Search in Graphs with Polynomial Delay 多项式延迟图中Top-k关键字的高效搜索
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.124
M. Kargar, Aijun An
A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.
演示了一个高效的图形关键字搜索系统。该系统有两个组成部分,一个是只在包含输入关键字的节点中搜索一组彼此接近并共同覆盖输入关键字的节点,另一个是探索这些节点之间的相互关系。系统以多项式延迟生成全部或top-k个答案。根据排序标准将答案呈现给用户,以便节点之间距离较近的答案呈现在节点之间距离较远的答案之前。此外,我们的系统生成的答案集是无重复的。系统使用两种方法将最终答案呈现给用户。表示方法通过树或多中心图揭示答案中节点之间的关系。我们将展示每种方法都有自己的优点和缺点。该系统使用两个具有挑战性的数据集进行了演示,即非常大的DBLP和高循环Mondial。本文还演示了实现高效关键字搜索系统所面临的挑战和困难。
{"title":"Efficient Top-k Keyword Search in Graphs with Polynomial Delay","authors":"M. Kargar, Aijun An","doi":"10.1109/ICDE.2012.124","DOIUrl":"https://doi.org/10.1109/ICDE.2012.124","url":null,"abstract":"A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117346790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
On Text Clustering with Side Information 基于侧信息的文本聚类研究
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.111
C. Aggarwal, Yuchen Zhao, Philip S. Yu
Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
近年来,由于在网络、社交网络和其他信息网络等在线论坛中存在大量以各种形式存在的非结构化数据,文本聚类已成为一个日益重要的问题。在大多数情况下,数据不是纯文本形式的。许多附带信息与文本文档一起提供。这些侧信息可以是不同类型的,比如文档中的链接、来自web日志的用户访问行为,或者嵌入到文本文档中的其他非文本属性。这些属性可能包含用于集群目的的大量信息。然而,这些侧信息的相对重要性可能很难估计,特别是当一些信息有噪声时。在这种情况下,将侧信息合并到聚类过程中是有风险的,因为它可能会提高聚类表示的质量,也可能会给聚类过程增加噪声。因此,我们需要一种有原则的方法来执行聚类过程,以便最大限度地利用这些侧信息的优势。在本文中,我们设计了一种将经典划分算法与概率模型相结合的算法,以创建一种有效的聚类方法。我们给出了一些真实数据集的实验结果,以说明使用这种方法的优点。
{"title":"On Text Clustering with Side Information","authors":"C. Aggarwal, Yuchen Zhao, Philip S. Yu","doi":"10.1109/ICDE.2012.111","DOIUrl":"https://doi.org/10.1109/ICDE.2012.111","url":null,"abstract":"Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"15 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125761350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Privacy in Social Networks: How Risky is Your Social Graph? 社交网络中的隐私:你的社交图谱有多危险?
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.99
C. Akcora, B. Carminati, E. Ferrari
Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.
为了保护个人数据免受各种隐私威胁,在线社交网络(OSNs)已经做出了一些努力。然而,尽管这些建议具有相关性,但我们认为仍然缺乏一个概念模型,在此基础上设计隐私工具。这个模型的核心应该是风险的概念。因此,本文提出了一种osn的风险度量方法。其目的是将社交网络用户的风险水平联系起来,以便为其他用户提供一个衡量标准,即与他们互动可能存在多大风险,就披露私人信息而言。我们计算基于相似性和效益措施的风险水平,也考虑到用户的风险态度。特别是,我们采用了一种主动学习的方法来进行风险评估,其中用户的风险态度是通过少量必要的用户交互来学习的。本文讨论的风险评估过程已经开发到一个Facebook应用程序中,并在真实数据上进行了测试。实验证明了该方法的有效性。
{"title":"Privacy in Social Networks: How Risky is Your Social Graph?","authors":"C. Akcora, B. Carminati, E. Ferrari","doi":"10.1109/ICDE.2012.99","DOIUrl":"https://doi.org/10.1109/ICDE.2012.99","url":null,"abstract":"Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125260322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Intuitive Interaction with Encrypted Query Execution in DataStorm 与DataStorm中加密查询执行的直观交互
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.140
Kenneth P. Smith, A. Kini, William Wang, Chris Wolf, M. Allen, Andrew Sillers
The encrypted execution of database queries promises powerful security protections, however users are currently unlikely to benefit without significant expertise. In this demonstration, we illustrate a simple workflow enabling users to design secure executions of their queries. The Data Storm system demonstrated simplifies both the design and execution of encrypted execution plans, and represents progress toward the challenge of developing a general planner for encrypted query execution.
数据库查询的加密执行保证了强大的安全保护,但是如果没有专业知识,用户目前不太可能从中受益。在本演示中,我们将演示一个简单的工作流,使用户能够设计查询的安全执行。所演示的Data Storm系统简化了加密执行计划的设计和执行,并代表了为加密查询执行开发通用规划器这一挑战的进展。
{"title":"Intuitive Interaction with Encrypted Query Execution in DataStorm","authors":"Kenneth P. Smith, A. Kini, William Wang, Chris Wolf, M. Allen, Andrew Sillers","doi":"10.1109/ICDE.2012.140","DOIUrl":"https://doi.org/10.1109/ICDE.2012.140","url":null,"abstract":"The encrypted execution of database queries promises powerful security protections, however users are currently unlikely to benefit without significant expertise. In this demonstration, we illustrate a simple workflow enabling users to design secure executions of their queries. The Data Storm system demonstrated simplifies both the design and execution of encrypted execution plans, and represents progress toward the challenge of developing a general planner for encrypted query execution.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131492546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Self-Configuring Schema Matching System 一种自配置模式匹配系统
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.21
E. Peukert, Julian Eberius, E. Rahm
Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.
映射复杂的元数据结构在数据集成、本体对齐或模型管理等许多领域中都是至关重要的。为了加快这种映射的生成,开发了自动匹配系统来计算可以由用户纠正的映射建议。然而,构建和调优匹配策略仍然需要匹配专家进行大量的手工工作,并需要正确的映射来评估生成的映射。因此,我们提出了一种能够自动适应给定映射问题的自配置模式匹配系统。我们的方法基于对输入模式和中间匹配结果的分析。各种匹配规则使用分析结果为给定的匹配任务自动构造和调整底层匹配过程。我们从模式、本体和模型管理领域全面评估了我们在不同映射问题上的方法。评估结果表明,该系统能够在不同的映射问题和领域中鲁棒地返回高质量的映射。
{"title":"A Self-Configuring Schema Matching System","authors":"E. Peukert, Julian Eberius, E. Rahm","doi":"10.1109/ICDE.2012.21","DOIUrl":"https://doi.org/10.1109/ICDE.2012.21","url":null,"abstract":"Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"86 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132756283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1