首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
Effective Data Density Estimation in Ring-Based P2P Networks 基于环的P2P网络的有效数据密度估计
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.19
Minqi Zhou, Heng Tao Shen, Xiaofang Zhou, Weining Qian, Aoying Zhou
Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.
估计点对点(P2P)网络中的全球数据分布是一个重要的问题,但尚未得到很好的解决。它可以使许多P2P应用程序受益,例如负载平衡分析、查询处理和数据挖掘。受随机变量生成反演方法的启发,本文提出了一种新的基于动态环的P2P网络的无分布数据密度估计模型,无论底层数据的分布模型如何,都能以较低的估计成本获得较高的估计精度。它通过对全局累积分布函数进行抽样,生成任意分布的随机样本,并且不存在抽样偏差。在P2P网络中,无分布估计的关键思想是对一小部分对等点进行采样,以估计数据域上的全局数据分布。介绍了用于估计全局数据分布的全局累积分布函数的计算和采样算法,并进行了详细的理论分析。我们广泛的性能研究证实了我们的方法在基于环的P2P网络中的有效性和效率。
{"title":"Effective Data Density Estimation in Ring-Based P2P Networks","authors":"Minqi Zhou, Heng Tao Shen, Xiaofang Zhou, Weining Qian, Aoying Zhou","doi":"10.1109/ICDE.2012.19","DOIUrl":"https://doi.org/10.1109/ICDE.2012.19","url":null,"abstract":"Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"8 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130356811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Iterative Graph Feature Mining for Graph Indexing 面向图索引的迭代图特征挖掘
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.11
Dayu Yuan, P. Mitra, Huiwen Yu, C. Lee Giles
Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all "mine-at-once" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the "mine-at-once" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an "iterative subgraph mining" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.
子图搜索是图数据库中常用的查询场景。给定一个查询图q,子图搜索算法返回所有以q为子图的数据库图。为了有效地实现子图搜索,挖掘子图特征以便对图数据库进行索引。许多子图特征挖掘方法已经被提出。它们都是“一次挖掘”算法,在构建稳定的图索引之前,在一次运行中挖掘整个特征集。但是,由于环境的变化(例如图数据库的更新和可用内存的增加),需要更新索引以适应这些变化。大多数“一次挖掘”算法涉及在整个图数据库上频繁的子图或子树挖掘。此外,构造和部署新索引涉及到昂贵的磁盘操作,因此重新挖掘特性并从头构建索引的效率很低。我们观察到,在大多数情况下,更新一小部分图索引就足够了。在这里,我们提出了一个“迭代子图挖掘”算法,迭代地找到一个特征插入(或从)索引中删除。由于大多数索引特性和索引结构没有改变,因此可以频繁调用该算法。我们定义了一个目标函数来指导特征挖掘。接下来,我们提出了一种基本的分支定界算法来挖掘特征。最后,我们设计了一种先进的搜索算法,该算法可以快速找到接近最优的子图特征,并减少了搜索空间。实验表明,我们的特征挖掘算法比流行的图索引算法gIndex快5倍,并且我们的迭代算法挖掘的特征对子图搜索问题有更好的过滤率。
{"title":"Iterative Graph Feature Mining for Graph Indexing","authors":"Dayu Yuan, P. Mitra, Huiwen Yu, C. Lee Giles","doi":"10.1109/ICDE.2012.11","DOIUrl":"https://doi.org/10.1109/ICDE.2012.11","url":null,"abstract":"Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all \"mine-at-once\" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the \"mine-at-once\" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an \"iterative subgraph mining\" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"245 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132291337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Exploiting Common Subexpressions for Cloud Query Processing 利用公共子表达式进行云查询处理
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.106
Yasin N. Silva, P. Larson, Jingren Zhou
Many companies now routinely run massive data analysis jobs -- expressed in some scripting language -- on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.
许多公司现在经常在大型低端服务器集群上运行大量数据分析工作(用一些脚本语言表示)。许多分析脚本是复杂的,并且包含常见的子表达式,即随后以多种不同方式连接和聚合的中间结果。对这样的脚本应用传统的优化技术将产生多次执行公共子表达式的计划,对每个消费者执行一次,这显然是浪费。此外,不同的消费者可能对结果有不同的物理需求:一个消费者可能希望在列a上对其进行分区,另一个消费者可能希望在列b上对其进行分区。为了找到真正的最优计划,优化器必须以基于成本的方式权衡这些冲突的需求。在本文中,我们展示了如何扩展一个级联式优化器来正确地优化包含公共子表达式的脚本。这种方法已经在微软的大规模数据分析系统SCOPE中得到了原型。对简单脚本和大型脚本的实验分析表明,扩展优化器生成的计划的估计成本降低了21%到57%。
{"title":"Exploiting Common Subexpressions for Cloud Query Processing","authors":"Yasin N. Silva, P. Larson, Jingren Zhou","doi":"10.1109/ICDE.2012.106","DOIUrl":"https://doi.org/10.1109/ICDE.2012.106","url":null,"abstract":"Many companies now routinely run massive data analysis jobs -- expressed in some scripting language -- on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123813189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Privacy in Social Networks: How Risky is Your Social Graph? 社交网络中的隐私:你的社交图谱有多危险?
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.99
C. Akcora, B. Carminati, E. Ferrari
Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.
为了保护个人数据免受各种隐私威胁,在线社交网络(OSNs)已经做出了一些努力。然而,尽管这些建议具有相关性,但我们认为仍然缺乏一个概念模型,在此基础上设计隐私工具。这个模型的核心应该是风险的概念。因此,本文提出了一种osn的风险度量方法。其目的是将社交网络用户的风险水平联系起来,以便为其他用户提供一个衡量标准,即与他们互动可能存在多大风险,就披露私人信息而言。我们计算基于相似性和效益措施的风险水平,也考虑到用户的风险态度。特别是,我们采用了一种主动学习的方法来进行风险评估,其中用户的风险态度是通过少量必要的用户交互来学习的。本文讨论的风险评估过程已经开发到一个Facebook应用程序中,并在真实数据上进行了测试。实验证明了该方法的有效性。
{"title":"Privacy in Social Networks: How Risky is Your Social Graph?","authors":"C. Akcora, B. Carminati, E. Ferrari","doi":"10.1109/ICDE.2012.99","DOIUrl":"https://doi.org/10.1109/ICDE.2012.99","url":null,"abstract":"Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125260322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Accelerating Range Queries for Brain Simulations 加速范围查询大脑模拟
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.56
F. Tauheed, Laurynas Biveinis, T. Heinis, F. Schürmann, H. Markram, A. Ailamaki
Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.
神经科学家越来越多地使用计算工具来建立和模拟大脑模型。这些模拟中涉及的数据量是巨大的,有效管理这些数据是关键。分析这些数据的一个特殊问题是对大脑空间模型的范围查询的可扩展执行。已知的索引方法即使在今天的小模型上也表现不佳,这些模型只代表大脑的一小部分,只包含几百万个密集排列的空间元素。当前方法的问题是,随着模型中细节级别的增加,树结构中的重叠也会增加,最终会减慢查询的执行速度。神经科学家需要使用更大、更详细(更密集)的模型,这促使我们开发一种新的索引方法。为此,我们开发了FLAT,这是一种用于密集数据集的可扩展索引方法。我们基于当前方法在密集数据集的情况下存在重叠的关键观察来开发FLAT。因此,我们将FLAT设计为两个阶段的方法,每个阶段都独立于密度。在第一阶段,它使用传统的空间索引来有效地检索初始对象。在第二阶段,它遍历初始对象的邻域以检索剩余的查询结果。我们的实验结果表明,FLAT不仅优于R-Tree变体,从2倍到8倍,而且还实现了与数据集大小和密度的独立性。
{"title":"Accelerating Range Queries for Brain Simulations","authors":"F. Tauheed, Laurynas Biveinis, T. Heinis, F. Schürmann, H. Markram, A. Ailamaki","doi":"10.1109/ICDE.2012.56","DOIUrl":"https://doi.org/10.1109/ICDE.2012.56","url":null,"abstract":"Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121248881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
On Text Clustering with Side Information 基于侧信息的文本聚类研究
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.111
C. Aggarwal, Yuchen Zhao, Philip S. Yu
Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
近年来,由于在网络、社交网络和其他信息网络等在线论坛中存在大量以各种形式存在的非结构化数据,文本聚类已成为一个日益重要的问题。在大多数情况下,数据不是纯文本形式的。许多附带信息与文本文档一起提供。这些侧信息可以是不同类型的,比如文档中的链接、来自web日志的用户访问行为,或者嵌入到文本文档中的其他非文本属性。这些属性可能包含用于集群目的的大量信息。然而,这些侧信息的相对重要性可能很难估计,特别是当一些信息有噪声时。在这种情况下,将侧信息合并到聚类过程中是有风险的,因为它可能会提高聚类表示的质量,也可能会给聚类过程增加噪声。因此,我们需要一种有原则的方法来执行聚类过程,以便最大限度地利用这些侧信息的优势。在本文中,我们设计了一种将经典划分算法与概率模型相结合的算法,以创建一种有效的聚类方法。我们给出了一些真实数据集的实验结果,以说明使用这种方法的优点。
{"title":"On Text Clustering with Side Information","authors":"C. Aggarwal, Yuchen Zhao, Philip S. Yu","doi":"10.1109/ICDE.2012.111","DOIUrl":"https://doi.org/10.1109/ICDE.2012.111","url":null,"abstract":"Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"15 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125761350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Adaptive Windows for Duplicate Detection 自适应窗口重复检测
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.20
Uwe Draisbach, Felix Naumann, Sascha Szott, Oliver Wonneberg
Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
重复检测的任务是分别识别数据集中表示相同现实世界实体的所有记录组。这项任务很困难,因为(i)表示可能略有不同,因此必须定义一些相似性度量来比较成对的记录;(ii)数据集可能有很大的容量,使得对所有记录进行成对比较不可行。为了解决第二个问题,许多算法建议对数据集进行分区,并只比较每个分区内的所有记录对。其中一种著名的方法是邻域排序方法(SNM),它根据某个键对数据进行排序,然后在数据上移动一个窗口,只比较出现在同一窗口内的记录。我们提出了重复计数策略(DCS)的SNM变体,它使用不同的窗口大小。它是基于这样的直觉,即可能存在高相似度的区域表明窗口大小较大,而低相似度的区域表明窗口大小较小。除了DCS的基本变体,我们还提出并彻底评估了一个名为DCS++的变体,该变体在效率方面优于原始SNM(相同的结果,较少的比较)。
{"title":"Adaptive Windows for Duplicate Detection","authors":"Uwe Draisbach, Felix Naumann, Sascha Szott, Oliver Wonneberg","doi":"10.1109/ICDE.2012.20","DOIUrl":"https://doi.org/10.1109/ICDE.2012.20","url":null,"abstract":"Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115615179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 113
Trust and Share: Trusted Information Sharing in Online Social Networks 信任与分享:在线社交网络中的可信信息共享
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.127
B. Carminati, E. Ferrari, Jacopo Girardi
At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.
在Web 2.0时代之初,在线社交网络(Online Social Networks, OSNs)只是wiki、博客、视频共享等中的另一种现象。然而,它们很快成为互联网时代最大的革命之一。统计数据证实了社交网站在用户数量(例如,Facebook达到7.5亿用户,Twitter达到2亿用户,LinkedIn达到1亿用户)、在社交网站上花费的时间和数据流量(例如,Facebook用户在页面、群组、事件和社区页面方面与大约9亿条数据进行交互)方面的重要性持续上升。这种成功的趋势使osn成为Web上信息共享最有前途的范例之一。
{"title":"Trust and Share: Trusted Information Sharing in Online Social Networks","authors":"B. Carminati, E. Ferrari, Jacopo Girardi","doi":"10.1109/ICDE.2012.127","DOIUrl":"https://doi.org/10.1109/ICDE.2012.127","url":null,"abstract":"At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128265206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Efficient Top-k Keyword Search in Graphs with Polynomial Delay 多项式延迟图中Top-k关键字的高效搜索
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.124
M. Kargar, Aijun An
A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.
演示了一个高效的图形关键字搜索系统。该系统有两个组成部分,一个是只在包含输入关键字的节点中搜索一组彼此接近并共同覆盖输入关键字的节点,另一个是探索这些节点之间的相互关系。系统以多项式延迟生成全部或top-k个答案。根据排序标准将答案呈现给用户,以便节点之间距离较近的答案呈现在节点之间距离较远的答案之前。此外,我们的系统生成的答案集是无重复的。系统使用两种方法将最终答案呈现给用户。表示方法通过树或多中心图揭示答案中节点之间的关系。我们将展示每种方法都有自己的优点和缺点。该系统使用两个具有挑战性的数据集进行了演示,即非常大的DBLP和高循环Mondial。本文还演示了实现高效关键字搜索系统所面临的挑战和困难。
{"title":"Efficient Top-k Keyword Search in Graphs with Polynomial Delay","authors":"M. Kargar, Aijun An","doi":"10.1109/ICDE.2012.124","DOIUrl":"https://doi.org/10.1109/ICDE.2012.124","url":null,"abstract":"A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117346790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A Self-Configuring Schema Matching System 一种自配置模式匹配系统
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.21
E. Peukert, Julian Eberius, E. Rahm
Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.
映射复杂的元数据结构在数据集成、本体对齐或模型管理等许多领域中都是至关重要的。为了加快这种映射的生成,开发了自动匹配系统来计算可以由用户纠正的映射建议。然而,构建和调优匹配策略仍然需要匹配专家进行大量的手工工作,并需要正确的映射来评估生成的映射。因此,我们提出了一种能够自动适应给定映射问题的自配置模式匹配系统。我们的方法基于对输入模式和中间匹配结果的分析。各种匹配规则使用分析结果为给定的匹配任务自动构造和调整底层匹配过程。我们从模式、本体和模型管理领域全面评估了我们在不同映射问题上的方法。评估结果表明,该系统能够在不同的映射问题和领域中鲁棒地返回高质量的映射。
{"title":"A Self-Configuring Schema Matching System","authors":"E. Peukert, Julian Eberius, E. Rahm","doi":"10.1109/ICDE.2012.21","DOIUrl":"https://doi.org/10.1109/ICDE.2012.21","url":null,"abstract":"Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"86 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132756283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1