2011 IEEE 27th International Conference on Data Engineering最新文献

英文中文

ATOM: Automatic target-driven ontology merging ATOM:自动目标驱动的本体合并

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767871

Salvatore Raunich, E. Rahm

The proliferation of ontologies and taxonomies in many domains increasingly demands the integration of multiple such ontologies to provide a unified view on them. We demonstrate a new automatic approach to merge large taxonomies such as product catalogs or web directories. Our approach is based on an equivalence matching between a source and target taxonomy to merge them. It is target-driven, i.e. it preserves the structure of the target taxonomy as much as possible. Further, we show how the approach can utilize additional relationships between source and target concepts to semantically improve the merge result.

许多领域中本体和分类法的激增日益要求集成多个这样的本体，以提供对它们的统一视图。我们演示了一种新的自动方法来合并大型分类法，如产品目录或web目录。我们的方法是基于源分类法和目标分类法之间的等价匹配来合并它们。它是目标驱动的，也就是说，它尽可能地保留目标分类法的结构。此外，我们还展示了该方法如何利用源和目标概念之间的附加关系在语义上改进合并结果。

引用次数: 76

Preference queries over sets 对集合的偏好查询

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767866

Xi Zhang, J. Chomicki

We propose a “logic + SQL” framework for set preferences. Candidate best sets are represented using profiles consisting of scalar features. This reduces set preferences to tuple preferences over set profiles. We propose two optimization techniques: superpreference and M-relation. Superpreference targets dominated profiles. It reduces the input size by filtering out tuples not belonging to any best k-subset. M-relation targets repeated profiles. It consolidates tuples that are exchangeable with regard to the given set preference, and therefore avoids redundant computation of the same profile. We show the results of an experimental study that demonstrates the efficacy of the optimizations.

我们提出了一个“逻辑+ SQL”框架来设置首选项。候选最佳集使用由标量特征组成的配置文件表示。这将集合首选项减少为集合配置文件上的元组首选项。我们提出了两种优化技术:超偏好和m关系。超偏好目标占主导地位。它通过过滤掉不属于任何最佳k子集的元组来减小输入大小。m关系的目标是重复的配置文件。它根据给定的集合首选项合并可交换的元组，从而避免了对相同配置文件的冗余计算。我们展示了一项实验研究的结果，证明了优化的有效性。

引用次数: 26

Efficient SPectrAl Neighborhood blocking for entity resolution 有效的光谱邻域块实体分辨率

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767835

Liangcai Shu, Aiyou Chen, Ming Xiong, W. Meng

In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.

在许多电信和web应用程序中，需要识别相同源中的数据对象还是不同源中的数据对象在现实世界中表示相同的实体。当缺乏跨多个数据源的唯一标识符来表示真实世界的实体时，多个服务中的订阅者、供应链管理中的客户以及社交网络中的用户都会出现这个问题。实体解析是识别和发现数据集中引用现实世界中相同实体的对象。我们研究了需要高效和可扩展解决方案的大型数据集的实体解析问题。本文提出了一种新的无监督阻塞算法——谱邻域(SPectrAl Neighborhood, SPAN)，该算法基于谱聚类为记录构建了一棵快速的二分树，使得树中的邻域记录能够准确地识别出真实的实体。我们的方法有两个主要的新颖方面:1)我们开发了一种快速的算法，该算法可以在不显式计算两两相似度的情况下执行光谱聚类，这大大提高了标准光谱聚类算法的可扩展性;2)在双分区过程中，我们使用了由Newman-Girvan模性指定的停止准则。我们对合成数据和实际数据的实验结果表明，SPAN具有鲁棒性，在准确性方面优于其他阻塞算法，同时在处理大型数据集方面具有效率和可扩展性。

{"title":"Efficient SPectrAl Neighborhood blocking for entity resolution","authors":"Liangcai Shu, Aiyou Chen, Ming Xiong, W. Meng","doi":"10.1109/ICDE.2011.5767835","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767835","url":null,"abstract":"In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130442130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

DBridge: A program rewrite tool for set-oriented query execution 一个程序重写工具，用于执行面向集合的查询

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767949

Mahendra Chavan, Ravindra Guravannavar, Karthik Ramachandra, Sundararajarao Sudarshan

We present DBridge, a novel static analysis and program transformation tool to optimize database access. Traditionally, rewrite of queries and programs are done independently, by the database query optimzier and the language compiler respectively, leaving out many optimization opportunities. Our tool aims to bridge this gap by performing holistic transformations, which include both program and query rewrite.

我们提出了DBridge，一个新的静态分析和程序转换工具，以优化数据库访问。传统上，查询和程序的重写分别由数据库查询优化器和语言编译器独立完成，从而遗漏了许多优化机会。我们的工具旨在通过执行整体转换(包括程序和查询重写)来弥合这一差距。

引用次数: 26

A unified model for data and constraint repair 数据和约束修复的统一模型

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767833

Fei Chiang, Renée J. Miller

Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed. In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality. We evaluate the quality and scalability of our repair algorithms over synthetic data and present a qualitative case study using a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.

完整性约束在数据设计中起着重要的作用。然而，在操作数据库中，由于许多原因，它们可能不会被强制执行。因此，随着时间的推移，数据可能会与约束不一致。为了解决这个问题，有几种方法提出了修复数据的技术，通过对数据进行最小或最低成本的更改，使其与约束保持一致。这种技术适用于数据变化，但模式及其约束保持不变的旧世界。然而，在许多现代应用程序中，约束可能随着应用程序或业务规则的更改、数据与新数据源的集成或数据的底层语义的演变而演变。在这种设置中，当出现不一致时，就不再清楚数据中是否存在错误(并且应该修复数据)，或者约束是否已经演变(并且应该修复约束)。在这项工作中，我们提出了一种新的统一成本模型，允许在平等的基础上比较数据和约束修复。我们考虑对与一组规则不一致的数据库进行修复，这些规则被建模为功能依赖项(fd)。fd是最常见的约束类型，并且在维护数据质量方面发挥着重要作用。我们通过合成数据评估了我们的修复算法的质量和可扩展性，并使用一个知名的真实数据集进行了定性案例研究。结果表明，我们的修复算法不仅可以很好地扩展到大型数据集，而且能够准确地捕获和纠正不一致性，并准确地决定何时进行数据修复与约束修复是最好的。

{"title":"A unified model for data and constraint repair","authors":"Fei Chiang, Renée J. Miller","doi":"10.1109/ICDE.2011.5767833","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767833","url":null,"abstract":"Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed. In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality. We evaluate the quality and scalability of our repair algorithms over synthetic data and present a qualitative case study using a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130000105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

Partitioning techniques for fine-grained indexing 用于细粒度索引的分区技术

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767830

Eugene Wu, S. Madden

Many data-intensive websites use databases that grow much faster than the rate that users access the data. Such growing datasets lead to ever-increasing space and performance overheads for maintaining and accessing indexes. Furthermore, there is often considerable skew with popular users and recent data accessed much more frequently. These observations led us to design Shinobi, a system which uses horizontal partitioning as a mechanism for improving query performance to cluster the physical data, and increasing insert performance by only indexing data that is frequently accessed. We present database design algorithms that optimally partition tables, drop indexes from partitions that are infrequently queried, and maintain these partitions as workloads change. We show a 60× performance improvement over traditionally indexed tables using a real-world query workload derived from a traffic monitoring application

许多数据密集型网站使用的数据库的增长速度远远快于用户访问数据的速度。这种不断增长的数据集导致维护和访问索引的空间和性能开销不断增加。此外，流行用户和访问频繁得多的最新数据之间往往存在相当大的偏差。这些观察结果引导我们设计了Shinobi，这是一个使用水平分区作为一种机制来提高查询性能以聚类物理数据的系统，并通过仅索引频繁访问的数据来提高插入性能。我们提出的数据库设计算法可以优化分区表，从不经常查询的分区中删除索引，并在工作负载变化时维护这些分区。我们使用来自流量监控应用程序的真实查询工作负载，展示了比传统索引表提高60倍的性能

引用次数: 37

Query optimizer plan diagrams: Production, reduction and applications 查询优化器计划图:生产、减少和应用

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767959

J. Haritsa

The automated optimization of declarative SQL queries is a classical problem that has been diligently addressed by the database community over several decades. However, due to its inherent complexities and challenges, the topic has largely remained a “black art”, and the quality of the query optimizer continues to be a key differentiator between competing database products, with large technical teams involved in their design and implementation.

声明性SQL查询的自动优化是数据库社区几十年来一直在努力解决的一个经典问题。然而，由于其固有的复杂性和挑战，该主题在很大程度上仍然是一门“黑艺术”，查询优化器的质量仍然是竞争数据库产品之间的关键区别，它们的设计和实现涉及大型技术团队。

引用次数: 6

Outlier detection on uncertain data: Objects, instances, and inferences 不确定数据的离群值检测:对象、实例和推论

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767850

B. Jiang, J. Pei

This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are modeled by a probability density distribution. We detect outliers at both the instance level and the object level. To detect outlier instances, it is a prerequisite to know normal instances. By assuming that uncertain objects with similar properties tend to have similar instances, we learn the normal instances for each uncertain object using the instances of objects with similar properties. Consequently, outlier instances can be detected by comparing against normal ones. Furthermore, we can detect outlier objects most of whose instances are outliers. Technically, we use a Bayesian inference algorithm to solve the problem, and develop an approximation algorithm and a filtering algorithm to speed up the computation. An extensive empirical study on both real data and synthetic data verifies the effectiveness and efficiency of our algorithms.

本文研究了不确定数据的异常值检测问题。我们从考虑不确定对象及其实例的综合模型开始。不确定对象具有某些固有属性，由一组实例组成，这些实例由概率密度分布建模。我们在实例级和对象级检测异常值。为了检测异常实例，了解正常实例是一个先决条件。通过假设具有相似属性的不确定对象往往具有相似的实例，我们使用具有相似属性的对象的实例来学习每个不确定对象的正常实例。因此，可以通过与正常实例进行比较来检测异常实例。此外，我们可以检测到异常对象，其大多数实例都是异常值。在技术上，我们使用贝叶斯推理算法来解决问题，并开发了一种近似算法和一种滤波算法来加快计算速度。通过对真实数据和合成数据的大量实证研究，验证了算法的有效性和高效性。

引用次数: 28

Optimal location queries in road network databases 路网数据库中最优位置查询

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767845

Xiaokui Xiao, Bin Yao, Feifei Li

Optimal location (OL) queries are a type of spatial queries particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an Lp space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their Lp distance. Motivated by the deficiency of the existing techniques, this paper presents the first study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We demonstrate the efficiency of our solutions through extensive experiments with real data.

最优位置(OL)查询是一种空间查询，对资源的战略规划特别有用。给定一组现有设施和一组客户，OL查询需要一个位置来建立一个优化特定成本度量(基于客户端和设施之间的距离定义)的新设施。已经提出了几种技术来处理OL查询，假设所有客户端和设施都位于Lp空间中。然而，在实践中，空间位置之间的移动通常受到底层道路网络的限制，因此，两个位置之间的实际距离可能与它们的Lp距离有很大不同。基于现有技术的不足，本文首次对道路网络中的OL查询进行了研究。我们提出了一个统一的框架，解决了在实践中发现重要应用的三种OL查询变体，并使用几种新的查询处理算法实例化了该框架。我们通过大量的真实数据实验证明了我们的解决方案的有效性。

引用次数: 104

Finding top-k profitable products 寻找最赚钱的产品

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767895

Qian Wan, R. C. Wong, Yu Peng

The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous studies focus on how to help customers find a set of “best” possible products from a pool of given products. In this paper, we identify an interesting problem, finding top-k profitable products, which has not been studied before. Given a set of products in the existing market, we want to find a set of k “best” possible products such that these new products are not dominated by the products in the existing market. In this problem, we need to set the prices of these products such that the total profit is maximized. We refer such products as top-k profitable products. A straightforward solution is to enumerate all possible subsets of size k and find the subset which gives the greatest profit. However, there are an exponential number of possible subsets. In this paper, we propose solutions to find the top-k profitable products efficiently. An extensive performance study using both synthetic and real datasets is reported to verify its effectiveness and efficiency.

优势度和天际线分析在多准则决策应用中的重要性已得到充分认识。以前的大多数研究都集中在如何帮助客户从一堆给定的产品中找到一组“最好”的可能产品。在本文中，我们发现了一个有趣的问题，即寻找top-k有利可图的产品，这是以前没有研究过的。给定现有市场上的一组产品，我们想要找到k个“最佳”可能产品的集合，使得这些新产品不被现有市场上的产品所主导。在这个问题中，我们需要设定这些产品的价格，使总利润最大化。我们把这种产品称为高利润产品。一个直接的解决方案是枚举大小为k的所有可能子集，并找到利润最大的子集。然而，可能的子集数量是指数级的。在本文中，我们提出了有效地寻找top-k盈利产品的解决方案。使用合成数据集和真实数据集进行了广泛的性能研究，以验证其有效性和效率。

引用次数: 44

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 27th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀