2011 IEEE 27th International Conference on Data Engineering最新文献

英文中文

NORMS: An automatic tool to perform schema label normalization 规范:执行模式标签规范化的自动工具

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767952

S. Sorrentino, S. Bergamaschi, M. Gawinecki

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources (heterogeneous in format and structure). Schema matching systems usually exploit lexical and semantic information provided by lexical databases/thesauri to discover intra/inter semantic relationships among schema elements. However, most of them obtain poor performance on real world scenarios due to the significant presence of “non-dictionary words”. Non-dictionary words include compound nouns, abbreviations and acronyms. In this paper, we present NORMS (NORMalizer of Schemata), a tool performing schema label normalization to increase the number of comparable labels extracted from schemata1.

模式匹配是跨异构数据源(格式和结构都是异构的)查找概念之间关系的问题。模式匹配系统通常利用词汇数据库/词典提供的词汇和语义信息来发现模式元素之间的语义内/语义间关系。然而，由于“非字典单词”的大量存在，它们中的大多数在现实场景中表现不佳。非词典词汇包括复合名词、缩略语和首字母缩略词。在本文中，我们提出了norm (NORMalizer of Schemata)，一个执行模式标签规范化的工具，以增加从schemata1中提取的可比较标签的数量。

引用次数: 11

Decomposing DAGs into spanning trees: A new way to compress transitive closures 将dag分解为生成树:压缩传递闭包的一种新方法

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767832

Yangjun Chen, Yibin Chen

Let G(V, E) be a digraph (directed graph) with n nodes and e edges. Digraph G* = (V, E*) is the reflexive, transitive closure if (v, u) ∈ E* iff there is a path from v to u in G. Efficient storage of G* is important for supporting reachability queries which are not only common on graph databases, but also serve as fundamental operations used in many graph algorithms. A lot of strategies have been suggested based on the graph labeling, by which each node is assigned with certain labels such that the reachability of any two nodes through a path can be determined by their labels. Among them are interval labelling, chain decomposition, and 2-hop labeling. However, due to the very large size of many real world graphs, the computational cost and size of labels using existing methods would prove too expensive to be practical. In this paper, we propose a new approach to decompose a graph into a series of spanning trees which may share common edges, to transform a reachability query over a graph into a set of queries over trees. We demonstrate both analytically and empirically the efficiency and effectiveness of our method.

设G(V, E)是一个有向图(有向图)，有n个节点和E条边。有向图G* = (V, E*)是自反的，传递的闭包，如果(V, u)∈E*，如果在G中有一条从V到u的路径，G*的有效存储对于支持可达性查询非常重要，可达性查询不仅在图数据库中很常见，而且是许多图算法中使用的基本操作。人们提出了许多基于图标记的策略，通过给每个节点分配特定的标签，使得任意两个节点通过路径的可达性可以通过它们的标签来确定。其中包括区间标记、链分解和2-hop标记。然而，由于许多现实世界的图非常大，使用现有方法的计算成本和标签的大小将被证明过于昂贵而不实用。在本文中，我们提出了一种新的方法，将图分解为一系列可能共享共同边的生成树，从而将图上的可达性查询转换为树上的查询集。我们从分析和经验两方面证明了我们的方法的效率和有效性。

引用次数: 38

Efficient maintenance of common keys in archives of continuous query results from deep websites 有效维护来自深度网站的连续查询结果档案中的常用键

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767891

Fajar Ardian, S. Bhowmick

In many real-world applications, it is important to create a local archive containing versions of structured results of continuous queries (queries that are evaluated periodically) submitted to autonomous database-driven Web sites (e.g., deep Web). Such history of digital information is a potential gold mine for all kinds of scientific, media and business analysts. An important task in this context is to maintain the set of common keys of the underlying archived results as they play pivotal role in data modeling and analysis, query processing, and entity tracking. A set of attributes in a structured data is a common key iff it is a key for all versions of the data in the archive. Due to the data-driven nature of key discovery from the archive, unlike traditional keys, the common keys are not temporally invariant. That is, keys identified in one version may be different from those in another version. Hence, in this paper, we propose a novel technique to maintain common keys in an archive containing a sequence of versions of evolutionary continuous query results. Given the current common key set of existing versions and a new snapshot, we propose an algorithm called COKE (COmmon KEy maintenancE) which incrementally maintains the common key set without undertaking expensive minimal keys computation from the new snapshot. Furthermore, it exploits certain interesting evolutionary features of real-world data to further reduce the computation cost. Our exhaustive empirical study demonstrates that COKE has excellent performance and is orders of magnitude faster than a baseline approach for maintenance of common keys.

在许多实际应用程序中，创建包含提交给自主数据库驱动的Web站点(例如深度Web)的连续查询(定期评估的查询)的结构化结果版本的本地存档非常重要。对于各种科学、媒体和商业分析人士来说，这样的数字信息历史是一座潜在的金矿。在此上下文中，一个重要的任务是维护底层归档结果的公共键集，因为它们在数据建模和分析、查询处理和实体跟踪中起着关键作用。如果结构化数据中的一组属性是归档中所有版本数据的一个键，那么它就是一个公共键。由于从存档中发现密钥的数据驱动性质，与传统密钥不同，公共密钥不是暂时不变的。也就是说，一个版本中标识的键可能与另一个版本中的键不同。因此，在本文中，我们提出了一种新的技术来维护包含进化连续查询结果的一系列版本的存档中的公共键。给定现有版本的当前公共密钥集和一个新的快照，我们提出了一种称为COKE (common key maintenancE)的算法，该算法增量地维护公共密钥集，而无需从新快照进行昂贵的最小密钥计算。此外，它还利用了现实世界数据的一些有趣的演化特征来进一步降低计算成本。我们详尽的实证研究表明，COKE具有出色的性能，并且在维护公共键方面比基线方法快几个数量级。

{"title":"Efficient maintenance of common keys in archives of continuous query results from deep websites","authors":"Fajar Ardian, S. Bhowmick","doi":"10.1109/ICDE.2011.5767891","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767891","url":null,"abstract":"In many real-world applications, it is important to create a local archive containing versions of structured results of continuous queries (queries that are evaluated periodically) submitted to autonomous database-driven Web sites (e.g., deep Web). Such history of digital information is a potential gold mine for all kinds of scientific, media and business analysts. An important task in this context is to maintain the set of common keys of the underlying archived results as they play pivotal role in data modeling and analysis, query processing, and entity tracking. A set of attributes in a structured data is a common key iff it is a key for all versions of the data in the archive. Due to the data-driven nature of key discovery from the archive, unlike traditional keys, the common keys are not temporally invariant. That is, keys identified in one version may be different from those in another version. Hence, in this paper, we propose a novel technique to maintain common keys in an archive containing a sequence of versions of evolutionary continuous query results. Given the current common key set of existing versions and a new snapshot, we propose an algorithm called COKE (COmmon KEy maintenancE) which incrementally maintains the common key set without undertaking expensive minimal keys computation from the new snapshot. Furthermore, it exploits certain interesting evolutionary features of real-world data to further reduce the computation cost. Our exhaustive empirical study demonstrates that COKE has excellent performance and is orders of magnitude faster than a baseline approach for maintenance of common keys.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130805008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

AMC - A framework for modelling and comparing matching systems as matching processes AMC -作为匹配过程建模和比较匹配系统的框架

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767940

E. Peukert, Julian Eberius, E. Rahm

We present the Auto Mapping Core (AMC), a new framework that supports fast construction and tuning of schema matching approaches for specific domains such as ontology alignment, model matching or database-schema matching. Distinctive features of our framework are new visualisation techniques for modelling matching processes, stepwise tuning of parameters, intermediate result analysis and performance-oriented rewrites. Furthermore, existing matchers can be plugged into the framework to comparatively evaluate them in a common environment. This allows deeper analysis of behaviour and shortcomings in existing complex matching systems.

我们提出了自动映射核心(AMC)，这是一个新的框架，支持快速构建和调整特定领域的模式匹配方法，如本体对齐、模型匹配或数据库-模式匹配。我们的框架的显著特点是新的可视化技术建模匹配过程，逐步调整参数，中间结果分析和面向性能的重写。此外，现有的匹配器可以插入到框架中，以便在公共环境中对它们进行比较评估。这允许对现有复杂匹配系统的行为和缺陷进行更深入的分析。

引用次数: 66

Selectivity estimation of twig queries on cyclic graphs 循环图上小枝查询的选择性估计

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767893

Yun Peng, Byron Choi, Jianliang Xu

Recent applications including the Semantic Web, Web ontology and XML have sparked a renewed interest on graph-structured databases. Among others, twig queries have been a popular tool for retrieving subgraphs from graph-structured databases. To optimize twig queries, selectivity estimation has been a crucial and classical step. However, the majority of existing works on selectivity estimation focuses on relational and tree data. In this paper, we investigate selectivity estimation of twig queries on possibly cyclic graph data. To facilitate selectivity estimation on cyclic graphs, we propose a matrix representation of graphs derived from prime labeling — a scheme for reachability queries on directed acyclic graphs. With this representation, we exploit the consecutive ones property (C1P) of matrices. As a consequence, a node is mapped to a point in a two-dimensional space whereas a query is mapped to multiple points. We adopt histograms for scalable selectivity estimation. We perform an extensive experimental evaluation on the proposed technique and show that our technique controls the estimation error under 1.3% on XMARK and DBLP, which is more accurate than previous techniques. On TREEBANK, we produce RMSE and NRMSE 6.8 times smaller than previous techniques.

最近的一些应用，包括语义网、Web本体和XML，重新激起了人们对图结构数据库的兴趣。其中，树枝查询已经成为从图结构数据库检索子图的流行工具。为了优化分支查询，选择性估计是一个关键和经典的步骤。然而，现有的选择性估计工作主要集中在关系数据和树数据上。本文研究了可能循环图数据上的小枝查询的选择性估计。为了方便循环图的选择性估计，我们提出了一种由素数标记衍生的图的矩阵表示——一种有向无循环图的可达性查询方案。利用这种表示，我们利用了矩阵的连续一性质(C1P)。因此，一个节点被映射到二维空间中的一个点，而一个查询被映射到多个点。我们采用直方图进行可扩展选择性估计。我们对所提出的技术进行了广泛的实验评估，并表明我们的技术将XMARK和DBLP的估计误差控制在1.3%以下，比以前的技术更准确。在TREEBANK上，我们产生的RMSE和NRMSE比以前的技术小6.8倍。

{"title":"Selectivity estimation of twig queries on cyclic graphs","authors":"Yun Peng, Byron Choi, Jianliang Xu","doi":"10.1109/ICDE.2011.5767893","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767893","url":null,"abstract":"Recent applications including the Semantic Web, Web ontology and XML have sparked a renewed interest on graph-structured databases. Among others, twig queries have been a popular tool for retrieving subgraphs from graph-structured databases. To optimize twig queries, selectivity estimation has been a crucial and classical step. However, the majority of existing works on selectivity estimation focuses on relational and tree data. In this paper, we investigate selectivity estimation of twig queries on possibly cyclic graph data. To facilitate selectivity estimation on cyclic graphs, we propose a matrix representation of graphs derived from prime labeling — a scheme for reachability queries on directed acyclic graphs. With this representation, we exploit the consecutive ones property (C1P) of matrices. As a consequence, a node is mapped to a point in a two-dimensional space whereas a query is mapped to multiple points. We adopt histograms for scalable selectivity estimation. We perform an extensive experimental evaluation on the proposed technique and show that our technique controls the estimation error under 1.3% on XMARK and DBLP, which is more accurate than previous techniques. On TREEBANK, we produce RMSE and NRMSE 6.8 times smaller than previous techniques.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133111495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

On data dependencies in dataspaces 关于数据空间中的数据依赖关系

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767857

Shaoxu Song, Lei Chen, Philip S. Yu

To study data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDs), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDs), metric functional dependencies (MFDs), and matching dependencies (MDs). As we illustrated, comparable dependencies are useful in real practice of dataspaces, e.g., semantic query optimization. Due to the heterogeneous data in dataspaces, the first question, known as the validation problem, is to determine whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, including greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.

为了研究数据空间中异构数据的数据依赖关系，我们定义了一种通用的依赖关系形式，即可比依赖关系(cd)，它指定了对可比属性的约束。它涵盖了数据库中大量依赖项的语义，包括功能依赖项(fd)、度量功能依赖项(mfd)和匹配依赖项(MDs)。正如我们所说明的，可比依赖关系在数据空间的实际实践中是有用的，例如，语义查询优化。由于数据空间中的数据是异构的，因此第一个问题(称为验证问题)是确定数据实例中是否存在依赖项(几乎)。不幸的是，正如我们所证明的，具有一定错误或置信度保证的验证问题通常是困难的。事实上，置信度验证问题也是np困难的，难以在任何常数因子内近似。然而，我们开发了几种有效的近似计算方法，包括贪心和随机方法，其近似界是一个对象可能引入的最大违例数。最后，通过对实际数据进行广泛的实验评估，验证了本文方法的优越性。

{"title":"On data dependencies in dataspaces","authors":"Shaoxu Song, Lei Chen, Philip S. Yu","doi":"10.1109/ICDE.2011.5767857","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767857","url":null,"abstract":"To study data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDs), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDs), metric functional dependencies (MFDs), and matching dependencies (MDs). As we illustrated, comparable dependencies are useful in real practice of dataspaces, e.g., semantic query optimization. Due to the heterogeneous data in dataspaces, the first question, known as the validation problem, is to determine whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, including greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123856851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

How schema independent are schema free query interfaces? 模式无关的查询接口是如何与模式无关的?

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767880

Arash Termehchy, M. Winslett, Yodsawalai Chodpathumwan

Real-world databases often have extremely complex schemas. With thousands of entity types and relationships, each with a hundred or so attributes, it is extremely difficult for new users to explore the data and formulate queries. Schema free query interfaces (SFQIs) address this problem by allowing users with no knowledge of the schema to submit queries. We postulate that SFQIs should deliver the same answers when given alternative but equivalent schemas for the same underlying information. In this paper, we introduce and formally define design independence, which captures this property for SFQIs. We establish a theoretical framework to measure the amount of design independence provided by an SFQI. We show that most current SFQIs provide a very limited degree of design independence. We also show that SFQIs based on the statistical properties of data can provide design independence when the changes in the schema do not introduce or remove redundancy in the data. We propose a novel XML SFQI called Duplication Aware Coherency Ranking (DA-CR) based on information-theoretic relationships among the data items in the database, and prove that DA-CR is design independent. Our extensive empirical study using three real-world data sets shows that the average case design independence of current SFQIs is considerably lower than that of DA-CR. We also show that the ranking quality of DA-CR is better than or equal to that of current SFQI methods.

现实世界的数据库通常具有极其复杂的模式。有成千上万的实体类型和关系，每个都有大约100个属性，对于新用户来说，探索数据和制定查询是极其困难的。模式无关查询接口(SFQIs)通过允许不了解模式的用户提交查询来解决这个问题。我们假设sfqi在为相同的底层信息提供替代但等效的模式时应该提供相同的答案。在本文中，我们引入并正式定义了设计独立性，它捕获了sfqi的这一属性。我们建立了一个理论框架来衡量SFQI提供的设计独立性的程度。我们表明，大多数当前的sfqi提供了非常有限程度的设计独立性。我们还表明，当模式中的更改不引入或消除数据中的冗余时，基于数据统计属性的sfqi可以提供设计独立性。基于数据库中数据项之间的信息论关系，提出了一种新的XML SFQI，即重复感知一致性排序(DA-CR)，并证明了DA-CR是设计无关的。我们使用三个真实数据集进行的广泛实证研究表明，当前sfqi的平均案例设计独立性明显低于DA-CR。我们还表明，DA-CR的排序质量优于或等于目前的SFQI方法。

{"title":"How schema independent are schema free query interfaces?","authors":"Arash Termehchy, M. Winslett, Yodsawalai Chodpathumwan","doi":"10.1109/ICDE.2011.5767880","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767880","url":null,"abstract":"Real-world databases often have extremely complex schemas. With thousands of entity types and relationships, each with a hundred or so attributes, it is extremely difficult for new users to explore the data and formulate queries. Schema free query interfaces (SFQIs) address this problem by allowing users with no knowledge of the schema to submit queries. We postulate that SFQIs should deliver the same answers when given alternative but equivalent schemas for the same underlying information. In this paper, we introduce and formally define design independence, which captures this property for SFQIs. We establish a theoretical framework to measure the amount of design independence provided by an SFQI. We show that most current SFQIs provide a very limited degree of design independence. We also show that SFQIs based on the statistical properties of data can provide design independence when the changes in the schema do not introduce or remove redundancy in the data. We propose a novel XML SFQI called Duplication Aware Coherency Ranking (DA-CR) based on information-theoretic relationships among the data items in the database, and prove that DA-CR is design independent. Our extensive empirical study using three real-world data sets shows that the average case design independence of current SFQIs is considerably lower than that of DA-CR. We also show that the ranking quality of DA-CR is better than or equal to that of current SFQI methods.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116330799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Generating test data for killing SQL mutants: A constraint-based approach 生成用于终止SQL突变的测试数据:基于约束的方法

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767876

Shetal Shah, Sundararajarao Sudarshan, Suhas Kajbaje, S. Patidar, B. P. Gupta, Devang Vira

Complex SQL queries are widely used today, but it is rather difficult to check if a complex query has been written correctly. Formal verification based on comparing a specification with an implementation is not applicable, since SQL queries are essentially a specification without any implementation. Queries are usually checked by running them on sample datasets and checking that the correct result is returned; there is no guarantee that all possible errors are detected. In this paper, we address the problem of test data generation for checking correctness of SQL queries, based on the query mutation approach for modeling errors. Our presentation focuses in particular on a class of join/outer-join mutations, comparison operator mutations, and aggregation operation mutations, which are a common cause of error. To minimize human effort in testing, our techniques generate a test suite containing small and intuitive test datasets. The number of datasets generated, is linear in the size of the query, although the number of mutations in the class we consider is exponential. Under certain assumptions on constraints and query constructs, the test suite we generate is complete for a subclass of mutations that we define, i.e., it kills all non-equivalent mutations in this subclass.

复杂的SQL查询如今被广泛使用，但是要检查一个复杂的查询是否被正确编写是相当困难的。基于比较规范和实现的正式验证是不适用的，因为SQL查询本质上是没有任何实现的规范。查询通常通过在样本数据集上运行查询并检查是否返回正确的结果来检查;不能保证检测到所有可能的错误。在本文中，我们解决了基于查询突变方法来检查SQL查询正确性的测试数据生成问题。我们的演示特别关注一类连接/外连接突变、比较操作符突变和聚合操作突变，它们是导致错误的常见原因。为了最大限度地减少测试中的人力，我们的技术生成了一个包含小而直观的测试数据集的测试套件。生成的数据集的数量在查询的大小上是线性的，尽管我们考虑的类中的突变数量是指数的。在对约束和查询构造的某些假设下，我们生成的测试套件对于我们定义的突变的子类是完整的，也就是说，它杀死这个子类中的所有非等效突变。

{"title":"Generating test data for killing SQL mutants: A constraint-based approach","authors":"Shetal Shah, Sundararajarao Sudarshan, Suhas Kajbaje, S. Patidar, B. P. Gupta, Devang Vira","doi":"10.1109/ICDE.2011.5767876","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767876","url":null,"abstract":"Complex SQL queries are widely used today, but it is rather difficult to check if a complex query has been written correctly. Formal verification based on comparing a specification with an implementation is not applicable, since SQL queries are essentially a specification without any implementation. Queries are usually checked by running them on sample datasets and checking that the correct result is returned; there is no guarantee that all possible errors are detected. In this paper, we address the problem of test data generation for checking correctness of SQL queries, based on the query mutation approach for modeling errors. Our presentation focuses in particular on a class of join/outer-join mutations, comparison operator mutations, and aggregation operation mutations, which are a common cause of error. To minimize human effort in testing, our techniques generate a test suite containing small and intuitive test datasets. The number of datasets generated, is linear in the size of the query, although the number of mutations in the class we consider is exponential. Under certain assumptions on constraints and query constructs, the test suite we generate is complete for a subclass of mutations that we define, i.e., it kills all non-equivalent mutations in this subclass.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117070517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Large scale Hamming distance query processing 大规模汉明距离查询处理

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767831

A. Liu, Ke Shen, E. Torng

Hamming distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.

汉明距离在近重复检测、模式识别等领域得到了广泛的应用。我们研究汉明距离范围查询问题，其目标是找到数据库中与查询字符串在汉明距离k范围内的所有字符串。如果k是固定的，我们有一个静态汉明距离范围查询问题。如果k是输入的一部分，我们有一个动态汉明距离范围查询问题。对于静态问题，现有技术由于其对数据库的主动复制而使用大量内存。对于动态范围查询问题，据我们所知，没有针对任意数据库的空间和时间高效的解决方案。在本文中，我们首先提出了一种名为HEngines的静态汉明距离范围查询算法，该算法通过动态扩展查询来解决现有技术中的空间问题。然后，我们提出了一种称为hengine的动态汉明距离范围查询算法，该算法使用分而治之的策略解决了现有技术中的限制。我们实现了我们的算法，并在大型真实世界和合成数据集上进行了并排比较。在我们的实验中，HEngines使用的空间比现有技术少4.65倍，处理查询的速度比现有技术快16%，而hengine处理查询的速度比线性扫描快46倍，而只使用1.7倍的空间。

{"title":"Large scale Hamming distance query processing","authors":"A. Liu, Ke Shen, E. Torng","doi":"10.1109/ICDE.2011.5767831","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767831","url":null,"abstract":"Hamming distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125440174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Real-time quantification and classification of consistency anomalies in multi-tier architectures 多层体系结构中一致性异常的实时量化与分类

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767927

Kamal Zellag, Bettina Kemme

While online transaction processing applications heavily rely on the transactional properties provided by the underlying infrastructure, they often choose to not use the highest isolation level, i.e., serializability, because of the potential performance implications of costly strict two-phase locking concurrency control. Instead, modern transaction systems, consisting of an application server tier and a database tier, offer several levels of isolation providing a trade-off between performance and consistency. While it is fairly well known how to identify the anomalies that are possible under a certain level of isolation, it is much more difficult to quantify the amount of anomalies that occur during run-time of a given application. In this paper, we address this issue and present a new approach to detect, in realtime, consistency anomalies for arbitrary multi-tier applications. As the application is running, our tool detect anomalies online indicating exactly the transactions and data items involved. Furthermore, we classify the detected anomalies into patterns showing the business methods involved as well as their occurrence frequency. We use the RUBiS benchmark to show how the introduction of a new transaction type can have a dramatic effect on the number of anomalies for certain isolation levels, and how our tool can quickly detect such problem transactions. Therefore, our system can help designers to either choose an isolation level where the anomalies do not occur or to change the transaction design to avoid the anomalies.

虽然在线事务处理应用程序严重依赖于底层基础设施提供的事务属性，但它们通常选择不使用最高隔离级别，即序列化性，因为代价高昂的严格两阶段锁定并发控制可能会影响性能。相反，由应用服务器层和数据库层组成的现代事务系统提供了多个级别的隔离，在性能和一致性之间进行了权衡。虽然大家都知道如何识别在某种隔离级别下可能出现的异常，但是量化给定应用程序运行期间发生的异常数量要困难得多。在本文中，我们解决了这个问题，并提出了一种新的方法来实时检测任意多层应用程序的一致性异常。在应用程序运行时，我们的工具在线检测异常，准确地指示所涉及的事务和数据项。此外，我们将检测到的异常分类为显示所涉及的业务方法及其发生频率的模式。我们使用RUBiS基准来展示新事务类型的引入如何对某些隔离级别的异常数量产生巨大影响，以及我们的工具如何快速检测此类问题事务。因此，我们的系统可以帮助设计人员选择不发生异常的隔离级别，或者更改事务设计以避免异常。

{"title":"Real-time quantification and classification of consistency anomalies in multi-tier architectures","authors":"Kamal Zellag, Bettina Kemme","doi":"10.1109/ICDE.2011.5767927","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767927","url":null,"abstract":"While online transaction processing applications heavily rely on the transactional properties provided by the underlying infrastructure, they often choose to not use the highest isolation level, i.e., serializability, because of the potential performance implications of costly strict two-phase locking concurrency control. Instead, modern transaction systems, consisting of an application server tier and a database tier, offer several levels of isolation providing a trade-off between performance and consistency. While it is fairly well known how to identify the anomalies that are possible under a certain level of isolation, it is much more difficult to quantify the amount of anomalies that occur during run-time of a given application. In this paper, we address this issue and present a new approach to detect, in realtime, consistency anomalies for arbitrary multi-tier applications. As the application is running, our tool detect anomalies online indicating exactly the transactions and data items involved. Furthermore, we classify the detected anomalies into patterns showing the business methods involved as well as their occurrence frequency. We use the RUBiS benchmark to show how the introduction of a new transaction type can have a dramatic effect on the number of anomalies for certain isolation levels, and how our tool can quickly detect such problem transactions. Therefore, our system can help designers to either choose an isolation level where the anomalies do not occur or to change the transaction design to avoid the anomalies.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128072887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 27th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀