2012 IEEE 28th International Conference on Data Engineering最新文献_第2页

LotusX: A Position-Aware XML Graphical Search System with Auto-Completion LotusX:具有自动补全功能的位置感知XML图形搜索系统

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.123

Chunbin Lin, Jiaheng Lu, T. Ling, Bogdan Cautis

The existing query languages for XML (e.g., XQuery) require professional programming skills to be formulated, however, such complex query languages burden the query processing. In addition, when issuing an XML query, users are required to be familiar with the content (including the structural and textual information) of the hierarchical XML, which is diffcult for common users. The need for designing user friendly interfaces to reduce the burden of query formulation is fundamental to the spreading of XML community. We present a twig-based XML graphical search system, called LotusX, that provides a graphical interface to simplify the query processing without the need of learning query language and data schemas and the knowledge of the content of the XML document. The basic idea is that LotusX proposes "position-aware" and "auto-completion" features to help users to create tree-modeled queries (twig pattern) by providing the possible candidates on-the-fly. In addition, complex twig queries (including order sensitive queries) are supported in LotusX. Furthermore, a new ranking strategy and a query rewriting solution are implemented to rank and rewrite the query effectively. We provide an online demo for LotusX system: http://datasearch.ruc.edu.cn:8080/LotusX.

现有的XML查询语言(例如XQuery)需要专业的编程技能来制定，然而，这种复杂的查询语言给查询处理增加了负担。此外，在发出XML查询时，要求用户熟悉分层XML的内容(包括结构和文本信息)，这对于普通用户来说是困难的。设计用户友好的界面以减少查询公式的负担是XML社区扩展的基础。我们提出了一个基于小枝的XML图形搜索系统LotusX，它提供了一个图形界面来简化查询处理，而不需要学习查询语言和数据模式，也不需要了解XML文档的内容。基本思想是LotusX提出了“位置感知”和“自动完成”特性，通过动态提供可能的候选项来帮助用户创建树模型查询(小枝模式)。此外，LotusX还支持复杂的分支查询(包括顺序敏感查询)。此外，还实现了一种新的排序策略和查询重写解决方案，以有效地对查询进行排序和重写。我们提供了LotusX系统的在线演示:http://datasearch.ruc.edu.cn:8080/LotusX。

{"title":"LotusX: A Position-Aware XML Graphical Search System with Auto-Completion","authors":"Chunbin Lin, Jiaheng Lu, T. Ling, Bogdan Cautis","doi":"10.1109/ICDE.2012.123","DOIUrl":"https://doi.org/10.1109/ICDE.2012.123","url":null,"abstract":"The existing query languages for XML (e.g., XQuery) require professional programming skills to be formulated, however, such complex query languages burden the query processing. In addition, when issuing an XML query, users are required to be familiar with the content (including the structural and textual information) of the hierarchical XML, which is diffcult for common users. The need for designing user friendly interfaces to reduce the burden of query formulation is fundamental to the spreading of XML community. We present a twig-based XML graphical search system, called LotusX, that provides a graphical interface to simplify the query processing without the need of learning query language and data schemas and the knowledge of the content of the XML document. The basic idea is that LotusX proposes \"position-aware\" and \"auto-completion\" features to help users to create tree-modeled queries (twig pattern) by providing the possible candidates on-the-fly. In addition, complex twig queries (including order sensitive queries) are supported in LotusX. Furthermore, a new ranking strategy and a query rewriting solution are implemented to rank and rewrite the query effectively. We provide an online demo for LotusX system: http://datasearch.ruc.edu.cn:8080/LotusX.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124630788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Correlation Support for Risk Evaluation in Databases 数据库中风险评估的相关性支持

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.30

K. Eisenreich, J. Adamek, Philipp J. Rösch, V. Markl, Gregor Hackenbroich

Investigating potential dependencies in data and their effect on future business developments can help experts to prevent misestimations of risks and chances. This makes correlation a highly important factor in risk analysis tasks. Previous research on correlation in uncertain data management addressed foremost the handling of dependencies between discrete rather than continuous distributions. Also, none of the existing approaches provides a clear method for extracting correlation structures from data and introducing assumptions about correlation to independently represented data. To enable risk analysis under correlation assumptions, we use an approximation technique based on copula functions. This technique enables analysts to introduce arbitrary correlation structures between arbitrary distributions and calculate relevant measures over thus correlated data. The correlation information can either be extracted at runtime from historic data or be accessed from a parametrically precomputed structure. We discuss the construction, application and querying of approximate correlation representations for different analysis tasks. Our experiments demonstrate the efficiency and accuracy of the proposed approach, and point out several possibilities for optimization.

调查数据中的潜在依赖关系及其对未来业务发展的影响可以帮助专家防止对风险和机会的错误估计。这使得相关性在风险分析任务中成为一个非常重要的因素。先前对不确定数据管理中相关性的研究主要是处理离散分布而不是连续分布之间的依赖关系。此外，现有的方法都没有提供一种明确的方法来从数据中提取相关结构，并对独立表示的数据引入有关相关性的假设。为了在相关假设下进行风险分析，我们使用了一种基于copula函数的近似技术。这种技术使分析人员能够在任意分布之间引入任意的相关结构，并计算相关数据的相关度量。相关信息既可以在运行时从历史数据中提取，也可以从参数化预先计算的结构中访问。针对不同的分析任务，讨论了近似相关表示的构造、应用和查询。我们的实验证明了该方法的有效性和准确性，并指出了几种优化的可能性。

{"title":"Correlation Support for Risk Evaluation in Databases","authors":"K. Eisenreich, J. Adamek, Philipp J. Rösch, V. Markl, Gregor Hackenbroich","doi":"10.1109/ICDE.2012.30","DOIUrl":"https://doi.org/10.1109/ICDE.2012.30","url":null,"abstract":"Investigating potential dependencies in data and their effect on future business developments can help experts to prevent misestimations of risks and chances. This makes correlation a highly important factor in risk analysis tasks. Previous research on correlation in uncertain data management addressed foremost the handling of dependencies between discrete rather than continuous distributions. Also, none of the existing approaches provides a clear method for extracting correlation structures from data and introducing assumptions about correlation to independently represented data. To enable risk analysis under correlation assumptions, we use an approximation technique based on copula functions. This technique enables analysts to introduce arbitrary correlation structures between arbitrary distributions and calculate relevant measures over thus correlated data. The correlation information can either be extracted at runtime from historic data or be accessed from a parametrically precomputed structure. We discuss the construction, application and querying of approximate correlation representations for different analysis tasks. Our experiments demonstrate the efficiency and accuracy of the proposed approach, and point out several possibilities for optimization.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129233692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multidimensional Analysis of Atypical Events in Cyber-Physical Data 网络物理数据中非典型事件的多维分析

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.32

L. Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun, Hector Gonzalez, Sebastian Seith

A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational) components to form a situation-integrated analytical system that may respond intelligently to dynamic changes of the real-world situations. CPS claims many promising applications, such as traffic observation, battlefield surveillance and sensor-network based monitoring. One important research topic in CPS is about the atypical event analysis, i.e., retrieving the events from large amount of data and analyzing them with spatial, temporal and other multi-dimensional information. Many traditional approaches are not feasible for such analysis since they use numeric measures and cannot describe the complex atypical events. In this study, we propose a new model of atypical cluster to effectively represent those events and efficiently retrieve them from massive data. The micro-cluster is designed to summarize individual events, and the macro-cluster is used to integrate the information from multiple event. To facilitate scalable, flexible and online analysis, the concept of significant cluster is defined and a guided clustering algorithm is proposed to retrieve significant clusters in an efficient manner. We conduct experiments on real datasets with the size of more than 50 GB, the results show that the proposed method can provide more accurate information with only 15% to 20% time cost of the baselines.

网络物理系统(CPS)将物理设备(如传感器、摄像头)与网络(或信息)组件集成在一起，形成一个情境集成分析系统，可以智能地响应现实世界情境的动态变化。CPS声称有许多有前途的应用，如交通观察，战场监视和基于传感器网络的监控。非典型事件分析是CPS的一个重要研究课题，即从大量数据中检索事件，并利用空间、时间等多维信息对事件进行分析。许多传统的方法由于使用数值度量而不能描述复杂的非典型事件，因此对这种分析是不可行的。在本研究中，我们提出了一种新的非典型聚类模型来有效地表示这些事件，并有效地从海量数据中检索这些事件。微聚类用于汇总单个事件，宏聚类用于整合多个事件的信息。为了便于可扩展性、灵活性和在线分析，定义了重要聚类的概念，并提出了一种有效检索重要聚类的引导聚类算法。我们在大于50gb的真实数据集上进行了实验，结果表明，该方法可以提供更准确的信息，而时间成本仅为基线的15% ~ 20%。

{"title":"Multidimensional Analysis of Atypical Events in Cyber-Physical Data","authors":"L. Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun, Hector Gonzalez, Sebastian Seith","doi":"10.1109/ICDE.2012.32","DOIUrl":"https://doi.org/10.1109/ICDE.2012.32","url":null,"abstract":"A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational) components to form a situation-integrated analytical system that may respond intelligently to dynamic changes of the real-world situations. CPS claims many promising applications, such as traffic observation, battlefield surveillance and sensor-network based monitoring. One important research topic in CPS is about the atypical event analysis, i.e., retrieving the events from large amount of data and analyzing them with spatial, temporal and other multi-dimensional information. Many traditional approaches are not feasible for such analysis since they use numeric measures and cannot describe the complex atypical events. In this study, we propose a new model of atypical cluster to effectively represent those events and efficiently retrieve them from massive data. The micro-cluster is designed to summarize individual events, and the macro-cluster is used to integrate the information from multiple event. To facilitate scalable, flexible and online analysis, the concept of significant cluster is defined and a guided clustering algorithm is proposed to retrieve significant clusters in an efficient manner. We conduct experiments on real datasets with the size of more than 50 GB, the results show that the proposed method can provide more accurate information with only 15% to 20% time cost of the baselines.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123920398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Lookup Tables: Fine-Grained Partitioning for Distributed Databases 查找表:分布式数据库的细粒度分区

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.26

Aubrey Tatarowicz, C. Curino, E. Jones, S. Madden

The standard way to get linear scaling in a distributed OLTP DBMS is to horizontally partition data across several nodes. Ideally, this partitioning will result in each query being executed at just one node, to avoid the overheads of distributed transactions and allow nodes to be added without increasing the amount of required coordination. For some applications, simple strategies, such as hashing on primary key, provide this property. Unfortunately, for many applications, including social networking and order-fulfillment, many-to-many relationships cause simple strategies to result in a large fraction of distributed queries. Instead, what is needed is a fine-grained partitioning, where related individual tuples (e.g., cliques of friends) are co-located together in the same partition. Maintaining such a fine-grained partitioning requires the database to store a large amount of metadata about which partition each tuple resides in. We call such metadata a lookup table, and present the design of a data distribution layer that efficiently stores these tables and maintains them in the presence of inserts, deletes, and updates. We show that such tables can provide scalability for several difficult to partition database workloads, including Wikipedia, Twitter, and TPC-E. Our implementation provides 40% to 300% better performance on these workloads than either simple range or hash partitioning and shows greater potential for further scale-out.

在分布式OLTP DBMS中获得线性扩展的标准方法是跨多个节点对数据进行水平分区。理想情况下，这种分区将导致每个查询只在一个节点上执行，以避免分布式事务的开销，并允许在不增加所需协调量的情况下添加节点。对于某些应用程序，一些简单的策略，比如对主键进行散列，就提供了这个属性。不幸的是，对于许多应用程序，包括社交网络和订单履行，多对多关系会导致简单的策略产生大量的分布式查询。相反，需要的是细粒度分区，其中相关的单个元组(例如，朋友团)在同一分区中共同定位在一起。维护这种细粒度分区需要数据库存储关于每个元组所在分区的大量元数据。我们将这样的元数据称为查找表，并介绍了数据分发层的设计，该层可以有效地存储这些表，并在存在插入、删除和更新时维护它们。我们展示了这样的表可以为几个难以分区的数据库工作负载提供可伸缩性，包括Wikipedia、Twitter和TPC-E。我们的实现在这些工作负载上提供了比简单范围分区或散列分区好40%到300%的性能，并显示出进一步向外扩展的更大潜力。

{"title":"Lookup Tables: Fine-Grained Partitioning for Distributed Databases","authors":"Aubrey Tatarowicz, C. Curino, E. Jones, S. Madden","doi":"10.1109/ICDE.2012.26","DOIUrl":"https://doi.org/10.1109/ICDE.2012.26","url":null,"abstract":"The standard way to get linear scaling in a distributed OLTP DBMS is to horizontally partition data across several nodes. Ideally, this partitioning will result in each query being executed at just one node, to avoid the overheads of distributed transactions and allow nodes to be added without increasing the amount of required coordination. For some applications, simple strategies, such as hashing on primary key, provide this property. Unfortunately, for many applications, including social networking and order-fulfillment, many-to-many relationships cause simple strategies to result in a large fraction of distributed queries. Instead, what is needed is a fine-grained partitioning, where related individual tuples (e.g., cliques of friends) are co-located together in the same partition. Maintaining such a fine-grained partitioning requires the database to store a large amount of metadata about which partition each tuple resides in. We call such metadata a lookup table, and present the design of a data distribution layer that efficiently stores these tables and maintains them in the presence of inserts, deletes, and updates. We show that such tables can provide scalability for several difficult to partition database workloads, including Wikipedia, Twitter, and TPC-E. Our implementation provides 40% to 300% better performance on these workloads than either simple range or hash partitioning and shows greater potential for further scale-out.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125675204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Entity Search Strategies for Mashup Applications Mashup应用程序的实体搜索策略

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.84

Stefan Endrullis, Andreas Thor, E. Rahm

Programmatic data integration approaches such as mashups have become a viable approach to dynamically integrate web data at runtime. Key data sources for mashups include entity search engines and hidden databases that need to be queried via source-specific search interfaces or web forms. Current mashups are typically restricted to simple query approaches such as using keyword search. Such approaches may need a high number of queries if many objects have to be found. Furthermore, the effectiveness of the queries may be limited, i.e., they may miss relevant results. We therefore propose more advanced search strategies that aim at finding a set of entities with high efficiency and high effectiveness. Our strategies use different kinds of queries that are determined by source-specific query generators. Furthermore, the queries are selected based on the characteristics of input entities. We introduce a flexible model for entity search strategies that includes a ranking of candidate queries determined by different query generators. We describe different query generators and outline their use within four entity search strategies. These strategies apply different query ranking and selection approaches to optimize efficiency and effectiveness. We evaluate our search strategies in detail for two domains: product search and publication search. The comparison with a standard keyword search shows that the proposed search strategies provide significant improvements in both domains.

mashup等程序化数据集成方法已经成为在运行时动态集成web数据的可行方法。mashup的关键数据源包括实体搜索引擎和隐藏数据库，需要通过特定于源的搜索界面或web表单进行查询。当前的mashup通常仅限于简单的查询方法，例如使用关键字搜索。如果需要查找许多对象，这种方法可能需要大量的查询。此外，查询的有效性可能会受到限制，即它们可能会错过相关的结果。因此，我们提出了更高级的搜索策略，旨在以高效率和高效率的方式找到一组实体。我们的策略使用不同类型的查询，这些查询由特定于源的查询生成器决定。此外，根据输入实体的特征选择查询。我们为实体搜索策略引入了一个灵活的模型，该模型包括由不同查询生成器确定的候选查询的排名。我们描述了不同的查询生成器，并概述了它们在四种实体搜索策略中的使用。这些策略应用不同的查询排序和选择方法来优化效率和有效性。我们详细评估了两个领域的搜索策略:产品搜索和出版物搜索。与标准关键字搜索的比较表明，所提出的搜索策略在这两个领域都有显著的改进。

{"title":"Entity Search Strategies for Mashup Applications","authors":"Stefan Endrullis, Andreas Thor, E. Rahm","doi":"10.1109/ICDE.2012.84","DOIUrl":"https://doi.org/10.1109/ICDE.2012.84","url":null,"abstract":"Programmatic data integration approaches such as mashups have become a viable approach to dynamically integrate web data at runtime. Key data sources for mashups include entity search engines and hidden databases that need to be queried via source-specific search interfaces or web forms. Current mashups are typically restricted to simple query approaches such as using keyword search. Such approaches may need a high number of queries if many objects have to be found. Furthermore, the effectiveness of the queries may be limited, i.e., they may miss relevant results. We therefore propose more advanced search strategies that aim at finding a set of entities with high efficiency and high effectiveness. Our strategies use different kinds of queries that are determined by source-specific query generators. Furthermore, the queries are selected based on the characteristics of input entities. We introduce a flexible model for entity search strategies that includes a ranking of candidate queries determined by different query generators. We describe different query generators and outline their use within four entity search strategies. These strategies apply different query ranking and selection approaches to optimize efficiency and effectiveness. We evaluate our search strategies in detail for two domains: product search and publication search. The comparison with a standard keyword search shows that the proposed search strategies provide significant improvements in both domains.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115797035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Ranking Query Answers in Probabilistic Databases: Complexity and Efficient Algorithms 概率数据库查询答案排序:复杂性和高效算法

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.61

Dan Olteanu, Hongkai Wen

In many applications of probabilistic databases, the probabilities are mere degrees of uncertainty in the data and are not otherwise meaningful to the user. Often, users care only about the ranking of answers in decreasing order of their probabilities or about a few most likely answers. In this paper, we investigate the problem of ranking query answers in probabilistic databases. We give a dichotomy for ranking in case of conjunctive queries without repeating relation symbols: it is either in polynomial time or NP-hard. Surprisingly, our syntactic characterisation of tractable queries is not the same as for probability computation. The key observation is that there are queries for which probability computation is #P-hard, yet ranking can be computed in polynomial time. This is possible whenever probability computation for distinct answers has a common factor that is hard to compute but irrelevant for ranking. We complement this tractability analysis with an effective ranking technique for conjunctive queries. Given a query, we construct a share plan, which exposes sub queries whose probability computation can be shared or ignored across query answers. Our technique combines share plans with incremental approximate probability computation of sub queries. We implemented our technique in the SPROUT query engine and report on performance gains of orders of magnitude over Monte Carlo simulation using FPRAS and exact probability computation based on knowledge compilation.

在概率数据库的许多应用中，概率仅仅是数据中的不确定程度，对用户没有其他意义。通常情况下，用户只关心答案按概率递减的顺序排列，或者只关心几个最有可能的答案。本文研究了概率数据库中查询答案排序问题。我们给出了不重复关系符号的联合查询排序的二分法:它要么是多项式时间，要么是np困难。令人惊讶的是，我们对可处理查询的语法特征与概率计算的语法特征不同。关键的观察结果是，有些查询的概率计算非常困难，但排序可以在多项式时间内计算出来。当不同答案的概率计算有一个难以计算但与排名无关的共同因素时，这是可能的。我们用一种有效的连接查询排序技术来补充这种可跟踪性分析。给定一个查询，我们构造一个共享计划，该计划公开子查询，这些子查询的概率计算可以在查询答案之间共享或忽略。我们的技术将共享计划与子查询的增量近似概率计算相结合。我们在SPROUT查询引擎中实现了我们的技术，并报告了使用FPRAS和基于知识编译的精确概率计算的蒙特卡罗模拟的数量级性能增益。

{"title":"Ranking Query Answers in Probabilistic Databases: Complexity and Efficient Algorithms","authors":"Dan Olteanu, Hongkai Wen","doi":"10.1109/ICDE.2012.61","DOIUrl":"https://doi.org/10.1109/ICDE.2012.61","url":null,"abstract":"In many applications of probabilistic databases, the probabilities are mere degrees of uncertainty in the data and are not otherwise meaningful to the user. Often, users care only about the ranking of answers in decreasing order of their probabilities or about a few most likely answers. In this paper, we investigate the problem of ranking query answers in probabilistic databases. We give a dichotomy for ranking in case of conjunctive queries without repeating relation symbols: it is either in polynomial time or NP-hard. Surprisingly, our syntactic characterisation of tractable queries is not the same as for probability computation. The key observation is that there are queries for which probability computation is #P-hard, yet ranking can be computed in polynomial time. This is possible whenever probability computation for distinct answers has a common factor that is hard to compute but irrelevant for ranking. We complement this tractability analysis with an effective ranking technique for conjunctive queries. Given a query, we construct a share plan, which exposes sub queries whose probability computation can be shared or ignored across query answers. Our technique combines share plans with incremental approximate probability computation of sub queries. We implemented our technique in the SPROUT query engine and report on performance gains of orders of magnitude over Monte Carlo simulation using FPRAS and exact probability computation based on knowledge compilation.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"184 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131991305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Scalable and Numerically Stable Descriptive Statistics in SystemML SystemML中可扩展和数值稳定的描述性统计

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.12

Yuanyuan Tian, S. Tatikonda, B. Reinwald

With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.

随着近年来产生的数据量呈指数级增长，迫切需要将机器学习算法应用于大型数据集。SystemML是一个采用声明式方法进行大规模数据分析的框架。在SystemML中，机器学习算法被表示为一种高级语言的脚本，称为DML，在语法上类似于r。DML脚本在构建在MapReduce之上的SystemML运行时中进行编译、优化和执行。作为几乎所有定量分析的基础，描述性统计提供了在SystemML中探索数据的强大工具。在本文中，我们描述了在SystemML中实现描述性统计的经验。特别是，我们详细阐述了如何克服两个主要挑战:(1)在MapReduce的分布式设置中处理大型数据集时实现数值稳定性;(2)设计可扩展的算法来计算MapReduce中的顺序统计。通过与现有工具和系统中常用的算法进行经验比较，我们证明了SystemML实现的数值精度。我们还强调我们在这项工作中学到的宝贵经验。

{"title":"Scalable and Numerically Stable Descriptive Statistics in SystemML","authors":"Yuanyuan Tian, S. Tatikonda, B. Reinwald","doi":"10.1109/ICDE.2012.12","DOIUrl":"https://doi.org/10.1109/ICDE.2012.12","url":null,"abstract":"With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129977757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

The Future of Scientific Data Bases 科学数据库的未来

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.151

M. Stonebraker, A. Ailamaki, J. Kepner, A. Szalay

For many decades, users in scientific fields (domain scientists) have resorted to either home-grown tools or legacy software for the management of their data. Technological advancements nowadays necessitate many of the properties such as data independence, scalability, and functionality found in the roadmap of DBMS technology, DBMS products, however, are not yet ready to address scientific application and user needs. Recent efforts toward building a science DBMS indicate that there is a long way ahead of us, paved by a research agenda that is rich in interesting and challenging problems.

几十年来，科学领域的用户(领域科学家)要么使用本地开发的工具，要么使用遗留软件来管理他们的数据。如今的技术进步需要许多属性，如数据独立性、可扩展性和在DBMS技术路线图中发现的功能，然而，DBMS产品还没有准备好满足科学应用和用户需求。最近建立科学DBMS的努力表明，我们还有很长的路要走，这是由一个充满有趣和具有挑战性的问题的研究议程铺平的。

引用次数: 5

Efficient Exact Similarity Searches Using Multiple Token Orderings 使用多个令牌排序的高效精确相似搜索

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.79

Jongik Kim, Hongrae Lee

Similarity searches are essential in many applications including data cleaning and near duplicate detection. Many similarity search algorithms first generate candidate records, and then identify true matches among them. A major focus of those algorithms has been on how to reduce the number of candidate records in the early stage of similarity query processing. One of the most commonly used techniques to reduce the candidate size is the prefix filtering principle, which exploits the document frequency ordering of tokens. In this paper, we propose a novel partitioning technique that considers multiple token orderings based on token co-occurrence statistics. Experimental results show that the proposed technique is effective in reducing the number of candidate records and as a result improves the performance of existing algorithms significantly.

相似性搜索在许多应用程序中都是必不可少的，包括数据清理和近重复检测。许多相似度搜索算法首先生成候选记录，然后在候选记录中识别真实匹配。这些算法的一个主要焦点是如何在相似性查询处理的早期阶段减少候选记录的数量。减少候选大小最常用的技术之一是前缀过滤原则，该原则利用了令牌的文档频率排序。在本文中，我们提出了一种新的基于令牌共现统计的考虑多个令牌排序的分区技术。实验结果表明，该方法有效地减少了候选记录的数量，从而显著提高了现有算法的性能。

引用次数: 22

Recomputing Materialized Instances after Changes to Mappings and Data 更改映射和数据后重新计算物化实例

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.107

Todd J. Green, Z. Ives

A major challenge faced by today's information systems is that of evolution as data usage evolves or new data resources become available. Modern organizations sometimes exchange data with one another via declarative mappings among their databases, as in data exchange and collaborative data sharing systems. Such mappings are frequently revised and refined as new data becomes available, new cross-reference tables are created, and corrections are made. A fundamental question is how to handle changes to these mapping definitions, when the organizations each materialize the results of applying the mappings to the available data. We consider how to incrementally recompute these database instances in this setting, reusing (if possible) previously computed instances to speed up computation. We develop a principled solution that performs cost-based exploration of recomputation versus reuse, and simultaneously handles updates to source data and mapping definitions through a single, unified mechanism. Our solution also takes advantage of provenance information, when present, to speed up computation even further. We present an implementation that takes advantage of an off-the-shelf DBMS's query processing system, and we show experimentally that our approach provides substantial performance benefits.

当今信息系统面临的一个主要挑战是随着数据使用的发展或新数据资源的可用性而发展。现代组织有时通过数据库之间的声明性映射相互交换数据，例如在数据交换和协作数据共享系统中。随着新数据的出现，这种映射经常被修改和细化，创建新的交叉引用表，并进行更正。一个基本的问题是，当各组织将将映射应用到可用数据的结果具体化时，如何处理这些映射定义的更改。我们考虑如何在这种设置中增量地重新计算这些数据库实例，重用(如果可能的话)以前计算过的实例来加快计算速度。我们开发了一个原则性的解决方案，该解决方案执行基于成本的重新计算与重用的探索，并同时通过单一、统一的机制处理源数据和映射定义的更新。我们的解决方案还利用了存在的来源信息来进一步加快计算速度。我们提出了一个利用现成DBMS查询处理系统的实现，并通过实验证明，我们的方法提供了实质性的性能优势。

{"title":"Recomputing Materialized Instances after Changes to Mappings and Data","authors":"Todd J. Green, Z. Ives","doi":"10.1109/ICDE.2012.107","DOIUrl":"https://doi.org/10.1109/ICDE.2012.107","url":null,"abstract":"A major challenge faced by today's information systems is that of evolution as data usage evolves or new data resources become available. Modern organizations sometimes exchange data with one another via declarative mappings among their databases, as in data exchange and collaborative data sharing systems. Such mappings are frequently revised and refined as new data becomes available, new cross-reference tables are created, and corrections are made. A fundamental question is how to handle changes to these mapping definitions, when the organizations each materialize the results of applying the mappings to the available data. We consider how to incrementally recompute these database instances in this setting, reusing (if possible) previously computed instances to speed up computation. We develop a principled solution that performs cost-based exploration of recomputation versus reuse, and simultaneously handles updates to source data and mapping definitions through a single, unified mechanism. Our solution also takes advantage of provenance information, when present, to speed up computation even further. We present an implementation that takes advantage of an off-the-shelf DBMS's query processing system, and we show experimentally that our approach provides substantial performance benefits.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128635544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5