Proceedings 18th International Conference on Data Engineering最新文献

英文中文

SG-WRAP: a schema-guided wrapper generator SG-WRAP:一个模式引导的包装器生成器

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994743

Xiaofeng Meng, Hongjun Lu, Haiyan Wang, Mingzhe Gu

Although wrapper generation work has been reported in the literature, there seem no standard ways to evaluate the performance of such systems. We conducted a series of experiments to evaluate the usability, correctness and efficiency of SG-WRAP. The usability tests selected a number of users to use the system. The results indicated that, with minimal introduction of the system, DTD definition and structure of HTML pages, even naive users could quickly generate wrappers without much difficulty. For correctness, we adapted the precision and recall metrics in information retrieval to data extraction. The results show that, with the refining process, the system can generate wrappers with very high accuracy. Finally, the efficiency tests indicated that the wrapper generation process is fast enough even with large size Web pages.

尽管在文献中已经报道了包装器生成工作，但似乎没有标准的方法来评估此类系统的性能。我们进行了一系列的实验来评估SG-WRAP的可用性、正确性和效率。可用性测试选择了一些用户来使用该系统。结果表明，只要很少地介绍系统、DTD定义和HTML页面的结构，即使是没有经验的用户也可以毫不费力地快速生成包装器。为了提高准确性，我们将信息检索中的精度和召回率指标应用于数据提取。结果表明，通过细化过程，该系统可以生成具有很高精度的包皮。最后，效率测试表明，即使对于大尺寸的Web页面，包装器生成过程也足够快。

引用次数: 20

Declarative composition and peer-to-peer provisioning of dynamic Web services 动态Web服务的声明式组合和点对点供应

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994738

B. Benatallah, Quan Z. Sheng, A. Ngu, M. Dumas

The development of new services through the integration of existing ones has gained a considerable momentum as a means to create and streamline business-to-business collaborations. Unfortunately, as Web services are often autonomous and heterogeneous entities, connecting and coordinating them in order to build integrated services is a delicate and time-consuming task. In this paper, we describe the design and implementation of a system through which existing Web services can be declaratively composed, and the resulting composite services can be executed following a peer-to-peer paradigm, within a dynamic environment. This system provides tools for specifying composite services through. statecharts, data conversion rules, and provider selection, policies. These specifications are then translated into XML documents that can be interpreted by peer-to-peer inter-connected software components, in order to provision the composite service without requiring a central authority.

通过集成现有服务来开发新服务，作为创建和简化企业对企业协作的一种手段，已经获得了相当大的势头。不幸的是，由于Web服务通常是自治的异构实体，连接和协调它们以构建集成服务是一项微妙且耗时的任务。在本文中，我们描述了一个系统的设计和实现，通过该系统可以声明性地组合现有的Web服务，并且可以在动态环境中按照对等范式执行生成的组合服务。该系统提供了通过指定组合服务的工具。状态图、数据转换规则和提供者选择、策略。然后将这些规范转换为XML文档，这些文档可以由点对点互连的软件组件解释，以便在不需要中央权威的情况下提供组合服务。

引用次数: 454

Query estimation by adaptive sampling 基于自适应采样的查询估计

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994781

Yi-Leh Wu, D. Agrawal, A. E. Abbadi

The ability to provide accurate and efficient result estimations of user queries is very important for the query optimizer in database systems. In this paper, we show that the traditional estimation techniques with data reduction points of view do not produce satisfiable estimation results if the query patterns are dynamically changing. We further show that to reduce query estimation error, instead of accurately capturing the data distribution, it is more effective to capture the user query patterns. In this paper, we propose query estimation techniques that can adapt to user query patterns for more accurate estimates of the size of selection or range queries over databases.

对于数据库系统中的查询优化器来说，提供准确而高效的用户查询结果估计的能力非常重要。在本文中，我们证明了传统的基于数据约简观点的估计技术在查询模式是动态变化的情况下不能产生令人满意的估计结果。我们进一步证明，为了减少查询估计误差，捕获用户查询模式比准确捕获数据分布更有效。在本文中，我们提出了可以适应用户查询模式的查询估计技术，以便更准确地估计数据库上的选择或范围查询的大小。

引用次数: 19

From XML schema to relations: a cost-based approach to XML storage 从XML模式到关系:基于成本的XML存储方法

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994698

P. Bohannon, J. Freire, Prasan Roy, Jérôme Siméon

As Web applications manipulate an increasing amount of XML, there is a growing interest in storing XML data in relational databases. Due to the mismatch between the complexity of XML's tree structure and the simplicity of flat relational tables, there are many ways to store the same document in an RDBMS, and a number of heuristic techniques have been proposed. These techniques typically define fixed mappings and do not take application characteristics into account. However, a fixed mapping is unlikely to work well for all possible applications. In contrast, LegoDB is a cost-based XML storage mapping engine that explores a space of possible XML-to-relational mappings and selects the best mapping for a given application. LegoDB leverages current XML and relational technologies: (1) it models the target application with an XML Schema, XML data statistics, and an XQuery workload; (2) the space of configurations is generated through XML-Schema rewritings; and (3) the best among the derived configurations is selected using cost estimates obtained through a standard relational optimizer. We describe the LegoDB storage engine and provide experimental results that demonstrate the effectiveness of this approach.

随着Web应用程序操作越来越多的XML，人们对在关系数据库中存储XML数据越来越感兴趣。由于XML树结构的复杂性与平面关系表的简单性之间的不匹配，有许多方法可以在RDBMS中存储相同的文档，并且已经提出了许多启发式技术。这些技术通常定义固定的映射，而不考虑应用程序的特征。然而，固定的映射不太可能适用于所有可能的应用程序。相比之下，LegoDB是一种基于成本的XML存储映射引擎，它探索可能的XML到关系映射空间，并为给定的应用程序选择最佳映射。LegoDB利用当前的XML和关系技术:(1)它使用XML模式、XML数据统计和XQuery工作负载对目标应用程序建模;(2)通过XML-Schema重写生成配置空间;(3)使用标准关系优化器获得的成本估算，从派生配置中选择最佳配置。我们描述了LegoDB存储引擎，并提供了实验结果，证明了这种方法的有效性。

{"title":"From XML schema to relations: a cost-based approach to XML storage","authors":"P. Bohannon, J. Freire, Prasan Roy, Jérôme Siméon","doi":"10.1109/ICDE.2002.994698","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994698","url":null,"abstract":"As Web applications manipulate an increasing amount of XML, there is a growing interest in storing XML data in relational databases. Due to the mismatch between the complexity of XML's tree structure and the simplicity of flat relational tables, there are many ways to store the same document in an RDBMS, and a number of heuristic techniques have been proposed. These techniques typically define fixed mappings and do not take application characteristics into account. However, a fixed mapping is unlikely to work well for all possible applications. In contrast, LegoDB is a cost-based XML storage mapping engine that explores a space of possible XML-to-relational mappings and selects the best mapping for a given application. LegoDB leverages current XML and relational technologies: (1) it models the target application with an XML Schema, XML data statistics, and an XQuery workload; (2) the space of configurations is generated through XML-Schema rewritings; and (3) the best among the derived configurations is selected using cost estimates obtained through a standard relational optimizer. We describe the LegoDB storage engine and provide experimental results that demonstrate the effectiveness of this approach.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128466405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 349

Cost models for overlapping and multi-version B-trees 重叠和多版本b树的成本模型

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994709

Yufei Tao, D. Papadias, Jun Zhang

Overlapping and multi-version techniques are two popular frameworks that transform an ephemeral index into a multiple logical-tree structure in order to support versioning databases. Although both frameworks have produced numerous efficient indexing methods, their performance analysis is rather limited; as a result, there is no clear understanding about the behavior of the alternative structures and the choice of the best one, given the data and query characteristics. Furthermore, query optimization based on these methods is currently impossible. These are serious problems due to the incorporation of overlapping and multi-version techniques in several traditional (e.g. banking) and emerging (e.g. spatio-temporal) applications. In this paper, we propose frameworks for reducing the performance analysis of overlapping and multi-version structures to that of the corresponding ephemeral structures, thus simplifying the problem significantly. The frameworks lead to accurate cost models that predict the sizes of the trees, the node accesses and query selectivity. Although we focus on B-tree-based structures, the proposed models can be employed with a variety of indexes.

重叠和多版本技术是两种流行的框架，它们将临时索引转换为多个逻辑树结构，以支持数据库的版本控制。虽然这两个框架都产生了许多高效的索引方法，但它们的性能分析相当有限;因此，在给定数据和查询特征的情况下，对备选结构的行为和最佳结构的选择没有明确的理解。此外，基于这些方法的查询优化目前是不可能的。由于在一些传统(如银行)和新兴(如时空)应用中合并了重叠和多版本技术，这些问题很严重。在本文中，我们提出了将重叠和多版本结构的性能分析简化为相应的短暂结构的性能分析的框架，从而大大简化了问题。这些框架产生了准确的成本模型，可以预测树的大小、节点访问和查询选择性。虽然我们关注的是基于b树的结构，但所提出的模型可以用于各种索引。

{"title":"Cost models for overlapping and multi-version B-trees","authors":"Yufei Tao, D. Papadias, Jun Zhang","doi":"10.1109/ICDE.2002.994709","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994709","url":null,"abstract":"Overlapping and multi-version techniques are two popular frameworks that transform an ephemeral index into a multiple logical-tree structure in order to support versioning databases. Although both frameworks have produced numerous efficient indexing methods, their performance analysis is rather limited; as a result, there is no clear understanding about the behavior of the alternative structures and the choice of the best one, given the data and query characteristics. Furthermore, query optimization based on these methods is currently impossible. These are serious problems due to the incorporation of overlapping and multi-version techniques in several traditional (e.g. banking) and emerging (e.g. spatio-temporal) applications. In this paper, we propose frameworks for reducing the performance analysis of overlapping and multi-version structures to that of the corresponding ephemeral structures, thus simplifying the problem significantly. The frameworks lead to accurate cost models that predict the sizes of the trees, the node accesses and query selectivity. Although we focus on B-tree-based structures, the proposed models can be employed with a variety of indexes.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

FAST: a new sampling-based algorithm for discovering association rules FAST:一种新的基于抽样的关联规则发现算法

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994717

Bin Chen, P. Haas, P. Scheuermann

We present FAST (finding associations from sampled transactions), a refined sampling-based mining algorithm that is distinguished from prior algorithms by its novel two-phase approach to sample collection. In phase I a large sample is collected to quickly and accurately estimate the support of each item in the database. In phase II, a small final sample is obtained by excluding "outlier" transactions in such a manner that the support of each item in the final sample is as close as possible to the estimated support of the item in the entire database. We propose two approaches to obtaining the final sample in phase II: trimming and growing. The trimming procedure starts from the large initial sample and removes outlier transactions until a specified stopping criterion is satisfied. In contrast, the growing procedure selects representative transactions from the initial sample and adds them to an initially empty data set.

我们提出了FAST(从采样事务中寻找关联)，这是一种改进的基于采样的挖掘算法，其新颖的两阶段样本收集方法与先前的算法不同。在第一阶段，收集大量样本以快速准确地估计数据库中每个项目的支持度。在第二阶段，通过排除“离群”交易获得一个小的最终样本，以使最终样本中的每个项目的支持度尽可能接近整个数据库中该项目的估计支持度。我们提出在第二阶段获得最终样品的两种方法:修剪和生长。修剪过程从大的初始样本开始，去除异常事务，直到满足指定的停止标准。相反，增长过程从初始样本中选择具有代表性的事务，并将它们添加到初始空数据集中。

引用次数: 5

Multivariate time series prediction via temporal classification 基于时间分类的多变量时间序列预测

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994722

B. Liu, Jing Liu

In this paper, we study a special form of time-series prediction, viz. the prediction of a dependent variable taking discrete values. Although in a real application this variable may take numeric values, the users are usually only interested in its value ranges, e.g. normal or abnormal, not its actual values. In this work, we extended two traditional classification techniques, namely the naive Bayesian classifier and decision trees, to suit temporal prediction. This results in two new techniques: a temporal naive Bayesian (T-NB) model and a temporal decision tree (T-DT). T-NB and T-DT have been tested on seven real-life data sets from an oil refinery. Experimental results show that they perform very accurate predictions.

本文研究了时间序列预测的一种特殊形式，即取离散值的因变量的预测。虽然在实际应用中，这个变量可以取数值，但用户通常只对它的值范围感兴趣，例如正常或异常，而不是它的实际值。在这项工作中，我们扩展了两种传统的分类技术，即朴素贝叶斯分类器和决策树，以适应时间预测。这就产生了两种新技术:时间朴素贝叶斯(T-NB)模型和时间决策树(T-DT)。T-NB和T-DT已经在一家炼油厂的七个真实数据集上进行了测试。实验结果表明，它们的预测非常准确。

引用次数: 8

Recovery guarantees for general multi-tier applications 一般多层应用程序的恢复保证

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994773

R. Barga, D. Lomet, G. Weikum

Database recovery does not mask failures to applications and users. Recovery is needed that considers data, messages and application components. Special cases have been studied, but clear principles for recovery guarantees in general multi-tier applications such as Web-based e-services are missing. We develop a framework for recovery guarantees that masks almost all failures. The main concept is an interaction contract between two components, a pledge as to message and state persistence, and contract release. Contracts are composed into system-wide agreements so that a set of components is provably recoverable with exactly-once message delivery and execution, except perhaps for crash-interrupted user input or output. Our implementation techniques reduce the data logging cost, allow effective log truncation, and provide independent recovery for critical server components. Interaction contracts form the basis for our Phoenix/COM project on persistent components. Our framework's utility is demonstrated with a case study of a web-based e-service.

数据库恢复不会对应用程序和用户屏蔽故障。需要考虑数据、消息和应用程序组件的恢复。已经研究了特殊情况，但是在一般多层应用程序(如基于web的电子服务)中缺乏恢复保证的明确原则。我们制定了一个恢复保证的框架，它掩盖了几乎所有的失败。其主要概念是两个组件之间的交互契约、关于消息和状态持久性的承诺以及契约释放。契约被组成到系统范围的协议中，这样一组组件可以通过一次消息传递和执行得到可靠的恢复，除非用户输入或输出出现崩溃中断。我们的实现技术降低了数据记录成本，允许有效的日志截断，并为关键服务器组件提供独立恢复。交互契约构成了持久组件上的Phoenix/COM项目的基础。通过一个基于web的电子服务的案例研究来演示我们的框架的实用程序。

引用次数: 42

Attribute classification using feature analysis 使用特征分析进行属性分类

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994725

Felix Naumann, C. T. H. Ho, Xuqing Tian, L. Haas, N. Megiddo

The basis of many systems that integrate data from multiple sources is a set of correspondences between source schemata and a target schema. Correspondences express a relationship between sets of source attributes, possibly from multiple sources, and a set of target attributes. Clio is an integration tool that assists users in defining value correspondences between attributes. In real life scenarios there may be many sources and the source relations may have many attributes. Users can get lost and might miss or be unable to find some correspondences. Also, in many real life schemata the attribute names reveal little or nothing about the semantics of the data values. Only the data values in the attribute columns can convey the semantic meaning of the attribute. Our work relieves users of the problems of too many attributes and meaningless attribute names, by automatically suggesting correspondences between source and target attributes. For each attribute, we analyze the data values and derive a set of features.

许多集成来自多个数据源的数据的系统的基础是源模式和目标模式之间的一组对应关系。对应表示一组源属性(可能来自多个源)和一组目标属性之间的关系。Clio是一个集成工具，它帮助用户定义属性之间的值对应关系。在实际场景中，可能有许多源，并且源关系可能具有许多属性。用户可能会迷路，可能会错过或无法找到一些通信。此外，在许多实际模式中，属性名很少或根本没有透露数据值的语义。只有属性列中的数据值才能传达属性的语义。我们的工作通过自动提示源属性和目标属性之间的对应关系，减轻了用户属性过多和属性名称无意义的问题。对于每个属性，我们分析数据值并派生出一组特征。

引用次数: 50

An intuitive framework for understanding changes in evolving data streams 一个直观的框架，用于理解不断发展的数据流中的变化

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994715

C. Aggarwal

Many organizations today store large streams of transactional data in real time. This data can often show important changes in trends over time. In many commercial applications, it may be valuable to provide the user with an understanding of the nature of changes occuring over time in the data stream. In this paper, we discuss the process of analysing the significant changes and trends in data streams in a way which is understandable, intuitive and user-friendly.

如今，许多组织实时存储大量事务性数据流。这些数据通常可以显示出随着时间的推移趋势的重要变化。在许多商业应用程序中，向用户提供数据流中随时间变化的性质的理解可能很有价值。在本文中，我们讨论了以一种可理解、直观和用户友好的方式分析数据流中的重大变化和趋势的过程。

引用次数: 26

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings 18th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀