Proceedings 17th International Conference on Data Engineering最新文献

英文中文

Differential logging: a commutative and associative logging scheme for highly parallel main memory database 差分日志:一种交换和关联的日志记录方案，用于高度并行的主存数据库

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914826

Juchang Lee, Kihong Kim, S. Cha

With a GByte of memory priced at less than $2000, main-memory DBMSs (MMDBMSs) are emerging as an economically viable alternative to disk-resident DBMSs (DRDBMSs) in many problem domains. The MMDBMS can show significantly higher performance than the DRDBMS by reducing disk accesses to the sequential form of log writing and occasional checkpointing. Upon a system crash, the recovery process begins by accessing the disk-resident log and checkpoint data to restore a consistent state. With increasing CPU speed, however, such disk access is still the dominant bottleneck in MMDBMSs. To overcome this bottleneck, this paper explores alternatives of parallel logging and recovery. The major contribution of this paper is the so-called differential logging scheme that permits unrestricted parallelism in logging and recovery. Using the bit-wise XOR operation both to compute the differential log between the before and after images and to recover the consistent database state, this scheme offers the room for significant performance improvement in the MMDBMS. First, with logging done on the difference, the log volume is reduced to almost half compared with the conventional physical logging. Second, the commutativity and associativity of XOR enables processing of log records in an arbitrary order. This means that we can freely distribute log records to multiple disks to improve the logging performance. During the recovery time, we can do a parallel restart independently for each log disk. This paper shows the superior performance of the differential logging compared to the physical logging in a shared-memory multiprocessor environment.

由于gb内存的价格低于2000美元，主内存dbms (mmdbms)正在成为许多问题领域中磁盘驻留dbms (drdbms)的经济上可行的替代方案。MMDBMS通过减少对日志写入的顺序形式的磁盘访问和偶尔的检查点，可以显示出比DRDBMS更高的性能。在系统崩溃时，恢复过程首先访问磁盘驻留日志和检查点数据，以恢复一致的状态。但是，随着CPU速度的提高，这种磁盘访问仍然是mmdbms中的主要瓶颈。为了克服这一瓶颈，本文探索了并行日志和恢复的替代方案。本文的主要贡献是所谓的差分日志记录方案，它允许在日志记录和恢复中不受限制的并行性。使用逐位异或操作来计算前后图像之间的差分日志并恢复一致的数据库状态，该方案为MMDBMS提供了显著的性能改进空间。首先，通过对差异进行日志记录，与传统的物理日志记录相比，日志量减少了近一半。其次，XOR的交换性和结合性允许以任意顺序处理日志记录。这意味着我们可以自由地将日志记录分发到多个磁盘，以提高日志记录性能。在恢复期间，我们可以对每个日志磁盘独立地执行并行重启。本文展示了在共享内存多处理器环境中，与物理日志记录相比，差异日志记录的优越性能。

{"title":"Differential logging: a commutative and associative logging scheme for highly parallel main memory database","authors":"Juchang Lee, Kihong Kim, S. Cha","doi":"10.1109/ICDE.2001.914826","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914826","url":null,"abstract":"With a GByte of memory priced at less than $2000, main-memory DBMSs (MMDBMSs) are emerging as an economically viable alternative to disk-resident DBMSs (DRDBMSs) in many problem domains. The MMDBMS can show significantly higher performance than the DRDBMS by reducing disk accesses to the sequential form of log writing and occasional checkpointing. Upon a system crash, the recovery process begins by accessing the disk-resident log and checkpoint data to restore a consistent state. With increasing CPU speed, however, such disk access is still the dominant bottleneck in MMDBMSs. To overcome this bottleneck, this paper explores alternatives of parallel logging and recovery. The major contribution of this paper is the so-called differential logging scheme that permits unrestricted parallelism in logging and recovery. Using the bit-wise XOR operation both to compute the differential log between the before and after images and to recover the consistent database state, this scheme offers the room for significant performance improvement in the MMDBMS. First, with logging done on the difference, the log volume is reduced to almost half compared with the conventional physical logging. Second, the commutativity and associativity of XOR enables processing of log records in an arbitrary order. This means that we can freely distribute log records to multiple disks to improve the logging performance. During the recovery time, we can do a parallel restart independently for each log disk. This paper shows the superior performance of the differential logging compared to the physical logging in a shared-memory multiprocessor environment.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122405580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Prefetching based on the type-level access pattern in object-relational DBMSs 对象关系dbms中基于类型级访问模式的预取

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914880

Wook-Shin Han, Yang-Sae Moon, K. Whang, I. Song

Prefetching is an effective method for minimizing the number of round-trips between the client and the server in database management systems. We propose new notions of the type-level access locality and the type-level access pattern. We also formally define the notions of capturing and prefetching to help understand the underlying mechanisms. We then develop an efficient prefetching policy based on these notions and the framework. The type-level access locality is a phenomenon that repetitive patterns exist in the attributes referenced. The type-level access pattern is a pattern of attributes that are referenced in accessing the objects. Existing prefetching methods are based on object-level or page-level access patterns, which consist of object-ids or page-ids of the objects accessed. However the drawback of these methods is that they work only when exactly the same objects or pages are accessed repeatedly. In contrast even though the same objects are not accessed repeatedly our technique effectively prefetches objects if the same attributes are referenced repeatedly, i.e., if there is type-level access locality. Many navigational applications in object-relational database management systems (ORDBMSs) have type-level access locality. Therefore, our technique can be employed in ORDBMSs to effectively reduce the number of round trips, thereby significantly enhancing the performance.

预取是数据库管理系统中减少客户端和服务器之间往返次数的一种有效方法。我们提出了类型级访问局部性和类型级访问模式的新概念。我们还正式定义了捕获和预取的概念，以帮助理解底层机制。然后，我们基于这些概念和框架开发了一个有效的预取策略。类型级访问局部性是指在引用的属性中存在重复模式的现象。类型级访问模式是在访问对象时引用的属性模式。现有的预取方法是基于对象级或页面级访问模式的，它们由所访问对象的对象id或页面id组成。然而，这些方法的缺点是，它们只有在重复访问完全相同的对象或页面时才有效。相反，即使没有重复访问相同的对象，如果重复引用相同的属性，也就是说，如果存在类型级访问局部性，我们的技术也可以有效地预取对象。对象-关系数据库管理系统(ordbms)中的许多导航应用程序都具有类型级访问局部性。因此，我们的技术可以在ordbms中使用，从而有效地减少往返次数，从而显著提高性能。

{"title":"Prefetching based on the type-level access pattern in object-relational DBMSs","authors":"Wook-Shin Han, Yang-Sae Moon, K. Whang, I. Song","doi":"10.1109/ICDE.2001.914880","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914880","url":null,"abstract":"Prefetching is an effective method for minimizing the number of round-trips between the client and the server in database management systems. We propose new notions of the type-level access locality and the type-level access pattern. We also formally define the notions of capturing and prefetching to help understand the underlying mechanisms. We then develop an efficient prefetching policy based on these notions and the framework. The type-level access locality is a phenomenon that repetitive patterns exist in the attributes referenced. The type-level access pattern is a pattern of attributes that are referenced in accessing the objects. Existing prefetching methods are based on object-level or page-level access patterns, which consist of object-ids or page-ids of the objects accessed. However the drawback of these methods is that they work only when exactly the same objects or pages are accessed repeatedly. In contrast even though the same objects are not accessed repeatedly our technique effectively prefetches objects if the same attributes are referenced repeatedly, i.e., if there is type-level access locality. Many navigational applications in object-relational database management systems (ORDBMSs) have type-level access locality. Therefore, our technique can be employed in ORDBMSs to effectively reduce the number of round trips, thereby significantly enhancing the performance.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Cache-aware query routing in a cluster of databases 数据库集群中的缓存感知查询路由

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914879

Uwe Röhm, Klemens Böhm, H. Schek

We investigate query routing techniques in a cluster of databases for a query-dominant environment. The objective is to decrease query response time. Each component of the cluster runs an off-the-shelf DBMS and holds a copy of the whole database. The cluster has a coordinator that routes each query to an appropriate component. Considering queries of realistic complexity, e.g., TPC-R, this article addresses the following questions: Can routing benefit from caching effects due to previous queries? Since our components are black-boxes, how can we approximate their cache content? How to route a query, given such cache approximations? To answer these questions, we have developed a cache-aware query router that is based on signature approximations of queries. We report on experimental evaluations with the TPC-R benchmark using our PowerDB database cluster prototype. Our main result is that our approach of cache approximation routing is better than state-of-the-art strategies by a factor of two with regard to mean response time.

我们研究了查询主导环境下的数据库集群中的查询路由技术。目标是减少查询响应时间。集群的每个组件都运行一个现成的DBMS，并持有整个数据库的副本。集群有一个协调器，它将每个查询路由到适当的组件。考虑到实际复杂性的查询，例如TPC-R，本文解决了以下问题:路由是否可以从先前查询造成的缓存效果中受益?由于我们的组件是黑盒，我们如何近似它们的缓存内容?如何路由查询，给定这样的缓存近似值?为了回答这些问题，我们开发了一个基于查询签名近似的缓存感知查询路由器。我们报告了使用PowerDB数据库集群原型使用TPC-R基准测试进行的实验评估。我们的主要结果是，在平均响应时间方面，我们的缓存近似路由方法比最先进的策略要好两倍。

引用次数: 37

Querying XML documents made easy: nearest concept queries 查询XML文档变得容易:最近的概念查询

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914844

A. Schmidt, M. Kersten, Menzo Windhouwer

Due to the ubiquity and popularity of XML, users often are in the following situation: they want to query XML documents which contain potentially interesting information but they are unaware of the mark-up structure that is used. For example, it is easy to guess the contents of an XML bibliography file whereas the mark-up depends on the methodological, cultural and personal background of the author(s). None the less, it is this hierarchical structure that forms the basis of XML query languages. We exploit the tree structure of XML documents to equip users with a powerful tool, the meet operator that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies. Our approach is based on computing the lowest common ancestor of nodes in the XML syntax tree: e.g., given two strings, we are looking for nodes whose offspring contains these two strings. The novelty of this approach is that the result type is unknown at query formulation time and dependent on the database instance. If the two strings are an author's name and a year mainly publications of the author in this year are returned. If the two strings are numbers the result mostly consists of publications that have the numbers as year or page numbers. Because the result type of a query is not specified by the user we refer to the lowest common ancestor as nearest concept. We also present a running example taken from the bibliography domain, and demonstrate that the operator can be implemented efficiently.

由于XML的普遍性和流行性，用户经常遇到以下情况:他们希望查询包含潜在有趣信息的XML文档，但他们不知道所使用的标记结构。例如，很容易猜测XML书目文件的内容，而标记则取决于作者的方法、文化和个人背景。然而，正是这种层次结构构成了XML查询语言的基础。我们利用XML文档的树状结构为用户提供了一个强大的工具，即meet操作符，它允许用户查询他们熟悉的内容的数据库，而不需要了解标记和层次结构。我们的方法是基于计算XML语法树中节点的最低共同祖先:例如，给定两个字符串，我们正在寻找其后代包含这两个字符串的节点。这种方法的新颖之处在于，结果类型在查询制定时是未知的，并且依赖于数据库实例。如果这两个字符串是作者的名字和年份，则主要返回作者在这一年的出版物。如果这两个字符串是数字，则结果主要由以数字作为年份或页码的出版物组成。因为查询的结果类型不是由用户指定的，所以我们将最低的共同祖先称为最近的概念。最后给出了一个书目领域的实例，验证了该算子的有效性。

{"title":"Querying XML documents made easy: nearest concept queries","authors":"A. Schmidt, M. Kersten, Menzo Windhouwer","doi":"10.1109/ICDE.2001.914844","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914844","url":null,"abstract":"Due to the ubiquity and popularity of XML, users often are in the following situation: they want to query XML documents which contain potentially interesting information but they are unaware of the mark-up structure that is used. For example, it is easy to guess the contents of an XML bibliography file whereas the mark-up depends on the methodological, cultural and personal background of the author(s). None the less, it is this hierarchical structure that forms the basis of XML query languages. We exploit the tree structure of XML documents to equip users with a powerful tool, the meet operator that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies. Our approach is based on computing the lowest common ancestor of nodes in the XML syntax tree: e.g., given two strings, we are looking for nodes whose offspring contains these two strings. The novelty of this approach is that the result type is unknown at query formulation time and dependent on the database instance. If the two strings are an author's name and a year mainly publications of the author in this year are returned. If the two strings are numbers the result mostly consists of publications that have the numbers as year or page numbers. Because the result type of a query is not specified by the user we refer to the lowest common ancestor as nearest concept. We also present a running example taken from the bibliography domain, and demonstrate that the operator can be implemented efficiently.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131594572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 126

Infrastructure for Web-based application integration 基于web的应用程序集成的基础设施

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914860

D. Gawlick

Over the last couple of years application integration has taken a central position in the business world. Application integration deals with integrating computing environments within and between companies and depends on connectivity provided by the intranet and Internet respectively. Application integration is typically referred to as EAI (e-business application integration). The article sketches first the evolution of business computing and EAI. The major elements of a modern EAI technology, are the focus of the discussion, with special attention to Web based application integration. Finally, the article points to some interesting research topics.

在过去的几年中，应用程序集成在商业世界中占据了中心位置。应用程序集成处理公司内部和公司之间的计算环境集成，并依赖于内部网和Internet分别提供的连接。应用程序集成通常称为EAI(电子商务应用程序集成)。本文首先概述了业务计算和EAI的发展。现代EAI技术的主要元素是讨论的焦点，特别关注基于Web的应用程序集成。最后，文章指出了一些有趣的研究课题。

引用次数: 8

PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth PrefixSpan，通过前缀投影模式增长有效地挖掘序列模式

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914830

J. Pei, Jiawei Han, B. Mortazavi-Asl, Helen Pinto, Qiming Chen, U. Dayal, M. Hsu

Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern mining methods follow the methodology of A priori which may substantially reduce the number of combinations to be examined. Howeve6 Apriori still encounters problems when a sequence database is large andor when sequential patterns to be mined are numerous ano we propose a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern mining. Prefixspan mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Moreover; prefi-projection substantially reduces the size of projected databases and leads to efJicient processing. Our performance study shows that Prefixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.

顺序模式挖掘是一个重要的数据挖掘问题，有着广泛的应用。这是具有挑战性的，因为人们可能需要检查可能的子序列模式的组合爆炸式数量。大多数先前开发的顺序模式挖掘方法都遵循先验的方法，这可以大大减少要检查的组合的数量。然而，Apriori在序列数据库较大或待挖掘的序列模式较多时仍然会遇到问题，因此我们提出了一种新的序列模式挖掘方法，称为Prefixspan(即Prefix-projected - Ettern_ mining)，该方法探索了序列模式挖掘中的prejxprojection。前缀跨度挖掘了完整的模式集，但大大减少了候选子序列生成的工作量。此外;预投影大大减少了投影数据库的大小，并导致高效的处理。我们的性能研究表明，Prefixspan优于基于apriori的GSP算法和最近提出的另一种方法;在挖掘大型序列数据库时，可节省时间。

{"title":"PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth","authors":"J. Pei, Jiawei Han, B. Mortazavi-Asl, Helen Pinto, Qiming Chen, U. Dayal, M. Hsu","doi":"10.1109/ICDE.2001.914830","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914830","url":null,"abstract":"Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern mining methods follow the methodology of A priori which may substantially reduce the number of combinations to be examined. Howeve6 Apriori still encounters problems when a sequence database is large andor when sequential patterns to be mined are numerous ano we propose a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern mining. Prefixspan mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Moreover; prefi-projection substantially reduces the size of projected databases and leads to efJicient processing. Our performance study shows that Prefixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128649668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2158

A graph-based approach for extracting terminological properties of elements of XML documents 一种基于图的方法，用于提取XML文档元素的术语属性

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914845

L. Palopoli, G. Terracina, D. Ursino

XML is rapidly becoming a standard for information exchange over the Web. Web providers and applications using XML for representing and exchanging their data make their information available in such a way that interoperability can be easily reached. However in order to guarantee both the exchange of XML documents and the interoperability between information providers, it is often needed to single out semantic similarity properties relating concepts of different XML documents. This paper gives a contribution to this framework by proposing a technique for extracting synonymies and homonymies. The derivation technique is based on a rich conceptual model (called SDR-Network) which is used to represent concepts expressed in XML documents as well as the semantic relationships holding among them.

XML正迅速成为通过Web进行信息交换的标准。使用XML表示和交换数据的Web提供者和应用程序以一种容易实现互操作性的方式提供信息。但是，为了保证XML文档的交换和信息提供者之间的互操作性，通常需要挑出与不同XML文档的概念相关的语义相似性属性。本文提出了一种提取同义词和同义词的技术，对这个框架做出了贡献。派生技术基于一个丰富的概念模型(称为SDR-Network)，该模型用于表示XML文档中表达的概念以及它们之间的语义关系。

引用次数: 37

Block oriented processing of relational database operations in modern computer architectures 现代计算机体系结构中关系数据库操作的面向块处理

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914871

S. Padmanabhan, Timothy Malkemus, R. Agarwal, A. Jhingran

Database systems are not well-tuned to take advantage of modern superscalar processor architectures. In particular, the clocks per instruction (CPI) for rather simple database queries are quite poor compared to scientific kernels or SPEC benchmarks. The lack of performance of database systems has been attributed to poor utilization of caches and processor function units as well as higher branching penalties. In this paper, we argue that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance. We have implemented the block-oriented processing technique for aggregation expression evaluation and sorting operations as a feature in the DB2 Universal Database (UDB) system. We present results from representative queries on a 30-GB TPC-H (Transaction Processing Council Benchmark H) database to show the value of this technique.

数据库系统还不能很好地利用现代超标量处理器体系结构。特别是，与科学内核或SPEC基准测试相比，相当简单的数据库查询的每条指令时钟(CPI)相当差。数据库系统缺乏性能的原因是缓存和处理器功能单元的利用率不高，以及分支损失较大。在本文中，我们认为面向块的数据库操作处理策略可以更好地利用处理器和缓存，从而显著提高性能。我们已经在DB2 Universal Database (UDB)系统中实现了面向块的处理技术，用于聚合表达式求值和排序操作。我们给出了在30 gb TPC-H(事务处理委员会基准H)数据库上的代表性查询的结果，以显示该技术的价值。

引用次数: 101

An automated change-detection algorithm for HTML documents based on semantic hierarchies 基于语义层次结构的HTML文档的自动更改检测算法

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914842

S. Lim, Yiu-Kai Ng

The data at many Web sites is changing rapidly, and a significant amount of this data is presented in HTML documents that consist of markups and data contents. Although XML is becoming more popular for data exchange, the presentation of data contained in XML documents is given, by and large, in the HTML format using XSL(T). Since HTML was designed to "display" data from the human perspective, it is not trivial for a machine to detect (hierarchical) changes of data in an HTML document. In this paper, we propose a heuristic algorithm, called SCD (Semantic Change Detection), to detect semantic changes to the hierarchical data contents in any two HTML documents automatically. Semantic changes differ from syntactic changes since the latter refer to changes of data contents with respect to markup structures according to the HTML grammar. SCD does not require pre-processing, nor any knowledge of the internal structure of the source documents beforehand. The time complexity of SCD is O[(|X|/spl times/|Y|)log(|X|/spl times/|Y|)], where |X| and |Y| are the number of unique branches in the syntactic hierarchies of any two given HTML documents, respectively.

许多Web站点上的数据变化很快，其中很大一部分数据以HTML文档的形式呈现，这些文档由标记和数据内容组成。尽管XML在数据交换方面变得越来越流行，但是XML文档中包含的数据的表示大体上是使用XSL(T)以HTML格式给出的。由于HTML被设计为从人的角度“显示”数据，因此机器检测HTML文档中数据的(分层)变化并非易事。在本文中，我们提出了一种启发式算法，称为SCD(语义变化检测)，用于自动检测任意两个HTML文档中分层数据内容的语义变化。语义更改不同于语法更改，因为后者指的是根据HTML语法对标记结构进行的数据内容更改。SCD不需要预处理，也不需要事先了解源文档的内部结构。SCD的时间复杂度为O[(|X|/spl times/|Y|)log(|X|/spl times/|Y|)]，其中|X|和|Y|分别是任意两个给定HTML文档的语法层次结构中唯一分支的数量。

{"title":"An automated change-detection algorithm for HTML documents based on semantic hierarchies","authors":"S. Lim, Yiu-Kai Ng","doi":"10.1109/ICDE.2001.914842","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914842","url":null,"abstract":"The data at many Web sites is changing rapidly, and a significant amount of this data is presented in HTML documents that consist of markups and data contents. Although XML is becoming more popular for data exchange, the presentation of data contained in XML documents is given, by and large, in the HTML format using XSL(T). Since HTML was designed to \"display\" data from the human perspective, it is not trivial for a machine to detect (hierarchical) changes of data in an HTML document. In this paper, we propose a heuristic algorithm, called SCD (Semantic Change Detection), to detect semantic changes to the hierarchical data contents in any two HTML documents automatically. Semantic changes differ from syntactic changes since the latter refer to changes of data contents with respect to markup structures according to the HTML grammar. SCD does not require pre-processing, nor any knowledge of the internal structure of the source documents beforehand. The time complexity of SCD is O[(|X|/spl times/|Y|)log(|X|/spl times/|Y|)], where |X| and |Y| are the number of unique branches in the syntactic hierarchies of any two given HTML documents, respectively.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122366092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Selectivity estimation for spatial joins 空间连接的选择性估计

Proceedings 17th International Conference on Data Engineering

Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914849

N. An, Zhen-Yu Yang, A. Sivasubramaniam

Spatial joins are important and time consuming operations in spatial database management systems. It is crucial to be able to accurately estimate the performance of these operations so that one can derive efficient query execution plans, and even develop/refine data structures to improve their performance. While estimation techniques for analyzing the performance of other operations, such as range queries, on spatial data has come under scrutiny, the problem of estimating selectivity for spatial joins has been little explored. The limited forays into this area have used parametric techniques, which are largely restrictive on the datasets that they can be used for since they tend to make simplifying assumptions about the nature of the datasets to be joined. Sampling and histogram based techniques, on the other hand, are much less restrictive. However, there has been no prior attempt at understanding the accuracy of sampling techniques, or developing histogram based techniques to estimate the selectivity of spatial joins. Apart from extensively evaluating the accuracy of sampling techniques for the very first time, this paper presents two novel histogram based solutions for spatial join estimation. Using a wide spectrum of both real and synthetic datasets, it is shown that one of our proposed schemes, called Geometric Histograms (GH), can accurately quantify the selectivity of spatial joins.

空间连接是空间数据库管理系统中重要且耗时的操作。能够准确地估计这些操作的性能是至关重要的，这样就可以获得有效的查询执行计划，甚至开发/改进数据结构以提高其性能。虽然用于分析空间数据上的其他操作(如范围查询)的性能的估计技术已经受到了严格的审查，但估计空间连接的选择性的问题却很少被探索。对这一领域的有限尝试使用了参数化技术，这在很大程度上限制了它们可以用于的数据集，因为它们倾向于对要连接的数据集的性质做出简化的假设。另一方面，基于采样和直方图的技术限制要少得多。然而，在理解采样技术的准确性或开发基于直方图的技术来估计空间连接的选择性方面，还没有事先的尝试。除了首次广泛评估采样技术的准确性外，本文还提出了两种新的基于直方图的空间连接估计解决方案。使用广泛的真实和合成数据集，表明我们提出的一种称为几何直方图(GH)的方案可以准确地量化空间连接的选择性。

{"title":"Selectivity estimation for spatial joins","authors":"N. An, Zhen-Yu Yang, A. Sivasubramaniam","doi":"10.1109/ICDE.2001.914849","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914849","url":null,"abstract":"Spatial joins are important and time consuming operations in spatial database management systems. It is crucial to be able to accurately estimate the performance of these operations so that one can derive efficient query execution plans, and even develop/refine data structures to improve their performance. While estimation techniques for analyzing the performance of other operations, such as range queries, on spatial data has come under scrutiny, the problem of estimating selectivity for spatial joins has been little explored. The limited forays into this area have used parametric techniques, which are largely restrictive on the datasets that they can be used for since they tend to make simplifying assumptions about the nature of the datasets to be joined. Sampling and histogram based techniques, on the other hand, are much less restrictive. However, there has been no prior attempt at understanding the accuracy of sampling techniques, or developing histogram based techniques to estimate the selectivity of spatial joins. Apart from extensively evaluating the accuracy of sampling techniques for the very first time, this paper presents two novel histogram based solutions for spatial join estimation. Using a wide spectrum of both real and synthetic datasets, it is shown that one of our proposed schemes, called Geometric Histograms (GH), can accurately quantify the selectivity of spatial joins.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124646885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings 17th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀