Proceedings 18th International Conference on Data Engineering最新文献

英文中文

Geometric-similarity retrieval in large image bases 大型图像库的几何相似检索

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994757

I. Fudos, Leonidas Palios, E. Pitoura

We propose a novel approach to shape-based image retrieval that builds upon a similarity criterion which is based on the average point set distance. Compared to traditional techniques, such as dimensionality reduction, our method exhibits better behavior in that it maintains the average topology of shapes independently of the number of points used to represent them and is more resilient to noise. An efficient algorithm is presented based on an incremental "fattening," of the query shape until the best match is discovered. The algorithm uses simplex range search techniques and fractional cascading to provide an average polylogarithmic time complexity on the total number of shape vertices. The algorithm is extended to perform additional fast approximate matching, when there is no image sufficiently similar to the query image. We present techniques for the efficient external storage of the shape base and of the auxiliary geometric data structures used by the algorithm. Finally, we show how our approach can be used for processing queries, containing pairwise relations of object boundaries such as contain, tangent, and overlap. Such queries are either extracted from some user drafted sketch or defined explicitly by the user. Alternative methods are presented for forming query execution plans.

我们提出了一种新的基于形状的图像检索方法，该方法建立在基于平均点集距离的相似标准之上。与传统技术(如降维)相比，我们的方法表现出更好的性能，因为它保持了形状的平均拓扑，独立于用于表示它们的点的数量，并且对噪声更具弹性。提出了一种基于增量“增肥”查询形状的高效算法，直到发现最佳匹配。该算法使用单纯形范围搜索技术和分数级联来提供形状顶点总数的平均多对数时间复杂度。将该算法扩展到在没有与查询图像足够相似的图像时执行额外的快速近似匹配。我们提出了算法所使用的形状基和辅助几何数据结构的有效外部存储技术。最后，我们将展示如何使用我们的方法处理查询，包括对象边界的成对关系，如包含、切线和重叠。这些查询要么是从用户起草的草图中提取出来的，要么是由用户显式定义的。提出了用于形成查询执行计划的替代方法。

{"title":"Geometric-similarity retrieval in large image bases","authors":"I. Fudos, Leonidas Palios, E. Pitoura","doi":"10.1109/ICDE.2002.994757","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994757","url":null,"abstract":"We propose a novel approach to shape-based image retrieval that builds upon a similarity criterion which is based on the average point set distance. Compared to traditional techniques, such as dimensionality reduction, our method exhibits better behavior in that it maintains the average topology of shapes independently of the number of points used to represent them and is more resilient to noise. An efficient algorithm is presented based on an incremental \"fattening,\" of the query shape until the best match is discovered. The algorithm uses simplex range search techniques and fractional cascading to provide an average polylogarithmic time complexity on the total number of shape vertices. The algorithm is extended to perform additional fast approximate matching, when there is no image sufficiently similar to the query image. We present techniques for the efficient external storage of the shape base and of the auxiliary geometric data structures used by the algorithm. Finally, we show how our approach can be used for processing queries, containing pairwise relations of object boundaries such as contain, tangent, and overlap. Such queries are either extracted from some user drafted sketch or defined explicitly by the user. Alternative methods are presented for forming query execution plans.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116748449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Approximating a data stream for querying and estimation: algorithms and performance evaluation 用于查询和估计的近似数据流:算法和性能评估

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994775

S. Guha, Nick Koudas

Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.

获取快速、高质量的数据分布近似值是数据库管理的核心问题。各种流行的数据库应用程序，包括大多数应用领域中的近似查询、相似度搜索和数据挖掘，都依赖于这种高质量的近似。在数据库理论和实践中，基于直方图的近似是一种非常流行的方法，它以一种节省空间的方式简洁地表示数据分布。在本文中，我们把直方图构造问题的角度，我们提出了一个有限的数据集和/或已知的数据集大小的要求，我们推广它。我们考虑一个无限数据集的情况，其中数据连续到达，形成无限数据流。在这种情况下，我们提出了能够构建可证明的高质量直方图的单遍算法。我们提出了用于基本直方图构建问题的固定窗口变体的算法，支持直方图的增量维护。所提出的算法以精度换取速度，并允许基于应用程序需求在两者之间进行适当的权衡。在无限数据流近似查询的情况下，我们提出了一个详细的实验评估，将我们的算法与使用真实数据集的其他适用技术进行比较，证明了我们建议的优越性。

{"title":"Approximating a data stream for querying and estimation: algorithms and performance evaluation","authors":"S. Guha, Nick Koudas","doi":"10.1109/ICDE.2002.994775","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994775","url":null,"abstract":"Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128397768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 109

YFilter: efficient and scalable filtering of XML documents YFilter:高效和可伸缩的XML文档过滤

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994748

Y. Diao, Peter M. Fischer, M. Franklin, Raymond To

Much of the data exchanged over the Internet will soon be encoded in XML, allowing for sophisticated filtering and content-based routing. We have built a filtering engine called YFilter, which filters streaming XML documents according to XQuery or XPath queries that involve both path expressions and predicates. Unlike previous work, YFilter uses a novel NFA-based execution model. We present the structures and algorithms underlying YFilter, and show its efficiency and scalability under various workloads.

在Internet上交换的大部分数据将很快用XML编码，从而允许复杂的过滤和基于内容的路由。我们已经构建了一个名为YFilter的过滤引擎，它根据同时涉及路径表达式和谓词的XQuery或XPath查询过滤流XML文档。与之前的工作不同，YFilter使用了一种新颖的基于nfa的执行模型。我们介绍了YFilter的结构和算法，并展示了它在各种工作负载下的效率和可扩展性。

引用次数: 293

Fjording the stream: an architecture for queries over streaming sensor data Fjording the stream:一种对流传感器数据进行查询的架构

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994774

S. Madden, M. Franklin

If industry visionaries are correct, our lives will soon be full of sensors, connected together in loose conglomerations via wireless networks, each monitoring and collecting data about the environment at large. These sensors behave very differently from traditional database sources: they have intermittent connectivity, are limited by severe power constraints, and typically sample periodically and push immediately, keeping no record of historical information. These limitations make traditional database systems inappropriate for queries over sensors. We present the Fjords architecture for managing multiple queries over many sensors, and show how it can be used to limit sensor resource demands while maintaining high query throughput. We evaluate our architecture using traces from a network of traffic sensors deployed on Interstate 80 near Berkeley and present performance results that show how query throughput, communication costs and power consumption are necessarily coupled in sensor environments.

如果行业远见者是正确的，我们的生活将很快充满传感器，通过无线网络松散地连接在一起，每个传感器都监测和收集有关整个环境的数据。这些传感器的行为与传统数据库源非常不同:它们具有间歇性连接，受到严格的功率限制，通常定期采样并立即推送，不保留历史信息记录。这些限制使得传统数据库系统不适合对传感器进行查询。我们介绍了用于管理多个传感器上的多个查询的Fjords架构，并展示了如何使用它来限制传感器资源需求，同时保持高查询吞吐量。我们使用部署在伯克利附近80号州际公路上的交通传感器网络的痕迹来评估我们的架构，并给出性能结果，显示查询吞吐量、通信成本和功耗在传感器环境中是如何必然耦合的。

引用次数: 602

Techniques for storing XML 存储XML的技术

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994740

M. Fernández, S. Amer-Yahia

XML is the de facto standard for data exchange between applications on the Web. Applications, such as electronic markets, will produce and consume large volumes of data and therefore will require efficient and reliable storage and retrieval of XML data. Many techniques for XML storage have been proposed, including flat files, relational database management systems, object-oriented database systems, LDAP directories, and native XML database systems. To better understand the requirements of XML storage systems, we first review various classes of XML documents including highly structured data as stored in relational databases, "mixed" content from document-processing applications, and "streams-oriented" data from ecommerce and transactional applications. We also consider the types of queries typically applied to these classes of documents. In the second part, we present features of the XQuery and XPath data model that must be supported by an XML storage system and then we describe in detail a variety of storage alternatives from industry and research. We focus on techniques that use relational storage. Typically, these techniques produce a logical relational schema for the XML data and treat the storage system as an "black box". In the last part of the tutorial, we consider new techniques that open the storage system's "black box" so that we can take advantage of physical-layout features.

XML是Web上应用程序之间数据交换的事实上的标准。电子市场等应用程序将产生和消耗大量数据，因此需要高效可靠地存储和检索XML数据。已经提出了许多XML存储技术，包括平面文件、关系数据库管理系统、面向对象数据库系统、LDAP目录和原生XML数据库系统。为了更好地理解XML存储系统的需求，我们首先回顾各种类型的XML文档，包括存储在关系数据库中的高度结构化数据、来自文档处理应用程序的“混合”内容以及来自电子商务和事务应用程序的“面向流”数据。我们还考虑了通常应用于这些文档类的查询类型。在第二部分中，我们将介绍XML存储系统必须支持的XQuery和XPath数据模型的特性，然后详细描述来自工业界和研究领域的各种存储替代方案。我们主要关注使用关系存储的技术。通常，这些技术为XML数据生成逻辑关系模式，并将存储系统视为“黑箱”。在本教程的最后一部分中，我们将考虑打开存储系统“黑箱”的新技术，以便我们可以利用物理布局特性。

{"title":"Techniques for storing XML","authors":"M. Fernández, S. Amer-Yahia","doi":"10.1109/ICDE.2002.994740","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994740","url":null,"abstract":"XML is the de facto standard for data exchange between applications on the Web. Applications, such as electronic markets, will produce and consume large volumes of data and therefore will require efficient and reliable storage and retrieval of XML data. Many techniques for XML storage have been proposed, including flat files, relational database management systems, object-oriented database systems, LDAP directories, and native XML database systems. To better understand the requirements of XML storage systems, we first review various classes of XML documents including highly structured data as stored in relational databases, \"mixed\" content from document-processing applications, and \"streams-oriented\" data from ecommerce and transactional applications. We also consider the types of queries typically applied to these classes of documents. In the second part, we present features of the XQuery and XPath data model that must be supported by an XML storage system and then we describe in detail a variety of storage alternatives from industry and research. We focus on techniques that use relational storage. Typically, these techniques produce a logical relational schema for the XML data and treat the storage system as an \"black box\". In the last part of the tutorial, we consider new techniques that open the storage system's \"black box\" so that we can take advantage of physical-layout features.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115517024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Efficient temporal join processing using indices 使用索引进行有效的临时连接处理

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994701

Donghui Zhang, V. Tsotras, B. Seeger

We examine the problem of processing temporal joins in the presence of indexing schemes. Previous work on temporal joins has concentrated on non-indexed relations which were fully scanned. Given the large data volumes created by the ever increasing time dimension, sequential scanning is prohibitive. This is especially true when the temporal join involves only parts of the joining relations (e.g., a given time interval instead of the whole timeline). Utilizing an index becomes then beneficial as it directs the join to the data of interest. We consider temporal join algorithms for three representative indexing schemes, namely a B+-tree, an R*-tree and a temporal index, the Multiversion B+-tree (MVBT). Both the B+-tree and R*-tree result in simple but not efficient join algorithms because neither index achieves good temporal data clustering. Better clustering is maintained by the MVBT through record copying. Nevertheless, copies can greatly affect the correctness and effectiveness of the join algorithms. We identify these problems and propose efficient solutions and optimizations. An extensive comparison of all index based temporal joins, using a variety of datasets and query characteristics shows that the MVBT based join algorithms are consistently faster. In particular the link-based algorithm has the most robust behavior. In our experiments it showed a ten fold improvement over the R*-tree joins while it was between six and thirty times faster than the B+-tree joins.

我们研究了在存在索引方案的情况下处理时态连接的问题。以前关于时间连接的工作主要集中在非索引关系上，这些关系被完全扫描了。考虑到不断增加的时间维度所产生的大数据量，顺序扫描是令人望而却步的。当时间连接只涉及连接关系的一部分(例如，给定的时间间隔而不是整个时间轴)时尤其如此。利用索引是有益的，因为它将连接指向感兴趣的数据。我们考虑了三种代表性索引方案的时间连接算法，即B+树，R*树和时间索引，多版本B+树(MVBT)。B+树和R*树都会产生简单但效率不高的连接算法，因为这两个索引都无法实现良好的时间数据聚类。MVBT通过记录复制来维护更好的集群。然而，拷贝会极大地影响连接算法的正确性和有效性。我们识别这些问题并提出有效的解决方案和优化方案。对所有基于索引的时态连接(使用各种数据集和查询特征)的广泛比较表明，基于MVBT的连接算法始终更快。其中，基于链路的算法具有最强的鲁棒性。在我们的实验中，它比R*-tree连接快10倍，而比B+-tree连接快6到30倍。

{"title":"Efficient temporal join processing using indices","authors":"Donghui Zhang, V. Tsotras, B. Seeger","doi":"10.1109/ICDE.2002.994701","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994701","url":null,"abstract":"We examine the problem of processing temporal joins in the presence of indexing schemes. Previous work on temporal joins has concentrated on non-indexed relations which were fully scanned. Given the large data volumes created by the ever increasing time dimension, sequential scanning is prohibitive. This is especially true when the temporal join involves only parts of the joining relations (e.g., a given time interval instead of the whole timeline). Utilizing an index becomes then beneficial as it directs the join to the data of interest. We consider temporal join algorithms for three representative indexing schemes, namely a B+-tree, an R*-tree and a temporal index, the Multiversion B+-tree (MVBT). Both the B+-tree and R*-tree result in simple but not efficient join algorithms because neither index achieves good temporal data clustering. Better clustering is maintained by the MVBT through record copying. Nevertheless, copies can greatly affect the correctness and effectiveness of the join algorithms. We identify these problems and propose efficient solutions and optimizations. An extensive comparison of all index based temporal joins, using a variety of datasets and query characteristics shows that the MVBT based join algorithms are consistently faster. In particular the link-based algorithm has the most robust behavior. In our experiments it showed a ten fold improvement over the R*-tree joins while it was between six and thirty times faster than the B+-tree joins.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115182316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 70

DBXplorer: a system for keyword-based search over relational databases DBXplorer:一个基于关键字的关系数据库搜索系统

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994693

S. Agrawal, S. Chaudhuri, Gautam Das

Internet search engines have popularized the keyword-based search paradigm. While traditional database management systems offer powerful query languages, they do not allow keyword-based search. In this paper, we discuss DBXplorer, a system that enables keyword-based searches in relational databases. DBXplorer has been implemented using a commercial relational database and Web server and allows users to interact via a browser front-end. We outline the challenges and discuss the implementation of our system, including results of extensive experimental evaluation.

互联网搜索引擎推广了基于关键字的搜索模式。虽然传统的数据库管理系统提供了强大的查询语言，但它们不允许基于关键字的搜索。在本文中，我们讨论DBXplorer，一个在关系数据库中支持基于关键字的搜索的系统。DBXplorer是使用商业关系数据库和Web服务器实现的，并允许用户通过浏览器前端进行交互。我们概述了挑战并讨论了我们系统的实施，包括广泛的实验评估结果。

引用次数: 879

Mixing querying and navigation in MIX 混合MIX中的查询和导航

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994714

Pratik Mukhopadhyay, Y. Papakonstantinou

Web-based information systems provide to their users the ability to interleave querying and browsing during their information discovery efforts. The MIX system provides an API called QDOM (Querible Document Object Model) that supports the interleaved querying and browsing of virtual XML views, specified in an XQuery-like language. QDOM is based on the DOM standard. It allows the client applications to navigate into the view using standard DOM navigation commands. Then the application can use any visited node as the root for a query that creates a new view. The query/navigation processing algorithms of MIX perform decontextualization, i.e., they translate a query that has been issued from within the context of other queries and navigations into efficient queries that are understood by the source outside of the context of previous operations. In addition, MIX provides a navigation-driven query evaluation model, where source data are retrieved only as needed by the subsequent navigations. This paper presents how MIX supports QDOM on views of relational databases.

基于web的信息系统为用户提供了在信息发现过程中穿插查询和浏览的能力。MIX系统提供了一个名为QDOM(可查询文档对象模型)的API，该API支持用类似xquery的语言指定的虚拟XML视图的交错查询和浏览。QDOM基于DOM标准。它允许客户端应用程序使用标准DOM导航命令导航到视图中。然后，应用程序可以使用任何访问的节点作为创建新视图的查询的根。MIX的查询/导航处理算法执行去上下文化，也就是说，它们将从其他查询和导航的上下文中发出的查询转换为以前操作上下文之外的源可以理解的高效查询。此外，MIX提供了一个导航驱动的查询评估模型，其中源数据仅在后续导航需要时检索。本文介绍了MIX如何在关系数据库视图上支持QDOM。

引用次数: 20

Streaming-data algorithms for high-quality clustering 用于高质量聚类的流数据算法

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994785

Liadan O'Callaghan, A. Meyerson, R. Motwani, Nina Mishra, S. Guha

Streaming data analysis has recently attracted attention in numerous applications including telephone records, Web documents and click streams. For such analysis, single-pass algorithms that consume a small amount of memory are critical. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

流数据分析最近在许多应用中引起了人们的注意，包括电话记录、网络文档和点击流。对于这种分析，消耗少量内存的单遍算法至关重要。我们描述了一种有效聚类大数据流的流算法。我们还提供了该算法在合成数据流和真实数据流上的性能的经验证据。

引用次数: 681

Design and implementation of a high-performance distributed Web crawler 高性能分布式Web爬虫的设计与实现

Proceedings 18th International Conference on Data Engineering

Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994750

Vladislav Shkapenyuk, Torsten Suel

Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

广泛的Web搜索引擎以及许多更专业的搜索工具都依赖于Web爬虫来获取用于索引和分析的大量页面集合。这样的Web爬虫可能在数周或数月的时间内与数百万台主机进行交互，因此健壮性、灵活性和可管理性问题非常重要。此外，为了以合理的成本实现高性能，必须考虑I/O性能、网络资源和操作系统限制。在本文中，我们描述了一个运行在工作站网络上的分布式Web爬虫的设计和实现。爬虫可以扩展到每秒(至少)几百个页面，对系统崩溃和其他事件具有弹性，并且可以适应各种爬虫应用程序。我们给出了系统的软件架构，讨论了性能瓶颈，并描述了实现高性能的有效技术。我们还报告了基于在500万台主机上抓取1.2亿个页面的初步实验结果。

引用次数: 410

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings 18th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀