首页 > 最新文献

Proceedings 17th International Conference on Data Engineering最新文献

英文 中文
XML data and object databases: the perfect couple? XML数据和对象数据库:完美的一对?
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914822
Andreas Renner
XML is increasingly gaining acceptance as a medium for exchanging data between applications. Given its text-based structure, XML can easily be distributed across any type of communication channel, including the Internet. This article provides an overview of an efficient way to store XML data inside an object-oriented database management system (OODBMS). It first discusses the difference between XML data and XML documents, and then introduces an approach to integrate XML data into the Java/sup TM/ programming language and programming model. This integration is combined with the transparent persistence of Java objects defined by the ODMG.
作为应用程序之间交换数据的媒介,XML越来越被人们所接受。由于其基于文本的结构,XML可以很容易地分布在任何类型的通信通道上,包括Internet。本文概述了在面向对象数据库管理系统(OODBMS)中存储XML数据的有效方法。首先讨论了XML数据和XML文档的区别,然后介绍了一种将XML数据集成到Java/sup TM/编程语言和编程模型中的方法。这种集成与ODMG定义的Java对象的透明持久性结合在一起。
{"title":"XML data and object databases: the perfect couple?","authors":"Andreas Renner","doi":"10.1109/ICDE.2001.914822","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914822","url":null,"abstract":"XML is increasingly gaining acceptance as a medium for exchanging data between applications. Given its text-based structure, XML can easily be distributed across any type of communication channel, including the Internet. This article provides an overview of an efficient way to store XML data inside an object-oriented database management system (OODBMS). It first discusses the difference between XML data and XML documents, and then introduces an approach to integrate XML data into the Java/sup TM/ programming language and programming model. This integration is combined with the transparent persistence of Java objects defined by the ODMG.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127030766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
B-tree indexes and CPU caches b树索引和CPU缓存
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914847
G. Graefe, P. Larson
Since many existing techniques for exploiting CPU caches in the implementation of B-tree indexes have not been discussed in the literature, most of them are surveyed. Rather than providing a detailed performance evaluation for one or two of them on some specific contemporary hardware, the purpose is to survey and to make widely available this heretofore-folkloric knowledge in order to enable, structure, and hopefully stimulate future research.
由于在实现b树索引时利用CPU缓存的许多现有技术尚未在文献中讨论,因此对其中的大多数技术进行了调查。本文的目的不是在某些特定的现代硬件上为其中的一两个提供详细的性能评估,而是调查并广泛提供这些迄今为止的民间知识,以便支持、组织并有希望刺激未来的研究。
{"title":"B-tree indexes and CPU caches","authors":"G. Graefe, P. Larson","doi":"10.1109/ICDE.2001.914847","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914847","url":null,"abstract":"Since many existing techniques for exploiting CPU caches in the implementation of B-tree indexes have not been discussed in the literature, most of them are surveyed. Rather than providing a detailed performance evaluation for one or two of them on some specific contemporary hardware, the purpose is to survey and to make widely available this heretofore-folkloric knowledge in order to enable, structure, and hopefully stimulate future research.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133293854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 116
Efficient bulk deletes in relational databases 在关系数据库中高效的批量删除
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914827
Andreas Gärtner, A. Kemper, Donald Kossmann, Bernhard Zeller
Many applications require that large amounts of data are deleted from a database - typically, such bulk deletes are carried out periodically and involve old or out-of-date data. If the data is not partitioned in such a way that bulk deletes can be carried out by simply deleting whole partitions, then most current database products execute such bulk delete operations very poorly. The reason is that every record is deleted from each index individually. This paper proposes and evaluates a new class of techniques to support bulk delete operations more efficiently. These techniques outperform the "record-at-a-time" approach implemented in many database products by about an order of magnitude.
许多应用程序需要从数据库中删除大量数据——通常,这种批量删除是定期执行的,并且涉及旧的或过时的数据。如果数据不是以这样一种方式进行分区,即可以通过简单地删除整个分区来执行批量删除,那么大多数当前数据库产品执行此类批量删除操作都非常糟糕。原因是每个记录都是单独从每个索引中删除的。本文提出并评估了一类新的技术,以更有效地支持批量删除操作。这些技术的性能比许多数据库产品中实现的“一次记录”方法高出大约一个数量级。
{"title":"Efficient bulk deletes in relational databases","authors":"Andreas Gärtner, A. Kemper, Donald Kossmann, Bernhard Zeller","doi":"10.1109/ICDE.2001.914827","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914827","url":null,"abstract":"Many applications require that large amounts of data are deleted from a database - typically, such bulk deletes are carried out periodically and involve old or out-of-date data. If the data is not partitioned in such a way that bulk deletes can be carried out by simply deleting whole partitions, then most current database products execute such bulk delete operations very poorly. The reason is that every record is deleted from each index individually. This paper proposes and evaluates a new class of techniques to support bulk delete operations more efficiently. These techniques outperform the \"record-at-a-time\" approach implemented in many database products by about an order of magnitude.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130506895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Quality-aware and load sensitive planning of image similarity queries 图像相似度查询的质量敏感和负载敏感规划
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914853
Klemens Böhm, M. Mlivoncic, R. Weber
Evaluating similarity queries over image collections effectively and efficiently is an important but difficult issue. In many settings, a system does not deal with individual queries in isolation, there rather is a stream of queries. Researchers have proposed a number of query-evaluation alternatives and generalizations, in particular parallel methods over several components, and methods that yield approximate results. Choosing a plan for a given query is subject to more criteria than in conventional settings, notably result quality next to response time and resource consumption. We have designed and implemented a query planner that incorporates these concepts. We describe our space of possible plans and how we search this space. The usefulness of such a planner depends on a number of criteria, e.g., increase of throughput, adaptivity to different workloads, query planning overhead, or influence of the scoring function in quantitative terms. This article describes respective evaluations and shows that the benefit of our particular approach is significant.
有效地评估图像集合上的相似性查询是一个重要但又困难的问题。在许多设置中,系统不会孤立地处理单个查询,而是处理查询流。研究人员已经提出了许多查询-求值的替代方法和推广方法,特别是多个组件的并行方法和产生近似结果的方法。与传统设置相比,为给定查询选择计划要遵循更多的标准,特别是响应时间和资源消耗之后的结果质量。我们已经设计并实现了一个包含这些概念的查询规划器。我们描述可能的方案空间以及如何搜索这个空间。这种规划器的有用性取决于许多标准,例如,吞吐量的增加、对不同工作负载的适应性、查询规划开销或定量评分函数的影响。本文描述了各自的评估,并表明我们的特定方法的好处是显著的。
{"title":"Quality-aware and load sensitive planning of image similarity queries","authors":"Klemens Böhm, M. Mlivoncic, R. Weber","doi":"10.1109/ICDE.2001.914853","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914853","url":null,"abstract":"Evaluating similarity queries over image collections effectively and efficiently is an important but difficult issue. In many settings, a system does not deal with individual queries in isolation, there rather is a stream of queries. Researchers have proposed a number of query-evaluation alternatives and generalizations, in particular parallel methods over several components, and methods that yield approximate results. Choosing a plan for a given query is subject to more criteria than in conventional settings, notably result quality next to response time and resource consumption. We have designed and implemented a query planner that incorporates these concepts. We describe our space of possible plans and how we search this space. The usefulness of such a planner depends on a number of criteria, e.g., increase of throughput, adaptivity to different workloads, query planning overhead, or influence of the scoring function in quantitative terms. This article describes respective evaluations and shows that the benefit of our particular approach is significant.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114278817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Cache-on-demand: recycling with certainty 按需缓存:确定回收
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914878
K. Tan, S. Goh, B. Ooi
Queries posed to a database usually access some common relations, or share some common sub-expressions. We examine the issue of caching using a novel framework, called cache-on-demand (CoD). CoD views intermediate/final answers of existing running queries as virtual caches that an incoming query can exploit. Those caches that are beneficial may then be materialized for the incoming query. Such an approach is essentially nonspeculative: the exact cost of investment and the return on investment are known, and the cache is certain to be reused. We address several issues for CoD to be realized. We also propose two optimizing strategies, Conform-CoD and Scramble-CoD, and evaluate their performance. Our results show that CoD-based schemes can provide substantial performance improvement.
对数据库提出的查询通常访问一些公共关系,或者共享一些公共子表达式。我们使用一个新的框架——按需缓存(CoD)来研究缓存问题。CoD将现有运行查询的中间/最终答案视为传入查询可以利用的虚拟缓存。这些有益的缓存可能会在传入查询中具体化。这种方法本质上是非推测性的:确切的投资成本和投资回报是已知的,并且缓存肯定会被重用。我们为实现CoD解决了几个问题。本文还提出了两种优化策略,即符合- cod和抢占- cod,并对其性能进行了评价。我们的研究结果表明,基于cod的方案可以提供实质性的性能改进。
{"title":"Cache-on-demand: recycling with certainty","authors":"K. Tan, S. Goh, B. Ooi","doi":"10.1109/ICDE.2001.914878","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914878","url":null,"abstract":"Queries posed to a database usually access some common relations, or share some common sub-expressions. We examine the issue of caching using a novel framework, called cache-on-demand (CoD). CoD views intermediate/final answers of existing running queries as virtual caches that an incoming query can exploit. Those caches that are beneficial may then be materialized for the incoming query. Such an approach is essentially nonspeculative: the exact cost of investment and the return on investment are known, and the cache is certain to be reused. We address several issues for CoD to be realized. We also propose two optimizing strategies, Conform-CoD and Scramble-CoD, and evaluate their performance. Our results show that CoD-based schemes can provide substantial performance improvement.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128803165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Approximate nearest neighbor searching in multimedia databases 多媒体数据库中的近似近邻搜索
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914864
H. Ferhatosmanoğlu, E. Tuncel, D. Agrawal, A. E. Abbadi
Develops a general framework for approximate nearest-neighbor queries. We categorize the current approaches for nearest-neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of each data object. We first propose modifications to well-known techniques to support the progressive processing of approximate nearest-neighbor queries. A user may therefore stop the retrieval process once enough information has been returned. We then develop a new technique based on clustering that merges the benefits of the two general classes of approaches. Our cluster-based approach allows a user to progressively explore the approximate results with increasing accuracy. We propose a new metric for evaluation of approximate nearest-neighbor searching techniques. Using both the proposed and the traditional metrics, we analyze and compare several techniques with a detailed performance evaluation. We demonstrate the feasibility and efficiency of approximate nearest-neighbor searching. We perform experiments on several real data sets and establish the superiority of the proposed cluster-based technique over the existing techniques for approximate nearest-neighbor searching.
为近似最近邻查询开发一个通用框架。我们根据减少需要检查的数据集的能力或减少每个数据对象的表示大小的能力对当前最近邻查询处理方法进行分类。我们首先提出了对已知技术的修改,以支持近似最近邻查询的渐进处理。因此,一旦返回了足够的信息,用户就可以停止检索过程。然后,我们开发了一种基于聚类的新技术,它融合了两种一般方法的优点。我们基于聚类的方法允许用户逐步探索近似结果,并提高精度。我们提出了一种评价近似最近邻搜索技术的新度量。使用提出的和传统的指标,我们分析和比较了几种技术,并进行了详细的性能评估。我们证明了近似最近邻搜索的可行性和有效性。我们在几个真实数据集上进行了实验,并建立了所提出的基于聚类的技术相对于现有的近似最近邻搜索技术的优越性。
{"title":"Approximate nearest neighbor searching in multimedia databases","authors":"H. Ferhatosmanoğlu, E. Tuncel, D. Agrawal, A. E. Abbadi","doi":"10.1109/ICDE.2001.914864","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914864","url":null,"abstract":"Develops a general framework for approximate nearest-neighbor queries. We categorize the current approaches for nearest-neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of each data object. We first propose modifications to well-known techniques to support the progressive processing of approximate nearest-neighbor queries. A user may therefore stop the retrieval process once enough information has been returned. We then develop a new technique based on clustering that merges the benefits of the two general classes of approaches. Our cluster-based approach allows a user to progressively explore the approximate results with increasing accuracy. We propose a new metric for evaluation of approximate nearest-neighbor searching techniques. Using both the proposed and the traditional metrics, we analyze and compare several techniques with a detailed performance evaluation. We demonstrate the feasibility and efficiency of approximate nearest-neighbor searching. We perform experiments on several real data sets and establish the superiority of the proposed cluster-based technique over the existing techniques for approximate nearest-neighbor searching.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130990028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 132
The MD-join: an operator for complex OLAP MD-join:用于复杂OLAP的操作符
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914866
Damianos Chatziantoniou, M. Akinde, T. Johnson, Samuel Kim
OLAP queries (i.e. group-by or cube-by queries with aggregation) have proven to be valuable for data analysis and exploration. Many decision support applications need very complex OLAP queries, requiring a fine degree of control over both the group definition and the aggregates that are computed. For example, suppose that the user has access to a data cube whose measure attribute is Sum(Sales). Then the user might wish to compute the sum of sales in New York and the sum of sales in California for those data cube entries in which Sum(Sales)>$1,000,000. This type of complex OLAP query is often difficult to express and difficult to optimize using standard relational operators (including standard aggregation operators). In this paper, we propose the MD-join operator for complex OLAP queries. The MD-join provides a clean separation between group definition and aggregate computation, allowing great flexibility in the expression of OLAP queries. In addition, the MD-join has a simple and easily optimizable implementation, while the equivalent relational algebra expression is often complex and difficult to optimize. We present several algebraic transformations that allow relational algebra queries that include MD-joins to be optimized.
OLAP查询(即具有聚合功能的分组查询或立方查询)已被证明对数据分析和探索很有价值。许多决策支持应用程序需要非常复杂的OLAP查询,需要对组定义和计算的聚合进行一定程度的控制。例如,假设用户可以访问一个度量属性为Sum(Sales)的数据多维数据集。然后,用户可能希望计算sum (sales)>$1,000,000的那些数据立方条目在纽约和加利福尼亚的销售总额。这种类型的复杂OLAP查询通常难以表达,也难以使用标准关系操作符(包括标准聚合操作符)进行优化。在本文中,我们提出了用于复杂OLAP查询的MD-join运算符。MD-join在组定义和聚合计算之间提供了清晰的分离,允许在OLAP查询的表达式中具有很大的灵活性。此外,MD-join具有简单且易于优化的实现,而等价的关系代数表达式通常很复杂且难以优化。我们提出了几个代数转换,它们允许对包含md连接的关系代数查询进行优化。
{"title":"The MD-join: an operator for complex OLAP","authors":"Damianos Chatziantoniou, M. Akinde, T. Johnson, Samuel Kim","doi":"10.1109/ICDE.2001.914866","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914866","url":null,"abstract":"OLAP queries (i.e. group-by or cube-by queries with aggregation) have proven to be valuable for data analysis and exploration. Many decision support applications need very complex OLAP queries, requiring a fine degree of control over both the group definition and the aggregates that are computed. For example, suppose that the user has access to a data cube whose measure attribute is Sum(Sales). Then the user might wish to compute the sum of sales in New York and the sum of sales in California for those data cube entries in which Sum(Sales)>$1,000,000. This type of complex OLAP query is often difficult to express and difficult to optimize using standard relational operators (including standard aggregation operators). In this paper, we propose the MD-join operator for complex OLAP queries. The MD-join provides a clean separation between group definition and aggregate computation, allowing great flexibility in the expression of OLAP queries. In addition, the MD-join has a simple and easily optimizable implementation, while the equivalent relational algebra expression is often complex and difficult to optimize. We present several algebraic transformations that allow relational algebra queries that include MD-joins to be optimized.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133763850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
A cost model and index architecture for the similarity join 相似性连接的成本模型和索引体系结构
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914854
C. Böhm, H. Kriegel
The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter /spl epsiv/. Due to its high practical relevance, many similarity join algorithms have been devised. The authors propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: fine-grained index structures are beneficial for CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort in the experimental evaluation, and a substantial improvement over competitive techniques is shown.
相似连接是一种重要的数据库原语,已成功地应用于提高数据挖掘算法的速度。在相似性连接中,将多维向量空间的两个点集组合在一起,使结果包含距离不超过参数/spl epsiv/的所有点对。由于其高度的实用性,人们设计了许多相似连接算法。提出了一种基于索引的相似性连接操作的分析成本模型。我们的问题分析揭示了CPU时间和I/O时间之间的严重优化冲突:细粒度索引结构有利于CPU效率,但会降低I/O性能。根据这一观察结果,我们提出了一种新的索引架构和连接算法,它允许对CPU时间和I/O时间进行单独的优化。我们的解决方案利用了针对I/O处理进行了优化的大型页面。该页面容纳了一种搜索结构,该结构在实验评估中最大限度地减少了计算工作量,并且比竞争技术有了实质性的改进。
{"title":"A cost model and index architecture for the similarity join","authors":"C. Böhm, H. Kriegel","doi":"10.1109/ICDE.2001.914854","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914854","url":null,"abstract":"The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter /spl epsiv/. Due to its high practical relevance, many similarity join algorithms have been devised. The authors propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: fine-grained index structures are beneficial for CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort in the experimental evaluation, and a substantial improvement over competitive techniques is shown.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132918670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Similarity search without tears: the OMNI-family of all-purpose access methods 无撕裂的相似搜索:omni家族的通用访问方法
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914877
R. S. Filho, A. Traina, C. Traina, C. Faloutsos
Designing a new access method inside a commercial DBMS is cumbersome and expensive. We propose a family of metric access methods that are fast and easy to implement on top of existing access methods, such as sequential scan, R-trees and Slim-trees. The idea is to elect a set of objects as foci, and gauge all other objects with their distances from this set. We show how to define the foci set cardinality, how to choose appropriate foci, and how to perform range and nearest-neighbor queries using them, without false dismissals. The foci increase the pruning of distance calculations during the query processing. Furthermore we index the distances from each object to the foci to reduce even triangular inequality comparisons. Experiments on real and synthetic datasets show that our methods match or outperform existing methods. They are up to 10 times faster, and perform up to 10 times fewer distance calculations and disk accesses. In addition, it scales up well, exhibiting sub-linear performance with growing database size.
在商业DBMS中设计一种新的访问方法既麻烦又昂贵。我们提出了一系列快速且易于实现的度量访问方法,如顺序扫描、r树和slim -tree。这个想法是选择一组物体作为焦点,并测量所有其他物体与这组物体的距离。我们将展示如何定义焦点集基数,如何选择合适的焦点,以及如何使用它们执行范围查询和最近邻查询,而不会出现错误的忽略。焦点增加了查询处理过程中距离计算的修剪。此外,我们索引从每个对象到焦点的距离,以减少三角不平等的比较。在真实和合成数据集上的实验表明,我们的方法与现有方法相匹配或优于现有方法。它们的速度最高可达10倍,执行的距离计算和磁盘访问最多可减少10倍。此外,它可以很好地扩展,随着数据库大小的增长表现出亚线性的性能。
{"title":"Similarity search without tears: the OMNI-family of all-purpose access methods","authors":"R. S. Filho, A. Traina, C. Traina, C. Faloutsos","doi":"10.1109/ICDE.2001.914877","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914877","url":null,"abstract":"Designing a new access method inside a commercial DBMS is cumbersome and expensive. We propose a family of metric access methods that are fast and easy to implement on top of existing access methods, such as sequential scan, R-trees and Slim-trees. The idea is to elect a set of objects as foci, and gauge all other objects with their distances from this set. We show how to define the foci set cardinality, how to choose appropriate foci, and how to perform range and nearest-neighbor queries using them, without false dismissals. The foci increase the pruning of distance calculations during the query processing. Furthermore we index the distances from each object to the foci to reduce even triangular inequality comparisons. Experiments on real and synthetic datasets show that our methods match or outperform existing methods. They are up to 10 times faster, and perform up to 10 times fewer distance calculations and disk accesses. In addition, it scales up well, exhibiting sub-linear performance with growing database size.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123740385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 154
Mining partially periodic event patterns with unknown periods 挖掘具有未知周期的部分周期事件模式
Pub Date : 2001-04-02 DOI: 10.1109/ICDE.2001.914829
Sheng Ma, J. Hellerstein
Periodic behavior is common in real-world applications. However in many cases, periodicities are partial in that they are present only intermittently. The authors study such intermittent patterns, which they refer to as p-patterns. The formulation of p-patterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for p-patterns as two sub-tasks: (1) finding the periods of p-patterns and (2) mining temporal associations. For (2), a level-wise algorithm is used. For (1), we develop a novel approach based on a chi-squared test, and study its performance in the presence of noise. Further we develop two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association-first algorithm. Our results show that the association-first algorithm has a higher tolerance to noise; the period-first algorithm is more computationally efficient and provides flexibility as to the specification of support levels. In addition, we apply the period-first algorithm to mining data collected from two production computer networks, a process that led to several actionable insights.
周期性行为在实际应用程序中很常见。然而,在许多情况下,周期性是部分的,因为它们只是间歇性地出现。作者研究这种间歇性模式,他们称之为p模式。p模式的公式考虑到不精确的时间信息(例如,由于分布式环境中的不同步时钟),噪声数据(例如,由于无关事件)以及相位和/或周期的变化。我们将p模式的挖掘构建为两个子任务:(1)找到p模式的周期和(2)挖掘时间关联。对于(2),使用了一种逐级算法。对于(1),我们开发了一种基于卡方检验的新方法,并研究了其在噪声存在下的性能。此外,我们根据上述子任务的执行顺序开发了两种挖掘p模式的算法:周期优先算法和关联优先算法。结果表明,关联优先算法对噪声有较高的容忍度;周期优先算法的计算效率更高,并且在支持级别的指定方面提供了灵活性。此外,我们将周期优先算法应用于挖掘从两个生产计算机网络收集的数据,这一过程产生了一些可操作的见解。
{"title":"Mining partially periodic event patterns with unknown periods","authors":"Sheng Ma, J. Hellerstein","doi":"10.1109/ICDE.2001.914829","DOIUrl":"https://doi.org/10.1109/ICDE.2001.914829","url":null,"abstract":"Periodic behavior is common in real-world applications. However in many cases, periodicities are partial in that they are present only intermittently. The authors study such intermittent patterns, which they refer to as p-patterns. The formulation of p-patterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for p-patterns as two sub-tasks: (1) finding the periods of p-patterns and (2) mining temporal associations. For (2), a level-wise algorithm is used. For (1), we develop a novel approach based on a chi-squared test, and study its performance in the presence of noise. Further we develop two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association-first algorithm. Our results show that the association-first algorithm has a higher tolerance to noise; the period-first algorithm is more computationally efficient and provides flexibility as to the specification of support levels. In addition, we apply the period-first algorithm to mining data collected from two production computer networks, a process that led to several actionable insights.","PeriodicalId":431818,"journal":{"name":"Proceedings 17th International Conference on Data Engineering","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127188657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 279
期刊
Proceedings 17th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1