首页 > 最新文献

21st International Conference on Data Engineering (ICDE'05)最新文献

英文 中文
Bloom filter-based XML packets filtering for millions of path queries 基于Bloom过滤器的XML包过滤数百万个路径查询
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.26
Xueqing Gong, Ying Yan, Weining Qian, Aoying Zhou
The filtering of XML data is the basis of many complex applications. Lots of algorithms have been proposed to solve this problem. One important challenge is that the number of path queries is huge. It is necessary to take an efficient data structure representing path queries. Another challenge is that these path queries usually vary with time. The maintenance of path queries determines the flexibility and capacity of a filtering system. In this paper, we introduce a novel approximate method for XML data filtering, which uses Bloom filters representing path queries. In this method, millions of path queries can be stored efficiently At the same time, it is easy to deal with the change of these path queries. To improve the filtering performance, we introduce a new data structure, Prefix Filters, to decrease the number of candidate paths. Experiments show that our Bloom filter-based method takes less time to build routing table than automaton-based method. And our method has a good performance with acceptable false positive when filtering XML packets of relatively small depth with millions of path queries.
XML数据的过滤是许多复杂应用程序的基础。为了解决这个问题,已经提出了许多算法。一个重要的挑战是路径查询的数量非常大。有必要采用一种表示路径查询的有效数据结构。另一个挑战是,这些路径查询通常随时间而变化。路径查询的维护决定了过滤系统的灵活性和容量。本文介绍了一种新的XML数据过滤近似方法,该方法使用Bloom过滤器表示路径查询。该方法可以高效地存储数以百万计的路径查询,同时易于处理这些路径查询的变化。为了提高过滤性能,我们引入了一种新的数据结构,前缀过滤器,以减少候选路径的数量。实验表明,基于Bloom过滤器的路由表构建方法比基于自动机的路由表构建方法耗时更短。在使用数百万个路径查询过滤相对深度较小的XML数据包时,我们的方法具有良好的性能,并且具有可接受的误报。
{"title":"Bloom filter-based XML packets filtering for millions of path queries","authors":"Xueqing Gong, Ying Yan, Weining Qian, Aoying Zhou","doi":"10.1109/ICDE.2005.26","DOIUrl":"https://doi.org/10.1109/ICDE.2005.26","url":null,"abstract":"The filtering of XML data is the basis of many complex applications. Lots of algorithms have been proposed to solve this problem. One important challenge is that the number of path queries is huge. It is necessary to take an efficient data structure representing path queries. Another challenge is that these path queries usually vary with time. The maintenance of path queries determines the flexibility and capacity of a filtering system. In this paper, we introduce a novel approximate method for XML data filtering, which uses Bloom filters representing path queries. In this method, millions of path queries can be stored efficiently At the same time, it is easy to deal with the change of these path queries. To improve the filtering performance, we introduce a new data structure, Prefix Filters, to decrease the number of candidate paths. Experiments show that our Bloom filter-based method takes less time to build routing table than automaton-based method. And our method has a good performance with acceptable false positive when filtering XML packets of relatively small depth with millions of path queries.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128522271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Personalized queries under a generalized preference model 广义偏好模型下的个性化查询
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.106
G. Koutrika, Y. Ioannidis
Query personalization is the process of dynamically enhancing a query with related user preferences stored in a user profile with the aim of providing personalized answers. The underlying idea is that different users may find different things relevant to a search due to different preferences. Essential ingredients of query personalization are: (a) a model for representing and storing preferences in user profiles, and (b) algorithms for the generation of personalized answers using stored preferences. Modeling the plethora of preference types is a challenge. In this paper, we present a preference model that combines expressivity and concision. In addition, we provide efficient algorithms for the selection of preferences related to a query, and an algorithm for the progressive generation of personalized results, which are ranked based on user interest. Several classes of ranking functions are provided for this purpose. We present results of experiments both synthetic and with real users (a) demonstrating the efficiency of our algorithms, (b) showing the benefits of query personalization, and (c) providing insight as to the appropriateness of the proposed ranking functions.
查询个性化是使用存储在用户配置文件中的相关用户首选项动态增强查询的过程,目的是提供个性化的答案。其基本思想是,不同的用户可能会由于不同的偏好而找到与搜索相关的不同内容。查询个性化的基本成分是:(a)在用户配置文件中表示和存储首选项的模型,以及(b)使用存储的首选项生成个性化答案的算法。对过多的偏好类型进行建模是一项挑战。在本文中,我们提出了一个结合表达性和简洁性的偏好模型。此外,我们还提供了用于选择与查询相关的偏好的高效算法,以及用于逐步生成个性化结果的算法,这些结果基于用户兴趣进行排名。为此目的提供了几类排序函数。我们展示了合成和真实用户的实验结果(a)展示了我们算法的效率,(b)展示了查询个性化的好处,以及(c)提供了对所提议的排名函数的适当性的见解。
{"title":"Personalized queries under a generalized preference model","authors":"G. Koutrika, Y. Ioannidis","doi":"10.1109/ICDE.2005.106","DOIUrl":"https://doi.org/10.1109/ICDE.2005.106","url":null,"abstract":"Query personalization is the process of dynamically enhancing a query with related user preferences stored in a user profile with the aim of providing personalized answers. The underlying idea is that different users may find different things relevant to a search due to different preferences. Essential ingredients of query personalization are: (a) a model for representing and storing preferences in user profiles, and (b) algorithms for the generation of personalized answers using stored preferences. Modeling the plethora of preference types is a challenge. In this paper, we present a preference model that combines expressivity and concision. In addition, we provide efficient algorithms for the selection of preferences related to a query, and an algorithm for the progressive generation of personalized results, which are ranked based on user interest. Several classes of ranking functions are provided for this purpose. We present results of experiments both synthetic and with real users (a) demonstrating the efficiency of our algorithms, (b) showing the benefits of query personalization, and (c) providing insight as to the appropriateness of the proposed ranking functions.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124599338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 133
IMAX: incremental maintenance of schema-based XML statistics IMAX:基于模式的XML统计信息的增量维护
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.75
Maya Ramanath, L. Zhang, J. Freire, J. Haritsa
Current approaches for estimating the cardinality of XML queries are applicable to a static scenario wherein the underlying XML data does not change subsequent to the collection of statistics on the repository. However, in practice, many XML-based applications are dynamic and involve frequent updates to the data. In this paper, we investigate efficient strategies for incrementally maintaining statistical summaries as and when updates are applied to the data. Specifically, we propose algorithms that handle both the addition of new documents as well as random insertions in the existing document trees. We also show, through a detailed performance evaluation, that our incremental techniques are significantly faster than the naive recomputation approach; and that estimation accuracy can be maintained even with a fixed memory budget.
目前估计XML查询基数的方法适用于静态场景,其中底层XML数据在存储库的统计信息收集之后不会发生变化。然而,在实践中,许多基于xml的应用程序是动态的,并且涉及对数据的频繁更新。在本文中,我们研究了当更新应用于数据时增量式维护统计摘要的有效策略。具体来说,我们提出的算法既可以处理新文档的添加,也可以处理现有文档树中的随机插入。我们还通过详细的性能评估表明,我们的增量技术比单纯的重新计算方法要快得多;即使在内存预算固定的情况下,也可以保持估计的准确性。
{"title":"IMAX: incremental maintenance of schema-based XML statistics","authors":"Maya Ramanath, L. Zhang, J. Freire, J. Haritsa","doi":"10.1109/ICDE.2005.75","DOIUrl":"https://doi.org/10.1109/ICDE.2005.75","url":null,"abstract":"Current approaches for estimating the cardinality of XML queries are applicable to a static scenario wherein the underlying XML data does not change subsequent to the collection of statistics on the repository. However, in practice, many XML-based applications are dynamic and involve frequent updates to the data. In this paper, we investigate efficient strategies for incrementally maintaining statistical summaries as and when updates are applied to the data. Specifically, we propose algorithms that handle both the addition of new documents as well as random insertions in the existing document trees. We also show, through a detailed performance evaluation, that our incremental techniques are significantly faster than the naive recomputation approach; and that estimation accuracy can be maintained even with a fixed memory budget.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128888345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Schema matching using duplicates 使用重复项进行模式匹配
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.126
Alexander Bilke, Felix Naumann
Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.
大多数数据集成应用程序都需要在各自数据集的模式之间进行匹配。我们将展示如何利用这些数据集中存在的重复项来自动识别匹配的属性。我们描述了一种算法,该算法首先发现具有未对齐模式的数据集之间的重复项,然后使用这些重复项在具有不透明列名的模式之间执行模式匹配。在具有未对齐模式的数据集中发现重复项比在通常设置中更难,因为不清楚应该将一个对象中的哪些字段与另一个对象中的哪些字段进行比较。我们开发了一种新的算法,可以在这种情况下有效地找到最可能的副本。现在,我们的模式匹配算法能够通过比较这些重复记录中的数据值来识别相应的属性。对实际数据的实验研究表明了该方法的有效性。
{"title":"Schema matching using duplicates","authors":"Alexander Bilke, Felix Naumann","doi":"10.1109/ICDE.2005.126","DOIUrl":"https://doi.org/10.1109/ICDE.2005.126","url":null,"abstract":"Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126561766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 241
GPIVOT: efficient incremental maintenance of complex ROLAP views GPIVOT:对复杂的ROLAP视图进行有效的增量维护
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.71
Songting Chen, Elke A. Rundensteiner
Data warehousing and on-line analytical processing (OLAP) are essential for decision support applications. Common OLAP operations include for example drill down, roll up, pivot and unpivot. Typically, such queries are fairly complex and are often executed over huge volumes of data. The solution in practice is to use materialized views to reduce the query cost. Utilizing materialized views that incorporate not just traditional simple SELECT-PROJECT-JOIN operators but also complex OLAP operators such as pivot and unpivot is crucial to improve the OLAP query performance but as of now unexplored topic. In this work, we demonstrate that the efficient maintenance of views with pivot and unpivot operators requires the definition of more generalized operators, which we call GPIVOT and GUNPIVOT. We propose rewriting rules, combination rules and propagation rules for such operators. We also design a novel view maintenance framework for applying these rules to obtain an efficient maintenance plan. Our query transformation rules are thus dual purpose serving both view maintenance and query optimization. This paves the way for the inclusion of the GPIVOT and GUNPIVOT into any DBMS engine.
数据仓库和联机分析处理(OLAP)对于决策支持应用程序是必不可少的。常见的OLAP操作包括向下钻取、向上卷取、pivot和unpivot。通常,此类查询相当复杂,并且经常在大量数据上执行。在实践中,解决方案是使用物化视图来降低查询成本。利用物化视图不仅包含传统的简单SELECT-PROJECT-JOIN操作符,而且还包含复杂的OLAP操作符(如pivot和unpivot),这对于提高OLAP查询性能至关重要,但这是目前尚未探索的主题。在这项工作中,我们证明了有效地维护具有枢轴和非枢轴算子的视图需要定义更广义的算子,我们称之为GPIVOT和GUNPIVOT。我们提出了改写规则、组合规则和传播规则。我们还设计了一个新的视图维护框架来应用这些规则,以获得有效的维护计划。因此,我们的查询转换规则具有双重目的,既服务于视图维护,又服务于查询优化。这为将GPIVOT和GUNPIVOT包含到任何DBMS引擎中铺平了道路。
{"title":"GPIVOT: efficient incremental maintenance of complex ROLAP views","authors":"Songting Chen, Elke A. Rundensteiner","doi":"10.1109/ICDE.2005.71","DOIUrl":"https://doi.org/10.1109/ICDE.2005.71","url":null,"abstract":"Data warehousing and on-line analytical processing (OLAP) are essential for decision support applications. Common OLAP operations include for example drill down, roll up, pivot and unpivot. Typically, such queries are fairly complex and are often executed over huge volumes of data. The solution in practice is to use materialized views to reduce the query cost. Utilizing materialized views that incorporate not just traditional simple SELECT-PROJECT-JOIN operators but also complex OLAP operators such as pivot and unpivot is crucial to improve the OLAP query performance but as of now unexplored topic. In this work, we demonstrate that the efficient maintenance of views with pivot and unpivot operators requires the definition of more generalized operators, which we call GPIVOT and GUNPIVOT. We propose rewriting rules, combination rules and propagation rules for such operators. We also design a novel view maintenance framework for applying these rules to obtain an efficient maintenance plan. Our query transformation rules are thus dual purpose serving both view maintenance and query optimization. This paves the way for the inclusion of the GPIVOT and GUNPIVOT into any DBMS engine.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"246 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127492291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Privacy and ownership preserving of outsourced medical data 外包医疗数据的隐私和所有权保护
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.111
E. Bertino, B. Ooi, Yanjiang Yang, R. Deng
The demand for the secondary use of medical data is increasing steadily to allow for the provision of better quality health care. Two important issues pertaining to this sharing of data have to be addressed: one is the privacy protection for individuals referred to in the data; the other is copyright protection over the data. In this paper, we present a unified framework that seamlessly combines techniques of binning and digital watermarking to attain the dual goals of privacy and copyright protection. Our binning method is built upon an earlier approach of generalization and suppression by allowing a broader concept of generalization. To ensure data usefulness, we propose constraining binning by usage metrics that define maximal allowable information loss, and the metrics can be enforced off-line. Our watermarking algorithm watermarks the binned data in a hierarchical manner by leveraging on the very nature of the data. The method is resilient to the generalization attack that is specific to the binned data, as well as other attacks intended to destroy the inserted mark. We prove that watermarking could not adversely interfere with binning, and implemented the framework. Experiments were conducted, and the results show the robustness of the proposed framework.
对医疗数据二次使用的需求正在稳步增长,以便提供更优质的保健服务。必须解决与这种数据共享有关的两个重要问题:一是数据中提到的个人隐私保护;另一个是数据的版权保护。在本文中,我们提出了一个统一的框架,无缝地结合了分码和数字水印技术,以实现隐私和版权保护的双重目标。我们的分箱方法建立在早期的泛化和抑制方法的基础上,允许更广泛的泛化概念。为了确保数据的有用性,我们提出了使用指标来约束分组,这些指标定义了最大允许的信息丢失,并且这些指标可以离线执行。我们的水印算法利用数据的本质,以分层的方式对分组数据进行水印。该方法对特定于分组数据的泛化攻击以及旨在破坏插入标记的其他攻击具有弹性。证明了水印不会对分帧产生不利干扰,并实现了该框架。实验结果表明,该框架具有较好的鲁棒性。
{"title":"Privacy and ownership preserving of outsourced medical data","authors":"E. Bertino, B. Ooi, Yanjiang Yang, R. Deng","doi":"10.1109/ICDE.2005.111","DOIUrl":"https://doi.org/10.1109/ICDE.2005.111","url":null,"abstract":"The demand for the secondary use of medical data is increasing steadily to allow for the provision of better quality health care. Two important issues pertaining to this sharing of data have to be addressed: one is the privacy protection for individuals referred to in the data; the other is copyright protection over the data. In this paper, we present a unified framework that seamlessly combines techniques of binning and digital watermarking to attain the dual goals of privacy and copyright protection. Our binning method is built upon an earlier approach of generalization and suppression by allowing a broader concept of generalization. To ensure data usefulness, we propose constraining binning by usage metrics that define maximal allowable information loss, and the metrics can be enforced off-line. Our watermarking algorithm watermarks the binned data in a hierarchical manner by leveraging on the very nature of the data. The method is resilient to the generalization attack that is specific to the binned data, as well as other attacks intended to destroy the inserted mark. We prove that watermarking could not adversely interfere with binning, and implemented the framework. Experiments were conducted, and the results show the robustness of the proposed framework.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132299152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 159
Effective computation of biased quantiles over data streams 有效计算数据流上的偏分位数
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.55
Graham Cormode, Flip Korn, S. Muthukrishnan, D. Srivastava
Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the "high-biased" and the "targeted" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over highspeed data streams.
在许多数据源(如IP流量流)中普遍存在偏差。为了不断地总结这些数据的分布,高偏差的分位数(例如,第50、90和99百分位数)在更高的等级(例如,分别为5.1%和0.1%的误差)上具有更精细的误差保证,比具有统一误差保证的均匀分布的分位数(例如,第25、50和75百分位数)更有用。在本文中,我们解决以下两个问题。首先,与以最优误差统一计算所有分位数相比,我们是否可以使用更少的空间和计算时间,有效地为数据分布的较高级别计算具有更精细误差保证的分位数?其次,如果预先请求特定的分位数及其误差范围,是否可以减少必要的空间使用和计算时间?我们通过将它们分别形式化为“高偏差”和“目标”分位数问题来肯定地回答这两个问题,并提出具有可证明保证的算法,这些算法的性能明显优于这些问题的先前已知解决方案。我们在Gigascope数据流管理系统中实现了我们的算法,并评估了维护相关摘要结构的替代方法。我们在真实和合成IP数据流上的实验结果补充了我们的理论分析,并强调了在高速数据流上维护摘要结构时轻量级、非阻塞实现的重要性。
{"title":"Effective computation of biased quantiles over data streams","authors":"Graham Cormode, Flip Korn, S. Muthukrishnan, D. Srivastava","doi":"10.1109/ICDE.2005.55","DOIUrl":"https://doi.org/10.1109/ICDE.2005.55","url":null,"abstract":"Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the \"high-biased\" and the \"targeted\" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over highspeed data streams.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116832987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Efficient processing of skyline queries with partially-ordered domains 具有部分有序域的天际线查询的高效处理
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.60
C. Chan, P. Eng, K. Tan
Many decision support applications are characterized by several features: (1) the query is typically based on multiple criteria; (2) there is no single optimal answer (or answer set); (3) because of (2), users typically look for satisfying answers; (4) for the same query, different users, dictated by their personal preferences, may find different answers meeting their needs. As such, it is important for the DBMS to present all interesting answers that may fulfill a user's need. In this article, we focus on the set of interesting answers called the skyline. Given a set of points, the skyline comprises the points that are not dominated by other points. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. We address the novel and important problem of evaluating skyline queries involving partially-ordered attribute domains.
许多决策支持应用程序具有以下几个特征:(1)查询通常基于多个标准;(2)不存在单一最优答案(或答案集);(3)由于(2),用户通常会寻找满意的答案;(4)对于同一个查询,不同的用户,根据他们的个人喜好,可能会找到满足他们需求的不同答案。因此,DBMS提供可能满足用户需求的所有有趣的答案是很重要的。在这篇文章中,我们将关注一组有趣的答案——天际线。给定一组点,天际线由不受其他点支配的点组成。如果一个点在所有维度上都一样好或更好,并且至少在一个维度上更好,那么它就优于另一个点。我们解决了评估涉及部分有序属性域的skyline查询的新颖而重要的问题。
{"title":"Efficient processing of skyline queries with partially-ordered domains","authors":"C. Chan, P. Eng, K. Tan","doi":"10.1109/ICDE.2005.60","DOIUrl":"https://doi.org/10.1109/ICDE.2005.60","url":null,"abstract":"Many decision support applications are characterized by several features: (1) the query is typically based on multiple criteria; (2) there is no single optimal answer (or answer set); (3) because of (2), users typically look for satisfying answers; (4) for the same query, different users, dictated by their personal preferences, may find different answers meeting their needs. As such, it is important for the DBMS to present all interesting answers that may fulfill a user's need. In this article, we focus on the set of interesting answers called the skyline. Given a set of points, the skyline comprises the points that are not dominated by other points. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. We address the novel and important problem of evaluating skyline queries involving partially-ordered attribute domains.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116982058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
On the signature trees and balanced signature trees 关于签名树和平衡签名树
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.99
Yangjun Chen
Advanced database application areas, such as computer aided design, office automation, digital libraries, data-mining as well as hypertext and multimedia systems need to handle complex data structures with set-valued attributes, which can be represented as bit strings, called signatures. A set of signatures can be stored in a file, called a signature file. In this paper, we propose a new method to organize a signature file into a tree structure, called a signature tree, to speed up the signature file scanning and query evaluation.
高级数据库应用领域,如计算机辅助设计、办公自动化、数字图书馆、数据挖掘以及超文本和多媒体系统,需要处理具有集值属性的复杂数据结构,这些数据结构可以表示为位串,称为签名。一组签名可以存储在一个文件中,称为签名文件。本文提出了一种将签名文件组织成树状结构的方法——签名树,以加快签名文件的扫描和查询计算速度。
{"title":"On the signature trees and balanced signature trees","authors":"Yangjun Chen","doi":"10.1109/ICDE.2005.99","DOIUrl":"https://doi.org/10.1109/ICDE.2005.99","url":null,"abstract":"Advanced database application areas, such as computer aided design, office automation, digital libraries, data-mining as well as hypertext and multimedia systems need to handle complex data structures with set-valued attributes, which can be represented as bit strings, called signatures. A set of signatures can be stored in a file, called a signature file. In this paper, we propose a new method to organize a signature file into a tree structure, called a signature tree, to speed up the signature file scanning and query evaluation.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134222456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis 探讨多维数据分析中聚类与离群值之间的交互关系
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.146
Yong Shi, A. Zhang
Nowadays many data mining algorithms focus on clustering methods. There are also a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat clusters and outliers as concepts of the same importance in data analysis. In this paper, we present a cluster-outlier iterative detection algorithm, tending to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields such as pattern recognition, data clustering and signal processing. Experimental results demonstrate the advantages of our approach.
目前许多数据挖掘算法都集中在聚类方法上。还有很多方法是为异常值检测而设计的。我们观察到,在许多情况下,聚类和离群值是彼此意义不可分割的概念,特别是对于那些带有噪声的数据集。因此,有必要将聚类和离群值作为数据分析中同等重要的概念来对待。在本文中,我们提出了一种聚类-离群点迭代检测算法,倾向于从另一个角度检测噪声数据集的聚类和离群点。该算法根据聚类内部的相互关系和聚类与离群点之间的相互关系对聚类进行检测和调整,反之亦然。迭代地对聚类和离群点进行调整和修改,直到达到一定的终止条件。该数据处理算法可应用于模式识别、数据聚类和信号处理等多个领域。实验结果证明了该方法的优越性。
{"title":"Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis","authors":"Yong Shi, A. Zhang","doi":"10.1109/ICDE.2005.146","DOIUrl":"https://doi.org/10.1109/ICDE.2005.146","url":null,"abstract":"Nowadays many data mining algorithms focus on clustering methods. There are also a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat clusters and outliers as concepts of the same importance in data analysis. In this paper, we present a cluster-outlier iterative detection algorithm, tending to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields such as pattern recognition, data clustering and signal processing. Experimental results demonstrate the advantages of our approach.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128443485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
21st International Conference on Data Engineering (ICDE'05)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1