首页 > 最新文献

21st International Conference on Data Engineering (ICDE'05)最新文献

英文 中文
Online mining of data streams: applications, techniques and progress 数据流的在线挖掘:应用、技术和进展
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.101
Haixun Wang, J. Pei, Philip S. Yu
In this paper, we focus on the differences between mining static large data sets and data streams. Over the years, the database and data mining community have learned valuable lessons from mining static large data sets, and developed many useful algorithms and tools for this purpose. The paper aims at providing a shortcut to the current frontier of stream mining research. We emphasize the research problems, the inherent technical challenges and the latest results. Particularly, the paper highlights new challenges and potential research interests. Research community has been interested in the integration between data mining tasks and database management systems.
在本文中,我们着重于挖掘静态大数据集和数据流之间的区别。多年来,数据库和数据挖掘社区从挖掘静态大型数据集中学到了宝贵的经验,并为此开发了许多有用的算法和工具。本文旨在为当前河流开采研究的前沿提供一条捷径。我们强调研究问题,固有的技术挑战和最新成果。特别指出了新的挑战和潜在的研究方向。数据挖掘任务与数据库管理系统之间的集成一直是研究界关注的问题。
{"title":"Online mining of data streams: applications, techniques and progress","authors":"Haixun Wang, J. Pei, Philip S. Yu","doi":"10.1109/ICDE.2005.101","DOIUrl":"https://doi.org/10.1109/ICDE.2005.101","url":null,"abstract":"In this paper, we focus on the differences between mining static large data sets and data streams. Over the years, the database and data mining community have learned valuable lessons from mining static large data sets, and developed many useful algorithms and tools for this purpose. The paper aims at providing a shortcut to the current frontier of stream mining research. We emphasize the research problems, the inherent technical challenges and the latest results. Particularly, the paper highlights new challenges and potential research interests. Research community has been interested in the integration between data mining tasks and database management systems.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133290733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches 用于评估遗留信息集成方法的测试工具
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.140
J. Hammer, M. Stonebraker, Oguzhan Topsakal
We introduce our new, publicly available testbed and benchmark called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches) for testing and evaluating integration technologies. THALIA provides researchers with a collection of 40 downloadable data sources representing University course catalogs from computer science departments worldwide. In addition, THALIA currently provides a set of twelve challenge queries as well as a scoring function for ranking the performance of an integration system. A second contribution is a systematic classification of the types of syntactic and semantic heterogeneities, which directly lead to the twelve challenge. We have chosen course information as our domain of discourse because it is well known and easy to understand. Furthermore, there is an abundance of data sources publicly available that allowed us to develop a testbed exhibiting all of the syntactic and semantic heterogeneities that we have identified.
我们引入了新的、公开可用的测试平台和基准,称为THALIA(用于评估遗留信息集成方法的测试工具),用于测试和评估集成技术。THALIA为研究人员提供了40个可下载的数据源,这些数据源代表了全球计算机科学系的大学课程目录。此外,THALIA目前提供了一组12个挑战查询以及一个评分功能,用于对集成系统的性能进行排名。第二个贡献是对句法和语义异构类型的系统分类,这直接导致了第12个挑战。我们选择课程信息作为我们的话语领域,因为它是众所周知的,易于理解。此外,有大量公开可用的数据源,使我们能够开发一个测试平台,展示我们已经确定的所有语法和语义异构性。
{"title":"THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches","authors":"J. Hammer, M. Stonebraker, Oguzhan Topsakal","doi":"10.1109/ICDE.2005.140","DOIUrl":"https://doi.org/10.1109/ICDE.2005.140","url":null,"abstract":"We introduce our new, publicly available testbed and benchmark called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches) for testing and evaluating integration technologies. THALIA provides researchers with a collection of 40 downloadable data sources representing University course catalogs from computer science departments worldwide. In addition, THALIA currently provides a set of twelve challenge queries as well as a scoring function for ranking the performance of an integration system. A second contribution is a systematic classification of the types of syntactic and semantic heterogeneities, which directly lead to the twelve challenge. We have chosen course information as our domain of discourse because it is well known and easy to understand. Furthermore, there is an abundance of data sources publicly available that allowed us to develop a testbed exhibiting all of the syntactic and semantic heterogeneities that we have identified.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122406733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
A multiresolution symbolic representation of time series 时间序列的多分辨率符号表示
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.10
V. Megalooikonomou, Qiang Wang, Guo Li, C. Faloutsos
Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and non-trivial problem. In this paper, we introduce a new representation of time series, the multiresolution vector quantized (MVQ) approximation, along with a new distance function. The novelty of MVQ is that it keeps both local and global information about the original time series in a hierarchical mechanism, processing the original time series at multiple resolutions. Moreover, the proposed representation is symbolic employing key subsequences and potentially allows the application of text-based retrieval techniques into the similarity analysis of time series. The proposed method is fast and scales linearly with the size of database and the dimensionality. Contrary to the vast majority in the literature that uses the Euclidean distance, MVQ uses a multi-resolution/hierarchical distance function. We performed experiments with real and synthetic data. The proposed distance function consistently outperforms all the major competitors (Euclidean, dynamic time warping, piecewise aggregate approximation) achieving up to 20% better precision/recall and clustering accuracy on the tested datasets.
高效、准确地搜索时间序列之间的相似性并发现有趣的模式是一个重要而重要的问题。在本文中,我们引入了一种新的时间序列表示,即多分辨率矢量量化(MVQ)近似,以及一个新的距离函数。MVQ的新颖之处在于,它将原始时间序列的局部和全局信息保持在分层机制中,以多种分辨率处理原始时间序列。此外,所提出的表示是采用关键子序列的符号表示,并且可能允许将基于文本的检索技术应用于时间序列的相似性分析。该方法速度快,且随数据库大小和维数呈线性扩展。与绝大多数使用欧几里得距离的文献相反,MVQ使用多分辨率/分层距离函数。我们用真实数据和合成数据进行了实验。所提出的距离函数始终优于所有主要的竞争对手(欧几里得,动态时间翘曲,分段聚合近似),在测试数据集上实现高达20%的精度/召回率和聚类精度提高。
{"title":"A multiresolution symbolic representation of time series","authors":"V. Megalooikonomou, Qiang Wang, Guo Li, C. Faloutsos","doi":"10.1109/ICDE.2005.10","DOIUrl":"https://doi.org/10.1109/ICDE.2005.10","url":null,"abstract":"Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and non-trivial problem. In this paper, we introduce a new representation of time series, the multiresolution vector quantized (MVQ) approximation, along with a new distance function. The novelty of MVQ is that it keeps both local and global information about the original time series in a hierarchical mechanism, processing the original time series at multiple resolutions. Moreover, the proposed representation is symbolic employing key subsequences and potentially allows the application of text-based retrieval techniques into the similarity analysis of time series. The proposed method is fast and scales linearly with the size of database and the dimensionality. Contrary to the vast majority in the literature that uses the Euclidean distance, MVQ uses a multi-resolution/hierarchical distance function. We performed experiments with real and synthetic data. The proposed distance function consistently outperforms all the major competitors (Euclidean, dynamic time warping, piecewise aggregate approximation) achieving up to 20% better precision/recall and clustering accuracy on the tested datasets.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122833431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Configurable security protocols for multi-party data analysis with malicious participants 针对恶意参与者的多方数据分析的可配置安全协议
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.37
B. Malin, E. Airoldi, Samuel Edoho-Eket, Yiheng Li
Standard multi-party computation models assume semi-honest behavior, where the majority of participants implement protocols according to specification, an assumption not always plausible. In this paper we introduce a multi-party protocol for collaborative data analysis when participants are malicious and fail to follow specification. The protocol incorporates a semi-trusted third party, which analyzes encrypted data and provides honest responses that only intended recipients can successfully decrypt. The protocol incorporates data confidentiality by enabling participants to receive encrypted responses tailored to their own encrypted data submissions without revealing plaintext to other participants, including the third party. As opposed to previous models, trust need only be placed on a single participant with no data at stake. Additionally, the proposed protocol is configurable in a way that security features are controlled by independent subprotocols. Various combinations of subprotocols allow for a flexible security system, appropriate for a number of distributed data applications, such as secure list comparison.
标准的多方计算模型假定半诚实的行为,其中大多数参与者根据规范实现协议,这一假设并不总是可信的。本文介绍了一种针对恶意参与者不遵守规范情况下的协同数据分析的多方协议。该协议包含一个半可信的第三方,该第三方分析加密数据并提供只有预期接收方才能成功解密的诚实响应。该协议结合了数据保密性,使参与者能够接收针对自己加密数据提交的加密响应,而不会向其他参与者(包括第三方)泄露明文。与以前的模型相反,信任只需要放在没有数据风险的单个参与者身上。此外,提议的协议是可配置的,安全特性由独立的子协议控制。子协议的各种组合允许灵活的安全系统,适用于许多分布式数据应用程序,例如安全列表比较。
{"title":"Configurable security protocols for multi-party data analysis with malicious participants","authors":"B. Malin, E. Airoldi, Samuel Edoho-Eket, Yiheng Li","doi":"10.1109/ICDE.2005.37","DOIUrl":"https://doi.org/10.1109/ICDE.2005.37","url":null,"abstract":"Standard multi-party computation models assume semi-honest behavior, where the majority of participants implement protocols according to specification, an assumption not always plausible. In this paper we introduce a multi-party protocol for collaborative data analysis when participants are malicious and fail to follow specification. The protocol incorporates a semi-trusted third party, which analyzes encrypted data and provides honest responses that only intended recipients can successfully decrypt. The protocol incorporates data confidentiality by enabling participants to receive encrypted responses tailored to their own encrypted data submissions without revealing plaintext to other participants, including the third party. As opposed to previous models, trust need only be placed on a single participant with no data at stake. Additionally, the proposed protocol is configurable in a way that security features are controlled by independent subprotocols. Various combinations of subprotocols allow for a flexible security system, appropriate for a number of distributed data applications, such as secure list comparison.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124986045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Adaptive caching for continuous queries 用于连续查询的自适应缓存
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.15
S. Babu, Kamesh Munagala, J. Widom, R. Motwani
We address the problem of executing continuous multiway join queries in unpredictable and volatile environments. Our query class captures windowed join queries in data stream systems as well as conventional maintenance of materialized join views. Our adaptive approach handles streams of updates whose rates and data characteristics may change over time, as well as changes in system conditions such as memory availability. In this paper we focus specifically on the problem of adaptive placement and removal of caches to optimize join performance. Our approach automatically considers conventional tree-shaped join plans with materialized subresults at every intermediate node, sub result-free MJoins, and the entire spectrum between them. We provide algorithms for selecting caches, monitoring their cost and benefits in current conditions, allocating memory to caches, and adapting as conditions change. All of our algorithms are implemented in the STREAM prototype data stream management system and a thorough experimental evaluation is included.
我们解决了在不可预测和不稳定的环境中执行连续多路连接查询的问题。我们的查询类捕获数据流系统中的窗口连接查询,以及物化连接视图的常规维护。我们的自适应方法处理速率和数据特征可能随时间变化的更新流,以及内存可用性等系统条件的变化。在本文中,我们特别关注自适应放置和删除缓存以优化连接性能的问题。我们的方法自动考虑在每个中间节点上具有实体化子结果的传统树形连接计划、无子结果的MJoins以及它们之间的整个频谱。我们提供了选择缓存、监控当前条件下的成本和收益、为缓存分配内存以及根据条件变化进行调整的算法。我们的所有算法都在STREAM原型数据流管理系统中实现,并进行了全面的实验评估。
{"title":"Adaptive caching for continuous queries","authors":"S. Babu, Kamesh Munagala, J. Widom, R. Motwani","doi":"10.1109/ICDE.2005.15","DOIUrl":"https://doi.org/10.1109/ICDE.2005.15","url":null,"abstract":"We address the problem of executing continuous multiway join queries in unpredictable and volatile environments. Our query class captures windowed join queries in data stream systems as well as conventional maintenance of materialized join views. Our adaptive approach handles streams of updates whose rates and data characteristics may change over time, as well as changes in system conditions such as memory availability. In this paper we focus specifically on the problem of adaptive placement and removal of caches to optimize join performance. Our approach automatically considers conventional tree-shaped join plans with materialized subresults at every intermediate node, sub result-free MJoins, and the entire spectrum between them. We provide algorithms for selecting caches, monitoring their cost and benefits in current conditions, allocating memory to caches, and adapting as conditions change. All of our algorithms are implemented in the STREAM prototype data stream management system and a thorough experimental evaluation is included.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125017499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 104
On discovery of extremely low-dimensional clusters using semi-supervised projected clustering 利用半监督投影聚类发现极低维聚类
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.96
Kevin Y. Yip, D. Cheung, M. Ng
Recent studies suggest that projected clusters with extremely low dimensionality exist in many real datasets. A number of projected clustering algorithms have been proposed in the past several years, but few can identify clusters with dimensionality lower than 10% of the total number of dimensions, which are commonly found in some real datasets such as gene expression profiles. In this paper we propose a new algorithm that can accurately identify projected clusters with relevant dimensions as few as 5% of the total number of dimensions. It makes use of a robust objective function that combines object clustering and dimension selection into a single optimization problem. The algorithm can also utilize domain knowledge in the form of labeled objects and labeled dimensions to improve its clustering accuracy. We believe this is the first semi-supervised projected clustering algorithm. Both theoretical analysis and experimental results show that by using a small amount of input knowledge, possibly covering only a portion of the underlying classes, the new algorithm can be further improved to accurately detect clusters with only 1% of the dimensions being relevant. The algorithm is also useful in getting a target set of clusters when there are multiple possible groupings of the objects.
最近的研究表明,在许多真实数据集中存在极低维数的投影聚类。在过去的几年中,已经提出了许多预测聚类算法,但是很少有算法能够识别出维数低于总维数10%的聚类,这在一些真实数据集(如基因表达谱)中很常见。在本文中,我们提出了一种新的算法,可以准确地识别出相关维数少于总维数5%的投影聚类。它利用鲁棒目标函数,将目标聚类和维数选择结合为一个优化问题。该算法还可以利用标记对象和标记维度形式的领域知识来提高聚类精度。我们认为这是第一个半监督投影聚类算法。理论分析和实验结果都表明,通过使用少量的输入知识,可能只覆盖一部分底层类,新算法可以进一步改进,以准确地检测出只有1%的维度是相关的聚类。当存在多个可能的对象分组时,该算法在获得目标簇集方面也很有用。
{"title":"On discovery of extremely low-dimensional clusters using semi-supervised projected clustering","authors":"Kevin Y. Yip, D. Cheung, M. Ng","doi":"10.1109/ICDE.2005.96","DOIUrl":"https://doi.org/10.1109/ICDE.2005.96","url":null,"abstract":"Recent studies suggest that projected clusters with extremely low dimensionality exist in many real datasets. A number of projected clustering algorithms have been proposed in the past several years, but few can identify clusters with dimensionality lower than 10% of the total number of dimensions, which are commonly found in some real datasets such as gene expression profiles. In this paper we propose a new algorithm that can accurately identify projected clusters with relevant dimensions as few as 5% of the total number of dimensions. It makes use of a robust objective function that combines object clustering and dimension selection into a single optimization problem. The algorithm can also utilize domain knowledge in the form of labeled objects and labeled dimensions to improve its clustering accuracy. We believe this is the first semi-supervised projected clustering algorithm. Both theoretical analysis and experimental results show that by using a small amount of input knowledge, possibly covering only a portion of the underlying classes, the new algorithm can be further improved to accurately detect clusters with only 1% of the dimensions being relevant. The algorithm is also useful in getting a target set of clusters when there are multiple possible groupings of the objects.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128345870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
A relationally complete visual query language for heterogeneous data sources and pervasive querying 一种相对完整的可视化查询语言,用于异构数据源和普适查询
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.12
S. Polyviou, G. Samaras, P. Evripidou
In this paper we introduce and formally define Query by Browsing (QBB), a scalable, relationally complete visual query language based on the desktop user interface paradigm and tuple relational calculus that allows the formulation of complex queries over relational, entity-relationship, object-oriented and XML data sources on a variety of handheld and desktop platforms. It is to our knowledge the first visual query language to combine the important characteristics of usability, scalability, expressive power and flexibility. We support these claims by demonstrating the similarity of the QBB paradigm to the popular desktop user interface paradigm, by relating it to relational calculus and relational algebra and by describing Chiromancer II, a Web-based implementation of the QBB paradigm for handheld devices. We also discuss ways in which non-relational sources can be represented and queried and compare QBB to related work in the area of visual query languages for a variety of data models. We finally offer conclusions and thoughts for future work.
在本文中,我们介绍并正式定义了Query by Browsing (QBB),这是一种基于桌面用户界面范式和元组关系演算的可扩展的、关系完整的可视化查询语言,允许在各种手持和桌面平台上对关系、实体关系、面向对象和XML数据源进行复杂查询。据我们所知,它是第一个结合了可用性、可扩展性、表达能力和灵活性等重要特征的可视化查询语言。我们通过展示QBB范式与流行的桌面用户界面范式的相似性、将其与关系微积分和关系代数联系起来以及描述Chiromancer II(用于手持设备的基于web的QBB范式实现)来支持这些主张。我们还讨论了表示和查询非关系源的方法,并将QBB与针对各种数据模型的可视化查询语言领域的相关工作进行了比较。最后对今后的工作提出了结论和思考。
{"title":"A relationally complete visual query language for heterogeneous data sources and pervasive querying","authors":"S. Polyviou, G. Samaras, P. Evripidou","doi":"10.1109/ICDE.2005.12","DOIUrl":"https://doi.org/10.1109/ICDE.2005.12","url":null,"abstract":"In this paper we introduce and formally define Query by Browsing (QBB), a scalable, relationally complete visual query language based on the desktop user interface paradigm and tuple relational calculus that allows the formulation of complex queries over relational, entity-relationship, object-oriented and XML data sources on a variety of handheld and desktop platforms. It is to our knowledge the first visual query language to combine the important characteristics of usability, scalability, expressive power and flexibility. We support these claims by demonstrating the similarity of the QBB paradigm to the popular desktop user interface paradigm, by relating it to relational calculus and relational algebra and by describing Chiromancer II, a Web-based implementation of the QBB paradigm for handheld devices. We also discuss ways in which non-relational sources can be represented and queried and compare QBB to related work in the area of visual query languages for a variety of data models. We finally offer conclusions and thoughts for future work.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124012739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Finding (recently) frequent items in distributed data streams 在分布式数据流中查找(最近)频繁的项
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.68
A. Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston
We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naive methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than naive approaches while providing the same error guarantees on answers.
我们考虑维护多个分布式数据流联合中频繁出现的项目的频率计数问题。结合多个节点的近似频率计数的朴素方法往往会导致过大的数据结构,在节点之间传输的成本很高。为了尽量减少通信需求,每个节点在计算项目频率时保持的精度必须仔细管理。我们引入了精度梯度的概念来管理节点在分层通信结构中的精度。然后,我们研究了如何设置精度梯度以最小化通信的优化问题,并提供了在所有可能的输入中最小化最坏情况通信负载的最优解。然后,我们引入了一个设计在实践中表现良好的变体,其输入数据不符合最坏情况特征。我们使用真实世界的数据验证了我们方法的有效性,并表明我们的方法比幼稚的方法产生的沟通要少得多,同时提供了相同的错误保证。
{"title":"Finding (recently) frequent items in distributed data streams","authors":"A. Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston","doi":"10.1109/ICDE.2005.68","DOIUrl":"https://doi.org/10.1109/ICDE.2005.68","url":null,"abstract":"We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naive methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than naive approaches while providing the same error guarantees on answers.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126933491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 216
Towards building a MetaQuerier: extracting and matching Web query interfaces 构建一个元查询器:提取和匹配Web查询接口
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.145
Bin He, Zhen Zhang, K. Chang
We witness the rapid growth and thus the prevalence of databases on the Web. Our recent study in April 2004 estimated 450,000 online databases. On this deep Web, myriad databases provide dynamic query-based data access through their query interfaces, instead of static URL links. It is thus essential to integrate these query interfaces for integrating the deep Web. The overall goal of the MetaQuerier project aims at opening up the deep Web to users, by building a system to help users exploring and integrating deep Web sources. In particular, to start with, we focus on the integration of deep Web sources in the same domain, which is itself an important integration task. To automate this integration scenario, we need to solve two critical problems: extracting query interfaces and matching query interfaces. To solve the interface extraction problem, we introduce a parsing paradigm by hypothesizing the existence of hidden syntax which describes the layout and semantic of Web interfaces. Also, unlike traditional pairwise schema matching, we propose a holistic matching approach, which matches all schemas at the same time with the hypothesis of a hidden schema model. Therefore, our techniques explore, in essence, "data mining for information integration." That is, we mine the observable information to discover the underlying semantics.
我们见证了Web上数据库的快速增长和普及。我们最近在2004年4月的研究估计有45万个在线数据库。在这个深度网络上,无数数据库通过查询接口提供动态的基于查询的数据访问,而不是静态的URL链接。因此,整合这些查询接口以整合深度网络是必要的。MetaQuerier项目的总体目标是通过建立一个系统来帮助用户探索和整合深度网络资源,从而向用户开放深度网络。特别是,首先,我们将重点放在同一领域的深度Web源的集成上,这本身就是一项重要的集成任务。为了自动化这个集成场景,我们需要解决两个关键问题:提取查询接口和匹配查询接口。为了解决接口抽取问题,我们通过假设存在描述Web接口布局和语义的隐藏语法,引入了一种解析范式。此外,与传统的两两模式匹配不同,我们提出了一种整体匹配方法,该方法在隐含模式模型的假设下同时匹配所有模式。因此,我们的技术在本质上探索“用于信息集成的数据挖掘”。也就是说,我们挖掘可观察信息来发现底层语义。
{"title":"Towards building a MetaQuerier: extracting and matching Web query interfaces","authors":"Bin He, Zhen Zhang, K. Chang","doi":"10.1109/ICDE.2005.145","DOIUrl":"https://doi.org/10.1109/ICDE.2005.145","url":null,"abstract":"We witness the rapid growth and thus the prevalence of databases on the Web. Our recent study in April 2004 estimated 450,000 online databases. On this deep Web, myriad databases provide dynamic query-based data access through their query interfaces, instead of static URL links. It is thus essential to integrate these query interfaces for integrating the deep Web. The overall goal of the MetaQuerier project aims at opening up the deep Web to users, by building a system to help users exploring and integrating deep Web sources. In particular, to start with, we focus on the integration of deep Web sources in the same domain, which is itself an important integration task. To automate this integration scenario, we need to solve two critical problems: extracting query interfaces and matching query interfaces. To solve the interface extraction problem, we introduce a parsing paradigm by hypothesizing the existence of hidden syntax which describes the layout and semantic of Web interfaces. Also, unlike traditional pairwise schema matching, we propose a holistic matching approach, which matches all schemas at the same time with the hypothesis of a hidden schema model. Therefore, our techniques explore, in essence, \"data mining for information integration.\" That is, we mine the observable information to discover the underlying semantics.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"303 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
AutoLag: automatic discovery of lag correlations in stream data AutoLag:自动发现流数据中的滞后相关性
Pub Date : 2005-04-05 DOI: 10.1109/ICDE.2005.24
Yasushi Sakurai, S. Papadimitriou, C. Faloutsos
We have introduced the problem of automatic lag correlation detection on streaming data and proposed AutoLag to address this problem by using careful approximations and smoothing. Our experiments on real and realistic data show that AutoLag works as expected, estimating the unknown lags with excellent accuracy and significant speed-up. In our experiments on real and realistic data, AutoLag was up to about 42,000 times faster than the naive implementation, with at most 1% relative error.
我们介绍了流数据的自动滞后相关性检测问题,并提出了AutoLag,通过使用仔细的近似和平滑来解决这个问题。我们在真实和现实数据上的实验表明,AutoLag的工作效果与预期的一样,能够以优异的精度和显著的速度估计未知滞后。在我们对真实和现实数据的实验中,AutoLag比原始实现快约42,000倍,相对误差最多为1%。
{"title":"AutoLag: automatic discovery of lag correlations in stream data","authors":"Yasushi Sakurai, S. Papadimitriou, C. Faloutsos","doi":"10.1109/ICDE.2005.24","DOIUrl":"https://doi.org/10.1109/ICDE.2005.24","url":null,"abstract":"We have introduced the problem of automatic lag correlation detection on streaming data and proposed AutoLag to address this problem by using careful approximations and smoothing. Our experiments on real and realistic data show that AutoLag works as expected, estimating the unknown lags with excellent accuracy and significant speed-up. In our experiments on real and realistic data, AutoLag was up to about 42,000 times faster than the naive implementation, with at most 1% relative error.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130009253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
21st International Conference on Data Engineering (ICDE'05)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1