首页 > 最新文献

22nd International Conference on Data Engineering (ICDE'06)最新文献

英文 中文
Network-Aware Operator Placement for Stream-Processing Systems 流处理系统的网络感知算子配置
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.105
P. Pietzuch, J. Ledlie, Jeffrey Shneidman, M. Roussopoulos, M. Welsh, M. Seltzer
To use their pool of resources efficiently, distributed stream-processing systems push query operators to nodes within the network. Currently, these operators, ranging from simple filters to custom business logic, are placed manually at intermediate nodes along the transmission path to meet application-specific performance goals. Determining placement locations is challenging because network and node conditions change over time and because streams may interact with each other, opening venues for reuse and repositioning of operators. This paper describes a stream-based overlay network (SBON), a layer between a stream-processing system and the physical network that manages operator placement for stream-processing systems. Our design is based on a cost space, an abstract representation of the network and on-going streams, which permits decentralized, large-scale multi-query optimization decisions. We present an evaluation of the SBON approach through simulation, experiments on PlanetLab, and an integration with Borealis, an existing stream-processing engine. Our results show that an SBON consistently improves network utilization, provides low stream latency, and enables dynamic optimization at low engineering cost.
为了有效地使用资源池,分布式流处理系统将查询操作符推送到网络中的节点。目前,这些操作符(从简单的过滤器到自定义业务逻辑)被手动放置在传输路径上的中间节点上,以满足特定于应用程序的性能目标。由于网络和节点的条件会随着时间的推移而变化,而且数据流可能会相互作用,从而为运营商的再利用和重新定位开辟场地,因此确定放置位置是一项挑战。本文描述了一种基于流的覆盖网络(SBON),它是流处理系统和物理网络之间的一层,用于管理流处理系统的操作员位置。我们的设计基于成本空间,网络和持续流的抽象表示,它允许分散的,大规模的多查询优化决策。我们通过模拟、PlanetLab上的实验以及与现有流处理引擎Borealis的集成,对SBON方法进行了评估。我们的研究结果表明,SBON可以持续提高网络利用率,提供低流延迟,并以低工程成本实现动态优化。
{"title":"Network-Aware Operator Placement for Stream-Processing Systems","authors":"P. Pietzuch, J. Ledlie, Jeffrey Shneidman, M. Roussopoulos, M. Welsh, M. Seltzer","doi":"10.1109/ICDE.2006.105","DOIUrl":"https://doi.org/10.1109/ICDE.2006.105","url":null,"abstract":"To use their pool of resources efficiently, distributed stream-processing systems push query operators to nodes within the network. Currently, these operators, ranging from simple filters to custom business logic, are placed manually at intermediate nodes along the transmission path to meet application-specific performance goals. Determining placement locations is challenging because network and node conditions change over time and because streams may interact with each other, opening venues for reuse and repositioning of operators. This paper describes a stream-based overlay network (SBON), a layer between a stream-processing system and the physical network that manages operator placement for stream-processing systems. Our design is based on a cost space, an abstract representation of the network and on-going streams, which permits decentralized, large-scale multi-query optimization decisions. We present an evaluation of the SBON approach through simulation, experiments on PlanetLab, and an integration with Borealis, an existing stream-processing engine. Our results show that an SBON consistently improves network utilization, provides low stream latency, and enables dynamic optimization at low engineering cost.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"122 1","pages":"49-49"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83649866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 474
Mining Actionable Patterns by Role Models 通过角色模型挖掘可操作的模式
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.96
Ke Wang, Yuelong Jiang, A. Tuzhilin
Data mining promises to discover valid and potentially useful patterns in data. Often, discovered patterns are not useful to the user."Actionability" addresses this problem in that a pattern is deemed actionable if the user can act upon it in her favor. We introduce the notion of "action" as a domain-independent way to model the domain knowledge. Given a data set about actionable features and an utility measure, a pattern is actionable if it summarizes a population that can be acted upon towards a more promising population observed with a higher utility. We present several pruning strategies taking into account the actionability requirement to reduce the search space, and algorithms for mining all actionable patterns as well as mining the top k actionable patterns. We evaluate the usefulness of patterns and the focus of search on a real-world application domain.
数据挖掘有望在数据中发现有效的和潜在有用的模式。通常,发现的模式对用户没有用处。“可操作性”解决了这个问题,因为如果用户可以对其进行操作,则认为模式是可操作的。我们引入了“动作”的概念,作为一种与领域无关的方法来对领域知识进行建模。给定一个关于可操作特性和效用度量的数据集,如果模式总结了一个种群,可以对其进行操作,从而获得具有更高效用的更有希望的种群,那么该模式就是可操作的。我们提出了几种考虑可操作性要求以减少搜索空间的修剪策略,以及挖掘所有可操作模式和挖掘前k个可操作模式的算法。我们将评估模式的有用性以及对实际应用程序领域的搜索重点。
{"title":"Mining Actionable Patterns by Role Models","authors":"Ke Wang, Yuelong Jiang, A. Tuzhilin","doi":"10.1109/ICDE.2006.96","DOIUrl":"https://doi.org/10.1109/ICDE.2006.96","url":null,"abstract":"Data mining promises to discover valid and potentially useful patterns in data. Often, discovered patterns are not useful to the user.\"Actionability\" addresses this problem in that a pattern is deemed actionable if the user can act upon it in her favor. We introduce the notion of \"action\" as a domain-independent way to model the domain knowledge. Given a data set about actionable features and an utility measure, a pattern is actionable if it summarizes a population that can be acted upon towards a more promising population observed with a higher utility. We present several pruning strategies taking into account the actionability requirement to reduce the search space, and algorithms for mining all actionable patterns as well as mining the top k actionable patterns. We evaluate the usefulness of patterns and the focus of search on a real-world application domain.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"2 1","pages":"16-16"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84582011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Foundations of Automated Database Tuning 自动数据库调优的基础
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.72
Surajit Chaudhuri, G. Weikum
Our society is more dependent on information systems than ever before. However, managing the information systems infrastructure in a cost-effective manner is a growing challenge. The total cost of ownership (TCO) of information technology is increasingly dominated by people costs. In fact, mistakes in operations and administration of information systems are the single most reasons for system outage and unacceptable performance. For information systems to provide value to their customers, we must reduce the complexity associated with their deployment and usage.
我们的社会比以往任何时候都更加依赖信息系统。然而,以具有成本效益的方式管理信息系统基础设施是一个日益严峻的挑战。信息技术的总拥有成本(TCO)越来越多地由人员成本主导。事实上,信息系统操作和管理中的错误是导致系统中断和不可接受性能的最主要原因。为了让信息系统为其客户提供价值,我们必须降低与它们的部署和使用相关的复杂性。
{"title":"Foundations of Automated Database Tuning","authors":"Surajit Chaudhuri, G. Weikum","doi":"10.1109/ICDE.2006.72","DOIUrl":"https://doi.org/10.1109/ICDE.2006.72","url":null,"abstract":"Our society is more dependent on information systems than ever before. However, managing the information systems infrastructure in a cost-effective manner is a growing challenge. The total cost of ownership (TCO) of information technology is increasingly dominated by people costs. In fact, mistakes in operations and administration of information systems are the single most reasons for system outage and unacceptable performance. For information systems to provide value to their customers, we must reduce the complexity associated with their deployment and usage.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"1 1","pages":"104-104"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91297915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Segmentation of Publication Records of Authors from the Web 网络作者出版记录的分割
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.137
Wei Zhang, Clement T. Yu, N. Smalheiser, Vetle I. Torvik
Publication records are often found in the authors’ personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel "Split and Merge" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.
出版记录通常可以在作者的个人主页上找到。如果将这样的记录划分为作者、标题、日期等语义字段的列表,则可以将非结构化文本转换为结构化数据,以便在其他应用程序中使用。本文提出了一种出版物记录分割系统PEPURS。它采用了一种新颖的“拆分合并”策略。发布记录被分成几段;多个统计分类器计算它们属于不同领域的可能性;最后,如果相邻段属于同一字段,则合并它们。PEPURS引入了标点符号及其相邻文本作为一种新的特征来区分标点符号的不同角色。PEPURS在实验中获得了较高的精度分数。
{"title":"Segmentation of Publication Records of Authors from the Web","authors":"Wei Zhang, Clement T. Yu, N. Smalheiser, Vetle I. Torvik","doi":"10.1109/ICDE.2006.137","DOIUrl":"https://doi.org/10.1109/ICDE.2006.137","url":null,"abstract":"Publication records are often found in the authors’ personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel \"Split and Merge\" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"5 1","pages":"120-120"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90727799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Every Click You Make, IWill Be Fetching It: Efficient XML Query Processing in RDMS Using GUI-driven Prefetching 你所做的每一次点击,我都将获取它:在RDMS中使用gui驱动的预获取进行有效的XML查询处理
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.64
S. Bhowmick, Sandeep Prakash
formulation and efficient processing of the formulated query. However, due to the nature of XML data, formulating an XML query using an XML query language such as XQuery requires considerable effort. A user must be completely familiar with the syntax of the query language, and must be able to express his/her needs accurately in a syntactically correct form. In many real life applications it is not realistic to assume that users are proficient in expressing such textual queries. Hence, there is a need for a user-friendly visual querying schemes to replace data retrieval aspects of XQuery. In this paper, we address the problem of efficient processing of XQueries in the relational environment where the queries are formulated using a user-friendly GUI. We take a novel and non-traditional approach to improving query performance by prefetching data during the formulation of a query in a single-user environment. The latency offered by the GUI-based query formulation is utilized to prefetch portions of the query results. The basic idea we employ for prefetching is that we prefetch constituent path expressions, store the intermediary results, reuse them when connective is added or "Run" is pressed.
公式查询的制定和高效处理。但是,由于XML数据的性质,使用XML查询语言(如XQuery)制定XML查询需要付出相当大的努力。用户必须完全熟悉查询语言的语法,并且必须能够以语法正确的形式准确地表达他/她的需求。在许多实际应用程序中,假设用户精通表达此类文本查询是不现实的。因此,需要一种用户友好的可视化查询方案来取代XQuery的数据检索方面。在本文中,我们解决了在关系环境中高效处理xquery的问题,在关系环境中,查询是使用用户友好的GUI制定的。我们采用一种新颖的、非传统的方法,通过在单用户环境中制定查询期间预取数据来提高查询性能。基于gui的查询公式提供的延迟被用来预取部分查询结果。我们用于预取的基本思想是,我们预取组成路径表达式,存储中间结果,在添加连接或按下“运行”时重用它们。
{"title":"Every Click You Make, IWill Be Fetching It: Efficient XML Query Processing in RDMS Using GUI-driven Prefetching","authors":"S. Bhowmick, Sandeep Prakash","doi":"10.1109/ICDE.2006.64","DOIUrl":"https://doi.org/10.1109/ICDE.2006.64","url":null,"abstract":"formulation and efficient processing of the formulated query. However, due to the nature of XML data, formulating an XML query using an XML query language such as XQuery requires considerable effort. A user must be completely familiar with the syntax of the query language, and must be able to express his/her needs accurately in a syntactically correct form. In many real life applications it is not realistic to assume that users are proficient in expressing such textual queries. Hence, there is a need for a user-friendly visual querying schemes to replace data retrieval aspects of XQuery. In this paper, we address the problem of efficient processing of XQueries in the relational environment where the queries are formulated using a user-friendly GUI. We take a novel and non-traditional approach to improving query performance by prefetching data during the formulation of a query in a single-user environment. The latency offered by the GUI-based query formulation is utilized to prefetch portions of the query results. The basic idea we employ for prefetching is that we prefetch constituent path expressions, store the intermediary results, reuse them when connective is added or \"Run\" is pressed.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"45 1","pages":"152-152"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85086503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams 不同之处:数据流上重复弹性聚合的分布式、连续监控
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.173
Graham Cormode, S. Muthukrishnan, W. Zhuang
Emerging applications in sensor systems and network-wide IP traffic analysis present many technical challenges. They need distributed monitoring and continuous tracking of events. They have severe resource constraints not only at each site in terms of per-update processing time and archival space for highspeed streams of observations, but also crucially, communication constraints for collaborating on the monitoring task. These elements have been addressed in a series of recent works. A fundamental issue that arises is that one cannot make the "uniqueness" assumption on observed events which is present in previous works, since widescale monitoring invariably encounters the same events at different points. For example, within the network of an Internet Service Provider packets of the same flow will be observed in different routers; similarly, the same individual will be observed by multiple mobile sensors in monitoring wild animals. Aggregates of interest on such distributed environments must be resilient to duplicate observations. We study such duplicate-resilient aggregates that measure the extent of the duplication―how many unique observations are there, how many observations are unique―as well as standard holistic aggregates such as quantiles and heavy hitters over the unique items. We present accuracy guaranteed, highly communication-efficient algorithms for these aggregates that work within the time and space constraints of high speed streams. We also present results of a detailed experimental study on both real-life and synthetic data.
传感器系统和全网IP流量分析的新兴应用提出了许多技术挑战。它们需要对事件进行分布式监控和连续跟踪。它们不仅在每个站点的每次更新处理时间和高速观测流的存档空间方面存在严重的资源限制,而且至关重要的是,在监测任务上进行协作的通信限制。这些因素在最近的一系列作品中得到了解决。出现的一个基本问题是,人们不能对以前工作中出现的观察事件做出“唯一性”假设,因为大规模监测总是在不同的点遇到相同的事件。例如,在互联网服务提供商的网络中,将在不同的路由器中观察到相同流的数据包;同样,在监测野生动物时,同一个体也会被多个移动传感器观察到。在这种分布式环境中,感兴趣的聚合必须对重复观察具有弹性。我们研究这样的重复弹性聚合,测量重复的程度——有多少独特的观察,有多少观察是独特的——以及标准的整体聚合,如分位数和独特项目的重击。在高速流的时间和空间限制下,我们为这些聚合提供了精度保证、通信效率高的算法。我们还介绍了对现实生活和合成数据进行详细实验研究的结果。
{"title":"What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams","authors":"Graham Cormode, S. Muthukrishnan, W. Zhuang","doi":"10.1109/ICDE.2006.173","DOIUrl":"https://doi.org/10.1109/ICDE.2006.173","url":null,"abstract":"Emerging applications in sensor systems and network-wide IP traffic analysis present many technical challenges. They need distributed monitoring and continuous tracking of events. They have severe resource constraints not only at each site in terms of per-update processing time and archival space for highspeed streams of observations, but also crucially, communication constraints for collaborating on the monitoring task. These elements have been addressed in a series of recent works. A fundamental issue that arises is that one cannot make the \"uniqueness\" assumption on observed events which is present in previous works, since widescale monitoring invariably encounters the same events at different points. For example, within the network of an Internet Service Provider packets of the same flow will be observed in different routers; similarly, the same individual will be observed by multiple mobile sensors in monitoring wild animals. Aggregates of interest on such distributed environments must be resilient to duplicate observations. We study such duplicate-resilient aggregates that measure the extent of the duplication―how many unique observations are there, how many observations are unique―as well as standard holistic aggregates such as quantiles and heavy hitters over the unique items. We present accuracy guaranteed, highly communication-efficient algorithms for these aggregates that work within the time and space constraints of high speed streams. We also present results of a detailed experimental study on both real-life and synthetic data.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"7 1","pages":"57-57"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85632651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Dual Labeling: Answering Graph Reachability Queries in Constant Time 双标记:在常数时间内回答图可达性查询
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.53
Haixun Wang, Hao He, Jun Yang, Philip S. Yu, J. Yu
Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability queries. Several reachability labeling methods have been proposed for this purpose. They assign labels to the vertices, such that the reachability between any two vertices may be decided using their labels only. For sparse graphs, 2-hop based reachability labeling schemes answer reachability queries efficiently using relatively small label space. However, the labeling process itself is often too time consuming to be practical for large graphs. In this paper, we propose a novel labeling scheme for sparse graphs. Our scheme ensures that graph reachability queries can be answered in constant time. Furthermore, for sparse graphs, the complexity of the labeling process is almost linear, which makes our algorithm applicable to massive datasets. Analytical and experimental results show that our approach is much more efficient than stateof- the-art approaches. Furthermore, our labeling method also provides an alternative scheme to tradeoff query time for label space, which further benefits applications that use tree-like graphs.
图的可达性是广泛应用的基础,包括XML索引、地理导航、Internet路由、基于RDF/OWL的本体查询等。许多应用程序涉及巨大的图形,需要快速回答可达性查询。为此提出了几种可达性标注方法。它们为顶点分配标签,这样任意两个顶点之间的可达性可以只用它们的标签来确定。对于稀疏图,基于2跳的可达性标记方案使用相对较小的标签空间有效地回答了可达性查询。然而,标记过程本身通常太耗时,对于大型图形来说不实用。本文提出了一种新的稀疏图标注方案。我们的方案保证了图可达性查询可以在恒定的时间内得到回答。此外,对于稀疏图,标记过程的复杂性几乎是线性的,这使得我们的算法适用于大量数据集。分析和实验结果表明,我们的方法比目前最先进的方法更有效。此外,我们的标记方法还提供了一种替代方案来权衡查询时间和标签空间,这进一步有利于使用树状图的应用程序。
{"title":"Dual Labeling: Answering Graph Reachability Queries in Constant Time","authors":"Haixun Wang, Hao He, Jun Yang, Philip S. Yu, J. Yu","doi":"10.1109/ICDE.2006.53","DOIUrl":"https://doi.org/10.1109/ICDE.2006.53","url":null,"abstract":"Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability queries. Several reachability labeling methods have been proposed for this purpose. They assign labels to the vertices, such that the reachability between any two vertices may be decided using their labels only. For sparse graphs, 2-hop based reachability labeling schemes answer reachability queries efficiently using relatively small label space. However, the labeling process itself is often too time consuming to be practical for large graphs. In this paper, we propose a novel labeling scheme for sparse graphs. Our scheme ensures that graph reachability queries can be answered in constant time. Furthermore, for sparse graphs, the complexity of the labeling process is almost linear, which makes our algorithm applicable to massive datasets. Analytical and experimental results show that our approach is much more efficient than stateof- the-art approaches. Furthermore, our labeling method also provides an alternative scheme to tradeoff query time for label space, which further benefits applications that use tree-like graphs.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"26 1","pages":"75-75"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84739777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 265
SIPPER: Selecting Informative Peers in Structured P2P Environment for Content-Based Retrieval 基于内容检索的结构化P2P环境中信息对等点的选择
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.139
Shuigeng Zhou, Zhengjie Zhang, Weining Qian, Aoying Zhou
In this demonstration, we present a prototype system called SIPPER, which is the abbreviation for Selecting Informative Peers in Structured P2P Environment for Content-based Retrieval. SIPPER distinguishes itself from the existing P2P-IR systems by the following two features: First, to improve retrieval efficiency, SIPPER employs a novel peer selection method to direct the query to a small fraction of relevant peers in the network for searching globally relevant documents. Second, to reduce the bandwidth cost of meta data publishing, SIPPER uses a new publishing mechanism, the term-node publishing mechanism, which is different from the traditional term-document model [2].
在本演示中,我们提出了一个名为SIPPER的原型系统,SIPPER是结构化P2P环境中基于内容的检索选择信息对等体的缩写。SIPPER与现有P2P-IR系统的区别在于:首先,为了提高检索效率,SIPPER采用了一种新颖的对等点选择方法,将查询引导到网络中相关对等点的一小部分,以搜索全局相关文档;其次,为了降低元数据发布的带宽成本,SIPPER采用了一种新的发布机制,即术语-节点发布机制,它不同于传统的术语-文档模型[2]。
{"title":"SIPPER: Selecting Informative Peers in Structured P2P Environment for Content-Based Retrieval","authors":"Shuigeng Zhou, Zhengjie Zhang, Weining Qian, Aoying Zhou","doi":"10.1109/ICDE.2006.139","DOIUrl":"https://doi.org/10.1109/ICDE.2006.139","url":null,"abstract":"In this demonstration, we present a prototype system called SIPPER, which is the abbreviation for Selecting Informative Peers in Structured P2P Environment for Content-based Retrieval. SIPPER distinguishes itself from the existing P2P-IR systems by the following two features: First, to improve retrieval efficiency, SIPPER employs a novel peer selection method to direct the query to a small fraction of relevant peers in the network for searching globally relevant documents. Second, to reduce the bandwidth cost of meta data publishing, SIPPER uses a new publishing mechanism, the term-node publishing mechanism, which is different from the traditional term-document model [2].","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"23 1","pages":"161-161"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85779219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficient Discovery of Emerging Frequent Patterns in ArbitraryWindows on Data Streams 数据流任意窗口中频繁模式的有效发现
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.57
Xiaoming Jin, Xinqiang Zuo, K. Lam, Jianmin Wang, Jiaguang Sun
This paper proposes an effective data mining technique for finding useful patterns in streaming sequences. At present, typical approaches to this problem are to search for patterns in a fixed-size window sliding through the stream of data being collected. The practical values of such approaches are limited in that, in typical application scenarios, the patterns are emerging and it is difficult, if not impossible, to determine a priori a suitable window size within which useful patterns may exist. It is therefore desirable to devise techniques that can identify useful patterns with arbitrary window sizes. Attempts to this problem are challenging, however, because it requires a highly efficient searching in a substantially bigger solution space. This paper presents a new method which includes firstly a pruning strategy to reduce the search space and secondly a mining strategy that adopts a dynamic index structure to allow efficient discovery of emerging patterns in a streaming sequence. Experimental results on real data and synthetic data show that the proposed method outperforms other existing schemes both in computational efficiency and effectiveness in finding useful patterns.
本文提出了一种有效的数据挖掘技术,用于在流序列中发现有用的模式。目前,解决该问题的典型方法是在一个固定大小的窗口中搜索模式,该窗口滑动穿过正在收集的数据流。这种方法的实际价值是有限的,因为在典型的应用程序场景中,模式正在出现,很难(如果不是不可能的话)先验地确定一个合适的窗口大小,其中可能存在有用的模式。因此,需要设计出能够识别任意窗口大小的有用模式的技术。然而,尝试解决这个问题是具有挑战性的,因为它需要在更大的解决方案空间中进行高效搜索。本文提出了一种新的方法,该方法首先采用剪枝策略来减少搜索空间,其次采用动态索引结构的挖掘策略来有效地发现流序列中的新模式。在实际数据和合成数据上的实验结果表明,该方法在计算效率和发现有用模式的有效性方面都优于现有的方法。
{"title":"Efficient Discovery of Emerging Frequent Patterns in ArbitraryWindows on Data Streams","authors":"Xiaoming Jin, Xinqiang Zuo, K. Lam, Jianmin Wang, Jiaguang Sun","doi":"10.1109/ICDE.2006.57","DOIUrl":"https://doi.org/10.1109/ICDE.2006.57","url":null,"abstract":"This paper proposes an effective data mining technique for finding useful patterns in streaming sequences. At present, typical approaches to this problem are to search for patterns in a fixed-size window sliding through the stream of data being collected. The practical values of such approaches are limited in that, in typical application scenarios, the patterns are emerging and it is difficult, if not impossible, to determine a priori a suitable window size within which useful patterns may exist. It is therefore desirable to devise techniques that can identify useful patterns with arbitrary window sizes. Attempts to this problem are challenging, however, because it requires a highly efficient searching in a substantially bigger solution space. This paper presents a new method which includes firstly a pruning strategy to reduce the search space and secondly a mining strategy that adopts a dynamic index structure to allow efficient discovery of emerging patterns in a streaming sequence. Experimental results on real data and synthetic data show that the proposed method outperforms other existing schemes both in computational efficiency and effectiveness in finding useful patterns.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"18 1","pages":"113-113"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87089894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic Sales Lead Generation from Web Data 从网络数据自动生成销售线索
Pub Date : 2006-04-03 DOI: 10.1109/ICDE.2006.28
Ganesh Ramakrishnan, Sachindra Joshi, Sumit Negi, R. Krishnapuram, S. Balakrishnan
Speed to market is critical to companies that are driven by sales in a competitive market. The earlier a potential customer can be approached in the decision making process of a purchase, the higher are the chances of converting that prospect into a customer. Traditional methods to identify sales leads such as company surveys and direct marketing are manual, expensive and not scalable. Over the past decade the World Wide Web has grown into an information-mesh, with most important facts being reported through Web sites. Several news papers, press releases, trade journals, business magazines and other related sources are on-line. These sources could be used to identify prospective buyers automatically. In this paper, we present a system called ETAP (Electronic Trigger Alert Program) that extracts trigger events from Web data that help in identifying prospective buyers. Trigger events are events of corporate relevance and indicative of the propensity of companies to purchase new products associated with these events. Examples of trigger events are change in management, revenue growth and mergers & acquisitions. The unstructured nature of information makes the extraction task of trigger events difficult. We pose the problem of trigger events extraction as a classification problem and develop methods for learning trigger event classifiers using existing classification methods. We present methods to automatically generate the training data required to learn the classifiers. We also propose a method of feature abstraction that uses named entity recognition to solve the problem of data sparsity. We score and rank the trigger events extracted from ETAP for easy browsing. Our experiments show the effectiveness of the method and thus establish the feasibility of automatic sales lead generation using the Web data.
对于在竞争激烈的市场中靠销售驱动的公司来说,上市速度至关重要。在购买决策过程中,越早接触潜在客户,将潜在客户转化为客户的机会就越高。识别销售线索的传统方法,如公司调查和直接营销,是手动的,昂贵的,不可扩展的。在过去的十年里,万维网已经发展成为一个信息网,大多数重要的事实都是通过网站报道的。一些新闻报纸、新闻稿、贸易期刊、商业杂志和其他相关资源都在网上。这些资源可用于自动识别潜在买家。在本文中,我们提出了一个称为ETAP(电子触发警报程序)的系统,该系统从Web数据中提取触发事件,有助于识别潜在买家。触发事件是与公司相关的事件,表明公司倾向于购买与这些事件相关的新产品。触发事件的例子有管理层变动、收入增长和并购。信息的非结构化特性使得提取触发事件的任务变得困难。我们将触发事件提取问题作为一个分类问题,并开发了使用现有分类方法学习触发事件分类器的方法。我们提出了自动生成学习分类器所需的训练数据的方法。我们还提出了一种使用命名实体识别的特征抽象方法来解决数据稀疏性问题。我们对从ETAP中提取的触发事件进行评分和排序,以便于浏览。实验证明了该方法的有效性,从而建立了利用Web数据自动生成销售线索的可行性。
{"title":"Automatic Sales Lead Generation from Web Data","authors":"Ganesh Ramakrishnan, Sachindra Joshi, Sumit Negi, R. Krishnapuram, S. Balakrishnan","doi":"10.1109/ICDE.2006.28","DOIUrl":"https://doi.org/10.1109/ICDE.2006.28","url":null,"abstract":"Speed to market is critical to companies that are driven by sales in a competitive market. The earlier a potential customer can be approached in the decision making process of a purchase, the higher are the chances of converting that prospect into a customer. Traditional methods to identify sales leads such as company surveys and direct marketing are manual, expensive and not scalable. Over the past decade the World Wide Web has grown into an information-mesh, with most important facts being reported through Web sites. Several news papers, press releases, trade journals, business magazines and other related sources are on-line. These sources could be used to identify prospective buyers automatically. In this paper, we present a system called ETAP (Electronic Trigger Alert Program) that extracts trigger events from Web data that help in identifying prospective buyers. Trigger events are events of corporate relevance and indicative of the propensity of companies to purchase new products associated with these events. Examples of trigger events are change in management, revenue growth and mergers & acquisitions. The unstructured nature of information makes the extraction task of trigger events difficult. We pose the problem of trigger events extraction as a classification problem and develop methods for learning trigger event classifiers using existing classification methods. We present methods to automatically generate the training data required to learn the classifiers. We also propose a method of feature abstraction that uses named entity recognition to solve the problem of data sparsity. We score and rank the trigger events extracted from ETAP for easy browsing. Our experiments show the effectiveness of the method and thus establish the feasibility of automatic sales lead generation using the Web data.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"67 1","pages":"101-101"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90276794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
22nd International Conference on Data Engineering (ICDE'06)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1