首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
DESKS: Direction-Aware Spatial Keyword Search 办公桌:方向感知空间关键字搜索
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.93
Guoliang Li, Jianhua Feng, Jing Xu
Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.
基于位置的服务(LBS)已经被移动用户广泛接受。许多LBS用户有方向性搜索需求,答案必须在搜索方向上。然而,据我们所知,目前还没有任何可用的研究来调查方向感知搜索。一种简单的方法是先在不考虑方向约束的情况下找到候选对象,然后通过剔除那些使方向约束失效的候选对象来生成答案。然而,这种方法在许多不必要的方向上进行了大量无用的计算,成本很高。为了解决这个问题,我们提出了一种方向感知的空间关键字搜索方法,该方法本身就支持方向感知搜索。我们设计了新的方向感知索引结构,以减少不必要的方向。我们开发了有效的修剪技术和搜索算法来有效地回答方向感知查询。由于用户可能会动态地改变他们的搜索方向,我们建议增量地回答一个查询。在实际数据集上的实验结果表明,该方法取得了较高的性能,明显优于现有的方法。
{"title":"DESKS: Direction-Aware Spatial Keyword Search","authors":"Guoliang Li, Jianhua Feng, Jing Xu","doi":"10.1109/ICDE.2012.93","DOIUrl":"https://doi.org/10.1109/ICDE.2012.93","url":null,"abstract":"Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128395435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Micro-Specialization in DBMSes dbms中的微专门化
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.110
Rui Zhang, R. Snodgrass, S. Debray
Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.
关系数据库管理系统具有通用性,因为它们可以处理任意模式、查询和修改,这种通用性通过运行时元数据查找和测试实现,从而确保在所有情况下都将控制权传递给适当的代码。不幸的是,这些查找和测试是在信息可用的情况下执行的,这些信息会使其中一些操作变得多余,从而导致不必要的运行时开销。本文介绍了微专门化,这种方法使用特定于关系和查询的信息在运行时专门化DBMS代码,从而消除了一些开销。我们开发了一种方法和专门化时间的分类法,并提出了一种通用体系结构,该体系结构将大多数专门化代码序列的创建和执行隔离在独立于dbms的模块中。通过三种应用于PostgreSQL的说明类型的微专门化,我们展示了这种方法需要对DBMS进行最小的更改,并且可以在大范围的查询、修改和批量加载中同时提高性能,就TPC-H和TPC-C基准测试的存储、CPU使用和I/O时间而言。
{"title":"Micro-Specialization in DBMSes","authors":"Rui Zhang, R. Snodgrass, S. Debray","doi":"10.1109/ICDE.2012.110","DOIUrl":"https://doi.org/10.1109/ICDE.2012.110","url":null,"abstract":"Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133505873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Processing and Notifying Range Top-k Subscriptions 处理和通知范围Top-k订阅
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.67
Albert Yu, P. Agarwal, Jun Yang
We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.
我们考虑如何在广域网上支持大量用户,这些用户的兴趣以范围top-k连续查询为特征。给定对象更新,我们需要通知top-k结果受到影响的用户。简单的解决方案包括使用内容驱动的网络通知其兴趣范围包含更新的所有用户(忽略top-k),或者使用服务器仅计算受影响的查询并单独通知它们。前一种解决方案产生过多的网络流量,而后一种解决方案使服务器不堪重负。我们为这个问题提出了一个几何框架,它允许我们用消息简洁地描述受影响的查询集,这些消息可以使用内容驱动的网络有效地传播。我们给出了快速算法,将每个更新重新表述为一组消息,这些消息的数量可以证明是最优的,无论是否知道所有用户的兴趣。我们还对我们的解决方案进行了扩展,包括一种近似算法,可以在服务器端重新制定的成本和用户端后处理的成本之间进行权衡,以及批处理更新的有效技术。
{"title":"Processing and Notifying Range Top-k Subscriptions","authors":"Albert Yu, P. Agarwal, Jun Yang","doi":"10.1109/ICDE.2012.67","DOIUrl":"https://doi.org/10.1109/ICDE.2012.67","url":null,"abstract":"We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"453 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131474032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning Stochastic Models of Information Flow 学习信息流的随机模型
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.103
Luke Dickens, Ian Molloy, Jorge Lobo, P. Cheng, A. Russo
An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.
对信息流的理解有许多应用,包括最大化对社会媒体的营销影响,限制恶意软件传播,以及管理不希望的敏感信息披露。本文提出了基于独立级联模型从数据中学习网络信息流模型的可扩展方法,并从这些模型中预测不可见流的概率。我们的方法是基于一个原则性的概率结构,在预测的准确性和可扩展的评估方面,结果与现有的方法相比是有利的,此外,我们能够评估比以前显示的更广泛的查询范围,包括联合和/或条件流的概率,以及反映模型的不确定性。流概率的精确评估在边缘数量上是指数级的,朴素采样也可能是昂贵的,因此我们建议使用Metropolis-Hastings算法以有效的马尔可夫链蒙特卡罗方式进行采样-详细信息在文中描述。我们确定了两种类型的数据,一种是已知过去流路径的数据——属性数据,另一种是只知道端点的数据——非属性数据。本文讨论了这两种数据类型,包括训练方法、示例真实世界数据集和实验评估。特别地,我们研究了来自Twitter微博服务的流数据,探索了属性情况下通过转发(tweet转发)的消息流,以及非属性情况下哈希标签(元数据标签)和url的传播。
{"title":"Learning Stochastic Models of Information Flow","authors":"Luke Dickens, Ian Molloy, Jorge Lobo, P. Cheng, A. Russo","doi":"10.1109/ICDE.2012.103","DOIUrl":"https://doi.org/10.1109/ICDE.2012.103","url":null,"abstract":"An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133855220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Efficient Support of XQuery Update Facility in XML Enabled RDBMS 支持XML的RDBMS中对XQuery更新功能的有效支持
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.17
Z. Liu, Hui J. Chang, Balasubramanyam Sthanikam
XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.
XQuery Update Facility (XQUF)提供了一种声明式的XML更新方式,已成为W3C的推荐工具。另一方面,SQL/XML标准将XMLType定义为RDBMS环境中的列数据类型,并定义了标准的SQL/XML操作符,如XML Query(),以嵌入XQuery来查询RDBMS中的XMLType列。基于这个SQL/XML标准,支持XML的RDBMS通过提供XML存储和查询功能,以符合标准的方式成为承载XML应用程序的工业强度平台。然而,在XQUF成为推荐标准之前,更新XML功能支持仍然是RDBMS的专有技术。XQUF不知道XML是如何存储的,因此将实际更新传播到任何持久XML存储超出了XQUF的范围。在本文中,我们将展示如何将XQUF合并到XML Query()中,以便在启用XML的RDBMS(如Oracle XMLDB)环境中有效地更新存储在XMLType列中的XML。我们介绍了各种编译时和运行时优化技术,以展示如何有效地实现XQUF,以声明式方式更新存储在RDBMS中的XML。我们介绍了如何为常见的物理XML存储模型优化XQUF的方法:原生二进制XML存储模型和XML存储模型的关系分解。虽然我们的研究是使用Oracle XMLDB完成的,但是所有提出的优化技术对于需要支持持久XML存储更新的XML存储来说都是通用的,而不是特定于Oracle XMLDB实现。
{"title":"Efficient Support of XQuery Update Facility in XML Enabled RDBMS","authors":"Z. Liu, Hui J. Chang, Balasubramanyam Sthanikam","doi":"10.1109/ICDE.2012.17","DOIUrl":"https://doi.org/10.1109/ICDE.2012.17","url":null,"abstract":"XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132360183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
GeoFeed: A Location Aware News Feed System GeoFeed:一个位置感知新闻Feed系统
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.97
Jie Bao, M. Mokbel, Chi-Yin Chow
This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.
本文介绍了地理Feed系统,这是一个位置感知新闻Feed系统,为其用户提供了一个新的平台,可以从他们的朋友或喜欢的新闻来源获得空间相关的消息更新。Geo Feed与所有现有的新闻Feed系统的区别在于,它在决定所选新闻Feed时考虑到消息的空间范围和用户位置。Geo Feed为用户提供了三种不同的新闻推送方式,即空间拉(space pull)、空间推送(space push)和共享推送(shared push)。那么,Geo Feed面临的主要挑战是决定何时使用这三种方法中的每一种方法来针对哪些用户。Geo Feed配备了一个智能决策模型,该模型决定以以下方式使用这些方法:(a)最大限度地减少提供位置感知新闻Feed的系统开销,以及(b)保证每个用户获得请求的位置感知新闻Feed的特定响应时间。基于真实数据和合成数据的实验结果表明,Geo Feed在响应时间和维护成本方面优于现有的新闻Feed系统。
{"title":"GeoFeed: A Location Aware News Feed System","authors":"Jie Bao, M. Mokbel, Chi-Yin Chow","doi":"10.1109/ICDE.2012.97","DOIUrl":"https://doi.org/10.1109/ICDE.2012.97","url":null,"abstract":"This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129263576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Physically Independent Stream Merging 物理独立流合并
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.25
B. Chandramouli, D. Maier, J. Goldstein
A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the "same" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.
用于合并等效数据流的工具可以支持数据流管理系统(DSMS)中的多种功能,例如查询计划切换和高可用性。可以从逻辑上将数据流视为事件的时态表,每个事件都与一个生命周期(时间间隔)相关联,在此期间事件将贡献输出。在许多应用中,“相同”的逻辑流可以在物理上以多种物理形式呈现自己,例如,由于传输中产生的混乱或来自多个源的组合,以及先前事件的修改。当这些流在时间、顺序和组成上可能存在物理差异时,正确合并这些流是具有挑战性的。本文介绍了一种新的流操作符,称为逻辑合并(LMerge),它将多个逻辑上一致的流作为输入,并输出与所有流兼容的单个流。LMerge可以处理输入流的动态附加和分离。我们提出了一系列LMerge算法,这些算法可以利用编译时流属性来提高效率。Stream Insight(一款商业数据管理系统)的实验表明,LMerge有时比在输入上强制执行确定性要高效几个数量级,而且当流可变性有限时,使用专门的算法是有好处的。我们还展示了LMerge及其扩展可以在几个实际应用程序中提供性能优势。
{"title":"Physically Independent Stream Merging","authors":"B. Chandramouli, D. Maier, J. Goldstein","doi":"10.1109/ICDE.2012.25","DOIUrl":"https://doi.org/10.1109/ICDE.2012.25","url":null,"abstract":"A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the \"same\" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117140483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Credit Suisse Meta-data Warehouse 瑞士信贷元数据仓库
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.41
Claudio Jossen, Lukas Blunschi, M. Mori, Donald Kossmann, Kurt Stockinger
This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.
本文描述了瑞士信贷自2009年以来的元数据仓库。与大多数其他大型组织一样,瑞士信贷拥有复杂的应用程序环境和几个数据仓库,以满足其用户的信息需求。元数据仓库解决的问题是提高组织在诸如开发新的业务流程、新的业务分析报告或实现新的监管需求等变化方面的敏捷性和灵活性。元数据仓库通过提供在数据仓库中搜索信息项和提取信息项沿袭的服务来支持这些更改。设计这种元数据仓库的一个困难是,没有标准的或众所周知的元数据模型可用于支持此类搜索服务。相反,元数据结构本身需要灵活,并随着IT环境的变化而发展。本文描述了Credit Suisse元数据仓库的当前数据结构和实现,并展示了其服务如何帮助提高整个组织的灵活性。本文给出了一系列元数据结构、用例和屏幕截图示例,以便根据瑞士信贷内部实际业务和IT用户的反馈说明所使用的概念和经验教训。
{"title":"The Credit Suisse Meta-data Warehouse","authors":"Claudio Jossen, Lukas Blunschi, M. Mori, Donald Kossmann, Kurt Stockinger","doi":"10.1109/ICDE.2012.41","DOIUrl":"https://doi.org/10.1109/ICDE.2012.41","url":null,"abstract":"This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123848773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Temporal Analytics on Big Data for Web Advertising 网络广告大数据的时间分析
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.55
B. Chandramouli, J. Goldstein, S. Duan
"Big Data" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.
map-reduce (M-R)集群中的“大数据”本质上通常是暂时的,许多基于此类数据的分析任务也是如此。例如,展示广告使用行为定位(BT)根据用户先前的搜索、页面浏览量等为用户选择广告。英国电信之前的工作主要集中在使用M-R对离线数据进行良好扩展的技术上。然而,这种方法对于处理时态数据的bt风格的应用程序有局限性:(1)许多查询是临时的,不容易在M-R中表达,此外,M-R前端(如SCOPE)面向集合的性质不适合时间处理;(2)随着商业系统的成熟,它们可能还需要直接分析和响应实时数据馈送,因为高周转时间可能导致错失机会,但目前的解决方案很难自然地也在实时流上运行。我们的贡献是双重的。首先,我们提出了一个新的框架TiMR(发音定时器),它结合了面向时间的数据处理系统和M-R框架。用户编写和提交分析算法作为临时查询——这些查询简洁、与横向扩展无关,并且易于编写。它们在使用TiMR的大规模离线数据上可以很好地扩展,并且可以在实时流上不加修改地工作。我们还提出了新的基于成本的查询碎片和时间分区方案,以提高TiMR的效率。其次,我们展示了这种方法对BT的可行性,使用新的时间算法来利用新的目标机会。使用来自商业广告平台的真实数据进行的实验表明,TiMR非常有效,并且减少了开发工作量。我们的BT解决方案简单而简洁,在记忆、学习时间和点击率/覆盖率方面比目前的方案要好几倍。
{"title":"Temporal Analytics on Big Data for Web Advertising","authors":"B. Chandramouli, J. Goldstein, S. Duan","doi":"10.1109/ICDE.2012.55","DOIUrl":"https://doi.org/10.1109/ICDE.2012.55","url":null,"abstract":"\"Big Data\" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123415442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
Aggregate Query Answering on Possibilistic Data with Cardinality Constraints 具有基数约束的可能性数据的聚合查询应答
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.15
Graham Cormode, D. Srivastava, E. Shen, Ting Yu
Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.
造成数据不确定性的原因有很多:当数据不完整,包含相互矛盾的信息,或故意干扰或粗化以删除敏感细节时。在许多实际应用程序中出现的一个重要情况是,数据描述了一组可能性,但具有基数约束。这些约束表示元组编码之间的相关性,例如,最多有两个可能的记录是正确的,或者在一组元组和属性值之间存在(未知的)一对一映射。尽管在处理不确定数据方面已经付出了很多努力,但目前的系统还没有能力处理这种相关性,除了简单的互斥和共存约束。实际上,它们很少支持有效地处理此类数据的聚合查询。在本文中,我们的目标是通过引入LICM(线性整数约束模型)来解决其中的一些缺陷,该模型可以简洁地表示许多类型的元组相关性,特别是一类基数约束。我们使用来自数据清理和屏蔽敏感数据的示例来激励和解释该模型,以表明它支持对此类数据进行建模和查询,这在以前是不可能的。通过描述如何在模型中的数据上实现关系运算符,我们开发了一种有效的策略来回答对可能性数据的连接和聚合查询。LICM紧凑地集成了关联编码、查询应答和沿袭记录。结合现成的线性整数规划求解器,我们的方法为聚合查询提供了精确的边界。我们的原型实现表明,使用LICM进行查询应答是有效的和可扩展的。
{"title":"Aggregate Query Answering on Possibilistic Data with Cardinality Constraints","authors":"Graham Cormode, D. Srivastava, E. Shen, Ting Yu","doi":"10.1109/ICDE.2012.15","DOIUrl":"https://doi.org/10.1109/ICDE.2012.15","url":null,"abstract":"Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124144077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1