首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
DESKS: Direction-Aware Spatial Keyword Search 办公桌:方向感知空间关键字搜索
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.93
Guoliang Li, Jianhua Feng, Jing Xu
Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.
基于位置的服务(LBS)已经被移动用户广泛接受。许多LBS用户有方向性搜索需求,答案必须在搜索方向上。然而,据我们所知,目前还没有任何可用的研究来调查方向感知搜索。一种简单的方法是先在不考虑方向约束的情况下找到候选对象,然后通过剔除那些使方向约束失效的候选对象来生成答案。然而,这种方法在许多不必要的方向上进行了大量无用的计算,成本很高。为了解决这个问题,我们提出了一种方向感知的空间关键字搜索方法,该方法本身就支持方向感知搜索。我们设计了新的方向感知索引结构,以减少不必要的方向。我们开发了有效的修剪技术和搜索算法来有效地回答方向感知查询。由于用户可能会动态地改变他们的搜索方向,我们建议增量地回答一个查询。在实际数据集上的实验结果表明,该方法取得了较高的性能,明显优于现有的方法。
{"title":"DESKS: Direction-Aware Spatial Keyword Search","authors":"Guoliang Li, Jianhua Feng, Jing Xu","doi":"10.1109/ICDE.2012.93","DOIUrl":"https://doi.org/10.1109/ICDE.2012.93","url":null,"abstract":"Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128395435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Micro-Specialization in DBMSes dbms中的微专门化
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.110
Rui Zhang, R. Snodgrass, S. Debray
Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.
关系数据库管理系统具有通用性,因为它们可以处理任意模式、查询和修改,这种通用性通过运行时元数据查找和测试实现,从而确保在所有情况下都将控制权传递给适当的代码。不幸的是,这些查找和测试是在信息可用的情况下执行的,这些信息会使其中一些操作变得多余,从而导致不必要的运行时开销。本文介绍了微专门化,这种方法使用特定于关系和查询的信息在运行时专门化DBMS代码,从而消除了一些开销。我们开发了一种方法和专门化时间的分类法,并提出了一种通用体系结构,该体系结构将大多数专门化代码序列的创建和执行隔离在独立于dbms的模块中。通过三种应用于PostgreSQL的说明类型的微专门化,我们展示了这种方法需要对DBMS进行最小的更改,并且可以在大范围的查询、修改和批量加载中同时提高性能,就TPC-H和TPC-C基准测试的存储、CPU使用和I/O时间而言。
{"title":"Micro-Specialization in DBMSes","authors":"Rui Zhang, R. Snodgrass, S. Debray","doi":"10.1109/ICDE.2012.110","DOIUrl":"https://doi.org/10.1109/ICDE.2012.110","url":null,"abstract":"Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133505873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Processing and Notifying Range Top-k Subscriptions 处理和通知范围Top-k订阅
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.67
Albert Yu, P. Agarwal, Jun Yang
We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.
我们考虑如何在广域网上支持大量用户,这些用户的兴趣以范围top-k连续查询为特征。给定对象更新,我们需要通知top-k结果受到影响的用户。简单的解决方案包括使用内容驱动的网络通知其兴趣范围包含更新的所有用户(忽略top-k),或者使用服务器仅计算受影响的查询并单独通知它们。前一种解决方案产生过多的网络流量,而后一种解决方案使服务器不堪重负。我们为这个问题提出了一个几何框架,它允许我们用消息简洁地描述受影响的查询集,这些消息可以使用内容驱动的网络有效地传播。我们给出了快速算法,将每个更新重新表述为一组消息,这些消息的数量可以证明是最优的,无论是否知道所有用户的兴趣。我们还对我们的解决方案进行了扩展,包括一种近似算法,可以在服务器端重新制定的成本和用户端后处理的成本之间进行权衡,以及批处理更新的有效技术。
{"title":"Processing and Notifying Range Top-k Subscriptions","authors":"Albert Yu, P. Agarwal, Jun Yang","doi":"10.1109/ICDE.2012.67","DOIUrl":"https://doi.org/10.1109/ICDE.2012.67","url":null,"abstract":"We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"453 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131474032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning Stochastic Models of Information Flow 学习信息流的随机模型
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.103
Luke Dickens, Ian Molloy, Jorge Lobo, P. Cheng, A. Russo
An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.
对信息流的理解有许多应用,包括最大化对社会媒体的营销影响,限制恶意软件传播,以及管理不希望的敏感信息披露。本文提出了基于独立级联模型从数据中学习网络信息流模型的可扩展方法,并从这些模型中预测不可见流的概率。我们的方法是基于一个原则性的概率结构,在预测的准确性和可扩展的评估方面,结果与现有的方法相比是有利的,此外,我们能够评估比以前显示的更广泛的查询范围,包括联合和/或条件流的概率,以及反映模型的不确定性。流概率的精确评估在边缘数量上是指数级的,朴素采样也可能是昂贵的,因此我们建议使用Metropolis-Hastings算法以有效的马尔可夫链蒙特卡罗方式进行采样-详细信息在文中描述。我们确定了两种类型的数据,一种是已知过去流路径的数据——属性数据,另一种是只知道端点的数据——非属性数据。本文讨论了这两种数据类型,包括训练方法、示例真实世界数据集和实验评估。特别地,我们研究了来自Twitter微博服务的流数据,探索了属性情况下通过转发(tweet转发)的消息流,以及非属性情况下哈希标签(元数据标签)和url的传播。
{"title":"Learning Stochastic Models of Information Flow","authors":"Luke Dickens, Ian Molloy, Jorge Lobo, P. Cheng, A. Russo","doi":"10.1109/ICDE.2012.103","DOIUrl":"https://doi.org/10.1109/ICDE.2012.103","url":null,"abstract":"An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133855220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Efficient Support of XQuery Update Facility in XML Enabled RDBMS 支持XML的RDBMS中对XQuery更新功能的有效支持
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.17
Z. Liu, Hui J. Chang, Balasubramanyam Sthanikam
XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.
XQuery Update Facility (XQUF)提供了一种声明式的XML更新方式,已成为W3C的推荐工具。另一方面,SQL/XML标准将XMLType定义为RDBMS环境中的列数据类型,并定义了标准的SQL/XML操作符,如XML Query(),以嵌入XQuery来查询RDBMS中的XMLType列。基于这个SQL/XML标准,支持XML的RDBMS通过提供XML存储和查询功能,以符合标准的方式成为承载XML应用程序的工业强度平台。然而,在XQUF成为推荐标准之前,更新XML功能支持仍然是RDBMS的专有技术。XQUF不知道XML是如何存储的,因此将实际更新传播到任何持久XML存储超出了XQUF的范围。在本文中,我们将展示如何将XQUF合并到XML Query()中,以便在启用XML的RDBMS(如Oracle XMLDB)环境中有效地更新存储在XMLType列中的XML。我们介绍了各种编译时和运行时优化技术,以展示如何有效地实现XQUF,以声明式方式更新存储在RDBMS中的XML。我们介绍了如何为常见的物理XML存储模型优化XQUF的方法:原生二进制XML存储模型和XML存储模型的关系分解。虽然我们的研究是使用Oracle XMLDB完成的,但是所有提出的优化技术对于需要支持持久XML存储更新的XML存储来说都是通用的,而不是特定于Oracle XMLDB实现。
{"title":"Efficient Support of XQuery Update Facility in XML Enabled RDBMS","authors":"Z. Liu, Hui J. Chang, Balasubramanyam Sthanikam","doi":"10.1109/ICDE.2012.17","DOIUrl":"https://doi.org/10.1109/ICDE.2012.17","url":null,"abstract":"XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132360183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
GeoFeed: A Location Aware News Feed System GeoFeed:一个位置感知新闻Feed系统
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.97
Jie Bao, M. Mokbel, Chi-Yin Chow
This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.
本文介绍了地理Feed系统,这是一个位置感知新闻Feed系统,为其用户提供了一个新的平台,可以从他们的朋友或喜欢的新闻来源获得空间相关的消息更新。Geo Feed与所有现有的新闻Feed系统的区别在于,它在决定所选新闻Feed时考虑到消息的空间范围和用户位置。Geo Feed为用户提供了三种不同的新闻推送方式,即空间拉(space pull)、空间推送(space push)和共享推送(shared push)。那么,Geo Feed面临的主要挑战是决定何时使用这三种方法中的每一种方法来针对哪些用户。Geo Feed配备了一个智能决策模型,该模型决定以以下方式使用这些方法:(a)最大限度地减少提供位置感知新闻Feed的系统开销,以及(b)保证每个用户获得请求的位置感知新闻Feed的特定响应时间。基于真实数据和合成数据的实验结果表明,Geo Feed在响应时间和维护成本方面优于现有的新闻Feed系统。
{"title":"GeoFeed: A Location Aware News Feed System","authors":"Jie Bao, M. Mokbel, Chi-Yin Chow","doi":"10.1109/ICDE.2012.97","DOIUrl":"https://doi.org/10.1109/ICDE.2012.97","url":null,"abstract":"This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129263576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Optimizing Statistical Information Extraction Programs over Evolving Text 优化统计信息提取程序在不断发展的文本
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.60
Fei Chen, Xixuan Feng, C. Ré, Min Wang
Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.
统计信息提取(IE)程序越来越多地用于构建现实世界的IE系统,如阿里巴巴、CiteSeer、麒麟和YAGO。当前的统计IE方法认为文本语料库底层的提取程序是静态的。然而,许多现实世界的文本语料库是动态的(文档被插入、修改和删除)。随着语料库的发展,IE程序必须重复应用于连续的语料库快照,以保持提取的信息是最新的。从头开始对每个快照应用IE可能效率低下:一对连续的快照可能变化很小,但不知道这一点,程序必须从头开始再次运行。在本文中,我们提出了CRFlex,一个有效执行这种重复统计IE的系统,通过回收以前的IE结果来实现增量更新。作为第一步,CRFlex将重点放在使用领先统计模型条件随机场(CRFs)的统计IE程序上。我们展示了如何为增量更新的CRF推理算法的属性建模,以及如何利用它们来正确地回收以前的推理结果。然后,我们展示了如何有效地捕获和存储IE程序的中间结果,以便后续回收。我们发现在读写中间结果所花费的I/O成本与回收这些中间结果所节省的CPU成本之间存在权衡。因此,我们提出了一个基于成本的解决方案,以确定任何给定的基于crf的IE程序和不断发展的语料库的最有效的回收方法。我们对基于crf的IE程序在现实世界数据集上的3个IE任务进行了广泛的实验,以证明我们方法的实用性。
{"title":"Optimizing Statistical Information Extraction Programs over Evolving Text","authors":"Fei Chen, Xixuan Feng, C. Ré, Min Wang","doi":"10.1109/ICDE.2012.60","DOIUrl":"https://doi.org/10.1109/ICDE.2012.60","url":null,"abstract":"Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115469920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Accuracy-Aware Uncertain Stream Databases 准确性感知的不确定流数据库
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.96
Tingjian Ge, Fujun Liu
Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations.
以前的工作已经引入了概率分布作为不确定流数据库系统的一级组件。缺少的一个因素是这些概率分布有多精确。这确实对呈现给最终用户的查询结果的准确性有深远的影响。虽然之前有一些工作研究了元组不确定性模型中不可靠的中间查询结果,但据我们所知,我们是第一个考虑不确定流数据库的人,在这种数据库中,从基于原始数据样本的学习分布到查询结果都要考虑准确性。本文对一个具有精度感知的不确定流数据库系统的各个组成部分进行了初步研究,包括精度信息的表示和如何获得查询结果的精度。此外,我们提出了基于假设检验的新谓词,用于使用有限精度的数据进行决策。我们用一套全面的实验评估来加强我们的研究。
{"title":"Accuracy-Aware Uncertain Stream Databases","authors":"Tingjian Ge, Fujun Liu","doi":"10.1109/ICDE.2012.96","DOIUrl":"https://doi.org/10.1109/ICDE.2012.96","url":null,"abstract":"Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114073853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Credit Suisse Meta-data Warehouse 瑞士信贷元数据仓库
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.41
Claudio Jossen, Lukas Blunschi, M. Mori, Donald Kossmann, Kurt Stockinger
This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.
本文描述了瑞士信贷自2009年以来的元数据仓库。与大多数其他大型组织一样,瑞士信贷拥有复杂的应用程序环境和几个数据仓库,以满足其用户的信息需求。元数据仓库解决的问题是提高组织在诸如开发新的业务流程、新的业务分析报告或实现新的监管需求等变化方面的敏捷性和灵活性。元数据仓库通过提供在数据仓库中搜索信息项和提取信息项沿袭的服务来支持这些更改。设计这种元数据仓库的一个困难是,没有标准的或众所周知的元数据模型可用于支持此类搜索服务。相反,元数据结构本身需要灵活,并随着IT环境的变化而发展。本文描述了Credit Suisse元数据仓库的当前数据结构和实现,并展示了其服务如何帮助提高整个组织的灵活性。本文给出了一系列元数据结构、用例和屏幕截图示例,以便根据瑞士信贷内部实际业务和IT用户的反馈说明所使用的概念和经验教训。
{"title":"The Credit Suisse Meta-data Warehouse","authors":"Claudio Jossen, Lukas Blunschi, M. Mori, Donald Kossmann, Kurt Stockinger","doi":"10.1109/ICDE.2012.41","DOIUrl":"https://doi.org/10.1109/ICDE.2012.41","url":null,"abstract":"This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123848773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Aggregate Query Answering on Possibilistic Data with Cardinality Constraints 具有基数约束的可能性数据的聚合查询应答
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.15
Graham Cormode, D. Srivastava, E. Shen, Ting Yu
Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.
造成数据不确定性的原因有很多:当数据不完整,包含相互矛盾的信息,或故意干扰或粗化以删除敏感细节时。在许多实际应用程序中出现的一个重要情况是,数据描述了一组可能性,但具有基数约束。这些约束表示元组编码之间的相关性,例如,最多有两个可能的记录是正确的,或者在一组元组和属性值之间存在(未知的)一对一映射。尽管在处理不确定数据方面已经付出了很多努力,但目前的系统还没有能力处理这种相关性,除了简单的互斥和共存约束。实际上,它们很少支持有效地处理此类数据的聚合查询。在本文中,我们的目标是通过引入LICM(线性整数约束模型)来解决其中的一些缺陷,该模型可以简洁地表示许多类型的元组相关性,特别是一类基数约束。我们使用来自数据清理和屏蔽敏感数据的示例来激励和解释该模型,以表明它支持对此类数据进行建模和查询,这在以前是不可能的。通过描述如何在模型中的数据上实现关系运算符,我们开发了一种有效的策略来回答对可能性数据的连接和聚合查询。LICM紧凑地集成了关联编码、查询应答和沿袭记录。结合现成的线性整数规划求解器,我们的方法为聚合查询提供了精确的边界。我们的原型实现表明,使用LICM进行查询应答是有效的和可扩展的。
{"title":"Aggregate Query Answering on Possibilistic Data with Cardinality Constraints","authors":"Graham Cormode, D. Srivastava, E. Shen, Ting Yu","doi":"10.1109/ICDE.2012.15","DOIUrl":"https://doi.org/10.1109/ICDE.2012.15","url":null,"abstract":"Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124144077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1