首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
Cross Domain Search by Exploiting Wikipedia 利用维基百科进行跨域搜索
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.13
Chen Liu, Sai Wu, Shouxu Jiang, A. Tung
The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross-modal resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross-modal search, we can better exploit existing resources. Tags associated with Web 2.0 resources are intuitive medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.
各种媒体格式的丰富Web 2.0资源要求更好的资源集成,以丰富用户体验。这自然会导致新的跨模态资源搜索需求,其中查询是一种模态中的资源,而结果是其他模态中密切相关的资源。通过跨模式搜索,我们可以更好地利用现有资源。与Web 2.0资源关联的标记是将不同形态的资源链接在一起的直观媒介。然而,标记本质上是一种特别的活动。它们通常包含噪音,并受到贴标者主观倾向的影响。因此,简单地通过标签链接资源是不可靠的。在本文中,我们提出了一种将标记资源链接到从维基百科中提取的概念的方法,维基百科在过去几年中已经成为一个相当可靠的参考。因此,与标签相比,概念的质量更高。我们基于与资源相关的概念开发了有效的跨模式搜索方法。进行了大量的实验,结果表明我们的解决方案取得了良好的性能。
{"title":"Cross Domain Search by Exploiting Wikipedia","authors":"Chen Liu, Sai Wu, Shouxu Jiang, A. Tung","doi":"10.1109/ICDE.2012.13","DOIUrl":"https://doi.org/10.1109/ICDE.2012.13","url":null,"abstract":"The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross-modal resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross-modal search, we can better exploit existing resources. Tags associated with Web 2.0 resources are intuitive medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116627303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Temporal Analytics on Big Data for Web Advertising 网络广告大数据的时间分析
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.55
B. Chandramouli, J. Goldstein, S. Duan
"Big Data" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.
map-reduce (M-R)集群中的“大数据”本质上通常是暂时的,许多基于此类数据的分析任务也是如此。例如,展示广告使用行为定位(BT)根据用户先前的搜索、页面浏览量等为用户选择广告。英国电信之前的工作主要集中在使用M-R对离线数据进行良好扩展的技术上。然而,这种方法对于处理时态数据的bt风格的应用程序有局限性:(1)许多查询是临时的,不容易在M-R中表达,此外,M-R前端(如SCOPE)面向集合的性质不适合时间处理;(2)随着商业系统的成熟,它们可能还需要直接分析和响应实时数据馈送,因为高周转时间可能导致错失机会,但目前的解决方案很难自然地也在实时流上运行。我们的贡献是双重的。首先,我们提出了一个新的框架TiMR(发音定时器),它结合了面向时间的数据处理系统和M-R框架。用户编写和提交分析算法作为临时查询——这些查询简洁、与横向扩展无关,并且易于编写。它们在使用TiMR的大规模离线数据上可以很好地扩展,并且可以在实时流上不加修改地工作。我们还提出了新的基于成本的查询碎片和时间分区方案,以提高TiMR的效率。其次,我们展示了这种方法对BT的可行性,使用新的时间算法来利用新的目标机会。使用来自商业广告平台的真实数据进行的实验表明,TiMR非常有效,并且减少了开发工作量。我们的BT解决方案简单而简洁,在记忆、学习时间和点击率/覆盖率方面比目前的方案要好几倍。
{"title":"Temporal Analytics on Big Data for Web Advertising","authors":"B. Chandramouli, J. Goldstein, S. Duan","doi":"10.1109/ICDE.2012.55","DOIUrl":"https://doi.org/10.1109/ICDE.2012.55","url":null,"abstract":"\"Big Data\" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123415442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
Viewing the Web as a Distributed Knowledge Base 将Web视为分布式知识库
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.150
S. Abiteboul, Émilien Antoine, Julia Stoyanovich
This paper addresses the challenges faced by everyday Web users, who interact with inherently heterogeneous and distributed information. Managing such data is currently beyond the skills of casual users. We describe ongoing work that has as its goal the development of foundations for declarative distributed data management. In this approach, we see the Web as a knowledge base consisting of distributed logical facts and rules. Our objective is to enable automated reasoning over this knowledge base, ultimately improving the quality of service and of data. For this, we use Webdamlog, a Datalog-style language with rule delegation. We outline ongoing efforts on the Web dam Exchange platform that combines Webdamlog evaluation with communication and security protocols.
本文解决了日常Web用户所面临的挑战,这些用户与固有的异构和分布式信息交互。管理此类数据目前超出了普通用户的技能范围。我们描述了正在进行的工作,其目标是开发声明式分布式数据管理的基础。在这种方法中,我们将Web视为一个由分布式逻辑事实和规则组成的知识库。我们的目标是在这个知识库上实现自动推理,最终提高服务和数据的质量。为此,我们使用Webdamlog,这是一种带有规则委托的datalog风格的语言。我们概述了Webdam Exchange平台上正在进行的工作,该平台将Webdamlog评估与通信和安全协议相结合。
{"title":"Viewing the Web as a Distributed Knowledge Base","authors":"S. Abiteboul, Émilien Antoine, Julia Stoyanovich","doi":"10.1109/ICDE.2012.150","DOIUrl":"https://doi.org/10.1109/ICDE.2012.150","url":null,"abstract":"This paper addresses the challenges faced by everyday Web users, who interact with inherently heterogeneous and distributed information. Managing such data is currently beyond the skills of casual users. We describe ongoing work that has as its goal the development of foundations for declarative distributed data management. In this approach, we see the Web as a knowledge base consisting of distributed logical facts and rules. Our objective is to enable automated reasoning over this knowledge base, ultimately improving the quality of service and of data. For this, we use Webdamlog, a Datalog-style language with rule delegation. We outline ongoing efforts on the Web dam Exchange platform that combines Webdamlog evaluation with communication and security protocols.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"64 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126087540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Processing of Rank Joins in Highly Distributed Systems 高度分布系统中秩连接的处理
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.108
C. Doulkeridis, Akrivi Vlachou, K. Nørvåg, Y. Kotidis, N. Polyzotis
In this paper, we study efficient processing of rank joins in highly distributed systems, where servers store fragments of relations in an autonomous manner. Existing rank-join algorithms exhibit poor performance in this setting due to excessive communication costs or high latency. We propose a novel distributed rank-join framework that employs data statistics, maintained as histograms, to determine the subset of each relational fragment that needs to be fetched to generate the top-k join results. At the heart of our framework lies a distributed score bound estimation algorithm that produces sufficient score bounds for each relation, that guarantee the correctness of the rank-join result set, when the histograms are accurate. Furthermore, we propose a generalization of our framework that supports approximate statistics, in the case that the exact statistical information is not available. An extensive experimental study validates the efficiency of our framework and demonstrates its advantages over existing methods.
在本文中,我们研究了高度分布式系统中秩连接的有效处理,其中服务器以自治的方式存储关系片段。由于通信成本过高或延迟高,现有的秩联接算法在这种情况下表现出较差的性能。我们提出了一种新的分布式排名连接框架,该框架使用数据统计(以直方图的形式维护)来确定需要提取的每个关系片段的子集,以生成top-k连接结果。我们框架的核心是一个分布式分数界估计算法,该算法为每个关系产生足够的分数界,当直方图准确时,它保证了排名连接结果集的正确性。此外,我们提出了一个支持近似统计的框架的泛化,在没有确切统计信息的情况下。广泛的实验研究验证了我们的框架的效率,并证明了它比现有方法的优势。
{"title":"Processing of Rank Joins in Highly Distributed Systems","authors":"C. Doulkeridis, Akrivi Vlachou, K. Nørvåg, Y. Kotidis, N. Polyzotis","doi":"10.1109/ICDE.2012.108","DOIUrl":"https://doi.org/10.1109/ICDE.2012.108","url":null,"abstract":"In this paper, we study efficient processing of rank joins in highly distributed systems, where servers store fragments of relations in an autonomous manner. Existing rank-join algorithms exhibit poor performance in this setting due to excessive communication costs or high latency. We propose a novel distributed rank-join framework that employs data statistics, maintained as histograms, to determine the subset of each relational fragment that needs to be fetched to generate the top-k join results. At the heart of our framework lies a distributed score bound estimation algorithm that produces sufficient score bounds for each relation, that guarantee the correctness of the rank-join result set, when the histograms are accurate. Furthermore, we propose a generalization of our framework that supports approximate statistics, in the case that the exact statistical information is not available. An extensive experimental study validates the efficiency of our framework and demonstrates its advantages over existing methods.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126019518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Physically Independent Stream Merging 物理独立流合并
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.25
B. Chandramouli, D. Maier, J. Goldstein
A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the "same" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.
用于合并等效数据流的工具可以支持数据流管理系统(DSMS)中的多种功能,例如查询计划切换和高可用性。可以从逻辑上将数据流视为事件的时态表,每个事件都与一个生命周期(时间间隔)相关联,在此期间事件将贡献输出。在许多应用中,“相同”的逻辑流可以在物理上以多种物理形式呈现自己,例如,由于传输中产生的混乱或来自多个源的组合,以及先前事件的修改。当这些流在时间、顺序和组成上可能存在物理差异时,正确合并这些流是具有挑战性的。本文介绍了一种新的流操作符,称为逻辑合并(LMerge),它将多个逻辑上一致的流作为输入,并输出与所有流兼容的单个流。LMerge可以处理输入流的动态附加和分离。我们提出了一系列LMerge算法,这些算法可以利用编译时流属性来提高效率。Stream Insight(一款商业数据管理系统)的实验表明,LMerge有时比在输入上强制执行确定性要高效几个数量级,而且当流可变性有限时,使用专门的算法是有好处的。我们还展示了LMerge及其扩展可以在几个实际应用程序中提供性能优势。
{"title":"Physically Independent Stream Merging","authors":"B. Chandramouli, D. Maier, J. Goldstein","doi":"10.1109/ICDE.2012.25","DOIUrl":"https://doi.org/10.1109/ICDE.2012.25","url":null,"abstract":"A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the \"same\" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117140483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Parametric Plan Caching Using Density-Based Clustering 使用基于密度的聚类的参数规划缓存
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.57
Günes Aluç, David DeHaan, Ivan T. Bowman
Query plan caching eliminates the need for repeated query optimization, hence, it has strong practical implications for relational database management systems (RDBMSs). Unfortunately, existing approaches consider only the query plan generated at the expected values of parameters that characterize the query, data and the current state of the system, while these parameters may take different values during the lifetime of a cached plan. A better alternative is to harvest the optimizer's plan choice for different parameter values, populate the cache with promising query plans, and select a cached plan based upon current parameter values. To address this challenge, we propose a parametric plan caching (PPC) framework that uses an online plan space clustering algorithm. The clustering algorithm is density-based, and it exploits locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces can be efficiently stored in database histograms and queried in constant time. We experimentally validate that our approach is precise, efficient in space-and-time and adaptive, requiring no eager exploration of the plan spaces of the optimizer.
查询计划缓存消除了重复查询优化的需要,因此,它对关系数据库管理系统(rdbms)具有很强的实际意义。不幸的是,现有的方法只考虑在参数的期望值下生成的查询计划,这些参数表征查询、数据和系统的当前状态,而这些参数在缓存计划的生命周期内可能采用不同的值。更好的替代方法是获取优化器对不同参数值的计划选择,用有希望的查询计划填充缓存,并根据当前参数值选择缓存的计划。为了解决这一挑战,我们提出了一个使用在线规划空间聚类算法的参数规划缓存(PPC)框架。聚类算法是基于密度的,它利用位置敏感的散列作为预处理步骤,使规划空间中的聚类可以有效地存储在数据库直方图中,并在恒定时间内查询。实验结果表明,该方法具有精确、高效、自适应的特点,无需对优化器的规划空间进行探索。
{"title":"Parametric Plan Caching Using Density-Based Clustering","authors":"Günes Aluç, David DeHaan, Ivan T. Bowman","doi":"10.1109/ICDE.2012.57","DOIUrl":"https://doi.org/10.1109/ICDE.2012.57","url":null,"abstract":"Query plan caching eliminates the need for repeated query optimization, hence, it has strong practical implications for relational database management systems (RDBMSs). Unfortunately, existing approaches consider only the query plan generated at the expected values of parameters that characterize the query, data and the current state of the system, while these parameters may take different values during the lifetime of a cached plan. A better alternative is to harvest the optimizer's plan choice for different parameter values, populate the cache with promising query plans, and select a cached plan based upon current parameter values. To address this challenge, we propose a parametric plan caching (PPC) framework that uses an online plan space clustering algorithm. The clustering algorithm is density-based, and it exploits locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces can be efficiently stored in database histograms and queried in constant time. We experimentally validate that our approach is precise, efficient in space-and-time and adaptive, requiring no eager exploration of the plan spaces of the optimizer.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130731254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Earlybird: Real-Time Search at Twitter Earlybird:实时搜索Twitter
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.149
Michael Busch, Krishna Gade, B. Larson, Patrick Lok, Samuel B. Luckenbill, Jimmy J. Lin
The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Early bird, the core retrieval engine that powers Twitter's real-time search service. Although Early bird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of real-time search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting low-latency, high-throughput query evaluation. These demands are met with a single-writer, multiple-reader concurrency model and the targeted use of memory barriers. Early bird represents a point in the design space of real-time search engines that has worked well for Twitter's needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space.
今天的网络越来越多地以社交和实时信号为特征,我们认为这代表了信息检索的两个前沿。在本文中,我们介绍了Early bird,这是Twitter实时搜索服务的核心检索引擎。尽管Early bird像几乎所有现代检索引擎一样构建和维护反向索引,但它的索引结构与支持传统网络搜索的索引结构不同。我们将描述这些差异,并介绍我们设计背后的基本原理。实时搜索的一个关键需求是能够快速摄取内容并使其立即可搜索,同时支持低延迟、高吞吐量的查询评估。这些需求可以通过单写入器、多读取器并发模型和有针对性地使用内存屏障来满足。Early bird代表了实时搜索引擎设计领域的一个观点,它很好地满足了Twitter的需求。通过分享我们的经验,我们希望在这个令人兴奋的领域激发更多的兴趣和创新。
{"title":"Earlybird: Real-Time Search at Twitter","authors":"Michael Busch, Krishna Gade, B. Larson, Patrick Lok, Samuel B. Luckenbill, Jimmy J. Lin","doi":"10.1109/ICDE.2012.149","DOIUrl":"https://doi.org/10.1109/ICDE.2012.149","url":null,"abstract":"The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Early bird, the core retrieval engine that powers Twitter's real-time search service. Although Early bird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of real-time search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting low-latency, high-throughput query evaluation. These demands are met with a single-writer, multiple-reader concurrency model and the targeted use of memory barriers. Early bird represents a point in the design space of real-time search engines that has worked well for Twitter's needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
Efficiently Monitoring Top-k Pairs over Sliding Windows 有效地监控Top-k对滑动窗口
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.89
Zhitao Shen, M. A. Cheema, Xuemin Lin, W. Zhang, Haixun Wang
Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, k-furthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called K-sky band which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost.
Top-k对查询受到了研究界的极大关注。k-最近对查询,k-最远对查询及其变体是top-k对查询中研究得最充分的特殊情况。在本文中,我们提出了第一种方法来回答滑动窗口上的一类top-k对查询。我们的框架处理多个top-k对查询,每个查询允许使用不同的评分函数、不同的k值和不同大小的滑动窗口。虽然滑动窗口中可能的对的数量是滑动窗口中对象数量N的二次元,但我们通过保持一个称为K-sky波段的小子集来有效地回答top-k对查询,该子集预计由O(K log(N/K))对组成。对于使用相同评分函数的所有查询,我们只需要维护一个K-sky波段。提出了一种有效的k波段维护和查询应答技术。我们进行了详细的复杂性分析,并表明我们的方法的预期成本相当接近下限成本。我们通过实验验证了这一点,将我们的方法与一个特殊设计的最高算法进行了比较,该算法假设存在一个oracle并满足下界成本。
{"title":"Efficiently Monitoring Top-k Pairs over Sliding Windows","authors":"Zhitao Shen, M. A. Cheema, Xuemin Lin, W. Zhang, Haixun Wang","doi":"10.1109/ICDE.2012.89","DOIUrl":"https://doi.org/10.1109/ICDE.2012.89","url":null,"abstract":"Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, k-furthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called K-sky band which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"365 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132875048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Attribute-Based Subsequence Matching and Mining 基于属性的子序列匹配与挖掘
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.81
Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu
Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like "Director" and "Actors". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.
序列分析在我们的日常生活中非常重要。通常,每个序列都与一个有序的元素列表相关联。例如,在电影租赁应用程序中,客户的电影租赁记录包含一个有序的电影列表,这是一个序列示例。序列分析的研究大多集中在子序列匹配上,即找到数据库中存储的所有序列,使给定的查询序列是这些序列中的每一个序列的子序列。在许多应用程序中,元素与属性或属性相关联。例如,每部电影都与一些属性相关联,如“导演”和“演员”。不幸的是,据我们所知,所有现有的序列分析研究都没有考虑元素的属性。在本文中,我们提出两个问题。第一个问题是:给定一个查询序列和一组序列,考虑到元素的属性,我们希望找到与该查询序列匹配的所有序列。这个问题被称为基于属性的子序列匹配(ASM)。所有传统子序列匹配问题的现有应用都可以应用于我们的新问题,只要我们给定元素的属性。提出了一种求解ASM问题的有效算法。该算法效率的关键思想是将每个具有潜在许多相关属性的整个序列压缩成一个数字三元组。通过处理这些非常压缩的表示,我们大大加快了基于属性的子序列匹配。第二个问题是找到所有频繁的基于属性的子序列。我们还对第二个问题采用了一个现有的高效算法,以表明我们可以使用为第一个问题开发的算法。实证研究表明,我们的算法在大型数据集中是可扩展的。特别是,在大多数情况下,我们的算法运行速度至少比直接方法快一个数量级。这项工作可以激发生物信息学中基于子序列匹配的序列分类、频繁序列挖掘、基序检测和序列匹配等现有数据挖掘问题。
{"title":"Attribute-Based Subsequence Matching and Mining","authors":"Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu","doi":"10.1109/ICDE.2012.81","DOIUrl":"https://doi.org/10.1109/ICDE.2012.81","url":null,"abstract":"Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like \"Director\" and \"Actors\". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Querying Uncertain Spatio-Temporal Data 查询不确定时空数据
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.94
Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Andreas Züfle
The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one point of time is considered. In contrast, the problem of modeling and querying uncertain spatio-temporal data has only been treated as a simple extension of the spatial case, disregarding time dependencies between consecutive timestamps. In this work, we present a framework for efficiently modeling and querying uncertain spatio-temporal data. The key idea of our approach is to model possible object trajectories by stochastic processes. This approach has three major advantages over previous work. First it allows answering queries in accordance with the possible worlds model. Second, dependencies between object locations at consecutive points in time are taken into account. And third it is possible to reduce all queries on this model to simple matrix multiplications. Based on these concepts we propose efficient solutions for different probabilistic spatio-temporal queries. In an experimental evaluation we show that our approaches are several order of magnitudes faster than state-of-the-art competitors.
由于在空间、时间、多媒体和传感器数据库中的广泛应用,不确定数据的建模和管理问题引起了人们的极大兴趣。在只考虑一个时间点的静态(快照)情况下,存在广泛的涵盖空间不确定性的工作。相比之下,不确定时空数据的建模和查询问题仅被视为空间情况的简单扩展,而忽略了连续时间戳之间的时间依赖性。在这项工作中,我们提出了一个有效建模和查询不确定时空数据的框架。我们方法的关键思想是通过随机过程来模拟可能的物体轨迹。与以前的工作相比,这种方法有三个主要优点。首先,它允许根据可能世界模型回答查询。其次,考虑连续时间点上目标位置之间的依赖关系。第三,可以将该模型上的所有查询简化为简单的矩阵乘法。基于这些概念,我们提出了不同概率时空查询的有效解决方案。在实验评估中,我们表明我们的方法比最先进的竞争对手快几个数量级。
{"title":"Querying Uncertain Spatio-Temporal Data","authors":"Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Andreas Züfle","doi":"10.1109/ICDE.2012.94","DOIUrl":"https://doi.org/10.1109/ICDE.2012.94","url":null,"abstract":"The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one point of time is considered. In contrast, the problem of modeling and querying uncertain spatio-temporal data has only been treated as a simple extension of the spatial case, disregarding time dependencies between consecutive timestamps. In this work, we present a framework for efficiently modeling and querying uncertain spatio-temporal data. The key idea of our approach is to model possible object trajectories by stochastic processes. This approach has three major advantages over previous work. First it allows answering queries in accordance with the possible worlds model. Second, dependencies between object locations at consecutive points in time are taken into account. And third it is possible to reduce all queries on this model to simple matrix multiplications. Based on these concepts we propose efficient solutions for different probabilistic spatio-temporal queries. In an experimental evaluation we show that our approaches are several order of magnitudes faster than state-of-the-art competitors.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133992827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1