2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献_第6页

Memory-efficient algorithms for spatial network queries 空间网络查询的内存效率算法

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544863

Sarana Nutanong, H. Samet

Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra's algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC's drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmark-based heuristic functions, thereby enabling it to be applied to non-spatial networks/graphs such as social networks. Comparisons of experiments on SWH for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ.

在空间网络中逐步寻找k个最近邻(kNN)是基于位置服务的一个重要问题。一种方法(INE)简单地应用Dijkstra的算法。另一种方法(IER)是先用欧几里得距离计算k个最近邻，然后计算它们对应的网络距离，然后按照欧几里得距离的递增顺序，逐步找到下一个最近邻，直到找到一个欧几里得距离大于当前k个最近邻的网络距离。LBC方法通过使用欧几里得启发式估计器避免访问不可能导致k个最近邻居的节点，从而改进了INE;通过使用欧几里得启发式估计器对查询点周围的候选对象执行多个启发式搜索实例，避免重复访问空间网络中出现在通往k个最近邻居的最短路径上的节点，从而改进了IER。LBC的缺点是，维护启发式搜索的多个实例(称为波阵)需要k个优先级队列，并且维护它们所需的队列操作会产生很高的内存处理成本。提出了一种方法(SWH)，该方法利用一种新颖的启发式函数，将查询点周围的对象作为一个单元考虑，而不是像LBC那样一次作为一个目的地，从而消除了对多个波前的需要，只需要一个优先级队列。这大大降低了内存中处理成本组件，同时与LBC一样降低了访问空间网络的成本。SWH还被扩展为支持增量距离半连接(IDSJ)查询，这是kNN查询的多查询点泛化。此外，SWH被证明支持基于地标的启发式函数，从而使其能够应用于非空间网络/图形，如社交网络。用最佳的单波前搜索方法INE和现有最佳的启发式搜索方法LBC对kNN查询的SWH进行实验比较，SWH的速度提高了2.5倍，用现有的最佳启发式搜索方法LBC的速度提高了3.5倍。对于IDSJ查询，SWH-IDSJ比INE-IDSJ快5倍，比LBC-IDSJ快4倍。

{"title":"Memory-efficient algorithms for spatial network queries","authors":"Sarana Nutanong, H. Samet","doi":"10.1109/ICDE.2013.6544863","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544863","url":null,"abstract":"Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra's algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC's drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmark-based heuristic functions, thereby enabling it to be applied to non-spatial networks/graphs such as social networks. Comparisons of experiments on SWH for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125789479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Πgora: An Integration System for Probabilistic Data Πgora:概率数据的集成系统

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544935

Dan Olteanu, Lampros Papageorgiou, Sebastiaan J. van Schaik

Πgora is an integration system for probabilistic data modelled using different formalisms such as pc-tables, Bayesian networks, and stochastic automata. User queries are expressed over a global relational layer and are evaluated by Πgora using a range of strategies, including data conversion into one probabilistic formalism followed by evaluation using a formalism-specific engine, and hybrid plans, where subqueries are evaluated using engines for different formalisms. This demonstration allows users to experience Πgora on real-world heterogeneous data sources from the medical domain.

Πgora是一个集成系统，用于使用不同的形式化建模的概率数据，如pc表，贝叶斯网络和随机自动机。用户查询在全局关系层上表示，并通过Πgora使用一系列策略进行评估，包括将数据转换为一种概率形式，然后使用特定于形式的引擎进行评估，以及混合计划，其中使用不同形式的引擎对子查询进行评估。此演示允许用户在真实的医疗领域异构数据源上体验Πgora。

引用次数: 5

Focused matrix factorization for audience selection in display advertising 展示广告受众选择的聚焦矩阵分解

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544841

Bhargav Kanagal, Amr Ahmed, Sandeep Pandey, V. Josifovski, Lluis Garcia Pueyo, Jeffrey Yuan

Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users' past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users' preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users' interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches.

受众选择是展示广告系统中的一个关键问题，我们需要选择对广告活动感兴趣(即最有可能购买)的用户列表。用户过去对该活动的反馈可以利用协同过滤技术(如矩阵分解)来构建这样一个列表。然而，用户活动交互通常是非常稀疏的，因此传统的矩阵分解效果不佳。此外，简单地结合来自所有活动的用户反馈并不能解决这个问题，因为它会稀释对目标活动的关注。为了解决这些问题，我们提出了一种新的聚焦矩阵分解模型(FMF)，该模型可以学习用户对特定活动产品的偏好，同时也可以利用相关产品的信息。我们利用产品分类来发现相关的活动，并设计模型来区分用户对活动产品和非活动产品的兴趣。我们开发了FMF模型的并行多核实现，并在跨越100多万种产品的真实广告数据集上评估其性能。我们的实验证明了使用我们的模型优于现有方法的好处。

{"title":"Focused matrix factorization for audience selection in display advertising","authors":"Bhargav Kanagal, Amr Ahmed, Sandeep Pandey, V. Josifovski, Lluis Garcia Pueyo, Jeffrey Yuan","doi":"10.1109/ICDE.2013.6544841","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544841","url":null,"abstract":"Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users' past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users' preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users' interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125876005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Peeking into the optimization of data flow programs with MapReduce-style UDFs 用mapreduce风格的udf窥视数据流程序的优化

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544927

Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag

Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.

数据流是定义数据密集型处理任务的流行抽象。为了支持广泛的用例，许多数据处理系统都具有mapreduce风格的用户定义函数(udf)。与关系型DBMS中的udf相比，mapreduce风格的udf没有那么严格的模板。这些模板并不单独提供决定是否可以使用关系操作符和其他udf重新排序所需的所有信息。然而，众所周知，重新排序操作符(如过滤器、连接和聚合)可以产生数量级的运行时改进。我们演示了一个数据流优化器，它能够用命令式语言编写的mapreduce风格的udf对操作符进行重新排序。我们的方法利用静态代码分析从UDF中提取信息，用于推断UDF操作符的可重排序性。这些信息足以列举传统RDBMS优化器所涵盖的大部分搜索空间，包括过滤器和聚合下推、密集连接顺序以及基于感兴趣的属性选择物理执行策略。我们演示了我们的优化器和一个作业提交客户机，它允许用户逐步了解优化过程的每个阶段:udf的静态代码分析、重新排序的候选数据流的枚举、物理执行计划的生成以及它们的并行执行。为了演示，我们提供了一些关系和非关系数据流程序，这些程序突出了我们方法的显著特征。

{"title":"Peeking into the optimization of data flow programs with MapReduce-style UDFs","authors":"Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag","doi":"10.1109/ICDE.2013.6544927","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544927","url":null,"abstract":"Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Stratification driven placement of complex data: A framework for distributed data analytics 复杂数据的分层驱动放置:分布式数据分析的框架

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544868

Ye Wang, S. Parthasarathy, P. Sadayappan

With the increasing popularity of XML data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and processing such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is an important challenge. A critical element at the heart of this challenge relates to the placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this ecosystem according to the needs of the application to maximize locality, balance load, or minimize data skew. Results on several real-world applications validate the efficacy and efficiency of our approach.

随着XML数据存储、社交网络以及Web 2.0和3.0应用程序的日益流行，树和图等复杂的数据格式变得无处不在。在现代计算生态系统中，管理和处理如此庞大而复杂的数据存储，以有效地实现可操作的信息，是一个重要的挑战。这一挑战的核心是如何放置、存储和访问这些tera级和peta级数据。在这项工作中，我们开发了一个新的分布式框架来减轻程序员的负担，并提出了一个敏捷和智能的放置服务层，作为解决这一挑战的灵活而统一的手段。我们的框架的核心是分层的概念，它最初试图将结构(或语义)相似的实体分组到分层中。随后，根据应用程序的需要，在这个生态系统中划分地层，以最大化局域性、平衡负载或最小化数据倾斜。几个实际应用的结果验证了我们方法的有效性和效率。

引用次数: 19

LinkProbe: Probabilistic inference on large-scale social networks LinkProbe:大规模社交网络的概率推断

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544833

Haiquan Chen, Wei-Shinn Ku, Haixun Wang, L. Tang, Min-Te Sun

As one of the most important Semantic Web applications, social network analysis has attracted more and more interest from researchers due to the rapidly increasing availability of massive social network data. A desired solution for social network analysis should address the following issues. First, in many real world applications, inference rules are partially correct. An ideal solution should be able to handle partially correct rules. Second, applications in practice often involve large amounts of data. The inference mechanism should scale up towards large-scale data. Third, inference methods should take into account probabilistic evidence data because these are domains abounding with uncertainty. Various solutions for social network analysis have existed for quite a few years; however, none of them support all the aforementioned features. In this paper, we design and implement LinkProbe, a prototype to quantitatively predict the existence of links among nodes in large-scale social networks, which are empowered by Markov Logic Networks (MLNs). MLN has been proved to be an effective inference model which can handle complex dependencies and partially correct rules. More importantly, although MLN has shown acceptable performance in prior works, it is also reported as impractical in handling large-scale data due to its highly demanding nature in terms of inference time and memory consumption. In order to overcome these limitations, LinkProbe retrieves the k-backbone graphs and conducts the MLN inference on both the most globally influencing nodes and most locally related nodes. Our extensive experiments show that LinkProbe manages to provide a tunable balance between MLN inference accuracy and inference efficiency.

社交网络分析作为语义Web最重要的应用之一，随着海量社交网络数据可用性的快速增长，越来越受到研究者的关注。社会网络分析的理想解决方案应该解决以下问题。首先，在许多实际应用程序中，推理规则是部分正确的。理想的解决方案应该能够处理部分正确的规则。其次，实践中的应用程序通常涉及大量数据。推理机制应该向大规模数据扩展。第三，推理方法应考虑概率证据数据，因为这些是充满不确定性的领域。社交网络分析的各种解决方案已经存在了好几年;然而，它们都不支持上述所有功能。在本文中，我们设计并实现了LinkProbe，这是一个原型，用于定量预测大规模社交网络中节点之间的链接是否存在，这是由马尔可夫逻辑网络(mln)授权的。MLN已被证明是一种有效的推理模型，能够处理复杂的依赖关系和部分正确的规则。更重要的是，尽管MLN在之前的工作中表现出了可接受的性能，但由于其在推理时间和内存消耗方面的高要求，它在处理大规模数据时也被报道为不切实际。为了克服这些限制，LinkProbe检索k-backbone图，并对最具全局影响的节点和最具局部相关的节点进行MLN推理。我们的大量实验表明，LinkProbe能够在MLN推理精度和推理效率之间提供可调的平衡。

{"title":"LinkProbe: Probabilistic inference on large-scale social networks","authors":"Haiquan Chen, Wei-Shinn Ku, Haixun Wang, L. Tang, Min-Te Sun","doi":"10.1109/ICDE.2013.6544833","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544833","url":null,"abstract":"As one of the most important Semantic Web applications, social network analysis has attracted more and more interest from researchers due to the rapidly increasing availability of massive social network data. A desired solution for social network analysis should address the following issues. First, in many real world applications, inference rules are partially correct. An ideal solution should be able to handle partially correct rules. Second, applications in practice often involve large amounts of data. The inference mechanism should scale up towards large-scale data. Third, inference methods should take into account probabilistic evidence data because these are domains abounding with uncertainty. Various solutions for social network analysis have existed for quite a few years; however, none of them support all the aforementioned features. In this paper, we design and implement LinkProbe, a prototype to quantitatively predict the existence of links among nodes in large-scale social networks, which are empowered by Markov Logic Networks (MLNs). MLN has been proved to be an effective inference model which can handle complex dependencies and partially correct rules. More importantly, although MLN has shown acceptable performance in prior works, it is also reported as impractical in handling large-scale data due to its highly demanding nature in terms of inference time and memory consumption. In order to overcome these limitations, LinkProbe retrieves the k-backbone graphs and conducts the MLN inference on both the most globally influencing nodes and most locally related nodes. Our extensive experiments show that LinkProbe manages to provide a tunable balance between MLN inference accuracy and inference efficiency.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117340123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Ficklebase: Looking into the future to erase the past 菲克斯基:展望未来，抹去过去

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544816

Sumeet Bajaj, R. Sion

It has become apparent that in the digital world data once stored is never truly deleted even when such an expunction is desired either as a normal system function or for regulatory compliance purposes. Forensic Analysis techniques on systems are often successful at recovering information said to have been deleted in the past. Efforts aimed at thwarting such forensic analysis of systems have either focused on (i) identifying the system components where deleted data lingers and performing a secure delete operation over these remnants, or (ii) designing history independent data structures that hide information about past operations which result in the current system state. Yet, new data is constantly derived by processing existing (input) data which makes it increasingly difficult to remove all traces of this existing data, i.e., for regulatory compliance purposes. Even after deletion, significant information can linger in and be recoverable from the side effects the deleted data records left on the currently available state. In this paper we address this aspect in the context of a relational database, such that when combined with (i) & (ii), complete erasure of data and its effects can be achieved (“un-traceable deletion”). We introduce Ficklebase - a relational database wherein once a tuple has been “expired” - any and all its side-effects are removed, thereby eliminating all its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is undetectable. We present the design and evaluation of Ficklebase, and then discuss several of the fundamental functional implications of un-traceable deletion.

很明显，在数字世界中，一旦存储的数据永远不会真正删除，即使这种删除是作为正常的系统功能或出于监管合规目的而需要的。系统上的法医分析技术通常能够成功地恢复过去被删除的信息。阻止这种系统的取证分析的努力要么集中在(i)识别被删除数据残留的系统组件，并对这些残余执行安全删除操作，要么(ii)设计独立于历史的数据结构，隐藏有关导致当前系统状态的过去操作的信息。然而，通过处理现有(输入)数据不断产生新数据，这使得为了遵守法规的目的而删除这些现有数据的所有痕迹变得越来越困难。即使在删除之后，重要的信息也可以保留下来，并且可以从删除的数据记录在当前可用状态上留下的副作用中恢复。在本文中，我们在关系数据库的上下文中解决了这方面的问题，这样当与(i)和(ii)结合使用时，可以实现数据的完全擦除及其影响(“不可追踪的删除”)。我们介绍了Ficklebase——一个关系数据库，其中一旦元组“过期”，它的任何和所有副作用都将被删除，从而消除它的所有痕迹，使其不可恢复，并保证删除本身是不可检测的。我们介绍了Ficklebase的设计和评估，然后讨论了不可追溯删除的几个基本功能含义。

{"title":"Ficklebase: Looking into the future to erase the past","authors":"Sumeet Bajaj, R. Sion","doi":"10.1109/ICDE.2013.6544816","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544816","url":null,"abstract":"It has become apparent that in the digital world data once stored is never truly deleted even when such an expunction is desired either as a normal system function or for regulatory compliance purposes. Forensic Analysis techniques on systems are often successful at recovering information said to have been deleted in the past. Efforts aimed at thwarting such forensic analysis of systems have either focused on (i) identifying the system components where deleted data lingers and performing a secure delete operation over these remnants, or (ii) designing history independent data structures that hide information about past operations which result in the current system state. Yet, new data is constantly derived by processing existing (input) data which makes it increasingly difficult to remove all traces of this existing data, i.e., for regulatory compliance purposes. Even after deletion, significant information can linger in and be recoverable from the side effects the deleted data records left on the currently available state. In this paper we address this aspect in the context of a relational database, such that when combined with (i) & (ii), complete erasure of data and its effects can be achieved (“un-traceable deletion”). We introduce Ficklebase - a relational database wherein once a tuple has been “expired” - any and all its side-effects are removed, thereby eliminating all its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is undetectable. We present the design and evaluation of Ficklebase, and then discuss several of the fundamental functional implications of un-traceable deletion.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115400824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Top-K oracle: A new way to present top-k tuples for uncertain data Top-K oracle:一种表示不确定数据的Top-K元组的新方法

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544821

Chunyao Song, Zheng Li, Tingjian Ge

Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in managing such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring function. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users fc tuples, which are merely one point in the complex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets.

在许多现代应用中，需要对噪声和不确定数据进行管理。管理此类数据的一个主要困难是具有不同概率的查询结果元组的绝对数量。在许多情况下，用户对确定性世界中的元组有偏好，这由评分函数决定。然而，对于不确定的数据，返回top-k一直是一个具有挑战性的问题。已经提出了各种各样的语义，并且已经证明它们给出了非常不同的元组排名。在本文中，我们提出了一种完全不同的方法。我们提供了一个所谓的top-k oracle，用户可以任意查询它，而不是返回用户fc元组，它只是top-k元组向量复杂分布中的一个点。直观地说，oracle是一个黑盒，无论何时给定SQL查询，它都会返回其结果。我们给出的任何信息都是基于对基本真值top-k元组的忠实估计。这在应急响应应用和监控top-k应用中尤其重要。此外，我们是第一个提供嵌套查询功能，将不确定的top-k结果作为子查询。我们为top-k oracle设计了各种查询处理算法，并通过对真实世界和合成数据集的系统评估来验证其效率和准确性。

{"title":"Top-K oracle: A new way to present top-k tuples for uncertain data","authors":"Chunyao Song, Zheng Li, Tingjian Ge","doi":"10.1109/ICDE.2013.6544821","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544821","url":null,"abstract":"Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in managing such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring function. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users fc tuples, which are merely one point in the complex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130612161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Extracting interesting related context-dependent concepts from social media streams using temporal distributions 使用时间分布从社交媒体流中提取有趣的相关上下文相关概念

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544931

C. Sayers, M. Hsu

To enable the interactive exploration of large social media datasets we exploit the temporal distributions of word n-grams within the message stream to discover “interesting” concepts, determine “relatedness” between concepts, and find representative examples for display. We present a new algorithm for context-dependent “interestingness” using the coefficient of variation of the temporal distribution, apply the well-known technique of Pearson's Correlation to tweets using equi-height histogramming to determine correlation, and employ an asymmetric variant for computing “relatedness” to encourage exploration. We further introduce techniques using interestingness, correlation, and relatedness to automatically discover concepts and select preferred word N-grams for display. These techniques are demonstrated on an 800,000 tweet dataset from the Academy Awards.

为了实现对大型社交媒体数据集的交互式探索，我们利用消息流中单词n-grams的时间分布来发现“有趣”的概念，确定概念之间的“相关性”，并找到具有代表性的示例进行显示。我们提出了一种使用时间分布变异系数的上下文相关“兴趣”新算法，将著名的Pearson相关技术应用于推文，使用等高直方图来确定相关性，并采用不对称变体来计算“相关性”以鼓励探索。我们进一步介绍了使用兴趣、相关性和相关性来自动发现概念并选择首选单词n图进行显示的技术。这些技术在来自奥斯卡奖的80万条tweet数据集上进行了演示。

引用次数: 0

On answering why-not questions in reverse skyline queries 关于回答逆向天际线查询中的why-not问题

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544890

Md. Saiful Islam, Rui Zhou, Chengfei Liu

This paper aims at answering the so called why-not questions in reverse skyline queries. A reverse skyline query retrieves all data points whose dynamic skylines contain the query point. We outline the benefit and the semantics of answering why-not questions in reverse skyline queries. In connection with this, we show how to modify the why-not point and the query point to include the why-not point in the reverse skyline of the query point. We then show, how a query point can be positioned safely anywhere within a region (i.e., called safe region) without losing any of the existing reverse skyline points. We also show how to answer why-not questions considering the safe region of the query point. Our approach efficiently combines both query point and data point modification techniques to produce meaningful answers. Experimental results also demonstrate that our approach can produce high quality explanations for why-not questions in reverse skyline queries.

本文旨在回答逆向天际线查询中所谓的“为什么不”问题。反向天际线查询检索其动态天际线包含查询点的所有数据点。我们概述了在反向天际线查询中回答why-not问题的好处和语义。与此相关，我们将展示如何修改为什么不点和查询点，以在查询点的反向天际线中包含为什么不点。然后，我们展示了如何将查询点安全地定位在一个区域内的任何地方(即称为安全区)，而不会丢失任何现有的反向天际线点。我们还将展示如何回答考虑到查询点的安全区域的“为什么不”问题。我们的方法有效地结合了查询点和数据点修改技术来产生有意义的答案。实验结果还表明，我们的方法可以对反向天际线查询中的why-not问题产生高质量的解释。

引用次数: 74