Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献_第3页

Integrating non-spatial preferences into spatial location queries 将非空间偏好集成到空间位置查询中

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618247

Qiang Qu, Siyuan Liu, B. Yang, Christian S. Jensen

Increasing volumes of geo-referenced data are becoming available. This data includes so-called points of interest that describe businesses, tourist attractions, etc. by means of a geo-location and properties such as a textual description or ratings. We propose and study the efficient implementation of a new kind of query on points of interest that takes into account both the locations and properties of the points of interest. The query takes a result cardinality, a spatial range, and property-related preferences as parameters, and it returns a compact set of points of interest with the given cardinality and in the given range that satisfies the preferences. Specifically, the points of interest in the result set cover so-called allying preferences and are located far from points of interest that possess so-called alienating preferences. A unified result rating function integrates the two kinds of preferences with spatial distance to achieve this functionality. We provide efficient exact algorithms for this kind of query. To enable queries on large datasets, we also provide an approximate algorithm that utilizes a nearest-neighbor property to achieve scalable performance. We develop and apply lower and upper bounds that enable search-space pruning and thus improve performance. Finally, we provide a generalization of the above query and also extend the algorithms to support the generalization. We report on an experimental evaluation of the proposed algorithms using real point of interest data from Google Places for Business that offers insight into the performance of the proposed solutions.

越来越多的地理参考数据可供使用。这些数据包括所谓的兴趣点，通过地理位置和文本描述或评级等属性来描述企业、旅游景点等。我们提出并研究了一种新的兴趣点查询的有效实现，该查询同时考虑了兴趣点的位置和属性。查询将结果基数、空间范围和与属性相关的首选项作为参数，并返回具有给定基数且在满足首选项的给定范围内的一组紧凑的兴趣点。具体来说，结果集中的兴趣点涵盖了所谓的结盟偏好，并且远离拥有所谓疏远偏好的兴趣点。统一的结果评级函数将两种偏好与空间距离相结合，实现这一功能。我们为这类查询提供了高效精确的算法。为了支持对大型数据集的查询，我们还提供了一种近似算法，该算法利用最近邻属性来实现可扩展的性能。我们开发并应用支持搜索空间修剪的下界和上界，从而提高性能。最后，我们提供了上述查询的泛化，并扩展了算法来支持泛化。我们报告了使用谷歌商业场所的真实兴趣点数据对所提议算法的实验评估，该数据提供了对所提议解决方案性能的见解。

{"title":"Integrating non-spatial preferences into spatial location queries","authors":"Qiang Qu, Siyuan Liu, B. Yang, Christian S. Jensen","doi":"10.1145/2618243.2618247","DOIUrl":"https://doi.org/10.1145/2618243.2618247","url":null,"abstract":"Increasing volumes of geo-referenced data are becoming available. This data includes so-called points of interest that describe businesses, tourist attractions, etc. by means of a geo-location and properties such as a textual description or ratings. We propose and study the efficient implementation of a new kind of query on points of interest that takes into account both the locations and properties of the points of interest. The query takes a result cardinality, a spatial range, and property-related preferences as parameters, and it returns a compact set of points of interest with the given cardinality and in the given range that satisfies the preferences. Specifically, the points of interest in the result set cover so-called allying preferences and are located far from points of interest that possess so-called alienating preferences. A unified result rating function integrates the two kinds of preferences with spatial distance to achieve this functionality. We provide efficient exact algorithms for this kind of query. To enable queries on large datasets, we also provide an approximate algorithm that utilizes a nearest-neighbor property to achieve scalable performance. We develop and apply lower and upper bounds that enable search-space pruning and thus improve performance. Finally, we provide a generalization of the above query and also extend the algorithms to support the generalization. We report on an experimental evaluation of the proposed algorithms using real point of interest data from Google Places for Business that offers insight into the performance of the proposed solutions.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"33 1","pages":"8:1-8:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83244330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Helping scientists reconnect their datasets 帮助科学家重新连接他们的数据集

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618263

Abdussalam Alawini, D. Maier, K. Tufte, Bill Howe

It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these "residual" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing. We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.

随着时间的推移，与研究项目相关的数据集似乎不可避免地会激增:合作者可能会用新的测量方法和新的属性扩展数据集，新的实验运行会产生具有相似结构的新文件，并且提取数据子集用于独立分析。随着时间的推移，这些“残余”数据集开始增加，科学家们可能会失去连接它们的衍生历史，使数据共享、来源跟踪和科学可重复性变得复杂。在本文中，我们关注电子表格中的数据，考虑两个数据集之间的可观察关系如何帮助科学家回忆它们最初的推导联系。例如，如果数据集A完全包含在数据集B中，则B可能是A的最新版本，在归档或发布时应优先选择B。我们明确了相关关系的空间，开发了一套有效发现这些关系的算法，并将这些算法组织到一个名为ReConnect的新系统中，以帮助科学家发现关系。我们的评估表明，现有的依赖于标记两个电子表格之间差异的方法对于许多关系发现任务是不切实际的，一项用户研究表明，ReConnect可以提高科学家检测有用关系的能力，并随后为给定任务确定最佳数据集。

{"title":"Helping scientists reconnect their datasets","authors":"Abdussalam Alawini, D. Maier, K. Tufte, Bill Howe","doi":"10.1145/2618243.2618263","DOIUrl":"https://doi.org/10.1145/2618243.2618263","url":null,"abstract":"It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these \"residual\" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing.\u0000 We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"7 1","pages":"29:1-29:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88370061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Efficient temporal shortest path queries on evolving social graphs 演化社会图的有效时间最短路径查询

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618282

Wenyu Huo, V. Tsotras

Graph-like data appears in many applications, such as social networks, internet hyperlinks, roadmaps, etc. and in most cases, graphs are dynamic, evolving through time. In this work, we study the problem of efficient shortest-path query evaluation on evolving social graphs. Our shortest-path queries are "temporal": they can refer to any time-point or time-interval in the graph's evolution, and corresponding valid answers should be returned. To efficiently support this type of temporal query, we extend the traditional Dijkstra's algorithm to compute shortest-path distance(s) for a time-point or a time-interval. To speed up query processing, we explore preprocessing index techniques such as Contraction Hierarchies (CH). Moreover, we examine how to maintain the evolving graph along with the index by utilizing temporal partition strategies. Experimental evaluations on real world datasets and large synthetic datasets demonstrate the feasibility and scalability of our proposed efficient techniques and optimizations.

类似图形的数据出现在许多应用程序中，如社交网络、互联网超链接、路线图等，在大多数情况下，图形是动态的，随着时间的推移而演变。在这项工作中，我们研究了进化社会图的有效最短路径查询评估问题。我们的最短路径查询是“暂时的”:它们可以引用图演化中的任何时间点或时间间隔，并且应该返回相应的有效答案。为了有效地支持这种类型的时间查询，我们扩展了传统的Dijkstra算法来计算时间点或时间间隔的最短路径距离(s)。为了加速查询处理，我们探索了预处理索引技术，如收缩层次结构(CH)。此外，我们还研究了如何利用时间分区策略来维护随索引变化的图。在真实世界数据集和大型合成数据集上的实验评估证明了我们提出的高效技术和优化的可行性和可扩展性。

引用次数: 38

Inverse predictions on continuous models in scientific databases 科学数据库中连续模型的逆预测

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618249

A. M. Zimmer, Philip Driessen, P. Kranen, T. Seidl

Using continuous models in scientific databases has received an increased attention in the last years. It allows for a more efficient and accurate querying, as well as predictions of the outputs even where no measurements were performed. The most common queries are on how the output looks like for a given input setting. In this paper we study inverse model-based queries on continuous models, where one specifies a desired output and searches for the appropriate input setting, which falls into the reverse engineering category. We propose two possible approaches. The first one is an extension of the inverse regression paradigm. But simply switching the roles of input and output variables poses new challenges, which we overcome by using partial least squares. The second approach formulates the inverse prediction queries as linear optimization problems. We show that even though these two approaches seem completely different, they are closely related, and that the latter is more general. It facilitates the formulation of a wide range of queries, with specifications of fixed values and ranges in both input and output space, enabling the intuitive exploration of the experimental data and understanding the underlying process.

在过去几年中，在科学数据库中使用连续模型受到了越来越多的关注。它允许更有效和准确的查询，以及即使在没有进行测量的情况下对输出的预测。最常见的查询是关于给定输入设置的输出是什么样的。在本文中，我们研究了连续模型上的基于逆模型的查询，其中指定期望的输出并搜索适当的输入设置，这属于逆向工程的范畴。我们提出了两种可能的方法。第一个是逆回归范式的扩展。但是简单地转换输入和输出变量的角色会带来新的挑战，我们通过使用偏最小二乘法来克服这个挑战。第二种方法将逆预测查询表述为线性优化问题。我们表明，尽管这两种方法看起来完全不同，但它们密切相关，后者更为普遍。它有助于制定广泛的查询，在输入和输出空间中都有固定值和范围的规格，可以直观地探索实验数据并理解底层过程。

{"title":"Inverse predictions on continuous models in scientific databases","authors":"A. M. Zimmer, Philip Driessen, P. Kranen, T. Seidl","doi":"10.1145/2618243.2618249","DOIUrl":"https://doi.org/10.1145/2618243.2618249","url":null,"abstract":"Using continuous models in scientific databases has received an increased attention in the last years. It allows for a more efficient and accurate querying, as well as predictions of the outputs even where no measurements were performed. The most common queries are on how the output looks like for a given input setting. In this paper we study inverse model-based queries on continuous models, where one specifies a desired output and searches for the appropriate input setting, which falls into the reverse engineering category. We propose two possible approaches. The first one is an extension of the inverse regression paradigm. But simply switching the roles of input and output variables poses new challenges, which we overcome by using partial least squares. The second approach formulates the inverse prediction queries as linear optimization problems. We show that even though these two approaches seem completely different, they are closely related, and that the latter is more general. It facilitates the formulation of a wide range of queries, with specifications of fixed values and ranges in both input and output space, enabling the intuitive exploration of the experimental data and understanding the underlying process.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"26:1-26:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78162597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Offline cleaning of RFID trajectory data RFID轨迹数据的离线清洗

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618271

Bettina Fazzinga, S. Flesca, F. Furfaro, F. Parisi

An offline cleaning technique is proposed for translating the readings generated by RFID-tracked moving objects into positions over a map. It consists in a grid-based two-way filtering scheme embedding a sampling strategy for addressing missing detections. The readings are first processed in time order: at each time point t, the positions (i.e., cells of a grid assumed over the map) compatible with the reading at t are filtered according to their reachability from the positions that survived the filtering for the previous time point. Then, the positions that survived the first filtering are re-filtered, applying the same scheme in inverse order. As the two phases proceed, a probability is progressively evaluated for each candidate position at each time point t: at the end, this probability assembles the three probabilities of being the actual position given the past and future positions, and given the reading at t. A sampling procedure is employed at certain steps of the first filtering phase to intelligently reduce the number of cells to be considered as candidate positions at the next steps, as their number can grow dramatically in the presence of consecutive missing detections. The proposed approach is experimentally validated and shown to be efficient and effective in accomplishing its task.

提出了一种离线清洗技术，用于将rfid跟踪的移动物体产生的读数转换为地图上的位置。它包含一个基于网格的双向滤波方案，该方案嵌入了一个寻址缺失检测的采样策略。首先按时间顺序处理读数:在每个时间点t，根据其可达性，从前一个时间点过滤幸存下来的位置中过滤与t点读数兼容的位置(即，在地图上假设的网格单元)。然后，在第一次滤波中幸存下来的位置被重新滤波，以相反的顺序应用相同的方案。随着两个阶段的进行，每个候选位置在每个时间点t的概率逐步计算:最后，该概率将给定过去和未来位置以及给定t点读数的实际位置的三个概率组合在一起。在第一个滤波阶段的某些步骤中采用采样过程，以智能地减少下一步被视为候选位置的单元格数量，因为在连续缺失检测的情况下，它们的数量可能会急剧增长。实验验证了该方法的有效性和有效性。

{"title":"Offline cleaning of RFID trajectory data","authors":"Bettina Fazzinga, S. Flesca, F. Furfaro, F. Parisi","doi":"10.1145/2618243.2618271","DOIUrl":"https://doi.org/10.1145/2618243.2618271","url":null,"abstract":"An offline cleaning technique is proposed for translating the readings generated by RFID-tracked moving objects into positions over a map. It consists in a grid-based two-way filtering scheme embedding a sampling strategy for addressing missing detections. The readings are first processed in time order: at each time point t, the positions (i.e., cells of a grid assumed over the map) compatible with the reading at t are filtered according to their reachability from the positions that survived the filtering for the previous time point. Then, the positions that survived the first filtering are re-filtered, applying the same scheme in inverse order. As the two phases proceed, a probability is progressively evaluated for each candidate position at each time point t: at the end, this probability assembles the three probabilities of being the actual position given the past and future positions, and given the reading at t. A sampling procedure is employed at certain steps of the first filtering phase to intelligently reduce the number of cells to be considered as candidate positions at the next steps, as their number can grow dramatically in the presence of consecutive missing detections. The proposed approach is experimentally validated and shown to be efficient and effective in accomplishing its task.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"5:1-5:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80521943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Proactive adaptations in sensor network query processing 传感器网络查询处理中的主动适应

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618267

A. B. Stokes, N. Paton, A. Fernandes

Wireless sensor networks (WSN) are used by many applications for event and environmental monitoring. Due to the resource-limited nodes in WSNs, there has been much research into extending the functional lifetime of the network through energy-saving techniques. Sensor Network Query Processing (SNQP) is one such technique. SNQP uses information about a query and the WSN over which it is to be run, to generate an energy-efficient Query Execution Plan (QEP) that distributes processing in the form of QEP fragments to the nodes in the WSN. However, any QEP is likely to drain the batteries of the nodes unevenly, and, as a result, nodes used in a QEP may run out of energy when there are significant energy stocks still available in the WSN. An adaptive query processor could react to energy depletion, for example, by generating a revised plan that refrains from using the drained nodes. However, adapting only when a node has been depleted may provide few opportunities for the creation of effective new QEPs. In this paper, we introduce an approach that determines, at query compilation time, a sequence of QEPs with switch times for transitioning between successive plans with a view to extending the overall lifetime of the query. We describe how this approach has been implemented as an extension to an existing SNQP and present experimental results indicating that it can significantly increase QEP lifetimes.

无线传感器网络(WSN)被许多应用用于事件和环境监测。由于无线传感器网络中节点的资源有限，通过节能技术延长网络功能寿命的研究越来越多。传感器网络查询处理(SNQP)就是这样一种技术。SNQP使用关于查询和要运行查询的WSN的信息来生成一个高效的查询执行计划(QEP)，该计划以QEP片段的形式将处理分发到WSN中的节点。然而，任何QEP都可能不均匀地耗尽节点的电池，因此，在QEP中使用的节点可能会在WSN中仍然有大量可用的能源库存时耗尽能量。自适应查询处理器可以对能量耗尽做出反应，例如，通过生成一个修改后的计划来避免使用耗尽的节点。但是，仅在节点耗尽时进行调整可能无法提供创建有效的新qep的机会。在本文中，我们介绍了一种方法，该方法在查询编译时确定一系列qep，这些qep具有在连续计划之间转换的切换时间，以延长查询的总体生命周期。我们描述了这种方法是如何作为现有SNQP的扩展来实现的，并给出了实验结果，表明它可以显着增加QEP的寿命。

{"title":"Proactive adaptations in sensor network query processing","authors":"A. B. Stokes, N. Paton, A. Fernandes","doi":"10.1145/2618243.2618267","DOIUrl":"https://doi.org/10.1145/2618243.2618267","url":null,"abstract":"Wireless sensor networks (WSN) are used by many applications for event and environmental monitoring. Due to the resource-limited nodes in WSNs, there has been much research into extending the functional lifetime of the network through energy-saving techniques. Sensor Network Query Processing (SNQP) is one such technique. SNQP uses information about a query and the WSN over which it is to be run, to generate an energy-efficient Query Execution Plan (QEP) that distributes processing in the form of QEP fragments to the nodes in the WSN. However, any QEP is likely to drain the batteries of the nodes unevenly, and, as a result, nodes used in a QEP may run out of energy when there are significant energy stocks still available in the WSN. An adaptive query processor could react to energy depletion, for example, by generating a revised plan that refrains from using the drained nodes. However, adapting only when a node has been depleted may provide few opportunities for the creation of effective new QEPs. In this paper, we introduce an approach that determines, at query compilation time, a sequence of QEPs with switch times for transitioning between successive plans with a view to extending the overall lifetime of the query. We describe how this approach has been implemented as an extension to an existing SNQP and present experimental results indicating that it can significantly increase QEP lifetimes.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"109 1","pages":"23:1-23:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80672967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

PStore: an efficient storage framework for managing scientific data PStore:用于管理科学数据的高效存储框架

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618268

Souvik Bhattacherjee, A. Deshpande, A. Sussman

In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data generated by scientific simulations. PStore consists of two modules, a data ingestion module and a query processing module, that respectively address two of the key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on dicing (also called range) queries. PStore provides a suite of compression schemes that leverage, and in some cases extend, existing techniques to provide support for diverse scientific simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model CWRF [5]. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.

在本文中，我们介绍了PStore的设计，实现和评估，PStore是一个用于管理由科学模拟生成的大量阵列数据的无覆盖存储框架。PStore包括两个模块，一个数据摄取模块和一个查询处理模块，分别解决了科学仿真数据管理中的两个关键挑战。数据摄取模块旨在处理以非常快的速度生成的大量模拟数据，这通常使数据无法卸载到存储设备上;该模块负责为手头的数据选择适当的压缩方案，将数据分块，然后在将其发送到存储节点之前对其进行压缩。另一方面，查询处理模块负责对存储的数据有效地执行不同类型的查询;在本文中，我们特别关注dicding(也称为range)查询。PStore提供了一套压缩方案，这些方案利用(在某些情况下扩展)现有技术，为各种科学模拟数据提供支持。为了有效地执行对这些压缩数据的查询，PStore采用并扩展了两级分块方案，结合了压缩的效果，并通过利用块预取来隐藏长运行范围查询的昂贵的磁盘延迟。此外，我们还将查询处理模块并行化，以进一步加快执行速度。我们使用区域气候模式CWRF[5]在真实世界模拟中获得的140gb数据集上评估了PStore。在本文中，我们使用了3D和4D数据集，并通过大量的实验证明了高性能。

{"title":"PStore: an efficient storage framework for managing scientific data","authors":"Souvik Bhattacherjee, A. Deshpande, A. Sussman","doi":"10.1145/2618243.2618268","DOIUrl":"https://doi.org/10.1145/2618243.2618268","url":null,"abstract":"In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data generated by scientific simulations. PStore consists of two modules, a data ingestion module and a query processing module, that respectively address two of the key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on dicing (also called range) queries. PStore provides a suite of compression schemes that leverage, and in some cases extend, existing techniques to provide support for diverse scientific simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model CWRF [5]. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"25:1-25:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78504145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

SAGA: array storage as a DB with support for structural aggregations SAGA:作为DB的数组存储，支持结构聚合

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618270

Yi Wang, Arnab Nandi, G. Agrawal

In recent years, many Array DBMSs, including SciDB and RasDaMan have emerged to meet the needs of data management applications where the natural structures are the arrays. These systems, like their relational counterparts, involve an expensive data ingestion phase. The paradigm of using native storage as a DB and providing database-like support (e.g., the NoDB approach) has recently been shown to be an effective approach for dealing with infrequently queried data, where data ingestion costs cannot be justified, though only in context of relational data. Applications that generate massive arrays, such as the scientific simulations, often store the data in one of a small number of array storage formats, like NetCDF or HDF5. Thus, a natural question is, "can database-like functionality be supported over native array storage?". In this paper, we present algorithms, different partitioning strategies, and an analytical model for supporting structural (grid, sliding, hierarchical, and circular) aggregations over native array storage, and describe implementation of this approach in a system we refer to as Structural AGgregations over Array storage (SAGA). We show how the relative performance of different partitioning strategies changes with varying amount of computation in the aggregation function and different levels of data skew, and our model is effective in choosing the best partitioning strategy. Performance comparison with SciDB shows that despite working on native array storage, the aggregation costs with our system are lower. Finally, we also show that our structural aggregation implementations achieve high parallel efficiency.

近年来，出现了许多Array dbms，包括SciDB和RasDaMan，以满足自然结构为数组的数据管理应用程序的需求。与它们的关系系统一样，这些系统涉及一个昂贵的数据摄取阶段。使用本地存储作为数据库并提供类似数据库的支持(例如，NoDB方法)的范例最近被证明是处理不经常查询的数据的有效方法，在这种情况下，数据摄取成本无法证明是合理的，尽管只是在关系数据上下文中。生成大量数组的应用程序，如科学模拟，通常将数据存储在少数数组存储格式中的一种，如NetCDF或HDF5。因此，一个自然的问题是，“在本机数组存储上能支持类似数据库的功能吗?”在本文中，我们提出了算法，不同的分区策略，以及支持本地阵列存储上的结构(网格，滑动，分层和圆形)聚合的分析模型，并描述了这种方法在我们称为阵列存储上的结构聚合(SAGA)系统中的实现。我们展示了不同分区策略的相对性能如何随着聚合函数的计算量和数据倾斜程度的不同而变化，并且我们的模型在选择最佳分区策略方面是有效的。与SciDB的性能比较表明，尽管在本机阵列存储上工作，我们系统的聚合成本更低。最后，我们还证明了我们的结构聚合实现具有很高的并行效率。

{"title":"SAGA: array storage as a DB with support for structural aggregations","authors":"Yi Wang, Arnab Nandi, G. Agrawal","doi":"10.1145/2618243.2618270","DOIUrl":"https://doi.org/10.1145/2618243.2618270","url":null,"abstract":"In recent years, many Array DBMSs, including SciDB and RasDaMan have emerged to meet the needs of data management applications where the natural structures are the arrays. These systems, like their relational counterparts, involve an expensive data ingestion phase. The paradigm of using native storage as a DB and providing database-like support (e.g., the NoDB approach) has recently been shown to be an effective approach for dealing with infrequently queried data, where data ingestion costs cannot be justified, though only in context of relational data.\u0000 Applications that generate massive arrays, such as the scientific simulations, often store the data in one of a small number of array storage formats, like NetCDF or HDF5. Thus, a natural question is, \"can database-like functionality be supported over native array storage?\". In this paper, we present algorithms, different partitioning strategies, and an analytical model for supporting structural (grid, sliding, hierarchical, and circular) aggregations over native array storage, and describe implementation of this approach in a system we refer to as Structural AGgregations over Array storage (SAGA). We show how the relative performance of different partitioning strategies changes with varying amount of computation in the aggregation function and different levels of data skew, and our model is effective in choosing the best partitioning strategy. Performance comparison with SciDB shows that despite working on native array storage, the aggregation costs with our system are lower. Finally, we also show that our structural aggregation implementations achieve high parallel efficiency.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"36 1","pages":"9:1-9:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89578801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Distributed data placement to minimize communication costs via graph partitioning 分布式数据放置，通过图分区最小化通信成本

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618258

Lukasz Golab, Marios Hadjieleftheriou, H. Karloff, B. Saha

With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, data-intensive scientific workflows and join-intensive queries cannot always be evaluated efficiently using MapReduce-style processing without extensive data migrations, which cause network congestion and reduced query throughput. In this paper, we study the problem of computing data placement strategies that minimize the data communication costs incurred by such workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of Graph Partitioning, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives. We study several practical extensions of the problem: with load balancing, with replication, and with complex workflows consisting of multiple steps that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the Graph Partitioning solution of the no-replication case. Using a workload based on TPC-DS, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.

随着无共享服务器集群的广泛使用，为mapreduce风格的工作负载提供高可用性、可靠性和增强性能的分布式对象存储已经大量出现。然而，使用mapreduce风格的处理，如果没有大量的数据迁移，数据密集型科学工作流和连接密集型查询并不总是能够有效地评估，这会导致网络拥塞和查询吞吐量降低。在本文中，我们研究了计算数据放置策略的问题，该策略可以最大限度地减少分布式设置中此类工作负载所产生的数据通信成本。我们的主要贡献是将数据放置问题简化为经过充分研究的图分区问题，这是NP-Hard问题，但存在有效的近似算法。该结果的新颖性和意义在于准确地表示了通信成本，并且使用标准图而不是超图，超图在先前的数据放置工作中用于针对不同目标进行优化。我们研究了这个问题的几个实际扩展:负载平衡、复制和由多个步骤组成的复杂工作流，这些步骤可能在不同的服务器上计算。我们提供整数线性程序(IP)，可以与任何IP求解器一起使用，以找到最佳的数据放置。对于无复制的情况，我们使用公开可用的图分区库(例如，METIS)来有效地计算接近最优的解决方案。对于具有复制的版本，我们引入了两种启发式方法，它们利用无复制情况下的图分区解决方案。使用基于TPC-DS的工作负载，IP求解器可能需要数周的时间才能计算出最优的数据放置，而我们的缩减在几秒钟内就产生了近乎最优的解决方案。

{"title":"Distributed data placement to minimize communication costs via graph partitioning","authors":"Lukasz Golab, Marios Hadjieleftheriou, H. Karloff, B. Saha","doi":"10.1145/2618243.2618258","DOIUrl":"https://doi.org/10.1145/2618243.2618258","url":null,"abstract":"With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, data-intensive scientific workflows and join-intensive queries cannot always be evaluated efficiently using MapReduce-style processing without extensive data migrations, which cause network congestion and reduced query throughput. In this paper, we study the problem of computing data placement strategies that minimize the data communication costs incurred by such workloads in a distributed setting.\u0000 Our main contribution is a reduction of the data placement problem to the well-studied problem of Graph Partitioning, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives.\u0000 We study several practical extensions of the problem: with load balancing, with replication, and with complex workflows consisting of multiple steps that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the Graph Partitioning solution of the no-replication case. Using a workload based on TPC-DS, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"20:1-20:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80467689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

DivIDE: efficient diversification for interactive data exploration 划分:交互式数据探索的高效多样化

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618253

Hina A. Khan, M. Sharaf, Abdullah M. Albarrak

Today, Interactive Data Exploration (IDE) has become a main constituent of many discovery-oriented applications, in which users repeatedly submit exploratory queries to identify interesting subspaces in large data sets. Returning relevant yet diverse results to such queries provides users with quick insights into a rather large data space. Meanwhile, search results diversification adds additional cost to an already computationally expensive exploration process. To address this challenge, in this paper, we propose a novel diversification scheme called DivIDE, which targets the problem of efficiently diversifying the results of queries posed during data exploration sessions. In particular, our scheme exploits the properties of data diversification functions while leveraging the natural overlap occurring between the results of different queries so that to provide significant reductions in processing costs. Our extensive experimental evaluation on both synthetic and real data sets shows the significant benefits provided by our scheme as compared to existing methods.

如今，交互式数据探索(IDE)已成为许多面向发现的应用程序的主要组成部分，在这些应用程序中，用户反复提交探索性查询，以识别大型数据集中感兴趣的子空间。向此类查询返回相关但不同的结果，为用户提供了对相当大的数据空间的快速洞察。同时，搜索结果的多样化给本已计算成本高昂的勘探过程增加了额外的成本。为了应对这一挑战，在本文中，我们提出了一种名为DivIDE的新型多样化方案，该方案针对数据探索会话期间提出的查询结果的有效多样化问题。特别是，我们的方案利用了数据多样化函数的属性，同时利用了不同查询结果之间的自然重叠，从而显著降低了处理成本。我们在合成和真实数据集上的广泛实验评估表明，与现有方法相比，我们的方案提供了显着的优势。

{"title":"DivIDE: efficient diversification for interactive data exploration","authors":"Hina A. Khan, M. Sharaf, Abdullah M. Albarrak","doi":"10.1145/2618243.2618253","DOIUrl":"https://doi.org/10.1145/2618243.2618253","url":null,"abstract":"Today, Interactive Data Exploration (IDE) has become a main constituent of many discovery-oriented applications, in which users repeatedly submit exploratory queries to identify interesting subspaces in large data sets. Returning relevant yet diverse results to such queries provides users with quick insights into a rather large data space. Meanwhile, search results diversification adds additional cost to an already computationally expensive exploration process. To address this challenge, in this paper, we propose a novel diversification scheme called DivIDE, which targets the problem of efficiently diversifying the results of queries posed during data exploration sessions. In particular, our scheme exploits the properties of data diversification functions while leveraging the natural overlap occurring between the results of different queries so that to provide significant reductions in processing costs. Our extensive experimental evaluation on both synthetic and real data sets shows the significant benefits provided by our scheme as compared to existing methods.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"15:1-15:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90803121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28