Proceedings of the 28th International Conference on Scientific and Statistical Database Management最新文献

英文中文

Efficient Maintenance of All-Pairs Shortest Distances 最短距离全对的有效维护

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949713

S. Greco, Cristian Molinaro, Chiara Pulice

Computing shortest distances is a central task in many graph applications. Since it is impractical to recompute shortest distances from scratch every time the graph changes, many algorithms have been proposed to incrementally maintain shortest distances after edge deletions or insertions. In this paper, we address the problem of maintaining all-pairs shortest distances in dynamic graphs and propose novel efficient incremental algorithms, working both in main memory and on disk. We prove their correctness and provide complexity analyses. Experimental results on real-world datasets show that current main-memory algorithms become soon impractical, disk-based ones are needed for larger graphs, and our approach significantly outperforms state-of-the-art algorithms.

计算最短距离是许多图形应用程序的中心任务。由于每次图改变时从头开始重新计算最短距离是不切实际的，因此已经提出了许多算法来在删除或插入边缘后增量地保持最短距离。在本文中，我们解决了在动态图中保持全对最短距离的问题，并提出了一种新的高效增量算法，可以同时在主存和磁盘上工作。我们证明了它们的正确性，并提供了复杂性分析。在真实数据集上的实验结果表明，当前的主内存算法很快就会变得不切实际，更大的图形需要基于磁盘的算法，而我们的方法明显优于最先进的算法。

引用次数: 5

SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams SPOTHOT:大型文本流中地理空间事件的可扩展检测

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949699

Erich Schubert, Michael Weiler, H. Kriegel

The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization. We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting "Breaking News" in real-time. Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.

社交媒体数据的分析面临着几个挑战:首先，数据集非常大，其次，它们不断变化，第三，它们是异构的，由文本，图像，地理位置和社会关系组成。在本文中，我们将重点关注检测由文本和位置信息组成的事件，并介绍一种在体积和速度方面都可扩展的分析方法。在我们的事件检测中，我们还通过有效的归一化解决了由于不同文化、语言和国家采用社交媒体的差异而产生的问题。我们介绍了一种算法，能够使用基于signittrend事件检测系统的可扩展在线方法处理大量数据，该算法能够识别数据流中不寻常的地理文本模式，而无需用户事先指定任何约束，例如要跟踪的标签:与早期的工作相比，我们能够只使用固定的内存量来监控每个位置的每个单词，将这些值与早期数据的统计数据进行比较，并以最小的延迟立即报告重大偏差。因此，该算法能够实时报道“突发新闻”。位置使用无监督几何离散化和监督管理层次来建模，这允许同时检测城市、区域和全球级别的事件。使用几个使用Twitter数据的真实示例用例演示了该方法的有用性。

{"title":"SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams","authors":"Erich Schubert, Michael Weiler, H. Kriegel","doi":"10.1145/2949689.2949699","DOIUrl":"https://doi.org/10.1145/2949689.2949699","url":null,"abstract":"The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization. We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting \"Breaking News\" in real-time. Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127376516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters 异构星团上天文表平行交叉匹配的多赋值单联接

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949705

Xiaoying Jia, Qiong Luo

Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with existing databases. In this paper, we propose to adopt a Multi-Assignment Single Join (MASJ) method for cross-match on heterogeneous clusters that consist of both CPUs and GPUs. We chose MASJ for cross-match, because (1) cross-matching records from astronomic catalogs is essentially a spatial distance join on two sets of points, and (2) each reference point is mapped to only a small number of search intervals. As a result, the MASJ cross-match, or MASJ-CM algorithm is feasible and highly efficient in a heterogeneous cluster environment. We have implemented MASJ-CM in two packages: one is an MPI-CUDA implementation, which fully utilizes the multi-core CPUs, GPUs, and InfiniBand communications; the other is on top of the popular distributed computing platform Spark, which greatly simplifies the programming. Our results on a six-node CPU-GPU cluster show that the MPI-CUDA implementation achieved a speedup of 2.69 times over a previous indexed nested-loop join algorithm. The Spark-based implementation was an order of magnitude slower than the MPI-CUDA; nevertheless, it is widely applicable and its source code much simpler.

交叉匹配是天文数据库中整合多个天体目录的核心操作。随着新的天文项目的快速发展，产生了大量的天文目录，需要与现有数据库快速交叉匹配。在本文中，我们提出了一种多分配单连接(MASJ)方法来交叉匹配由cpu和gpu组成的异构集群。我们选择MASJ进行交叉匹配，是因为:(1)天文表的交叉匹配记录本质上是两组点的空间距离连接，(2)每个参考点只映射到少量的搜索区间。因此，MASJ交叉匹配(MASJ- cm)算法在异构集群环境中是可行且高效的。我们在两个包中实现了MASJ-CM:一个是MPI-CUDA实现，充分利用了多核cpu、gpu和InfiniBand通信;另一个是基于流行的分布式计算平台Spark，大大简化了编程。我们在六节点CPU-GPU集群上的结果表明，MPI-CUDA实现比以前的索引嵌套循环连接算法实现了2.69倍的加速。基于spark的实现比MPI-CUDA慢一个数量级;然而，它是广泛适用的，其源代码也简单得多。

{"title":"Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters","authors":"Xiaoying Jia, Qiong Luo","doi":"10.1145/2949689.2949705","DOIUrl":"https://doi.org/10.1145/2949689.2949705","url":null,"abstract":"Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with existing databases. In this paper, we propose to adopt a Multi-Assignment Single Join (MASJ) method for cross-match on heterogeneous clusters that consist of both CPUs and GPUs. We chose MASJ for cross-match, because (1) cross-matching records from astronomic catalogs is essentially a spatial distance join on two sets of points, and (2) each reference point is mapped to only a small number of search intervals. As a result, the MASJ cross-match, or MASJ-CM algorithm is feasible and highly efficient in a heterogeneous cluster environment. We have implemented MASJ-CM in two packages: one is an MPI-CUDA implementation, which fully utilizes the multi-core CPUs, GPUs, and InfiniBand communications; the other is on top of the popular distributed computing platform Spark, which greatly simplifies the programming. Our results on a six-node CPU-GPU cluster show that the MPI-CUDA implementation achieved a speedup of 2.69 times over a previous indexed nested-loop join algorithm. The Spark-based implementation was an order of magnitude slower than the MPI-CUDA; nevertheless, it is widely applicable and its source code much simpler.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116619436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

SMS: Stable Matching Algorithm using Skylines SMS:使用Skylines的稳定匹配算法

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949712

Rohit Anurag, Arnab Bhattacharya

In this paper we show how skylines can be used to improve the stable matching algorithm with asymmetric preference sets for men and women. The skyline set of men (or women) in a dataset comprises of those who are not worse off in all the qualities in comparison to another man (or woman). We prove that if a man in the skyline set is matched with a woman in the skyline set, the resulting pair is stable. We design our algorithm, SMS, based on the above observation by running the matching algorithm in phases considering only the skyline sets. In addition to being efficient, SMS provides two important additional properties. The first is progressiveness where stable pairs are output without waiting for the entire algorithm to finish. The second is balance in quality between men versus women since the proposers are switched automatically between the sets. Empirical results show that SMS runs orders of magnitude faster than the original Gale-Shapley algorithm and produces better quality matchings.

在本文中，我们展示了如何使用天际线来改进具有非对称男女偏好集的稳定匹配算法。数据集中的男性(或女性)天际线集合由那些在所有品质上都没有比另一个男性(或女性)差的人组成。我们证明了如果天际线集合中的一个男人与天际线集合中的一个女人匹配，结果对是稳定的。基于上述观察，我们设计了SMS算法，只考虑天际线集，分阶段运行匹配算法。除了高效之外，SMS还提供了两个重要的附加属性。第一种是渐进式，在不等待整个算法完成的情况下输出稳定的对。其次是男女之间的质量平衡，因为求婚对象会自动在两组之间切换。实证结果表明，SMS的运行速度比原始的Gale-Shapley算法快了几个数量级，并且产生了更好的匹配质量。

引用次数: 2

Privacy or Security?: Take A Look And Then Decide 隐私还是安全?看一看再决定

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949706

Bettina Fazzinga, F. Furfaro, E. Masciari, G. Mazzeo

Big data paradigm is currently the leading paradigm for data production and management. As a matter of fact, new information are generated at high rates in specialized fields (e.g., cybersecurity scenario). This may cause that the events to be studied occur at rates that are too fast to be effectively analyzed in real time. For example, in order to detect possible security threats, millions of records in a high-speed flow stream must be screened. To ameliorate this problem, a viable solution is the use of data compression for reducing the amount of data to be analyzed. In this paper we propose the use of privacy-preserving histograms, that provide approximate answers to 'safe' queries, for analyzing data in the cybersecurity scenario without compromising individuals' privacy, and we describe our system that has been used in a real life scenario.

大数据范式是当前数据生产和管理的主导范式。事实上，在特定领域(如网络安全场景)，新信息的生成速度非常快。这可能导致要研究的事件以太快的速度发生，无法进行有效的实时分析。例如，为了检测可能的安全威胁，必须对高速流中的数百万条记录进行筛选。为了改善这个问题，一个可行的解决方案是使用数据压缩来减少要分析的数据量。在本文中，我们建议使用隐私保护直方图，它为“安全”查询提供近似答案，用于在不损害个人隐私的情况下分析网络安全场景中的数据，并且我们描述了我们在现实生活场景中使用的系统。

引用次数: 0

Compact and queryable representation of raster datasets 栅格数据集的紧凑和可查询的表示

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949710

Susana Ladra, J. Paramá, Fernando Silva-Coira

Compact data structures combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always compressed, even in main memory. With this, we obtain two benefits: we can manage larger datasets in main memory and we take advantage of a better usage of the memory hierarchy. In this work, we present a compact data structure to represent raster data, which is commonly used in geographical information systems to represent attributes of the space (i.e., temperatures, elevation measures, etc.). The proposed data structure is not only able to represent a dataset in compressed form and access to an individual datum without decompressing the dataset from the beginning, but it also indexes its content, and thus it is capable of speeding up queries. There have been previous attempts to represent raster data using compact data structures, which work well when the raster dataset has few different values. However, when the range of possible values increases, performance in both space and time degrades. Our new method competes with previous approaches in that first scenario, but scales much better when the number of different values and the size of the dataset increase, which is critical when applying over real datasets.

紧凑的数据结构将数据的压缩表示和访问这些数据的结构结合在一个独特的数据结构中。目标是能够直接以压缩形式管理数据，并以这种方式保持数据始终是压缩的，即使在主存中也是如此。这样，我们获得了两个好处:我们可以在主内存中管理更大的数据集，并且可以更好地利用内存层次结构。在这项工作中，我们提出了一种紧凑的数据结构来表示栅格数据，栅格数据通常用于地理信息系统中表示空间属性(即温度，高程测量等)。所提出的数据结构不仅能够以压缩形式表示数据集，并且无需从头解压缩数据集即可访问单个数据，而且还可以对数据集的内容进行索引，从而能够加快查询速度。以前曾有人尝试使用紧凑的数据结构来表示栅格数据，当栅格数据集只有很少不同的值时，这种结构可以很好地工作。但是，当可能值的范围增大时，在空间和时间上的性能都会下降。在第一个场景中，我们的新方法与之前的方法竞争，但是当不同值的数量和数据集的大小增加时，它的可伸缩性要好得多，这在应用于真实数据集时是至关重要的。

{"title":"Compact and queryable representation of raster datasets","authors":"Susana Ladra, J. Paramá, Fernando Silva-Coira","doi":"10.1145/2949689.2949710","DOIUrl":"https://doi.org/10.1145/2949689.2949710","url":null,"abstract":"Compact data structures combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always compressed, even in main memory. With this, we obtain two benefits: we can manage larger datasets in main memory and we take advantage of a better usage of the memory hierarchy. In this work, we present a compact data structure to represent raster data, which is commonly used in geographical information systems to represent attributes of the space (i.e., temperatures, elevation measures, etc.). The proposed data structure is not only able to represent a dataset in compressed form and access to an individual datum without decompressing the dataset from the beginning, but it also indexes its content, and thus it is capable of speeding up queries. There have been previous attempts to represent raster data using compact data structures, which work well when the raster dataset has few different values. However, when the range of possible values increases, performance in both space and time degrades. Our new method competes with previous approaches in that first scenario, but scales much better when the number of different values and the size of the dataset increase, which is critical when applying over real datasets.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116626593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

PAMPAS: Privacy-Aware Mobile Participatory Sensing Using Secure Probes PAMPAS:使用安全探针的隐私感知移动参与式传感

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949704

Dai Hai Ton That, I. S. Popa, K. Zeitouni, C. Borcea

Mobile participatory sensing could be used in many applications such as vehicular traffic monitoring, pollution tracking, or even health surveying. However, its success depends on finding a solution for querying large numbers of users which protects user location privacy and works in real-time. This paper presents PAMPAS, a privacy-aware mobile distributed system for efficient data aggregation in mobile participatory sensing. In PAMPAS, mobile devices enhanced with secure hardware, called secure probes (SPs), perform distributed query processing, while preventing users from accessing other users' data. A supporting server infrastructure (SSI) coordinates the inter-SP communication and the computation tasks executed on SPs. PAMPAS ensures that SSI cannot link the location reported by SPs to the user identities even if SSI has additional background information. In addition to its novel system architecture, PAMPAS also proposes two new protocols for privacy-aware location-based aggregation and adaptive spatial partitioning of SPs that work efficiently on resource-constrained SPs. Our experimental results and security analysis demonstrate that these protocols are able to collect the data, aggregate them, and share statistics or derived models in real-time, without any location privacy leakage.

移动参与式传感可用于许多应用，如车辆交通监测、污染跟踪，甚至健康调查。然而，它的成功取决于找到一种能够查询大量用户的解决方案，既能保护用户的位置隐私，又能实时工作。本文提出了一种用于移动参与式感知中高效数据聚合的隐私感知移动分布式系统PAMPAS。在PAMPAS中，使用安全硬件增强的移动设备(称为安全探针(sp))执行分布式查询处理，同时防止用户访问其他用户的数据。支持的服务器基础设施(SSI)协调sp间的通信和在sp上执行的计算任务。PAMPAS确保SSI不能将sp报告的位置链接到用户身份，即使SSI有额外的背景信息。除了新颖的系统架构外，PAMPAS还提出了两种新的协议，用于敏感隐私的基于位置的聚合和自适应空间划分的sp，以有效地处理资源受限的sp。我们的实验结果和安全性分析表明，这些协议能够实时收集、汇总数据，并共享统计数据或派生模型，而不会泄露任何位置隐私。

{"title":"PAMPAS: Privacy-Aware Mobile Participatory Sensing Using Secure Probes","authors":"Dai Hai Ton That, I. S. Popa, K. Zeitouni, C. Borcea","doi":"10.1145/2949689.2949704","DOIUrl":"https://doi.org/10.1145/2949689.2949704","url":null,"abstract":"Mobile participatory sensing could be used in many applications such as vehicular traffic monitoring, pollution tracking, or even health surveying. However, its success depends on finding a solution for querying large numbers of users which protects user location privacy and works in real-time. This paper presents PAMPAS, a privacy-aware mobile distributed system for efficient data aggregation in mobile participatory sensing. In PAMPAS, mobile devices enhanced with secure hardware, called secure probes (SPs), perform distributed query processing, while preventing users from accessing other users' data. A supporting server infrastructure (SSI) coordinates the inter-SP communication and the computation tasks executed on SPs. PAMPAS ensures that SSI cannot link the location reported by SPs to the user identities even if SSI has additional background information. In addition to its novel system architecture, PAMPAS also proposes two new protocols for privacy-aware location-based aggregation and adaptive spatial partitioning of SPs that work efficiently on resource-constrained SPs. Our experimental results and security analysis demonstrate that these protocols are able to collect the data, aggregate them, and share statistics or derived models in real-time, without any location privacy leakage.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134186366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Selective Scan for Filter Operator of SciDB 选择性扫描筛选算子的SciDB

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949707

Sangchul Kim, Seoung Gook Sohn, Taehoon Kim, Jinseon Yu, Bogyeong Kim, Bongki Moon

Recently there has been an increasing interest in analyzing scientific data generated by observations and scientific experiments. For managing these data efficiently, SciDB, a multi-dimensional array-based DBMS, is suggested. When SciDB processes a query with where predicates, it uses filter operator internally to produce a result array that matches the predicates. Most queries for scientific data analysis utilize spatial information. However, filter operator of SciDB reads all data without considering features of array-based DBMSs and spatial information. In this demo, we present an efficient query processing scheme utilizing characteristics of array-based data, implemented by employing coordinates. It uses a selective scan that retrieves data corresponding to a range that satisfies specific conditions. In our experiments, the selective scan is up to 30x faster than the original scan. We demonstrate that our implementation of the filter operator will reduce the processing time of a selection query significantly and enable SciDB to handle a massive amount of scientific data in more scalable manner.

最近，人们对分析由观察和科学实验产生的科学数据越来越感兴趣。为了有效地管理这些数据，建议采用基于多维数组的数据库管理系统SciDB。当SciDB处理带有where谓词的查询时，它在内部使用筛选操作符来生成与谓词匹配的结果数组。大多数科学数据分析查询都利用空间信息。但是，SciDB的filter算子读取所有数据，没有考虑基于数组的dbms的特点和空间信息。在这个演示中，我们提出了一个有效的查询处理方案，利用基于数组的数据的特征，通过使用坐标实现。它使用选择性扫描，检索满足特定条件的范围对应的数据。在我们的实验中，选择性扫描的速度比原始扫描快30倍。我们证明了过滤器操作符的实现将显著减少选择查询的处理时间，并使SciDB能够以更具可扩展性的方式处理大量科学数据。

{"title":"Selective Scan for Filter Operator of SciDB","authors":"Sangchul Kim, Seoung Gook Sohn, Taehoon Kim, Jinseon Yu, Bogyeong Kim, Bongki Moon","doi":"10.1145/2949689.2949707","DOIUrl":"https://doi.org/10.1145/2949689.2949707","url":null,"abstract":"Recently there has been an increasing interest in analyzing scientific data generated by observations and scientific experiments. For managing these data efficiently, SciDB, a multi-dimensional array-based DBMS, is suggested. When SciDB processes a query with where predicates, it uses filter operator internally to produce a result array that matches the predicates. Most queries for scientific data analysis utilize spatial information. However, filter operator of SciDB reads all data without considering features of array-based DBMSs and spatial information. In this demo, we present an efficient query processing scheme utilizing characteristics of array-based data, implemented by employing coordinates. It uses a selective scan that retrieves data corresponding to a range that satisfies specific conditions. In our experiments, the selective scan is up to 30x faster than the original scan. We demonstrate that our implementation of the filter operator will reduce the processing time of a selection query significantly and enable SciDB to handle a massive amount of scientific data in more scalable manner.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126734672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Vectorized UDFs in Column-Stores 列存储中的矢量化udf

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949703

Mark Raasveldt, H. Mühleisen

Data Scientists rely on vector-based scripting languages such as R, Python and MATLAB to perform ad-hoc data analysis on potentially large data sets. When facing large data sets, they are only efficient when data is processed using vectorized or bulk operations. At the same time, overwhelming volume and variety of data as well as parsing overhead suggests that the use of specialized analytical data management systems would be beneficial. Data might also already be stored in a database. Efficient execution of data analysis programs such as data mining directly inside a database greatly improves analysis efficiency. We investigate how these vector-based languages can be efficiently integrated in the processing model of operator--at--a--time databases. We present MonetDB/Python, a new system that combines the open-source database MonetDB with the vector-based language Python. In our evaluation, we demonstrate efficiency gains of orders of magnitude.

数据科学家依靠基于向量的脚本语言，如R、Python和MATLAB，对潜在的大型数据集进行临时数据分析。当面对大型数据集时，它们只有在使用向量化或批量操作处理数据时才有效。同时，庞大的数据量和种类以及解析开销表明，使用专门的分析数据管理系统将是有益的。数据也可能已经存储在数据库中。数据分析程序(如直接在数据库中进行数据挖掘)的高效执行大大提高了分析效率。我们研究了如何将这些基于向量的语言有效地集成到操作符实时数据库的处理模型中。我们介绍了MonetDB/Python，这是一个结合了开源数据库MonetDB和基于矢量的语言Python的新系统。在我们的评估中，我们展示了数量级的效率增益。

引用次数: 26

Data Exchange with MapReduce: A First Cut MapReduce的数据交换:A First Cut

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949702

Khalid Belhajjame, A. Bonifati

Data exchange is one of the oldest database problems, being of both practical and theoretical interest. Given the pace at which heterogeneous data are published on the web, thanks to initiatives such as Linked Data and Open Science, scalability of data exchange becomes crucial. Pivotal to data exchange is the chase algorithm, which is a fixpoint algorithm to evaluate both source-to-target constraints and target constraints in the data exchange process. In this paper, we investigate how new programming models such as MapReduce can be used to implement the chase on large-scale data sources. To the best of our knowledge, how to exchange data at scale has not been investigated so far. We present an initial solution for chasing source-to-target tuple generating dependencies and target tuple-generating dependencies, and discuss open issues that need to be addressed to leverage MapReduce for the data exchange problem.

数据交换是最古老的数据库问题之一，具有实践和理论意义。考虑到异构数据在网络上发布的速度，由于诸如关联数据和开放科学等倡议，数据交换的可扩展性变得至关重要。数据交换的关键是追踪算法，它是一种不动点算法，用于评估数据交换过程中的源到目标约束和目标约束。在本文中，我们研究了如何使用MapReduce等新的编程模型来实现对大规模数据源的追踪。据我们所知，到目前为止，还没有研究过如何大规模交换数据。我们提出了一个初始的解决方案，用于追踪源到目标元组生成依赖关系和目标元组生成依赖关系，并讨论了利用MapReduce解决数据交换问题需要解决的开放问题。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀