33rd International Conference on Scientific and Statistical Database Management最新文献

英文中文

What is special about spatial data science and Geo-AI? 空间数据科学和地理人工智能有什么特别之处?

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3472263

S. Shekhar

The importance of spatial data science and Geo-AI is growing with the rise of spatial and spatiotemporal big data (e.g., trajectories, remote-sensing images, census and geo-social media) [1-2]. Societal use cases include Agriculture (global crop monitoring, precision agriculture), Location-based services (e.g., navigation, ride-sharing), Public Health (e.g., monitoring disease spread), Environment and Climate (change detection, land-cover classification), Smart Cities (e.g., mapping buildings), etc. [1-2] Classical data science and AI (e.g., machine learning) often perform poorly when applied to spatial data sets because of the many reasons [1-5]. First, spatial data is embedded in a continuous space and classical statistics (e.g., correlation) are not robust to the modifiable areal unit problem. Second, spatial data-items have extended footprints (e.g., line strings, polygons) and implicit relationships (e.g., distance, touch). Third, high cost of spurious patterns requires guardrails (e.g., statistical significance tests) to reduce false positives. Furthermore, spatial autocorrelation and variability violate the classical assumption of data samples being generated independently from identical distributions, which risk models that are either inaccurate or inconsistent with the data. Thus, new methods are needed to analyze spatial data [1-5]. This talk surveys common and emerging methods for spatial classification and prediction (e.g., spatial autoregression, spatial decision trees [6], spatial variability aware neural networks [7]), as well as techniques for discovering interesting, useful and non-trivial patterns such as hotspots (e.g., circular, linear, arbitrary shapes [8]), interactions (e.g., co-locations [9], tele-connections), spatial outliers [10], and their spatio-temporal counterparts [3].

随着空间和时空大数据(如轨迹、遥感图像、人口普查和地理社交媒体)的兴起，空间数据科学和地理人工智能的重要性日益增强[1-2]。社会用例包括农业(全球作物监测、精准农业)、基于位置的服务(例如导航、拼车)、公共卫生(例如监测疾病传播)、环境和气候(变化检测、土地覆盖分类)、智慧城市(例如绘制建筑物)等[1-2]。由于多种原因，经典数据科学和人工智能(例如机器学习)在应用于空间数据集时往往表现不佳[1-5]。首先，空间数据嵌入在连续空间中，经典统计(如相关性)对可修改面积单位问题不具有鲁棒性。其次，空间数据项具有扩展的足迹(例如，线串、多边形)和隐式关系(例如，距离、触摸)。第三，虚假模式的高成本需要护栏(例如，统计显著性检验)来减少误报。此外，空间自相关和变异违背了数据样本独立于相同分布的经典假设，这可能会导致模型不准确或与数据不一致。因此，需要新的方法来分析空间数据[1-5]。本次演讲将探讨空间分类和预测的常用和新兴方法(例如，空间自回归、空间决策树[6]、空间变异性感知神经网络[7])，以及发现有趣、有用和重要模式的技术，如热点(例如，圆形、线性、任意形状[8])、相互作用(例如，共定位[9]、远程连接[10])、空间异常值[10]及其时空对应[3]。

{"title":"What is special about spatial data science and Geo-AI?","authors":"S. Shekhar","doi":"10.1145/3468791.3472263","DOIUrl":"https://doi.org/10.1145/3468791.3472263","url":null,"abstract":"The importance of spatial data science and Geo-AI is growing with the rise of spatial and spatiotemporal big data (e.g., trajectories, remote-sensing images, census and geo-social media) [1-2]. Societal use cases include Agriculture (global crop monitoring, precision agriculture), Location-based services (e.g., navigation, ride-sharing), Public Health (e.g., monitoring disease spread), Environment and Climate (change detection, land-cover classification), Smart Cities (e.g., mapping buildings), etc. [1-2] Classical data science and AI (e.g., machine learning) often perform poorly when applied to spatial data sets because of the many reasons [1-5]. First, spatial data is embedded in a continuous space and classical statistics (e.g., correlation) are not robust to the modifiable areal unit problem. Second, spatial data-items have extended footprints (e.g., line strings, polygons) and implicit relationships (e.g., distance, touch). Third, high cost of spurious patterns requires guardrails (e.g., statistical significance tests) to reduce false positives. Furthermore, spatial autocorrelation and variability violate the classical assumption of data samples being generated independently from identical distributions, which risk models that are either inaccurate or inconsistent with the data. Thus, new methods are needed to analyze spatial data [1-5]. This talk surveys common and emerging methods for spatial classification and prediction (e.g., spatial autoregression, spatial decision trees [6], spatial variability aware neural networks [7]), as well as techniques for discovering interesting, useful and non-trivial patterns such as hotspots (e.g., circular, linear, arbitrary shapes [8]), interactions (e.g., co-locations [9], tele-connections), spatial outliers [10], and their spatio-temporal counterparts [3].","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116648806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Caching Support for Range Query Processing on Bitmap Indices 位图索引上范围查询处理的缓存支持

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468800

Sarah McClain, Manya Mutschler-Aldine, C. Monaghan, David Chiu, Jason Sawin, Patrick Jarvis

Bitmaps are commonly used for indexing read-mostly data sets. The range of an attribute is split into bins, where its values are placed: bij = 1 denotes the value of the ith tuple is in the jth bin, and bij = 0 otherwise. A number of query types can be decomposed into the systematic application of boolean operators over sets of bins. However, when bitmaps are high-dimensional, the overall query-processing performance can deteriorate due to the increased number of bins that participate per query. We propose a caching framework that organizes, manages, and integrates cached partial results to accelerate query processing on high-dimensional bitmaps. We begin by showing that, to resolve general complex disjunctive and conjunctive queries, the selection of an optimal set of partial bitmap results is NP-complete. A restriction on this problem to only consider consecutive bin sequences (characteristic of common range and point queries) allows us to solve it efficiently. The evaluation our caching system over several workloads carried out on the TPC-H benchmark and a real network-intrusion data set is presented.

位图通常用于索引只读数据集。属性的范围被分成若干个bin，其值被放置在其中:bij = 1表示第i个元组的值在第j个bin中，否则bij = 0。许多查询类型可以分解为布尔运算符在一组箱子上的系统应用程序。但是，当位图是高维的时，由于每个查询参与的bin数量增加，整体查询处理性能可能会下降。我们提出了一个缓存框架来组织、管理和集成缓存的部分结果，以加速高维位图上的查询处理。我们首先表明，为了解决一般复杂的析取和合取查询，部分位图结果的最优集的选择是np完全的。这个问题只考虑连续bin序列(公共范围和点查询的特征)的限制使我们能够有效地解决它。在TPC-H基准测试和真实的网络入侵数据集上对我们的缓存系统进行了几种工作负载的评估。

{"title":"Caching Support for Range Query Processing on Bitmap Indices","authors":"Sarah McClain, Manya Mutschler-Aldine, C. Monaghan, David Chiu, Jason Sawin, Patrick Jarvis","doi":"10.1145/3468791.3468800","DOIUrl":"https://doi.org/10.1145/3468791.3468800","url":null,"abstract":"Bitmaps are commonly used for indexing read-mostly data sets. The range of an attribute is split into bins, where its values are placed: bij = 1 denotes the value of the ith tuple is in the jth bin, and bij = 0 otherwise. A number of query types can be decomposed into the systematic application of boolean operators over sets of bins. However, when bitmaps are high-dimensional, the overall query-processing performance can deteriorate due to the increased number of bins that participate per query. We propose a caching framework that organizes, manages, and integrates cached partial results to accelerate query processing on high-dimensional bitmaps. We begin by showing that, to resolve general complex disjunctive and conjunctive queries, the selection of an optimal set of partial bitmap results is NP-complete. A restriction on this problem to only consider consecutive bin sequences (characteristic of common range and point queries) allows us to solve it efficiently. The evaluation our caching system over several workloads carried out on the TPC-H benchmark and a real network-intrusion data set is presented.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114281925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MISE: An Array-Based Integrated System for Atmospheric Scanning LiDAR 基于阵列的大气扫描激光雷达集成系统

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468829

Kyoseung Koo, Juhun Kim, Bongki Moon

Researchers suffer from two problems while building a data processing pipeline for atmospheric scanning LiDAR. First, they must build an entire system that handles collecting signals, processing data, and visualizing the results. Second, they should support fast data processing to expand and deploy their system. In this paper, we introduce MISE, a fast integrated system that handles atmospheric scanning LiDAR data. MISE provides end-to-end processing, configuration options, and predefined signal-processing methods. In addition, the system uses an efficient chunking approach for fast processing with an array database. We demonstrate the construction and operation of a fine-dust particle monitoring system (based on a real-world scenario) using MISE. This demonstration demonstrates the usability and fast performance of MISE.

研究人员在为大气扫描激光雷达建立数据处理管道时遇到了两个问题。首先，他们必须建立一个完整的系统来处理收集信号、处理数据和可视化结果。其次，他们应该支持快速数据处理，以扩展和部署他们的系统。本文介绍了一种处理大气扫描激光雷达数据的快速集成系统MISE。MISE提供端到端处理，配置选项和预定义的信号处理方法。此外，该系统采用高效的分块方法对数组数据库进行快速处理。我们演示了使用MISE构建和操作细尘颗粒监测系统(基于现实世界的场景)。此演示演示了MISE的可用性和快速性能。

引用次数: 1

MAMBO - Indexing Dead Space to Accelerate Spatial Queries✱ MAMBO -索引死空间以加速空间查询

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468804

Giannis Evagorou, T. Heinis

With the increasing size and prevalence of spatial data across applications, efficiently indexing it becomes key. Minimum bounding boxes (MBBs) — i.e., axis-aligned rectangles that minimally enclose an object — used as approximations for complex geometric objects have become crucial for spatial indexes. MBBs succinctly summarize complex spatial objects and thus allow for an efficient filtering stage thanks to faster intersection tests. However, they introduce dead-space, i.e., space that is indexed but contains no spatial objects. Querying dead space gives no result but reads data from disk thus slowing down query execution unnecessarily. In this paper, we propose MaMBo (Meshed MBb), a grid-based data structure to index dead space in addition to an index of the spatial objects. We augment intersection operations of established indexes to consult our data structure while executing queries, thereby avoiding retrieval of unnecessary data from disk, i.e., data which only contains dead space. As our experiments show, we can significantly reduce I/O — the major overhead for disk-resident datasets — by over 50% when using MaMBo with an R-Tree.

随着应用程序中空间数据的大小和流行程度的增加，有效地对其进行索引成为关键。最小边界框(Minimum bounding box, MBBs)——即最小限度地包围对象的与轴线对齐的矩形——用作复杂几何对象的近似值，对于空间索引来说已经变得至关重要。MBBs简洁地总结了复杂的空间对象，因此由于更快的交叉测试，允许有效的过滤阶段。然而，它们引入了死空间，即索引了但不包含空间对象的空间。查询死空间不会产生结果，而是从磁盘读取数据，因此不必要地减慢了查询的执行速度。在本文中，我们提出了一种基于网格的数据结构MaMBo (Meshed MBb)，除了空间对象的索引之外，还可以索引死空间。我们增加已建立索引的交叉操作，以便在执行查询时查询我们的数据结构，从而避免从磁盘检索不必要的数据，即只包含死空间的数据。正如我们的实验所示，当使用带有R-Tree的MaMBo时，我们可以显著减少I/O(磁盘驻留数据集的主要开销)50%以上。

{"title":"MAMBO - Indexing Dead Space to Accelerate Spatial Queries✱","authors":"Giannis Evagorou, T. Heinis","doi":"10.1145/3468791.3468804","DOIUrl":"https://doi.org/10.1145/3468791.3468804","url":null,"abstract":"With the increasing size and prevalence of spatial data across applications, efficiently indexing it becomes key. Minimum bounding boxes (MBBs) — i.e., axis-aligned rectangles that minimally enclose an object — used as approximations for complex geometric objects have become crucial for spatial indexes. MBBs succinctly summarize complex spatial objects and thus allow for an efficient filtering stage thanks to faster intersection tests. However, they introduce dead-space, i.e., space that is indexed but contains no spatial objects. Querying dead space gives no result but reads data from disk thus slowing down query execution unnecessarily. In this paper, we propose MaMBo (Meshed MBb), a grid-based data structure to index dead space in addition to an index of the spatial objects. We augment intersection operations of established indexes to consult our data structure while executing queries, thereby avoiding retrieval of unnecessary data from disk, i.e., data which only contains dead space. As our experiments show, we can significantly reduce I/O — the major overhead for disk-resident datasets — by over 50% when using MaMBo with an R-Tree.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124138236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ArrayQL for Linear Algebra within Umbra ArrayQL用于本影内的线性代数

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468838

Maximilian E. Schüle, T. Götz, A. Kemper, Thomas Neumann

Array database systems offer a declarative language for array-based access on multidimensional data. This study explains the integration of ArrayQL inside a relational database system, either addressable through a separate query interface or integrated into SQL as user-defined functions. With a relational database system as the target, we inherit the benefits such as query optimisation and multi-version concurrency control by design. Apart from SQL, having another query language allows processing the data without extraction or transformation out of its relational form. This is possible as we work on a relational array representation, for which we translate each ArrayQL operator into relational algebra. In our evaluation, ArrayQL within Umbra computes matrix operations faster than state of the art database extensions.

数组数据库系统为基于数组的多维数据访问提供了一种声明性语言。本研究解释了在关系数据库系统中集成ArrayQL，可以通过单独的查询接口寻址，也可以作为用户定义的函数集成到SQL中。以关系数据库系统为目标，通过设计继承了查询优化和多版本并发控制等优点。除了SQL之外，使用另一种查询语言可以处理数据，而无需从其关系形式中提取或转换数据。当我们处理关系数组表示时，这是可能的，为此我们将每个ArrayQL操作符转换为关系代数。在我们的评估中，Umbra中的ArrayQL计算矩阵操作的速度比最先进的数据库扩展要快。

引用次数: 6

Distributed Enumeration of Four Node Graphlets at Quadrillion-Scale 千万亿规模下四节点石墨烯的分布式枚举

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468805

Xiaozhou Liu, Yudi Santoso, Venkatesh Srinivasan, Alex Thomo

Graphlet enumeration is a basic task in graph analysis with many applications. Thus it is important to be able to perform this task within a reasonable amount of time. However, this objective is challenging when the input graph is very large, with millions of nodes and edges. Known solutions are limited in terms of scalability. Distributed computing is often proposed as a solution to improve scalability. However, it has to be done carefully to reduce the overhead cost and to really benefit from the distributed solution. We study the enumeration of four-node graphlets in undirected graphs using a distributed platform. We propose an efficient distributed solution which significantly surpasses the existing solutions. With this method we are able to process larger graphs that have never been processed before and enumerate quadrillions of graphlets using a modest cluster of machines. We show the scalability of our solution through experimental results. Finally, we also extend our algorithm to enumerate graphlets in probabilistic graphs and demonstrate its suitability for this case.

在许多应用程序中，Graphlet枚举是图分析中的一项基本任务。因此，能够在合理的时间内执行此任务非常重要。然而，当输入图非常大，有数百万个节点和边时，这个目标是具有挑战性的。已知的解决方案在可伸缩性方面是有限的。分布式计算通常被认为是提高可伸缩性的一种解决方案。然而，为了减少间接成本并真正从分布式解决方案中获益，必须谨慎地进行此操作。利用分布式平台研究了无向图中四节点石墨烯的枚举问题。我们提出了一种高效的分布式解决方案，大大超越了现有的解决方案。通过这种方法，我们能够处理以前从未处理过的更大的图，并使用适度的机器集群枚举千万亿的图。通过实验结果证明了该解决方案的可扩展性。最后，我们还扩展了我们的算法来枚举概率图中的graphlet，并证明了它对这种情况的适用性。

引用次数: 1

Online Landmark-Based Batch Processing of Shortest Path Queries 基于里程碑的在线批处理最短路径查询

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468844

Manuel Hotz, Theodoros Chondrogiannis, Leonard Wörteler, Michael Grossniklaus

Processing shortest path queries is a basic operation in many graph problems. Both preprocessing-based and batch processing techniques have been proposed to speed up the computation of a single shortest path by amortizing its costs. However, both of these approaches suffer from limitations. The former techniques are prohibitively expensive in situations where the precomputed information needs to be updated frequently due to changes in the graph, while the latter require coordinates and cannot be used on non-spatial graphs. In this paper, we address both limitations and propose novel techniques for batch processing shortest paths queries using landmarks. We show how preprocessing can be avoided entirely by integrating the computation of landmark distances into query processing. Our experimental results demonstrate that our techniques outperform the state of the art on both spatial and non-spatial graphs with a maximum speedup of 3.61 × in online scenarios.

处理最短路径查询是许多图问题的基本操作。提出了基于预处理和批处理的技术，通过平摊成本来加快单个最短路径的计算速度。然而，这两种方法都有局限性。在由于图的变化而需要频繁更新预先计算的信息的情况下，前一种技术的成本非常高，而后一种技术需要坐标，不能用于非空间图。在本文中，我们解决了这两个限制，并提出了使用地标批量处理最短路径查询的新技术。我们展示了如何通过将地标距离的计算集成到查询处理中来完全避免预处理。我们的实验结果表明，我们的技术在空间和非空间图形上都优于目前的技术，在线场景下的最大加速速度为3.61 x。

引用次数: 0

DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models DJEnsemble:基于成本的时空模型集合的选择与分配

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468806

R. S. Pereira, Y. M. Souto, A. Silva, Rocio Zorilla, Brian Tsan, Florin Rusu, Eduardo S. Ogasawara, A. Ziviani, F. Porto

Consider a set of black-box models – each of them independently trained on a different dataset – answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production, learning data patterns from different datasets and facing independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation strategy that maps the models to the query region. In this paper we present DJEnsemble, a cost-based strategy for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate DJEnsemble and highlight its efficiency, selecting model ensembles that are almost as efficient as the optimal solution. When compared against the traditional ensemble approach, DJEnsemble achieves up to 4X improvement in execution time and almost 9X improvement in prediction accuracy.

考虑一组黑箱模型——它们中的每一个都在不同的数据集上独立训练——回答相同的预测时空查询。每个模型都是独立构建的，在部署到生产环境之前，都会遍历自己的生命周期，学习来自不同数据集的数据模式，并面临独立的超参数调优。为了回答查询，必须将黑盒预测器集合并分配到时空查询区域。然而，计算最优集成是一项复杂的任务，包括选择适当的模型和定义将模型映射到查询区域的有效分配策略。在本文中，我们提出了DJEnsemble，一种基于成本的策略，用于自动选择和分配一个不连接的黑盒预测器集合来回答预测性时空查询。我们进行了一组广泛的实验来评估DJEnsemble并强调其效率，选择几乎与最优解决方案一样有效的模型集成。与传统的集成方法相比，DJEnsemble在执行时间上提高了4倍，在预测精度上提高了近9倍。

{"title":"DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models","authors":"R. S. Pereira, Y. M. Souto, A. Silva, Rocio Zorilla, Brian Tsan, Florin Rusu, Eduardo S. Ogasawara, A. Ziviani, F. Porto","doi":"10.1145/3468791.3468806","DOIUrl":"https://doi.org/10.1145/3468791.3468806","url":null,"abstract":"Consider a set of black-box models – each of them independently trained on a different dataset – answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production, learning data patterns from different datasets and facing independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation strategy that maps the models to the query region. In this paper we present DJEnsemble, a cost-based strategy for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate DJEnsemble and highlight its efficiency, selecting model ensembles that are almost as efficient as the optimal solution. When compared against the traditional ensemble approach, DJEnsemble achieves up to 4X improvement in execution time and almost 9X improvement in prediction accuracy.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124814726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

In-Database Machine Learning with SQL on GPUs 基于gpu的SQL数据库内机器学习

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468840

Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann

In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.

在机器学习中，不断重新训练模型可以保证基于最新数据作为训练输入的准确预测。但是，要从数据库中检索最新的数据，由于数据库系统很少用于矩阵代数和梯度下降等操作，因此需要进行耗时的提取。在这项工作中，我们证明了使用递归表的SQL可以从数据预处理、模型训练和验证中表达完整的机器学习管道。为了便于对损失函数进行说明，我们通过一个自动微分算子扩展了代码生成数据库系统Umbra，以便在递归表中使用:将损失函数用SQL表示为lambda函数，Umbra为每个偏导数生成机器码。我们进一步对专用梯度下降算子使用自动微分，该算子生成LLVM代码以在gpu上训练用户指定的模型。我们在硬件级别微调GPU内核，以允许更高的吞吐量，并提出多个单元的非阻塞同步。在我们的评估中，与单独编译每个导数相比，自动微分通过缓存子表达式的数量加速了运行时。我们的GPU内核具有独立的模型，即使在小批处理规模下也能实现最大的吞吐量，这使得SQL中的机器学习管道更具竞争力。

{"title":"In-Database Machine Learning with SQL on GPUs","authors":"Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann","doi":"10.1145/3468791.3468840","DOIUrl":"https://doi.org/10.1145/3468791.3468840","url":null,"abstract":"In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126427023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Automatic Selection of Analytic Platforms with ASAP-DM 基于asp - dm的分析平台自动选择

33rd International Conference on Scientific and Statistical Database Management

Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468802

M. Fritz, Gang Shao, H. Schwarz

The plethora of available analytic platforms escalates the difficulty of selecting the most appropriate platform for a certain data mining task and datasets with varying characteristics. Especially novice analysts experience difficulties to keep up with the latest technical developments. In this demo, we present the ASAP-DM framework. ASAP-DM is able to automatically select a well-performing analytic platform for a given data mining task via an intuitive web interface, thus especially supporting novice analysts. The take-aways for demo attendees are: (1) a good understanding of the challenges of various data mining workloads, dataset characteristics, and the effects on the selection of analytic platforms, (2) useful insights on how ASAP-DM internally works, and (3) how to benefit from ASAP-DM for exploratory data analysis.

过多的可用分析平台增加了为特定数据挖掘任务和具有不同特征的数据集选择最合适平台的难度。特别是新手分析师在跟上最新技术发展方面遇到了困难。在这个演示中，我们展示了asp - dm框架。asp - dm能够通过直观的web界面为给定的数据挖掘任务自动选择性能良好的分析平台，因此特别支持新手分析人员。演示参与者的收获是:(1)对各种数据挖掘工作负载的挑战，数据集特征以及对分析平台选择的影响有很好的理解，(2)对ASAP-DM内部工作方式的有用见解，以及(3)如何从ASAP-DM中受益探索性数据分析。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

33rd International Conference on Scientific and Statistical Database Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀