Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文中文

GraphTempo: An aggregation framework for evolving graphs GraphTempo:用于演化图的聚合框架

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.18

Evangelia Tsoukanara, Georgia Koloniari, E. Pitoura

Graphs offer a generic abstraction for modeling entities and the interactions and relationships between them. Since most real-world graphs evolve over time, there is a need for models to explore the evolution of graphs over time. We introduce the GraphTempo model that allows aggregation both at the attribute and at the time dimension. We also propose an exploration strategy for navigating through the evolution of the graph based on identifying time intervals of significant growth, shrinkage or stability. This exploration strategy would be useful for example for identifying time periods of multiple collaborations between specific groups in a cooperation network, or of declining contacts between specific groups in a disease propagation network. We evaluate the performance and effectiveness of our strategy using two real graphs.

图为实体建模以及它们之间的交互和关系提供了一个通用的抽象。由于大多数现实世界的图都随着时间的推移而发展，因此需要模型来探索图随时间的发展。我们引入GraphTempo模型，该模型允许在属性维度和时间维度上进行聚合。我们还提出了一种探索策略，通过识别显著增长、收缩或稳定的时间间隔来导航图的演变。这种探索策略对于确定合作网络中特定群体之间多重合作的时间段，或者疾病传播网络中特定群体之间联系减少的时间段，都是有用的。我们使用两个真实的图来评估我们策略的性能和有效性。

引用次数: 3

Spatial Structure-Aware Road Network Embedding via Graph Contrastive Learning 基于图对比学习的空间结构感知道路网络嵌入

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.12

Yanchuan Chang, E. Tanin, Xin Cao, Jianzhong Qi

Road networks are widely used as a fundamental structure in urban transportation studies. In recent years, with more research leveraging deep learning to solve conventional transportation problems, how to obtain robust road network representations (i.e., embeddings) applicable for a wide range of applications became a fundamental need. Existing studies mainly adopt graph embedding methods. Such methods, however, foremost learn the topological correlations of road networks but ignore the spatial structure (i.e., spatial correlations) which are also important in applications such as querying similar trajectories. Besides, most studies learn task-specic embeddings in a supervised manner such that the embeddings are sub-optimal when being used for new tasks. It is inecient to store or learn dedicated embeddings for every dierent task in a large transportation system. To tackle these issues, we propose a model named SARN to learn generic and task-agnostic road network embeddings based on self-supervised contrastive learning. We present (i) a spatial similarity matrix to help learn the spatial correlations of the roads, (ii) a sampling strategy based on the spatial structure of a road network to form self-supervised training samples, and (iii) a two-level loss function to guide SARN to learn embeddings based on both local and global contrasts of similar and dissimilar road segments. Experimental results on three downstream tasks over real-world road networks show that SARN outperforms state-of-the-art self-supervised models consistently and achieves comparable (or even better) performance to supervised models.

道路网作为城市交通研究的基本结构被广泛应用。近年来，随着越来越多的研究利用深度学习来解决传统的交通问题，如何获得适用于广泛应用的鲁棒道路网络表示(即嵌入)成为一个基本需求。现有研究主要采用图嵌入方法。然而，这些方法首先学习了道路网络的拓扑相关性，但忽略了空间结构(即空间相关性)，这在查询类似轨迹等应用中也很重要。此外，大多数研究以监督的方式学习特定任务的嵌入，使得嵌入在用于新任务时是次优的。在大型运输系统中，为每个不同的任务存储或学习专用的嵌入是不方便的。为了解决这些问题，我们提出了一个基于自监督对比学习的SARN模型来学习通用的和任务不可知的道路网络嵌入。我们提出了(i)一个空间相似性矩阵来帮助学习道路的空间相关性，(ii)一个基于道路网络空间结构的采样策略来形成自监督训练样本，以及(iii)一个两级损失函数来指导SARN学习基于相似和不相似道路段的局部和全局对比的嵌入。在现实世界道路网络的三个下游任务上的实验结果表明，SARN始终优于最先进的自监督模型，并达到与监督模型相当(甚至更好)的性能。

{"title":"Spatial Structure-Aware Road Network Embedding via Graph Contrastive Learning","authors":"Yanchuan Chang, E. Tanin, Xin Cao, Jianzhong Qi","doi":"10.48786/edbt.2023.12","DOIUrl":"https://doi.org/10.48786/edbt.2023.12","url":null,"abstract":"Road networks are widely used as a fundamental structure in urban transportation studies. In recent years, with more research leveraging deep learning to solve conventional transportation problems, how to obtain robust road network representations (i.e., embeddings) applicable for a wide range of applications became a fundamental need. Existing studies mainly adopt graph embedding methods. Such methods, however, foremost learn the topological correlations of road networks but ignore the spatial structure (i.e., spatial correlations) which are also important in applications such as querying similar trajectories. Besides, most studies learn task-specic embeddings in a supervised manner such that the embeddings are sub-optimal when being used for new tasks. It is inecient to store or learn dedicated embeddings for every dierent task in a large transportation system. To tackle these issues, we propose a model named SARN to learn generic and task-agnostic road network embeddings based on self-supervised contrastive learning. We present (i) a spatial similarity matrix to help learn the spatial correlations of the roads, (ii) a sampling strategy based on the spatial structure of a road network to form self-supervised training samples, and (iii) a two-level loss function to guide SARN to learn embeddings based on both local and global contrasts of similar and dissimilar road segments. Experimental results on three downstream tasks over real-world road networks show that SARN outperforms state-of-the-art self-supervised models consistently and achieves comparable (or even better) performance to supervised models.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"46 1","pages":"144-156"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84432765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

MM-quecat: A Tool for Unified Querying of Multi-Model Data MM-quecat:多模型数据统一查询工具

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.76

P. Koupil, Daniel Crha, I. Holubová

The dawn of multi-model data has brought many challenges to most aspects of data management. In addition, no standards exist focusing on how the models should be combined and man-aged. This paper focuses on the problems related to multi-model querying. We introduce MM-quecat , a tool that enables one to query multi-model data regardless of the underlying multi-model database or polystore. Using category theory, we provide a unified abstract representation of multi-model data, which can be viewed as a graph and, thus, queried using a SPARQL-based query language. Moreover, the support for cross-model redundancy enables the choice of the optimal multi-model query strategy.

多模型数据的出现给数据管理的大多数方面带来了许多挑战。此外，没有标准关注模型应该如何组合和管理。本文主要研究了多模型查询的相关问题。我们介绍MM-quecat，这是一种工具，可以查询多模型数据，而不管底层的多模型数据库或多存储。使用范畴论，我们提供了多模型数据的统一抽象表示，可以将其视为一个图，因此可以使用基于sparql的查询语言进行查询。此外，支持跨模型冗余可以选择最优的多模型查询策略。

引用次数: 0

Exploration of Approaches for In-Database ML 数据库内ML方法的探索

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.25

Steffen Kläbe, Stefan Hagedorn, K. Sattler

Database systems are no longer used only for the storage of plain structured data and basic analyses. An increasing role is also played by the integration of ML models, e.g., neural networks with specialized frameworks, and their use for classification or prediction. However, using such models on data stored in a database system might require downloading the data and performing the computations outside. In this paper, we evaluate approaches for integrating the ML inference step as a special query operator - the ModelJoin. We explore several options for this integration on different abstraction levels: relational representation of the models as well as SQL queries for inference, the use of UDFs, the use of APIs to existing ML runtimes and a native implementation of the ModelJoin as a query operator supporting both CPU and GPU execution. Our evaluation results show that integrating ML runtimes over APIs perform similarly to a native operator while being generic to support arbitrary model types. The solution of relational representation and SQL queries is most portable and works well for smaller inputs without any changes needed in the database engine.

数据库系统不再仅仅用于存储简单的结构化数据和进行基本分析。机器学习模型的集成也发挥着越来越重要的作用，例如，具有专门框架的神经网络，以及它们用于分类或预测。但是，对存储在数据库系统中的数据使用这种模型可能需要下载数据并在外部执行计算。在本文中，我们评估了将ML推理步骤集成为一个特殊查询操作符- ModelJoin的方法。我们在不同的抽象层次上探索了这种集成的几个选项:模型的关系表示以及用于推理的SQL查询、udf的使用、对现有ML运行时使用api以及将ModelJoin作为支持CPU和GPU执行的查询操作符的本地实现。我们的评估结果表明，在api上集成ML运行时的性能与本机操作符相似，同时具有泛型以支持任意模型类型。关系表示和SQL查询的解决方案是最可移植的，并且可以很好地用于较小的输入，而无需在数据库引擎中进行任何更改。

{"title":"Exploration of Approaches for In-Database ML","authors":"Steffen Kläbe, Stefan Hagedorn, K. Sattler","doi":"10.48786/edbt.2023.25","DOIUrl":"https://doi.org/10.48786/edbt.2023.25","url":null,"abstract":"Database systems are no longer used only for the storage of plain structured data and basic analyses. An increasing role is also played by the integration of ML models, e.g., neural networks with specialized frameworks, and their use for classification or prediction. However, using such models on data stored in a database system might require downloading the data and performing the computations outside. In this paper, we evaluate approaches for integrating the ML inference step as a special query operator - the ModelJoin. We explore several options for this integration on different abstraction levels: relational representation of the models as well as SQL queries for inference, the use of UDFs, the use of APIs to existing ML runtimes and a native implementation of the ModelJoin as a query operator supporting both CPU and GPU execution. Our evaluation results show that integrating ML runtimes over APIs perform similarly to a native operator while being generic to support arbitrary model types. The solution of relational representation and SQL queries is most portable and works well for smaller inputs without any changes needed in the database engine.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"311-323"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85281129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Workload-Aware Query Recommendation Using Deep Learning 基于深度学习的工作负载感知查询推荐

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.05

E. Y. Lai, Zainab Zolaktaf, Mostafa Milani, Omar AlOmeir, Jianhao Cao, R. Pottinger

Users interact with databases by writing sequences of SQL queries that are are often stored in query workloads. Current SQL query recommendation approaches make little use of query workloads. Our work presents a novel workload-aware approach to query recommendation. We use deep learning prediction models trained on query pairs extracted from large-scale query workloads to build our approach. Our algorithms suggest contextual (query fragments) and structural (query templates) information to aid users in formulating their next query. We evaluate our algorithms on two real-world datasets: the Sloan Digital Sky Survey (SDSS) and SQLShare. We perform a thorough analysis of the workloads and then empirically show that our workload-aware, deep-learning approach vastly outperforms known collaborative filtering approaches.

用户通过编写通常存储在查询工作负载中的SQL查询序列与数据库进行交互。当前的SQL查询推荐方法很少使用查询工作负载。我们的工作提出了一种新颖的工作负载感知查询推荐方法。我们使用从大规模查询工作负载中提取的查询对训练的深度学习预测模型来构建我们的方法。我们的算法建议上下文(查询片段)和结构(查询模板)信息，以帮助用户制定下一个查询。我们在两个真实世界的数据集上评估了我们的算法:斯隆数字巡天(SDSS)和SQLShare。我们对工作负载进行了彻底的分析，然后通过经验表明，我们的工作负载感知、深度学习方法大大优于已知的协同过滤方法。

引用次数: 0

New Trends in Time Series Anomaly Detection 时间序列异常检测的新趋势

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.80

Paul Boniol, John Paparizzos, Themis Palpanas

Anomaly detection is an important problem in data analytics with applications in many domains. In recent years, there has been an increasing interest in anomaly detection tasks applied to time series. In this tutorial, we take a holistic view on anomaly detection in time series, starting from the core definitions and taxonomies related to time series and anomaly types, to an extensive description of the anomaly detection methods proposed by different communities in the literature. Then, we discuss shortcomings in traditional evaluation measures. Finally, we present new solutions to assess the quality of anomaly detection approaches and new benchmarks capturing diverse domains and applications.

异常检测是数据分析中的一个重要问题，在许多领域都有应用。近年来，人们对应用于时间序列的异常检测任务越来越感兴趣。在本教程中，我们从时间序列和异常类型相关的核心定义和分类法开始，对不同团体在文献中提出的异常检测方法进行了广泛的描述，全面了解了时间序列中的异常检测。然后，讨论了传统评价方法的不足。最后，我们提出了新的解决方案来评估异常检测方法的质量，以及捕获不同领域和应用的新基准。

引用次数: 2

Adaptive Real-time Virtualization of Legacy ETL Pipelines in Cloud Data Warehouses 云数据仓库中遗留ETL管道的自适应实时虚拟化

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.64

Ehab Abdelhamid, Nikos Tsikoudis, M. Duller, Marc B. Sugiyama, Nicholas E. Marino, F. Waas

Extract, Transform, and Load (ETL) pipelines are widely used to ingest data into Enterprise Data Warehouse (EDW) systems. These pipelines can be very complex and often tightly coupled to a given EDW, making it challenging to upgrade from a legacy EDW to a Cloud Data Warehouse (CDW). This paper presents a novel solution for a transparent and fully-automated porting of legacy ETL pipelines to CDW environments.

提取、转换和加载(ETL)管道广泛用于将数据摄取到企业数据仓库(EDW)系统中。这些管道可能非常复杂，并且通常与给定的EDW紧密耦合，这使得从传统EDW升级到云数据仓库(CDW)具有挑战性。本文提出了一种新颖的解决方案，用于将遗留ETL管道透明且全自动地移植到CDW环境中。

引用次数: 0

COVIDKG.ORG - a Web-scale COVID-19 Interactive, Trustworthy Knowledge Graph, Constructed and Interrogated for Bias using Deep-Learning COVIDKG。ORG -一个网络规模的COVID-19互动，可信赖的知识图谱，使用深度学习构建和询问偏见

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.63

Bhimesh Kandibedala, A. Pyayt, Nick Piraino, Chris Caballero, M. Gubanov

We describe a Web-scale interactive Knowledge Graph (KG) , populated with trustworthy information from the latest published medical findings on COVID-19. Currently existing, socially maintained KGs, such as YAGO or DBPedia or more specialized medical ontologies, such as NCBI, Virus-, and COVID-19-related are getting stale very quickly, lack any latest COVID-19 medical findings - most importantly lack any scalable mechanism to keep them up to date. Here we describe COVIDKG.ORG - an online, interactive, trust-worthy COVID-19 Web-scale Knowledge Graph and several advanced search-engines. Its content is extracted and updated from the latest medical research. Because of that it does not suffer from any bias or misinformation, often dominating public information sources.

我们描述了一个网络规模的交互式知识图(KG)，其中填充了来自最新发表的COVID-19医学发现的可靠信息。目前现有的、由社会维护的知识库，如YAGO或DBPedia，或更专业的医学本体，如NCBI、病毒和COVID-19相关的知识库，正在迅速过时，缺乏任何最新的COVID-19医学发现——最重要的是缺乏任何可扩展的机制来保持它们的最新状态。这里我们来描述一下covid - kg。ORG——一个在线的、互动的、值得信赖的COVID-19网络规模知识图谱和几个先进的搜索引擎。它的内容是从最新的医学研究中提取和更新的。正因为如此，它不会受到任何偏见或错误信息的影响，经常占据公共信息来源的主导地位。

引用次数: 0

Multi-Task Processing in Vertex-Centric Graph Systems: Evaluations and Insights 以顶点为中心的图系统中的多任务处理:评价和见解

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.20

Siqiang Luo, Zichen Zhu, Xiaokui Xiao, Y. Yang, Chunbo Li, B. Kao

Vertex-centric (VC) graph systems are at the core of large-scale distributed graph processing. For such systems, a common usage pattern is the concurrent processing of multiple tasks ( multi-processing for short), which aims to execute a large number of unit tasks in parallel. In this paper, we point out that multi-processing has not been sufficiently studied or evaluated in previous work; hence, we fill this critical gap with three major contributions. First, we examine the tradeoff between two important measures in VC-systems: the number of communication rounds and message congestion . We show that this tradeoff is crucial to system performance; yet, existing approaches fail to achieve an optimal tradeoff, leading to poor performance. Second, based on exten-sive experimental evaluations on mainstream VC systems (e.g., Giraph, Pregel+, GraphD) and benchmark multi-processing tasks (e.g., Batch Personalized PageRanks, Multiple Source Shortest Paths), we present several important insights on the correlation between system performance and configurations, which is valu-able to practitioners in optimizing system performance. Third, based on the insights drawn from our experimental evaluations, we present a cost-based tuning framework that optimizes the performance of a representative VC-system. This demonstrates the usefulness of the insights.

以顶点为中心(VC)的图系统是大规模分布式图处理的核心。对于这样的系统，常见的使用模式是并发处理多个任务(简称为多处理)，其目的是并行执行大量单元任务。在本文中，我们指出，多处理在以前的工作中没有得到充分的研究和评价;因此，我们用三个主要贡献来填补这一关键空白。首先，我们研究了风险投资系统中两个重要指标之间的权衡:通信轮数和消息拥塞。我们表明，这种权衡对系统性能至关重要;然而，现有的方法无法实现最佳权衡，导致性能不佳。其次，基于对主流VC系统(例如，Giraph, Pregel+， GraphD)和基准多处理任务(例如，Batch Personalized pagerank, Multiple Source最短路径)的广泛实验评估，我们提出了关于系统性能与配置之间相关性的几个重要见解，这对从业者优化系统性能有价值。第三，基于从实验评估中得出的见解，我们提出了一个基于成本的优化框架，该框架可优化具有代表性的风险投资系统的性能。这证明了这些见解的有用性。

{"title":"Multi-Task Processing in Vertex-Centric Graph Systems: Evaluations and Insights","authors":"Siqiang Luo, Zichen Zhu, Xiaokui Xiao, Y. Yang, Chunbo Li, B. Kao","doi":"10.48786/edbt.2023.20","DOIUrl":"https://doi.org/10.48786/edbt.2023.20","url":null,"abstract":"Vertex-centric (VC) graph systems are at the core of large-scale distributed graph processing. For such systems, a common usage pattern is the concurrent processing of multiple tasks ( multi-processing for short), which aims to execute a large number of unit tasks in parallel. In this paper, we point out that multi-processing has not been sufficiently studied or evaluated in previous work; hence, we fill this critical gap with three major contributions. First, we examine the tradeoff between two important measures in VC-systems: the number of communication rounds and message congestion . We show that this tradeoff is crucial to system performance; yet, existing approaches fail to achieve an optimal tradeoff, leading to poor performance. Second, based on exten-sive experimental evaluations on mainstream VC systems (e.g., Giraph, Pregel+, GraphD) and benchmark multi-processing tasks (e.g., Batch Personalized PageRanks, Multiple Source Shortest Paths), we present several important insights on the correlation between system performance and configurations, which is valu-able to practitioners in optimizing system performance. Third, based on the insights drawn from our experimental evaluations, we present a cost-based tuning framework that optimizes the performance of a representative VC-system. This demonstrates the usefulness of the insights.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"4 1","pages":"247-259"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79759570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

PyFroid: Scaling Data Analysis on a Commodity Workstation PyFroid:在商品工作站上扩展数据分析

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2024.06

Venkatesh Emani, A. Floratou, C. Curino

Almost every organization today is promoting data-driven decision making leveraging advances in data science. According to various surveys, data scientists spend up to 80% of their time cleaning and transforming data. Although data management systems have been carefully optimized for such tasks over several decades, they are seldom leveraged by data scientists who prefer to use libraries such as Pandas, sacrificing performance and scalability in favor of familiarity and ease of use. As a result, data scientists are not able to fully leverage the hardware capabilities of commodity workstations and either end up working on a small sample of their data locally or migrate to more heavyweight frameworks in a cluster environment. In this paper, we present PyFroid, a framework that leverages lightweight relational databases to improve the performance and scalability of Pandas, allowing data scientists to operate on much larger datasets on a commodity workstation. PyFroid has zero learning curve as it maintains all the Pandas APIs and is fully compatible with the tools that data scientists use (e.g., Python notebooks). We experimentally demonstrate that, compared to Pandas, PyFroid is able to analyze up to 20X more data on the same machine, provide comparable or better performance for small datasets as well as near-memory data sizes, and consume much less resources.

如今，几乎每个组织都在利用数据科学的进步推动数据驱动的决策制定。根据各种调查，数据科学家花费高达80%的时间来清理和转换数据。尽管数据管理系统在过去几十年中已经针对这些任务进行了精心优化，但数据科学家很少利用它们，他们更喜欢使用Pandas等库，牺牲性能和可伸缩性，以换取熟悉性和易用性。因此，数据科学家无法充分利用商品工作站的硬件功能，最终只能在本地处理一小部分数据样本，或者迁移到集群环境中更重量级的框架。在本文中，我们介绍了PyFroid，这是一个利用轻量级关系数据库来提高Pandas的性能和可伸缩性的框架，允许数据科学家在普通工作站上操作更大的数据集。PyFroid的学习曲线为零，因为它维护了所有Pandas api，并且与数据科学家使用的工具(例如Python笔记本)完全兼容。我们通过实验证明，与Pandas相比，PyFroid能够在同一台机器上分析多达20倍的数据，对于小型数据集以及近内存数据大小提供相当或更好的性能，并且消耗更少的资源。

{"title":"PyFroid: Scaling Data Analysis on a Commodity Workstation","authors":"Venkatesh Emani, A. Floratou, C. Curino","doi":"10.48786/edbt.2024.06","DOIUrl":"https://doi.org/10.48786/edbt.2024.06","url":null,"abstract":"Almost every organization today is promoting data-driven decision making leveraging advances in data science. According to various surveys, data scientists spend up to 80% of their time cleaning and transforming data. Although data management systems have been carefully optimized for such tasks over several decades, they are seldom leveraged by data scientists who prefer to use libraries such as Pandas, sacrificing performance and scalability in favor of familiarity and ease of use. As a result, data scientists are not able to fully leverage the hardware capabilities of commodity workstations and either end up working on a small sample of their data locally or migrate to more heavyweight frameworks in a cluster environment. In this paper, we present PyFroid, a framework that leverages lightweight relational databases to improve the performance and scalability of Pandas, allowing data scientists to operate on much larger datasets on a commodity workstation. PyFroid has zero learning curve as it maintains all the Pandas APIs and is fully compatible with the tools that data scientists use (e.g., Python notebooks). We experimentally demonstrate that, compared to Pandas, PyFroid is able to analyze up to 20X more data on the same machine, provide comparable or better performance for small datasets as well as near-memory data sizes, and consume much less resources.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"61-67"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89326433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Advances in database technology : proceedings. International Conference on Extending Database Technology

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀