首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
Kora: A Cloud-Native Event Streaming Platform for Kafka Kora: Kafka的云原生事件流平台
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611567
Anna Povzner, Prince Mahajan, Jason Gustafson, Jun Rao, Ismael Juma, Feng Min, Shriram Sridharan, Nikhil Bhatia, Gopi Attaluri, Adithya Chandra, Stanislav Kozlovski, Rajini Sivaram, Lucas Bradstreet, Bob Barrett, Dhruvil Shah, David Jacot, David Arthur, Ron Dagostino, Colin McCabe, Manikumar Reddy Obili, Kowshik Prakasam, Jose Garcia Sancio, Vikas Singh, Alok Nikhil, Kamal Gupta
Event streaming is an increasingly critical infrastructure service used in many industries and there is growing demand for cloud-native solutions. Confluent Cloud provides a massive scale event streaming platform built on top of Apache Kafka with tens of thousands of clusters running in 70+ regions across AWS, Google Cloud, and Azure. This paper introduces Kora , the cloud-native platform for Apache Kafka at the core of Confluent Cloud. We describe Kora's design that enables it to meet its cloud-native goals, such as reliability, elasticity, and cost efficiency. We discuss Kora's abstractions which allow users to think in terms of their workload requirements and not the underlying infrastructure, and we discuss how Kora is designed to provide consistent, predictable performance across cloud environments with diverse capabilities.
事件流是许多行业使用的越来越重要的基础设施服务,对云原生解决方案的需求也在不断增长。Confluent Cloud提供了一个建立在Apache Kafka之上的大规模事件流平台,在AWS、Google Cloud和Azure的70多个区域中运行着数万个集群。本文介绍了作为Confluent Cloud核心的Apache Kafka云原生平台Kora。我们描述了Kora的设计,使其能够满足其云原生目标,例如可靠性、弹性和成本效率。我们讨论了Kora的抽象,它允许用户根据他们的工作负载需求而不是底层基础设施进行思考,我们还讨论了Kora是如何设计的,以便在具有不同功能的云环境中提供一致的、可预测的性能。
{"title":"Kora: A Cloud-Native Event Streaming Platform for Kafka","authors":"Anna Povzner, Prince Mahajan, Jason Gustafson, Jun Rao, Ismael Juma, Feng Min, Shriram Sridharan, Nikhil Bhatia, Gopi Attaluri, Adithya Chandra, Stanislav Kozlovski, Rajini Sivaram, Lucas Bradstreet, Bob Barrett, Dhruvil Shah, David Jacot, David Arthur, Ron Dagostino, Colin McCabe, Manikumar Reddy Obili, Kowshik Prakasam, Jose Garcia Sancio, Vikas Singh, Alok Nikhil, Kamal Gupta","doi":"10.14778/3611540.3611567","DOIUrl":"https://doi.org/10.14778/3611540.3611567","url":null,"abstract":"Event streaming is an increasingly critical infrastructure service used in many industries and there is growing demand for cloud-native solutions. Confluent Cloud provides a massive scale event streaming platform built on top of Apache Kafka with tens of thousands of clusters running in 70+ regions across AWS, Google Cloud, and Azure. This paper introduces Kora , the cloud-native platform for Apache Kafka at the core of Confluent Cloud. We describe Kora's design that enables it to meet its cloud-native goals, such as reliability, elasticity, and cost efficiency. We discuss Kora's abstractions which allow users to think in terms of their workload requirements and not the underlying infrastructure, and we discuss how Kora is designed to provide consistent, predictable performance across cloud environments with diverse capabilities.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demonstration of OpenDBML, a Framework for Democratizing In-Database Machine Learning openbml的演示,一个民主化的数据库内机器学习框架
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611598
Mahdi Ghorbani, Amir Shaikhha
Machine learning over relational data has been used in several applications. The traditional approach of joining relations first and then training a model on the joined table is time-consuming and requires a significant amount of memory. Recent research has focused on in-database machine learning (in-DB ML) to address this issue; these methods train the models over relations without joining, resulting in a more efficient process. However, such systems have ad-hoc user interfaces and specific data formats, making them challenging to use. To address this problem, this paper presents OpenDBML, a framework for democratizing in-DB ML. OpenDBML offers a Python interface for multiple in-DB ML systems, a set of commonly used datasets, and the ability to add new datasets and in-DB ML systems via both Python and web interfaces. The paper also presents comprehensive demonstration scenarios to illustrate how to use OpenDBML effectively.
在关系数据上的机器学习已经在几个应用中使用。首先连接关系,然后在连接表上训练模型的传统方法非常耗时,并且需要大量内存。最近的研究集中在数据库内机器学习(in-DB ML)来解决这个问题;这些方法在不连接的情况下对关系模型进行训练,从而产生更有效的过程。然而,这样的系统具有特别的用户界面和特定的数据格式,使它们难以使用。为了解决这个问题,本文提出了OpenDBML,一个民主化数据库内ML的框架。OpenDBML为多个数据库内ML系统提供了一个Python接口,一组常用的数据集,以及通过Python和web接口添加新数据集和数据库内ML系统的能力。本文还提供了全面的演示场景来说明如何有效地使用OpenDBML。
{"title":"Demonstration of OpenDBML, a Framework for Democratizing In-Database Machine Learning","authors":"Mahdi Ghorbani, Amir Shaikhha","doi":"10.14778/3611540.3611598","DOIUrl":"https://doi.org/10.14778/3611540.3611598","url":null,"abstract":"Machine learning over relational data has been used in several applications. The traditional approach of joining relations first and then training a model on the joined table is time-consuming and requires a significant amount of memory. Recent research has focused on in-database machine learning (in-DB ML) to address this issue; these methods train the models over relations without joining, resulting in a more efficient process. However, such systems have ad-hoc user interfaces and specific data formats, making them challenging to use. To address this problem, this paper presents OpenDBML, a framework for democratizing in-DB ML. OpenDBML offers a Python interface for multiple in-DB ML systems, a set of commonly used datasets, and the ability to add new datasets and in-DB ML systems via both Python and web interfaces. The paper also presents comprehensive demonstration scenarios to illustrate how to use OpenDBML effectively.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SimpleTS: An Efficient and Universal Model Selection Framework for Time Series Forecasting SimpleTS:一种有效且通用的时间序列预测模型选择框架
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611561
Yuanyuan Yao, Dimeng Li, Hailiang Jie, Hailiang Jie, Tianyi Li, Jie Chen, Jiaqi Wang, Feifei Li, Yunjun Gao
Time series forecasting, that predicts events through a sequence of time, has received increasing attention in past decades. The diverse range of time series forecasting models presents a challenge for selecting the most suitable model for a given dataset. As such, the Alibaba Cloud database monitoring system must address the issue of selecting an optimal forecasting model for a single time series data. While several model selection frameworks, including AutoAI-TS, have been developed to predict a dataset, their effectiveness may be limited as they may not adapt well to all types of time series, resulting in reduced prediction accuracy. Alternatively, models such as AutoForecast, which train on individual data points, may offer better adaptability but are limited by longer training time required. In this paper, we introduce SimpleTS, a versatile framework for time series forecasting that exhibits high efficiency and accuracy across all types of time series data. When performing an online prediction task, SimpleTS first classifies input time series into one type, and then efficiently selects the most suitable prediction model for this type. To optimize performance, SimpleTS (i) clusters models with similar performance to improve the efficiency of classification; (ii) uses soft labeling and weighted representation learning to achieve higher classification accuracy for different time series types. Extensive experiments on 3 private datasets and 52 public datasets show that SimpleTS outperforms the state-of-the-art toolkits in terms of both training time and prediction accuracy.
时间序列预测,即通过时间序列预测事件,在过去几十年中受到越来越多的关注。时间序列预测模型的多样性对给定数据集选择最合适的模型提出了挑战。因此,阿里云数据库监测系统必须解决单个时间序列数据选择最优预测模型的问题。虽然已经开发了包括AutoAI-TS在内的几个模型选择框架来预测数据集,但它们的有效性可能受到限制,因为它们可能无法很好地适应所有类型的时间序列,从而导致预测精度降低。另外,像AutoForecast这样在单个数据点上进行训练的模型可能提供更好的适应性,但受所需训练时间较长的限制。在本文中,我们介绍了SimpleTS,这是一个用于时间序列预测的通用框架,在所有类型的时间序列数据中都表现出高效率和准确性。在执行在线预测任务时,SimpleTS首先将输入的时间序列分类为一种类型,然后高效地选择最适合该类型的预测模型。为了优化性能,SimpleTS (i)将性能相近的模型聚类,提高分类效率;(ii)利用软标记和加权表示学习对不同时间序列类型实现更高的分类精度。在3个私有数据集和52个公共数据集上进行的大量实验表明,SimpleTS在训练时间和预测精度方面都优于最先进的工具包。
{"title":"SimpleTS: An Efficient and Universal Model Selection Framework for Time Series Forecasting","authors":"Yuanyuan Yao, Dimeng Li, Hailiang Jie, Hailiang Jie, Tianyi Li, Jie Chen, Jiaqi Wang, Feifei Li, Yunjun Gao","doi":"10.14778/3611540.3611561","DOIUrl":"https://doi.org/10.14778/3611540.3611561","url":null,"abstract":"Time series forecasting, that predicts events through a sequence of time, has received increasing attention in past decades. The diverse range of time series forecasting models presents a challenge for selecting the most suitable model for a given dataset. As such, the Alibaba Cloud database monitoring system must address the issue of selecting an optimal forecasting model for a single time series data. While several model selection frameworks, including AutoAI-TS, have been developed to predict a dataset, their effectiveness may be limited as they may not adapt well to all types of time series, resulting in reduced prediction accuracy. Alternatively, models such as AutoForecast, which train on individual data points, may offer better adaptability but are limited by longer training time required. In this paper, we introduce SimpleTS, a versatile framework for time series forecasting that exhibits high efficiency and accuracy across all types of time series data. When performing an online prediction task, SimpleTS first classifies input time series into one type, and then efficiently selects the most suitable prediction model for this type. To optimize performance, SimpleTS (i) clusters models with similar performance to improve the efficiency of classification; (ii) uses soft labeling and weighted representation learning to achieve higher classification accuracy for different time series types. Extensive experiments on 3 private datasets and 52 public datasets show that SimpleTS outperforms the state-of-the-art toolkits in terms of both training time and prediction accuracy.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Solving Hard Variants of Database Schema Matching on Quantum Computers 解决量子计算机上数据库模式匹配的硬变体问题
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611603
Kristin Fritsch, Stefanie Scherzinger
With quantum computers now available as cloud services, there is a global quest for applications where a quantum advantage can be shown. Naturally, data management is a candidate domain. Workable solutions require the design of hybrid quantum algorithms, where a quantum computing unit (a QPU) and classical computing (via CPUs) cooperate towards solving a problem. This demo illustrates such an end-to-end solution targeting NP-hard variants of database schema matching. Our demo is intended to be educational (and hopefully inspiring), allowing participants to explore the critical design decisions, such as the handover between phases of QPU- and CPU-based computation. It will also allow participants to experience hands-on - through playful interaction - how easily problem sizes exceed the limitations of today's QPUs.
随着量子计算机现在可以作为云服务使用,全球都在寻求能够显示量子优势的应用。自然,数据管理是一个候选领域。可行的解决方案需要设计混合量子算法,其中量子计算单元(QPU)和经典计算(通过cpu)合作解决问题。这个演示演示了针对数据库模式匹配的NP-hard变体的端到端解决方案。我们的演示旨在具有教育意义(并希望具有启发性),允许参与者探索关键的设计决策,例如基于QPU和基于cpu的计算阶段之间的切换。它还将允许参与者亲身体验-通过有趣的互动-如何轻松地超越当今qpu的限制问题的大小。
{"title":"Solving Hard Variants of Database Schema Matching on Quantum Computers","authors":"Kristin Fritsch, Stefanie Scherzinger","doi":"10.14778/3611540.3611603","DOIUrl":"https://doi.org/10.14778/3611540.3611603","url":null,"abstract":"With quantum computers now available as cloud services, there is a global quest for applications where a quantum advantage can be shown. Naturally, data management is a candidate domain. Workable solutions require the design of hybrid quantum algorithms, where a quantum computing unit (a QPU) and classical computing (via CPUs) cooperate towards solving a problem. This demo illustrates such an end-to-end solution targeting NP-hard variants of database schema matching. Our demo is intended to be educational (and hopefully inspiring), allowing participants to explore the critical design decisions, such as the handover between phases of QPU- and CPU-based computation. It will also allow participants to experience hands-on - through playful interaction - how easily problem sizes exceed the limitations of today's QPUs.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges 加入订单选择与深度强化学习:基础,技术和挑战
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611576
Zhengtong Yan, Valter Uotila, Jiaheng Lu
Join Order Selection (JOS) is a fundamental challenge in query optimization, as it significantly affects query performance. However, finding an optimal join order is an NP-hard problem due to the exponentially large search space. Despite the decades-long effort, traditional methods still suffer from limitations. Deep Reinforcement Learning (DRL) approaches have recently gained growing interest and shown superior performance over traditional methods. These DRL-based methods could leverage prior experience through the trial-and-error strategy to automatically explore the optimal join order. This tutorial will focus on recent DRL-based approaches for join order selection by providing a comprehensive overview of the various approaches. We will start by briefly introducing the core concepts of join ordering and the traditional methods for JOS. Next, we will provide some preliminary knowledge about DRL and then delve into DRL-based join order selection approaches by offering detailed information on those methods, analyzing their relationships, and summarizing their weaknesses and strengths. To help the audience gain a deeper understanding of DRL approaches for JOS, we will present two open-source demonstrations and compare their differences. Finally, we will identify research challenges and open problems to provide insights into future research directions. This tutorial will provide valuable guidance for developing more practical DRL approaches for JOS.
Join Order Selection (Join Order Selection, JOS)是查询优化中的一个基本挑战,因为它会显著影响查询性能。然而,由于搜索空间呈指数级增长,寻找最优连接顺序是一个np困难问题。尽管经过了几十年的努力,传统方法仍然受到限制。深度强化学习(DRL)方法最近获得了越来越多的兴趣,并显示出优于传统方法的性能。这些基于drl的方法可以通过试错策略利用先前的经验来自动探索最优连接顺序。本教程将通过对各种方法的全面概述,重点介绍最近用于连接顺序选择的基于drl的方法。我们将首先简要介绍连接排序的核心概念和用于jo的传统方法。接下来,我们将提供一些关于DRL的初步知识,然后深入研究基于DRL的连接顺序选择方法,提供有关这些方法的详细信息,分析它们之间的关系,并总结它们的优缺点。为了帮助读者更深入地理解用于JOS的DRL方法,我们将提供两个开源演示并比较它们的差异。最后,我们将确定研究挑战和开放问题,以提供对未来研究方向的见解。本教程将为开发更实用的JOS DRL方法提供有价值的指导。
{"title":"Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges","authors":"Zhengtong Yan, Valter Uotila, Jiaheng Lu","doi":"10.14778/3611540.3611576","DOIUrl":"https://doi.org/10.14778/3611540.3611576","url":null,"abstract":"Join Order Selection (JOS) is a fundamental challenge in query optimization, as it significantly affects query performance. However, finding an optimal join order is an NP-hard problem due to the exponentially large search space. Despite the decades-long effort, traditional methods still suffer from limitations. Deep Reinforcement Learning (DRL) approaches have recently gained growing interest and shown superior performance over traditional methods. These DRL-based methods could leverage prior experience through the trial-and-error strategy to automatically explore the optimal join order. This tutorial will focus on recent DRL-based approaches for join order selection by providing a comprehensive overview of the various approaches. We will start by briefly introducing the core concepts of join ordering and the traditional methods for JOS. Next, we will provide some preliminary knowledge about DRL and then delve into DRL-based join order selection approaches by offering detailed information on those methods, analyzing their relationships, and summarizing their weaknesses and strengths. To help the audience gain a deeper understanding of DRL approaches for JOS, we will present two open-source demonstrations and compare their differences. Finally, we will identify research challenges and open problems to provide insights into future research directions. This tutorial will provide valuable guidance for developing more practical DRL approaches for JOS.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KGNav: A Knowledge Graph Navigational Visual Query System KGNav:知识图谱导航可视化查询系统
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611592
Xiang Wang, Xin Wang, Zhaozhuo Li, Dong Han
Visual query is a vital technique for comprehending and analyzing knowledge graphs, which provides an effective method to lower the barrier of querying knowledge graphs for non-professional users. Nevertheless, visual query techniques for knowledge graphs and ontologies that have emerged in recent years cannot bridge the gap between global information provided by the knowledge graph schema and underlying data of knowledge graph. Thus it cannot fully exploit the global information to navigate users for querying knowledge graphs. This demonstration showcases KGNav, a Knowledge Graph Navigational visual query system. KGNav (1) redefines the minimal unit of operation to abstract the conceptual hierarchy, i.e., Knowledge Graph Schema, in the domain from the original knowledge graph in an offline semi-automatic way through the equivalence relations between these units; it also (2) provides a series of operators and an interactive GUI to capture user query intentions, guiding users to explore the Knowledge Graph Schema to achieve in-depth analysis of knowledge graphs. We will demonstrate the capability of KGNav in reducing tedious queries, enabling users to swiftly grasp the structure of the knowledge graph, and performing queries through several fundamental scenarios.
可视化查询是理解和分析知识图的重要技术,为非专业用户降低知识图查询的障碍提供了一种有效的方法。然而,近年来出现的针对知识图和本体的可视化查询技术并不能弥补知识图模式提供的全局信息与知识图底层数据之间的差距。因此,它不能充分利用全局信息来引导用户查询知识图谱。这个演示展示了KGNav,一个知识图谱导航可视化查询系统。KGNav(1)重新定义了最小操作单元,通过这些单元之间的等价关系,以离线半自动的方式从原始知识图中抽象出领域内的概念层次,即知识图图式(Knowledge Graph Schema);(2)提供了一系列操作符和交互式GUI来捕捉用户查询意图,引导用户探索知识图图式,实现对知识图的深入分析。我们将展示KGNav在减少繁琐查询,使用户能够快速掌握知识图的结构以及通过几个基本场景执行查询方面的能力。
{"title":"KGNav: A Knowledge Graph Navigational Visual Query System","authors":"Xiang Wang, Xin Wang, Zhaozhuo Li, Dong Han","doi":"10.14778/3611540.3611592","DOIUrl":"https://doi.org/10.14778/3611540.3611592","url":null,"abstract":"Visual query is a vital technique for comprehending and analyzing knowledge graphs, which provides an effective method to lower the barrier of querying knowledge graphs for non-professional users. Nevertheless, visual query techniques for knowledge graphs and ontologies that have emerged in recent years cannot bridge the gap between global information provided by the knowledge graph schema and underlying data of knowledge graph. Thus it cannot fully exploit the global information to navigate users for querying knowledge graphs. This demonstration showcases KGNav, a Knowledge Graph Navigational visual query system. KGNav (1) redefines the minimal unit of operation to abstract the conceptual hierarchy, i.e., Knowledge Graph Schema, in the domain from the original knowledge graph in an offline semi-automatic way through the equivalence relations between these units; it also (2) provides a series of operators and an interactive GUI to capture user query intentions, guiding users to explore the Knowledge Graph Schema to achieve in-depth analysis of knowledge graphs. We will demonstrate the capability of KGNav in reducing tedious queries, enabling users to swiftly grasp the structure of the knowledge graph, and performing queries through several fundamental scenarios.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ganos Aero: A Cloud-Native System for Big Raster Data Management and Processing Ganos Aero:用于大栅格数据管理和处理的云原生系统
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611597
Fei Xiao, Jiong Xie, Zhida Chen, Feifei Li, Zhen Chen, Jianwei Liu, Yinpei Liu
The development of Earth Observation technology contributes to the production of massive raster data. It is vital to manage and conduct analytical tasks on the raster data. Existing solutions employ dedicated systems for the raster data management and processing, respectively, incurring problems such as data redundancy, difficulty in updating, expensive data transferring and transformation, etc. To cope with these limitations, this demonstration presents Ganos Aero, a cloud-native system for big raster data management and processing. Ganos Aero proposes a unified raster data model for both the data management and processing, which stores a single copy of the raster data and without performing an expensive tiling procedure, and thus achieves significant improvement in the storage and updating efficiency. To enable efficient query and batch task processing, Ganos Aero implements an on-the-fly tile production mechanism, and optimizes its performance using the cloud features including decoupling compute from storage and pushing costly operations closer to the storage layer. Since deployed in Alibaba Cloud in 2022, Ganos Aero has been playing a critical role in many real applications including the modern agriculture, environment monitoring and protection, et al.
对地观测技术的发展促进了大量栅格数据的产生。对栅格数据进行管理和分析是至关重要的。现有的解决方案分别采用专用系统进行栅格数据的管理和处理,存在数据冗余、更新困难、数据传输和转换成本高等问题。为了应对这些限制,本演示展示了Ganos Aero,一个用于大栅格数据管理和处理的云原生系统。Ganos Aero为数据管理和处理提出了统一的栅格数据模型,该模型存储栅格数据的单一副本,无需执行昂贵的平铺过程,从而显著提高了存储和更新效率。为了实现高效的查询和批处理任务,Ganos Aero实现了一种即时瓷砖生产机制,并使用云特性优化其性能,包括将计算与存储分离,并将昂贵的操作推到更靠近存储层的位置。自2022年部署到阿里云以来,Ganos Aero在现代农业、环境监测和保护等许多实际应用中发挥了关键作用。
{"title":"Ganos Aero: A Cloud-Native System for Big Raster Data Management and Processing","authors":"Fei Xiao, Jiong Xie, Zhida Chen, Feifei Li, Zhen Chen, Jianwei Liu, Yinpei Liu","doi":"10.14778/3611540.3611597","DOIUrl":"https://doi.org/10.14778/3611540.3611597","url":null,"abstract":"The development of Earth Observation technology contributes to the production of massive raster data. It is vital to manage and conduct analytical tasks on the raster data. Existing solutions employ dedicated systems for the raster data management and processing, respectively, incurring problems such as data redundancy, difficulty in updating, expensive data transferring and transformation, etc. To cope with these limitations, this demonstration presents Ganos Aero, a cloud-native system for big raster data management and processing. Ganos Aero proposes a unified raster data model for both the data management and processing, which stores a single copy of the raster data and without performing an expensive tiling procedure, and thus achieves significant improvement in the storage and updating efficiency. To enable efficient query and batch task processing, Ganos Aero implements an on-the-fly tile production mechanism, and optimizes its performance using the cloud features including decoupling compute from storage and pushing costly operations closer to the storage layer. Since deployed in Alibaba Cloud in 2022, Ganos Aero has been playing a critical role in many real applications including the modern agriculture, environment monitoring and protection, et al.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance 氪:字节跳动的实时服务和分析SQL引擎
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611545
Jianjun Chen, Rui Shi, Heng Chen, Li Zhang, Ruidong Li, Wei Ding, Liya Fan, Hao Wang, Mu Xiong, Yuxiang Chen, Benchao Dong, Kuankuan Guo, Yuanjin Lin, Xiao Liu, Haiyang Shi, Peipei Wang, Zikang Wang, Yemeng Yang, Junda Zhao, Dongyan Zhou, Zhikai Zuo, Yuming Liang
In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs. To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.
近年来,在ByteDance,我们开始看到越来越多的业务场景需要执行实时数据服务,而不是对大量新导入的数据进行复杂的Ad Hoc分析。服务工作负载要求以最小的延迟对大量新添加的数据项执行复杂的查询。这些系统通常用于关键任务场景,而传统的OLAP系统无法处理此类用例。为了解决这个问题,ByteDance产品通常必须在生产中一起使用多个系统,迫使相同的数据被ETLed到多个系统中,从而导致数据一致性问题,浪费资源,并增加学习和维护成本。为了解决上述问题,我们构建了一个单一的混合服务和分析处理(HSAP)系统来处理这两种工作负载类型。HSAP仍处于早期阶段,市场上还很少有系统。本文演示了如何构建Krypton,这是一个具有竞争力的云原生HSAP系统,通过利用许多先前已知的查询处理技术、具有持久内存的分层缓存和原生列式存储格式,提供了出色的弹性和查询性能。氪可以支持高数据新鲜度,高数据摄取率和强数据一致性。我们还讨论了在生产中开发和操作氪的经验教训和最佳实践。
{"title":"Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance","authors":"Jianjun Chen, Rui Shi, Heng Chen, Li Zhang, Ruidong Li, Wei Ding, Liya Fan, Hao Wang, Mu Xiong, Yuxiang Chen, Benchao Dong, Kuankuan Guo, Yuanjin Lin, Xiao Liu, Haiyang Shi, Peipei Wang, Zikang Wang, Yemeng Yang, Junda Zhao, Dongyan Zhou, Zhikai Zuo, Yuming Liang","doi":"10.14778/3611540.3611545","DOIUrl":"https://doi.org/10.14778/3611540.3611545","url":null,"abstract":"In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs. To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ADOps: An Anomaly Detection Pipeline in Structured Logs 采用:结构化日志异常检测管道
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611618
Xintong Song, Yusen Zhu, Jianfei Wu, Bai Liu, Hongkang Wei
Anomaly detection has been extensively implemented in industry. The reality is that an application may have numerous scenarios where anomalies need to be monitored. However, the complete process of anomaly detection will take much time, including data acquisition, data processing, model training, and model deployment. In particular, some simple scenarios do not require building complex anomaly detection models. This results in a waste of resources. To solve these problems, we build an anomaly detection pipeline(ADOps) to modularize each step. For simple anomaly detection scenarios, no programming is required and new anomaly detection tasks can be created by simply modifying the configuration file. In addition, it can also improve the development efficiency of complex anomaly detection models. We show how users create anomaly detection tasks on the anomaly detection pipeline and how engineers use it to develop anomaly detection models.
异常检测在工业中得到了广泛的应用。实际情况是,应用程序可能有许多需要监视异常情况的场景。但是,异常检测的完整过程需要花费大量的时间,包括数据采集、数据处理、模型训练和模型部署。特别是,一些简单的场景不需要构建复杂的异常检测模型。这导致了资源的浪费。为了解决这些问题,我们构建了一个异常检测管道(ADOps)来模块化每个步骤。对于简单的异常检测场景,不需要编程,只需修改配置文件即可创建新的异常检测任务。此外,它还可以提高复杂异常检测模型的开发效率。我们展示了用户如何在异常检测管道上创建异常检测任务,以及工程师如何使用它来开发异常检测模型。
{"title":"ADOps: An Anomaly Detection Pipeline in Structured Logs","authors":"Xintong Song, Yusen Zhu, Jianfei Wu, Bai Liu, Hongkang Wei","doi":"10.14778/3611540.3611618","DOIUrl":"https://doi.org/10.14778/3611540.3611618","url":null,"abstract":"Anomaly detection has been extensively implemented in industry. The reality is that an application may have numerous scenarios where anomalies need to be monitored. However, the complete process of anomaly detection will take much time, including data acquisition, data processing, model training, and model deployment. In particular, some simple scenarios do not require building complex anomaly detection models. This results in a waste of resources. To solve these problems, we build an anomaly detection pipeline(ADOps) to modularize each step. For simple anomaly detection scenarios, no programming is required and new anomaly detection tasks can be created by simply modifying the configuration file. In addition, it can also improve the development efficiency of complex anomaly detection models. We show how users create anomaly detection tasks on the anomaly detection pipeline and how engineers use it to develop anomaly detection models.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Portals: A Showcase of Multi-Dataflow Stateful Serverless 门户:展示多数据流的无状态服务器
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611619
Jonas Spenger, Chengyang Huang, Philipp Haller, Paris Carbone
Serverless applications spanning the cloud and edge require flexible programming frameworks for expressing compositions across the different levels of deployment. Another critical aspect for applications with state is failure resilience beyond the scope of a single dataflow graph that is the current standard in data streaming systems. This paper presents Portals, an interactive, stateful dataflow composition framework with strong end-to-end guarantees. Portals enables event-driven, resilient applications that span across dataflow graphs and serverless deployments. The demonstration exhibits three scenarios in our multi-dataflow streaming-based system: dynamically composing a stateful serverless application; an interactive cloud and edge serverless application; and a Portals browser playground.
跨越云和边缘的无服务器应用程序需要灵活的编程框架来跨不同部署级别表达组合。具有状态的应用程序的另一个关键方面是超出单个数据流图范围的故障恢复能力,这是数据流系统中的当前标准。本文介绍了portal,它是一个具有强大的端到端保证的交互式、有状态的数据流组合框架。门户支持跨数据流图和无服务器部署的事件驱动的弹性应用程序。该演示展示了基于多数据流的流系统中的三种场景:动态组合一个有状态的无服务器应用程序;交互式云和边缘无服务器应用程序;以及一个门户网站浏览器平台。
{"title":"Portals: A Showcase of Multi-Dataflow Stateful Serverless","authors":"Jonas Spenger, Chengyang Huang, Philipp Haller, Paris Carbone","doi":"10.14778/3611540.3611619","DOIUrl":"https://doi.org/10.14778/3611540.3611619","url":null,"abstract":"Serverless applications spanning the cloud and edge require flexible programming frameworks for expressing compositions across the different levels of deployment. Another critical aspect for applications with state is failure resilience beyond the scope of a single dataflow graph that is the current standard in data streaming systems. This paper presents Portals, an interactive, stateful dataflow composition framework with strong end-to-end guarantees. Portals enables event-driven, resilient applications that span across dataflow graphs and serverless deployments. The demonstration exhibits three scenarios in our multi-dataflow streaming-based system: dynamically composing a stateful serverless application; an interactive cloud and edge serverless application; and a Portals browser playground.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1