Proceedings of the Vldb Endowment最新文献

英文中文

Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance 氪:字节跳动的实时服务和分析SQL引擎

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611545

Jianjun Chen, Rui Shi, Heng Chen, Li Zhang, Ruidong Li, Wei Ding, Liya Fan, Hao Wang, Mu Xiong, Yuxiang Chen, Benchao Dong, Kuankuan Guo, Yuanjin Lin, Xiao Liu, Haiyang Shi, Peipei Wang, Zikang Wang, Yemeng Yang, Junda Zhao, Dongyan Zhou, Zhikai Zuo, Yuming Liang

In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs. To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.

近年来，在ByteDance，我们开始看到越来越多的业务场景需要执行实时数据服务，而不是对大量新导入的数据进行复杂的Ad Hoc分析。服务工作负载要求以最小的延迟对大量新添加的数据项执行复杂的查询。这些系统通常用于关键任务场景，而传统的OLAP系统无法处理此类用例。为了解决这个问题，ByteDance产品通常必须在生产中一起使用多个系统，迫使相同的数据被ETLed到多个系统中，从而导致数据一致性问题，浪费资源，并增加学习和维护成本。为了解决上述问题，我们构建了一个单一的混合服务和分析处理(HSAP)系统来处理这两种工作负载类型。HSAP仍处于早期阶段，市场上还很少有系统。本文演示了如何构建Krypton，这是一个具有竞争力的云原生HSAP系统，通过利用许多先前已知的查询处理技术、具有持久内存的分层缓存和原生列式存储格式，提供了出色的弹性和查询性能。氪可以支持高数据新鲜度，高数据摄取率和强数据一致性。我们还讨论了在生产中开发和操作氪的经验教训和最佳实践。

{"title":"Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance","authors":"Jianjun Chen, Rui Shi, Heng Chen, Li Zhang, Ruidong Li, Wei Ding, Liya Fan, Hao Wang, Mu Xiong, Yuxiang Chen, Benchao Dong, Kuankuan Guo, Yuanjin Lin, Xiao Liu, Haiyang Shi, Peipei Wang, Zikang Wang, Yemeng Yang, Junda Zhao, Dongyan Zhou, Zhikai Zuo, Yuming Liang","doi":"10.14778/3611540.3611545","DOIUrl":"https://doi.org/10.14778/3611540.3611545","url":null,"abstract":"In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs. To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DeepVQL: Deep Video Queries on PostgreSQL DeepVQL:基于PostgreSQL的深度视频查询

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611583

Dong June Lew, Kihyun Yoo, Kwang Woo Nam

The recent development of mobile and camera devices has led to the generation, sharing, and usage of massive amounts of video data. As a result, deep learning technology has gained attention as an alternative for video recognition and situation judgment. Recently, new systems supporting SQL-like declarative query languages have emerged, focusing on developing their own systems to support new queries combined with deep learning that are not supported by existing systems. The proposed DeepVQL system in this paper is implemented by expanding the PostgreSQL system. DeepVQL supports video database functions and provides various user-defined functions for object detection, object tracking, and video analytics queries. The advantage of this system is its ability to utilize queries with specific spatial regions or temporal durations as conditions for analyzing moving objects in traffic videos.

最近移动和摄像设备的发展导致了大量视频数据的产生、共享和使用。因此，深度学习技术作为视频识别和态势判断的替代方案受到了关注。最近，支持类似sql的声明式查询语言的新系统出现了，它们专注于开发自己的系统，以支持现有系统不支持的结合深度学习的新查询。本文提出的DeepVQL系统是通过扩展PostgreSQL系统来实现的。DeepVQL支持视频数据库功能，并提供各种自定义函数，用于对象检测、对象跟踪和视频分析查询。该系统的优点是能够利用具有特定空间区域或时间持续时间的查询作为分析交通视频中移动物体的条件。

引用次数: 0

SHEVA: A Visual Analytics System for Statistical Hypothesis Exploration 舍瓦:用于统计假设探索的可视化分析系统

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611631

Vicente Nejar de Almeida, Eduardo Ribeiro, Nassim Bouarour, João Luiz Dihl Comba, Sihem Amer-Yahia

We demonstrate SHEVA, a System for Hypothesis Exploration with Visual Analytics. SHEVA adopts an Exploratory Data Analysis (EDA) approach to discovering statistically-sound insights from large datasets. The system addresses three longstanding challenges in Multiple Hypothesis Testing: (i) the likelihood of rejecting the null hypothesis by chance, (ii) the pitfall of not being representative of the input data, and (iii) the ability to navigate among many data regions while preserving the user's train of thought. To address (i) & (ii), SHEVA implements significance adjustment methods that account for data-informed properties such as coverage and novelty. To address (iii), SHEVA proposes to guide users by recommending one-sample and two-sample hypotheses in a stepwise fashion following a data hierarchy. Users may choose from a collection of pre-trained hypothesis exploration policies and let SHEVA guide them through the most significant hypotheses in the data, or intervene to override suggested hypotheses. Furthermore, SHEVA relies on data-to-visual element mappings to convey hypothesis testing results in an interpretable fashion, and allows hypothesis pipelines to be stored and retrieved later to be tested on new datasets.

我们展示了SHEVA，一个使用可视化分析进行假设探索的系统。舍瓦采用探索性数据分析(EDA)方法从大型数据集中发现统计上合理的见解。该系统解决了多重假设检验中三个长期存在的挑战:(i)偶然拒绝零假设的可能性，(ii)不代表输入数据的陷阱，以及(iii)在保留用户思路的同时在许多数据区域之间导航的能力。解决(i) &(ii) SHEVA实施了考虑数据知情属性(如覆盖率和新颖性)的显著性调整方法。为了解决(iii)， SHEVA建议通过推荐单样本和双样本假设，按照数据层次结构逐步引导用户。用户可以从预先训练好的假设探索策略集合中进行选择，并让SHEVA指导他们通过数据中最重要的假设，或者进行干预以推翻建议的假设。此外，SHEVA依赖于数据到视觉元素的映射，以一种可解释的方式传递假设检验结果，并允许存储和检索假设管道，以便在新的数据集上进行测试。

{"title":"SHEVA: A Visual Analytics System for Statistical Hypothesis Exploration","authors":"Vicente Nejar de Almeida, Eduardo Ribeiro, Nassim Bouarour, João Luiz Dihl Comba, Sihem Amer-Yahia","doi":"10.14778/3611540.3611631","DOIUrl":"https://doi.org/10.14778/3611540.3611631","url":null,"abstract":"We demonstrate SHEVA, a System for Hypothesis Exploration with Visual Analytics. SHEVA adopts an Exploratory Data Analysis (EDA) approach to discovering statistically-sound insights from large datasets. The system addresses three longstanding challenges in Multiple Hypothesis Testing: (i) the likelihood of rejecting the null hypothesis by chance, (ii) the pitfall of not being representative of the input data, and (iii) the ability to navigate among many data regions while preserving the user's train of thought. To address (i) & (ii), SHEVA implements significance adjustment methods that account for data-informed properties such as coverage and novelty. To address (iii), SHEVA proposes to guide users by recommending one-sample and two-sample hypotheses in a stepwise fashion following a data hierarchy. Users may choose from a collection of pre-trained hypothesis exploration policies and let SHEVA guide them through the most significant hypotheses in the data, or intervene to override suggested hypotheses. Furthermore, SHEVA relies on data-to-visual element mappings to convey hypothesis testing results in an interpretable fashion, and allows hypothesis pipelines to be stored and retrieved later to be tested on new datasets.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modernization of Databases in the Cloud Era: Building Databases that Run Like Legos 云时代的数据库现代化:构建像乐高积木一样运行的数据库

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611639

Feifei Li

Utilizing cloud for common and critical computing infrastructures has already become the norm across the board. The rapid evolvement of the underlying cloud infrastructure and the revolutionary development of AI present both challenges and opportunities for building new database architectures and systems. It is crucial to modernize database systems in the cloud era, so that next generation cloud native databases may run like legos-they are adaptive, flexible, reliable, and smart towards dynamic workloads and varying requirements. That said, we observe four critical trends and requirements for the modernization of cloud databases: embracing cloud-native architecture, full integration with cloud platform and orchestration, co-design for data fabric, and moving towards being AI augmented. Modernizing database systems by adopting these critical trends and addressing key challenges associated with them provide ample opportunities for data management communities from both academia and industry to explore. We will provide an in-depth case study of how we modernize PolarDB with respect to embracing these four trends in the cloud era. Our ultimate goal is to build databases that run just like playing with legos, so that a database system fits for rich and dynamic workloads and requirements in a self-adaptive, performant, easy-/intuitive-to use, reliable, and intelligent manner.

将云用于公共和关键的计算基础设施已经成为一种普遍的规范。底层云基础设施的快速发展和人工智能的革命性发展为构建新的数据库架构和系统带来了挑战和机遇。在云时代对数据库系统进行现代化是至关重要的，这样下一代云本地数据库就可以像乐高一样运行——它们对动态工作负载和不同的需求具有自适应、灵活、可靠和智能。也就是说，我们观察到云数据库现代化的四个关键趋势和需求:拥抱云原生架构，与云平台和编排的完全集成，数据结构的协同设计，以及向人工智能增强方向发展。通过采用这些关键趋势和解决与之相关的关键挑战来实现数据库系统的现代化，为学术界和工业界的数据管理社区提供了充分的探索机会。我们将提供一个深入的案例研究，说明我们如何在云时代对PolarDB进行现代化改造，以拥抱这四大趋势。我们的最终目标是构建像玩乐高一样运行的数据库，以便数据库系统以自适应、高性能、易于/直观使用、可靠和智能的方式适合丰富和动态的工作负载和需求。

{"title":"Modernization of Databases in the Cloud Era: Building Databases that Run Like Legos","authors":"Feifei Li","doi":"10.14778/3611540.3611639","DOIUrl":"https://doi.org/10.14778/3611540.3611639","url":null,"abstract":"Utilizing cloud for common and critical computing infrastructures has already become the norm across the board. The rapid evolvement of the underlying cloud infrastructure and the revolutionary development of AI present both challenges and opportunities for building new database architectures and systems. It is crucial to modernize database systems in the cloud era, so that next generation cloud native databases may run like legos-they are adaptive, flexible, reliable, and smart towards dynamic workloads and varying requirements. That said, we observe four critical trends and requirements for the modernization of cloud databases: embracing cloud-native architecture, full integration with cloud platform and orchestration, co-design for data fabric, and moving towards being AI augmented. Modernizing database systems by adopting these critical trends and addressing key challenges associated with them provide ample opportunities for data management communities from both academia and industry to explore. We will provide an in-depth case study of how we modernize PolarDB with respect to embracing these four trends in the cloud era. Our ultimate goal is to build databases that run just like playing with legos, so that a database system fits for rich and dynamic workloads and requirements in a self-adaptive, performant, easy-/intuitive-to use, reliable, and intelligent manner.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FEBench: A Benchmark for Real-Time Relational Data Feature Extraction FEBench:一个实时关系数据特征提取的基准

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611550

Xuanhe Zhou, Cheng Chen, Kunyi Li, Bingsheng He, Mian Lu, Qiaosheng Liu, Wei Huang, Guoliang Li, Zhao Zheng, Yuqiang Chen

As the use of online AI inference services rapidly expands in various applications (e.g., fraud detection in banking, product recommendation in e-commerce), real-time feature extraction (RTFE) systems have been developed to compute the requested features from incoming data tuples in ultra-low latency. Similar to relational databases, these RTFE procedures can be expressed using SQL-like languages. However, there is a lack of research on the workload characteristics and specialized benchmarks for RTFE, especially in comparison with existing database workloads and benchmarks (e.g., concurrent transactions in TPC-C). In this paper, we study the RTFE workload characteristics using over one hundred real datasets from open repositories (e.g. Kaggle, Tianchi, UCI ML, KiltHub) and those from 4Paradigm. The study highlights the significant differences between RTFE workloads and existing database benchmarks in terms of application scenarios, operator distributions, and query structures. Based on these findings, we propose to develop a realtime feature extraction benchmark named FEBench based on the four important criteria for a domain-specific benchmark proposed by Jim Gray. FEBench consists of selected representative datasets, query templates, and an online request simulator. We use FEBench to evaluate the effectiveness of feature extraction systems including OpenMLDB and Flink and find that each system exhibits distinct advantages and limitations in terms of overall latency, tail latency, and concurrency performance.

随着在线人工智能推理服务在各种应用(例如，银行欺诈检测，电子商务产品推荐)中的使用迅速扩展，实时特征提取(RTFE)系统已经开发出来，以超低延迟从传入数据元组中计算所请求的特征。与关系数据库类似，这些RTFE过程可以使用类似sql的语言来表示。然而，缺乏对RTFE工作负载特征和专门基准的研究，特别是与现有数据库工作负载和基准(例如，TPC-C中的并发事务)进行比较。在本文中，我们使用来自开放存储库(例如Kaggle, Tianchi, UCI ML, KiltHub)和4Paradigm的100多个真实数据集研究RTFE工作负载特征。该研究强调了RTFE工作负载与现有数据库基准在应用程序场景、操作符分布和查询结构方面的显著差异。基于这些发现，我们建议基于Jim Gray提出的特定领域基准的四个重要标准开发一个实时特征提取基准，名为FEBench。FEBench由选定的代表性数据集、查询模板和在线请求模拟器组成。我们使用FEBench来评估包括OpenMLDB和Flink在内的特征提取系统的有效性，并发现每个系统在总体延迟、尾部延迟和并发性能方面都表现出不同的优势和局限性。

{"title":"FEBench: A Benchmark for Real-Time Relational Data Feature Extraction","authors":"Xuanhe Zhou, Cheng Chen, Kunyi Li, Bingsheng He, Mian Lu, Qiaosheng Liu, Wei Huang, Guoliang Li, Zhao Zheng, Yuqiang Chen","doi":"10.14778/3611540.3611550","DOIUrl":"https://doi.org/10.14778/3611540.3611550","url":null,"abstract":"As the use of online AI inference services rapidly expands in various applications (e.g., fraud detection in banking, product recommendation in e-commerce), real-time feature extraction (RTFE) systems have been developed to compute the requested features from incoming data tuples in ultra-low latency. Similar to relational databases, these RTFE procedures can be expressed using SQL-like languages. However, there is a lack of research on the workload characteristics and specialized benchmarks for RTFE, especially in comparison with existing database workloads and benchmarks (e.g., concurrent transactions in TPC-C). In this paper, we study the RTFE workload characteristics using over one hundred real datasets from open repositories (e.g. Kaggle, Tianchi, UCI ML, KiltHub) and those from 4Paradigm. The study highlights the significant differences between RTFE workloads and existing database benchmarks in terms of application scenarios, operator distributions, and query structures. Based on these findings, we propose to develop a realtime feature extraction benchmark named FEBench based on the four important criteria for a domain-specific benchmark proposed by Jim Gray. FEBench consists of selected representative datasets, query templates, and an online request simulator. We use FEBench to evaluate the effectiveness of feature extraction systems including OpenMLDB and Flink and find that each system exhibits distinct advantages and limitations in terms of overall latency, tail latency, and concurrency performance.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

PSFQ: A Blockchain-Based Privacy-Preserving and Verifiable Student Feedback Questionnaire Platform PSFQ:基于区块链的隐私保护和可验证的学生反馈问卷平台

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611585

Wangze Ni, Pengze Chen, Lei Chen

Recently, more and more higher education institutions have been using student feedback questionnaires (SFQ) to evaluate teaching. However, existing SFQ systems have two shortcomings. The first is that the respondent of an SFQ is not anonymous. The second is that the statistical report of SFQs can be manipulated. To tackle these two shortcomings, we develop a novel SFQ system, namely PSFQ. In PSFQ, the respondent of an SFQ is mixed with multiple users by a ring signature. PSFQ uses an advanced ring signature approach to minimize the size of a ring signature when anonymity satisfies the requirements. Thus, the first shortcoming has been overcome. Moreover, all answers are encrypted by homomorphic encryption and stored on the blockchain, enabling users to verify the correctness of the statistical reports. Our demonstration will showcase how PSFQ provides confidential SFQ responses while ensuring the correctness of statistical reports.

近年来，越来越多的高等院校开始采用学生反馈问卷对教学进行评价。然而，现有的SFQ系统有两个缺点。首先，SFQ的应答者不是匿名的。二是SFQs的统计报告可以被操纵。为了解决这两个缺点，我们开发了一种新的SFQ系统，即PSFQ。在PSFQ中，一个SFQ的应答者通过环签名与多个用户混合。PSFQ采用先进的环签名方法，在满足匿名性要求的情况下，将环签名的大小最小化。这样，第一个缺点就克服了。并且，所有的答案都经过同态加密加密并存储在区块链上，用户可以验证统计报告的正确性。我们的演示将展示PSFQ如何提供保密的SFQ响应，同时确保统计报告的正确性。

引用次数: 0

BrewER: Entity Resolution On-Demand BrewER:按需实体解决方案

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611612

Luca Zecchini, Giovanni Simonini, Sonia Bergamaschi, Felix Naumann

The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task). We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.

实体解析(ER)的任务旨在检测数据集中描述相同现实世界实体的多条记录，并将它们合并为一条一致的记录。ER在保证良好的数据质量方面起着基础性的作用，例如，作为数据科学管道的输入。然而，传统的ER方法需要在能够对其运行一致查询之前清理整个数据;因此，用户很难在有限的时间或资源下处理常见的场景(例如，当数据频繁变化或用户只对任务的数据集的一部分感兴趣时)。我们之前介绍过BrewER，这是一个框架，用于评估脏数据上的SQL SP查询，同时根据用户定义的优先级逐步返回结果，就好像它们是在干净数据上发出的一样。在本演示中，我们将展示如何利用BrewER来减轻ER的负担，使数据科学家能够为他们的任务节省大量资源。

引用次数: 0

CEDA: Learned Cardinality Estimation with Domain Adaptation 基于域自适应的学习基数估计

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611589

Zilong Wang, Qixiong Zeng, Ning Wang, Haowen Lu, Yue Zhang

Cardinality Estimation (CE) is a fundamental but critical problem in DBMS query optimization, while deep learning techniques have made significant breakthroughs in the research of CE. However, apart from requiring sufficiently large training data to cover all possible query regions for accurate estimation, current query-driven CE methods also suffer from workload drifts. In fact, retraining or fine-tuning needs cardinality labels as ground truth and obtaining the labels through DBMS is also expensive. Therefore, we propose CEDA, a novel domain-adaptive CE system. CEDA can achieve more accurate estimations by automatically generating workloads as training data according to the data distribution in the database, and incorporating histogram information into an attention-based cardinality estimator. To solve the problem of workload drifts in real-world environments, CEDA adopts a domain adaptation strategy, making the model more robust and perform well on an unlabeled workload with a large difference from the feature distribution of the training set.

基数估计(Cardinality Estimation, CE)是数据库管理系统查询优化中一个基本而关键的问题，深度学习技术在基数估计的研究上取得了重大突破。然而，除了需要足够大的训练数据来覆盖所有可能的查询区域以进行准确估计之外，当前查询驱动的CE方法还存在工作负载漂移的问题。事实上，重新训练或微调需要基数标签作为基础真理，并且通过DBMS获得标签也很昂贵。因此，我们提出了一种新的领域自适应CE系统CEDA。CEDA可以根据数据库中的数据分布自动生成工作负载作为训练数据，并将直方图信息合并到基于注意力的基数估计器中，从而实现更准确的估计。为了解决现实环境中的工作负载漂移问题，CEDA采用了域自适应策略，使模型更加鲁棒，并且在与训练集特征分布差异较大的未标记工作负载上表现良好。

引用次数: 0

Approximate Queries over Concurrent Updates 并行更新的近似查询

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611602

Congying Wang, Nithin Sastry Tellapuri, Sphoorthi Keshannagari, Dylan Zinsley, Zhuoyue Zhao, Dong Xie

Approximate Query Processing (AQP) systems produce estimation of query answers using small random samples. It is attractive for the users who are willing to trade accuracy for low query latency. On the other hand, real-world data are often subject to concurrent updates. If the user wants to perform real-time approximate data analysis, the AQP system must support concurrent updates and sampling. Towards that, we recently developed a new concurrent index, AB-tree, to support efficient sampling under updates. In this work, we will demonstrate the feasibility of supporting realtime approximate data analysis in online transaction settings using index-assisted sampling.

近似查询处理(AQP)系统使用小随机样本对查询答案进行估计。对于那些愿意以准确性换取低查询延迟的用户来说，这是很有吸引力的。另一方面，现实世界的数据经常受到并发更新的影响。如果用户想要进行实时近似数据分析，AQP系统必须支持并发更新和采样。为此，我们最近开发了一个新的并发索引AB-tree，以支持更新下的高效采样。在这项工作中，我们将展示使用索引辅助采样在在线交易设置中支持实时近似数据分析的可行性。

引用次数: 0

EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data EmbedX:一个通用的，高效的和可扩展的平台来嵌入图形和高维稀疏数据

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611546

Yuanhang Zou, Zhihao Ding, Jieming Shi, Shuting Guo, Chunchen Su, Yafei Zhang

In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both. There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics. We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.

在现代在线服务中，将网络规模的图形数据和高维稀疏数据一起处理成嵌入，用于下游任务，如推荐、广告、预测和分类，变得越来越重要。既有针对高维稀疏数据的学习方法和系统，也有针对高维稀疏数据的学习方法和系统。工业中迫切需要一个系统来有效地处理这两种类型的数据以获得更高的业务价值，然而，这是具有挑战性的。腾讯的数据包含数十亿个具有非常高维度稀疏特征的样本，而图也包含数十亿个节点和边。此外，学习模型通常执行具有高计算成本的昂贵操作。海量稀疏数据和图数据具有不同的特征，难以同时存储、管理和检索。我们提出了一个来自腾讯的工业分布式学习框架EmbedX，它是通用的，有效地支持在图和高维稀疏数据上的嵌入。EmbedX包括用于图和稀疏数据管理的分布式服务器层，以及优化的参数和图算子，以有效支持4类方法，包括高维稀疏数据的深度学习模型、网络嵌入方法、图神经网络以及内部开发的两类数据的联合学习模型。大量的腾讯数据和公共数据实验证明了EmbedX的优越性。例如，在一个包含13亿个节点、350亿条边和28亿个样本、16亿个维度的稀疏特征的腾讯数据集上，EmbedX的训练速度提高了一个数量级，我们的联合模型取得了卓越的效果。EmbedX部署在腾讯。真实用例的A/B测试进一步验证了EmbedX的强大功能。EmbedX是用c++实现的，在https://github.com/Tencent/embedx上开源。

{"title":"EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data","authors":"Yuanhang Zou, Zhihao Ding, Jieming Shi, Shuting Guo, Chunchen Su, Yafei Zhang","doi":"10.14778/3611540.3611546","DOIUrl":"https://doi.org/10.14778/3611540.3611546","url":null,"abstract":"In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both. There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics. We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Vldb Endowment

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀