In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs. To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.
{"title":"Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance","authors":"Jianjun Chen, Rui Shi, Heng Chen, Li Zhang, Ruidong Li, Wei Ding, Liya Fan, Hao Wang, Mu Xiong, Yuxiang Chen, Benchao Dong, Kuankuan Guo, Yuanjin Lin, Xiao Liu, Haiyang Shi, Peipei Wang, Zikang Wang, Yemeng Yang, Junda Zhao, Dongyan Zhou, Zhikai Zuo, Yuming Liang","doi":"10.14778/3611540.3611545","DOIUrl":"https://doi.org/10.14778/3611540.3611545","url":null,"abstract":"In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs. To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611583
Dong June Lew, Kihyun Yoo, Kwang Woo Nam
The recent development of mobile and camera devices has led to the generation, sharing, and usage of massive amounts of video data. As a result, deep learning technology has gained attention as an alternative for video recognition and situation judgment. Recently, new systems supporting SQL-like declarative query languages have emerged, focusing on developing their own systems to support new queries combined with deep learning that are not supported by existing systems. The proposed DeepVQL system in this paper is implemented by expanding the PostgreSQL system. DeepVQL supports video database functions and provides various user-defined functions for object detection, object tracking, and video analytics queries. The advantage of this system is its ability to utilize queries with specific spatial regions or temporal durations as conditions for analyzing moving objects in traffic videos.
{"title":"DeepVQL: Deep Video Queries on PostgreSQL","authors":"Dong June Lew, Kihyun Yoo, Kwang Woo Nam","doi":"10.14778/3611540.3611583","DOIUrl":"https://doi.org/10.14778/3611540.3611583","url":null,"abstract":"The recent development of mobile and camera devices has led to the generation, sharing, and usage of massive amounts of video data. As a result, deep learning technology has gained attention as an alternative for video recognition and situation judgment. Recently, new systems supporting SQL-like declarative query languages have emerged, focusing on developing their own systems to support new queries combined with deep learning that are not supported by existing systems. The proposed DeepVQL system in this paper is implemented by expanding the PostgreSQL system. DeepVQL supports video database functions and provides various user-defined functions for object detection, object tracking, and video analytics queries. The advantage of this system is its ability to utilize queries with specific spatial regions or temporal durations as conditions for analyzing moving objects in traffic videos.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611631
Vicente Nejar de Almeida, Eduardo Ribeiro, Nassim Bouarour, João Luiz Dihl Comba, Sihem Amer-Yahia
We demonstrate SHEVA, a System for Hypothesis Exploration with Visual Analytics. SHEVA adopts an Exploratory Data Analysis (EDA) approach to discovering statistically-sound insights from large datasets. The system addresses three longstanding challenges in Multiple Hypothesis Testing: (i) the likelihood of rejecting the null hypothesis by chance, (ii) the pitfall of not being representative of the input data, and (iii) the ability to navigate among many data regions while preserving the user's train of thought. To address (i) & (ii), SHEVA implements significance adjustment methods that account for data-informed properties such as coverage and novelty. To address (iii), SHEVA proposes to guide users by recommending one-sample and two-sample hypotheses in a stepwise fashion following a data hierarchy. Users may choose from a collection of pre-trained hypothesis exploration policies and let SHEVA guide them through the most significant hypotheses in the data, or intervene to override suggested hypotheses. Furthermore, SHEVA relies on data-to-visual element mappings to convey hypothesis testing results in an interpretable fashion, and allows hypothesis pipelines to be stored and retrieved later to be tested on new datasets.
{"title":"SHEVA: A Visual Analytics System for Statistical Hypothesis Exploration","authors":"Vicente Nejar de Almeida, Eduardo Ribeiro, Nassim Bouarour, João Luiz Dihl Comba, Sihem Amer-Yahia","doi":"10.14778/3611540.3611631","DOIUrl":"https://doi.org/10.14778/3611540.3611631","url":null,"abstract":"We demonstrate SHEVA, a System for Hypothesis Exploration with Visual Analytics. SHEVA adopts an Exploratory Data Analysis (EDA) approach to discovering statistically-sound insights from large datasets. The system addresses three longstanding challenges in Multiple Hypothesis Testing: (i) the likelihood of rejecting the null hypothesis by chance, (ii) the pitfall of not being representative of the input data, and (iii) the ability to navigate among many data regions while preserving the user's train of thought. To address (i) & (ii), SHEVA implements significance adjustment methods that account for data-informed properties such as coverage and novelty. To address (iii), SHEVA proposes to guide users by recommending one-sample and two-sample hypotheses in a stepwise fashion following a data hierarchy. Users may choose from a collection of pre-trained hypothesis exploration policies and let SHEVA guide them through the most significant hypotheses in the data, or intervene to override suggested hypotheses. Furthermore, SHEVA relies on data-to-visual element mappings to convey hypothesis testing results in an interpretable fashion, and allows hypothesis pipelines to be stored and retrieved later to be tested on new datasets.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611639
Feifei Li
Utilizing cloud for common and critical computing infrastructures has already become the norm across the board. The rapid evolvement of the underlying cloud infrastructure and the revolutionary development of AI present both challenges and opportunities for building new database architectures and systems. It is crucial to modernize database systems in the cloud era, so that next generation cloud native databases may run like legos-they are adaptive, flexible, reliable, and smart towards dynamic workloads and varying requirements. That said, we observe four critical trends and requirements for the modernization of cloud databases: embracing cloud-native architecture, full integration with cloud platform and orchestration, co-design for data fabric, and moving towards being AI augmented. Modernizing database systems by adopting these critical trends and addressing key challenges associated with them provide ample opportunities for data management communities from both academia and industry to explore. We will provide an in-depth case study of how we modernize PolarDB with respect to embracing these four trends in the cloud era. Our ultimate goal is to build databases that run just like playing with legos, so that a database system fits for rich and dynamic workloads and requirements in a self-adaptive, performant, easy-/intuitive-to use, reliable, and intelligent manner.
{"title":"Modernization of Databases in the Cloud Era: Building Databases that Run Like Legos","authors":"Feifei Li","doi":"10.14778/3611540.3611639","DOIUrl":"https://doi.org/10.14778/3611540.3611639","url":null,"abstract":"Utilizing cloud for common and critical computing infrastructures has already become the norm across the board. The rapid evolvement of the underlying cloud infrastructure and the revolutionary development of AI present both challenges and opportunities for building new database architectures and systems. It is crucial to modernize database systems in the cloud era, so that next generation cloud native databases may run like legos-they are adaptive, flexible, reliable, and smart towards dynamic workloads and varying requirements. That said, we observe four critical trends and requirements for the modernization of cloud databases: embracing cloud-native architecture, full integration with cloud platform and orchestration, co-design for data fabric, and moving towards being AI augmented. Modernizing database systems by adopting these critical trends and addressing key challenges associated with them provide ample opportunities for data management communities from both academia and industry to explore. We will provide an in-depth case study of how we modernize PolarDB with respect to embracing these four trends in the cloud era. Our ultimate goal is to build databases that run just like playing with legos, so that a database system fits for rich and dynamic workloads and requirements in a self-adaptive, performant, easy-/intuitive-to use, reliable, and intelligent manner.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the use of online AI inference services rapidly expands in various applications (e.g., fraud detection in banking, product recommendation in e-commerce), real-time feature extraction (RTFE) systems have been developed to compute the requested features from incoming data tuples in ultra-low latency. Similar to relational databases, these RTFE procedures can be expressed using SQL-like languages. However, there is a lack of research on the workload characteristics and specialized benchmarks for RTFE, especially in comparison with existing database workloads and benchmarks (e.g., concurrent transactions in TPC-C). In this paper, we study the RTFE workload characteristics using over one hundred real datasets from open repositories (e.g. Kaggle, Tianchi, UCI ML, KiltHub) and those from 4Paradigm. The study highlights the significant differences between RTFE workloads and existing database benchmarks in terms of application scenarios, operator distributions, and query structures. Based on these findings, we propose to develop a realtime feature extraction benchmark named FEBench based on the four important criteria for a domain-specific benchmark proposed by Jim Gray. FEBench consists of selected representative datasets, query templates, and an online request simulator. We use FEBench to evaluate the effectiveness of feature extraction systems including OpenMLDB and Flink and find that each system exhibits distinct advantages and limitations in terms of overall latency, tail latency, and concurrency performance.
{"title":"FEBench: A Benchmark for Real-Time Relational Data Feature Extraction","authors":"Xuanhe Zhou, Cheng Chen, Kunyi Li, Bingsheng He, Mian Lu, Qiaosheng Liu, Wei Huang, Guoliang Li, Zhao Zheng, Yuqiang Chen","doi":"10.14778/3611540.3611550","DOIUrl":"https://doi.org/10.14778/3611540.3611550","url":null,"abstract":"As the use of online AI inference services rapidly expands in various applications (e.g., fraud detection in banking, product recommendation in e-commerce), real-time feature extraction (RTFE) systems have been developed to compute the requested features from incoming data tuples in ultra-low latency. Similar to relational databases, these RTFE procedures can be expressed using SQL-like languages. However, there is a lack of research on the workload characteristics and specialized benchmarks for RTFE, especially in comparison with existing database workloads and benchmarks (e.g., concurrent transactions in TPC-C). In this paper, we study the RTFE workload characteristics using over one hundred real datasets from open repositories (e.g. Kaggle, Tianchi, UCI ML, KiltHub) and those from 4Paradigm. The study highlights the significant differences between RTFE workloads and existing database benchmarks in terms of application scenarios, operator distributions, and query structures. Based on these findings, we propose to develop a realtime feature extraction benchmark named FEBench based on the four important criteria for a domain-specific benchmark proposed by Jim Gray. FEBench consists of selected representative datasets, query templates, and an online request simulator. We use FEBench to evaluate the effectiveness of feature extraction systems including OpenMLDB and Flink and find that each system exhibits distinct advantages and limitations in terms of overall latency, tail latency, and concurrency performance.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611585
Wangze Ni, Pengze Chen, Lei Chen
Recently, more and more higher education institutions have been using student feedback questionnaires (SFQ) to evaluate teaching. However, existing SFQ systems have two shortcomings. The first is that the respondent of an SFQ is not anonymous. The second is that the statistical report of SFQs can be manipulated. To tackle these two shortcomings, we develop a novel SFQ system, namely PSFQ. In PSFQ, the respondent of an SFQ is mixed with multiple users by a ring signature. PSFQ uses an advanced ring signature approach to minimize the size of a ring signature when anonymity satisfies the requirements. Thus, the first shortcoming has been overcome. Moreover, all answers are encrypted by homomorphic encryption and stored on the blockchain, enabling users to verify the correctness of the statistical reports. Our demonstration will showcase how PSFQ provides confidential SFQ responses while ensuring the correctness of statistical reports.
{"title":"PSFQ: A Blockchain-Based Privacy-Preserving and Verifiable Student Feedback Questionnaire Platform","authors":"Wangze Ni, Pengze Chen, Lei Chen","doi":"10.14778/3611540.3611585","DOIUrl":"https://doi.org/10.14778/3611540.3611585","url":null,"abstract":"Recently, more and more higher education institutions have been using student feedback questionnaires (SFQ) to evaluate teaching. However, existing SFQ systems have two shortcomings. The first is that the respondent of an SFQ is not anonymous. The second is that the statistical report of SFQs can be manipulated. To tackle these two shortcomings, we develop a novel SFQ system, namely PSFQ. In PSFQ, the respondent of an SFQ is mixed with multiple users by a ring signature. PSFQ uses an advanced ring signature approach to minimize the size of a ring signature when anonymity satisfies the requirements. Thus, the first shortcoming has been overcome. Moreover, all answers are encrypted by homomorphic encryption and stored on the blockchain, enabling users to verify the correctness of the statistical reports. Our demonstration will showcase how PSFQ provides confidential SFQ responses while ensuring the correctness of statistical reports.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611612
Luca Zecchini, Giovanni Simonini, Sonia Bergamaschi, Felix Naumann
The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task). We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.
{"title":"BrewER: Entity Resolution On-Demand","authors":"Luca Zecchini, Giovanni Simonini, Sonia Bergamaschi, Felix Naumann","doi":"10.14778/3611540.3611612","DOIUrl":"https://doi.org/10.14778/3611540.3611612","url":null,"abstract":"The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task). We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611589
Zilong Wang, Qixiong Zeng, Ning Wang, Haowen Lu, Yue Zhang
Cardinality Estimation (CE) is a fundamental but critical problem in DBMS query optimization, while deep learning techniques have made significant breakthroughs in the research of CE. However, apart from requiring sufficiently large training data to cover all possible query regions for accurate estimation, current query-driven CE methods also suffer from workload drifts. In fact, retraining or fine-tuning needs cardinality labels as ground truth and obtaining the labels through DBMS is also expensive. Therefore, we propose CEDA, a novel domain-adaptive CE system. CEDA can achieve more accurate estimations by automatically generating workloads as training data according to the data distribution in the database, and incorporating histogram information into an attention-based cardinality estimator. To solve the problem of workload drifts in real-world environments, CEDA adopts a domain adaptation strategy, making the model more robust and perform well on an unlabeled workload with a large difference from the feature distribution of the training set.
{"title":"CEDA: Learned Cardinality Estimation with Domain Adaptation","authors":"Zilong Wang, Qixiong Zeng, Ning Wang, Haowen Lu, Yue Zhang","doi":"10.14778/3611540.3611589","DOIUrl":"https://doi.org/10.14778/3611540.3611589","url":null,"abstract":"Cardinality Estimation (CE) is a fundamental but critical problem in DBMS query optimization, while deep learning techniques have made significant breakthroughs in the research of CE. However, apart from requiring sufficiently large training data to cover all possible query regions for accurate estimation, current query-driven CE methods also suffer from workload drifts. In fact, retraining or fine-tuning needs cardinality labels as ground truth and obtaining the labels through DBMS is also expensive. Therefore, we propose CEDA, a novel domain-adaptive CE system. CEDA can achieve more accurate estimations by automatically generating workloads as training data according to the data distribution in the database, and incorporating histogram information into an attention-based cardinality estimator. To solve the problem of workload drifts in real-world environments, CEDA adopts a domain adaptation strategy, making the model more robust and perform well on an unlabeled workload with a large difference from the feature distribution of the training set.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Approximate Query Processing (AQP) systems produce estimation of query answers using small random samples. It is attractive for the users who are willing to trade accuracy for low query latency. On the other hand, real-world data are often subject to concurrent updates. If the user wants to perform real-time approximate data analysis, the AQP system must support concurrent updates and sampling. Towards that, we recently developed a new concurrent index, AB-tree, to support efficient sampling under updates. In this work, we will demonstrate the feasibility of supporting realtime approximate data analysis in online transaction settings using index-assisted sampling.
{"title":"Approximate Queries over Concurrent Updates","authors":"Congying Wang, Nithin Sastry Tellapuri, Sphoorthi Keshannagari, Dylan Zinsley, Zhuoyue Zhao, Dong Xie","doi":"10.14778/3611540.3611602","DOIUrl":"https://doi.org/10.14778/3611540.3611602","url":null,"abstract":"Approximate Query Processing (AQP) systems produce estimation of query answers using small random samples. It is attractive for the users who are willing to trade accuracy for low query latency. On the other hand, real-world data are often subject to concurrent updates. If the user wants to perform real-time approximate data analysis, the AQP system must support concurrent updates and sampling. Towards that, we recently developed a new concurrent index, AB-tree, to support efficient sampling under updates. In this work, we will demonstrate the feasibility of supporting realtime approximate data analysis in online transaction settings using index-assisted sampling.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both. There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics. We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.
{"title":"EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data","authors":"Yuanhang Zou, Zhihao Ding, Jieming Shi, Shuting Guo, Chunchen Su, Yafei Zhang","doi":"10.14778/3611540.3611546","DOIUrl":"https://doi.org/10.14778/3611540.3611546","url":null,"abstract":"In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both. There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics. We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}