首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster OceanBase Paetica:支持单机和分布式集群的无共享/万物共享混合数据库
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611560
Zhifeng Yang, Quanqing Xu, Shanyan Gao, Chuanhui Yang, Guoping Wang, Yuzhong Zhao, Fanyu Kong, Hao Liu, Wanhong Wang, Jinliang Xiao
In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The OceanBase database system has demonstrated its stability and effectiveness within the Ant Group and other commercial organizations, besides through the TPC-C and TPC-H tests. In this paper, we have designed a stand-alone and distributed integrated architecture named Paetica to address the overhead caused by the distributed components in the stand-alone mode, with respect to the OceanBase system. Paetica enables adaptive configuration of the database that allows OceanBase to support both serial and parallel executions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in version 4.0 of the OceanBase system, and the experiments show that Paetica exhibits notable scalability and outperforms alternative stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.
在OceanBase数据库系统不断发展的过程中,必须增强其对小型企业的适应性。除了通过TPC-C和TPC-H测试外,OceanBase数据库系统已经在Ant集团和其他商业组织中证明了其稳定性和有效性。在本文中,我们设计了一个名为Paetica的独立和分布式集成体系结构,以解决由独立模式下的分布式组件引起的开销,涉及到OceanBase系统。Paetica支持数据库的自适应配置,允许OceanBase在独立和分布式场景中支持串行和并行执行,从而提供效率和经济。该设计已在OceanBase系统4.0版本中实现,实验表明,Paetica具有显著的可扩展性,并且优于其他独立或分布式数据库。此外,通过为企业或业务发展的连续阶段使用单一的OceanBase数据库,而不需要迁移,它使OceanBase从主要服务大型企业转变为真正迎合中小型企业。我们的实验证实,在独立模式下,随着CPU核数的增加,Paetica已经实现了线性可扩展性。在Sysbench和TPC-H评估中,它也优于MySQL和Greenplum。
{"title":"OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster","authors":"Zhifeng Yang, Quanqing Xu, Shanyan Gao, Chuanhui Yang, Guoping Wang, Yuzhong Zhao, Fanyu Kong, Hao Liu, Wanhong Wang, Jinliang Xiao","doi":"10.14778/3611540.3611560","DOIUrl":"https://doi.org/10.14778/3611540.3611560","url":null,"abstract":"In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The OceanBase database system has demonstrated its stability and effectiveness within the Ant Group and other commercial organizations, besides through the TPC-C and TPC-H tests. In this paper, we have designed a stand-alone and distributed integrated architecture named Paetica to address the overhead caused by the distributed components in the stand-alone mode, with respect to the OceanBase system. Paetica enables adaptive configuration of the database that allows OceanBase to support both serial and parallel executions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in version 4.0 of the OceanBase system, and the experiments show that Paetica exhibits notable scalability and outperforms alternative stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Taurus MM: Bringing Multi-Master to the Cloud 金牛座MM:把Multi-Master带到云端
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611542
Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, Calvin Sun
A single-master database has limited update capacity because a single node handles all updates. A multi-master database potentially has higher update capacity because the load is spread across multiple nodes. However, the need to coordinate updates and ensure durability can generate high network traffic. Reducing network load is particularly important in a cloud environment where the network infrastructure is shared among thousands of tenants. In this paper, we present Taurus MM, a shared-storage multi-master database optimized for cloud environments. It implements two novel algorithms aimed at reducing network traffic plus a number of additional optimizations. The first algorithm is a new type of distributed clock that combines the small size of Lamport clocks with the effective support of distributed snapshots of vector clocks. The second algorithm is a new hybrid page and row locking protocol that significantly reduces the number of lock requests sent over the network. Experimental results on a cluster with up to eight masters demonstrate superior performance compared to Aurora multi-master and CockroachDB.
单主数据库的更新能力有限,因为一个节点处理所有更新。多主数据库可能具有更高的更新能力,因为负载分布在多个节点上。然而,协调更新和确保持久性的需求可能会产生高网络流量。在网络基础设施由数千个租户共享的云环境中,减少网络负载尤为重要。在本文中,我们提出了Taurus MM,一个针对云环境优化的共享存储多主数据库。它实现了两种新颖的算法,旨在减少网络流量以及一些额外的优化。第一种算法是一种新型的分布式时钟,它结合了Lamport时钟的小尺寸和矢量时钟的分布式快照的有效支持。第二种算法是一种新的页和行混合锁协议,它显著减少了通过网络发送的锁请求的数量。在多达8个master的集群上的实验结果表明,与Aurora multi-master和CockroachDB相比,性能更优越。
{"title":"Taurus MM: Bringing Multi-Master to the Cloud","authors":"Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, Calvin Sun","doi":"10.14778/3611540.3611542","DOIUrl":"https://doi.org/10.14778/3611540.3611542","url":null,"abstract":"A single-master database has limited update capacity because a single node handles all updates. A multi-master database potentially has higher update capacity because the load is spread across multiple nodes. However, the need to coordinate updates and ensure durability can generate high network traffic. Reducing network load is particularly important in a cloud environment where the network infrastructure is shared among thousands of tenants. In this paper, we present Taurus MM, a shared-storage multi-master database optimized for cloud environments. It implements two novel algorithms aimed at reducing network traffic plus a number of additional optimizations. The first algorithm is a new type of distributed clock that combines the small size of Lamport clocks with the effective support of distributed snapshots of vector clocks. The second algorithm is a new hybrid page and row locking protocol that significantly reduces the number of lock requests sent over the network. Experimental results on a cluster with up to eight masters demonstrate superior performance compared to Aurora multi-master and CockroachDB.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel) 法学硕士将重塑、强化还是扼杀数据科学?(VLDB 2023面板)
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611634
Alon Halevy, Yejin Choi, Avrilia Floratou, Michael J. Franklin, Natasha Noy, Haixun Wang
Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.
大型语言模型(llm)最近风靡全球,在多个领域提供了潜在的改变游戏规则的机会。当然,将法学硕士应用于结构化数据的管理,或者更一般地说,应用于数据科学中涉及的过程,有很大的前景。至少,法学硕士有潜力为我们的社区几十年来一直在解决的长期挑战提供实质性的进步。另一方面,它们可能会引入我们迄今为止只能梦想的全新功能。该小组将汇集一些领先的专家,他们一直在从不同的角度思考这些机会,并将其应用于研究原型甚至商业应用中。
{"title":"Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel)","authors":"Alon Halevy, Yejin Choi, Avrilia Floratou, Michael J. Franklin, Natasha Noy, Haixun Wang","doi":"10.14778/3611540.3611634","DOIUrl":"https://doi.org/10.14778/3611540.3611634","url":null,"abstract":"Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance CDSBen:在ByteDance上对云原生数据库系统中的存储服务性能进行基准测试
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611549
Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang
In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.
在这项工作中,我们重点研究了云原生数据库系统中存储服务的性能基准测试问题,云原生数据库系统广泛应用于各种云应用。这些系统的核心思想是将传统的单片OLTP数据库中的计算和存储分离开来。具体来说,我们首先展示了字节跳动的云原生数据库veDB的存储层上两个具有代表性的真实I/O工作负载的特征。然后,我们详细说明了使用标准基准(如TPC-C和YCSB)来模拟这些工作负载的局限性。为了克服这些限制,我们设计了一个基于学习的I/O工作负载基准,称为CDS-Ben。我们通过在ByteDance上部署CDSBen来展示它的优越性,并展示其生成的I/O轨迹与生产中的实际I/O轨迹非常相似。此外,我们还通过生成具有不同I/O特征的各种I/O工作负载来验证cdshen的准确性和灵活性。
{"title":"CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance","authors":"Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang","doi":"10.14778/3611540.3611549","DOIUrl":"https://doi.org/10.14778/3611540.3611549","url":null,"abstract":"In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks 到udf及以后:用于一般数据争用任务的完全分解数据处理器的演示
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611610
Nico Schäfer, Damjan Gjurovski, Angjela Davitkova, Sebastian Michel
While existing data management solutions try to keep up with novel data formats and features, a myriad of valuable functionality is often only accessible via programming language libraries. Particularly for machine learning tasks, there is a wealth of pre-trained models and easy-to-use libraries that allow a wide audience to harness state-of-the-art machine learning. We propose the demonstration of a highly modularized data processor for semi-structured data that can be extended by means of plain Python scripts. Next to commonly supported user-defined functions, the deep decomposition allows augmenting the core engine with additional index structures, customized import and export routines, and custom aggregation functions. For several use cases, we detail how user-defined modules can be quickly realized and invite the audience to write and apply custom code, to tailor provided code snippets that we bring along to own preferences to solve data analytics tasks involving sentiment analysis of Twitter tweets.
虽然现有的数据管理解决方案试图跟上新的数据格式和特性,但许多有价值的功能通常只能通过编程语言库访问。特别是对于机器学习任务,有大量的预训练模型和易于使用的库,可以让广泛的受众利用最先进的机器学习。我们建议演示一个高度模块化的数据处理器,用于可以通过普通Python脚本进行扩展的半结构化数据。除了通常支持的用户定义函数之外,深度分解还允许使用额外的索引结构、自定义导入和导出例程以及自定义聚合函数来扩展核心引擎。对于几个用例,我们详细介绍了如何快速实现用户定义模块,并邀请读者编写和应用自定义代码,以定制提供的代码片段,我们将这些代码片段带到自己的偏好中,以解决涉及Twitter tweet情绪分析的数据分析任务。
{"title":"To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks","authors":"Nico Schäfer, Damjan Gjurovski, Angjela Davitkova, Sebastian Michel","doi":"10.14778/3611540.3611610","DOIUrl":"https://doi.org/10.14778/3611540.3611610","url":null,"abstract":"While existing data management solutions try to keep up with novel data formats and features, a myriad of valuable functionality is often only accessible via programming language libraries. Particularly for machine learning tasks, there is a wealth of pre-trained models and easy-to-use libraries that allow a wide audience to harness state-of-the-art machine learning. We propose the demonstration of a highly modularized data processor for semi-structured data that can be extended by means of plain Python scripts. Next to commonly supported user-defined functions, the deep decomposition allows augmenting the core engine with additional index structures, customized import and export routines, and custom aggregation functions. For several use cases, we detail how user-defined modules can be quickly realized and invite the audience to write and apply custom code, to tailor provided code snippets that we bring along to own preferences to solve data analytics tasks involving sentiment analysis of Twitter tweets.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fanglue: An Interactive System for Decision Rule Crafting 方略:决策规则制作的交互式系统
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611621
Chen Qian, Shiwei Liang, Zhaoyang Wang, Yin Lou
In many applications the training data do not always contain sufficient information to produce high-quality decision rules for standard (end-to-end) rule mining algorithms, and human experts have to incorporate domain knowledge during rule induction in order to get meaningful results. In this work we present Fanglue, a home-grown system inside Alipay, for interactive decision rule crafting. Fanglue is a distributed in-memory system and is highly responsive when processing large-scale datasets. In addition, Fanglue extends the standard representation of a decision rule by introducing disjunctive clauses. Having disjunctive clauses can improve the coverage and robustness of a decision rule, especially for fraud prevention in Fintech applications.
在许多应用中,训练数据并不总是包含足够的信息来为标准(端到端)规则挖掘算法生成高质量的决策规则,并且人类专家必须在规则归纳过程中结合领域知识以获得有意义的结果。在这项工作中,我们展示了支付宝内部的一个自主开发的系统,用于交互式决策规则的制定。方值是一个分布式内存系统,在处理大规模数据集时具有很高的响应速度。此外,方语通过引入析取子句扩展了决策规则的标准表示。具有析取从句可以提高决策规则的覆盖范围和鲁棒性,特别是对于金融科技应用中的欺诈预防。
{"title":"Fanglue: An Interactive System for Decision Rule Crafting","authors":"Chen Qian, Shiwei Liang, Zhaoyang Wang, Yin Lou","doi":"10.14778/3611540.3611621","DOIUrl":"https://doi.org/10.14778/3611540.3611621","url":null,"abstract":"In many applications the training data do not always contain sufficient information to produce high-quality decision rules for standard (end-to-end) rule mining algorithms, and human experts have to incorporate domain knowledge during rule induction in order to get meaningful results. In this work we present Fanglue, a home-grown system inside Alipay, for interactive decision rule crafting. Fanglue is a distributed in-memory system and is highly responsive when processing large-scale datasets. In addition, Fanglue extends the standard representation of a decision rule by introducing disjunctive clauses. Having disjunctive clauses can improve the coverage and robustness of a decision rule, especially for fraud prevention in Fintech applications.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MagicScaler: Uncertainty-Aware, Predictive Autoscaling MagicScaler:不确定性意识,预测性自动缩放
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611566
Zhicheng Pan, Yihang Wang, Yingying Zhang, Sean Bin Yang, Yunyao Cheng, Peng Chen, Chenjuan Guo, Qingsong Wen, Xiduo Tian, Yunliang Dou, Zhiqiang Zhou, Chengcheng Yang, Aoying Zhou, Bin Yang
Predictive autoscaling is a key enabler for optimizing cloud resource allocation in Alibaba Cloud's computing platforms, which dynamically adjust the Elastic Compute Service (ECS) instances based on predicted user demands to ensure Quality of Service (QoS). However, user demands in the cloud are often highly complex, with high uncertainty and scale-sensitive temporal dependencies, thus posing great challenges for accurate prediction of future demands. These in turn make autoscaling challenging---autoscaling needs to properly account for demand uncertainty while maintaining a reasonable trade-off between two contradictory factors, i.e., low instance running costs vs. low QoS violation risks. To address the above challenges, we propose a novel predictive autoscaling framework MagicScaler , consisting of a Multi-scale attentive Gaussian process based predictor and an uncertainty-aware scaler. First, the predictor carefully bridges the best of two successful prediction methodologies---multi-scale attention mechanisms, which are good at capturing complex, multi-scale features, and stochastic process regression, which can quantify prediction uncertainty, thus achieving accurate demand prediction with quantified uncertainty. Second, the scaler takes the quantified future demand uncertainty into a judiciously designed loss function with stochastic constraints, enabling flexible trade-off between running costs and QoS violation risks. Extensive experiments on three clusters of Alibaba Cloud in different Chinese cities demonstrate the effectiveness and efficiency of MagicScaler , which outperforms other commonly adopted scalers, thus justifying our design choices.
预测自动伸缩是阿里云计算平台优化云资源分配的关键,它根据预测的用户需求动态调整弹性计算服务(ECS)实例,以确保服务质量(QoS)。然而,云中的用户需求往往非常复杂,具有高度的不确定性和对规模敏感的时间依赖性,因此对未来需求的准确预测提出了很大的挑战。这些反过来又使自动扩展具有挑战性——自动扩展需要适当地考虑需求的不确定性,同时在两个相互矛盾的因素之间保持合理的权衡,即低实例运行成本与低QoS违反风险。为了解决上述挑战,我们提出了一种新的预测自缩放框架MagicScaler,它由一个基于多尺度关注高斯过程的预测器和一个不确定性感知的缩放器组成。首先,预测者仔细地将两种成功预测方法中的最佳方法——善于捕捉复杂、多尺度特征的多尺度注意机制和量化预测不确定性的随机过程回归结合起来,从而实现具有量化不确定性的准确需求预测。其次,该标量将量化的未来需求不确定性转化为具有随机约束的合理设计的损失函数,实现了运行成本与QoS违规风险之间的灵活权衡。在中国不同城市的三个阿里云集群上进行的大量实验证明了MagicScaler的有效性和效率,它优于其他常用的scaler,从而证明了我们的设计选择是合理的。
{"title":"MagicScaler: Uncertainty-Aware, Predictive Autoscaling","authors":"Zhicheng Pan, Yihang Wang, Yingying Zhang, Sean Bin Yang, Yunyao Cheng, Peng Chen, Chenjuan Guo, Qingsong Wen, Xiduo Tian, Yunliang Dou, Zhiqiang Zhou, Chengcheng Yang, Aoying Zhou, Bin Yang","doi":"10.14778/3611540.3611566","DOIUrl":"https://doi.org/10.14778/3611540.3611566","url":null,"abstract":"Predictive autoscaling is a key enabler for optimizing cloud resource allocation in Alibaba Cloud's computing platforms, which dynamically adjust the Elastic Compute Service (ECS) instances based on predicted user demands to ensure Quality of Service (QoS). However, user demands in the cloud are often highly complex, with high uncertainty and scale-sensitive temporal dependencies, thus posing great challenges for accurate prediction of future demands. These in turn make autoscaling challenging---autoscaling needs to properly account for demand uncertainty while maintaining a reasonable trade-off between two contradictory factors, i.e., low instance running costs vs. low QoS violation risks. To address the above challenges, we propose a novel predictive autoscaling framework MagicScaler , consisting of a Multi-scale attentive Gaussian process based predictor and an uncertainty-aware scaler. First, the predictor carefully bridges the best of two successful prediction methodologies---multi-scale attention mechanisms, which are good at capturing complex, multi-scale features, and stochastic process regression, which can quantify prediction uncertainty, thus achieving accurate demand prediction with quantified uncertainty. Second, the scaler takes the quantified future demand uncertainty into a judiciously designed loss function with stochastic constraints, enabling flexible trade-off between running costs and QoS violation risks. Extensive experiments on three clusters of Alibaba Cloud in different Chinese cities demonstrate the effectiveness and efficiency of MagicScaler , which outperforms other commonly adopted scalers, thus justifying our design choices.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Common Sense: The Dark Matter of Language and Intelligence (VLDB 2023 Keynote) 常识:语言和智能的暗物质(VLDB 2023主题演讲)
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611638
Yejin Choi
Scale appears to be the winning recipe in today's leaderboards. And yet, extreme-scale neural models are (un)surprisingly brittle and make errors that are often nonsensical and even counterintuitive. In this talk, I will argue for the importance of knowledge, especially commonsense knowledge, as well as inference-time reasoning algorithms, and demonstrate how smaller models developed in academia can still have an edge over larger industry-scale models, if powered with knowledge and/or reasoning algorithms.
规模似乎是当今排行榜的制胜秘诀。然而,极端尺度的神经模型非常脆弱,经常会犯一些荒谬甚至违反直觉的错误。在这次演讲中,我将论证知识的重要性,尤其是常识性知识,以及推理时间推理算法,并展示学术界开发的小型模型如何仍然比大型工业规模的模型具有优势,如果有知识和/或推理算法的支持。
{"title":"Common Sense: The Dark Matter of Language and Intelligence (VLDB 2023 Keynote)","authors":"Yejin Choi","doi":"10.14778/3611540.3611638","DOIUrl":"https://doi.org/10.14778/3611540.3611638","url":null,"abstract":"Scale appears to be the winning recipe in today's leaderboards. And yet, extreme-scale neural models are (un)surprisingly brittle and make errors that are often nonsensical and even counterintuitive. In this talk, I will argue for the importance of knowledge, especially commonsense knowledge, as well as inference-time reasoning algorithms, and demonstrate how smaller models developed in academia can still have an edge over larger industry-scale models, if powered with knowledge and/or reasoning algorithms.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Microsoft Purview: A System for Central Governance of Data 微软权限:数据中央治理系统
3区 计算机科学 Q1 Computer Science Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611552
Shafi Ahmad, Dillidorai Arumugam, Srdan Bozovic, Elnata Degefa, Sailesh Duvvuri, Steven Gott, Nitish Gupta, Joachim Hammer, Nivedita Kaluskar, Raghav Kaushik, Rakesh Khanduja, Prasad Mujumdar, Gaurav Malhotra, Pankaj Naik, Nikolas Ogg, Krishna Kumar Parthasarthy, Raghu Ramakrishnan, Vlad Rodriguez, Rahul Sharma, Jakub Szymaszek, Andreas Wolter
Modern data estates are spread across data located on premises, on the edge and in one or more public clouds, spread across various sources like multiple relational databases, file and storage systems, and no-SQL systems, both operational and analytic; this phenomenon is referred to as data sprawl. Data administrators who wish to enforce compliance across the entire organization have to inventory their data, identify what parts of it are sensitive, and govern the sensitive data appropriately --- across the entirety of their sprawling data estate. Today, governance of data is completely siloed; each of the data subsystems has its own (and varied) governance features. Policies applied to sensitive data are applied piece-meal by iterating over all the data sources in a custom language specific to each source. This makes data governance cumbersome, error-prone (because a given policy must be manually enforced across different subsystems, inconsistencies can easily arise), and expensive. This paper presents Microsoft Purview , a service for unified governance of the entire data estate of an organization from a single central pane of glass. The Purview service consists of three parts: (1) a Data Map or metadata catalog that is populated by automated scanning of data sources in the organization, (2) a system to store and manage sensitivity classification of data, and (3) a policy system that enables data security officers to author and implement policies that span the entire organization, e.g., a policy that says, "Non-full-time employees should be denied access to data classified as PII (Personally Identifiable Information.") Purview transforms data governance across a complex data estate by offering the ability to govern centrally and automating data discovery, classification and policy enforcement. While other commercial catalog systems also build a global catalog, Purview is unique in its support for policies. It is also distinguished by covering both structured and unstructured data, thanks to its deep integration with Office 365 and its governance framework; indeed, "Microsoft Purview" represents a new unified offering that combines Office 365 governance and what was formerly a service for governing structured data called "Azure Purview". By integrating with Office 365's Rights Management Service, Purview offers central governance over structured data stored in databases and stores, reports in systems such as Power BI, as well as document data stored in Office 365. The Purview vision is to make the metadata in the Data Map increasingly richer through further automation and curation support and to use this 360 degree view of the data estate to support a wide range of governance policies, ranging from access control to lifecycle management (e.g., retention, deletion, restricting data movement). This paper covers the design and implementation challenges in building the Purview service for Attribute-Based Access Control (ABAC) policies, focusing speci
现代数据资产分布在位于本地、边缘和一个或多个公共云中的数据中,分布在各种数据源中,如多个关系数据库、文件和存储系统以及无sql系统,包括操作和分析;这种现象被称为数据蔓延。希望在整个组织中实施法规遵从性的数据管理员必须对其数据进行盘点,确定其中哪些部分是敏感的,并在整个庞大的数据资产中适当地管理敏感数据。如今,对数据的管理是完全孤立的;每个数据子系统都有自己的(和不同的)治理特性。应用于敏感数据的策略是通过使用特定于每个数据源的自定义语言遍历所有数据源来逐步应用的。这使得数据治理很麻烦,容易出错(因为给定的策略必须在不同的子系统之间手动执行,很容易出现不一致),而且成本很高。本文介绍了Microsoft Purview,这是一种用于从单个中心窗格统一治理组织的整个数据资产的服务。Purview服务由三个部分组成:(1)通过自动扫描组织内的数据源填充的数据地图或元数据目录,(2)存储和管理数据敏感性分类的系统,以及(3)使数据安全官员能够编写和实施跨越整个组织的政策的政策系统,例如,这样的政策:“应该禁止非全职员工访问归类为PII(个人身份信息)的数据。”Purview通过提供集中管理和自动化数据发现、分类和策略执行的能力,转变了复杂数据资产的数据治理。虽然其他商业目录系统也构建一个全局目录,但Purview在支持策略方面是独一无二的。由于它与Office 365及其治理框架的深度集成,它的特点还在于涵盖了结构化和非结构化数据;事实上,“Microsoft Purview”代表了一种新的统一产品,它结合了Office 365管理和以前用于管理结构化数据的服务“Azure Purview”。通过与Office 365的权限管理服务集成,Purview可以对存储在数据库和商店中的结构化数据、Power BI等系统中的报告以及存储在Office 365中的文档数据进行集中管理。Purview的愿景是通过进一步的自动化和管理支持,使Data Map中的元数据越来越丰富,并使用这种数据资产的360度视图来支持广泛的治理策略,从访问控制到生命周期管理(例如,保留、删除、限制数据移动)。本文涵盖了为基于属性的访问控制(ABAC)策略构建权限服务时的设计和实现挑战,特别关注其与Azure SQL数据库集成的详细描述。我们通过权限策略演示了统一Office 365治理与结构化数据治理的强大功能,这些策略即使在Office 365和结构化数据引擎(如Azure SQL Database)之间的数据流之间执行一致的访问控制。我们还描述了我们对由权限所施加的性能开销的经验评估的结果。
{"title":"Microsoft Purview: A System for Central Governance of Data","authors":"Shafi Ahmad, Dillidorai Arumugam, Srdan Bozovic, Elnata Degefa, Sailesh Duvvuri, Steven Gott, Nitish Gupta, Joachim Hammer, Nivedita Kaluskar, Raghav Kaushik, Rakesh Khanduja, Prasad Mujumdar, Gaurav Malhotra, Pankaj Naik, Nikolas Ogg, Krishna Kumar Parthasarthy, Raghu Ramakrishnan, Vlad Rodriguez, Rahul Sharma, Jakub Szymaszek, Andreas Wolter","doi":"10.14778/3611540.3611552","DOIUrl":"https://doi.org/10.14778/3611540.3611552","url":null,"abstract":"Modern data estates are spread across data located on premises, on the edge and in one or more public clouds, spread across various sources like multiple relational databases, file and storage systems, and no-SQL systems, both operational and analytic; this phenomenon is referred to as data sprawl. Data administrators who wish to enforce compliance across the entire organization have to inventory their data, identify what parts of it are sensitive, and govern the sensitive data appropriately --- across the entirety of their sprawling data estate. Today, governance of data is completely siloed; each of the data subsystems has its own (and varied) governance features. Policies applied to sensitive data are applied piece-meal by iterating over all the data sources in a custom language specific to each source. This makes data governance cumbersome, error-prone (because a given policy must be manually enforced across different subsystems, inconsistencies can easily arise), and expensive. This paper presents Microsoft Purview , a service for unified governance of the entire data estate of an organization from a single central pane of glass. The Purview service consists of three parts: (1) a Data Map or metadata catalog that is populated by automated scanning of data sources in the organization, (2) a system to store and manage sensitivity classification of data, and (3) a policy system that enables data security officers to author and implement policies that span the entire organization, e.g., a policy that says, \"Non-full-time employees should be denied access to data classified as PII (Personally Identifiable Information.\") Purview transforms data governance across a complex data estate by offering the ability to govern centrally and automating data discovery, classification and policy enforcement. While other commercial catalog systems also build a global catalog, Purview is unique in its support for policies. It is also distinguished by covering both structured and unstructured data, thanks to its deep integration with Office 365 and its governance framework; indeed, \"Microsoft Purview\" represents a new unified offering that combines Office 365 governance and what was formerly a service for governing structured data called \"Azure Purview\". By integrating with Office 365's Rights Management Service, Purview offers central governance over structured data stored in databases and stores, reports in systems such as Power BI, as well as document data stored in Office 365. The Purview vision is to make the metadata in the Data Map increasingly richer through further automation and curation support and to use this 360 degree view of the data estate to support a wide range of governance policies, ranging from access control to lifecycle management (e.g., retention, deletion, restricting data movement). This paper covers the design and implementation challenges in building the Purview service for Attribute-Based Access Control (ABAC) policies, focusing speci","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Randomized Blocking Structure for Streaming Record Linkage 流记录链接的随机阻塞结构
IF 2.5 3区 计算机科学 Q1 Computer Science Pub Date : 2023-07-01 DOI: 10.14778/3611479.3611487
Dimitrios Karapiperis, Christos Tjortjis, Vassilios S. Verykios
A huge amount of data, in terms of streams, are collected nowadays via a variety of sources, such as sensors, mobile devices, or even raw log files. The unprecedented rate at which these data are generated and collected calls for novel record linkage methods to identify matching records pairs, which refer to the same real-world entity. Towards this direction, blocking methods are used in order to reduce the number of candidate record pairs while still maintaining high levels of accuracy. This paper introduces ExpBlock, a randomized record linkage structure, which guarantees that both the most frequently accessed and recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. Specifically, the probability of inactive blocks and older records to remain in main memory decays in order to make room for more promising blocks and fresher records, respectively. We implement these features using random choices instead of utilizing cumbersome sorting data structures in order to favour simplicity of implementation and efficiency. We showcase, through the experimental evaluation, that ExplBlock scales efficiently to data streams by providing accurate results in a timely fashion.
就数据流而言,现在有大量的数据是通过各种来源收集的,比如传感器、移动设备,甚至是原始日志文件。这些数据以前所未有的速度生成和收集,需要新的记录链接方法来识别引用相同现实世界实体的匹配记录对。在这个方向上,为了减少候选记录对的数量,同时仍然保持高水平的准确性,使用了阻塞方法。ExpBlock是一种随机化的记录链接结构,它保证了访问最频繁和最近使用的块都保留在主存中,此外,块内的记录以滚动的方式更新。具体来说,为了分别为更有前途的块和较新的记录腾出空间,不活动的块和较旧的记录留在主存中的概率会降低。为了简化实现和提高效率,我们使用随机选择来实现这些功能,而不是使用繁琐的排序数据结构。通过实验评估,我们展示了ExplBlock通过及时提供准确的结果有效地扩展到数据流。
{"title":"A Randomized Blocking Structure for Streaming Record Linkage","authors":"Dimitrios Karapiperis, Christos Tjortjis, Vassilios S. Verykios","doi":"10.14778/3611479.3611487","DOIUrl":"https://doi.org/10.14778/3611479.3611487","url":null,"abstract":"A huge amount of data, in terms of streams, are collected nowadays via a variety of sources, such as sensors, mobile devices, or even raw log files. The unprecedented rate at which these data are generated and collected calls for novel record linkage methods to identify matching records pairs, which refer to the same real-world entity. Towards this direction, blocking methods are used in order to reduce the number of candidate record pairs while still maintaining high levels of accuracy. This paper introduces ExpBlock, a randomized record linkage structure, which guarantees that both the most frequently accessed and recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. Specifically, the probability of inactive blocks and older records to remain in main memory decays in order to make room for more promising blocks and fresher records, respectively. We implement these features using random choices instead of utilizing cumbersome sorting data structures in order to favour simplicity of implementation and efficiency. We showcase, through the experimental evaluation, that ExplBlock scales efficiently to data streams by providing accurate results in a timely fashion.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66648426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1