首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
Sniffer: A Novel Model Type Detection System against Machine-Learning-as-a-Service Platforms Sniffer:一种针对机器学习即服务平台的新型模型类型检测系统
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611591
Zhuo Ma, Yilong Yang, Bin Xiao, Yang Liu, Xinjing Liu, Zhuoran Ma, Tong Yang
Recent works explore several attacks against Machine-Learning-as-a-Service (MLaaS) platforms (e.g., the model stealing attack), allegedly posing potential real-world threats beyond viability in laboratories. However, hampered by model-type-sensitive , most of the attacks can hardly break mainstream real-world MLaaS platforms. That is, many MLaaS attacks are designed against only one certain type of model, such as tree models or neural networks. As the black-box MLaaS interface hides model type info, the attacker cannot choose a proper attack method with confidence, limiting the attack performance. In this paper, we demonstrate a system, named Sniffer, that is capable of making model-type-sensitive attacks "great again" in real-world applications. Specifically, Sniffer consists of four components: Generator, Querier, Probe, and Arsenal. The first two components work for preparing attack samples. Probe, as the most characteristic component in Sniffer, implements a series of self-designed algorithms to determine the type of models hidden behind the black-box MLaaS interfaces. With model type info unraveled, an optimum method can be selected from Arsenal (containing multiple attack methods) to accomplish its attack. Our demonstration shows how the audience can interact with Sniffer in a web-based interface against five mainstream MLaaS platforms.
最近的研究探索了几种针对机器学习即服务(MLaaS)平台的攻击(例如,模型窃取攻击),据称这些攻击构成了超出实验室可行性的潜在现实威胁。然而,受模型类型敏感的限制,大多数攻击很难突破现实世界主流的MLaaS平台。也就是说,许多MLaaS攻击只针对一种特定类型的模型,例如树模型或神经网络。由于黑盒MLaaS接口隐藏了模型类型信息,攻击者无法放心地选择合适的攻击方法,限制了攻击性能。在本文中,我们演示了一个名为Sniffer的系统,它能够在实际应用程序中使模型类型敏感攻击“再次伟大”。具体来说,Sniffer由四个组件组成:Generator、Querier、Probe和Arsenal。前两个组件用于准备攻击样本。Probe作为Sniffer中最具特色的组件,实现了一系列自己设计的算法来确定隐藏在黑箱MLaaS接口背后的模型类型。随着模型类型信息的展开,可以从武器库(包含多种攻击方法)中选择最优方法来完成其攻击。我们的演示向观众展示了如何在基于web的界面中针对五种主流MLaaS平台与Sniffer进行交互。
{"title":"Sniffer: A Novel Model Type Detection System against Machine-Learning-as-a-Service Platforms","authors":"Zhuo Ma, Yilong Yang, Bin Xiao, Yang Liu, Xinjing Liu, Zhuoran Ma, Tong Yang","doi":"10.14778/3611540.3611591","DOIUrl":"https://doi.org/10.14778/3611540.3611591","url":null,"abstract":"Recent works explore several attacks against Machine-Learning-as-a-Service (MLaaS) platforms (e.g., the model stealing attack), allegedly posing potential real-world threats beyond viability in laboratories. However, hampered by model-type-sensitive , most of the attacks can hardly break mainstream real-world MLaaS platforms. That is, many MLaaS attacks are designed against only one certain type of model, such as tree models or neural networks. As the black-box MLaaS interface hides model type info, the attacker cannot choose a proper attack method with confidence, limiting the attack performance. In this paper, we demonstrate a system, named Sniffer, that is capable of making model-type-sensitive attacks \"great again\" in real-world applications. Specifically, Sniffer consists of four components: Generator, Querier, Probe, and Arsenal. The first two components work for preparing attack samples. Probe, as the most characteristic component in Sniffer, implements a series of self-designed algorithms to determine the type of models hidden behind the black-box MLaaS interfaces. With model type info unraveled, an optimum method can be selected from Arsenal (containing multiple attack methods) to accomplish its attack. Our demonstration shows how the audience can interact with Sniffer in a web-based interface against five mainstream MLaaS platforms.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anser: Adaptive Information Sharing Framework of AnalyticDB 答:AnalyticDB自适应信息共享框架
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611553
Liang Lin, Yuhan Li, Bin Wu, Huijun Mai, Renjie Lou, Jian Tan, Feifei Li
The surge in data analytics has fostered burgeoning demand for AnalyticDB on Alibaba Cloud, which has well served thousands of customers from various business sectors. The most notable feature is the diversity of the workloads it handles, including batch processing, real-time data analytics, and unstructured data analytics. To improve the overall performance for such diverse workloads, one of the major challenges is to optimize long-running complex queries without sacrificing the processing efficiency of short-running interactive queries. While existing methods attempt to utilize runtime dynamic statistics for adaptive query processing, they often focus on specific scenarios instead of providing a holistic solution. To address this challenge, we propose a new framework called Anser , which enhances the design of traditional distributed data warehouses by embedding a new information sharing mechanism. This allows for the efficient management of the production and consumption of various dynamic information across the system. Building on top of Anser , we introduce a novel scheduling policy that optimizes both data and information exchanges within the physical plan, enabling the acceleration of complex analytical queries without sacrificing the performance of short-running interactive queries. We conduct comprehensive experiments over public and in-house workloads to demonstrate the effectiveness and efficiency of our proposed information sharing framework.
数据分析的激增促进了对阿里云上的AnalyticDB的需求迅速增长,该服务已经为来自不同业务领域的数千名客户提供了良好的服务。最显著的特性是它处理的工作负载的多样性,包括批处理、实时数据分析和非结构化数据分析。为了提高这种不同工作负载的整体性能,主要挑战之一是优化长时间运行的复杂查询,同时不牺牲短时间运行的交互式查询的处理效率。虽然现有的方法试图利用运行时动态统计信息进行自适应查询处理,但它们通常侧重于特定场景,而不是提供整体解决方案。为了应对这一挑战,我们提出了一个名为Anser的新框架,它通过嵌入新的信息共享机制来增强传统分布式数据仓库的设计。这允许对整个系统中各种动态信息的生产和消费进行有效的管理。在Anser的基础上,我们引入了一种新的调度策略,该策略可以优化物理计划中的数据和信息交换,从而加速复杂的分析查询,而不会牺牲短时间运行的交互式查询的性能。我们在公共和内部工作负载上进行了全面的实验,以证明我们提出的信息共享框架的有效性和效率。
{"title":"Anser: Adaptive Information Sharing Framework of AnalyticDB","authors":"Liang Lin, Yuhan Li, Bin Wu, Huijun Mai, Renjie Lou, Jian Tan, Feifei Li","doi":"10.14778/3611540.3611553","DOIUrl":"https://doi.org/10.14778/3611540.3611553","url":null,"abstract":"The surge in data analytics has fostered burgeoning demand for AnalyticDB on Alibaba Cloud, which has well served thousands of customers from various business sectors. The most notable feature is the diversity of the workloads it handles, including batch processing, real-time data analytics, and unstructured data analytics. To improve the overall performance for such diverse workloads, one of the major challenges is to optimize long-running complex queries without sacrificing the processing efficiency of short-running interactive queries. While existing methods attempt to utilize runtime dynamic statistics for adaptive query processing, they often focus on specific scenarios instead of providing a holistic solution. To address this challenge, we propose a new framework called Anser , which enhances the design of traditional distributed data warehouses by embedding a new information sharing mechanism. This allows for the efficient management of the production and consumption of various dynamic information across the system. Building on top of Anser , we introduce a novel scheduling policy that optimizes both data and information exchanges within the physical plan, enabling the acceleration of complex analytical queries without sacrificing the performance of short-running interactive queries. We conduct comprehensive experiments over public and in-house workloads to demonstrate the effectiveness and efficiency of our proposed information sharing framework.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AQUA: Automatic Collaborative Query Processing in Analytical Database 分析数据库中的自动协同查询处理
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611607
Yuchen Peng, Ke Chen, Lidan Shou, Dawei Jiang, Gang Chen
Data analysts nowadays are keen to have analytical capabilities involving deep learning (DL). Collaborative queries, which employ relational operations to process structured data and DL models to process unstructured data, provide a powerful facility for DL-based in-database analysis. The classical approach to support collaborative queries in relational databases is to integrate DL models with user-defined functions (UDFs) in a general-purpose language (e.g., C++) to process unstructured data. This approach suffers from suboptimal performance as the opaque UDFs preclude the generation of an optimal query plan. A recent work, DL2SQL, addresses the problem of collaborative query optimization by first converting DL computations into SQL subqueries and then using a classical relational query optimizer to optimize the entire collaborative query. However, the DL2SQL approach compromises usability by requiring data analysts to manually manage DL-related data and tune query performance. To this end, this paper introduces AQUA, an analytical database designed for efficient collaborative query processing. Built on DL2SQL, AQUA automates translations from collaborative queries into SQL queries. To enhance usability, AQUA introduces two techniques: 1) a declarative scheme for DL-related data management, and 2) DL-specific optimizations for collaborative query processing, eliminating the burden of manual data management and performance tuning from the data analysts. We demonstrate the key contributions of AQUA via a web APP that allows the audience to perform collaborative queries on the CIFAR-10 dataset.
如今,数据分析师渴望拥有涉及深度学习(DL)的分析能力。协作查询使用关系操作处理结构化数据,使用DL模型处理非结构化数据,为基于DL的数据库内分析提供了强大的工具。在关系数据库中支持协作查询的经典方法是使用通用语言(例如c++)将深度学习模型与用户定义函数(udf)集成在一起,以处理非结构化数据。这种方法的性能不是最优的,因为不透明的udf排除了最优查询计划的生成。最近的一项工作,DL2SQL,通过首先将DL计算转换为SQL子查询,然后使用经典的关系查询优化器来优化整个协作查询,解决了协作查询优化问题。但是,DL2SQL方法要求数据分析人员手动管理与dl相关的数据并调优查询性能,从而损害了可用性。为此,本文介绍了AQUA,一个为高效协同查询处理而设计的分析数据库。AQUA建立在DL2SQL之上,可以自动将协作查询转换为SQL查询。为了增强可用性,AQUA引入了两种技术:1)用于与dl相关的数据管理的声明式方案,以及2)用于协作查询处理的特定于dl的优化,从而消除了数据分析师手动数据管理和性能调优的负担。我们通过一个web应用程序演示了AQUA的关键贡献,该应用程序允许观众在CIFAR-10数据集上执行协作查询。
{"title":"AQUA: Automatic Collaborative Query Processing in Analytical Database","authors":"Yuchen Peng, Ke Chen, Lidan Shou, Dawei Jiang, Gang Chen","doi":"10.14778/3611540.3611607","DOIUrl":"https://doi.org/10.14778/3611540.3611607","url":null,"abstract":"Data analysts nowadays are keen to have analytical capabilities involving deep learning (DL). Collaborative queries, which employ relational operations to process structured data and DL models to process unstructured data, provide a powerful facility for DL-based in-database analysis. The classical approach to support collaborative queries in relational databases is to integrate DL models with user-defined functions (UDFs) in a general-purpose language (e.g., C++) to process unstructured data. This approach suffers from suboptimal performance as the opaque UDFs preclude the generation of an optimal query plan. A recent work, DL2SQL, addresses the problem of collaborative query optimization by first converting DL computations into SQL subqueries and then using a classical relational query optimizer to optimize the entire collaborative query. However, the DL2SQL approach compromises usability by requiring data analysts to manually manage DL-related data and tune query performance. To this end, this paper introduces AQUA, an analytical database designed for efficient collaborative query processing. Built on DL2SQL, AQUA automates translations from collaborative queries into SQL queries. To enhance usability, AQUA introduces two techniques: 1) a declarative scheme for DL-related data management, and 2) DL-specific optimizations for collaborative query processing, eliminating the burden of manual data management and performance tuning from the data analysts. We demonstrate the key contributions of AQUA via a web APP that allows the audience to perform collaborative queries on the CIFAR-10 dataset.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Taurus MM: Bringing Multi-Master to the Cloud 金牛座MM:把Multi-Master带到云端
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611542
Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, Calvin Sun
A single-master database has limited update capacity because a single node handles all updates. A multi-master database potentially has higher update capacity because the load is spread across multiple nodes. However, the need to coordinate updates and ensure durability can generate high network traffic. Reducing network load is particularly important in a cloud environment where the network infrastructure is shared among thousands of tenants. In this paper, we present Taurus MM, a shared-storage multi-master database optimized for cloud environments. It implements two novel algorithms aimed at reducing network traffic plus a number of additional optimizations. The first algorithm is a new type of distributed clock that combines the small size of Lamport clocks with the effective support of distributed snapshots of vector clocks. The second algorithm is a new hybrid page and row locking protocol that significantly reduces the number of lock requests sent over the network. Experimental results on a cluster with up to eight masters demonstrate superior performance compared to Aurora multi-master and CockroachDB.
单主数据库的更新能力有限,因为一个节点处理所有更新。多主数据库可能具有更高的更新能力,因为负载分布在多个节点上。然而,协调更新和确保持久性的需求可能会产生高网络流量。在网络基础设施由数千个租户共享的云环境中,减少网络负载尤为重要。在本文中,我们提出了Taurus MM,一个针对云环境优化的共享存储多主数据库。它实现了两种新颖的算法,旨在减少网络流量以及一些额外的优化。第一种算法是一种新型的分布式时钟,它结合了Lamport时钟的小尺寸和矢量时钟的分布式快照的有效支持。第二种算法是一种新的页和行混合锁协议,它显著减少了通过网络发送的锁请求的数量。在多达8个master的集群上的实验结果表明,与Aurora multi-master和CockroachDB相比,性能更优越。
{"title":"Taurus MM: Bringing Multi-Master to the Cloud","authors":"Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, Calvin Sun","doi":"10.14778/3611540.3611542","DOIUrl":"https://doi.org/10.14778/3611540.3611542","url":null,"abstract":"A single-master database has limited update capacity because a single node handles all updates. A multi-master database potentially has higher update capacity because the load is spread across multiple nodes. However, the need to coordinate updates and ensure durability can generate high network traffic. Reducing network load is particularly important in a cloud environment where the network infrastructure is shared among thousands of tenants. In this paper, we present Taurus MM, a shared-storage multi-master database optimized for cloud environments. It implements two novel algorithms aimed at reducing network traffic plus a number of additional optimizations. The first algorithm is a new type of distributed clock that combines the small size of Lamport clocks with the effective support of distributed snapshots of vector clocks. The second algorithm is a new hybrid page and row locking protocol that significantly reduces the number of lock requests sent over the network. Experimental results on a cluster with up to eight masters demonstrate superior performance compared to Aurora multi-master and CockroachDB.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DHive: Query Execution Performance Analysis via Dataflow in Apache Hive Hive: Apache Hive中基于数据流的查询执行性能分析
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611605
Chaozu Zhang, Qiaomu Shen, Bo Tang
Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.
如今,Apache Hive已被广泛用于许多组织的大规模数据分析应用程序。Hive开发了各种可视化分析工具,帮助用户快速分析查询执行过程,识别执行查询的性能瓶颈。但是,现有的工具主要侧重于显示查询子组件(作业和操作符)的时间使用情况,但无法提供足够的证据来分析执行进度缓慢的根本原因。为了解决这个问题,我们开发了一个可视化分析系统hive,通过数据流分析对查询执行过程进行可视化分析。hive在多个级别显示查询执行过程中的数据流:查询级别、作业级别和任务级别,用户可以通过将关键的作业/任务与系统配置和硬件状态等辅助信息联系起来,从而识别关键的作业/任务并解释其时间使用情况。我们通过一个生产集群中的两个案例来演示hive的有效性。hive是开源的,网址是https://github.com/DBGroup-SUSTech/DHive.git。
{"title":"DHive: Query Execution Performance Analysis via Dataflow in Apache Hive","authors":"Chaozu Zhang, Qiaomu Shen, Bo Tang","doi":"10.14778/3611540.3611605","DOIUrl":"https://doi.org/10.14778/3611540.3611605","url":null,"abstract":"Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"222 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel) 法学硕士将重塑、强化还是扼杀数据科学?(VLDB 2023面板)
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611634
Alon Halevy, Yejin Choi, Avrilia Floratou, Michael J. Franklin, Natasha Noy, Haixun Wang
Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.
大型语言模型(llm)最近风靡全球,在多个领域提供了潜在的改变游戏规则的机会。当然,将法学硕士应用于结构化数据的管理,或者更一般地说,应用于数据科学中涉及的过程,有很大的前景。至少,法学硕士有潜力为我们的社区几十年来一直在解决的长期挑战提供实质性的进步。另一方面,它们可能会引入我们迄今为止只能梦想的全新功能。该小组将汇集一些领先的专家,他们一直在从不同的角度思考这些机会,并将其应用于研究原型甚至商业应用中。
{"title":"Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel)","authors":"Alon Halevy, Yejin Choi, Avrilia Floratou, Michael J. Franklin, Natasha Noy, Haixun Wang","doi":"10.14778/3611540.3611634","DOIUrl":"https://doi.org/10.14778/3611540.3611634","url":null,"abstract":"Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks 到udf及以后:用于一般数据争用任务的完全分解数据处理器的演示
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611610
Nico Schäfer, Damjan Gjurovski, Angjela Davitkova, Sebastian Michel
While existing data management solutions try to keep up with novel data formats and features, a myriad of valuable functionality is often only accessible via programming language libraries. Particularly for machine learning tasks, there is a wealth of pre-trained models and easy-to-use libraries that allow a wide audience to harness state-of-the-art machine learning. We propose the demonstration of a highly modularized data processor for semi-structured data that can be extended by means of plain Python scripts. Next to commonly supported user-defined functions, the deep decomposition allows augmenting the core engine with additional index structures, customized import and export routines, and custom aggregation functions. For several use cases, we detail how user-defined modules can be quickly realized and invite the audience to write and apply custom code, to tailor provided code snippets that we bring along to own preferences to solve data analytics tasks involving sentiment analysis of Twitter tweets.
虽然现有的数据管理解决方案试图跟上新的数据格式和特性,但许多有价值的功能通常只能通过编程语言库访问。特别是对于机器学习任务,有大量的预训练模型和易于使用的库,可以让广泛的受众利用最先进的机器学习。我们建议演示一个高度模块化的数据处理器,用于可以通过普通Python脚本进行扩展的半结构化数据。除了通常支持的用户定义函数之外,深度分解还允许使用额外的索引结构、自定义导入和导出例程以及自定义聚合函数来扩展核心引擎。对于几个用例,我们详细介绍了如何快速实现用户定义模块,并邀请读者编写和应用自定义代码,以定制提供的代码片段,我们将这些代码片段带到自己的偏好中,以解决涉及Twitter tweet情绪分析的数据分析任务。
{"title":"To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks","authors":"Nico Schäfer, Damjan Gjurovski, Angjela Davitkova, Sebastian Michel","doi":"10.14778/3611540.3611610","DOIUrl":"https://doi.org/10.14778/3611540.3611610","url":null,"abstract":"While existing data management solutions try to keep up with novel data formats and features, a myriad of valuable functionality is often only accessible via programming language libraries. Particularly for machine learning tasks, there is a wealth of pre-trained models and easy-to-use libraries that allow a wide audience to harness state-of-the-art machine learning. We propose the demonstration of a highly modularized data processor for semi-structured data that can be extended by means of plain Python scripts. Next to commonly supported user-defined functions, the deep decomposition allows augmenting the core engine with additional index structures, customized import and export routines, and custom aggregation functions. For several use cases, we detail how user-defined modules can be quickly realized and invite the audience to write and apply custom code, to tailor provided code snippets that we bring along to own preferences to solve data analytics tasks involving sentiment analysis of Twitter tweets.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fanglue: An Interactive System for Decision Rule Crafting 方略:决策规则制作的交互式系统
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611621
Chen Qian, Shiwei Liang, Zhaoyang Wang, Yin Lou
In many applications the training data do not always contain sufficient information to produce high-quality decision rules for standard (end-to-end) rule mining algorithms, and human experts have to incorporate domain knowledge during rule induction in order to get meaningful results. In this work we present Fanglue, a home-grown system inside Alipay, for interactive decision rule crafting. Fanglue is a distributed in-memory system and is highly responsive when processing large-scale datasets. In addition, Fanglue extends the standard representation of a decision rule by introducing disjunctive clauses. Having disjunctive clauses can improve the coverage and robustness of a decision rule, especially for fraud prevention in Fintech applications.
在许多应用中,训练数据并不总是包含足够的信息来为标准(端到端)规则挖掘算法生成高质量的决策规则,并且人类专家必须在规则归纳过程中结合领域知识以获得有意义的结果。在这项工作中,我们展示了支付宝内部的一个自主开发的系统,用于交互式决策规则的制定。方值是一个分布式内存系统,在处理大规模数据集时具有很高的响应速度。此外,方语通过引入析取子句扩展了决策规则的标准表示。具有析取从句可以提高决策规则的覆盖范围和鲁棒性,特别是对于金融科技应用中的欺诈预防。
{"title":"Fanglue: An Interactive System for Decision Rule Crafting","authors":"Chen Qian, Shiwei Liang, Zhaoyang Wang, Yin Lou","doi":"10.14778/3611540.3611621","DOIUrl":"https://doi.org/10.14778/3611540.3611621","url":null,"abstract":"In many applications the training data do not always contain sufficient information to produce high-quality decision rules for standard (end-to-end) rule mining algorithms, and human experts have to incorporate domain knowledge during rule induction in order to get meaningful results. In this work we present Fanglue, a home-grown system inside Alipay, for interactive decision rule crafting. Fanglue is a distributed in-memory system and is highly responsive when processing large-scale datasets. In addition, Fanglue extends the standard representation of a decision rule by introducing disjunctive clauses. Having disjunctive clauses can improve the coverage and robustness of a decision rule, especially for fraud prevention in Fintech applications.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MagicScaler: Uncertainty-Aware, Predictive Autoscaling MagicScaler:不确定性意识,预测性自动缩放
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611566
Zhicheng Pan, Yihang Wang, Yingying Zhang, Sean Bin Yang, Yunyao Cheng, Peng Chen, Chenjuan Guo, Qingsong Wen, Xiduo Tian, Yunliang Dou, Zhiqiang Zhou, Chengcheng Yang, Aoying Zhou, Bin Yang
Predictive autoscaling is a key enabler for optimizing cloud resource allocation in Alibaba Cloud's computing platforms, which dynamically adjust the Elastic Compute Service (ECS) instances based on predicted user demands to ensure Quality of Service (QoS). However, user demands in the cloud are often highly complex, with high uncertainty and scale-sensitive temporal dependencies, thus posing great challenges for accurate prediction of future demands. These in turn make autoscaling challenging---autoscaling needs to properly account for demand uncertainty while maintaining a reasonable trade-off between two contradictory factors, i.e., low instance running costs vs. low QoS violation risks. To address the above challenges, we propose a novel predictive autoscaling framework MagicScaler , consisting of a Multi-scale attentive Gaussian process based predictor and an uncertainty-aware scaler. First, the predictor carefully bridges the best of two successful prediction methodologies---multi-scale attention mechanisms, which are good at capturing complex, multi-scale features, and stochastic process regression, which can quantify prediction uncertainty, thus achieving accurate demand prediction with quantified uncertainty. Second, the scaler takes the quantified future demand uncertainty into a judiciously designed loss function with stochastic constraints, enabling flexible trade-off between running costs and QoS violation risks. Extensive experiments on three clusters of Alibaba Cloud in different Chinese cities demonstrate the effectiveness and efficiency of MagicScaler , which outperforms other commonly adopted scalers, thus justifying our design choices.
预测自动伸缩是阿里云计算平台优化云资源分配的关键,它根据预测的用户需求动态调整弹性计算服务(ECS)实例,以确保服务质量(QoS)。然而,云中的用户需求往往非常复杂,具有高度的不确定性和对规模敏感的时间依赖性,因此对未来需求的准确预测提出了很大的挑战。这些反过来又使自动扩展具有挑战性——自动扩展需要适当地考虑需求的不确定性,同时在两个相互矛盾的因素之间保持合理的权衡,即低实例运行成本与低QoS违反风险。为了解决上述挑战,我们提出了一种新的预测自缩放框架MagicScaler,它由一个基于多尺度关注高斯过程的预测器和一个不确定性感知的缩放器组成。首先,预测者仔细地将两种成功预测方法中的最佳方法——善于捕捉复杂、多尺度特征的多尺度注意机制和量化预测不确定性的随机过程回归结合起来,从而实现具有量化不确定性的准确需求预测。其次,该标量将量化的未来需求不确定性转化为具有随机约束的合理设计的损失函数,实现了运行成本与QoS违规风险之间的灵活权衡。在中国不同城市的三个阿里云集群上进行的大量实验证明了MagicScaler的有效性和效率,它优于其他常用的scaler,从而证明了我们的设计选择是合理的。
{"title":"MagicScaler: Uncertainty-Aware, Predictive Autoscaling","authors":"Zhicheng Pan, Yihang Wang, Yingying Zhang, Sean Bin Yang, Yunyao Cheng, Peng Chen, Chenjuan Guo, Qingsong Wen, Xiduo Tian, Yunliang Dou, Zhiqiang Zhou, Chengcheng Yang, Aoying Zhou, Bin Yang","doi":"10.14778/3611540.3611566","DOIUrl":"https://doi.org/10.14778/3611540.3611566","url":null,"abstract":"Predictive autoscaling is a key enabler for optimizing cloud resource allocation in Alibaba Cloud's computing platforms, which dynamically adjust the Elastic Compute Service (ECS) instances based on predicted user demands to ensure Quality of Service (QoS). However, user demands in the cloud are often highly complex, with high uncertainty and scale-sensitive temporal dependencies, thus posing great challenges for accurate prediction of future demands. These in turn make autoscaling challenging---autoscaling needs to properly account for demand uncertainty while maintaining a reasonable trade-off between two contradictory factors, i.e., low instance running costs vs. low QoS violation risks. To address the above challenges, we propose a novel predictive autoscaling framework MagicScaler , consisting of a Multi-scale attentive Gaussian process based predictor and an uncertainty-aware scaler. First, the predictor carefully bridges the best of two successful prediction methodologies---multi-scale attention mechanisms, which are good at capturing complex, multi-scale features, and stochastic process regression, which can quantify prediction uncertainty, thus achieving accurate demand prediction with quantified uncertainty. Second, the scaler takes the quantified future demand uncertainty into a judiciously designed loss function with stochastic constraints, enabling flexible trade-off between running costs and QoS violation risks. Extensive experiments on three clusters of Alibaba Cloud in different Chinese cities demonstrate the effectiveness and efficiency of MagicScaler , which outperforms other commonly adopted scalers, thus justifying our design choices.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Common Sense: The Dark Matter of Language and Intelligence (VLDB 2023 Keynote) 常识:语言和智能的暗物质(VLDB 2023主题演讲)
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611638
Yejin Choi
Scale appears to be the winning recipe in today's leaderboards. And yet, extreme-scale neural models are (un)surprisingly brittle and make errors that are often nonsensical and even counterintuitive. In this talk, I will argue for the importance of knowledge, especially commonsense knowledge, as well as inference-time reasoning algorithms, and demonstrate how smaller models developed in academia can still have an edge over larger industry-scale models, if powered with knowledge and/or reasoning algorithms.
规模似乎是当今排行榜的制胜秘诀。然而,极端尺度的神经模型非常脆弱,经常会犯一些荒谬甚至违反直觉的错误。在这次演讲中,我将论证知识的重要性,尤其是常识性知识,以及推理时间推理算法,并展示学术界开发的小型模型如何仍然比大型工业规模的模型具有优势,如果有知识和/或推理算法的支持。
{"title":"Common Sense: The Dark Matter of Language and Intelligence (VLDB 2023 Keynote)","authors":"Yejin Choi","doi":"10.14778/3611540.3611638","DOIUrl":"https://doi.org/10.14778/3611540.3611638","url":null,"abstract":"Scale appears to be the winning recipe in today's leaderboards. And yet, extreme-scale neural models are (un)surprisingly brittle and make errors that are often nonsensical and even counterintuitive. In this talk, I will argue for the importance of knowledge, especially commonsense knowledge, as well as inference-time reasoning algorithms, and demonstrate how smaller models developed in academia can still have an edge over larger industry-scale models, if powered with knowledge and/or reasoning algorithms.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1