首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
Automatic SQL Error Mitigation in Oracle Oracle中的自动SQL错误缓解
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611568
Krishna Kantikiran Pasupuleti, Jiakun Li, Hong Su, Mohamed Ziauddin
Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies - i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints. This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.
尽管有最佳的编码实践,但在大型代码库中,软件bug是不可避免的。在传统数据库中,当查询处理过程中出现错误时,它们会中断用户的工作流程,直到找到并应用解决方案。手动识别变通方法通常依赖于试错法。这个过程不仅耗时,而且需要用户通常缺乏的领域专业知识。在本文中,我们提出了一个框架来自动减轻查询编译(包括优化和代码生成)过程中发生的错误,而无需任何用户干预。数据库在内部拦截错误,为其识别一个解决方案,并使用该解决方案重新编译查询。整个过程对用户保持透明,查询被无缝地执行。所提议的技术处理查询编译期间的SQL错误,并提供三种类型的缓解策略——i)快速故障转移到语句的一个现成的历史计划;ii)应用在错误发生时从优化器上下文中确定的有针对性的错误纠正指令(提示);iii)使用提示修改优化器的全局配置。这个特性已经实现,并将在Oracle自治数据库的下一个版本中发布。
{"title":"Automatic SQL Error Mitigation in Oracle","authors":"Krishna Kantikiran Pasupuleti, Jiakun Li, Hong Su, Mohamed Ziauddin","doi":"10.14778/3611540.3611568","DOIUrl":"https://doi.org/10.14778/3611540.3611568","url":null,"abstract":"Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies - i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints. This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Natural Language Interfaces for Databases with Deep Learning 深度学习数据库的自然语言接口
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611575
George Katsogiannis-Meimarakis, Mike Xydas, Georgia Koutrika
In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.
在数字革命时代,从工业和商业运营到医疗和学术研究,几乎所有人类活动都依赖于不断整合和利用不断增加的数据量。然而,数据的爆炸性数量和复杂性使得数据查询和探索即使对专家来说也是具有挑战性的,并且使得数据访问民主化的需求,即使对于非技术用户来说,也更加明显。现在是解除所有技术障碍的时候了,允许用户通过对话访问关系数据库。我们考虑了自然语言数据接口所基于的3个主要研究领域:文本到sql、sql到文本和数据到文本。本教程的目的是深入研究这些领域,涵盖最先进的技术和模型,并解释深度学习领域的进展如何导致令人印象深刻的进步。我们将展示激发研究和竞争的基准,并讨论开放的问题和研究机会,其中最重要的挑战之一是将这三个研究领域整合到一个对话系统中。
{"title":"Natural Language Interfaces for Databases with Deep Learning","authors":"George Katsogiannis-Meimarakis, Mike Xydas, Georgia Koutrika","doi":"10.14778/3611540.3611575","DOIUrl":"https://doi.org/10.14778/3611540.3611575","url":null,"abstract":"In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interactive Demonstration of EVA EVA互动演示
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611626
Gaurav Tarlok Kakkar, Aryan Rajoria, Myna Prasanna Kalluraya, Ashmita Raju, Jiashen Cao, Kexin Rong, Joy Arulraj
In this demonstration, we will present EVA, an end-to-end AI-Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.
在这个演示中,我们将介绍EVA,一个端到端的ai关系数据库管理系统。我们将使用三种使用场景来演示EVA的功能和效用:(1)EVA作为使用Streamlit和React开发的探索性视频分析界面的后端,(2)EVA与Python和数据科学生态系统无缝集成,允许用户在Python笔记本中访问EVA以及其他流行的库,如Pandas和Matplotlib,以及(3)EVA便于使用Label Studio进行大量标记,这是一个广泛使用的标记框架。通过优化复杂的视觉查询,我们说明了EVA如何允许广泛的应用程序开发人员利用计算机视觉的最新进展。
{"title":"Interactive Demonstration of EVA","authors":"Gaurav Tarlok Kakkar, Aryan Rajoria, Myna Prasanna Kalluraya, Ashmita Raju, Jiashen Cao, Kexin Rong, Joy Arulraj","doi":"10.14778/3611540.3611626","DOIUrl":"https://doi.org/10.14778/3611540.3611626","url":null,"abstract":"In this demonstration, we will present EVA, an end-to-end AI-Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Story of AWS Glue AWS Glue的故事
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611547
Mohit Saxena, Benjamin Sowell, Daiyan Alamgir, Nitin Bahadur, Bijay Bisht, Santosh Chandrachood, Chitti Keswani, G. Krishnamoorthy, Austin Lee, Bohou Li, Zach Mitchell, Vaibhav Porwal, Maheedhar Reddy Chappidi, Brian Ross, Noritaka Sekiyama, Omer Zaki, Linchi Zhang, Mehul A. Shah
AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.
AWS Glue是亚马逊的无服务器数据集成云服务,它使提取、清理、丰富、加载和组织数据变得简单而经济高效。AWS Glue最初于2017年8月推出,最初是一种提取-转换-加载(ETL)服务,旨在减轻开发人员和数据工程师在Amazon S3上加载数据库、数据仓库和构建数据湖所需的繁重工作。从那时起,它已经发展到服务于包括ETL专家和数据科学家在内的更大的受众,并包括更广泛的数据集成功能套件。如今,每个月都有数十万客户使用AWS Glue。在本文中,我们描述了云客户在准备分析数据时面临的用例和挑战,以及我们选择的驱动Glue设计的原则。我们在早期选择将重点放在易用性、可扩展性和可扩展性上。Glue的核心是提供无服务器的Apache Spark和Python引擎,由专门构建的资源管理器支持,用于快速启动和自动扩展。在Spark中,它提供了一种新的数据结构——DynamicFrames——用于操作杂乱的无模式半结构化数据,如事件日志,各种转换和工具来简化数据准备,以及一个新的shuffle插件来卸载到云存储。它还包括一个与Hivemetastore兼容的数据目录和Glue爬虫来构建和管理元数据,例如Amazon S3上的数据湖。最后,Glue Studio是用于创建基于Spark和python的ETL作业的可视化界面。我们将介绍使AWS Glue与众不同并推动其流行的创新,以及多年来它是如何发展的。
{"title":"The Story of AWS Glue","authors":"Mohit Saxena, Benjamin Sowell, Daiyan Alamgir, Nitin Bahadur, Bijay Bisht, Santosh Chandrachood, Chitti Keswani, G. Krishnamoorthy, Austin Lee, Bohou Li, Zach Mitchell, Vaibhav Porwal, Maheedhar Reddy Chappidi, Brian Ross, Noritaka Sekiyama, Omer Zaki, Linchi Zhang, Mehul A. Shah","doi":"10.14778/3611540.3611547","DOIUrl":"https://doi.org/10.14778/3611540.3611547","url":null,"abstract":"AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel) 法学硕士将重塑、强化还是扼杀数据科学?(VLDB 2023面板)
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611634
Alon Halevy, Yejin Choi, Avrilia Floratou, Michael J. Franklin, Natasha Noy, Haixun Wang
Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.
大型语言模型(llm)最近风靡全球,在多个领域提供了潜在的改变游戏规则的机会。当然,将法学硕士应用于结构化数据的管理,或者更一般地说,应用于数据科学中涉及的过程,有很大的前景。至少,法学硕士有潜力为我们的社区几十年来一直在解决的长期挑战提供实质性的进步。另一方面,它们可能会引入我们迄今为止只能梦想的全新功能。该小组将汇集一些领先的专家,他们一直在从不同的角度思考这些机会,并将其应用于研究原型甚至商业应用中。
{"title":"Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel)","authors":"Alon Halevy, Yejin Choi, Avrilia Floratou, Michael J. Franklin, Natasha Noy, Haixun Wang","doi":"10.14778/3611540.3611634","DOIUrl":"https://doi.org/10.14778/3611540.3611634","url":null,"abstract":"Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance CDSBen:在ByteDance上对云原生数据库系统中的存储服务性能进行基准测试
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611549
Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang
In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.
在这项工作中,我们重点研究了云原生数据库系统中存储服务的性能基准测试问题,云原生数据库系统广泛应用于各种云应用。这些系统的核心思想是将传统的单片OLTP数据库中的计算和存储分离开来。具体来说,我们首先展示了字节跳动的云原生数据库veDB的存储层上两个具有代表性的真实I/O工作负载的特征。然后,我们详细说明了使用标准基准(如TPC-C和YCSB)来模拟这些工作负载的局限性。为了克服这些限制,我们设计了一个基于学习的I/O工作负载基准,称为CDS-Ben。我们通过在ByteDance上部署CDSBen来展示它的优越性,并展示其生成的I/O轨迹与生产中的实际I/O轨迹非常相似。此外,我们还通过生成具有不同I/O特征的各种I/O工作负载来验证cdshen的准确性和灵活性。
{"title":"CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance","authors":"Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang","doi":"10.14778/3611540.3611549","DOIUrl":"https://doi.org/10.14778/3611540.3611549","url":null,"abstract":"In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sniffer: A Novel Model Type Detection System against Machine-Learning-as-a-Service Platforms Sniffer:一种针对机器学习即服务平台的新型模型类型检测系统
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611591
Zhuo Ma, Yilong Yang, Bin Xiao, Yang Liu, Xinjing Liu, Zhuoran Ma, Tong Yang
Recent works explore several attacks against Machine-Learning-as-a-Service (MLaaS) platforms (e.g., the model stealing attack), allegedly posing potential real-world threats beyond viability in laboratories. However, hampered by model-type-sensitive , most of the attacks can hardly break mainstream real-world MLaaS platforms. That is, many MLaaS attacks are designed against only one certain type of model, such as tree models or neural networks. As the black-box MLaaS interface hides model type info, the attacker cannot choose a proper attack method with confidence, limiting the attack performance. In this paper, we demonstrate a system, named Sniffer, that is capable of making model-type-sensitive attacks "great again" in real-world applications. Specifically, Sniffer consists of four components: Generator, Querier, Probe, and Arsenal. The first two components work for preparing attack samples. Probe, as the most characteristic component in Sniffer, implements a series of self-designed algorithms to determine the type of models hidden behind the black-box MLaaS interfaces. With model type info unraveled, an optimum method can be selected from Arsenal (containing multiple attack methods) to accomplish its attack. Our demonstration shows how the audience can interact with Sniffer in a web-based interface against five mainstream MLaaS platforms.
最近的研究探索了几种针对机器学习即服务(MLaaS)平台的攻击(例如,模型窃取攻击),据称这些攻击构成了超出实验室可行性的潜在现实威胁。然而,受模型类型敏感的限制,大多数攻击很难突破现实世界主流的MLaaS平台。也就是说,许多MLaaS攻击只针对一种特定类型的模型,例如树模型或神经网络。由于黑盒MLaaS接口隐藏了模型类型信息,攻击者无法放心地选择合适的攻击方法,限制了攻击性能。在本文中,我们演示了一个名为Sniffer的系统,它能够在实际应用程序中使模型类型敏感攻击“再次伟大”。具体来说,Sniffer由四个组件组成:Generator、Querier、Probe和Arsenal。前两个组件用于准备攻击样本。Probe作为Sniffer中最具特色的组件,实现了一系列自己设计的算法来确定隐藏在黑箱MLaaS接口背后的模型类型。随着模型类型信息的展开,可以从武器库(包含多种攻击方法)中选择最优方法来完成其攻击。我们的演示向观众展示了如何在基于web的界面中针对五种主流MLaaS平台与Sniffer进行交互。
{"title":"Sniffer: A Novel Model Type Detection System against Machine-Learning-as-a-Service Platforms","authors":"Zhuo Ma, Yilong Yang, Bin Xiao, Yang Liu, Xinjing Liu, Zhuoran Ma, Tong Yang","doi":"10.14778/3611540.3611591","DOIUrl":"https://doi.org/10.14778/3611540.3611591","url":null,"abstract":"Recent works explore several attacks against Machine-Learning-as-a-Service (MLaaS) platforms (e.g., the model stealing attack), allegedly posing potential real-world threats beyond viability in laboratories. However, hampered by model-type-sensitive , most of the attacks can hardly break mainstream real-world MLaaS platforms. That is, many MLaaS attacks are designed against only one certain type of model, such as tree models or neural networks. As the black-box MLaaS interface hides model type info, the attacker cannot choose a proper attack method with confidence, limiting the attack performance. In this paper, we demonstrate a system, named Sniffer, that is capable of making model-type-sensitive attacks \"great again\" in real-world applications. Specifically, Sniffer consists of four components: Generator, Querier, Probe, and Arsenal. The first two components work for preparing attack samples. Probe, as the most characteristic component in Sniffer, implements a series of self-designed algorithms to determine the type of models hidden behind the black-box MLaaS interfaces. With model type info unraveled, an optimum method can be selected from Arsenal (containing multiple attack methods) to accomplish its attack. Our demonstration shows how the audience can interact with Sniffer in a web-based interface against five mainstream MLaaS platforms.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DHive: Query Execution Performance Analysis via Dataflow in Apache Hive Hive: Apache Hive中基于数据流的查询执行性能分析
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611605
Chaozu Zhang, Qiaomu Shen, Bo Tang
Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.
如今,Apache Hive已被广泛用于许多组织的大规模数据分析应用程序。Hive开发了各种可视化分析工具,帮助用户快速分析查询执行过程,识别执行查询的性能瓶颈。但是,现有的工具主要侧重于显示查询子组件(作业和操作符)的时间使用情况,但无法提供足够的证据来分析执行进度缓慢的根本原因。为了解决这个问题,我们开发了一个可视化分析系统hive,通过数据流分析对查询执行过程进行可视化分析。hive在多个级别显示查询执行过程中的数据流:查询级别、作业级别和任务级别,用户可以通过将关键的作业/任务与系统配置和硬件状态等辅助信息联系起来,从而识别关键的作业/任务并解释其时间使用情况。我们通过一个生产集群中的两个案例来演示hive的有效性。hive是开源的,网址是https://github.com/DBGroup-SUSTech/DHive.git。
{"title":"DHive: Query Execution Performance Analysis via Dataflow in Apache Hive","authors":"Chaozu Zhang, Qiaomu Shen, Bo Tang","doi":"10.14778/3611540.3611605","DOIUrl":"https://doi.org/10.14778/3611540.3611605","url":null,"abstract":"Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"222 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anser: Adaptive Information Sharing Framework of AnalyticDB 答:AnalyticDB自适应信息共享框架
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611553
Liang Lin, Yuhan Li, Bin Wu, Huijun Mai, Renjie Lou, Jian Tan, Feifei Li
The surge in data analytics has fostered burgeoning demand for AnalyticDB on Alibaba Cloud, which has well served thousands of customers from various business sectors. The most notable feature is the diversity of the workloads it handles, including batch processing, real-time data analytics, and unstructured data analytics. To improve the overall performance for such diverse workloads, one of the major challenges is to optimize long-running complex queries without sacrificing the processing efficiency of short-running interactive queries. While existing methods attempt to utilize runtime dynamic statistics for adaptive query processing, they often focus on specific scenarios instead of providing a holistic solution. To address this challenge, we propose a new framework called Anser , which enhances the design of traditional distributed data warehouses by embedding a new information sharing mechanism. This allows for the efficient management of the production and consumption of various dynamic information across the system. Building on top of Anser , we introduce a novel scheduling policy that optimizes both data and information exchanges within the physical plan, enabling the acceleration of complex analytical queries without sacrificing the performance of short-running interactive queries. We conduct comprehensive experiments over public and in-house workloads to demonstrate the effectiveness and efficiency of our proposed information sharing framework.
数据分析的激增促进了对阿里云上的AnalyticDB的需求迅速增长,该服务已经为来自不同业务领域的数千名客户提供了良好的服务。最显著的特性是它处理的工作负载的多样性,包括批处理、实时数据分析和非结构化数据分析。为了提高这种不同工作负载的整体性能,主要挑战之一是优化长时间运行的复杂查询,同时不牺牲短时间运行的交互式查询的处理效率。虽然现有的方法试图利用运行时动态统计信息进行自适应查询处理,但它们通常侧重于特定场景,而不是提供整体解决方案。为了应对这一挑战,我们提出了一个名为Anser的新框架,它通过嵌入新的信息共享机制来增强传统分布式数据仓库的设计。这允许对整个系统中各种动态信息的生产和消费进行有效的管理。在Anser的基础上,我们引入了一种新的调度策略,该策略可以优化物理计划中的数据和信息交换,从而加速复杂的分析查询,而不会牺牲短时间运行的交互式查询的性能。我们在公共和内部工作负载上进行了全面的实验,以证明我们提出的信息共享框架的有效性和效率。
{"title":"Anser: Adaptive Information Sharing Framework of AnalyticDB","authors":"Liang Lin, Yuhan Li, Bin Wu, Huijun Mai, Renjie Lou, Jian Tan, Feifei Li","doi":"10.14778/3611540.3611553","DOIUrl":"https://doi.org/10.14778/3611540.3611553","url":null,"abstract":"The surge in data analytics has fostered burgeoning demand for AnalyticDB on Alibaba Cloud, which has well served thousands of customers from various business sectors. The most notable feature is the diversity of the workloads it handles, including batch processing, real-time data analytics, and unstructured data analytics. To improve the overall performance for such diverse workloads, one of the major challenges is to optimize long-running complex queries without sacrificing the processing efficiency of short-running interactive queries. While existing methods attempt to utilize runtime dynamic statistics for adaptive query processing, they often focus on specific scenarios instead of providing a holistic solution. To address this challenge, we propose a new framework called Anser , which enhances the design of traditional distributed data warehouses by embedding a new information sharing mechanism. This allows for the efficient management of the production and consumption of various dynamic information across the system. Building on top of Anser , we introduce a novel scheduling policy that optimizes both data and information exchanges within the physical plan, enabling the acceleration of complex analytical queries without sacrificing the performance of short-running interactive queries. We conduct comprehensive experiments over public and in-house workloads to demonstrate the effectiveness and efficiency of our proposed information sharing framework.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ScalarDB: Universal Transaction Manager for Polystores 用于polystore的通用事务管理器
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611563
Hiroyuki Yamada, Toshihiro Suzuki, Yuji Ito, Jun Nemoto
This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.
本文介绍了ScalarDB,一个通用的事务管理器,它可以跨多个不同的数据库实现分布式事务。ScalarDB在其数据库抽象之上提供了一个与数据库无关的事务管理器;因此,它可以实现跨各种数据库的事务,而不依赖于底层数据库的事务功能。ScalarDB基于几项研究工作,并进行了扩展,以提供强大的正确性保证(即严格的序列化性)、进一步的性能优化和几个关键的产品化机制。在本文中,我们描述了ScalarDB的设计和实现。我们还提供了评估结果,表明ScalarDB在不牺牲正确性的情况下实现了具有合理性能和近线性可伸缩性的数据库跨事务。最后,我们将分享一些在构建和运行ScalarDB时获得的案例研究和经验教训。
{"title":"ScalarDB: Universal Transaction Manager for Polystores","authors":"Hiroyuki Yamada, Toshihiro Suzuki, Yuji Ito, Jun Nemoto","doi":"10.14778/3611540.3611563","DOIUrl":"https://doi.org/10.14778/3611540.3611563","url":null,"abstract":"This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1