首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel PyTorch FSDP:扩展完全分片数据并行的经验
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611569
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li
It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.
人们普遍认为,大型模型有潜力在广泛的领域提供卓越的性能。尽管在机器学习系统研究领域取得了显著进展,这使得大型模型的开发和探索成为可能,但这种能力仍然局限于一小部分高级用户和行业领导者,导致更广泛的社区访问和利用这些技术存在隐性的技术障碍。在本文中,我们介绍了PyTorch完全分片数据并行(FSDP)作为大型模型训练的工业级解决方案。FSDP与几个关键PyTorch核心组件紧密合作设计,包括张量实现,调度系统和CUDA内存缓存分配器,以提供非侵入式用户体验和高培训效率。此外,FSDP集成了一系列技术和设置,可以优化各种硬件配置的资源利用率。实验结果表明,FSDP能够实现与分布式数据并行相当的性能,同时在TFLOPS方面为具有近线性可扩展性的更大模型提供支持。
{"title":"PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel","authors":"Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li","doi":"10.14778/3611540.3611569","DOIUrl":"https://doi.org/10.14778/3611540.3611569","url":null,"abstract":"It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135165172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Demonstrating ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Joins via Reinforcement Learning 示范采用:通过强化学习自适应优化最坏情况下最优连接的属性顺序
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611629
Junxiong Wang, Mitchell Gray, Immanuel Trummer, Ahmet Kara, Dan Olteanu
Performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. It is challenging to identify suitable orders prior to query execution due to the huge search space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We demonstrate ADOPT, a novel query engine that integrates adaptive query processing with a worst-case optimal join algorithm. ADOPT divides query execution into episodes, during which different attribute orders are invoked. With runtime feedback on performance of different attribute orders, ADOPT rapidly approaches near-optimal orders. Moreover, ADOPT uses a unique data structure which keeps track of the processed input data to prevent redundant work across different episodes. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments, ADOPT outperforms baselines, including commercial and open-source systems utilizing worst-case optimal join algorithms, particularly for complex queries that are difficult to optimize.
最坏情况最优连接算法的性能取决于处理连接属性的顺序。由于可能的顺序的巨大搜索空间以及在数据倾斜或数据相关的情况下不可靠的执行成本估计,在查询执行之前确定合适的顺序是具有挑战性的。我们展示了ADOPT,一个集成了自适应查询处理和最坏情况最优连接算法的新型查询引擎。ADOPT将查询执行划分为多个章节,在这些章节中调用不同的属性顺序。通过对不同属性顺序性能的运行时反馈,ADOPT可以快速接近最优顺序。此外,ADOPT使用了一种独特的数据结构来跟踪处理后的输入数据,以防止跨不同剧集的冗余工作。它通过强化学习选择属性顺序进行尝试,平衡探索新顺序的需求和利用有前途的顺序的愿望。在实验中,ADOPT优于基线,包括利用最坏情况最优连接算法的商业和开源系统,特别是对于难以优化的复杂查询。
{"title":"Demonstrating ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Joins via Reinforcement Learning","authors":"Junxiong Wang, Mitchell Gray, Immanuel Trummer, Ahmet Kara, Dan Olteanu","doi":"10.14778/3611540.3611629","DOIUrl":"https://doi.org/10.14778/3611540.3611629","url":null,"abstract":"Performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. It is challenging to identify suitable orders prior to query execution due to the huge search space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We demonstrate ADOPT, a novel query engine that integrates adaptive query processing with a worst-case optimal join algorithm. ADOPT divides query execution into episodes, during which different attribute orders are invoked. With runtime feedback on performance of different attribute orders, ADOPT rapidly approaches near-optimal orders. Moreover, ADOPT uses a unique data structure which keeps track of the processed input data to prevent redundant work across different episodes. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments, ADOPT outperforms baselines, including commercial and open-source systems utilizing worst-case optimal join algorithms, particularly for complex queries that are difficult to optimize.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4 演示GPT-DB:使用GPT-4为SQL处理生成特定于查询和可定制的代码
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611630
Immanuel Trummer
GPT-DB generates code for SQL processing in general-purpose programming languages such as Python. Generated code can be freely customized using user-provided natural language instructions. This enables users, for instance, to try out specific libraries for SQL processing or to generate non-standard output while processing. GPT-DB is based on OpenAI's GPT model series, neural networks capable of translating natural language instructions into code. By default, GPT-DB exploits the most recently released GPT-4 model whereas visitors may also select prior versions for comparison. GPT-DB automatically generates query-specific prompts, instructing GPT on code generation. These prompts include a description of the target database, as well as logical query plans described as natural language text, and instructions for customization. GPT-DB automatically verifies, and possibly re-generates, code using a reference database system for result comparisons. It enables users to select code samples for training, thereby increasing accuracy for future queries. The proposed demonstration showcases code generation for various queries and with varying instructions for code customization.
GPT-DB用通用编程语言(如Python)生成用于SQL处理的代码。生成的代码可以使用用户提供的自然语言指令自由定制。例如,这使用户可以尝试使用特定的库进行SQL处理,或者在处理过程中生成非标准输出。GPT- db基于OpenAI的GPT模型系列,能够将自然语言指令翻译成代码的神经网络。默认情况下,GPT-DB利用最新发布的GPT-4模型,而访问者也可以选择以前的版本进行比较。GPT- db自动生成特定于查询的提示,指示GPT进行代码生成。这些提示包括目标数据库的描述,以及描述为自然语言文本的逻辑查询计划,以及用于定制的说明。GPT-DB使用参考数据库系统自动验证并可能重新生成代码以进行结果比较。它使用户能够选择用于训练的代码样本,从而提高未来查询的准确性。建议的演示展示了各种查询的代码生成,以及用于代码定制的不同指令。
{"title":"Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4","authors":"Immanuel Trummer","doi":"10.14778/3611540.3611630","DOIUrl":"https://doi.org/10.14778/3611540.3611630","url":null,"abstract":"GPT-DB generates code for SQL processing in general-purpose programming languages such as Python. Generated code can be freely customized using user-provided natural language instructions. This enables users, for instance, to try out specific libraries for SQL processing or to generate non-standard output while processing. GPT-DB is based on OpenAI's GPT model series, neural networks capable of translating natural language instructions into code. By default, GPT-DB exploits the most recently released GPT-4 model whereas visitors may also select prior versions for comparison. GPT-DB automatically generates query-specific prompts, instructing GPT on code generation. These prompts include a description of the target database, as well as logical query plans described as natural language text, and instructions for customization. GPT-DB automatically verifies, and possibly re-generates, code using a reference database system for result comparisons. It enables users to select code samples for training, thereby increasing accuracy for future queries. The proposed demonstration showcases code generation for various queries and with varying instructions for code customization.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
EQUI-VOCAL Demonstration: Synthesizing Video Queries from User Interactions EQUI-VOCAL演示:从用户交互中合成视频查询
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611600
Enhao Zhang, Maureen Daum, Dong He, Manasi Ganti, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska
We demonstrate EQUI-VOCAL, a system that synthesizes compositional queries over videos from user feedback. EQUI-VOCAL enables users to query a video database for complex events by providing a few positive and negative examples of what they are looking for and labeling a small number of additional system-selected examples. Using those user inputs, EQUI-VOCAL synthesizes declarative queries that can then retrieve additional instances of the desired events. The demonstration makes two contributions: it introduces EQUI-VOCAL's graphical user interface and enables conference attendees to experiment with EQUI-VOCAL on a variety of queries. Both enable users to gain a better understanding of EQUI-VOCAL's query synthesis approach and to explore the impact of hyperparameters and label noise on system performance.
我们演示了EQUI-VOCAL,这是一个从用户反馈中合成视频合成查询的系统。EQUI-VOCAL使用户能够通过提供他们正在寻找的一些积极和消极的例子并标记少量额外的系统选择的例子来查询复杂事件的视频数据库。使用这些用户输入,EQUI-VOCAL合成声明性查询,然后检索所需事件的其他实例。该演示有两个贡献:它介绍了EQUI-VOCAL的图形用户界面,并使会议与会者能够在各种查询上试用EQUI-VOCAL。两者都使用户能够更好地理解EQUI-VOCAL的查询合成方法,并探索超参数和标签噪声对系统性能的影响。
{"title":"EQUI-VOCAL Demonstration: Synthesizing Video Queries from User Interactions","authors":"Enhao Zhang, Maureen Daum, Dong He, Manasi Ganti, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska","doi":"10.14778/3611540.3611600","DOIUrl":"https://doi.org/10.14778/3611540.3611600","url":null,"abstract":"We demonstrate EQUI-VOCAL, a system that synthesizes compositional queries over videos from user feedback. EQUI-VOCAL enables users to query a video database for complex events by providing a few positive and negative examples of what they are looking for and labeling a small number of additional system-selected examples. Using those user inputs, EQUI-VOCAL synthesizes declarative queries that can then retrieve additional instances of the desired events. The demonstration makes two contributions: it introduces EQUI-VOCAL's graphical user interface and enables conference attendees to experiment with EQUI-VOCAL on a variety of queries. Both enable users to gain a better understanding of EQUI-VOCAL's query synthesis approach and to explore the impact of hyperparameters and label noise on system performance.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Demonstration of DLBD: Database Logic Bug Detection System 数据库逻辑错误检测系统DLBD的演示
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611584
Xiu Tang, Sai Wu, Dongxiang Zhang, Ziyue Wang, Gongsheng Yuan, Gang Chen
Database management systems (DBMSs) are prone to logic bugs that can result in incorrect query results. Current debugging tools are limited to single table queries and struggle with issues like lack of ground-truth results and repetitive query space exploration. In this paper, we demonstrate DLBD, a system that automatically detects logic bugs in databases. DLBD offers holistic logic bug detection by providing automatic schema and query generation and ground-truth query result retrieval. Additionally, DLBD provides minimal test cases and root cause analysis for each bug to aid developers in reproducing and fixing detected bugs. DLBD incorporates heuristics and domain-specific knowledge to efficiently prune the search space and employs query space exploration mechanisms to avoid the repetitive search. Finally, DLBD utilizes a distributed processing framework to test database logic bugs in a scalable and efficient manner. Our system offers developers a reliable and effective way to detect and fix logic bugs in DBMSs.
数据库管理系统(dbms)容易出现逻辑错误,从而导致错误的查询结果。当前的调试工具仅限于单表查询,并且存在缺乏真实结果和重复查询空间探索等问题。在本文中,我们演示了DLBD,一个自动检测数据库中的逻辑错误的系统。DLBD通过提供自动模式和查询生成以及基本事实查询结果检索来提供整体逻辑错误检测。此外,DLBD为每个bug提供最小的测试用例和根本原因分析,以帮助开发人员重现和修复检测到的bug。DLBD结合启发式和领域特定知识来有效地修剪搜索空间,并采用查询空间探索机制来避免重复搜索。最后,DLBD利用分布式处理框架以可扩展和有效的方式测试数据库逻辑错误。我们的系统为开发人员提供了一种可靠而有效的方法来检测和修复dbms中的逻辑错误。
{"title":"A Demonstration of DLBD: Database Logic Bug Detection System","authors":"Xiu Tang, Sai Wu, Dongxiang Zhang, Ziyue Wang, Gongsheng Yuan, Gang Chen","doi":"10.14778/3611540.3611584","DOIUrl":"https://doi.org/10.14778/3611540.3611584","url":null,"abstract":"Database management systems (DBMSs) are prone to logic bugs that can result in incorrect query results. Current debugging tools are limited to single table queries and struggle with issues like lack of ground-truth results and repetitive query space exploration. In this paper, we demonstrate DLBD, a system that automatically detects logic bugs in databases. DLBD offers holistic logic bug detection by providing automatic schema and query generation and ground-truth query result retrieval. Additionally, DLBD provides minimal test cases and root cause analysis for each bug to aid developers in reproducing and fixing detected bugs. DLBD incorporates heuristics and domain-specific knowledge to efficiently prune the search space and employs query space exploration mechanisms to avoid the repetitive search. Finally, DLBD utilizes a distributed processing framework to test database logic bugs in a scalable and efficient manner. Our system offers developers a reliable and effective way to detect and fix logic bugs in DBMSs.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explaining Differentially Private Query Results with DPXPlain 用DPXPlain解释不同的私有查询结果
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611596
Tingyu Wang, Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy
Employing Differential Privacy (DP), the state-of-the-art privacy standard, to answer aggregate database queries poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? We propose to demonstrate DPXPlain, the first system for explaining group-by aggregate query answers with DP. DPXPlain allows users to compare values of two groups and receive a validity check, and further provides an explanation table with an interactive visualization, containing the approximately 'top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps.
使用最先进的隐私标准差分隐私(DP)来回答聚合数据库查询,给用户理解查询结果中观察到的趋势和异常带来了新的挑战:意外的答案是由于数据本身,还是由于必须添加以保持DP的额外噪声?我们建议演示DPXPlain,这是第一个用DP解释分组聚合查询答案的系统。DPXPlain允许用户比较两组的值并接受有效性检查,并进一步提供具有交互式可视化的解释表,其中包含大约“top-k”解释谓词及其相对影响和以置信区间的形式排列,同时保证所有步骤的DP。
{"title":"Explaining Differentially Private Query Results with DPXPlain","authors":"Tingyu Wang, Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy","doi":"10.14778/3611540.3611596","DOIUrl":"https://doi.org/10.14778/3611540.3611596","url":null,"abstract":"Employing Differential Privacy (DP), the state-of-the-art privacy standard, to answer aggregate database queries poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? We propose to demonstrate DPXPlain, the first system for explaining group-by aggregate query answers with DP. DPXPlain allows users to compare values of two groups and receive a validity check, and further provides an explanation table with an interactive visualization, containing the approximately 'top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DataRinse: Semantic Transforms for Data Preparation Based on Code Mining 数据挖掘:基于代码挖掘的数据准备语义转换
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611628
Ibrahim Abdelaziz, Julian Dolby, Udayan Khurana, Horst Samulowitz, Kavitha Srinivas
Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.
数据准备是任何数据分析问题的关键第一步。这项任务主要是手工完成的,由熟悉数据领域的人员执行。DataRinse是一个旨在从代码库的大规模静态分析中提取相关转换的系统。我们的动机是,在任何大型企业中,数据工程师和数据科学家等多个角色都在处理类似的数据集。然而,共享或重用这些代码并不明显,而且很难执行。在本文中,我们演示了DataRinse来处理数据准备,这样系统就会推荐一些代码来帮助准备用于更普遍的数据分析的列。我们展示了DataRinse不仅对代码中观察到的表达式进行切分,而且还使用分析对应用于同一字段的表达式进行分组,以便相关转换对用户连贯地显示。它是一个人在循环系统,用户选择由DataRinse生成的相关代码片段应用于他们的数据集。
{"title":"DataRinse: Semantic Transforms for Data Preparation Based on Code Mining","authors":"Ibrahim Abdelaziz, Julian Dolby, Udayan Khurana, Horst Samulowitz, Kavitha Srinivas","doi":"10.14778/3611540.3611628","DOIUrl":"https://doi.org/10.14778/3611540.3611628","url":null,"abstract":"Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demonstration of SPARQL ML : An Interfacing Language for Supporting Graph Machine Learning for RDF Graphs SPARQL ML的演示:一种支持RDF图的图机器学习的接口语言
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611599
Hussein Abdallah, Waleed Afandi, Essam Mansour
This demo paper presents KGNet, a graph machine learning-enabled RDF engine. KGNet integrates graph machine learning (GML) models with existing RDF engines as query operators to support node classification and link prediction tasks. For easy integration, KGNet extends the SPARQL language with user-defined predicates to support the GML operators. We refer to this extension as SPARQL ML query. Our SPARQL ML query optimizer is in charge of optimizing the selection of the near-optimal GML models. The development of KGNet poses research opportunities in various areas spanning KG management. In the paper, we demonstrate the ease of integration between the RDF engines and GML models through the SPARQL ML inference query language. We present several real use cases of different GML tasks on real KGs. Using KGNet, users do not need to learn a new scripting language or have a deep understanding of GML methods. The audience will experience KGNet with different KGs and GML models, as shown in our demo video and Colab notebook.
这篇演示论文介绍了KGNet,一个支持图机器学习的RDF引擎。KGNet将图机器学习(GML)模型与现有RDF引擎集成为查询操作符,以支持节点分类和链接预测任务。为了便于集成,KGNet使用用户定义的谓词扩展了SPARQL语言,以支持GML操作符。我们把这个扩展称为SPARQL ML查询。我们的SPARQL ML查询优化器负责优化接近最优的GML模型的选择。KGNet的发展为跨越KG管理的各个领域提供了研究机会。在本文中,我们通过SPARQL ML推理查询语言演示了RDF引擎和GML模型之间集成的便利性。使用KGNet,用户不需要学习新的脚本语言,也不需要对GML方法有深入的了解。正如我们的演示视频和Colab笔记本所示,观众将体验不同kg和GML模型的KGNet。
{"title":"Demonstration of SPARQL <sup> <i>ML</i> </sup> : An Interfacing Language for Supporting Graph Machine Learning for RDF Graphs","authors":"Hussein Abdallah, Waleed Afandi, Essam Mansour","doi":"10.14778/3611540.3611599","DOIUrl":"https://doi.org/10.14778/3611540.3611599","url":null,"abstract":"This demo paper presents KGNet, a graph machine learning-enabled RDF engine. KGNet integrates graph machine learning (GML) models with existing RDF engines as query operators to support node classification and link prediction tasks. For easy integration, KGNet extends the SPARQL language with user-defined predicates to support the GML operators. We refer to this extension as SPARQL ML query. Our SPARQL ML query optimizer is in charge of optimizing the selection of the near-optimal GML models. The development of KGNet poses research opportunities in various areas spanning KG management. In the paper, we demonstrate the ease of integration between the RDF engines and GML models through the SPARQL ML inference query language. We present several real use cases of different GML tasks on real KGs. Using KGNet, users do not need to learn a new scripting language or have a deep understanding of GML methods. The audience will experience KGNet with different KGs and GML models, as shown in our demo video and Colab notebook.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastMosaic in Action: A New Mosaic Operator for Array DBMSs FastMosaic的实际应用:用于数组dbms的一种新的马赛克运算符
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611590
Ramon Antonio Rodriges Zalipynis
Array DBMSs operate on N -d arrays. During the Data Ingestion phase, the widely used mosaic operator ingests a massive collection of overlapping arrays into a single large array, called mosaic. The operator can utilize sophisticated statistical and machine learning techniques, e.g. Canonical Correlation Analysis (CCA), to produce a high quality seamless mosaic where the contrasts between the values of cells taken from input overlapping arrays are minimized. However, the performance bottleneck becomes a major challenge when applying such advanced techniques over increasingly growing array volumes. We introduce a new, scalable way to perform CCA that is orders of magnitude faster than the popular Python's scikit-learn library for the purpose of array mosaicking. Furthermore, we developed a hybrid web-desktop application to showcase our novel FastMosaic operator, based on this new CCA. A rich GUI enables users to comprehensively investigate in/out arrays, interactively guides through an end-to-end mosaic construction on real-world geospatial arrays using FastMosaic, facilitating a convenient exploration of the FastMosaic pipeline and its internals.
Array dbms对N -d阵列进行操作。在数据摄取阶段,广泛使用的马赛克运算符将大量重叠数组的集合摄取到单个大数组中,称为马赛克。操作员可以利用复杂的统计和机器学习技术,例如典型相关分析(CCA),产生高质量的无缝马赛克,其中从输入重叠阵列中获取的单元值之间的对比度最小。然而,当将这种先进技术应用于日益增长的阵列体积时,性能瓶颈成为一个主要挑战。我们引入了一种新的、可扩展的方式来执行CCA,它比流行的Python scikit-learn库要快几个数量级,用于数组拼接。此外,我们还开发了一个混合网络桌面应用程序,以展示基于这种新的CCA的新型FastMosaic操作器。丰富的GUI使用户能够全面地研究输入/输出数组,使用FastMosaic交互式地指导在现实世界的地理空间数组上进行端到端拼接构建,从而方便地探索FastMosaic管道及其内部。
{"title":"FastMosaic in Action: A New Mosaic Operator for Array DBMSs","authors":"Ramon Antonio Rodriges Zalipynis","doi":"10.14778/3611540.3611590","DOIUrl":"https://doi.org/10.14778/3611540.3611590","url":null,"abstract":"Array DBMSs operate on N -d arrays. During the Data Ingestion phase, the widely used mosaic operator ingests a massive collection of overlapping arrays into a single large array, called mosaic. The operator can utilize sophisticated statistical and machine learning techniques, e.g. Canonical Correlation Analysis (CCA), to produce a high quality seamless mosaic where the contrasts between the values of cells taken from input overlapping arrays are minimized. However, the performance bottleneck becomes a major challenge when applying such advanced techniques over increasingly growing array volumes. We introduce a new, scalable way to perform CCA that is orders of magnitude faster than the popular Python's scikit-learn library for the purpose of array mosaicking. Furthermore, we developed a hybrid web-desktop application to showcase our novel FastMosaic operator, based on this new CCA. A rich GUI enables users to comprehensively investigate in/out arrays, interactively guides through an end-to-end mosaic construction on real-world geospatial arrays using FastMosaic, facilitating a convenient exploration of the FastMosaic pipeline and its internals.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Time Workload Pattern Analysis for Large-Scale Cloud Databases 大规模云数据库的实时工作负载模式分析
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611557
Jiaqi Wang, Tianyi Li, Anni Wang, Xiaoze Liu, Lu Chen, Jie Chen, Jianye Liu, Junyang Wu, Feifei Li, Yunjun Gao
Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases. In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AW M encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded query by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AW M enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.
在云系统上托管数据库服务已经成为一种常见的做法。这导致了数据库工作量的增加,从而为模式分析提供了机会。从业务逻辑的角度发现工作负载模式有助于更好地理解数据库系统的趋势和特征。然而,现有的工作负载模式发现系统并不适合业界普遍采用的大规模云数据库。这是因为大型云数据库的工作负载模式通常比普通数据库复杂得多。在本文中,我们提出了阿里巴巴工作负载挖掘器(AWM),这是一个实时系统,用于发现复杂的大规模工作负载模式。awm编码并发现从用户请求中记录的SQL查询模式,并基于发现的模式优化查询处理。一、数据收集预处理模块采集流查询日志,并将其编码为具有丰富语义上下文和执行特征的高维特征嵌入。接下来,在线工作负载挖掘模块按业务组分离编码查询,并发现每个组的工作负载模式。同时,离线训练模块收集标签并使用标签训练分类模型。最后,基于模式的优化模块通过利用发现的模式来优化云数据库中的查询处理。在一个合成数据集和两个真实数据集(从阿里云数据库中提取)上进行的大量实验结果表明,与最先进的方法相比,awm模式发现的准确性提高了66%,在线推理的延迟降低了22%。
{"title":"Real-Time Workload Pattern Analysis for Large-Scale Cloud Databases","authors":"Jiaqi Wang, Tianyi Li, Anni Wang, Xiaoze Liu, Lu Chen, Jie Chen, Jianye Liu, Junyang Wu, Feifei Li, Yunjun Gao","doi":"10.14778/3611540.3611557","DOIUrl":"https://doi.org/10.14778/3611540.3611557","url":null,"abstract":"Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases. In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AW M encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection &amp; Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded query by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AW M enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1