Proceedings of the Vldb Endowment最新文献

英文中文

Progressive Partitioning for Parallelized Query Execution in Google's Napa b谷歌的Napa并行查询执行的渐进式分区

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611541

Junichi Tatemura, Tao Zou, Jagan Sankaranarayanan, Yanlai Huang, Jim Chen, Yupu Zhang, Kevin Lai, Hao Zhang, Gokul Nath Babu Manoharan, Goetz Graefe, Divyakant Agrawal, Brad Adelberg, Shilpa Kolhar, Indrajit Roy

Napa holds Google's critical data warehouses in log-structured merge trees for real-time data ingestion and sub-second response for billions of queries per day. These queries are often multi-key look-ups in highly skewed tables and indexes. In our production experience, only progressive query-specific partitioning can achieve Napa's strict query latency SLOs. Here we advocate good-enough partitioning that keeps the per-query partitioning time low without risking uneven work distribution. Our design combines pragmatic system choices and algorithmic innovations. For instance, B-trees are augmented with statistics of key distributions, thus serving the dual purpose of aiding lookups and partitioning. Furthermore, progressive partitioning is designed to be "good enough" thereby balancing partitioning time with performance. The resulting system is robust and successfully serves day-in-day-out billions of queries with very high quality of service forming a core infrastructure at Google.

Napa在日志结构的合并树中拥有谷歌的关键数据仓库，用于实时数据摄取和每天数十亿次查询的亚秒级响应。这些查询通常是在高度倾斜的表和索引中进行多键查找。在我们的生产经验中，只有渐进式特定于查询的分区才能实现Napa严格的查询延迟slo。在这里，我们提倡足够好的分区，使每个查询的分区时间保持在较低的水平，而不会冒工作分布不均匀的风险。我们的设计结合了实用的系统选择和算法创新。例如，用键分布的统计信息扩充b树，从而达到辅助查找和分区的双重目的。此外，渐进式分区被设计为“足够好”，从而平衡分区时间和性能。由此产生的系统非常强大，并成功地为每天数十亿的查询提供高质量的服务，形成了谷歌的核心基础设施。

{"title":"Progressive Partitioning for Parallelized Query Execution in Google's Napa","authors":"Junichi Tatemura, Tao Zou, Jagan Sankaranarayanan, Yanlai Huang, Jim Chen, Yupu Zhang, Kevin Lai, Hao Zhang, Gokul Nath Babu Manoharan, Goetz Graefe, Divyakant Agrawal, Brad Adelberg, Shilpa Kolhar, Indrajit Roy","doi":"10.14778/3611540.3611541","DOIUrl":"https://doi.org/10.14778/3611540.3611541","url":null,"abstract":"Napa holds Google's critical data warehouses in log-structured merge trees for real-time data ingestion and sub-second response for billions of queries per day. These queries are often multi-key look-ups in highly skewed tables and indexes. In our production experience, only progressive query-specific partitioning can achieve Napa's strict query latency SLOs. Here we advocate good-enough partitioning that keeps the per-query partitioning time low without risking uneven work distribution. Our design combines pragmatic system choices and algorithmic innovations. For instance, B-trees are augmented with statistics of key distributions, thus serving the dual purpose of aiding lookups and partitioning. Furthermore, progressive partitioning is designed to be \"good enough\" thereby balancing partitioning time with performance. The resulting system is robust and successfully serves day-in-day-out billions of queries with very high quality of service forming a core infrastructure at Google.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel PyTorch FSDP:扩展完全分片数据并行的经验

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611569

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

人们普遍认为，大型模型有潜力在广泛的领域提供卓越的性能。尽管在机器学习系统研究领域取得了显著进展，这使得大型模型的开发和探索成为可能，但这种能力仍然局限于一小部分高级用户和行业领导者，导致更广泛的社区访问和利用这些技术存在隐性的技术障碍。在本文中，我们介绍了PyTorch完全分片数据并行(FSDP)作为大型模型训练的工业级解决方案。FSDP与几个关键PyTorch核心组件紧密合作设计，包括张量实现，调度系统和CUDA内存缓存分配器，以提供非侵入式用户体验和高培训效率。此外，FSDP集成了一系列技术和设置，可以优化各种硬件配置的资源利用率。实验结果表明，FSDP能够实现与分布式数据并行相当的性能，同时在TFLOPS方面为具有近线性可扩展性的更大模型提供支持。

{"title":"PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel","authors":"Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li","doi":"10.14778/3611540.3611569","DOIUrl":"https://doi.org/10.14778/3611540.3611569","url":null,"abstract":"It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135165172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Demonstrating ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Joins via Reinforcement Learning 示范采用:通过强化学习自适应优化最坏情况下最优连接的属性顺序

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611629

Junxiong Wang, Mitchell Gray, Immanuel Trummer, Ahmet Kara, Dan Olteanu

Performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. It is challenging to identify suitable orders prior to query execution due to the huge search space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We demonstrate ADOPT, a novel query engine that integrates adaptive query processing with a worst-case optimal join algorithm. ADOPT divides query execution into episodes, during which different attribute orders are invoked. With runtime feedback on performance of different attribute orders, ADOPT rapidly approaches near-optimal orders. Moreover, ADOPT uses a unique data structure which keeps track of the processed input data to prevent redundant work across different episodes. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments, ADOPT outperforms baselines, including commercial and open-source systems utilizing worst-case optimal join algorithms, particularly for complex queries that are difficult to optimize.

最坏情况最优连接算法的性能取决于处理连接属性的顺序。由于可能的顺序的巨大搜索空间以及在数据倾斜或数据相关的情况下不可靠的执行成本估计，在查询执行之前确定合适的顺序是具有挑战性的。我们展示了ADOPT，一个集成了自适应查询处理和最坏情况最优连接算法的新型查询引擎。ADOPT将查询执行划分为多个章节，在这些章节中调用不同的属性顺序。通过对不同属性顺序性能的运行时反馈，ADOPT可以快速接近最优顺序。此外，ADOPT使用了一种独特的数据结构来跟踪处理后的输入数据，以防止跨不同剧集的冗余工作。它通过强化学习选择属性顺序进行尝试，平衡探索新顺序的需求和利用有前途的顺序的愿望。在实验中，ADOPT优于基线，包括利用最坏情况最优连接算法的商业和开源系统，特别是对于难以优化的复杂查询。

{"title":"Demonstrating ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Joins via Reinforcement Learning","authors":"Junxiong Wang, Mitchell Gray, Immanuel Trummer, Ahmet Kara, Dan Olteanu","doi":"10.14778/3611540.3611629","DOIUrl":"https://doi.org/10.14778/3611540.3611629","url":null,"abstract":"Performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. It is challenging to identify suitable orders prior to query execution due to the huge search space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We demonstrate ADOPT, a novel query engine that integrates adaptive query processing with a worst-case optimal join algorithm. ADOPT divides query execution into episodes, during which different attribute orders are invoked. With runtime feedback on performance of different attribute orders, ADOPT rapidly approaches near-optimal orders. Moreover, ADOPT uses a unique data structure which keeps track of the processed input data to prevent redundant work across different episodes. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments, ADOPT outperforms baselines, including commercial and open-source systems utilizing worst-case optimal join algorithms, particularly for complex queries that are difficult to optimize.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4 演示GPT-DB:使用GPT-4为SQL处理生成特定于查询和可定制的代码

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611630

Immanuel Trummer

GPT-DB generates code for SQL processing in general-purpose programming languages such as Python. Generated code can be freely customized using user-provided natural language instructions. This enables users, for instance, to try out specific libraries for SQL processing or to generate non-standard output while processing. GPT-DB is based on OpenAI's GPT model series, neural networks capable of translating natural language instructions into code. By default, GPT-DB exploits the most recently released GPT-4 model whereas visitors may also select prior versions for comparison. GPT-DB automatically generates query-specific prompts, instructing GPT on code generation. These prompts include a description of the target database, as well as logical query plans described as natural language text, and instructions for customization. GPT-DB automatically verifies, and possibly re-generates, code using a reference database system for result comparisons. It enables users to select code samples for training, thereby increasing accuracy for future queries. The proposed demonstration showcases code generation for various queries and with varying instructions for code customization.

GPT-DB用通用编程语言(如Python)生成用于SQL处理的代码。生成的代码可以使用用户提供的自然语言指令自由定制。例如，这使用户可以尝试使用特定的库进行SQL处理，或者在处理过程中生成非标准输出。GPT- db基于OpenAI的GPT模型系列，能够将自然语言指令翻译成代码的神经网络。默认情况下，GPT-DB利用最新发布的GPT-4模型，而访问者也可以选择以前的版本进行比较。GPT- db自动生成特定于查询的提示，指示GPT进行代码生成。这些提示包括目标数据库的描述，以及描述为自然语言文本的逻辑查询计划，以及用于定制的说明。GPT-DB使用参考数据库系统自动验证并可能重新生成代码以进行结果比较。它使用户能够选择用于训练的代码样本，从而提高未来查询的准确性。建议的演示展示了各种查询的代码生成，以及用于代码定制的不同指令。

{"title":"Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4","authors":"Immanuel Trummer","doi":"10.14778/3611540.3611630","DOIUrl":"https://doi.org/10.14778/3611540.3611630","url":null,"abstract":"GPT-DB generates code for SQL processing in general-purpose programming languages such as Python. Generated code can be freely customized using user-provided natural language instructions. This enables users, for instance, to try out specific libraries for SQL processing or to generate non-standard output while processing. GPT-DB is based on OpenAI's GPT model series, neural networks capable of translating natural language instructions into code. By default, GPT-DB exploits the most recently released GPT-4 model whereas visitors may also select prior versions for comparison. GPT-DB automatically generates query-specific prompts, instructing GPT on code generation. These prompts include a description of the target database, as well as logical query plans described as natural language text, and instructions for customization. GPT-DB automatically verifies, and possibly re-generates, code using a reference database system for result comparisons. It enables users to select code samples for training, thereby increasing accuracy for future queries. The proposed demonstration showcases code generation for various queries and with varying instructions for code customization.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

EQUI-VOCAL Demonstration: Synthesizing Video Queries from User Interactions EQUI-VOCAL演示:从用户交互中合成视频查询

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611600

Enhao Zhang, Maureen Daum, Dong He, Manasi Ganti, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska

We demonstrate EQUI-VOCAL, a system that synthesizes compositional queries over videos from user feedback. EQUI-VOCAL enables users to query a video database for complex events by providing a few positive and negative examples of what they are looking for and labeling a small number of additional system-selected examples. Using those user inputs, EQUI-VOCAL synthesizes declarative queries that can then retrieve additional instances of the desired events. The demonstration makes two contributions: it introduces EQUI-VOCAL's graphical user interface and enables conference attendees to experiment with EQUI-VOCAL on a variety of queries. Both enable users to gain a better understanding of EQUI-VOCAL's query synthesis approach and to explore the impact of hyperparameters and label noise on system performance.

我们演示了EQUI-VOCAL，这是一个从用户反馈中合成视频合成查询的系统。EQUI-VOCAL使用户能够通过提供他们正在寻找的一些积极和消极的例子并标记少量额外的系统选择的例子来查询复杂事件的视频数据库。使用这些用户输入，EQUI-VOCAL合成声明性查询，然后检索所需事件的其他实例。该演示有两个贡献:它介绍了EQUI-VOCAL的图形用户界面，并使会议与会者能够在各种查询上试用EQUI-VOCAL。两者都使用户能够更好地理解EQUI-VOCAL的查询合成方法，并探索超参数和标签噪声对系统性能的影响。

引用次数: 0

A Demonstration of DLBD: Database Logic Bug Detection System 数据库逻辑错误检测系统DLBD的演示

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611584

Xiu Tang, Sai Wu, Dongxiang Zhang, Ziyue Wang, Gongsheng Yuan, Gang Chen

Database management systems (DBMSs) are prone to logic bugs that can result in incorrect query results. Current debugging tools are limited to single table queries and struggle with issues like lack of ground-truth results and repetitive query space exploration. In this paper, we demonstrate DLBD, a system that automatically detects logic bugs in databases. DLBD offers holistic logic bug detection by providing automatic schema and query generation and ground-truth query result retrieval. Additionally, DLBD provides minimal test cases and root cause analysis for each bug to aid developers in reproducing and fixing detected bugs. DLBD incorporates heuristics and domain-specific knowledge to efficiently prune the search space and employs query space exploration mechanisms to avoid the repetitive search. Finally, DLBD utilizes a distributed processing framework to test database logic bugs in a scalable and efficient manner. Our system offers developers a reliable and effective way to detect and fix logic bugs in DBMSs.

数据库管理系统(dbms)容易出现逻辑错误，从而导致错误的查询结果。当前的调试工具仅限于单表查询，并且存在缺乏真实结果和重复查询空间探索等问题。在本文中，我们演示了DLBD，一个自动检测数据库中的逻辑错误的系统。DLBD通过提供自动模式和查询生成以及基本事实查询结果检索来提供整体逻辑错误检测。此外，DLBD为每个bug提供最小的测试用例和根本原因分析，以帮助开发人员重现和修复检测到的bug。DLBD结合启发式和领域特定知识来有效地修剪搜索空间，并采用查询空间探索机制来避免重复搜索。最后，DLBD利用分布式处理框架以可扩展和有效的方式测试数据库逻辑错误。我们的系统为开发人员提供了一种可靠而有效的方法来检测和修复dbms中的逻辑错误。

引用次数: 0

Explaining Differentially Private Query Results with DPXPlain 用DPXPlain解释不同的私有查询结果

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611596

Tingyu Wang, Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy

Employing Differential Privacy (DP), the state-of-the-art privacy standard, to answer aggregate database queries poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? We propose to demonstrate DPXPlain, the first system for explaining group-by aggregate query answers with DP. DPXPlain allows users to compare values of two groups and receive a validity check, and further provides an explanation table with an interactive visualization, containing the approximately 'top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps.

使用最先进的隐私标准差分隐私(DP)来回答聚合数据库查询，给用户理解查询结果中观察到的趋势和异常带来了新的挑战:意外的答案是由于数据本身，还是由于必须添加以保持DP的额外噪声?我们建议演示DPXPlain，这是第一个用DP解释分组聚合查询答案的系统。DPXPlain允许用户比较两组的值并接受有效性检查，并进一步提供具有交互式可视化的解释表，其中包含大约“top-k”解释谓词及其相对影响和以置信区间的形式排列，同时保证所有步骤的DP。

引用次数: 0

DataRinse: Semantic Transforms for Data Preparation Based on Code Mining 数据挖掘:基于代码挖掘的数据准备语义转换

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611628

Ibrahim Abdelaziz, Julian Dolby, Udayan Khurana, Horst Samulowitz, Kavitha Srinivas

Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.

数据准备是任何数据分析问题的关键第一步。这项任务主要是手工完成的，由熟悉数据领域的人员执行。DataRinse是一个旨在从代码库的大规模静态分析中提取相关转换的系统。我们的动机是，在任何大型企业中，数据工程师和数据科学家等多个角色都在处理类似的数据集。然而，共享或重用这些代码并不明显，而且很难执行。在本文中，我们演示了DataRinse来处理数据准备，这样系统就会推荐一些代码来帮助准备用于更普遍的数据分析的列。我们展示了DataRinse不仅对代码中观察到的表达式进行切分，而且还使用分析对应用于同一字段的表达式进行分组，以便相关转换对用户连贯地显示。它是一个人在循环系统，用户选择由DataRinse生成的相关代码片段应用于他们的数据集。

引用次数: 0

Demonstration of SPARQL ^ML : An Interfacing Language for Supporting Graph Machine Learning for RDF Graphs SPARQL ML的演示:一种支持RDF图的图机器学习的接口语言

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611599

Hussein Abdallah, Waleed Afandi, Essam Mansour

This demo paper presents KGNet, a graph machine learning-enabled RDF engine. KGNet integrates graph machine learning (GML) models with existing RDF engines as query operators to support node classification and link prediction tasks. For easy integration, KGNet extends the SPARQL language with user-defined predicates to support the GML operators. We refer to this extension as SPARQL ML query. Our SPARQL ML query optimizer is in charge of optimizing the selection of the near-optimal GML models. The development of KGNet poses research opportunities in various areas spanning KG management. In the paper, we demonstrate the ease of integration between the RDF engines and GML models through the SPARQL ML inference query language. We present several real use cases of different GML tasks on real KGs. Using KGNet, users do not need to learn a new scripting language or have a deep understanding of GML methods. The audience will experience KGNet with different KGs and GML models, as shown in our demo video and Colab notebook.

这篇演示论文介绍了KGNet，一个支持图机器学习的RDF引擎。KGNet将图机器学习(GML)模型与现有RDF引擎集成为查询操作符，以支持节点分类和链接预测任务。为了便于集成，KGNet使用用户定义的谓词扩展了SPARQL语言，以支持GML操作符。我们把这个扩展称为SPARQL ML查询。我们的SPARQL ML查询优化器负责优化接近最优的GML模型的选择。KGNet的发展为跨越KG管理的各个领域提供了研究机会。在本文中，我们通过SPARQL ML推理查询语言演示了RDF引擎和GML模型之间集成的便利性。使用KGNet，用户不需要学习新的脚本语言，也不需要对GML方法有深入的了解。正如我们的演示视频和Colab笔记本所示，观众将体验不同kg和GML模型的KGNet。

引用次数: 0

FastMosaic in Action: A New Mosaic Operator for Array DBMSs FastMosaic的实际应用:用于数组dbms的一种新的马赛克运算符

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611590

Ramon Antonio Rodriges Zalipynis

Array DBMSs operate on N -d arrays. During the Data Ingestion phase, the widely used mosaic operator ingests a massive collection of overlapping arrays into a single large array, called mosaic. The operator can utilize sophisticated statistical and machine learning techniques, e.g. Canonical Correlation Analysis (CCA), to produce a high quality seamless mosaic where the contrasts between the values of cells taken from input overlapping arrays are minimized. However, the performance bottleneck becomes a major challenge when applying such advanced techniques over increasingly growing array volumes. We introduce a new, scalable way to perform CCA that is orders of magnitude faster than the popular Python's scikit-learn library for the purpose of array mosaicking. Furthermore, we developed a hybrid web-desktop application to showcase our novel FastMosaic operator, based on this new CCA. A rich GUI enables users to comprehensively investigate in/out arrays, interactively guides through an end-to-end mosaic construction on real-world geospatial arrays using FastMosaic, facilitating a convenient exploration of the FastMosaic pipeline and its internals.

Array dbms对N -d阵列进行操作。在数据摄取阶段，广泛使用的马赛克运算符将大量重叠数组的集合摄取到单个大数组中，称为马赛克。操作员可以利用复杂的统计和机器学习技术，例如典型相关分析(CCA)，产生高质量的无缝马赛克，其中从输入重叠阵列中获取的单元值之间的对比度最小。然而，当将这种先进技术应用于日益增长的阵列体积时，性能瓶颈成为一个主要挑战。我们引入了一种新的、可扩展的方式来执行CCA，它比流行的Python scikit-learn库要快几个数量级，用于数组拼接。此外，我们还开发了一个混合网络桌面应用程序，以展示基于这种新的CCA的新型FastMosaic操作器。丰富的GUI使用户能够全面地研究输入/输出数组，使用FastMosaic交互式地指导在现实世界的地理空间数组上进行端到端拼接构建，从而方便地探索FastMosaic管道及其内部。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Vldb Endowment

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀