首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
Generations of Knowledge Graphs: The Crazy Ideas and the Business Impact 知识图谱的世代:疯狂的想法和商业影响
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611636
Xin Luna Dong
Knowledge Graphs (KGs) have been used to support a wide range of applications, from web search to personal assistant. In this paper, we describe three generations of knowledge graphs: entity-based KGs , which have been supporting general search and question answering ( e.g. , at Google and Bing); text-rich KGs , which have been supporting search and recommendations for products, bio-informatics, etc. ( e.g. , at Amazon and Alibaba); and the emerging integration of KGs and LLMs, which we call dual neural KGs. We describe the characteristics of each generation of KGs, the crazy ideas behind the scenes in constructing such KGs, and the techniques developed over time to enable industry impact. In addition, we use KGs as examples to demonstrate a recipe to evolve research ideas from innovations to production practice, and then to the next level of innovations, to advance both science and business.
知识图谱(KGs)已被用于支持广泛的应用,从网络搜索到个人助理。在本文中,我们描述了三代知识图:基于实体的知识图,它已经支持一般搜索和问答(例如b谷歌和Bing);文本丰富的kg,支持产品搜索和推荐、生物信息学等(例如亚马逊和阿里巴巴);以及KGs和llm的新兴整合,我们称之为双神经KGs。我们描述了每一代KGs的特征,构建此类KGs背后的疯狂想法,以及随着时间的推移而开发的技术,以实现行业影响。此外,我们以kg为例,展示了如何将研究理念从创新发展到生产实践,然后再发展到下一阶段的创新,从而推动科学和商业的发展。
{"title":"Generations of Knowledge Graphs: The Crazy Ideas and the Business Impact","authors":"Xin Luna Dong","doi":"10.14778/3611540.3611636","DOIUrl":"https://doi.org/10.14778/3611540.3611636","url":null,"abstract":"Knowledge Graphs (KGs) have been used to support a wide range of applications, from web search to personal assistant. In this paper, we describe three generations of knowledge graphs: entity-based KGs , which have been supporting general search and question answering ( e.g. , at Google and Bing); text-rich KGs , which have been supporting search and recommendations for products, bio-informatics, etc. ( e.g. , at Amazon and Alibaba); and the emerging integration of KGs and LLMs, which we call dual neural KGs. We describe the characteristics of each generation of KGs, the crazy ideas behind the scenes in constructing such KGs, and the techniques developed over time to enable industry impact. In addition, we use KGs as examples to demonstrate a recipe to evolve research ideas from innovations to production practice, and then to the next level of innovations, to advance both science and business.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Execution of User-Defined Functions in SQL Queries SQL查询中用户定义函数的高效执行
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611574
Yannis Foufoulas, Alkis Simitsis
User-defined functions (UDFs) have been widely used to overcome the expressivity limitations of SQL and complement its declarative nature with functional capabilities. UDFs are particularly useful in today's applications that involve complex data analytics and machine learning algorithms and logic. However, UDFs pose significant performance challenges in query processing and optimization, largely due to the mismatch of the UDF execution and SQL processing environments. In this tutorial, we present state-of-the-art methods and systems towards efficient execution of UDFs in SQL queries. We focus on low-level techniques for physical optimization and compilation of UDF queries, describe and compare the core, recent approaches in the area, discuss their advantages and limitations, identify critical gaps in theory and practice, and propose promising future research directions.
用户定义函数(udf)已被广泛用于克服SQL的表达性限制,并用函数功能补充其声明性。udf在当今涉及复杂数据分析和机器学习算法和逻辑的应用程序中特别有用。然而,UDF在查询处理和优化方面带来了重大的性能挑战,这主要是由于UDF执行和SQL处理环境的不匹配。在本教程中,我们将介绍在SQL查询中高效执行udf的最新方法和系统。我们专注于物理优化和UDF查询编译的底层技术,描述和比较该领域的核心和最新方法,讨论它们的优势和局限性,确定理论和实践中的关键差距,并提出有希望的未来研究方向。
{"title":"Efficient Execution of User-Defined Functions in SQL Queries","authors":"Yannis Foufoulas, Alkis Simitsis","doi":"10.14778/3611540.3611574","DOIUrl":"https://doi.org/10.14778/3611540.3611574","url":null,"abstract":"User-defined functions (UDFs) have been widely used to overcome the expressivity limitations of SQL and complement its declarative nature with functional capabilities. UDFs are particularly useful in today's applications that involve complex data analytics and machine learning algorithms and logic. However, UDFs pose significant performance challenges in query processing and optimization, largely due to the mismatch of the UDF execution and SQL processing environments. In this tutorial, we present state-of-the-art methods and systems towards efficient execution of UDFs in SQL queries. We focus on low-level techniques for physical optimization and compilation of UDF queries, describe and compare the core, recent approaches in the area, discuss their advantages and limitations, identify critical gaps in theory and practice, and propose promising future research directions.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lynx: A Graph Query Framework for Multiple Heterogeneous Data Sources Lynx:面向多个异构数据源的图形查询框架
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611587
Zhihong Shen, Chuan Hu, Zihao Zhao
Graph model are increasingly popular among modern applications for its ability to model complex relationships between entities. Users tend to query the data as a graph with graph operations (e.g., graph navigation and exploration). However, a large fraction of the data resides in relational databases or other storage systems. Challenges arise in uniformly querying multiple heterogeneous data sources as a graph. Traditional solutions are limited by time-consuming data integration, expensive development effort, and incomplete query requirements. Thus, we developed Lynx, a general graph query framework, to simplify querying graph data by converting complex statements into basic graph operations. Instead of connecting directly to the data sources, Lynx retrieves data through user-implemented interfaces for those graph operations. We demonstrate Lynx's capabilities through real-world scenarios, showcasing Lynx's ability to process graph queries on multiple heterogeneous data sources and also to be used as a generic graph query engine development framework.
图模型由于能够对实体之间的复杂关系进行建模,在现代应用程序中越来越受欢迎。用户倾向于通过图形操作(例如,图形导航和探索)将数据作为图形来查询。然而,很大一部分数据驻留在关系数据库或其他存储系统中。以图的形式统一查询多个异构数据源会带来挑战。传统的解决方案受到耗时的数据集成、昂贵的开发工作和不完整的查询需求的限制。因此,我们开发了通用图查询框架Lynx,通过将复杂语句转换为基本图操作来简化图数据的查询。Lynx没有直接连接到数据源,而是通过用户实现的接口为那些图操作检索数据。我们通过实际场景演示Lynx的功能,展示Lynx在多个异构数据源上处理图形查询的能力,以及作为通用图形查询引擎开发框架使用的能力。
{"title":"Lynx: A Graph Query Framework for Multiple Heterogeneous Data Sources","authors":"Zhihong Shen, Chuan Hu, Zihao Zhao","doi":"10.14778/3611540.3611587","DOIUrl":"https://doi.org/10.14778/3611540.3611587","url":null,"abstract":"Graph model are increasingly popular among modern applications for its ability to model complex relationships between entities. Users tend to query the data as a graph with graph operations (e.g., graph navigation and exploration). However, a large fraction of the data resides in relational databases or other storage systems. Challenges arise in uniformly querying multiple heterogeneous data sources as a graph. Traditional solutions are limited by time-consuming data integration, expensive development effort, and incomplete query requirements. Thus, we developed Lynx, a general graph query framework, to simplify querying graph data by converting complex statements into basic graph operations. Instead of connecting directly to the data sources, Lynx retrieves data through user-implemented interfaces for those graph operations. We demonstrate Lynx's capabilities through real-world scenarios, showcasing Lynx's ability to process graph queries on multiple heterogeneous data sources and also to be used as a generic graph query engine development framework.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PAINE Demo: Optimizing Video Selection Queries with Commonsense Knowledge PAINE演示:用常识优化视频选择查询
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611581
Wenjia He, Ibrahim Sabek, Yuze Lou, Michael Cafarella
Because video is becoming more popular and constitutes a major part of data collection, we have the need to process video selection queries --- selecting videos that contain target objects. However, a naïve scan of a video corpus without optimization would be extremely inefficient due to applying complex detectors to irrelevant videos. This demo presents Paine; a video query system that employs a novel index mechanism to optimize video selection queries via commonsense knowledge. Paine samples video frames to build an inexpensive lossy index, then leverages probabilistic models based on existing commonsense knowledge sources to capture the semantic-level correlation among video frames, thereby allowing Paine to predict the content of unindexed video. These models can predict which videos are likely to satisfy selection predicates so as to avoid Paine from processing irrelevant videos. We will demonstrate a system prototype of Paine for accelerating the processing of video selection queries, allowing VLDB'23 participants to use the Paine interface to run queries. Users can compare Paine with the baseline, the SCAN method.
由于视频越来越受欢迎,并且构成了数据收集的主要部分,我们需要处理视频选择查询——选择包含目标对象的视频。然而,由于将复杂的检测器应用于不相关的视频,因此在没有优化的情况下对视频语料库进行naïve扫描将非常低效。这个演示展示了Paine;视频查询系统采用一种新的索引机制,通过常识知识优化视频选择查询。Paine对视频帧进行采样以建立一个廉价的有损索引,然后利用基于现有常识知识来源的概率模型来捕获视频帧之间的语义级相关性,从而允许Paine预测未索引视频的内容。这些模型可以预测哪些视频可能满足选择谓词,从而避免Paine处理不相关的视频。我们将演示Paine的系统原型,用于加速视频选择查询的处理,允许VLDB'23参与者使用Paine接口来运行查询。用户可以与Paine基线进行比较,采用SCAN方法。
{"title":"PAINE Demo: Optimizing Video Selection Queries with Commonsense Knowledge","authors":"Wenjia He, Ibrahim Sabek, Yuze Lou, Michael Cafarella","doi":"10.14778/3611540.3611581","DOIUrl":"https://doi.org/10.14778/3611540.3611581","url":null,"abstract":"Because video is becoming more popular and constitutes a major part of data collection, we have the need to process video selection queries --- selecting videos that contain target objects. However, a naïve scan of a video corpus without optimization would be extremely inefficient due to applying complex detectors to irrelevant videos. This demo presents Paine; a video query system that employs a novel index mechanism to optimize video selection queries via commonsense knowledge. Paine samples video frames to build an inexpensive lossy index, then leverages probabilistic models based on existing commonsense knowledge sources to capture the semantic-level correlation among video frames, thereby allowing Paine to predict the content of unindexed video. These models can predict which videos are likely to satisfy selection predicates so as to avoid Paine from processing irrelevant videos. We will demonstrate a system prototype of Paine for accelerating the processing of video selection queries, allowing VLDB'23 participants to use the Paine interface to run queries. Users can compare Paine with the baseline, the SCAN method.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChainDash: An Ad-Hoc Blockchain Data Analytics System ChainDash: Ad-Hoc区块链数据分析系统
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611611
Yushi Liu, Liwei Yuan, Zhihao Chen, Yekai Yu, Zhao Zhang, Cheqing Jin, Ying Yan
The emergence of digital asset applications, driven by Web 3.0 and powered by blockchain technology, has led to a growing demand for blockchain-specific graph analytics to unearth the insights. However, current blockchain data analytics systems are unable to perform efficient ad-hoc graph analytics over both live and past time windows due to their inefficient data synchronization and slow graph snapshots retrieval capability. To address these issues, we propose ChainDash, a blockchain data analytics system that dedicates a highly-parallelized data synchronization component and a retrieval-optimized temporal graph store. By leveraging these techniques, ChainDash supports efficient ad-hoc graph analytics of smart contract activities over arbitrary time windows. In the demonstration, we showcase the interactive visualization interfaces of ChainDash, where attendees will execute customized queries for ad-hoc graph analytics of blockchain data.
由Web 3.0驱动并由区块链技术提供支持的数字资产应用程序的出现,导致对区块链特定图形分析的需求不断增长,以挖掘见解。然而,当前的区块链数据分析系统由于其低效的数据同步和缓慢的图形快照检索能力,无法在实时和过去的时间窗口上执行有效的临时图形分析。为了解决这些问题,我们提出了ChainDash,这是一个区块链数据分析系统,专门用于高度并行化的数据同步组件和检索优化的时态图存储。通过利用这些技术,ChainDash支持在任意时间窗口内对智能合约活动进行高效的临时图表分析。在演示中,我们展示了ChainDash的交互式可视化界面,与会者将在其中执行自定义查询,以对区块链数据进行临时图形分析。
{"title":"ChainDash: An Ad-Hoc Blockchain Data Analytics System","authors":"Yushi Liu, Liwei Yuan, Zhihao Chen, Yekai Yu, Zhao Zhang, Cheqing Jin, Ying Yan","doi":"10.14778/3611540.3611611","DOIUrl":"https://doi.org/10.14778/3611540.3611611","url":null,"abstract":"The emergence of digital asset applications, driven by Web 3.0 and powered by blockchain technology, has led to a growing demand for blockchain-specific graph analytics to unearth the insights. However, current blockchain data analytics systems are unable to perform efficient ad-hoc graph analytics over both live and past time windows due to their inefficient data synchronization and slow graph snapshots retrieval capability. To address these issues, we propose ChainDash, a blockchain data analytics system that dedicates a highly-parallelized data synchronization component and a retrieval-optimized temporal graph store. By leveraging these techniques, ChainDash supports efficient ad-hoc graph analytics of smart contract activities over arbitrary time windows. In the demonstration, we showcase the interactive visualization interfaces of ChainDash, where attendees will execute customized queries for ad-hoc graph analytics of blockchain data.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning for Subgraph Extraction: Methods, Applications and Challenges 子图提取的机器学习:方法、应用和挑战
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611571
Kai Siong Yow, Ningyi Liao, Siqiang Luo, Reynold Cheng
Subgraphs are obtained by extracting a subset of vertices and a subset of edges from the associated original graphs, and many graph properties are known to be inherited by subgraphs. Subgraphs can be applied in many areas such as social networks, recommender systems, biochemistry and fraud discovery. Researchers from various communities have paid a great deal of attention to investigate numerous subgraph problems, by proposing algorithms that mainly extract important structures of a given graph. There are however some limitations that should be addressed, with regard to the efficiency, effectiveness and scalability of these traditional algorithms. As a consequence, machine learning techniques---one of the most latest trends---have recently been employed in the database community to address various subgraph problems considering that they have been shown to be beneficial in dealing with graph-related problems. We discuss learning-based approaches for four well known subgraph problems in this tutorial, namely subgraph isomorphism, maximum common subgraph, community detection and community search problems. We give a general description of each proposed model, and analyse its design and performance. To allow further investigations on relevant subgraph problems, we suggest some potential future directions in this area. We believe that this work can be used as one of the primary resources, for researchers who intend to develop learning models in solving problems that are closely related to subgraphs.
子图是通过从关联的原始图中提取一个顶点子集和一个边子集来获得的,并且已知许多图的属性是由子图继承的。子图可以应用于许多领域,如社交网络、推荐系统、生物化学和欺诈发现。不同领域的研究人员对子图问题进行了大量的研究,提出了主要从给定图中提取重要结构的算法。然而,在这些传统算法的效率、有效性和可扩展性方面,有一些限制需要解决。因此,机器学习技术——最新的趋势之一——最近被应用于数据库社区,以解决各种子图问题,因为它们已被证明在处理图相关问题方面是有益的。在本教程中,我们讨论了四个众所周知的子图问题的基于学习的方法,即子图同构、最大公共子图、社区检测和社区搜索问题。我们给出了每个模型的一般描述,并分析了其设计和性能。为了进一步研究相关的子图问题,我们提出了该领域的一些潜在的未来方向。我们相信,这项工作可以作为主要资源之一,为那些打算开发学习模型来解决与子图密切相关的问题的研究人员。
{"title":"Machine Learning for Subgraph Extraction: Methods, Applications and Challenges","authors":"Kai Siong Yow, Ningyi Liao, Siqiang Luo, Reynold Cheng","doi":"10.14778/3611540.3611571","DOIUrl":"https://doi.org/10.14778/3611540.3611571","url":null,"abstract":"Subgraphs are obtained by extracting a subset of vertices and a subset of edges from the associated original graphs, and many graph properties are known to be inherited by subgraphs. Subgraphs can be applied in many areas such as social networks, recommender systems, biochemistry and fraud discovery. Researchers from various communities have paid a great deal of attention to investigate numerous subgraph problems, by proposing algorithms that mainly extract important structures of a given graph. There are however some limitations that should be addressed, with regard to the efficiency, effectiveness and scalability of these traditional algorithms. As a consequence, machine learning techniques---one of the most latest trends---have recently been employed in the database community to address various subgraph problems considering that they have been shown to be beneficial in dealing with graph-related problems. We discuss learning-based approaches for four well known subgraph problems in this tutorial, namely subgraph isomorphism, maximum common subgraph, community detection and community search problems. We give a general description of each proposed model, and analyse its design and performance. To allow further investigations on relevant subgraph problems, we suggest some potential future directions in this area. We believe that this work can be used as one of the primary resources, for researchers who intend to develop learning models in solving problems that are closely related to subgraphs.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MINT: Detecting Fraudulent Behaviors from Time-Series Relational Data MINT:从时间序列关系数据中检测欺诈行为
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611551
Fei Xiao, Yuncheng Wu, Meihui Zhang, Gang Chen, Beng Chin Ooi
The e-commerce platforms, such as Shopee, have accumulated a huge volume of time-series relational data, which contains useful information on differentiating fraud users from benign users. Existing fraud behavior detection approaches typically model the time-series data with a vanilla Recurrent Neural Network (RNN) or combine the whole sequence as a single intention without considering the temporal behavioral patterns, row-level interactions, and different view intentions. In this paper, we present MINT, a M ultiview row- IN teractive T ime-aware framework to detect fraudulent behaviors from time-series structured data. The key idea of MINT is to build a time-aware behavior graph for each user's time-series relational data with each row represented as an action node. We utilize the user's temporal information to construct three different graph convolutional matrices for hierarchically learning the user's intentions from different views, that is, short-term, medium-term, and long-term intentions. To capture more meaningful row-level interactions and alleviate the over-smoothing issue in a vanilla time-aware behavior graph, we propose a novel gated neighbor interaction mechanism to calibrate the aggregated information by each action node. Since the receptive fields of the three graph convolutional layers are designed to grow nearly exponentially, our MINT requires many fewer layers than traditional deep graph neural networks (GNNs) to capture multi-hop neighboring information, and avoids recurrent feedforward propagation, thus leading to higher training efficiency and scalability. Our extensive experiments on the large-scale e-commerce datasets from Shopee with up to 4.6 billion records and a public dataset from Amazon show that MINT achieves superior performance over 10 state-of-the-art models and provides better interpretability and scalability.
像Shopee这样的电子商务平台积累了大量的时间序列关系数据,这些数据包含了区分欺诈用户和良性用户的有用信息。现有的欺诈行为检测方法通常使用普通的递归神经网络(RNN)对时间序列数据进行建模,或者将整个序列合并为单个意图,而不考虑时间行为模式、行级交互和不同视图意图。在本文中,我们提出了MINT,一个多视图行- In交互式时间感知框架,用于从时间序列结构化数据中检测欺诈行为。MINT的关键思想是为每个用户的时间序列关系数据构建一个时间感知行为图,其中每行表示为一个动作节点。我们利用用户的时间信息构建了三种不同的图卷积矩阵,从不同的角度分层学习用户的意图,即短期、中期和长期意图。为了捕获更有意义的行级交互并缓解时间感知行为图中的过度平滑问题,我们提出了一种新的门控邻居交互机制来校准每个动作节点的聚合信息。由于三个图卷积层的接受域被设计成几乎呈指数增长,我们的MINT比传统的深度图神经网络(gnn)需要更少的层来捕获多跳相邻信息,并且避免了循环前馈传播,从而提高了训练效率和可扩展性。我们对Shopee的大型电子商务数据集(多达46亿条记录)和亚马逊的公共数据集进行了广泛的实验,结果表明MINT在10个最先进的模型中取得了卓越的性能,并提供了更好的可解释性和可扩展性。
{"title":"MINT: Detecting Fraudulent Behaviors from Time-Series Relational Data","authors":"Fei Xiao, Yuncheng Wu, Meihui Zhang, Gang Chen, Beng Chin Ooi","doi":"10.14778/3611540.3611551","DOIUrl":"https://doi.org/10.14778/3611540.3611551","url":null,"abstract":"The e-commerce platforms, such as Shopee, have accumulated a huge volume of time-series relational data, which contains useful information on differentiating fraud users from benign users. Existing fraud behavior detection approaches typically model the time-series data with a vanilla Recurrent Neural Network (RNN) or combine the whole sequence as a single intention without considering the temporal behavioral patterns, row-level interactions, and different view intentions. In this paper, we present MINT, a M ultiview row- IN teractive T ime-aware framework to detect fraudulent behaviors from time-series structured data. The key idea of MINT is to build a time-aware behavior graph for each user's time-series relational data with each row represented as an action node. We utilize the user's temporal information to construct three different graph convolutional matrices for hierarchically learning the user's intentions from different views, that is, short-term, medium-term, and long-term intentions. To capture more meaningful row-level interactions and alleviate the over-smoothing issue in a vanilla time-aware behavior graph, we propose a novel gated neighbor interaction mechanism to calibrate the aggregated information by each action node. Since the receptive fields of the three graph convolutional layers are designed to grow nearly exponentially, our MINT requires many fewer layers than traditional deep graph neural networks (GNNs) to capture multi-hop neighboring information, and avoids recurrent feedforward propagation, thus leading to higher training efficiency and scalability. Our extensive experiments on the large-scale e-commerce datasets from Shopee with up to 4.6 billion records and a public dataset from Amazon show that MINT achieves superior performance over 10 state-of-the-art models and provides better interpretability and scalability.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Clustering of Multivariate Time Series with Time2Feat 基于Time2Feat的多元时间序列可解释聚类
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611604
Angela Bonifati, Francesco Del Buono, Francesco Guerra, Miki Lombardi, Donato Tiano
This paper showcases Time2Feat, an end-to-end machine learning system for Multivariate Time Series (MTS) clustering. The system relies on interpretable inter-signal and intra-signal features extracted from the time series. Then, a dimensionality reduction technique is applied to select a subset of features that retain most of the information, thus enhancing the interpretability of the results. In addition, the system enables domain specialists to semi-supervise the process by submitting a small collection of MTS with a target cluster. This process further improves both accuracy and interpretability, by reducing the number of features used by the clustering process. The demonstration shows the application of Time2Feat to various MTS datasets, by creating clusters from MTS datasets of interest, experimenting with different settings and using the approach capabilities to interpret the clusters generated.
本文展示了Time2Feat,一个用于多元时间序列(MTS)聚类的端到端机器学习系统。该系统依赖于从时间序列中提取的可解释的信号间和信号内特征。然后,应用降维技术选择保留大部分信息的特征子集,从而增强结果的可解释性。此外,该系统允许领域专家通过提交带有目标集群的少量MTS集合来半监督该过程。通过减少聚类过程使用的特征数量,该过程进一步提高了准确性和可解释性。该演示演示了Time2Feat在各种MTS数据集上的应用,通过从感兴趣的MTS数据集创建集群,实验不同的设置并使用方法功能来解释生成的集群。
{"title":"Interpretable Clustering of Multivariate Time Series with Time2Feat","authors":"Angela Bonifati, Francesco Del Buono, Francesco Guerra, Miki Lombardi, Donato Tiano","doi":"10.14778/3611540.3611604","DOIUrl":"https://doi.org/10.14778/3611540.3611604","url":null,"abstract":"This paper showcases Time2Feat, an end-to-end machine learning system for Multivariate Time Series (MTS) clustering. The system relies on interpretable inter-signal and intra-signal features extracted from the time series. Then, a dimensionality reduction technique is applied to select a subset of features that retain most of the information, thus enhancing the interpretability of the results. In addition, the system enables domain specialists to semi-supervise the process by submitting a small collection of MTS with a target cluster. This process further improves both accuracy and interpretability, by reducing the number of features used by the clustering process. The demonstration shows the application of Time2Feat to various MTS datasets, by creating clusters from MTS datasets of interest, experimenting with different settings and using the approach capabilities to interpret the clusters generated.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DuckPGQ: Bringing SQL/PGQ to DuckDB DuckPGQ:将SQL/PGQ引入DuckDB
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611614
Daniel ten Wolde, Gábor Szárnyas, Peter Boncz
We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows us to add new functions, data types, operators, optimizer rules, storage systems, and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected into SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and WCOJ) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g., building and analyzing Graph Neural Networks.
我们展示了SQL:2023最重要的新特性,即SQL/PGQ,它通过引入模式匹配和(最短)寻径的新语法来简化使用SQL查询图。我们将展示如何使用一个名为DuckPGQ的扩展模块将SQL/PGQ支持集成到RDBMS中,特别是在DuckDB系统中。因此,我们还演示了DuckDB可扩展性机制的使用,该机制允许我们向DuckDB添加新的函数、数据类型、操作符、优化器规则、存储系统,甚至解析器。我们还描述了DuckPGQ模块所基于的新数据结构和算法,以及如何将它们注入SQL计划。虽然演示的DuckPGQ扩展模块是精简和高效的,但我们勾画了一个路线图:(i)通过新的算法(分解和WCOJ)和更好的并行性来提高其性能;(ii)将其功能扩展到SQL之外的场景,例如,构建和分析图神经网络。
{"title":"DuckPGQ: Bringing SQL/PGQ to DuckDB","authors":"Daniel ten Wolde, Gábor Szárnyas, Peter Boncz","doi":"10.14778/3611540.3611614","DOIUrl":"https://doi.org/10.14778/3611540.3611614","url":null,"abstract":"We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows us to add new functions, data types, operators, optimizer rules, storage systems, and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected into SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and WCOJ) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g., building and analyzing Graph Neural Networks.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AutoSteer: Learned Query Optimization for Any SQL Database 自动驾驶:学习查询优化任何SQL数据库
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611544
Christoph Anneser, Nesime Tatbul, David Cohen, Zhenggang Xu, Prithviraj Pandian, Nikolay Laptev, Ryan Marcus
This paper presents AutoSteer, a learning-based solution that automatically drives query optimization in any SQL database that exposes tunable optimizer knobs. AutoSteer builds on the Bandit optimizer (Bao) and extends it with new capabilities (e.g., automated hint-set discovery) to minimize integration effort and facilitate usability in both monolithic and disaggregated SQL systems. We successfully applied AutoSteer on PostgreSQL, PrestoDB, Spark-SQL, MySQL, and DuckDB - five popular open-source database engines with diverse query optimizers. We then conducted a detailed experimental evaluation with public benchmarks (JOB, Stackoverflow, TPC-DS) and a production workload from Meta's PrestoDB deployments. Our evaluation shows that AutoSteer can not only outperform these engines' native query optimizers (e.g., up to 40% improvements for PrestoDB) but can also match the performance of Bao-for-PostgreSQL with reduced human supervision and increased adaptivity, as it replaces Bao's static, expert-picked hint-sets with those that are automatically discovered. We also provide an open-source implementation of AutoSteer together with a visual tool for interactive use by query optimization experts.
本文介绍了AutoSteer,这是一种基于学习的解决方案,可以在任何SQL数据库中自动驱动可调优化器旋钮的查询优化。AutoSteer建立在Bandit优化器(Bao)的基础上,并扩展了它的新功能(例如,自动提示集发现),以最大限度地减少集成工作,并促进单片和分解SQL系统的可用性。我们成功地将AutoSteer应用于PostgreSQL、PrestoDB、Spark-SQL、MySQL和DuckDB这五种流行的开源数据库引擎,它们具有不同的查询优化器。然后,我们使用公共基准测试(JOB、Stackoverflow、TPC-DS)和Meta PrestoDB部署的生产工作负载进行了详细的实验评估。我们的评估表明,AutoSteer不仅可以胜过这些引擎的原生查询优化器(例如,PrestoDB的性能提高了40%),而且还可以在减少人工监督和提高适应性的情况下与Bao for postgresql的性能相匹配,因为它用自动发现的提示集取代了Bao的静态、专家挑选的提示集。我们还提供了AutoSteer的开源实现以及一个可视化工具,供查询优化专家进行交互使用。
{"title":"AutoSteer: Learned Query Optimization for Any SQL Database","authors":"Christoph Anneser, Nesime Tatbul, David Cohen, Zhenggang Xu, Prithviraj Pandian, Nikolay Laptev, Ryan Marcus","doi":"10.14778/3611540.3611544","DOIUrl":"https://doi.org/10.14778/3611540.3611544","url":null,"abstract":"This paper presents AutoSteer, a learning-based solution that automatically drives query optimization in any SQL database that exposes tunable optimizer knobs. AutoSteer builds on the Bandit optimizer (Bao) and extends it with new capabilities (e.g., automated hint-set discovery) to minimize integration effort and facilitate usability in both monolithic and disaggregated SQL systems. We successfully applied AutoSteer on PostgreSQL, PrestoDB, Spark-SQL, MySQL, and DuckDB - five popular open-source database engines with diverse query optimizers. We then conducted a detailed experimental evaluation with public benchmarks (JOB, Stackoverflow, TPC-DS) and a production workload from Meta's PrestoDB deployments. Our evaluation shows that AutoSteer can not only outperform these engines' native query optimizers (e.g., up to 40% improvements for PrestoDB) but can also match the performance of Bao-for-PostgreSQL with reduced human supervision and increased adaptivity, as it replaces Bao's static, expert-picked hint-sets with those that are automatically discovered. We also provide an open-source implementation of AutoSteer together with a visual tool for interactive use by query optimization experts.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1