The VLDB Journal最新文献

A versatile framework for attributed network clustering via K-nearest neighbor augmentation 通过 K 近邻增强实现属性网络聚类的多功能框架

The VLDB Journal

Pub Date : 2024-09-16 DOI: 10.1007/s00778-024-00875-8

Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo

Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into k disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present AHCKA as an efficient approach to attributed hypergraph clustering (AHC). AHCKA includes a carefully-crafted K-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop ANCKA as a versatile attributed network clustering framework, capable of attributed graph clustering, attributed multiplex graph clustering, and AHC. Moreover, we devise ANCKA-GPU with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.

在社交网络、电子商务、生物信息学等建模领域，节点属性中包含特定实体信息的属性网络无处不在。其固有的网络拓扑结构既有简单的图，也有高阶交互的超图，还有分层的多图。节点聚类是一项重要的图挖掘任务，其目的是将归属网络的节点划分为 k 个互不相交的簇，使簇内节点紧密相连并具有相似的属性，而簇间节点则相距甚远、互不相似。如何通过节点或属性捕捉多跳连接，从而对多种类型的归属网络进行有效聚类，是一项极具挑战性的工作。在本文中，我们首先提出了 AHCKA 作为归属超图聚类（AHC）的有效方法。AHCKA 包括一个精心设计的 K 近邻增强策略，用于优化利用超图上的属性信息；一个联合超图随机行走模型，用于设计有效的 AHC 目标；以及一个高效求解器，用于目标优化的加速技术。所提出的技术可扩展到各种类型的属性网络，因此，我们将 ANCKA 开发成了一个通用的属性网络聚类框架，能够进行属性图聚类、属性多重图聚类和 AHC。此外，我们还针对 GPU 加速设计了 ANCKA-GPU 算法，以提高效率。我们进行了大量实验，在 8 个归属超图、6 个归属图和 3 个归属复用图上，将我们的方法与 19 个竞争对手进行了比较，结果表明我们的方法具有极高的聚类质量和效率。

{"title":"A versatile framework for attributed network clustering via K-nearest neighbor augmentation","authors":"Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo","doi":"10.1007/s00778-024-00875-8","DOIUrl":"https://doi.org/10.1007/s00778-024-00875-8","url":null,"abstract":"Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into k disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present AHCKA as an efficient approach to attributed hypergraph clustering (AHC). AHCKA includes a carefully-crafted K-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop ANCKA as a versatile attributed network clustering framework, capable of attributed graph clustering, attributed multiplex graph clustering, and AHC. Moreover, we devise ANCKA-GPU with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discovering critical vertices for reinforcement of large-scale bipartite networks 发现关键顶点，强化大规模双方位网络

The VLDB Journal

Pub Date : 2024-08-24 DOI: 10.1007/s00778-024-00871-y

Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang

Bipartite networks model relationships between two types of vertices and are prevalent in real-world applications. The departure of vertices in a bipartite network reduces the connections of other vertices, triggering their departures as well. This may lead to a breakdown of the bipartite network and undermine any downstream applications. Such cascading vertex departure can be captured by ((alpha ,beta ))-core, a cohesive subgraph model on bipartite networks that maintains the minimum engagement levels of vertices. Based on ((alpha ,beta ))-core, we aim to ensure the vertices are highly engaged with the bipartite network from two perspectives. (1) From a pre-emptive perspective, we study the anchored ((alpha ,beta ))-core problem, which aims to maximize the size of the ((alpha ,beta ))-core by including some “anchor” vertices. (2) From a defensive perspective, we study the collapsed ((alpha ,beta ))-core problem, which aims to identify the critical vertices whose departure can lead to the largest shrink of the ((alpha ,beta ))-core. We prove the NP-hardness of these problems and resort to heuristic algorithms that choose the best anchor/collapser iteratively under a filter-verification framework. Filter-stage optimizations are proposed to reduce “dominated” candidates and allow computation-sharing. In the verification stage, we select multiple candidates for improved efficiency. Extensive experiments on 18 real-world datasets and a billion-scale synthetic dataset validate the effectiveness and efficiency of our proposed techniques.

双向网络模拟两类顶点之间的关系，在现实世界的应用中非常普遍。双链网络中顶点的离开会减少其他顶点的连接，从而引发其他顶点的离开。这可能会导致双链网络崩溃，并破坏任何下游应用。这种级联顶点离去可以用 ((alpha ,beta ))-core 来捕捉，它是一个双方网络上的内聚子图模型，可以保持顶点的最小参与度。基于（(alpha ,beta)）-core，我们旨在从两个角度确保顶点与双向网络的高度参与。(1）从先发制人的角度，我们研究了锚定（anchored）（（（alpha ,beta）core）问题，其目的是通过包含一些 "锚 "顶点来最大化（（alpha ,beta）core）的大小。(2）从防御的角度，我们研究了坍塌（(alpha ,beta)）-核问题，其目的是找出临界顶点，这些顶点的离开会导致（(alpha ,beta)）-核的最大缩小。我们证明了这些问题的 NP 难度，并采用了启发式算法，在过滤验证框架下反复选择最佳锚点/连接点。我们提出了过滤阶段的优化方案，以减少 "占优 "的候选者，实现计算共享。在验证阶段，我们选择多个候选者以提高效率。在 18 个真实数据集和一个十亿规模的合成数据集上进行的广泛实验验证了我们所提技术的有效性和效率。

{"title":"Discovering critical vertices for reinforcement of large-scale bipartite networks","authors":"Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang","doi":"10.1007/s00778-024-00871-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00871-y","url":null,"abstract":"Bipartite networks model relationships between two types of vertices and are prevalent in real-world applications. The departure of vertices in a bipartite network reduces the connections of other vertices, triggering their departures as well. This may lead to a breakdown of the bipartite network and undermine any downstream applications. Such cascading vertex departure can be captured by ((alpha ,beta ))-core, a cohesive subgraph model on bipartite networks that maintains the minimum engagement levels of vertices. Based on ((alpha ,beta ))-core, we aim to ensure the vertices are highly engaged with the bipartite network from two perspectives. (1) From a pre-emptive perspective, we study the anchored ((alpha ,beta ))-core problem, which aims to maximize the size of the ((alpha ,beta ))-core by including some “anchor” vertices. (2) From a defensive perspective, we study the collapsed ((alpha ,beta ))-core problem, which aims to identify the critical vertices whose departure can lead to the largest shrink of the ((alpha ,beta ))-core. We prove the NP-hardness of these problems and resort to heuristic algorithms that choose the best anchor/collapser iteratively under a filter-verification framework. Filter-stage optimizations are proposed to reduce “dominated” candidates and allow computation-sharing. In the verification stage, we select multiple candidates for improved efficiency. Extensive experiments on 18 real-world datasets and a billion-scale synthetic dataset validate the effectiveness and efficiency of our proposed techniques.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search DumpyOS：用于可扩展数据序列相似性搜索的数据自适应多ary索引

The VLDB Journal

Pub Date : 2024-08-21 DOI: 10.1007/s00778-024-00874-9

Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang

Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available. These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges. DSTree and the iSAX index family are state-of-the-art solutions for this problem. However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy. In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance. First, we observe the presence of a proximity-compactness trade-off related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index. Second, a skewed data distribution will negatively affect the performance of iSAX. To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow. Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series. To fully leverage the potential of modern hardware including multicore CPUs and Solid State Drives (SSDs), we parallelize Dumpy to DumpyOS with sophisticated indexing and pruning-based querying algorithms. An optimized approximate search algorithm, DumpyOS-F that prominently improves the search accuracy without violating the index, is also proposed. Experiments with a variety of large, real datasets demonstrate that the Dumpy solutions achieve considerably better efficiency, scalability and search accuracy than its competitors. DumpyOS further improves on Dumpy, by delivering several times faster index building and querying, and DumpyOS-F improves the search accuracy of Dumpy-Fuzzy without the additional space cost of Dumpy-Fuzzy. This paper is an extension of the previously published SIGMOD paper [81].

数据序列索引是管理和分析当今日益增多的数据序列集合所必需的。这些索引支持精确和近似相似性搜索，其中近似搜索可在几毫秒内提供高质量结果，因此对某些现代应用非常有吸引力。减少预处理（即建立索引）时间和提高搜索结果的准确性是两大挑战。DSTree 和 iSAX 索引系列是解决这一问题的最先进解决方案。然而，DSTree 的索引构建时间较长，而 iSAX 的搜索准确率较低。在本文中，我们发现 iSAX 索引系列存在两个对整体性能有不利影响的问题。首先，我们发现在索引结构设计（即节点扇出程度）方面存在邻近性与紧凑性的权衡，这大大限制了所生成索引的效率和准确性。其次，倾斜的数据分布会对 iSAX 的性能产生负面影响。为了克服这些问题，我们提出了 Dumpy 索引，它采用了新颖的多ary 数据结构、自适应节点分割算法和高效的构建工作流程。此外，我们还设计了 Dumpy-Fuzzy 作为 Dumpy 的变体，通过适当的序列复制进一步提高搜索精度。为了充分利用现代硬件（包括多核 CPU 和固态硬盘 (SSD)）的潜力，我们将 Dumpy 并行化为 DumpyOS，并采用复杂的索引和基于剪枝的查询算法。我们还提出了一种优化的近似搜索算法 DumpyOS-F，它能在不违反索引的情况下显著提高搜索精度。使用各种大型真实数据集进行的实验表明，Dumpy 解决方案在效率、可扩展性和搜索准确性方面都大大优于竞争对手。DumpyOS 在 Dumpy 的基础上做了进一步改进，索引建立和查询速度提高了数倍，DumpyOS-F 提高了 Dumpy-Fuzzy 的搜索精度，但没有 Dumpy-Fuzzy 的额外空间成本。本文是对之前发表的 SIGMOD 论文[81]的扩展。

{"title":"DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search","authors":"Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang","doi":"10.1007/s00778-024-00874-9","DOIUrl":"https://doi.org/10.1007/s00778-024-00874-9","url":null,"abstract":"Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available. These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges. DSTree and the iSAX index family are state-of-the-art solutions for this problem. However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy. In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance. First, we observe the presence of a proximity-compactness trade-off related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index. Second, a skewed data distribution will negatively affect the performance of iSAX. To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow. Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series. To fully leverage the potential of modern hardware including multicore CPUs and Solid State Drives (SSDs), we parallelize Dumpy to DumpyOS with sophisticated indexing and pruning-based querying algorithms. An optimized approximate search algorithm, DumpyOS-F that prominently improves the search accuracy without violating the index, is also proposed. Experiments with a variety of large, real datasets demonstrate that the Dumpy solutions achieve considerably better efficiency, scalability and search accuracy than its competitors. DumpyOS further improves on Dumpy, by delivering several times faster index building and querying, and DumpyOS-F improves the search accuracy of Dumpy-Fuzzy without the additional space cost of Dumpy-Fuzzy. This paper is an extension of the previously published SIGMOD paper [81].","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling space-time efficient range queries with REncoder 利用 REncoder 实现时空高效范围查询

The VLDB Journal

Pub Date : 2024-08-07 DOI: 10.1007/s00778-024-00873-w

Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, Bin Cui

A range filter is a data structure to answer range membership queries. Range queries are common in modern applications, and range filters have gained rising attention for improving the performance of range queries by ruling out empty range queries. However, state-of-the-art range filters, such as SuRF and Rosetta, suffer either high false positive rate or low throughput. In this paper, we propose a novel range filter, called REncoder. It organizes all prefixes of keys into a segment tree, and locally encodes the segment tree into a Bloom filter to accelerate queries. REncoder supports diverse workloads by adaptively choosing how many levels of the segment tree to store. In addition, we also propose a customized blacklist optimization for it to further improve the accuracy of multi-round queries. We theoretically prove that the error of REncoder is bounded and derive the asymptotic space complexity under the bounded error. We conduct extensive experiments on both synthetic datasets and real datasets. The experimental results show that REncoder outperforms all state-of-the-art range filters, and the proposed blacklist optimization can effectively further reduce the false positive rate.

范围过滤器是一种用于回答范围成员查询的数据结构。范围查询在现代应用中很常见，而范围过滤器通过排除空范围查询来提高范围查询的性能，因此受到越来越多的关注。然而，最先进的范围过滤器，如 SuRF 和 Rosetta，要么误报率高，要么吞吐量低。在本文中，我们提出了一种名为 REncoder 的新型范围过滤器。它将密钥的所有前缀组织成一棵段树，并将段树局部编码成 Bloom 过滤器，以加速查询。REncoder 通过自适应地选择存储多少级段树来支持不同的工作负载。此外，我们还为它提出了一种定制的黑名单优化方法，以进一步提高多轮查询的准确性。我们从理论上证明了 REncoder 的误差是有界的，并推导出了有界误差下的渐近空间复杂度。我们在合成数据集和真实数据集上进行了大量实验。实验结果表明，REncoder 的性能优于所有最先进的范围过滤器，而且提出的黑名单优化可以有效地进一步降低误报率。

{"title":"Enabling space-time efficient range queries with REncoder","authors":"Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, Bin Cui","doi":"10.1007/s00778-024-00873-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00873-w","url":null,"abstract":"A range filter is a data structure to answer range membership queries. Range queries are common in modern applications, and range filters have gained rising attention for improving the performance of range queries by ruling out empty range queries. However, state-of-the-art range filters, such as SuRF and Rosetta, suffer either high false positive rate or low throughput. In this paper, we propose a novel range filter, called REncoder. It organizes all prefixes of keys into a segment tree, and locally encodes the segment tree into a Bloom filter to accelerate queries. REncoder supports diverse workloads by adaptively choosing how many levels of the segment tree to store. In addition, we also propose a customized blacklist optimization for it to further improve the accuracy of multi-round queries. We theoretically prove that the error of REncoder is bounded and derive the asymptotic space complexity under the bounded error. We conduct extensive experiments on both synthetic datasets and real datasets. The experimental results show that REncoder outperforms all state-of-the-art range filters, and the proposed blacklist optimization can effectively further reduce the false positive rate.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141948995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting AutoCTS++：用于相关时间序列预测的零点联合神经架构和超参数搜索

The VLDB Journal

Pub Date : 2024-07-30 DOI: 10.1007/s00778-024-00872-x

Xinle Wu, Xingjian Wu, Bin Yang, Lekui Zhou, Chenjuan Guo, Xiangfei Qiu, Jilin Hu, Zhenli Sheng, Christian S. Jensen

Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. Recent deep learning based forecasting methods show strong capabilities at capturing both the temporal dynamics of time series and the spatial correlations among time series, thus achieving impressive accuracy. In particular, automated CTS forecasting, where a deep learning architecture is configured automatically, enables forecasting accuracy that surpasses what has been achieved by manual approaches. However, automated CTS forecasting remains in its infancy, as existing proposals are only able to find optimal architectures for predefined hyperparameters and for specific datasets and forecasting settings (e.g., short vs. long term forecasting). These limitations hinder real-world industrial application, where forecasting faces diverse datasets and forecasting settings. We propose AutoCTS++, a zero-shot, joint search framework, to efficiently configure effective CTS forecasting models (including both neural architectures and hyperparameters), even when facing unseen datasets and foreacsting settings. Specifically, we propose an architecture-hyperparameter joint search space by encoding candidate architecture and accompanying hyperparameters into a graph representation. We then introduce a zero-shot Task-aware Architecture-Hyperparameter Comparator (T-AHC) to rank architecture-hyperparameter pairs according to different tasks (i.e., datasets and forecasting settings). We propose zero-shot means to train T-AHC, enabling it to rank architecture-hyperparameter pairs given unseen datasets and forecasting settings. A final forecasting model is then selected from the top-ranked pairs. Extensive experiments involving multiple benchmark datasets and forecasting settings demonstrate that AutoCTS++ is able to efficiently devise forecasting models for unseen datasets and forecasting settings that are capable of outperforming existing manually designed and automated models.

网络物理系统中的传感器通常会捕捉相互关联的过程，从而发出相关时间序列（CTS），对其进行预测可实现重要的应用。最近，基于深度学习的预测方法在捕捉时间序列的时间动态和时间序列之间的空间相关性方面显示出强大的能力，从而实现了令人印象深刻的准确性。特别是自动 CTS 预测，即自动配置深度学习架构，其预测准确度超过了人工方法。然而，自动 CTS 预测仍处于起步阶段，因为现有的建议只能为预定义的超参数以及特定的数据集和预测设置（如短期与长期预测）找到最佳架构。这些限制阻碍了现实世界中的工业应用，因为预测面临着不同的数据集和预测设置。我们提出了 AutoCTS++--一种零点联合搜索框架，用于高效配置有效的 CTS 预测模型（包括神经架构和超参数），即使在面对未知数据集和预测设置时也是如此。具体来说，我们提出了一个架构-超参数联合搜索空间，将候选架构和相应的超参数编码成图表示。然后，我们引入了一个零射任务感知架构-超参数比较器（T-AHC），根据不同的任务（即数据集和预测设置）对架构-超参数对进行排序。我们提出了零点训练 T-AHC 的方法，使其能够在未见数据集和预测设置的情况下对架构-参数对进行排序。然后从排名靠前的模型中选出最终的预测模型。涉及多个基准数据集和预测设置的广泛实验表明，AutoCTS++ 能够针对未见数据集和预测设置高效地设计预测模型，其性能优于现有的人工设计和自动模型。

{"title":"AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting","authors":"Xinle Wu, Xingjian Wu, Bin Yang, Lekui Zhou, Chenjuan Guo, Xiangfei Qiu, Jilin Hu, Zhenli Sheng, Christian S. Jensen","doi":"10.1007/s00778-024-00872-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00872-x","url":null,"abstract":"Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. Recent deep learning based forecasting methods show strong capabilities at capturing both the temporal dynamics of time series and the spatial correlations among time series, thus achieving impressive accuracy. In particular, automated CTS forecasting, where a deep learning architecture is configured automatically, enables forecasting accuracy that surpasses what has been achieved by manual approaches. However, automated CTS forecasting remains in its infancy, as existing proposals are only able to find optimal architectures for predefined hyperparameters and for specific datasets and forecasting settings (e.g., short vs. long term forecasting). These limitations hinder real-world industrial application, where forecasting faces diverse datasets and forecasting settings. We propose AutoCTS++, a zero-shot, joint search framework, to efficiently configure effective CTS forecasting models (including both neural architectures and hyperparameters), even when facing unseen datasets and foreacsting settings. Specifically, we propose an architecture-hyperparameter joint search space by encoding candidate architecture and accompanying hyperparameters into a graph representation. We then introduce a zero-shot Task-aware Architecture-Hyperparameter Comparator (T-AHC) to rank architecture-hyperparameter pairs according to different tasks (i.e., datasets and forecasting settings). We propose zero-shot means to train T-AHC, enabling it to rank architecture-hyperparameter pairs given unseen datasets and forecasting settings. A final forecasting model is then selected from the top-ranked pairs. Extensive experiments involving multiple benchmark datasets and forecasting settings demonstrate that AutoCTS++ is able to efficiently devise forecasting models for unseen datasets and forecasting settings that are capable of outperforming existing manually designed and automated models.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WavingSketch: an unbiased and generic sketch for finding top-k items in data streams WavingSketch：用于在数据流中查找前 k 项的无偏通用草图

The VLDB Journal

Pub Date : 2024-07-29 DOI: 10.1007/s00778-024-00869-6

Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang

Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves (10 times ) faster speed and (10^3 times ) smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).

查找数据流中的顶 k 项是数据挖掘中的一个基本问题。无偏估计是 top-k 算法公认的优雅而重要的特性。在本文中，我们提出了一种名为 WavingSketch 的新型草图算法，它比现有的无偏算法更加精确。我们从理论上证明了 WavingSketch 可以提供无偏估计，并推导出其误差边界。WavingSketch 是测量任务的通用算法，我们将其应用于五个应用领域：查找前 k 个频繁项、查找前 k 个重大变化项、查找前 k 个持久项、查找前 k 个超级传播者以及连接-聚合估计。我们的实验结果表明，与最先进的无偏空间节省法相比，WavingSketch在寻找频繁项方面的速度更快，误差更小。在其他应用中，WavingSketch 也实现了更高的准确率和更快的速度。所有相关代码都在 GitHub 上开源（https://github.com/WavingSketch/Waving-Sketch）。

引用次数: 0

FedST: secure federated shapelet transformation for time series classification FedST：用于时间序列分类的安全联合小形变换

The VLDB Journal

Pub Date : 2024-07-26 DOI: 10.1007/s00778-024-00865-w

Zhiyu Liang, Hongzhi Wang

This paper explores how to build a shapelet-based time series classification (TSC) model in the federated learning (FL) scenario, that is, using more data from multiple owners without actually sharing the data. We propose FedST, a novel federated TSC framework extended from a centralized shapelet transformation method. We recognize the federated shapelet search step as the kernel of FedST. Thus, we design a basic protocol for the FedST kernel that we prove to be secure and accurate. However, we identify that the basic protocol suffers from efficiency bottlenecks and the centralized acceleration techniques lose their efficacy due to the security issues. To speed up the federated protocol with security guarantee, we propose several optimizations tailored for the FL setting. Our theoretical analysis shows that the proposed methods are secure and more efficient. We conduct extensive experiments using both synthetic and real-world datasets. Empirical results show that our FedST solution is effective in terms of TSC accuracy, and the proposed optimizations can achieve three orders of magnitude of speedup.

本文探讨了如何在联合学习（FL）场景中建立基于小形的时间序列分类（TSC）模型，即在不实际共享数据的情况下使用来自多个所有者的更多数据。我们提出了 FedST，这是一种从集中式小形变换方法扩展而来的新型联盟 TSC 框架。我们将联合小形搜索步骤视为 FedST 的内核。因此，我们为 FedST 内核设计了一个基本协议，并证明了该协议的安全性和准确性。然而，我们发现该基本协议存在效率瓶颈，集中加速技术也因安全问题而失去了功效。为了在保证安全的前提下加速联合协议，我们提出了几种针对 FL 设置的优化方案。我们的理论分析表明，所提出的方法既安全又高效。我们使用合成数据集和真实数据集进行了大量实验。实证结果表明，我们的 FedST 解决方案在 TSC 准确性方面是有效的，所提出的优化方案可以实现三个数量级的提速。

引用次数: 0

FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification FICOM：半监督节点分类中有效且可扩展的 GNN 主动学习框架

The VLDB Journal

Pub Date : 2024-07-22 DOI: 10.1007/s00778-024-00870-z

Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang

Active learning for graph neural networks (GNNs) aims to select B nodes to label for the best possible GNN performance. Carefully selected labeled nodes can help improve GNN performance and hence motivates a line of research works. Unfortunately, existing methods still provide inferior GNN performance or cannot scale to large networks.Motivated by these limitations, in this paper, we present FICOM, an effective and scalable GNN active learning framework. Firstly, we formulate the node selection as an optimization problem where we consider the importance of a node from (i) the importance of a node during the feature propagation with a connection to the personalized PageRank (PPR), and (ii) the diversity of a node brings in the embedding space generated by feature propagation. We show that the defined problem is submodular, and a greedy solution can provide a ((1-1/e))-approximate solution.However, a standard greedy solution requires getting the node with the maximum marginal gain of the objective score in each iteration, which incurs a prohibitive running cost and cannot scale to large datasets. As our main contribution, we present FICOM, an efficient and scalable solution that provides ((1-1/e))-approximation guarantee and scales to graphs with millions of nodes on a single machine. The main idea is that we adaptively maintain the lower- and upper-bound of the marginal gain for each node v. In each iteration, we can first derive a small subset of candidate nodes and then compute the exact score for this subset of candidate nodes so that we can find the node with the maximum marginal gain efficiently. Extensive experiments on six benchmark datasets using four GNNs, including GCN, SGC, APPNP, and GCNII, show that our FICOM consistently outperforms existing active learning approaches on semi-supervised node classification tasks using different GNNs. Moreover, our solution can finish within 5 h on a million-node graph.

图神经网络（GNN）的主动学习旨在选择 B 节点进行标注，以尽可能提高 GNN 性能。精心选择的标记节点有助于提高图神经网络的性能，因此激发了一系列研究工作。鉴于这些局限性，我们在本文中提出了一个有效且可扩展的 GNN 主动学习框架 FICOM。首先，我们将节点选择表述为一个优化问题，我们从以下两个方面考虑节点的重要性：(i) 在特征传播过程中节点的重要性与个性化 PageRank（PPR）的联系；(ii) 由特征传播产生的嵌入空间中节点带来的多样性。然而，标准的贪婪解法需要在每次迭代中获取目标分数边际收益最大的节点，这将产生过高的运行成本，并且无法扩展到大型数据集。作为我们的主要贡献，我们提出了一种高效、可扩展的解决方案--FICOM，它提供了 ((1-1/e))-approximation 保证，并可在单台机器上扩展到拥有数百万节点的图。在每次迭代中，我们可以先得出一小部分候选节点，然后计算这部分候选节点的精确得分，从而高效地找到边际收益最大的节点。在使用四种 GNN（包括 GCN、SGC、APPNP 和 GCNII）的六个基准数据集上进行的广泛实验表明，在使用不同 GNN 的半监督节点分类任务中，我们的 FICOM 始终优于现有的主动学习方法。此外，我们的解决方案可以在 5 小时内完成百万节点图的分类。

{"title":"FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification","authors":"Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang","doi":"10.1007/s00778-024-00870-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00870-z","url":null,"abstract":"Active learning for graph neural networks (GNNs) aims to select B nodes to label for the best possible GNN performance. Carefully selected labeled nodes can help improve GNN performance and hence motivates a line of research works. Unfortunately, existing methods still provide inferior GNN performance or cannot scale to large networks.Motivated by these limitations, in this paper, we present FICOM, an effective and scalable GNN active learning framework. Firstly, we formulate the node selection as an optimization problem where we consider the importance of a node from (i) the importance of a node during the feature propagation with a connection to the personalized PageRank (PPR), and (ii) the diversity of a node brings in the embedding space generated by feature propagation. We show that the defined problem is submodular, and a greedy solution can provide a ((1-1/e))-approximate solution.However, a standard greedy solution requires getting the node with the maximum marginal gain of the objective score in each iteration, which incurs a prohibitive running cost and cannot scale to large datasets. As our main contribution, we present FICOM, an efficient and scalable solution that provides ((1-1/e))-approximation guarantee and scales to graphs with millions of nodes on a single machine. The main idea is that we adaptively maintain the lower- and upper-bound of the marginal gain for each node v. In each iteration, we can first derive a small subset of candidate nodes and then compute the exact score for this subset of candidate nodes so that we can find the node with the maximum marginal gain efficiently. Extensive experiments on six benchmark datasets using four GNNs, including GCN, SGC, APPNP, and GCNII, show that our FICOM consistently outperforms existing active learning approaches on semi-supervised node classification tasks using different GNNs. Moreover, our solution can finish within 5 h on a million-node graph.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to: “Refiner: a reliable and efficient incentive-driven federated learning system powered by blockchain” 更正："Refiner：由区块链驱动的可靠高效的激励驱动联合学习系统"

The VLDB Journal

Pub Date : 2024-07-18 DOI: 10.1007/s00778-024-00866-9

Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen

引用次数: 0

Survey of vector database management systems 矢量数据库管理系统调查

The VLDB Journal

Pub Date : 2024-07-15 DOI: 10.1007/s00778-024-00864-x

James Jie Pan, Jianguo Wang, Guoliang Li

There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely the ambiguity of semantic similarity, large size of vectors, high cost of similarity comparison, lack of structural properties that can be used for indexing, and difficulty of efficiently answering “hybrid” queries that jointly search both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning techniques based on randomization, learned partitioning, and “navigable” partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, distributed query processing, data manipulation queries, and hardware accelerated query execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including “native” systems that are specialized for vectors and “extended” systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally outline research challenges and point the direction for future work.

目前已有 20 多个商业矢量数据库管理系统（VDBMS），它们都是在过去五年内诞生的。但是，基于嵌入的检索已经研究了十多年，而相似性搜索的研究更是长达半个多世纪。推动从算法到系统转变的是新的数据密集型应用，特别是大型语言模型，这些应用需要大量的非结构化数据存储以及可靠、安全、快速和可扩展的查询处理能力。目前有多种新的数据管理技术可以满足这些需求，但还没有全面的调查报告对这些技术和系统进行彻底审查。我们首先确定了矢量数据管理的五个主要障碍，即语义相似性模糊、矢量规模大、相似性比较成本高、缺乏可用于索引的结构属性，以及难以有效回答联合搜索属性和矢量的 "混合 "查询。为了克服这些障碍，人们在查询处理、存储和索引以及查询优化和执行方面提出了新的方法。在查询处理方面，各种相似性得分和查询类型现已得到很好的理解；在存储和索引方面，技术包括矢量压缩，即量化，以及基于随机化、学习分区和 "可导航 "分区的分区技术；在查询优化和执行方面，我们介绍了用于混合查询的新算子，以及用于计划枚举、计划选择、分布式查询处理、数据操作查询和硬件加速查询执行的技术。通过这些技术，我们可以设计出各种具有不同设计和运行特性的 VDBMS，包括专门用于矢量的 "本地 "系统和将矢量功能集成到现有系统中的 "扩展 "系统。然后，我们讨论了基准测试，最后概述了研究挑战并指明了未来工作的方向。

{"title":"Survey of vector database management systems","authors":"James Jie Pan, Jianguo Wang, Guoliang Li","doi":"10.1007/s00778-024-00864-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00864-x","url":null,"abstract":"There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely the ambiguity of semantic similarity, large size of vectors, high cost of similarity comparison, lack of structural properties that can be used for indexing, and difficulty of efficiently answering “hybrid” queries that jointly search both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning techniques based on randomization, learned partitioning, and “navigable” partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, distributed query processing, data manipulation queries, and hardware accelerated query execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including “native” systems that are specialized for vectors and “extended” systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally outline research challenges and point the direction for future work.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0