The VLDB Journal最新文献_第4页

MM-DIRECT MM-DIRECT

The VLDB Journal

Pub Date : 2024-03-27 DOI: 10.1007/s00778-024-00846-z

Arlino Magalhaes, Angelo Brayner, Jose Maria Monteiro

Main memory databases (MMDBs) technology handles the primary database in Random Access Memory (RAM) to provide high throughput and low latency. However, volatile memory makes MMDBs much more sensitive to system failures. The contents of the database are lost in these failures, and, as a result, systems may be unavailable for a long time until the database recovery process has been finished. Therefore, novel recovery techniques are needed to repair crashed MMDBs as quickly as possible. This paper presents MM-DIRECT (Main Memory Database Instant RECovery with Tuple consistent checkpoint), a recovery technique that enables MMDBs to schedule transactions simultaneously with the database recovery process at system startup. Thus, it gives the impression that the database is instantly restored. The approach implements a tuple-level consistent checkpoint to reduce the recovery time. To validate the proposed approach, experiments were performed in a prototype implemented on the Redis database. The results show that the instant recovery technique effectively provides high transaction throughput rates even during the recovery process and normal database processing.

主内存数据库（MMDB）技术将主数据库放在随机存取存储器（RAM）中处理，以提供高吞吐量和低延迟。然而，易失性内存使 MMDB 对系统故障更加敏感。在这些故障中，数据库的内容会丢失，因此，在数据库恢复过程结束之前，系统可能会长时间不可用。因此，需要新颖的恢复技术来尽快修复崩溃的 MMDB。本文介绍的 MM-DIRECT（Main Memory Database Instant RECovery with Tuple consistent checkpoint）是一种恢复技术，它能使 MMDB 在系统启动时与数据库恢复过程同时安排事务。因此，它给人的印象是数据库是即时恢复的。该方法实现了元组级一致性检查点，以缩短恢复时间。为了验证所提出的方法，我们在 Redis 数据库的原型上进行了实验。结果表明，即使在恢复过程和正常的数据库处理过程中，即时恢复技术也能有效提供较高的事务吞吐率。

引用次数: 0

How good are machine learning clouds? Benchmarking two snapshots over 5 years 机器学习云有多好？对 5 年内的两个快照进行基准测试

The VLDB Journal

Pub Date : 2024-03-15 DOI: 10.1007/s00778-024-00842-3

Jiawei Jiang, Yi Wei, Yu Liu, Wentao Wu, Chuang Hu, Zhigao Zheng, Ziyi Zhang, Yingxia Shao, Ce Zhang

We conduct an empirical study of machine learning functionalities provided by major cloud service providers, which we call machine learning clouds. Machine learning clouds hold the promise of hiding all the sophistication of running large-scale machine learning: Instead of specifying how to run a machine learning task, users only specify what machine learning task to run and the cloud figures out the rest. Raising the level of abstraction, however, rarely comes free—a performance penalty is possible. How good, then, are current machine learning clouds on real-world machine learning workloads? We study this question by conducting benchmark on the mainstream machine learning clouds. Since these platforms continue to innovate, our benchmark tries to reflect their evolvement. Concretely, this paper consists of two sub-benchmarks—mlbench and automlbench. When we first started this work in 2016, only two cloud platforms provide machine learning services and limited themselves to model training and simple hyper-parameter tuning. We then focus on binary classification problems and present mlbench, a novel benchmark constructed by harvesting datasets from Kaggle competitions. We then compare the performance of the top winning code available from Kaggle with that of running machine learning clouds from both Azure and Amazon on mlbench. In the recent few years, more cloud providers support machine learning and include automatic machine learning (AutoML) techniques in their machine learning clouds. Their AutoML services can ease manual tuning on the whole machine learning pipeline, including but not limited to data preprocessing, feature selection, model selection, hyper-parameter, and model ensemble. To reflect these advancements, we design automlbench to assess the AutoML performance of four machine learning clouds using different kinds of workloads. Our comparative study reveals the strength and weakness of existing machine learning clouds and points out potential future directions for improvement.

我们对主要云服务提供商提供的机器学习功能进行了实证研究，我们称之为机器学习云。机器学习云有望隐藏运行大规模机器学习的所有复杂性：用户无需指定如何运行机器学习任务，只需指定要运行什么机器学习任务，其余的都由云计算来解决。然而，抽象程度的提高很少是免费的--可能会造成性能损失。那么，目前的机器学习云在真实世界的机器学习工作负载上表现如何呢？我们通过对主流机器学习云进行基准测试来研究这个问题。由于这些平台在不断创新，我们的基准测试试图反映它们的发展。具体来说，本文包括两个子基准--mlbench 和 automlbench。2016 年我们刚开始这项工作时，只有两个云平台提供机器学习服务，而且仅限于模型训练和简单的超参数调整。随后，我们将重点放在二元分类问题上，并介绍了 mlbench，这是一种通过从 Kaggle 竞赛中获取数据集而构建的新型基准。然后，我们比较了 Kaggle 上最优秀的获奖代码与在 mlbench 上运行 Azure 和 Amazon 机器学习云的性能。近年来，越来越多的云提供商支持机器学习，并在其机器学习云中加入了自动机器学习（AutoML）技术。它们的 AutoML 服务可简化整个机器学习管道的人工调整，包括但不限于数据预处理、特征选择、模型选择、超参数和模型集合。为了反映这些进步，我们设计了 automlbench，使用不同类型的工作负载来评估四种机器学习云的 AutoML 性能。我们的比较研究揭示了现有机器学习云的优缺点，并指出了未来潜在的改进方向。

{"title":"How good are machine learning clouds? Benchmarking two snapshots over 5 years","authors":"Jiawei Jiang, Yi Wei, Yu Liu, Wentao Wu, Chuang Hu, Zhigao Zheng, Ziyi Zhang, Yingxia Shao, Ce Zhang","doi":"10.1007/s00778-024-00842-3","DOIUrl":"https://doi.org/10.1007/s00778-024-00842-3","url":null,"abstract":"We conduct an empirical study of machine learning functionalities provided by major cloud service providers, which we call machine learning clouds. Machine learning clouds hold the promise of hiding all the sophistication of running large-scale machine learning: Instead of specifying how to run a machine learning task, users only specify what machine learning task to run and the cloud figures out the rest. Raising the level of abstraction, however, rarely comes free—a performance penalty is possible. How good, then, are current machine learning clouds on real-world machine learning workloads? We study this question by conducting benchmark on the mainstream machine learning clouds. Since these platforms continue to innovate, our benchmark tries to reflect their evolvement. Concretely, this paper consists of two sub-benchmarks—mlbench and automlbench. When we first started this work in 2016, only two cloud platforms provide machine learning services and limited themselves to model training and simple hyper-parameter tuning. We then focus on binary classification problems and present mlbench, a novel benchmark constructed by harvesting datasets from Kaggle competitions. We then compare the performance of the top winning code available from Kaggle with that of running machine learning clouds from both Azure and Amazon on mlbench. In the recent few years, more cloud providers support machine learning and include automatic machine learning (AutoML) techniques in their machine learning clouds. Their AutoML services can ease manual tuning on the whole machine learning pipeline, including but not limited to data preprocessing, feature selection, model selection, hyper-parameter, and model ensemble. To reflect these advancements, we design automlbench to assess the AutoML performance of four machine learning clouds using different kinds of workloads. Our comparative study reveals the strength and weakness of existing machine learning clouds and points out potential future directions for improvement.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Refiner: a reliable and efficient incentive-driven federated learning system powered by blockchain Refiner：由区块链驱动的可靠高效的激励驱动联合学习系统

The VLDB Journal

Pub Date : 2024-02-28 DOI: 10.1007/s00778-024-00839-y

Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen

Federated learning (FL) enables learning a model from data distributed across numerous workers while preserving data privacy. However, the classical FL technique is designed for Web2 applications where participants are trusted to produce correct computation results. Moreover, classical FL workers are assumed to voluntarily contribute their computational resources and have the same learning speed. Therefore, the classical FL technique is not applicable to Web3 applications, where participants are untrusted and self-interested players with potentially malicious behaviors and heterogeneous learning speeds. This paper proposes Refiner, a novel blockchain-powered decentralized FL system for Web3 applications. Refiner addresses the challenges introduced by Web3 participants by extending the classical FL technique with three interoperative extensions: (1) an incentive scheme for attracting self-interested participants, (2) a two-stage audit scheme for preventing malicious behavior, and (3) an incentive-aware semi-synchronous learning scheme for handling heterogeneous workers. We provide theoretical analyses of the security and efficiency of Refiner. Extensive experimental results on the CIFAR-10 and Shakespeare datasets confirm the effectiveness, security, and efficiency of Refiner.

联合学习（FL）可以从分布在众多工作者身上的数据中学习模型，同时保护数据隐私。然而，经典的联合学习技术是为 Web2 应用程序设计的，在 Web2 应用程序中，参与者被认为会产生正确的计算结果。此外，经典的 FL 工作者被假定自愿贡献其计算资源，并具有相同的学习速度。因此，经典的 FL 技术不适用于 Web3 应用程序，因为在 Web3 应用程序中，参与者都是不被信任的自利参与者，他们可能有恶意行为，学习速度也不尽相同。本文针对 Web3 应用程序提出了一种新型区块链驱动的去中心化 FL 系统 Refiner。Refiner 针对 Web3 参与者带来的挑战，在经典 FL 技术的基础上进行了三个互操作扩展：（1）用于吸引自利参与者的激励方案；（2）用于防止恶意行为的两阶段审核方案；（3）用于处理异构工作者的激励感知半同步学习方案。我们对 Refiner 的安全性和效率进行了理论分析。在 CIFAR-10 和 Shakespeare 数据集上的大量实验结果证实了 Refiner 的有效性、安全性和效率。

{"title":"Refiner: a reliable and efficient incentive-driven federated learning system powered by blockchain","authors":"Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen","doi":"10.1007/s00778-024-00839-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00839-y","url":null,"abstract":"Federated learning (FL) enables learning a model from data distributed across numerous workers while preserving data privacy. However, the classical FL technique is designed for Web2 applications where participants are trusted to produce correct computation results. Moreover, classical FL workers are assumed to voluntarily contribute their computational resources and have the same learning speed. Therefore, the classical FL technique is not applicable to Web3 applications, where participants are untrusted and self-interested players with potentially malicious behaviors and heterogeneous learning speeds. This paper proposes Refiner, a novel blockchain-powered decentralized FL system for Web3 applications. Refiner addresses the challenges introduced by Web3 participants by extending the classical FL technique with three interoperative extensions: (1) an incentive scheme for attracting self-interested participants, (2) a two-stage audit scheme for preventing malicious behavior, and (3) an incentive-aware semi-synchronous learning scheme for handling heterogeneous workers. We provide theoretical analyses of the security and efficiency of Refiner. Extensive experimental results on the CIFAR-10 and Shakespeare datasets confirm the effectiveness, security, and efficiency of Refiner.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140008737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ingress: an automated incremental graph processing system Ingress：自动增量图处理系统

The VLDB Journal

Pub Date : 2024-02-20 DOI: 10.1007/s00778-024-00838-z

Shufeng Gong, Chao Tian, Qiang Yin, Zhengdong Wang, Song Yu, Yanfeng Zhang, Wenyuan Yu, Liang Geng, Chong Fu, Ge Yu, Jingren Zhou

The graph data keep growing over time in real life. The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph computation. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality. In light of these challenges, we propose (textsf{Ingress}), an automated system for incremental graph proc essing. (textsf{Ingress}) is able to deduce the incremental counterpart of a batch vertex-centric algorithm, without the need of redesigned logic or data structures from users. Underlying (textsf{Ingress}) is an automated incrementalization framework equipped with four different memoization policies, to support all kinds of vertex-centric computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. (textsf{Ingress}) chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, (textsf{Ingress}) outperforms state-of-the-art incremental graph systems by (12.14times ) on average (up to (49.23times )) in efficiency.

在现实生活中，图形数据会随着时间不断增长。不断增长的动态图数据量需要高效的增量图计算技术。然而，增量图算法的开发极具挑战性。现有方法通常要求用户手动设计非难增量算子，或为某些特定类型的计算选择不同的记忆化策略，从而限制了其可用性和通用性。鉴于这些挑战，我们提出了增量图处理自动化系统（textsf{Ingress}）。它能够推导出以顶点为中心的批量算法的增量对应算法，而不需要用户重新设计逻辑或数据结构。textsf{Ingress}）的基础是一个自动增量框架，它配备了四种不同的内存化策略，可以支持各种以顶点为中心的计算，并优化内存利用率。我们确定了这些策略适用性的充分条件。textsf{Ingress}）通过验证这些条件，自动为给定算法选择最合适的策略。除了易用性和通用性之外，(textsf{Ingress})在效率上平均比最先进的增量图系统高出(12.14times)（最高可达(49.23times)）。

{"title":"Ingress: an automated incremental graph processing system","authors":"Shufeng Gong, Chao Tian, Qiang Yin, Zhengdong Wang, Song Yu, Yanfeng Zhang, Wenyuan Yu, Liang Geng, Chong Fu, Ge Yu, Jingren Zhou","doi":"10.1007/s00778-024-00838-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00838-z","url":null,"abstract":"The graph data keep growing over time in real life. The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph computation. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality. In light of these challenges, we propose (textsf{Ingress}), an automated system for incremental graph proc essing. (textsf{Ingress}) is able to deduce the incremental counterpart of a batch vertex-centric algorithm, without the need of redesigned logic or data structures from users. Underlying (textsf{Ingress}) is an automated incrementalization framework equipped with four different memoization policies, to support all kinds of vertex-centric computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. (textsf{Ingress}) chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, (textsf{Ingress}) outperforms state-of-the-art incremental graph systems by (12.14times ) on average (up to (49.23times )) in efficiency.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139910873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech-to-SQL: toward speech-driven SQL query generation from natural language question 语音到 SQL：从自然语言问题到语音驱动的 SQL 查询生成

The VLDB Journal

Pub Date : 2024-02-16 DOI: 10.1007/s00778-024-00837-0

Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang Zhao

Speech-based inputs have been gaining significant momentum with the popularity of smartphones and tablets in our daily lives, since voice is the most popular and efficient way for human–computer interaction. This paper works toward designing more effective speech-based interfaces to query the structured data in relational databases. We first identify a new task named Speech-to-SQL, which aims to understand the information conveyed by human speech and directly translate it into structured query language (SQL) statements. A naive solution to this problem can work in a cascaded manner, that is, an automatic speech recognition component followed by a text-to-SQL component. However, it requires a high-quality ASR system and also suffers from the error compounding problem between the two components, resulting in limited performance. To handle these challenges, we propose a novel end-to-end neural architecture named SpeechSQLNet to directly translate human speech into SQL queries without an external ASR step. SpeechSQLNet has the advantage of making full use of the rich linguistic information presented in speech. To the best of our knowledge, this is the first attempt to directly synthesize SQL based on common natural language questions in spoken form, rather than a natural language-based version of SQL. To validate the effectiveness of the proposed problem and model, we further construct a dataset named SpeechQL, by piggybacking the widely used text-to-SQL datasets. Extensive experimental evaluations on this dataset show that SpeechSQLNet can directly synthesize high-quality SQL queries from human speech, outperforming various competitive counterparts as well as the cascaded methods in terms of exact match accuracies. We expect speech-to-SQL would inspire more research on more effective and efficient human–machine interfaces to lower the barrier of using relational databases.

随着智能手机和平板电脑在我们日常生活中的普及，基于语音的输入已获得了显著的发展势头，因为语音是最流行、最有效的人机交互方式。本文致力于设计更有效的语音界面，以查询关系数据库中的结构化数据。我们首先确定了一项名为 "Speech-to-SQL "的新任务，其目的是理解人类语音所传达的信息，并将其直接转化为结构化查询语言（SQL）语句。解决这一问题的简单方法是采用级联方式，即先使用自动语音识别组件，然后再使用文本到 SQL 组件。但是，这种方法需要高质量的自动语音识别系统，而且两个组件之间存在误差复合问题，导致性能有限。为了应对这些挑战，我们提出了一种名为 SpeechSQLNet 的新型端到端神经架构，无需外部 ASR 步骤即可直接将人类语音翻译成 SQL 查询。SpeechSQLNet 的优势在于能充分利用语音中丰富的语言信息。据我们所知，这是首次尝试根据口语形式的常见自然语言问题直接合成 SQL，而不是基于自然语言版本的 SQL。为了验证所提问题和模型的有效性，我们进一步构建了一个名为 SpeechQL 的数据集，该数据集捎带了广泛使用的文本到 SQL 数据集。在该数据集上进行的广泛实验评估表明，SpeechSQLNet 可以直接从人类语音合成高质量的 SQL 查询，在精确匹配准确率方面优于各种竞争性同类产品以及级联方法。我们希望语音到 SQL 能够激发更多关于更有效和高效的人机界面的研究，从而降低使用关系数据库的门槛。

{"title":"Speech-to-SQL: toward speech-driven SQL query generation from natural language question","authors":"Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang Zhao","doi":"10.1007/s00778-024-00837-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00837-0","url":null,"abstract":"Speech-based inputs have been gaining significant momentum with the popularity of smartphones and tablets in our daily lives, since voice is the most popular and efficient way for human–computer interaction. This paper works toward designing more effective speech-based interfaces to query the structured data in relational databases. We first identify a new task named Speech-to-SQL, which aims to understand the information conveyed by human speech and directly translate it into structured query language (SQL) statements. A naive solution to this problem can work in a cascaded manner, that is, an automatic speech recognition component followed by a text-to-SQL component. However, it requires a high-quality ASR system and also suffers from the error compounding problem between the two components, resulting in limited performance. To handle these challenges, we propose a novel end-to-end neural architecture named SpeechSQLNet to directly translate human speech into SQL queries without an external ASR step. SpeechSQLNet has the advantage of making full use of the rich linguistic information presented in speech. To the best of our knowledge, this is the first attempt to directly synthesize SQL based on common natural language questions in spoken form, rather than a natural language-based version of SQL. To validate the effectiveness of the proposed problem and model, we further construct a dataset named SpeechQL, by piggybacking the widely used text-to-SQL datasets. Extensive experimental evaluations on this dataset show that SpeechSQLNet can directly synthesize high-quality SQL queries from human speech, outperforming various competitive counterparts as well as the cascaded methods in terms of exact match accuracies. We expect speech-to-SQL would inspire more research on more effective and efficient human–machine interfaces to lower the barrier of using relational databases.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139769335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new distributional treatment for time series anomaly detection 时间序列异常检测的新分布处理方法

The VLDB Journal

Pub Date : 2024-02-15 DOI: 10.1007/s00778-023-00832-x

Kai Ming Ting, Zongyou Liu, Lei Gong, Hang Zhang, Ye Zhu

Time series is traditionally treated with two main approaches, i.e., the time domain approach and the frequency domain approach. These approaches must rely on a sliding window so that time-shift versions of a sequence can be measured to be similar. Coupled with the use of a root point-to-point measure, existing methods often have quadratic time complexity. We offer the third (mathbb {R}) domain approach. It begins with an insight that sequences in a stationary time series can be treated as sets of independent and identically distributed (iid) points generated from an unknown distribution in (mathbb {R}). This (mathbb {R}) domain treatment enables two new possibilities: (a) The similarity between two sequences can be computed using a distributional measure such as Wasserstein distance (WD), kernel mean embedding or isolation distributional kernel ((mathcal {K}_I)), and (b) these distributional measures become non-sliding-window-based. Together, they offer an alternative that has more effective similarity measurements and runs significantly faster than the point-to-point and sliding-window-based measures. Our empirical evaluation shows that (mathcal {K}_I) is an effective and efficient distributional measure for time series; and (mathcal {K}_I)-based detectors have better detection accuracy than existing detectors in two tasks: (i) anomalous sequence detection in a stationary time series and (ii) anomalous time series detection in a dataset of non-stationary time series. The insight makes underutilized “old things new again” which gives existing distributional measures and anomaly detectors a new life in time series anomaly detection that would otherwise be impossible.

处理时间序列的传统方法主要有两种，即时域方法和频域方法。这些方法必须依赖于滑动窗口，这样才能测量出序列的时移版本是否相似。再加上使用根点对点测量，现有的方法往往具有二次时间复杂性。我们提供了第三种（mathbb {R}）域方法。这种方法的出发点是，静态时间序列中的序列可以被视为由 (mathbb {R}) 中的未知分布生成的独立且同分布（iid）点的集合。这种（mathbb {R}）域处理方法带来了两种新的可能性：（a）两个序列之间的相似性可以用分布度量来计算，比如 Wasserstein 距离（WD）、核均值嵌入或隔离分布核（(mathcal {K}_I)）；（b）这些分布度量不再基于滑动窗口。与基于点对点和滑动窗口的测量方法相比，它们共同提供了一种更有效的相似性测量方法，并且运行速度明显更快。我们的实证评估表明，(mathcal {K}_I)是一种有效且高效的时间序列分布度量；在以下两个任务中，基于(mathcal {K}_I)的检测器比现有的检测器具有更好的检测精度：（i）静态时间序列中的异常序列检测；（ii）非静态时间序列数据集中的异常时间序列检测。这一洞察力使未被充分利用的 "旧事物重新焕发生机"，从而使现有的分布测量和异常检测器在时间序列异常检测中焕发出新的活力，而这在其他情况下是不可能实现的。

{"title":"A new distributional treatment for time series anomaly detection","authors":"Kai Ming Ting, Zongyou Liu, Lei Gong, Hang Zhang, Ye Zhu","doi":"10.1007/s00778-023-00832-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00832-x","url":null,"abstract":"Time series is traditionally treated with two main approaches, i.e., the time domain approach and the frequency domain approach. These approaches must rely on a sliding window so that time-shift versions of a sequence can be measured to be similar. Coupled with the use of a root point-to-point measure, existing methods often have quadratic time complexity. We offer the third (mathbb {R}) domain approach. It begins with an insight that sequences in a stationary time series can be treated as sets of independent and identically distributed (iid) points generated from an unknown distribution in (mathbb {R}). This (mathbb {R}) domain treatment enables two new possibilities: (a) The similarity between two sequences can be computed using a distributional measure such as Wasserstein distance (WD), kernel mean embedding or isolation distributional kernel ((mathcal {K}_I)), and (b) these distributional measures become non-sliding-window-based. Together, they offer an alternative that has more effective similarity measurements and runs significantly faster than the point-to-point and sliding-window-based measures. Our empirical evaluation shows that (mathcal {K}_I) is an effective and efficient distributional measure for time series; and (mathcal {K}_I)-based detectors have better detection accuracy than existing detectors in two tasks: (i) anomalous sequence detection in a stationary time series and (ii) anomalous time series detection in a dataset of non-stationary time series. The insight makes underutilized “old things new again” which gives existing distributional measures and anomaly detectors a new life in time series anomaly detection that would otherwise be impossible.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A learning-based framework for spatial join processing: estimation, optimization and tuning 基于学习的空间连接处理框架：估计、优化和调整

The VLDB Journal

Pub Date : 2024-02-13 DOI: 10.1007/s00778-024-00836-1

Tin Vu, Alberto Belussi, Sara Migliorini, Ahmed Eldawy

The importance and complexity of spatial join operation resulted in the availability of many join algorithms, some of which are tailored for big-data platforms like Hadoop and Spark. The choice among them is not trivial and depends on different factors. This paper proposes the first machine-learning-based framework for spatial join query optimization which can accommodate both the characteristics of spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that once trained can be applied to any pair of input datasets, because they are able to extract the important input characteristics, such as data distribution and spatial partitioning, the logic of spatial join algorithms, and the relationship between the two input datasets. The proposed system defines a set of features that can be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train five machine learning models that are used to identify the best spatial join algorithm. The first two are regression models that estimate two important measures of the spatial join performance and they act as the cost model. The third model chooses the best partitioning strategy to use with spatial join. The fourth and fifth models further tune two important parameters, number of partitions and plane-sweep direction, to get the best performance. Experiments on large-scale synthetic and real data show the efficiency of the proposed models over baseline methods.

空间连接操作的重要性和复杂性导致了许多连接算法的出现，其中一些是为 Hadoop 和 Spark 等大数据平台量身定制的。在这些算法中做出选择并非易事，而且取决于不同的因素。本文提出了第一个基于机器学习的空间连接查询优化框架，它既能适应空间数据集的特点，又能适应不同算法的复杂性。主要的挑战在于如何开发可移植的成本模型，这些模型一旦经过训练就能应用于任何一对输入数据集，因为它们能够提取重要的输入特征，如数据分布和空间分区、空间连接算法的逻辑以及两个输入数据集之间的关系。建议的系统定义了一组可有效计算数据的特征，以捕捉空间连接的复杂方面。然后，它利用这些特征来训练五个机器学习模型，用于识别最佳的空间连接算法。前两个模型是回归模型，用于估算空间连接性能的两个重要指标，它们是成本模型。第三个模型选择与空间连接一起使用的最佳分区策略。第四和第五个模型进一步调整两个重要参数，即分区数量和平面扫描方向，以获得最佳性能。在大规模合成数据和真实数据上的实验表明，与基线方法相比，所提出的模型非常高效。

{"title":"A learning-based framework for spatial join processing: estimation, optimization and tuning","authors":"Tin Vu, Alberto Belussi, Sara Migliorini, Ahmed Eldawy","doi":"10.1007/s00778-024-00836-1","DOIUrl":"https://doi.org/10.1007/s00778-024-00836-1","url":null,"abstract":"The importance and complexity of spatial join operation resulted in the availability of many join algorithms, some of which are tailored for big-data platforms like Hadoop and Spark. The choice among them is not trivial and depends on different factors. This paper proposes the first machine-learning-based framework for spatial join query optimization which can accommodate both the characteristics of spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that once trained can be applied to any pair of input datasets, because they are able to extract the important input characteristics, such as data distribution and spatial partitioning, the logic of spatial join algorithms, and the relationship between the two input datasets. The proposed system defines a set of features that can be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train five machine learning models that are used to identify the best spatial join algorithm. The first two are regression models that estimate two important measures of the spatial join performance and they act as the cost model. The third model chooses the best partitioning strategy to use with spatial join. The fourth and fifth models further tune two important parameters, number of partitions and plane-sweep direction, to get the best performance. Experiments on large-scale synthetic and real data show the efficiency of the proposed models over baseline methods.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"166 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139769152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assisted design of data science pipelines 协助设计数据科学管道

The VLDB Journal

Pub Date : 2024-02-13 DOI: 10.1007/s00778-024-00835-2

Sergey Redyuk, Zoi Kaoudi, Sebastian Schelter, Volker Markl

When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the assisted design of data science pipelines using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.

在设计数据科学（DS）管道时，终端用户可能会被大量且不断增加的可用数据预处理和建模技术所淹没。智能发现助手（IDA）和自动机器学习（AutoML）解决方案旨在通过（半）自动化流程为最终用户提供便利。然而，它们的计算成本高昂，而且在广泛的实际用例和应用领域中适用性有限。这是由于：(a) 它们需要执行数以千计的管道才能获得最佳管道；(b) 它们对 DS 任务的支持有限，例如，仅支持监督分类或回归，以及可用数据预处理和 ML 算法的小型静态集；(c) 它们仅限于可量化的评估流程和指标，例如，使用 ROC AUC 分数进行分类的十倍交叉验证。为了克服这些局限性，我们提出了一种 "人在回路中 "的方法，利用以前执行过的管道来辅助设计数据科学管道。根据用户查询（即数据和数据科学任务），我们的框架会输出一份候选管道排序列表，用户可从中选择实时执行或修改。为了推荐管道，它首先利用高效的相似性搜索来识别相关的数据集和管道。然后，它利用多目标排序法对候选管道进行排序，并将用户互动纳入考虑范围，从而随着时间的推移不断改进建议。在我们的实验评估中，所提出的框架明显优于最先进的 IDA 工具，并与最先进的长期运行 AutoML 解决方案实现了类似的预测性能，同时还具有实时性、可通用于任何评估流程和 DS 任务，并可扩展到新的操作员。

{"title":"Assisted design of data science pipelines","authors":"Sergey Redyuk, Zoi Kaoudi, Sebastian Schelter, Volker Markl","doi":"10.1007/s00778-024-00835-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00835-2","url":null,"abstract":"When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the assisted design of data science pipelines using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139769336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time series data encoding in Apache IoTDB: comparative analysis and recommendation Apache IoTDB 中的时间序列数据编码：比较分析和建议

The VLDB Journal

Pub Date : 2024-02-12 DOI: 10.1007/s00778-024-00840-5

Tianrui Xia, Jinzhao Xiao, Yuxiang Huang, Changyu Hu, Shaoxu Song, Xiangdong Huang, Jianmin Wang

Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.

时间序列数据不仅应用广泛，而且特征鲜明，这刺激了时间序列数据库管理系统的蓬勃发展，如 Apache IoTDB、InfluxDB、OpenTSDB 等。几乎所有这些系统都采用列式存储，对时间序列数据进行有效编码。鉴于各种时间序列数据的不同特点，不同的编码策略可能会有不同的表现。在本研究中，我们首先总结了可能影响编码性能的时间序列数据特征。我们还介绍了针对这些特征的最新特征提取结果。然后，我们介绍了典型时间序列数据库 Apache IoTDB 的存储方案，规定了在系统中实施编码算法的限制。然后，对所研究算法的编码效果进行定性分析。为此，我们开发了一个评估编码算法的基准，包括一个数据生成器和几个真实世界的数据集。此外，我们还进行了广泛的实验评估。值得注意的是，我们在 Apache IoTDB 中对数据特征的编码效果进行了定量分析。最后，我们针对不同时间序列的数据特征推荐了最佳编码算法。我们训练了机器学习模型来进行推荐，并在真实世界数据集上进行了评估。

{"title":"Time series data encoding in Apache IoTDB: comparative analysis and recommendation","authors":"Tianrui Xia, Jinzhao Xiao, Yuxiang Huang, Changyu Hu, Shaoxu Song, Xiangdong Huang, Jianmin Wang","doi":"10.1007/s00778-024-00840-5","DOIUrl":"https://doi.org/10.1007/s00778-024-00840-5","url":null,"abstract":"Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sub-trajectory clustering with deep reinforcement learning 利用深度强化学习进行子轨迹聚类

The VLDB Journal

Pub Date : 2024-01-25 DOI: 10.1007/s00778-023-00833-w

Abstract

Sub-trajectory clustering is a fundamental problem in many trajectory applications. Existing approaches usually divide the clustering procedure into two phases: segmenting trajectories into sub-trajectories and then clustering these sub-trajectories. However, researchers need to develop complex human-crafted segmentation rules for specific applications, making the clustering results sensitive to the segmentation rules and lacking in generality. To solve this problem, we propose a novel algorithm using the clustering results to guide the segmentation, which is based on reinforcement learning (RL). The novelty is that the segmentation and clustering components cooperate closely and improve each other continuously to yield better clustering results. To devise our RL-based algorithm, we model the procedure of trajectory segmentation as a Markov decision process (MDP). We apply Deep-Q-Network (DQN) learning to train an RL model for the segmentation and achieve excellent clustering results. Experimental results on real datasets demonstrate the superior performance of the proposed RL-based approach over state-of-the-art methods.

摘要子轨迹聚类是许多轨迹应用中的基本问题。现有方法通常将聚类过程分为两个阶段：将轨迹分割成子轨迹，然后对这些子轨迹进行聚类。然而，研究人员需要针对特定应用制定复杂的人为分割规则，这使得聚类结果对分割规则非常敏感，缺乏通用性。为了解决这个问题，我们提出了一种基于强化学习（RL）的新算法，利用聚类结果来指导分割。该算法的新颖之处在于，分割和聚类组件紧密合作，不断改进，以获得更好的聚类结果。为了设计基于 RL 的算法，我们将轨迹分割过程建模为马尔可夫决策过程（MDP）。我们应用深度-Q网络（DQN）学习来训练用于分割的 RL 模型，并取得了出色的聚类结果。在真实数据集上的实验结果表明，与最先进的方法相比，所提出的基于 RL 的方法性能更优越。

引用次数: 0