Pub Date : 2024-03-27DOI: 10.1007/s00778-024-00846-z
Arlino Magalhaes, Angelo Brayner, Jose Maria Monteiro
Main memory databases (MMDBs) technology handles the primary database in Random Access Memory (RAM) to provide high throughput and low latency. However, volatile memory makes MMDBs much more sensitive to system failures. The contents of the database are lost in these failures, and, as a result, systems may be unavailable for a long time until the database recovery process has been finished. Therefore, novel recovery techniques are needed to repair crashed MMDBs as quickly as possible. This paper presents MM-DIRECT (Main Memory Database Instant RECovery with Tuple consistent checkpoint), a recovery technique that enables MMDBs to schedule transactions simultaneously with the database recovery process at system startup. Thus, it gives the impression that the database is instantly restored. The approach implements a tuple-level consistent checkpoint to reduce the recovery time. To validate the proposed approach, experiments were performed in a prototype implemented on the Redis database. The results show that the instant recovery technique effectively provides high transaction throughput rates even during the recovery process and normal database processing.
{"title":"MM-DIRECT","authors":"Arlino Magalhaes, Angelo Brayner, Jose Maria Monteiro","doi":"10.1007/s00778-024-00846-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00846-z","url":null,"abstract":"<p>Main memory databases (MMDBs) technology handles the primary database in Random Access Memory (RAM) to provide high throughput and low latency. However, volatile memory makes MMDBs much more sensitive to system failures. The contents of the database are lost in these failures, and, as a result, systems may be unavailable for a long time until the database recovery process has been finished. Therefore, novel recovery techniques are needed to repair crashed MMDBs as quickly as possible. This paper presents <i>MM-DIRECT</i> (Main Memory Database Instant RECovery with Tuple consistent checkpoint), a recovery technique that enables MMDBs to schedule transactions simultaneously with the database recovery process at system startup. Thus, it gives the impression that the database is instantly restored. The approach implements a tuple-level consistent checkpoint to reduce the recovery time. To validate the proposed approach, experiments were performed in a prototype implemented on the Redis database. The results show that the instant recovery technique effectively provides high transaction throughput rates even during the recovery process and normal database processing.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"175 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-15DOI: 10.1007/s00778-024-00842-3
Jiawei Jiang, Yi Wei, Yu Liu, Wentao Wu, Chuang Hu, Zhigao Zheng, Ziyi Zhang, Yingxia Shao, Ce Zhang
We conduct an empirical study of machine learning functionalities provided by major cloud service providers, which we call machine learning clouds. Machine learning clouds hold the promise of hiding all the sophistication of running large-scale machine learning: Instead of specifying how to run a machine learning task, users only specify what machine learning task to run and the cloud figures out the rest. Raising the level of abstraction, however, rarely comes free—a performance penalty is possible. How good, then, are current machine learning clouds on real-world machine learning workloads? We study this question by conducting benchmark on the mainstream machine learning clouds. Since these platforms continue to innovate, our benchmark tries to reflect their evolvement. Concretely, this paper consists of two sub-benchmarks—mlbench and automlbench. When we first started this work in 2016, only two cloud platforms provide machine learning services and limited themselves to model training and simple hyper-parameter tuning. We then focus on binary classification problems and present mlbench, a novel benchmark constructed by harvesting datasets from Kaggle competitions. We then compare the performance of the top winning code available from Kaggle with that of running machine learning clouds from both Azure and Amazon on mlbench. In the recent few years, more cloud providers support machine learning and include automatic machine learning (AutoML) techniques in their machine learning clouds. Their AutoML services can ease manual tuning on the whole machine learning pipeline, including but not limited to data preprocessing, feature selection, model selection, hyper-parameter, and model ensemble. To reflect these advancements, we design automlbench to assess the AutoML performance of four machine learning clouds using different kinds of workloads. Our comparative study reveals the strength and weakness of existing machine learning clouds and points out potential future directions for improvement.
{"title":"How good are machine learning clouds? Benchmarking two snapshots over 5 years","authors":"Jiawei Jiang, Yi Wei, Yu Liu, Wentao Wu, Chuang Hu, Zhigao Zheng, Ziyi Zhang, Yingxia Shao, Ce Zhang","doi":"10.1007/s00778-024-00842-3","DOIUrl":"https://doi.org/10.1007/s00778-024-00842-3","url":null,"abstract":"<p>We conduct an empirical study of machine learning functionalities provided by major cloud service providers, which we call <i>machine learning clouds</i>. Machine learning clouds hold the promise of hiding all the sophistication of running large-scale machine learning: Instead of specifying <i>how</i> to run a machine learning task, users only specify <i>what</i> machine learning task to run and the cloud figures out the rest. Raising the level of abstraction, however, rarely comes free—a performance penalty is possible. <i>How good, then, are current machine learning clouds on real-world machine learning workloads?</i> We study this question by conducting benchmark on the mainstream machine learning clouds. Since these platforms continue to innovate, our benchmark tries to reflect their evolvement. Concretely, this paper consists of two sub-benchmarks—<span>mlbench</span> and <span>automlbench</span>. When we first started this work in 2016, only two cloud platforms provide machine learning services and limited themselves to model training and simple hyper-parameter tuning. We then focus on binary classification problems and present <span>mlbench</span>, a novel benchmark constructed by harvesting datasets from Kaggle competitions. We then compare the performance of the top winning code available from Kaggle with that of running machine learning clouds from both Azure and Amazon on <span>mlbench</span>. In the recent few years, more cloud providers support machine learning and include automatic machine learning (AutoML) techniques in their machine learning clouds. Their AutoML services can ease manual tuning on the whole machine learning pipeline, including but not limited to data preprocessing, feature selection, model selection, hyper-parameter, and model ensemble. To reflect these advancements, we design <span>automlbench</span> to assess the AutoML performance of four machine learning clouds using different kinds of workloads. Our comparative study reveals the strength and weakness of existing machine learning clouds and points out potential future directions for improvement.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-28DOI: 10.1007/s00778-024-00839-y
Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen
Federated learning (FL) enables learning a model from data distributed across numerous workers while preserving data privacy. However, the classical FL technique is designed for Web2 applications where participants are trusted to produce correct computation results. Moreover, classical FL workers are assumed to voluntarily contribute their computational resources and have the same learning speed. Therefore, the classical FL technique is not applicable to Web3 applications, where participants are untrusted and self-interested players with potentially malicious behaviors and heterogeneous learning speeds. This paper proposes Refiner, a novel blockchain-powered decentralized FL system for Web3 applications. Refiner addresses the challenges introduced by Web3 participants by extending the classical FL technique with three interoperative extensions: (1) an incentive scheme for attracting self-interested participants, (2) a two-stage audit scheme for preventing malicious behavior, and (3) an incentive-aware semi-synchronous learning scheme for handling heterogeneous workers. We provide theoretical analyses of the security and efficiency of Refiner. Extensive experimental results on the CIFAR-10 and Shakespeare datasets confirm the effectiveness, security, and efficiency of Refiner.
{"title":"Refiner: a reliable and efficient incentive-driven federated learning system powered by blockchain","authors":"Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen","doi":"10.1007/s00778-024-00839-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00839-y","url":null,"abstract":"<p>Federated learning (FL) enables learning a model from data distributed across numerous workers while preserving data privacy. However, the classical FL technique is designed for Web2 applications where participants are trusted to produce correct computation results. Moreover, classical FL workers are assumed to voluntarily contribute their computational resources and have the same learning speed. Therefore, the classical FL technique is not applicable to Web3 applications, where participants are <i>untrusted</i> and <i>self-interested</i> players with <i>potentially malicious</i> behaviors and <i>heterogeneous</i> learning speeds. This paper proposes <i>Refiner</i>, a novel blockchain-powered decentralized FL system for Web3 applications. Refiner addresses the challenges introduced by Web3 participants by extending the classical FL technique with three interoperative extensions: (1) an incentive scheme for attracting self-interested participants, (2) a two-stage audit scheme for preventing malicious behavior, and (3) an incentive-aware semi-synchronous learning scheme for handling heterogeneous workers. We provide theoretical analyses of the security and efficiency of Refiner. Extensive experimental results on the CIFAR-10 and Shakespeare datasets confirm the effectiveness, security, and efficiency of Refiner.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140008737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The graph data keep growing over time in real life. The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph computation. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality. In light of these challenges, we propose (textsf{Ingress}), an automated system for incrementalgraph processing. (textsf{Ingress}) is able to deduce the incremental counterpart of a batch vertex-centric algorithm, without the need of redesigned logic or data structures from users. Underlying (textsf{Ingress}) is an automated incrementalization framework equipped with four different memoization policies, to support all kinds of vertex-centric computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. (textsf{Ingress}) chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, (textsf{Ingress}) outperforms state-of-the-art incremental graph systems by (12.14times ) on average (up to (49.23times )) in efficiency.
{"title":"Ingress: an automated incremental graph processing system","authors":"Shufeng Gong, Chao Tian, Qiang Yin, Zhengdong Wang, Song Yu, Yanfeng Zhang, Wenyuan Yu, Liang Geng, Chong Fu, Ge Yu, Jingren Zhou","doi":"10.1007/s00778-024-00838-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00838-z","url":null,"abstract":"<p>The graph data keep growing over time in real life. The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph computation. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality. In light of these challenges, we propose <span>(textsf{Ingress})</span>, an automated system for <i><u>in</u></i><i>cremental</i> <i><u>gr</u></i><i>aph proc</i> <i><u>ess</u></i><i>ing</i>. <span>(textsf{Ingress})</span> is able to deduce the incremental counterpart of a batch vertex-centric algorithm, without the need of redesigned logic or data structures from users. Underlying <span>(textsf{Ingress})</span> is an automated incrementalization framework equipped with four different memoization policies, to support all kinds of vertex-centric computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. <span>(textsf{Ingress})</span> chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, <span>(textsf{Ingress})</span> outperforms state-of-the-art incremental graph systems by <span>(12.14times )</span> on average (up to <span>(49.23times )</span>) in efficiency.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139910873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-16DOI: 10.1007/s00778-024-00837-0
Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang Zhao
Speech-based inputs have been gaining significant momentum with the popularity of smartphones and tablets in our daily lives, since voice is the most popular and efficient way for human–computer interaction. This paper works toward designing more effective speech-based interfaces to query the structured data in relational databases. We first identify a new task named Speech-to-SQL, which aims to understand the information conveyed by human speech and directly translate it into structured query language (SQL) statements. A naive solution to this problem can work in a cascaded manner, that is, an automatic speech recognition component followed by a text-to-SQL component. However, it requires a high-quality ASR system and also suffers from the error compounding problem between the two components, resulting in limited performance. To handle these challenges, we propose a novel end-to-end neural architecture named SpeechSQLNet to directly translate human speech into SQL queries without an external ASR step. SpeechSQLNet has the advantage of making full use of the rich linguistic information presented in speech. To the best of our knowledge, this is the first attempt to directly synthesize SQL based on common natural language questions in spoken form, rather than a natural language-based version of SQL. To validate the effectiveness of the proposed problem and model, we further construct a dataset named SpeechQL, by piggybacking the widely used text-to-SQL datasets. Extensive experimental evaluations on this dataset show that SpeechSQLNet can directly synthesize high-quality SQL queries from human speech, outperforming various competitive counterparts as well as the cascaded methods in terms of exact match accuracies. We expect speech-to-SQL would inspire more research on more effective and efficient human–machine interfaces to lower the barrier of using relational databases.
{"title":"Speech-to-SQL: toward speech-driven SQL query generation from natural language question","authors":"Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang Zhao","doi":"10.1007/s00778-024-00837-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00837-0","url":null,"abstract":"<p>Speech-based inputs have been gaining significant momentum with the popularity of smartphones and tablets in our daily lives, since voice is the most popular and efficient way for human–computer interaction. This paper works toward designing more effective speech-based interfaces to query the structured data in relational databases. We first identify a new task named <i>Speech-to-SQL</i>, which aims to understand the information conveyed by human speech and directly translate it into structured query language (SQL) statements. A naive solution to this problem can work in a cascaded manner, that is, an automatic speech recognition component followed by a text-to-SQL component. However, it requires a high-quality ASR system and also suffers from the error compounding problem between the two components, resulting in limited performance. To handle these challenges, we propose a novel end-to-end neural architecture named <i>SpeechSQLNet</i> to directly translate human speech into SQL queries without an external ASR step. SpeechSQLNet has the advantage of making full use of the rich linguistic information presented in speech. To the best of our knowledge, this is the first attempt to directly synthesize SQL based on common natural language questions in spoken form, rather than a natural language-based version of SQL. To validate the effectiveness of the proposed problem and model, we further construct a dataset named <i>SpeechQL</i>, by piggybacking the widely used text-to-SQL datasets. Extensive experimental evaluations on this dataset show that SpeechSQLNet can directly synthesize high-quality SQL queries from human speech, outperforming various competitive counterparts as well as the cascaded methods in terms of exact match accuracies. We expect speech-to-SQL would inspire more research on more effective and efficient human–machine interfaces to lower the barrier of using relational databases.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139769335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1007/s00778-023-00832-x
Kai Ming Ting, Zongyou Liu, Lei Gong, Hang Zhang, Ye Zhu
Time series is traditionally treated with two main approaches, i.e., the time domain approach and the frequency domain approach. These approaches must rely on a sliding window so that time-shift versions of a sequence can be measured to be similar. Coupled with the use of a root point-to-point measure, existing methods often have quadratic time complexity. We offer the third (mathbb {R}) domain approach. It begins with an insight that sequences in a stationary time series can be treated as sets of independent and identically distributed (iid) points generated from an unknown distribution in (mathbb {R}). This (mathbb {R}) domain treatment enables two new possibilities: (a) The similarity between two sequences can be computed using a distributional measure such as Wasserstein distance (WD), kernel mean embedding or isolation distributional kernel ((mathcal {K}_I)), and (b) these distributional measures become non-sliding-window-based. Together, they offer an alternative that has more effective similarity measurements and runs significantly faster than the point-to-point and sliding-window-based measures. Our empirical evaluation shows that (mathcal {K}_I) is an effective and efficient distributional measure for time series; and (mathcal {K}_I)-based detectors have better detection accuracy than existing detectors in two tasks: (i) anomalous sequence detection in a stationary time series and (ii) anomalous time series detection in a dataset of non-stationary time series. The insight makes underutilized “old things new again” which gives existing distributional measures and anomaly detectors a new life in time series anomaly detection that would otherwise be impossible.
{"title":"A new distributional treatment for time series anomaly detection","authors":"Kai Ming Ting, Zongyou Liu, Lei Gong, Hang Zhang, Ye Zhu","doi":"10.1007/s00778-023-00832-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00832-x","url":null,"abstract":"<p>Time series is traditionally treated with two main approaches, i.e., the time domain approach and the frequency domain approach. These approaches must rely on a sliding window so that time-shift versions of a sequence can be measured to be similar. Coupled with the use of a root point-to-point measure, existing methods often have quadratic time complexity. We offer the third <span>(mathbb {R})</span> domain approach. It begins with an <i>insight</i> that sequences in a stationary time series can be treated as sets of independent and identically distributed (iid) points generated from an unknown distribution in <span>(mathbb {R})</span>. This <span>(mathbb {R})</span> domain treatment enables two new possibilities: (a) The similarity between two sequences can be computed using a distributional measure such as Wasserstein distance (WD), kernel mean embedding or isolation distributional kernel (<span>(mathcal {K}_I)</span>), and (b) these distributional measures become non-sliding-window-based. Together, they offer an alternative that has more effective similarity measurements and runs significantly faster than the point-to-point and sliding-window-based measures. Our empirical evaluation shows that <span>(mathcal {K}_I)</span> is an effective and efficient distributional measure for time series; and <span>(mathcal {K}_I)</span>-based detectors have better detection accuracy than existing detectors in two tasks: (i) anomalous sequence detection in a stationary time series and (ii) anomalous time series detection in a dataset of non-stationary time series. The <i>insight</i> makes underutilized “old things new again” which gives existing distributional measures and anomaly detectors a new life in time series anomaly detection that would otherwise be impossible.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-13DOI: 10.1007/s00778-024-00836-1
Tin Vu, Alberto Belussi, Sara Migliorini, Ahmed Eldawy
The importance and complexity of spatial join operation resulted in the availability of many join algorithms, some of which are tailored for big-data platforms like Hadoop and Spark. The choice among them is not trivial and depends on different factors. This paper proposes the first machine-learning-based framework for spatial join query optimization which can accommodate both the characteristics of spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that once trained can be applied to any pair of input datasets, because they are able to extract the important input characteristics, such as data distribution and spatial partitioning, the logic of spatial join algorithms, and the relationship between the two input datasets. The proposed system defines a set of features that can be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train five machine learning models that are used to identify the best spatial join algorithm. The first two are regression models that estimate two important measures of the spatial join performance and they act as the cost model. The third model chooses the best partitioning strategy to use with spatial join. The fourth and fifth models further tune two important parameters, number of partitions and plane-sweep direction, to get the best performance. Experiments on large-scale synthetic and real data show the efficiency of the proposed models over baseline methods.
{"title":"A learning-based framework for spatial join processing: estimation, optimization and tuning","authors":"Tin Vu, Alberto Belussi, Sara Migliorini, Ahmed Eldawy","doi":"10.1007/s00778-024-00836-1","DOIUrl":"https://doi.org/10.1007/s00778-024-00836-1","url":null,"abstract":"<p>The importance and complexity of spatial join operation resulted in the availability of many join algorithms, some of which are tailored for big-data platforms like Hadoop and Spark. The choice among them is not trivial and depends on different factors. This paper proposes the first machine-learning-based framework for spatial join query optimization which can accommodate both the characteristics of spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that once trained can be applied to any pair of input datasets, because they are able to extract the important input characteristics, such as data distribution and spatial partitioning, the logic of spatial join algorithms, and the relationship between the two input datasets. The proposed system defines a set of features that can be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train five machine learning models that are used to identify the best spatial join algorithm. The first two are regression models that estimate two important measures of the spatial join performance and they act as the cost model. The third model chooses the best partitioning strategy to use with spatial join. The fourth and fifth models further tune two important parameters, number of partitions and plane-sweep direction, to get the best performance. Experiments on large-scale synthetic and real data show the efficiency of the proposed models over baseline methods.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"166 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139769152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-13DOI: 10.1007/s00778-024-00835-2
Sergey Redyuk, Zoi Kaoudi, Sebastian Schelter, Volker Markl
When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the assisteddesignofdatasciencepipelines using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.
在设计数据科学(DS)管道时,终端用户可能会被大量且不断增加的可用数据预处理和建模技术所淹没。智能发现助手(IDA)和自动机器学习(AutoML)解决方案旨在通过(半)自动化流程为最终用户提供便利。然而,它们的计算成本高昂,而且在广泛的实际用例和应用领域中适用性有限。这是由于:(a) 它们需要执行数以千计的管道才能获得最佳管道;(b) 它们对 DS 任务的支持有限,例如,仅支持监督分类或回归,以及可用数据预处理和 ML 算法的小型静态集;(c) 它们仅限于可量化的评估流程和指标,例如,使用 ROC AUC 分数进行分类的十倍交叉验证。为了克服这些局限性,我们提出了一种 "人在回路中 "的方法,利用以前执行过的管道来辅助设计数据科学管道。根据用户查询(即数据和数据科学任务),我们的框架会输出一份候选管道排序列表,用户可从中选择实时执行或修改。为了推荐管道,它首先利用高效的相似性搜索来识别相关的数据集和管道。然后,它利用多目标排序法对候选管道进行排序,并将用户互动纳入考虑范围,从而随着时间的推移不断改进建议。在我们的实验评估中,所提出的框架明显优于最先进的 IDA 工具,并与最先进的长期运行 AutoML 解决方案实现了类似的预测性能,同时还具有实时性、可通用于任何评估流程和 DS 任务,并可扩展到新的操作员。
{"title":"Assisted design of data science pipelines","authors":"Sergey Redyuk, Zoi Kaoudi, Sebastian Schelter, Volker Markl","doi":"10.1007/s00778-024-00835-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00835-2","url":null,"abstract":"<p>When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the <i>assisted</i> <i>design</i> <i>of</i> <i>data</i> <i>science</i> <i>pipelines</i> using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139769336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.
{"title":"Time series data encoding in Apache IoTDB: comparative analysis and recommendation","authors":"Tianrui Xia, Jinzhao Xiao, Yuxiang Huang, Changyu Hu, Shaoxu Song, Xiangdong Huang, Jianmin Wang","doi":"10.1007/s00778-024-00840-5","DOIUrl":"https://doi.org/10.1007/s00778-024-00840-5","url":null,"abstract":"<p>Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-25DOI: 10.1007/s00778-023-00833-w
Abstract
Sub-trajectory clustering is a fundamental problem in many trajectory applications. Existing approaches usually divide the clustering procedure into two phases: segmenting trajectories into sub-trajectories and then clustering these sub-trajectories. However, researchers need to develop complex human-crafted segmentation rules for specific applications, making the clustering results sensitive to the segmentation rules and lacking in generality. To solve this problem, we propose a novel algorithm using the clustering results to guide the segmentation, which is based on reinforcement learning (RL). The novelty is that the segmentation and clustering components cooperate closely and improve each other continuously to yield better clustering results. To devise our RL-based algorithm, we model the procedure of trajectory segmentation as a Markov decision process (MDP). We apply Deep-Q-Network (DQN) learning to train an RL model for the segmentation and achieve excellent clustering results. Experimental results on real datasets demonstrate the superior performance of the proposed RL-based approach over state-of-the-art methods.
{"title":"Sub-trajectory clustering with deep reinforcement learning","authors":"","doi":"10.1007/s00778-023-00833-w","DOIUrl":"https://doi.org/10.1007/s00778-023-00833-w","url":null,"abstract":"<h3>Abstract</h3> <p>Sub-trajectory clustering is a fundamental problem in many trajectory applications. Existing approaches usually divide the clustering procedure into two phases: segmenting trajectories into sub-trajectories and then clustering these sub-trajectories. However, researchers need to develop complex human-crafted segmentation rules for specific applications, making the clustering results sensitive to the segmentation rules and lacking in generality. To solve this problem, we propose a novel algorithm using the clustering results to guide the segmentation, which is based on reinforcement learning (RL). The novelty is that the segmentation and clustering components cooperate closely and improve each other continuously to yield better clustering results. To devise our RL-based algorithm, we model the procedure of trajectory segmentation as a Markov decision process (MDP). We apply Deep-Q-Network (DQN) learning to train an RL model for the segmentation and achieve excellent clustering results. Experimental results on real datasets demonstrate the superior performance of the proposed RL-based approach over state-of-the-art methods.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}