Daniel Mendoza, Francisco Romero, Qian Li, N. Yadwadkar, C. Kozyrakis
Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types. This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.
{"title":"Interference-Aware Scheduling for Inference Serving","authors":"Daniel Mendoza, Francisco Romero, Qian Li, N. Yadwadkar, C. Kozyrakis","doi":"10.1145/3437984.3458837","DOIUrl":"https://doi.org/10.1145/3437984.3458837","url":null,"abstract":"Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types. This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126722846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recommendation systems (RS) are a key component of modern commercial platforms, with Collaborative Filtering (CF) based RSs being the centrepiece. Relevant research has long focused on measuring and improving the effectiveness of such CF systems, but alas their efficiency - especially with regards to their time- and resource-consuming training phase - has received little to no attention. This work is a first step in the direction of addressing this gap. To do so, we first perform a methodical study of the computational complexity of the training phase for a number of highly popular CF-based RSs, including approaches based on matrix factorisation, k-nearest neighbours, co-clustering, and slope one schemes. Based on this, we then build a simple yet effective predictor that, given a small sample of a dataset, is able to predict training times over the complete dataset. Our systematic experimental evaluation shows that our approach outperforms state-of-the-art regression schemes by a considerable margin.
{"title":"Are we there yet? Estimating Training Time for Recommendation Systems","authors":"I. Paun, Yashar Moshfeghi, Nikos Ntarmos","doi":"10.1145/3437984.3458832","DOIUrl":"https://doi.org/10.1145/3437984.3458832","url":null,"abstract":"Recommendation systems (RS) are a key component of modern commercial platforms, with Collaborative Filtering (CF) based RSs being the centrepiece. Relevant research has long focused on measuring and improving the effectiveness of such CF systems, but alas their efficiency - especially with regards to their time- and resource-consuming training phase - has received little to no attention. This work is a first step in the direction of addressing this gap. To do so, we first perform a methodical study of the computational complexity of the training phase for a number of highly popular CF-based RSs, including approaches based on matrix factorisation, k-nearest neighbours, co-clustering, and slope one schemes. Based on this, we then build a simple yet effective predictor that, given a small sample of a dataset, is able to predict training times over the complete dataset. Our systematic experimental evaluation shows that our approach outperforms state-of-the-art regression schemes by a considerable margin.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115940954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The microservice architecture allows applications to be designed in a modular format, whereby each microservice can implement a single functionality and can be independently managed and deployed. However, an undesirable side-effect of this modular design is the large state space of possibly inter-dependent configuration parameters (of the constituent microservices) which have to be tuned to improve application performance. This workshop paper investigates optimization techniques and dimensionality reduction strategies for tuning microservices applications, empirically demonstrating the significant tail latency improvements (as much as 23%) that can be achieved with configuration tuning.
{"title":"Towards Optimal Configuration of Microservices","authors":"Gagan Somashekar, Anshul Gandhi","doi":"10.1145/3437984.3458828","DOIUrl":"https://doi.org/10.1145/3437984.3458828","url":null,"abstract":"The microservice architecture allows applications to be designed in a modular format, whereby each microservice can implement a single functionality and can be independently managed and deployed. However, an undesirable side-effect of this modular design is the large state space of possibly inter-dependent configuration parameters (of the constituent microservices) which have to be tuned to improve application performance. This workshop paper investigates optimization techniques and dimensionality reduction strategies for tuning microservices applications, empirically demonstrating the significant tail latency improvements (as much as 23%) that can be achieved with configuration tuning.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125471517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in deep learning allow on-demand reduction of model complexity, without a need for re-training, thus enabling a dynamic trade-off between the inference accuracy and the energy savings. Approximate mobile computing, on the other hand, adapts the computation approximation level as the context of usage, and consequently the computation needs or result accuracy needs, vary. In this work, we propose a synergy between the two directions and develop a context-aware method for dynamically adjusting the width of an on-device neural network based on the input and context-dependent classification confidence. We implement our method on a human activity recognition neural network and through measurements on a real-world embedded device demonstrate that such a network would save up to 37.8% energy and induce only 1% loss of accuracy, if used for continuous activity monitoring in the field of elderly care.
{"title":"Queen Jane Approximately: Enabling Efficient Neural Network Inference with Context-Adaptivity","authors":"O. Machidon, Davor Sluga, V. Pejović","doi":"10.1145/3437984.3458833","DOIUrl":"https://doi.org/10.1145/3437984.3458833","url":null,"abstract":"Recent advances in deep learning allow on-demand reduction of model complexity, without a need for re-training, thus enabling a dynamic trade-off between the inference accuracy and the energy savings. Approximate mobile computing, on the other hand, adapts the computation approximation level as the context of usage, and consequently the computation needs or result accuracy needs, vary. In this work, we propose a synergy between the two directions and develop a context-aware method for dynamically adjusting the width of an on-device neural network based on the input and context-dependent classification confidence. We implement our method on a human activity recognition neural network and through measurements on a real-world embedded device demonstrate that such a network would save up to 37.8% energy and induce only 1% loss of accuracy, if used for continuous activity monitoring in the field of elderly care.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115023828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The choice of convolutional routines (or primitives) for implementing the operations in a Convolutional Neural Network (CNN) has a tremendous impact over the inference time. To optimise the execution latency for a target system, a lengthy profiling stage is needed - iterating over all the implementations of convolutional primitives in the configuration of each layer to measure their execution time on that platform. Each primitive exercises the system resources in different ways, so new profiling is currently needed when optimising for another system. In this work, we replace this prohibitively expensive profiling stage with a machine learning based approach of performance modelling. Our approach drastically speeds up the optimisation by estimating the latency of convolutional primitives in any layer configuration running on a target system. We reduce the time needed for optimising the execution of large neural networks on an ARM Cortex-A73 system from hours to just seconds. Our performance model is easily transferable across target platforms. This is demonstrated by training a performance model on an Intel platform and transferring its predictive performance to AMD and ARM systems, using very few profiled samples from the target platforms for fine-tuning the performance model.
{"title":"Fast Optimisation of Convolutional Neural Network Inference using System Performance Models","authors":"Rik Mulder, Valentin Radu, Christophe Dubach","doi":"10.1145/3437984.3458840","DOIUrl":"https://doi.org/10.1145/3437984.3458840","url":null,"abstract":"The choice of convolutional routines (or primitives) for implementing the operations in a Convolutional Neural Network (CNN) has a tremendous impact over the inference time. To optimise the execution latency for a target system, a lengthy profiling stage is needed - iterating over all the implementations of convolutional primitives in the configuration of each layer to measure their execution time on that platform. Each primitive exercises the system resources in different ways, so new profiling is currently needed when optimising for another system. In this work, we replace this prohibitively expensive profiling stage with a machine learning based approach of performance modelling. Our approach drastically speeds up the optimisation by estimating the latency of convolutional primitives in any layer configuration running on a target system. We reduce the time needed for optimising the execution of large neural networks on an ARM Cortex-A73 system from hours to just seconds. Our performance model is easily transferable across target platforms. This is demonstrated by training a performance model on an Intel platform and transferring its predictive performance to AMD and ARM systems, using very few profiled samples from the target platforms for fine-tuning the performance model.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133821197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Private and public clouds require users to specify requests for resources such as CPU and memory (RAM) to be provisioned for their applications. The values of these requests do not necessarily relate to the application's run-time requirements, but only help the cloud infrastructure resource manager to map requested resources to physical resources. If an application exceeds these values, it might be throttled or even terminated. As a consequence, requested values are often overestimated, resulting in poor resource utilization in the cloud infrastructure. Autoscaling is a technique used to overcome these problems. We observed that Kubernetes Vertical Pod Autoscaler (VPA) might be using an autoscaling strategy that performs poorly on workloads that periodically change. Our experimental results show that compared to VPA, predictive methods based on Holt-Winters exponential smoothing (HW) and Long Short-Term Memory (LSTM) can decrease CPU slack by over 40% while avoiding CPU insufficiency for various CPU workloads. Furthermore, LSTM has been shown to generate stabler predictions compared to that of HW, which allowed for more robust scaling decisions.
私有云和公共云都要求用户指定为其应用程序提供的CPU和内存(RAM)等资源的请求。这些请求的值不一定与应用程序的运行时需求相关,而只是帮助云基础设施资源管理器将请求的资源映射到物理资源。如果应用程序超过了这些值,它可能会被限制甚至终止。因此,请求值经常被高估,从而导致云基础设施中的资源利用率低下。自动缩放是一种用来克服这些问题的技术。我们观察到Kubernetes Vertical Pod Autoscaler (VPA)可能使用的自动缩放策略在周期性变化的工作负载上表现不佳。实验结果表明,与VPA相比,基于Holt-Winters指数平滑(HW)和长短期记忆(LSTM)的预测方法可以减少40%以上的CPU松弛,同时避免各种CPU工作负载下的CPU不足。此外,与HW相比,LSTM已被证明可以产生更稳定的预测,从而允许更稳健的扩展决策。
{"title":"Predicting CPU usage for proactive autoscaling","authors":"Thomas Wang, Simone Ferlin Oliveira, Marco Chiesa","doi":"10.1145/3437984.3458831","DOIUrl":"https://doi.org/10.1145/3437984.3458831","url":null,"abstract":"Private and public clouds require users to specify requests for resources such as CPU and memory (RAM) to be provisioned for their applications. The values of these requests do not necessarily relate to the application's run-time requirements, but only help the cloud infrastructure resource manager to map requested resources to physical resources. If an application exceeds these values, it might be throttled or even terminated. As a consequence, requested values are often overestimated, resulting in poor resource utilization in the cloud infrastructure. Autoscaling is a technique used to overcome these problems. We observed that Kubernetes Vertical Pod Autoscaler (VPA) might be using an autoscaling strategy that performs poorly on workloads that periodically change. Our experimental results show that compared to VPA, predictive methods based on Holt-Winters exponential smoothing (HW) and Long Short-Term Memory (LSTM) can decrease CPU slack by over 40% while avoiding CPU insufficiency for various CPU workloads. Furthermore, LSTM has been shown to generate stabler predictions compared to that of HW, which allowed for more robust scaling decisions.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121171570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanan Hindy, C. Tachtatzis, Robert C. Atkinson, Ethan Bayne, X. Bellekens
Machine Learning (ML) for developing Intrusion Detection Systems (IDS) is a fast-evolving research area that has many unsolved domain challenges. Current IDS models face two challenges that limit their performance and robustness. Firstly, they require large datasets to train and their performance is highly dependent on the dataset size. Secondly, zero-day attacks demand that machine learning models are retrained in order to identify future attacks of this type. However, the sophistication and increasing rate of cyber attacks make retraining time prohibitive for practical implementation. This paper proposes a new IDS model that can learn from pair similarities rather than class discriminative features. Learning similarities requires less data for training and provides the ability to flexibly adapt to new cyber attacks, thus reducing the burden of retraining. The underlying model is based on Siamese Networks, therefore, given a number of instances, numerous similar and dissimilar pairs can be generated. The model is evaluated using three mainstream IDS datasets; CICIDS2017, KDD Cup'99, and NSL-KDD. The evaluation results confirm the ability of the Siamese Network model to suit IDS purposes by classifying cyber attacks based on similarity-based learning. This opens a new research direction for building adaptable IDS models using non-conventional ML techniques.
{"title":"Developing a Siamese Network for Intrusion Detection Systems","authors":"Hanan Hindy, C. Tachtatzis, Robert C. Atkinson, Ethan Bayne, X. Bellekens","doi":"10.1145/3437984.3458842","DOIUrl":"https://doi.org/10.1145/3437984.3458842","url":null,"abstract":"Machine Learning (ML) for developing Intrusion Detection Systems (IDS) is a fast-evolving research area that has many unsolved domain challenges. Current IDS models face two challenges that limit their performance and robustness. Firstly, they require large datasets to train and their performance is highly dependent on the dataset size. Secondly, zero-day attacks demand that machine learning models are retrained in order to identify future attacks of this type. However, the sophistication and increasing rate of cyber attacks make retraining time prohibitive for practical implementation. This paper proposes a new IDS model that can learn from pair similarities rather than class discriminative features. Learning similarities requires less data for training and provides the ability to flexibly adapt to new cyber attacks, thus reducing the burden of retraining. The underlying model is based on Siamese Networks, therefore, given a number of instances, numerous similar and dissimilar pairs can be generated. The model is evaluated using three mainstream IDS datasets; CICIDS2017, KDD Cup'99, and NSL-KDD. The evaluation results confirm the ability of the Siamese Network model to suit IDS purposes by classifying cyber attacks based on similarity-based learning. This opens a new research direction for building adaptable IDS models using non-conventional ML techniques.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"46 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130411247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, A. Fitzgibbon, Tim Harris
The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, horizontal, and pipeline parallelism. However, selecting the best set of strategies for a given model and hardware configuration is challenging because debugging and testing on clusters is expensive. In this work we propose DistIR, an IR for explicitly representing distributed DNN computation that can capture many popular distribution strategies. We build an analysis framework for DistIR programs, including a simulator and reference executor that can be used to automatically search for an optimal distribution strategy. Our unified global representation also eases development of new distribution strategies, as one can reuse the lowering to per-rank backend programs. Preliminary results using a grid search over a hybrid data/horizontal/pipeline-parallel space suggest DistIR and its simulator can aid automatic DNN distribution.
{"title":"DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks","authors":"Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, A. Fitzgibbon, Tim Harris","doi":"10.1145/3437984.3458829","DOIUrl":"https://doi.org/10.1145/3437984.3458829","url":null,"abstract":"The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, horizontal, and pipeline parallelism. However, selecting the best set of strategies for a given model and hardware configuration is challenging because debugging and testing on clusters is expensive. In this work we propose DistIR, an IR for explicitly representing distributed DNN computation that can capture many popular distribution strategies. We build an analysis framework for DistIR programs, including a simulator and reference executor that can be used to automatically search for an optimal distribution strategy. Our unified global representation also eases development of new distribution strategies, as one can reuse the lowering to per-rank backend programs. Preliminary results using a grid search over a hybrid data/horizontal/pipeline-parallel space suggest DistIR and its simulator can aid automatic DNN distribution.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130191170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning (FL) is increasingly becoming the norm for training models over distributed and private datasets. Major service providers rely on FL to improve services such as text auto-completion, virtual keyboards, and item recommendations. Nonetheless, training models with FL in practice requires significant amount of time (days or even weeks) because FL tasks execute in highly heterogeneous environments where devices only have widespread yet limited computing capabilities and network connectivity conditions. In this paper, we focus on mitigating the extent of device heterogeneity, which is a main contributing factor to training time in FL. We propose AQFL, a simple and practical approach leveraging adaptive model quantization to homogenize the computing resources of the clients. We evaluate AQFL on five common FL benchmarks. The results show that, in heterogeneous settings, AQFL obtains nearly the same quality and fairness of the model trained in homogeneous settings.
{"title":"Towards Mitigating Device Heterogeneity in Federated Learning via Adaptive Model Quantization","authors":"A. Abdelmoniem, M. Canini","doi":"10.1145/3437984.3458839","DOIUrl":"https://doi.org/10.1145/3437984.3458839","url":null,"abstract":"Federated learning (FL) is increasingly becoming the norm for training models over distributed and private datasets. Major service providers rely on FL to improve services such as text auto-completion, virtual keyboards, and item recommendations. Nonetheless, training models with FL in practice requires significant amount of time (days or even weeks) because FL tasks execute in highly heterogeneous environments where devices only have widespread yet limited computing capabilities and network connectivity conditions. In this paper, we focus on mitigating the extent of device heterogeneity, which is a main contributing factor to training time in FL. We propose AQFL, a simple and practical approach leveraging adaptive model quantization to homogenize the computing resources of the clients. We evaluate AQFL on five common FL benchmarks. The results show that, in heterogeneous settings, AQFL obtains nearly the same quality and fairness of the model trained in homogeneous settings.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114426236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RocksDB is a general-purpose embedded key-value store used in multiple different settings. Its versatility comes at the cost of complex tuning configurations. This paper investigates maximizing the throughput of RocksDB 10 operations by auto-tuning ten parameters of varying ranges. Off-the-shelf optimizers struggle with high-dimensional problem spaces and require a large number of training samples. We propose two techniques to tackle this problem: multitask modeling and dimensionality reduction through clustering. By incorporating adjacent optimization in the model, the model converged faster and found complicated settings that other tuners could not find. This approach had an additional computational complexity overhead, which we mitigated by manually assigning parameters to each sub-goal through our knowledge of RocksDB. The model is then incorporated in a standard Bayesian Optimization loop to find parameters that maximize RocksDB's 10 throughput. Our method achieved x1.3 improvement when bench-marked against a simulation of Facebook's social graph traffic, and converged in ten optimization steps compared to other state-of-the-art methods that required fifty steps.
{"title":"High-Dimensional Bayesian Optimization with Multi-Task Learning for RocksDB","authors":"Sami Alabed, Eiko Yoneki","doi":"10.1145/3437984.3458841","DOIUrl":"https://doi.org/10.1145/3437984.3458841","url":null,"abstract":"RocksDB is a general-purpose embedded key-value store used in multiple different settings. Its versatility comes at the cost of complex tuning configurations. This paper investigates maximizing the throughput of RocksDB 10 operations by auto-tuning ten parameters of varying ranges. Off-the-shelf optimizers struggle with high-dimensional problem spaces and require a large number of training samples. We propose two techniques to tackle this problem: multitask modeling and dimensionality reduction through clustering. By incorporating adjacent optimization in the model, the model converged faster and found complicated settings that other tuners could not find. This approach had an additional computational complexity overhead, which we mitigated by manually assigning parameters to each sub-goal through our knowledge of RocksDB. The model is then incorporated in a standard Bayesian Optimization loop to find parameters that maximize RocksDB's 10 throughput. Our method achieved x1.3 improvement when bench-marked against a simulation of Facebook's social graph traffic, and converged in ten optimization steps compared to other state-of-the-art methods that required fifty steps.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128115772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}