In this study, we accelerate power flow computation used in modeling and analysis of electric power distribution systems utilizing the GPU. We use kernels and parallel computation patterns (i.e., segmented scan and reduction) running on the GPU to accelerate a common method that is used to perform power flow computation called “forward-backward sweep”. To evaluate our approach, we compare the GPU-accelerated parallel implementation of this method written in CUDA to the serial implementation that runs on the CPU. We perform our tests on binary power distribution trees that have number of nodes between 1K to 256K. Our results show that the parallel implementation brings up to 3.9x total speedup over the serial implementation. As expected, for the parts of the computation that entirely run on the GPU, larger speedups are achieved as the size of the distribution tree increases. We also provide a discussion on how the topology of the tree would affect the results.
{"title":"Accelerating Forward-Backward Sweep Power Flow Computation on the GPU","authors":"Saumya Shah, M. Zarghami, Pınar Muyan-Özçelik","doi":"10.1145/3409390.3409397","DOIUrl":"https://doi.org/10.1145/3409390.3409397","url":null,"abstract":"In this study, we accelerate power flow computation used in modeling and analysis of electric power distribution systems utilizing the GPU. We use kernels and parallel computation patterns (i.e., segmented scan and reduction) running on the GPU to accelerate a common method that is used to perform power flow computation called “forward-backward sweep”. To evaluate our approach, we compare the GPU-accelerated parallel implementation of this method written in CUDA to the serial implementation that runs on the CPU. We perform our tests on binary power distribution trees that have number of nodes between 1K to 256K. Our results show that the parallel implementation brings up to 3.9x total speedup over the serial implementation. As expected, for the parts of the computation that entirely run on the GPU, larger speedups are achieved as the size of the distribution tree increases. We also provide a discussion on how the topology of the tree would affect the results.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134461910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By modelling how the probability distributions of individuals’ states evolve as new information flows through a network, belief propagation has broad applicability ranging from image correction to virus propagation to even social networks. Yet, its scant implementations confine themselves largely to the realm of small Bayesian networks. Applications of the algorithm to graphs of large scale are thus unfortunately out of reach. To promote its broad acceptance, we enable belief propagation for both small and large scale graphs utilizing GPU processing. We therefore explore a host of optimizations including a new simple yet extensible input format enabling belief propagation to operate at massive scale, along with significant workload processing updates and meticulous memory management to enable our implementation to outperform prior works in terms of raw execution time and input size on a single machine. Utilizing a suite of parallelization technologies and techniques against a diverse set of graphs, we demonstrate that our implementations can efficiently process even massive networks, achieving up to nearly 121x speedups versus our control yet optimized single threaded implementations while supporting graphs of over ten million nodes in size in contrast to previous works’ support for thousands of nodes using CPU-based multi-core and host solutions. To assist in choosing the optimal implementation for a given graph, we provide a promising method utilizing a random forest classifier and graph metadata with a nearly 95% F1-score from our initial benchmarking and is portable to different GPU architectures to achieve over an F1-score of over 72% accuracy and a speedup of nearly 183x versus our control running in this new environment.
{"title":"Rumor Has It: Optimizing the Belief Propagation Algorithm for Parallel Processing","authors":"Michael Trotter, Timothy Wood, H. H. Huang","doi":"10.1145/3409390.3409401","DOIUrl":"https://doi.org/10.1145/3409390.3409401","url":null,"abstract":"By modelling how the probability distributions of individuals’ states evolve as new information flows through a network, belief propagation has broad applicability ranging from image correction to virus propagation to even social networks. Yet, its scant implementations confine themselves largely to the realm of small Bayesian networks. Applications of the algorithm to graphs of large scale are thus unfortunately out of reach. To promote its broad acceptance, we enable belief propagation for both small and large scale graphs utilizing GPU processing. We therefore explore a host of optimizations including a new simple yet extensible input format enabling belief propagation to operate at massive scale, along with significant workload processing updates and meticulous memory management to enable our implementation to outperform prior works in terms of raw execution time and input size on a single machine. Utilizing a suite of parallelization technologies and techniques against a diverse set of graphs, we demonstrate that our implementations can efficiently process even massive networks, achieving up to nearly 121x speedups versus our control yet optimized single threaded implementations while supporting graphs of over ten million nodes in size in contrast to previous works’ support for thousands of nodes using CPU-based multi-core and host solutions. To assist in choosing the optimal implementation for a given graph, we provide a promising method utilizing a random forest classifier and graph metadata with a nearly 95% F1-score from our initial benchmarking and is portable to different GPU architectures to achieve over an F1-score of over 72% accuracy and a speedup of nearly 183x versus our control running in this new environment.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128349285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Chen, Pirah Noor Soomro, M. Abduljabbar, M. Manivannan, M. Pericàs
Shared resource interference is observed by applications as dynamic performance asymmetry. Prior art has developed approaches to reduce the impact of performance asymmetry mainly at the operating system and architectural levels. In this work, we study how application-level scheduling techniques can leverage moldability (i.e. flexibility to work as either single-threaded or multithreaded task) and explicit knowledge on task criticality to handle scenarios in which system performance is not only unknown but also changing over time. Our proposed task scheduler dynamically learns the performance characteristics of the underlying platform and uses this knowledge to devise better schedules aware of dynamic performance asymmetry, hence reducing the impact of interference. Our evaluation shows that both criticality-aware scheduling and parallelism tuning are effective schemes to address interference in both shared and distributed memory applications.
{"title":"Scheduling Task-parallel Applications in Dynamically Asymmetric Environments","authors":"J. Chen, Pirah Noor Soomro, M. Abduljabbar, M. Manivannan, M. Pericàs","doi":"10.1145/3409390.3409408","DOIUrl":"https://doi.org/10.1145/3409390.3409408","url":null,"abstract":"Shared resource interference is observed by applications as dynamic performance asymmetry. Prior art has developed approaches to reduce the impact of performance asymmetry mainly at the operating system and architectural levels. In this work, we study how application-level scheduling techniques can leverage moldability (i.e. flexibility to work as either single-threaded or multithreaded task) and explicit knowledge on task criticality to handle scenarios in which system performance is not only unknown but also changing over time. Our proposed task scheduler dynamically learns the performance characteristics of the underlying platform and uses this knowledge to devise better schedules aware of dynamic performance asymmetry, hence reducing the impact of interference. Our evaluation shows that both criticality-aware scheduling and parallelism tuning are effective schemes to address interference in both shared and distributed memory applications.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"575 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123127486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.
{"title":"Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation","authors":"Alexander Matz, J. Doerfert, H. Fröning","doi":"10.1145/3409390.3409403","DOIUrl":"https://doi.org/10.1145/3409390.3409403","url":null,"abstract":"GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Yakushin, Kshitij Mehta, Jieyang Chen, M. Wolf, Ian T Foster, S. Klasky, T. Munson
The traditional model of having simulations write data to disk for offline analysis can be prohibitively expensive on computers with limited storage capacity or I/O bandwidth. In situ data analysis has emerged as a necessary paradigm to address this issue and is expected to play an important role in exascale computing. We demonstrate the various aspects and challenges involved in setting up a comprehensive in situ data analysis pipeline that consists of a simulation coupled with compression and feature tracking routines, a framework for assessing compression quality, a middleware library for I/O and data management, and a workflow tool for composing and running the pipeline. We perform studies of compression mechanisms and parameters on two supercomputers, Summit at Oak Ridge National Laboratory and Theta at Argonne National Laboratory, for two example application pipelines. We show that the optimal choice of compression parameters varies with data, time, and analysis, and that periodic retuning of the in situ pipeline can improve compression quality. Finally, we discuss our perspective on the wider adoption of in situ data analysis and management practices and technologies in the HPC community.
{"title":"Feature-preserving Lossy Compression for In Situ Data Analysis","authors":"I. Yakushin, Kshitij Mehta, Jieyang Chen, M. Wolf, Ian T Foster, S. Klasky, T. Munson","doi":"10.1145/3409390.3409400","DOIUrl":"https://doi.org/10.1145/3409390.3409400","url":null,"abstract":"The traditional model of having simulations write data to disk for offline analysis can be prohibitively expensive on computers with limited storage capacity or I/O bandwidth. In situ data analysis has emerged as a necessary paradigm to address this issue and is expected to play an important role in exascale computing. We demonstrate the various aspects and challenges involved in setting up a comprehensive in situ data analysis pipeline that consists of a simulation coupled with compression and feature tracking routines, a framework for assessing compression quality, a middleware library for I/O and data management, and a workflow tool for composing and running the pipeline. We perform studies of compression mechanisms and parameters on two supercomputers, Summit at Oak Ridge National Laboratory and Theta at Argonne National Laboratory, for two example application pipelines. We show that the optimal choice of compression parameters varies with data, time, and analysis, and that periodic retuning of the in situ pipeline can improve compression quality. Finally, we discuss our perspective on the wider adoption of in situ data analysis and management practices and technologies in the HPC community.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130585927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luan Teylo, R. Brum, L. Arantes, Pierre Sens, Lúcia M. A. Drummond
In recent years, cloud computing has grown in popularity as they give users easy and almost instantaneous access to different computational resources. Some cloud providers, like Amazon, took advantage of the growing popularity and offered their VMs in some different hiring types: on-demand, reserved, and spot. The last type is usually offered at lower prices but can be terminated by the provider at any time. To deal with those failures, checkpoint and recovery procedures are typically used. In this context, we propose and analyze checkpoint and recovery procedures using three different storage services from Amazon: Amazon Simple Storage Service (S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS), considering spot VMs. These procedures were built upon the HADS framework, designed to schedule bag-of-tasks applications to spot and on-demand VMs. Our results showed that EBS outperformed the other approaches in terms of time spent on recording a checkpoint. But it required more time in the recovery procedure. EFS presented checkpointing and recovery times close to EBS but with higher monetary costs than the other services. S3 proved to be the best option in terms of monetary cost but required a longer time for recording a checkpoint, individually. However, when concurrent checkpoints were analysed, which can occur in a real application with lots of tasks, in our tests, S3 outperformed EFS in terms of execution time also.
近年来,云计算越来越受欢迎,因为它们使用户可以轻松且几乎即时地访问不同的计算资源。一些云提供商,如亚马逊,利用了日益流行的趋势,并提供了一些不同类型的虚拟机:按需、保留和现货。最后一种通常以较低的价格提供,但提供商可以随时终止。为了处理这些故障,通常使用检查点和恢复过程。在此背景下,我们使用Amazon的三种不同的存储服务:Amazon Simple storage Service (S3)、Amazon Elastic Block Store (EBS)和Amazon Elastic File System (EFS)提出并分析了检查点和恢复过程,并考虑了spot vm。这些过程构建在HADS框架之上,旨在将任务包应用程序调度到指定vm和按需vm。我们的结果表明,EBS在记录检查点所花费的时间方面优于其他方法。但在恢复过程中需要更多时间。EFS提供的检查点和恢复时间接近EBS,但比其他服务的货币成本更高。S3被证明是在金钱成本方面的最佳选择,但是单独记录检查点需要更长的时间。但是,当分析并发检查点时(这可能发生在具有许多任务的实际应用程序中),在我们的测试中,S3在执行时间方面也优于EFS。
{"title":"Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services","authors":"Luan Teylo, R. Brum, L. Arantes, Pierre Sens, Lúcia M. A. Drummond","doi":"10.1145/3409390.3409407","DOIUrl":"https://doi.org/10.1145/3409390.3409407","url":null,"abstract":"In recent years, cloud computing has grown in popularity as they give users easy and almost instantaneous access to different computational resources. Some cloud providers, like Amazon, took advantage of the growing popularity and offered their VMs in some different hiring types: on-demand, reserved, and spot. The last type is usually offered at lower prices but can be terminated by the provider at any time. To deal with those failures, checkpoint and recovery procedures are typically used. In this context, we propose and analyze checkpoint and recovery procedures using three different storage services from Amazon: Amazon Simple Storage Service (S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS), considering spot VMs. These procedures were built upon the HADS framework, designed to schedule bag-of-tasks applications to spot and on-demand VMs. Our results showed that EBS outperformed the other approaches in terms of time spent on recording a checkpoint. But it required more time in the recovery procedure. EFS presented checkpointing and recovery times close to EBS but with higher monetary costs than the other services. S3 proved to be the best option in terms of monetary cost but required a longer time for recording a checkpoint, individually. However, when concurrent checkpoints were analysed, which can occur in a real application with lots of tasks, in our tests, S3 outperformed EFS in terms of execution time also.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115514201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In natural language processing(NLP), the general way to understand the meaning of a word is via word embedding. The word embedding training model can convert words into multidimensional vectors and make the words that do not know “meaning” into vectors with “meaning”. Famous word embedding training models, include models such as FastText, Word2Vec, and GloVe. They can train words into vectors and then they are used for further semantic classifications. In this paper, we work on the efficient support for the FastText. FastText is an open source library created by Facebook(FAIR) lab that allows users to learn word embedding and text classification. We focus on the word representation application in FastText, in which general matrix-Vector multiplication(GEMV) is one of the most computationally intensive operations. In this paper, we adjust the software architecture of FastText, and pre-process the pre-trained model offline. In addition, we introduce a new accelerating method with sparse matrix compression in Halide, which improves performance by compressing the matrix. Our support with Halide sparse compression schedulers include hybrid compression schemes and re-ordering methods to improve the performance.
{"title":"Devise Sparse Compression Schedulers to Enhance FastText Methods","authors":"Chen-Ting Chao, Wei-Hsu Chu, Chao-Lin Lee, Jenq-Kuen Lee, Ming-Yu Hung, Hsiang-Wei Sung","doi":"10.1145/3409390.3409394","DOIUrl":"https://doi.org/10.1145/3409390.3409394","url":null,"abstract":"In natural language processing(NLP), the general way to understand the meaning of a word is via word embedding. The word embedding training model can convert words into multidimensional vectors and make the words that do not know “meaning” into vectors with “meaning”. Famous word embedding training models, include models such as FastText, Word2Vec, and GloVe. They can train words into vectors and then they are used for further semantic classifications. In this paper, we work on the efficient support for the FastText. FastText is an open source library created by Facebook(FAIR) lab that allows users to learn word embedding and text classification. We focus on the word representation application in FastText, in which general matrix-Vector multiplication(GEMV) is one of the most computationally intensive operations. In this paper, we adjust the software architecture of FastText, and pre-process the pre-trained model offline. In addition, we introduce a new accelerating method with sparse matrix compression in Halide, which improves performance by compressing the matrix. Our support with Halide sparse compression schedulers include hybrid compression schemes and re-ordering methods to improve the performance.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122866417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunimal Rathnayake, Lavanya Ramapantulu, Y. M. Teo
Emergence of applications that produce results with different accuracy allows cloud consumers to leverage the advantages of elastic cloud resources and pay-per-use pricing model. However, the trade-off between cost, accuracy and execution time of cloud applications has not been well studied due to multiple challenges. A key challenge faced by a cloud consumer is tuning the application and determining cloud resource configuration that achieves the desired application accuracy among the configuration space. This paper proposes an approach to improve the cost-accuracy performance of cloud applications for a given cost and accuracy. To illustrate our approach, we use two popular convolution neural networks’ (CNN) inference as examples with pruning as a tuning tool for changing the accuracy, and yield several insights. Firstly, we show the existence of multiple degrees of pruning as “sweet-spots”, where inference time and cost can be reduced without losing accuracy. Combining such sweet-spots can halve inference cost and time with one-tenth reduction in accuracy for Caffenet CNN. Secondly, we show that in the large resource configuration space, these “sweet-spots” form the cost-accuracy and time-accuracy Pareto-frontiers whereby a Pareto-optimal configuration can reduce cost and execution time by 55% and 50% respectively for achieving the highest possible inference accuracy. Lastly, to quantify the accuracy performance of cloud applications, we introduce Time Accuracy Ratio (TAR) and Cost Accuracy Ratio (CAR) metrics. We show that using TAR and CAR reduces the time complexity in determining cloud resource configurations from exponential to polynomial-time.
{"title":"Characterizing the Cost-Accuracy Performance of Cloud Applications","authors":"Sunimal Rathnayake, Lavanya Ramapantulu, Y. M. Teo","doi":"10.1145/3409390.3409409","DOIUrl":"https://doi.org/10.1145/3409390.3409409","url":null,"abstract":"Emergence of applications that produce results with different accuracy allows cloud consumers to leverage the advantages of elastic cloud resources and pay-per-use pricing model. However, the trade-off between cost, accuracy and execution time of cloud applications has not been well studied due to multiple challenges. A key challenge faced by a cloud consumer is tuning the application and determining cloud resource configuration that achieves the desired application accuracy among the configuration space. This paper proposes an approach to improve the cost-accuracy performance of cloud applications for a given cost and accuracy. To illustrate our approach, we use two popular convolution neural networks’ (CNN) inference as examples with pruning as a tuning tool for changing the accuracy, and yield several insights. Firstly, we show the existence of multiple degrees of pruning as “sweet-spots”, where inference time and cost can be reduced without losing accuracy. Combining such sweet-spots can halve inference cost and time with one-tenth reduction in accuracy for Caffenet CNN. Secondly, we show that in the large resource configuration space, these “sweet-spots” form the cost-accuracy and time-accuracy Pareto-frontiers whereby a Pareto-optimal configuration can reduce cost and execution time by 55% and 50% respectively for achieving the highest possible inference accuracy. Lastly, to quantify the accuracy performance of cloud applications, we introduce Time Accuracy Ratio (TAR) and Cost Accuracy Ratio (CAR) metrics. We show that using TAR and CAR reduces the time complexity in determining cloud resource configurations from exponential to polynomial-time.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing need for computational performance is resulting in an increase in the energy consumption of HPC systems, which is a major challenge to reach Exascale computing. To overcome this challenge, we developed a tuning plugin that targets applications that exhibit dynamically changing characteristics between iterations of the time loop as well as change in the control flow within the time loop itself. To analyze the inter-loop dynamism, we propose features to characterize the behaviour of loops for clustering via DBSCAN and spectral clustering. To save tuning time and costs, we implemented a random search strategy with a Gaussian probability distribution model to test a large number of system configurations in a single application run. The goal is to select the best configurations of the CPU and uncore frequencies for groups of similarly behaving loops, as well as individual instances of regions called within these loops based on their unique computational characteristics. During production runs, the configurations are dynamically switched for different code regions. The results of our experiments for two highly dynamic real-world applications highlight the effectiveness of our methodology in optimizing energy-efficiency.
{"title":"Exploiting Dynamism in HPC Applications to Optimize Energy-Efficiency","authors":"Madhura Kumaraswamy, M. Gerndt","doi":"10.1145/3409390.3409399","DOIUrl":"https://doi.org/10.1145/3409390.3409399","url":null,"abstract":"The growing need for computational performance is resulting in an increase in the energy consumption of HPC systems, which is a major challenge to reach Exascale computing. To overcome this challenge, we developed a tuning plugin that targets applications that exhibit dynamically changing characteristics between iterations of the time loop as well as change in the control flow within the time loop itself. To analyze the inter-loop dynamism, we propose features to characterize the behaviour of loops for clustering via DBSCAN and spectral clustering. To save tuning time and costs, we implemented a random search strategy with a Gaussian probability distribution model to test a large number of system configurations in a single application run. The goal is to select the best configurations of the CPU and uncore frequencies for groups of similarly behaving loops, as well as individual instances of regions called within these loops based on their unique computational characteristics. During production runs, the configurations are dynamically switched for different code regions. The results of our experiments for two highly dynamic real-world applications highlight the effectiveness of our methodology in optimizing energy-efficiency.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With machine learning on the rise, mobile platforms are striving to offer inference acceleration on edge devices so that related applications can achieve satisfiable performance. With this background, this work aims at interfacing inference on Android with TVM, an inference-focusing compiler for machine learning, and NNAPI, the official neural network API provided by Android. This work presents a flow to integrate NNAPI into TVM-generated inference model with a partition algorithm to determine which parts of the model should be computed on NNAPI and which should not. Conducted experiments show that properly partitioned models can achieve significant speedup using NNAPI when compared to pure TVM-generated CPU inference. In addition, our enable flow potentially benefits both frameworks by allowing them to leverage each other in AI model deployments.
{"title":"Enabling Android NNAPI Flow for TVM Runtime","authors":"Ming-Yi Lai, Chia-Yu Sung, Jenq-Kuen Lee, Ming-Yu Hung","doi":"10.1145/3409390.3409393","DOIUrl":"https://doi.org/10.1145/3409390.3409393","url":null,"abstract":"With machine learning on the rise, mobile platforms are striving to offer inference acceleration on edge devices so that related applications can achieve satisfiable performance. With this background, this work aims at interfacing inference on Android with TVM, an inference-focusing compiler for machine learning, and NNAPI, the official neural network API provided by Android. This work presents a flow to integrate NNAPI into TVM-generated inference model with a partition algorithm to determine which parts of the model should be computed on NNAPI and which should not. Conducted experiments show that properly partitioned models can achieve significant speedup using NNAPI when compared to pure TVM-generated CPU inference. In addition, our enable flow potentially benefits both frameworks by allowing them to leverage each other in AI model deployments.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125204430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}