Workshop Proceedings of the 49th International Conference on Parallel Processing最新文献_第2页

Accelerating Forward-Backward Sweep Power Flow Computation on the GPU 加速GPU的正向向后扫描功率流计算

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409397

Saumya Shah, M. Zarghami, Pınar Muyan-Özçelik

In this study, we accelerate power flow computation used in modeling and analysis of electric power distribution systems utilizing the GPU. We use kernels and parallel computation patterns (i.e., segmented scan and reduction) running on the GPU to accelerate a common method that is used to perform power flow computation called “forward-backward sweep”. To evaluate our approach, we compare the GPU-accelerated parallel implementation of this method written in CUDA to the serial implementation that runs on the CPU. We perform our tests on binary power distribution trees that have number of nodes between 1K to 256K. Our results show that the parallel implementation brings up to 3.9x total speedup over the serial implementation. As expected, for the parts of the computation that entirely run on the GPU, larger speedups are achieved as the size of the distribution tree increases. We also provide a discussion on how the topology of the tree would affect the results.

在本研究中，我们利用GPU加速了用于配电系统建模和分析的潮流计算。我们使用内核和并行计算模式(即分段扫描和缩减)在GPU上运行，以加速用于执行称为“向前向后扫描”的功率流计算的常用方法。为了评估我们的方法，我们比较了用CUDA编写的gpu加速并行实现和在CPU上运行的串行实现。我们在节点数在1K到256K之间的二进制功率分布树上执行测试。我们的结果表明，与串行实现相比，并行实现的总速度提高了3.9倍。正如预期的那样，对于完全在GPU上运行的部分计算，随着分布树的大小增加，可以实现更大的加速。我们还提供了关于树的拓扑结构如何影响结果的讨论。

引用次数: 0

Rumor Has It: Optimizing the Belief Propagation Algorithm for Parallel Processing 谣言有它:优化并行处理的信念传播算法

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409401

Michael Trotter, Timothy Wood, H. H. Huang

By modelling how the probability distributions of individuals’ states evolve as new information flows through a network, belief propagation has broad applicability ranging from image correction to virus propagation to even social networks. Yet, its scant implementations confine themselves largely to the realm of small Bayesian networks. Applications of the algorithm to graphs of large scale are thus unfortunately out of reach. To promote its broad acceptance, we enable belief propagation for both small and large scale graphs utilizing GPU processing. We therefore explore a host of optimizations including a new simple yet extensible input format enabling belief propagation to operate at massive scale, along with significant workload processing updates and meticulous memory management to enable our implementation to outperform prior works in terms of raw execution time and input size on a single machine. Utilizing a suite of parallelization technologies and techniques against a diverse set of graphs, we demonstrate that our implementations can efficiently process even massive networks, achieving up to nearly 121x speedups versus our control yet optimized single threaded implementations while supporting graphs of over ten million nodes in size in contrast to previous works’ support for thousands of nodes using CPU-based multi-core and host solutions. To assist in choosing the optimal implementation for a given graph, we provide a promising method utilizing a random forest classifier and graph metadata with a nearly 95% F1-score from our initial benchmarking and is portable to different GPU architectures to achieve over an F1-score of over 72% accuracy and a speedup of nearly 183x versus our control running in this new environment.

通过模拟个体状态的概率分布如何随着新信息在网络中流动而演变，信念传播具有广泛的适用性，从图像校正到病毒传播，甚至到社交网络。然而，它的少量实现主要局限于小型贝叶斯网络领域。因此，不幸的是，该算法在大规模图上的应用是遥不可及的。为了促进其被广泛接受，我们利用GPU处理实现了小型和大型图的信念传播。因此，我们探索了一系列优化，包括一种新的简单但可扩展的输入格式，使信念传播能够大规模运行，以及重要的工作负载处理更新和细致的内存管理，使我们的实现在原始执行时间和单个机器上的输入大小方面优于先前的工作。利用一套并行化技术和针对不同图形集的技术，我们证明了我们的实现可以有效地处理大规模网络，与我们的控制优化单线程实现相比，实现了近121倍的速度提升，同时支持超过1000万个节点的图形，而之前的作品使用基于cpu的多核和主机解决方案支持数千个节点。为了帮助选择给定图形的最佳实现，我们提供了一种有前途的方法，利用随机森林分类器和图形元数据，从我们的初始基准测试中获得近95%的f1分数，并且可移植到不同的GPU架构中，以实现超过72%的准确率和近183倍的加速。

{"title":"Rumor Has It: Optimizing the Belief Propagation Algorithm for Parallel Processing","authors":"Michael Trotter, Timothy Wood, H. H. Huang","doi":"10.1145/3409390.3409401","DOIUrl":"https://doi.org/10.1145/3409390.3409401","url":null,"abstract":"By modelling how the probability distributions of individuals’ states evolve as new information flows through a network, belief propagation has broad applicability ranging from image correction to virus propagation to even social networks. Yet, its scant implementations confine themselves largely to the realm of small Bayesian networks. Applications of the algorithm to graphs of large scale are thus unfortunately out of reach. To promote its broad acceptance, we enable belief propagation for both small and large scale graphs utilizing GPU processing. We therefore explore a host of optimizations including a new simple yet extensible input format enabling belief propagation to operate at massive scale, along with significant workload processing updates and meticulous memory management to enable our implementation to outperform prior works in terms of raw execution time and input size on a single machine. Utilizing a suite of parallelization technologies and techniques against a diverse set of graphs, we demonstrate that our implementations can efficiently process even massive networks, achieving up to nearly 121x speedups versus our control yet optimized single threaded implementations while supporting graphs of over ten million nodes in size in contrast to previous works’ support for thousands of nodes using CPU-based multi-core and host solutions. To assist in choosing the optimal implementation for a given graph, we provide a promising method utilizing a random forest classifier and graph metadata with a nearly 95% F1-score from our initial benchmarking and is portable to different GPU architectures to achieve over an F1-score of over 72% accuracy and a speedup of nearly 183x versus our control running in this new environment.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128349285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments 动态不对称环境中任务并行应用的调度

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409408

J. Chen, Pirah Noor Soomro, M. Abduljabbar, M. Manivannan, M. Pericàs

Shared resource interference is observed by applications as dynamic performance asymmetry. Prior art has developed approaches to reduce the impact of performance asymmetry mainly at the operating system and architectural levels. In this work, we study how application-level scheduling techniques can leverage moldability (i.e. flexibility to work as either single-threaded or multithreaded task) and explicit knowledge on task criticality to handle scenarios in which system performance is not only unknown but also changing over time. Our proposed task scheduler dynamically learns the performance characteristics of the underlying platform and uses this knowledge to devise better schedules aware of dynamic performance asymmetry, hence reducing the impact of interference. Our evaluation shows that both criticality-aware scheduling and parallelism tuning are effective schemes to address interference in both shared and distributed memory applications.

共享资源干扰被应用程序视为动态性能不对称。现有技术已经开发出主要在操作系统和架构级别上减少性能不对称影响的方法。在这项工作中，我们研究了应用程序级调度技术如何利用可塑性(即作为单线程或多线程任务工作的灵活性)和任务关键性的明确知识来处理系统性能不仅未知而且随时间变化的场景。我们提出的任务调度器动态地学习底层平台的性能特征，并使用这些知识来设计更好的了解动态性能不对称的调度器，从而减少干扰的影响。我们的评估表明，临界感知调度和并行性调优都是解决共享和分布式内存应用程序中干扰的有效方案。

引用次数: 1

Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation 使用多面体编译的数据并行核的自动分区

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409403

Alexander Matz, J. Doerfert, H. Fröning

GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.

gpu在计算机图形学之外的领域已经建立，包括科学计算、人工智能、数据仓库和其他计算密集型领域。他们的执行模型基于线程层次结构，并建议GPU工作负载通常可以沿着线程块的边界安全地分区。然而，最有效的分区策略高度依赖于应用程序的内存访问模式，对于程序员来说，在决策和实现方面通常是一项繁琐的任务。我们利用这一观察来实现自动将单gpu代码编译为多gpu应用程序的概念。我们提出了这个想法和这个概念的原型实现，并在一系列基准测试中进行了验证。特别地，我们说明了我们使用1)多面体编译来模拟内存访问，2)一个运行时库来跟踪GPU缓冲区和识别陈旧数据，3)用于GPU内核分区的IR转换，以及4)一个自定义预处理器重写CUDA主机代码以利用多个GPU。这项工作的重点是在全局内存和工具链上具有常规访问模式的应用程序，以完全自动编译CUDA应用程序，而无需任何用户干预。我们的基准测试比较了由NVIDIA参考编译器生成的单设备CUDA二进制文件和使用我们的工具链为多个gpu生成的二进制文件。我们报告16颗开普勒级gpu的速度高达12.4倍。

{"title":"Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation","authors":"Alexander Matz, J. Doerfert, H. Fröning","doi":"10.1145/3409390.3409403","DOIUrl":"https://doi.org/10.1145/3409390.3409403","url":null,"abstract":"GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Feature-preserving Lossy Compression for In Situ Data Analysis 原位数据分析中的特征保留有损压缩

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409400

I. Yakushin, Kshitij Mehta, Jieyang Chen, M. Wolf, Ian T Foster, S. Klasky, T. Munson

The traditional model of having simulations write data to disk for offline analysis can be prohibitively expensive on computers with limited storage capacity or I/O bandwidth. In situ data analysis has emerged as a necessary paradigm to address this issue and is expected to play an important role in exascale computing. We demonstrate the various aspects and challenges involved in setting up a comprehensive in situ data analysis pipeline that consists of a simulation coupled with compression and feature tracking routines, a framework for assessing compression quality, a middleware library for I/O and data management, and a workflow tool for composing and running the pipeline. We perform studies of compression mechanisms and parameters on two supercomputers, Summit at Oak Ridge National Laboratory and Theta at Argonne National Laboratory, for two example application pipelines. We show that the optimal choice of compression parameters varies with data, time, and analysis, and that periodic retuning of the in situ pipeline can improve compression quality. Finally, we discuss our perspective on the wider adoption of in situ data analysis and management practices and technologies in the HPC community.

在存储容量或I/O带宽有限的计算机上，让模拟将数据写入磁盘以进行脱机分析的传统模型可能会非常昂贵。原位数据分析已成为解决这一问题的必要范例，并有望在百亿亿次计算中发挥重要作用。我们展示了建立一个全面的现场数据分析管道所涉及的各个方面和挑战，该管道包括一个模拟与压缩和特征跟踪例程，一个评估压缩质量的框架，一个用于I/O和数据管理的中间件库，以及一个用于组合和运行管道的工作流工具。我们在两台超级计算机(橡树岭国家实验室的Summit和阿贡国家实验室的Theta)上执行压缩机制和参数的研究，用于两个示例应用程序管道。研究表明，压缩参数的最佳选择随数据、时间和分析而变化，并且定期对原位管道进行调整可以提高压缩质量。最后，我们讨论了我们对HPC社区更广泛地采用原位数据分析和管理实践和技术的看法。

{"title":"Feature-preserving Lossy Compression for In Situ Data Analysis","authors":"I. Yakushin, Kshitij Mehta, Jieyang Chen, M. Wolf, Ian T Foster, S. Klasky, T. Munson","doi":"10.1145/3409390.3409400","DOIUrl":"https://doi.org/10.1145/3409390.3409400","url":null,"abstract":"The traditional model of having simulations write data to disk for offline analysis can be prohibitively expensive on computers with limited storage capacity or I/O bandwidth. In situ data analysis has emerged as a necessary paradigm to address this issue and is expected to play an important role in exascale computing. We demonstrate the various aspects and challenges involved in setting up a comprehensive in situ data analysis pipeline that consists of a simulation coupled with compression and feature tracking routines, a framework for assessing compression quality, a middleware library for I/O and data management, and a workflow tool for composing and running the pipeline. We perform studies of compression mechanisms and parameters on two supercomputers, Summit at Oak Ridge National Laboratory and Theta at Argonne National Laboratory, for two example application pipelines. We show that the optimal choice of compression parameters varies with data, time, and analysis, and that periodic retuning of the in situ pipeline can improve compression quality. Finally, we discuss our perspective on the wider adoption of in situ data analysis and management practices and technologies in the HPC community.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130585927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services 使用Amazon Web Services的存储服务开发检查点和恢复程序

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409407

Luan Teylo, R. Brum, L. Arantes, Pierre Sens, Lúcia M. A. Drummond

In recent years, cloud computing has grown in popularity as they give users easy and almost instantaneous access to different computational resources. Some cloud providers, like Amazon, took advantage of the growing popularity and offered their VMs in some different hiring types: on-demand, reserved, and spot. The last type is usually offered at lower prices but can be terminated by the provider at any time. To deal with those failures, checkpoint and recovery procedures are typically used. In this context, we propose and analyze checkpoint and recovery procedures using three different storage services from Amazon: Amazon Simple Storage Service (S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS), considering spot VMs. These procedures were built upon the HADS framework, designed to schedule bag-of-tasks applications to spot and on-demand VMs. Our results showed that EBS outperformed the other approaches in terms of time spent on recording a checkpoint. But it required more time in the recovery procedure. EFS presented checkpointing and recovery times close to EBS but with higher monetary costs than the other services. S3 proved to be the best option in terms of monetary cost but required a longer time for recording a checkpoint, individually. However, when concurrent checkpoints were analysed, which can occur in a real application with lots of tasks, in our tests, S3 outperformed EFS in terms of execution time also.

近年来，云计算越来越受欢迎，因为它们使用户可以轻松且几乎即时地访问不同的计算资源。一些云提供商，如亚马逊，利用了日益流行的趋势，并提供了一些不同类型的虚拟机:按需、保留和现货。最后一种通常以较低的价格提供，但提供商可以随时终止。为了处理这些故障，通常使用检查点和恢复过程。在此背景下，我们使用Amazon的三种不同的存储服务:Amazon Simple storage Service (S3)、Amazon Elastic Block Store (EBS)和Amazon Elastic File System (EFS)提出并分析了检查点和恢复过程，并考虑了spot vm。这些过程构建在HADS框架之上，旨在将任务包应用程序调度到指定vm和按需vm。我们的结果表明，EBS在记录检查点所花费的时间方面优于其他方法。但在恢复过程中需要更多时间。EFS提供的检查点和恢复时间接近EBS，但比其他服务的货币成本更高。S3被证明是在金钱成本方面的最佳选择，但是单独记录检查点需要更长的时间。但是，当分析并发检查点时(这可能发生在具有许多任务的实际应用程序中)，在我们的测试中，S3在执行时间方面也优于EFS。

{"title":"Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services","authors":"Luan Teylo, R. Brum, L. Arantes, Pierre Sens, Lúcia M. A. Drummond","doi":"10.1145/3409390.3409407","DOIUrl":"https://doi.org/10.1145/3409390.3409407","url":null,"abstract":"In recent years, cloud computing has grown in popularity as they give users easy and almost instantaneous access to different computational resources. Some cloud providers, like Amazon, took advantage of the growing popularity and offered their VMs in some different hiring types: on-demand, reserved, and spot. The last type is usually offered at lower prices but can be terminated by the provider at any time. To deal with those failures, checkpoint and recovery procedures are typically used. In this context, we propose and analyze checkpoint and recovery procedures using three different storage services from Amazon: Amazon Simple Storage Service (S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS), considering spot VMs. These procedures were built upon the HADS framework, designed to schedule bag-of-tasks applications to spot and on-demand VMs. Our results showed that EBS outperformed the other approaches in terms of time spent on recording a checkpoint. But it required more time in the recovery procedure. EFS presented checkpointing and recovery times close to EBS but with higher monetary costs than the other services. S3 proved to be the best option in terms of monetary cost but required a longer time for recording a checkpoint, individually. However, when concurrent checkpoints were analysed, which can occur in a real application with lots of tasks, in our tests, S3 outperformed EFS in terms of execution time also.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115514201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Devise Sparse Compression Schedulers to Enhance FastText Methods 设计稀疏压缩调度器来增强FastText方法

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409394

Chen-Ting Chao, Wei-Hsu Chu, Chao-Lin Lee, Jenq-Kuen Lee, Ming-Yu Hung, Hsiang-Wei Sung

In natural language processing(NLP), the general way to understand the meaning of a word is via word embedding. The word embedding training model can convert words into multidimensional vectors and make the words that do not know “meaning” into vectors with “meaning”. Famous word embedding training models, include models such as FastText, Word2Vec, and GloVe. They can train words into vectors and then they are used for further semantic classifications. In this paper, we work on the efficient support for the FastText. FastText is an open source library created by Facebook(FAIR) lab that allows users to learn word embedding and text classification. We focus on the word representation application in FastText, in which general matrix-Vector multiplication(GEMV) is one of the most computationally intensive operations. In this paper, we adjust the software architecture of FastText, and pre-process the pre-trained model offline. In addition, we introduce a new accelerating method with sparse matrix compression in Halide, which improves performance by compressing the matrix. Our support with Halide sparse compression schedulers include hybrid compression schemes and re-ordering methods to improve the performance.

在自然语言处理(NLP)中，理解一个词的意思的一般方法是通过词嵌入。单词嵌入训练模型可以将单词转化为多维向量，将不知道“意思”的单词转化为有“意思”的向量。著名的词嵌入训练模型包括FastText、Word2Vec、GloVe等模型。它们可以将单词训练成向量，然后用于进一步的语义分类。在本文中，我们致力于快速文本的有效支持。FastText是由Facebook(FAIR)实验室创建的一个开源库，允许用户学习单词嵌入和文本分类。本文重点研究了FastText中的单词表示应用，其中一般矩阵向量乘法(GEMV)是计算量最大的运算之一。本文对FastText的软件架构进行调整，并对预训练好的模型进行离线预处理。此外，我们在Halide中引入了一种新的稀疏矩阵压缩加速方法，通过压缩矩阵来提高性能。我们对Halide稀疏压缩调度器的支持包括混合压缩方案和重新排序方法，以提高性能。

{"title":"Devise Sparse Compression Schedulers to Enhance FastText Methods","authors":"Chen-Ting Chao, Wei-Hsu Chu, Chao-Lin Lee, Jenq-Kuen Lee, Ming-Yu Hung, Hsiang-Wei Sung","doi":"10.1145/3409390.3409394","DOIUrl":"https://doi.org/10.1145/3409390.3409394","url":null,"abstract":"In natural language processing(NLP), the general way to understand the meaning of a word is via word embedding. The word embedding training model can convert words into multidimensional vectors and make the words that do not know “meaning” into vectors with “meaning”. Famous word embedding training models, include models such as FastText, Word2Vec, and GloVe. They can train words into vectors and then they are used for further semantic classifications. In this paper, we work on the efficient support for the FastText. FastText is an open source library created by Facebook(FAIR) lab that allows users to learn word embedding and text classification. We focus on the word representation application in FastText, in which general matrix-Vector multiplication(GEMV) is one of the most computationally intensive operations. In this paper, we adjust the software architecture of FastText, and pre-process the pre-trained model offline. In addition, we introduce a new accelerating method with sparse matrix compression in Halide, which improves performance by compressing the matrix. Our support with Halide sparse compression schedulers include hybrid compression schemes and re-ordering methods to improve the performance.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122866417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Characterizing the Cost-Accuracy Performance of Cloud Applications 描述云应用程序的成本-准确性性能

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409409

Sunimal Rathnayake, Lavanya Ramapantulu, Y. M. Teo

Emergence of applications that produce results with different accuracy allows cloud consumers to leverage the advantages of elastic cloud resources and pay-per-use pricing model. However, the trade-off between cost, accuracy and execution time of cloud applications has not been well studied due to multiple challenges. A key challenge faced by a cloud consumer is tuning the application and determining cloud resource configuration that achieves the desired application accuracy among the configuration space. This paper proposes an approach to improve the cost-accuracy performance of cloud applications for a given cost and accuracy. To illustrate our approach, we use two popular convolution neural networks’ (CNN) inference as examples with pruning as a tuning tool for changing the accuracy, and yield several insights. Firstly, we show the existence of multiple degrees of pruning as “sweet-spots”, where inference time and cost can be reduced without losing accuracy. Combining such sweet-spots can halve inference cost and time with one-tenth reduction in accuracy for Caffenet CNN. Secondly, we show that in the large resource configuration space, these “sweet-spots” form the cost-accuracy and time-accuracy Pareto-frontiers whereby a Pareto-optimal configuration can reduce cost and execution time by 55% and 50% respectively for achieving the highest possible inference accuracy. Lastly, to quantify the accuracy performance of cloud applications, we introduce Time Accuracy Ratio (TAR) and Cost Accuracy Ratio (CAR) metrics. We show that using TAR and CAR reduces the time complexity in determining cloud resource configurations from exponential to polynomial-time.

产生不同精度结果的应用程序的出现使云消费者能够利用弹性云资源和按使用付费定价模型的优势。然而，由于多种挑战，云应用程序的成本、准确性和执行时间之间的权衡尚未得到很好的研究。云使用者面临的一个关键挑战是调优应用程序和确定云资源配置，以在配置空间中实现所需的应用程序准确性。本文提出了一种在给定成本和精度的情况下提高云应用成本-精度性能的方法。为了说明我们的方法，我们使用两个流行的卷积神经网络(CNN)推理作为示例，将修剪作为改变精度的调整工具，并产生一些见解。首先，我们证明了存在多个程度的修剪作为“甜蜜点”，其中推理时间和成本可以在不损失准确性的情况下减少。结合这些最佳点可以使Caffenet CNN的推理成本和时间减半，准确率降低十分之一。其次，我们表明，在大的资源配置空间中，这些“最佳点”形成了成本-精度和时间-精度帕累托边界，其中帕累托最优配置可以分别减少55%和50%的成本和执行时间，以实现最高的推理精度。最后，为了量化云应用的准确性性能，我们引入了时间正确率(TAR)和成本正确率(CAR)指标。我们表明，使用TAR和CAR将确定云资源配置的时间复杂度从指数时间降低到多项式时间。

{"title":"Characterizing the Cost-Accuracy Performance of Cloud Applications","authors":"Sunimal Rathnayake, Lavanya Ramapantulu, Y. M. Teo","doi":"10.1145/3409390.3409409","DOIUrl":"https://doi.org/10.1145/3409390.3409409","url":null,"abstract":"Emergence of applications that produce results with different accuracy allows cloud consumers to leverage the advantages of elastic cloud resources and pay-per-use pricing model. However, the trade-off between cost, accuracy and execution time of cloud applications has not been well studied due to multiple challenges. A key challenge faced by a cloud consumer is tuning the application and determining cloud resource configuration that achieves the desired application accuracy among the configuration space. This paper proposes an approach to improve the cost-accuracy performance of cloud applications for a given cost and accuracy. To illustrate our approach, we use two popular convolution neural networks’ (CNN) inference as examples with pruning as a tuning tool for changing the accuracy, and yield several insights. Firstly, we show the existence of multiple degrees of pruning as “sweet-spots”, where inference time and cost can be reduced without losing accuracy. Combining such sweet-spots can halve inference cost and time with one-tenth reduction in accuracy for Caffenet CNN. Secondly, we show that in the large resource configuration space, these “sweet-spots” form the cost-accuracy and time-accuracy Pareto-frontiers whereby a Pareto-optimal configuration can reduce cost and execution time by 55% and 50% respectively for achieving the highest possible inference accuracy. Lastly, to quantify the accuracy performance of cloud applications, we introduce Time Accuracy Ratio (TAR) and Cost Accuracy Ratio (CAR) metrics. We show that using TAR and CAR reduces the time complexity in determining cloud resource configurations from exponential to polynomial-time.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploiting Dynamism in HPC Applications to Optimize Energy-Efficiency 利用高性能计算应用中的动态优化能源效率

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409399

Madhura Kumaraswamy, M. Gerndt

The growing need for computational performance is resulting in an increase in the energy consumption of HPC systems, which is a major challenge to reach Exascale computing. To overcome this challenge, we developed a tuning plugin that targets applications that exhibit dynamically changing characteristics between iterations of the time loop as well as change in the control flow within the time loop itself. To analyze the inter-loop dynamism, we propose features to characterize the behaviour of loops for clustering via DBSCAN and spectral clustering. To save tuning time and costs, we implemented a random search strategy with a Gaussian probability distribution model to test a large number of system configurations in a single application run. The goal is to select the best configurations of the CPU and uncore frequencies for groups of similarly behaving loops, as well as individual instances of regions called within these loops based on their unique computational characteristics. During production runs, the configurations are dynamically switched for different code regions. The results of our experiments for two highly dynamic real-world applications highlight the effectiveness of our methodology in optimizing energy-efficiency.

对计算性能日益增长的需求导致HPC系统的能耗增加，这是达到百亿亿级计算的主要挑战。为了克服这一挑战，我们开发了一个调优插件，其目标是在时间循环迭代之间表现出动态变化特征的应用程序，以及时间循环本身的控制流中的变化。为了分析环路间的动态，我们提出了通过DBSCAN和谱聚类来表征环路行为的特征。为了节省调优时间和成本，我们使用高斯概率分布模型实现了随机搜索策略，以便在单个应用程序运行中测试大量系统配置。目标是为行为相似的循环组选择最佳的CPU配置和非核心频率，以及基于其独特的计算特征在这些循环中调用的区域的单个实例。在生产运行期间，为不同的代码区域动态切换配置。我们对两个高度动态的现实世界应用的实验结果突出了我们的方法在优化能源效率方面的有效性。

{"title":"Exploiting Dynamism in HPC Applications to Optimize Energy-Efficiency","authors":"Madhura Kumaraswamy, M. Gerndt","doi":"10.1145/3409390.3409399","DOIUrl":"https://doi.org/10.1145/3409390.3409399","url":null,"abstract":"The growing need for computational performance is resulting in an increase in the energy consumption of HPC systems, which is a major challenge to reach Exascale computing. To overcome this challenge, we developed a tuning plugin that targets applications that exhibit dynamically changing characteristics between iterations of the time loop as well as change in the control flow within the time loop itself. To analyze the inter-loop dynamism, we propose features to characterize the behaviour of loops for clustering via DBSCAN and spectral clustering. To save tuning time and costs, we implemented a random search strategy with a Gaussian probability distribution model to test a large number of system configurations in a single application run. The goal is to select the best configurations of the CPU and uncore frequencies for groups of similarly behaving loops, as well as individual instances of regions called within these loops based on their unique computational characteristics. During production runs, the configurations are dynamically switched for different code regions. The results of our experiments for two highly dynamic real-world applications highlight the effectiveness of our methodology in optimizing energy-efficiency.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enabling Android NNAPI Flow for TVM Runtime 为TVM运行时启用Android NNAPI流

Workshop Proceedings of the 49th International Conference on Parallel Processing

Pub Date : 2020-08-17 DOI: 10.1145/3409390.3409393

Ming-Yi Lai, Chia-Yu Sung, Jenq-Kuen Lee, Ming-Yu Hung

With machine learning on the rise, mobile platforms are striving to offer inference acceleration on edge devices so that related applications can achieve satisfiable performance. With this background, this work aims at interfacing inference on Android with TVM, an inference-focusing compiler for machine learning, and NNAPI, the official neural network API provided by Android. This work presents a flow to integrate NNAPI into TVM-generated inference model with a partition algorithm to determine which parts of the model should be computed on NNAPI and which should not. Conducted experiments show that properly partitioned models can achieve significant speedup using NNAPI when compared to pure TVM-generated CPU inference. In addition, our enable flow potentially benefits both frameworks by allowing them to leverage each other in AI model deployments.

随着机器学习的兴起，移动平台正在努力在边缘设备上提供推理加速，以便相关应用程序能够获得令人满意的性能。在此背景下，本工作旨在将Android上的推理与TVM(一种专注于机器学习的推理编译器)和NNAPI (Android提供的官方神经网络API)进行接口。这项工作提出了一个将NNAPI集成到tvm生成的推理模型中的流程，该模型使用分区算法来确定模型的哪些部分应该在NNAPI上计算，哪些不应该。实验表明，与纯tvm生成的CPU推理相比，使用NNAPI进行适当分区的模型可以获得显着的加速。此外，我们的启用流程允许两个框架在AI模型部署中相互利用，从而潜在地使它们受益。

引用次数: 4