2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)最新文献

英文中文

Platform Agnostic Streaming Data Application Performance Models 与平台无关的流数据应用性能模型

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-11-01 DOI: 10.1109/rsdha54838.2021.00008

Clayton J. Faber, Tom Plano, Samatha Kodali, Zhili Xiao, Abhishek Dwaraki, J. Buhler, R. Chamberlain, A. Cabrera

The mapping of computational needs onto execution resources is, by and large, a manual task, and users are frequently guided simply by intuition and past experiences. We present a queueing theory based performance model for streaming data applications that takes steps towards a better understanding of resource mapping decisions, thereby assisting application developers to make good mapping choices. The performance model (and associated cost model) are agnostic to the specific properties of the compute resource and application, simply characterizing them by their achievable data throughput. We illustrate the model with a pair of applications, one chosen from the field of computational biology and the second is a classic machine learning problem.

总的来说，将计算需求映射到执行资源是一项手动任务，用户通常只受直觉和过去经验的指导。我们为流数据应用程序提出了一个基于排队理论的性能模型，该模型可以更好地理解资源映射决策，从而帮助应用程序开发人员做出良好的映射选择。性能模型(以及相关的成本模型)与计算资源和应用程序的特定属性无关，只是通过可实现的数据吞吐量来描述它们。我们用两个应用程序来说明这个模型，一个是从计算生物学领域选择的，第二个是一个经典的机器学习问题。

引用次数: 1

[Copyright notice] (版权)

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-11-01 DOI: 10.1109/rsdha54838.2021.00002

引用次数: 0

ELIχR: Eliminating Computation Redundancy in CNN-Based Video Processing ELIχR:消除基于cnn的视频处理中的计算冗余

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-11-01 DOI: 10.1109/rsdha54838.2021.00010

Jordan Schmerge, Daniel Mawhirter, Connor Holmes, Jedidiah McClurg, Bo-Zong Wu

Video processing frequently relies on applying convolutional neural networks (CNNs) for various tasks, including object tracking, real-time action classification, and image recognition. Due to complicated network design, processing even a single frame requires many operations, leading to low throughput and high latency. This process can be parallelized, but since consecutive images have similar content, most of these operations produce identical results, leading to inefficient usage of parallel hardware accelerators. In this paper, we present ELIχR, a software system that systematically addresses this computation redundancy problem in an architecture-independent way, using two key techniques. First, ELIχR implements a lightweight change propagation algorithm to automatically determine which data to recompute for each new frame based on changes in the input. Second, ELIχR implements a dynamic check to further reduce needed computations by leveraging special operators in the model (e.g., ReLU), and trading off accuracy for performance. We evaluate ELIχR on two real-world models, Inception V3 and Resnet-50, and two video streams. We show that ELIχR running on the CPU produces up to 3.49X speedup (1.76X on average) compared with frame sampling, given the same accuracy and real-time processing requirements, and we describe how our approach can be applied in an architecture-independent way to improve CNN performance in heterogeneous systems.

视频处理经常依赖于应用卷积神经网络(cnn)来完成各种任务，包括目标跟踪、实时动作分类和图像识别。由于网络设计复杂，即使处理一个帧也需要很多操作，导致低吞吐量和高延迟。这个过程可以并行化，但是由于连续的图像具有相似的内容，大多数这些操作产生相同的结果，导致并行硬件加速器的使用效率低下。在本文中，我们提出了ELIχR，这是一个软件系统，它以一种与体系结构无关的方式系统地解决了这种计算冗余问题，使用了两种关键技术。首先，ELIχR实现了一个轻量级的变化传播算法，根据输入的变化自动确定为每个新帧重新计算哪些数据。其次，ELIχR实现了动态检查，通过利用模型中的特殊运算符(例如，ReLU)进一步减少所需的计算，并在准确性和性能之间进行权衡。我们在两个真实世界的模型上评估ELIχR, Inception V3和Resnet-50，以及两个视频流。我们表明，在相同的精度和实时处理要求下，与帧采样相比，在CPU上运行的ELIχR产生高达3.49倍的加速(平均1.76倍)，并且我们描述了我们的方法如何以一种与架构无关的方式应用于异构系统中以提高CNN性能。

{"title":"ELIχR: Eliminating Computation Redundancy in CNN-Based Video Processing","authors":"Jordan Schmerge, Daniel Mawhirter, Connor Holmes, Jedidiah McClurg, Bo-Zong Wu","doi":"10.1109/rsdha54838.2021.00010","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00010","url":null,"abstract":"Video processing frequently relies on applying convolutional neural networks (CNNs) for various tasks, including object tracking, real-time action classification, and image recognition. Due to complicated network design, processing even a single frame requires many operations, leading to low throughput and high latency. This process can be parallelized, but since consecutive images have similar content, most of these operations produce identical results, leading to inefficient usage of parallel hardware accelerators. In this paper, we present ELIχR, a software system that systematically addresses this computation redundancy problem in an architecture-independent way, using two key techniques. First, ELIχR implements a lightweight change propagation algorithm to automatically determine which data to recompute for each new frame based on changes in the input. Second, ELIχR implements a dynamic check to further reduce needed computations by leveraging special operators in the model (e.g., ReLU), and trading off accuracy for performance. We evaluate ELIχR on two real-world models, Inception V3 and Resnet-50, and two video streams. We show that ELIχR running on the CPU produces up to 3.49X speedup (1.76X on average) compared with frame sampling, given the same accuracy and real-time processing requirements, and we describe how our approach can be applied in an architecture-independent way to improve CNN performance in heterogeneous systems.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy Efficient Task Graph Execution Using Compute Unit Masking in GPUs 在gpu中使用计算单元掩蔽的节能任务图执行

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-11-01 DOI: 10.1109/rsdha54838.2021.00011

M. Chow, K. Ranganath, R. Lerias, Mika Shanela Carodan, Daniel Wong

The frontiers of Supercomputers are pushed by novel discrete accelerators. Accelerators such as GPUs are employed to enable faster execution of Machine Learning, Scientific and High-Performance Computing applications. However, it has been harder to gain increased parallelism in traditional workloads. This is why more focus has been into Task Graphs. AMD’s Directed Acyclic Graph Execution Engine (DAGEE) allows the programmer to define a workload in fine-grained tasks, and the system handles the dependencies at the lower-level. We evaluate DAGEE with the Winograd-Strassen Matrix Multiplication algorithm and show that DAGEE achieves on average 15.3% speed up over the traditional matrix multiplication algorithm.While using DAGEE this may increase the contention among kernels due to the increased amount of parallelism. However, AMD allows the programmer to set the number of active Compute Unit (CU) by masking. This fine-grain scaling allows the system software to enable only the required number of Computation Units within a GPU. Using this mechanism we develop a Runtime that masks CU’s for each task during a task graph execution and partitions each task into their separate CU’s, reducing overall contention and energy consumption. We show that our CU Masking runtime on average reduces energy by 18%.

新型分立加速器推动着超级计算机的发展。使用gpu等加速器可以更快地执行机器学习、科学和高性能计算应用程序。然而，在传统工作负载中获得更高的并行性更加困难。这就是为什么任务图更受关注的原因。AMD的定向无环图执行引擎(DAGEE)允许程序员在细粒度任务中定义工作负载，系统在较低级别处理依赖关系。我们用Winograd-Strassen矩阵乘法算法对DAGEE进行了评估，结果表明DAGEE比传统的矩阵乘法算法平均提高了15.3%的速度。在使用DAGEE时，由于并行性的增加，这可能会增加内核之间的争用。然而，AMD允许程序员通过屏蔽来设置活动计算单元(CU)的数量。这种细粒度缩放允许系统软件在GPU内只启用所需数量的计算单元。使用这种机制，我们开发了一个运行时，它在任务图执行期间为每个任务屏蔽CU，并将每个任务划分到它们单独的CU中，从而减少总体争用和能耗。我们表明，我们的CU掩蔽运行时平均减少了18%的能量。

{"title":"Energy Efficient Task Graph Execution Using Compute Unit Masking in GPUs","authors":"M. Chow, K. Ranganath, R. Lerias, Mika Shanela Carodan, Daniel Wong","doi":"10.1109/rsdha54838.2021.00011","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00011","url":null,"abstract":"The frontiers of Supercomputers are pushed by novel discrete accelerators. Accelerators such as GPUs are employed to enable faster execution of Machine Learning, Scientific and High-Performance Computing applications. However, it has been harder to gain increased parallelism in traditional workloads. This is why more focus has been into Task Graphs. AMD’s Directed Acyclic Graph Execution Engine (DAGEE) allows the programmer to define a workload in fine-grained tasks, and the system handles the dependencies at the lower-level. We evaluate DAGEE with the Winograd-Strassen Matrix Multiplication algorithm and show that DAGEE achieves on average 15.3% speed up over the traditional matrix multiplication algorithm.While using DAGEE this may increase the contention among kernels due to the increased amount of parallelism. However, AMD allows the programmer to set the number of active Compute Unit (CU) by masking. This fine-grain scaling allows the system software to enable only the required number of Computation Units within a GPU. Using this mechanism we develop a Runtime that masks CU’s for each task during a task graph execution and partitions each task into their separate CU’s, reducing overall contention and energy consumption. We show that our CU Masking runtime on average reduces energy by 18%.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130701513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multi-accelerator Neural Network Inference in Diversely Heterogeneous Embedded Systems 不同异构嵌入式系统中的多加速器神经网络推理

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-11-01 DOI: 10.1109/rsdha54838.2021.00006

Ismet Dagli, M. Belviranli

Neural network inference (NNI) is commonly used in mobile and autonomous systems for latency-sensitive critical operations such as obstacle detection and avoidance. In addition to latency, energy consumption is also an important factor in such workloads, since the battery is a limited resource in such systems. Energy and latency demands of critical workload execution in such systems can vary based on the physical system state. For example, the remaining energy on a low-running battery should be prioritized for motor consumption in a quadcopter. On the other hand, if the quadcopter is flying through obstacles, latency-aware execution becomes a priority. Many recent mobile and autonomous system-on-chips embed a diverse range of accelerators with varying power and performance characteristics which can be utilized to achieve this fine trade-off between energy and latency.In this paper, we investigate Multi-accelerator Execution (MAE) on diversely heterogeneous embedded systems, where sub-components of a given workload, such as NNI, can be assigned to different type of accelerators to achieve a desired latency or energy goal. We first analyze the energy and performance characteristics of execution of neural network layers on different type of accelerators. We then explore energy/performance trade-offs via layer-wise scheduling for NNI by considering different layer-to-PE mappings. We finally propose a customizable metric, called multi-accelerator execution gain (MAEG), in order to measure the energy or performance benefits of MAE of a given workload. Our empirical results on Jetson Xavier SoCs show that our methodology can provide up to 28% energy/performance trade-off benefit when compared to the case where all layers are assigned to a single PE.

神经网络推理(NNI)通常用于移动和自主系统中对延迟敏感的关键操作，如障碍物检测和回避。除了延迟之外，在这种工作负载中，能源消耗也是一个重要因素，因为电池在这种系统中是一种有限的资源。在这样的系统中，关键工作负载执行的能量和延迟需求可以根据物理系统状态而变化。例如，在低运行电池上的剩余能量应该优先用于四轴飞行器的电机消耗。另一方面，如果四轴飞行器飞过障碍物，延迟感知执行将成为优先事项。许多最近的移动和自主系统芯片嵌入了各种各样的加速器，这些加速器具有不同的功率和性能特征，可以用来实现能量和延迟之间的良好权衡。在本文中，我们研究了不同异构嵌入式系统上的多加速器执行(MAE)，其中给定工作负载的子组件(如NNI)可以分配给不同类型的加速器，以实现所需的延迟或能量目标。首先分析了神经网络层在不同类型加速器上执行的能量和性能特征。然后，我们通过考虑不同的层到pe映射，通过NNI的分层调度来探索能量/性能权衡。我们最后提出了一个可定制的度量，称为多加速器执行增益(MAEG)，以衡量给定工作负载下MAE的能量或性能优势。我们对Jetson Xavier soc的实证结果表明，与将所有层分配给单个PE的情况相比，我们的方法可以提供高达28%的能量/性能权衡效益。

{"title":"Multi-accelerator Neural Network Inference in Diversely Heterogeneous Embedded Systems","authors":"Ismet Dagli, M. Belviranli","doi":"10.1109/rsdha54838.2021.00006","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00006","url":null,"abstract":"Neural network inference (NNI) is commonly used in mobile and autonomous systems for latency-sensitive critical operations such as obstacle detection and avoidance. In addition to latency, energy consumption is also an important factor in such workloads, since the battery is a limited resource in such systems. Energy and latency demands of critical workload execution in such systems can vary based on the physical system state. For example, the remaining energy on a low-running battery should be prioritized for motor consumption in a quadcopter. On the other hand, if the quadcopter is flying through obstacles, latency-aware execution becomes a priority. Many recent mobile and autonomous system-on-chips embed a diverse range of accelerators with varying power and performance characteristics which can be utilized to achieve this fine trade-off between energy and latency.In this paper, we investigate Multi-accelerator Execution (MAE) on diversely heterogeneous embedded systems, where sub-components of a given workload, such as NNI, can be assigned to different type of accelerators to achieve a desired latency or energy goal. We first analyze the energy and performance characteristics of execution of neural network layers on different type of accelerators. We then explore energy/performance trade-offs via layer-wise scheduling for NNI by considering different layer-to-PE mappings. We finally propose a customizable metric, called multi-accelerator execution gain (MAEG), in order to measure the energy or performance benefits of MAE of a given workload. Our empirical results on Jetson Xavier SoCs show that our methodology can provide up to 28% energy/performance trade-off benefit when compared to the case where all layers are assigned to a single PE.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131100623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Comparing LLC-Memory Traffic between CPU and GPU Architectures 比较CPU和GPU架构之间的LLC-Memory流量

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-11-01 DOI: 10.1109/rsdha54838.2021.00007

Mohammad Alaul Haque Monil, Seyong Lee, J. Vetter, A. Malony

The cache hierarchy in modern CPUs and GPUs is becoming increasingly complex, which makes understanding the handshake between the memory access patterns and the cache hierarchy difficult. Moreover, the details of different cache policies are not publicly available. Therefore, the research community relies on observation to understand the relationship between memory access patterns and cache hierarchy. Our previous studies delved into the different microarchitectures of Intel CPUs. In this study, GPUs from NVIDIA and AMD are considered. Even though the execution models in CPUs and GPUs are distinct, this study attempts to correlate the behavior of the cache hierarchy of CPUs and GPUs. Using the knowledge gathered from studying Intel CPUs, the similarities and dissimilarities between CPUs and GPUs are identified. Through model evaluation, this study provides a proof of concept that traffic between last-level cache and memory can be predicted for sequential streaming and strided access patterns on GPUs.

现代cpu和gpu中的缓存层次结构变得越来越复杂，这使得理解内存访问模式和缓存层次结构之间的握手变得困难。此外，不同缓存策略的详细信息是不公开的。因此，研究界依赖于观察来理解内存访问模式和缓存层次之间的关系。我们之前的研究深入研究了英特尔cpu的不同微架构。在本研究中，我们考虑了NVIDIA和AMD的gpu。尽管cpu和gpu的执行模型是不同的，但本研究试图将cpu和gpu的缓存层次结构的行为联系起来。通过对Intel cpu的研究，找出了cpu和gpu的异同点。通过模型评估，本研究提供了一个概念证明，在gpu上的顺序流和跨行访问模式下，可以预测最后一级缓存和内存之间的流量。

引用次数: 0

Distributed Training for High Resolution Images: A Domain and Spatial Decomposition Approach 高分辨率图像的分布式训练:一种域和空间分解方法

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

Pub Date : 2021-09-01 DOI: 10.2172/1827010

A. Tsaris, Jacob D. Hinkle, D. Lunga, P. Dias

In this work we developed two Pytorch libraries using the PyTorch RPC interface for distributed deep learning approaches on high resolution images. The spatial decomposition library allows for distributed training on very large images, which otherwise wouldn’t be possible on a single GPU. The domain parallelism library allows for distributed training across multiple domain unlabeled data, by leveraging the domain separation architecture. Both of those libraries where tested on the Summit supercomputer at Oak Ridge National Laboratory at a moderate scale.

在这项工作中，我们使用Pytorch RPC接口开发了两个Pytorch库，用于高分辨率图像的分布式深度学习方法。空间分解库允许在非常大的图像上进行分布式训练，否则在单个GPU上是不可能的。通过利用领域分离体系结构，领域并行库允许跨多个领域未标记数据进行分布式训练。这两个库都在橡树岭国家实验室的Summit超级计算机上进行了中等规模的测试。

引用次数: 1

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀