2020 IEEE International Symposium on Workload Characterization (IISWC)最新文献

[Title page i] [标题页i]

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00001

引用次数: 0

AI on the Edge: Characterizing AI-based IoT Applications Using Specialized Edge Architectures 边缘上的人工智能:使用专门的边缘架构表征基于人工智能的物联网应用

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00023

Qianlin Liang, P. Shenoy, David E. Irwin

Edge computing has emerged as a popular paradigm for supporting mobile and IoT applications with low latency or high bandwidth needs. The attractiveness of edge computing has been further enhanced due to the recent availability of special-purpose hardware to accelerate specific compute tasks, such as deep learning inference, on edge nodes. In this paper, we experimentally compare the benefits and limitations of using specialized edge systems, built using edge accelerators, to more traditional forms of edge and cloud computing. Our experimental study using edge-based AI workloads shows that today's edge accelerators can provide comparable, and in many cases better, performance, when normalized for power or cost, than traditional edge and cloud servers. They also provide latency and bandwidth benefits for split processing, across and within tiers, when using model compression or model splitting, but require dynamic methods to determine the optimal split across tiers. We find that edge accelerators can support varying degrees of concurrency for multi-tenant inference applications, but lack isolation mechanisms necessary for edge cloud multi-tenant hosting.

边缘计算已经成为支持低延迟或高带宽需求的移动和物联网应用的流行范例。边缘计算的吸引力已经进一步增强，由于最近可用的专用硬件来加速特定的计算任务，如深度学习推理，在边缘节点上。在本文中，我们通过实验比较了使用使用边缘加速器构建的专用边缘系统与更传统形式的边缘和云计算的优点和局限性。我们使用基于边缘的人工智能工作负载的实验研究表明，与传统的边缘和云服务器相比，今天的边缘加速器在功率或成本标准化时可以提供相当的性能，在许多情况下甚至更好。当使用模型压缩或模型分割时，它们还为跨层和层内的分割处理提供延迟和带宽优势，但需要动态方法来确定最佳的跨层分割。我们发现边缘加速器可以支持多租户推理应用程序的不同程度的并发性，但缺乏边缘云多租户托管所需的隔离机制。

{"title":"AI on the Edge: Characterizing AI-based IoT Applications Using Specialized Edge Architectures","authors":"Qianlin Liang, P. Shenoy, David E. Irwin","doi":"10.1109/IISWC50251.2020.00023","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00023","url":null,"abstract":"Edge computing has emerged as a popular paradigm for supporting mobile and IoT applications with low latency or high bandwidth needs. The attractiveness of edge computing has been further enhanced due to the recent availability of special-purpose hardware to accelerate specific compute tasks, such as deep learning inference, on edge nodes. In this paper, we experimentally compare the benefits and limitations of using specialized edge systems, built using edge accelerators, to more traditional forms of edge and cloud computing. Our experimental study using edge-based AI workloads shows that today's edge accelerators can provide comparable, and in many cases better, performance, when normalized for power or cost, than traditional edge and cloud servers. They also provide latency and bandwidth benefits for split processing, across and within tiers, when using model compression or model splitting, but require dynamic methods to determine the optimal split across tiers. We find that edge accelerators can support varying degrees of concurrency for multi-tenant inference applications, but lack isolation mechanisms necessary for edge cloud multi-tenant hosting.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115773067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Characterizing the impact of last-level cache replacement policies on big-data workloads 描述最后一级缓存替换策略对大数据工作负载的影响

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00022

Alexandre Valentin Jamet, Lluc Alvarez, Daniel A. Jiménez, Marc Casas

The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. The computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over the least-recently-used (LRU) policy, they tend to focus only on the SPEC CPU 2006 benchmark suite - and more recently on the SPEC CPU 2017 benchmark suite - for evaluation. However, these workloads are representative for only a subset of current High-Performance Computing (HPC) workloads. In this paper we evaluate the behavior of a mix of graph processing, scientific and industrial workloads (GAP, XSBench and Qualcomm) along with the well-known SPEC CPU 2006 and SPEC CPU 2017 workloads on state-of-the-art LLC replacement policies such as Multiperspective Reuse Prediction (MPPPB), Glider, Hawkeye, SHiP, DRRIP and SRRIP. Our evaluation reveals that, even though current state-of-the-art LLC replacement policies provide a significant performance improvement over LRU for both SPEC CPU 2006 and SPEC CPU 2017 workloads, those policies are hardly able to capture the access patterns and yield sensible improvement on current HPC and big data workloads due to their highly complex behavior. In addition, this paper introduces two new LLC replacement policies derived from MPPPB. The first proposed replacement policy, Multi-Sampler Multiperspective (MS-MPPPB), uses multiple samplers instead of a single one and dynamically selects the best-behaving sampler to drive reuse distance predictions. The second replacement policy presented in this paper, Multiperspective with Dynamic Features Selector (DS-MPPPB), selects the best behaving features among a set of 64 features to improve the accuracy of the predictions. On a large set of workloads that stress the LLC, MS-MPPPB achieves a geometric mean speed-up of 8.3% over LRU, while DS-MPPPB outperforms LRU by a geometric mean speedup of 8.0%. For big data and HPC workloads, the two proposed techniques present higher performance benefits than state-of-the-art approaches such as MPPPB, Glider and Hawkeye, which yield geometric mean speedups of 7.0%, 5.0% and 4.8% over LRU, respectively.

最后一级缓存(Last Level Cache, LLC)和内存延迟之间的巨大差异促使人们需要高效的缓存管理策略。计算机体系结构文献中有大量关于LLC替换策略的工作。尽管这些工作大大改善了最近最少使用(LRU)策略，但它们往往只关注SPEC CPU 2006基准测试套件-以及最近的SPEC CPU 2017基准测试套件-进行评估。但是，这些工作负载仅代表当前高性能计算(HPC)工作负载的一个子集。在本文中，我们评估了图形处理、科学和工业工作负载(GAP、XSBench和高通)以及著名的SPEC CPU 2006和SPEC CPU 2017工作负载在最先进的LLC替换策略(如多视角重用预测(MPPPB)、Glider、Hawkeye、SHiP、DRRIP和SRRIP)上的混合行为。我们的评估显示，尽管当前最先进的LLC替换策略为SPEC CPU 2006和SPEC CPU 2017工作负载提供了比LRU显著的性能改进，但由于其高度复杂的行为，这些策略几乎无法捕获访问模式并对当前HPC和大数据工作负载产生显着的改进。此外，本文还介绍了由MPPPB衍生出的两种新的有限责任公司置换政策。第一个被提出的替换策略，Multi-Sampler Multiperspective (MS-MPPPB)，使用多个采样器而不是单个采样器，并动态选择行为最佳的采样器来驱动重用距离预测。本文提出的第二种替换策略，multi - perspective with Dynamic Features Selector (DS-MPPPB)，从64个特征中选择表现最好的特征来提高预测的准确性。在对LLC施加压力的大量工作负载上，MS-MPPPB的几何平均加速比LRU高8.3%，而DS-MPPPB的几何平均加速比LRU高8.0%。对于大数据和高性能计算工作负载，这两种技术比MPPPB、Glider和Hawkeye等最先进的方法具有更高的性能优势，后者的几何平均速度分别比LRU提高7.0%、5.0%和4.8%。

{"title":"Characterizing the impact of last-level cache replacement policies on big-data workloads","authors":"Alexandre Valentin Jamet, Lluc Alvarez, Daniel A. Jiménez, Marc Casas","doi":"10.1109/IISWC50251.2020.00022","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00022","url":null,"abstract":"The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. The computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over the least-recently-used (LRU) policy, they tend to focus only on the SPEC CPU 2006 benchmark suite - and more recently on the SPEC CPU 2017 benchmark suite - for evaluation. However, these workloads are representative for only a subset of current High-Performance Computing (HPC) workloads. In this paper we evaluate the behavior of a mix of graph processing, scientific and industrial workloads (GAP, XSBench and Qualcomm) along with the well-known SPEC CPU 2006 and SPEC CPU 2017 workloads on state-of-the-art LLC replacement policies such as Multiperspective Reuse Prediction (MPPPB), Glider, Hawkeye, SHiP, DRRIP and SRRIP. Our evaluation reveals that, even though current state-of-the-art LLC replacement policies provide a significant performance improvement over LRU for both SPEC CPU 2006 and SPEC CPU 2017 workloads, those policies are hardly able to capture the access patterns and yield sensible improvement on current HPC and big data workloads due to their highly complex behavior. In addition, this paper introduces two new LLC replacement policies derived from MPPPB. The first proposed replacement policy, Multi-Sampler Multiperspective (MS-MPPPB), uses multiple samplers instead of a single one and dynamically selects the best-behaving sampler to drive reuse distance predictions. The second replacement policy presented in this paper, Multiperspective with Dynamic Features Selector (DS-MPPPB), selects the best behaving features among a set of 64 features to improve the accuracy of the predictions. On a large set of workloads that stress the LLC, MS-MPPPB achieves a geometric mean speed-up of 8.3% over LRU, while DS-MPPPB outperforms LRU by a geometric mean speedup of 8.0%. For big data and HPC workloads, the two proposed techniques present higher performance benefits than state-of-the-art approaches such as MPPPB, Glider and Hawkeye, which yield geometric mean speedups of 7.0%, 5.0% and 4.8% over LRU, respectively.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114739964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

CPU Microarchitectural Performance Characterization of Cloud Video Transcoding 云视频转码的CPU微架构性能表征

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00016

Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci

Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We lev

视频流占所有互联网流量的75%以上。流到终端用户的视频被编码以减小其大小，以便有效地利用互联网流量，并在终端用户的设备上播放时进行解码。视频必须进行转码。，将一种编码格式转换为另一种编码格式，以适应用户对分辨率、帧率和编码格式的不同需求。全球流媒体服务提供商(如YouTube、Netflix和Facebook)使用了大量的转码操作。优化转码的性能以提供几个百分点的加速可以节省数百万美元的计算和能源成本。虽然先前的工作确定了不同类别视频转码操作的微架构特征，但视频转码的其他参数及其对CPU性能的影响尚未得到研究。在这项工作中，我们研究了来自vbench(一个公开可用的云视频基准套件)的所有视频的视频转码的微架构性能。我们分析了领先的多媒体转码软件，FFmpeg及其所有主要可配置参数跨不同复杂性的视频(例如，高运动和频繁场景转换的视频更复杂)。根据我们的分析结果，我们发现了视频转码工作负载的指令缓存、数据缓存和分支预测单元的关键瓶颈。此外，我们观察到这些瓶颈随着转码参数的变化而变化很大。我们利用几种最先进的编译器方法来缓解视频转码操作的性能瓶颈。我们应用AutoFDO，一个反馈导向优化(FDO)工具来提高指令缓存和分支预测性能。为了提高数据缓存性能，我们利用了石墨，一个多面体优化器。在所有视频中，AutoFDO和Graphite的平均速度分别为4.66%和4.42%。我们还设置了具有不同微架构配置的模拟设置，并使用智能调度器探索潜在的改进，该调度器根据转码参数值将转码任务分配给最合适的配置。智能调度器的性能比随机调度器好3.72%，并且在75%的情况下与最佳调度器的性能相匹配。在这项工作中，我们研究了来自vbench(一个公开可用的云视频基准套件)的所有视频的视频转码的微架构性能。我们分析了领先的多媒体转码软件，FFmpeg及其所有主要可配置参数跨不同复杂性的视频(例如，高运动和频繁场景转换的视频更复杂)。根据我们的分析结果，我们发现了视频转码工作负载的指令缓存、数据缓存和分支预测单元的关键瓶颈。此外，我们观察到这些瓶颈随着转码参数的变化而变化很大。我们利用几种最先进的编译器方法来缓解视频转码操作的性能瓶颈。我们应用AutoFDO，一个反馈导向优化(FDO)工具来提高指令缓存和分支预测性能。为了提高数据缓存性能，我们利用了石墨，一个多面体优化器。在所有视频中，AutoFDO和Graphite的平均速度分别为4.66%和4.42%。我们还设置了具有不同微架构配置的模拟设置，并使用智能调度器探索潜在的改进，该调度器根据转码参数值将转码任务分配给最合适的配置。智能调度器的性能比随机调度器好3.72%，并且在75%的情况下与最佳调度器的性能相匹配。我们利用几种最先进的编译器方法来缓解视频转码操作的性能瓶颈。我们应用AutoFDO，一个反馈导向优化(FDO)工具来提高指令缓存和分支预测性能。为了提高数据缓存性能，我们利用了石墨，一个多面体优化器。在所有视频中，AutoFDO和Graphite的平均速度分别为4.66%和4.42%。我们还设置了具有不同微架构配置的模拟设置，并使用智能调度器探索潜在的改进，该调度器根据转码参数值将转码任务分配给最合适的配置。智能调度器的性能比随机调度器好3.72%，并且在75%的情况下与最佳调度器的性能相匹配。

{"title":"CPU Microarchitectural Performance Characterization of Cloud Video Transcoding","authors":"Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci","doi":"10.1109/IISWC50251.2020.00016","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00016","url":null,"abstract":"Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We lev","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"106 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122986333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

[Copyright notice] (版权)

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00003

引用次数: 0

A Case for Generalizable DNN Cost Models for Mobile Devices 移动设备的广义DNN成本模型

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00025

Vinod Ganesan, Surya Selvam, Sanchari Sen, Pratyush Kumar, A. Raghunathan

Accurate workload characterization of Deep Neural Networks (DNNs) is challenged by both network and hardware diversity. Networks are being designed with newer motifs such as depthwise separable convolutions, bottleneck layers, etc., which have widely varying performance characteristics. Further, the adoption of Neural Architecture Search (NAS) is creating a Cambrian explosion of networks, greatly expanding the space of networks that must be modeled. On the hardware front, myriad accelerators are being built for DNNs, while compiler improvements are enabling more efficient execution of DNNs on a wide range of CPUs and GPUs. Clearly, characterizing each DNN on each hardware system is infeasible. We thus need cost models to estimate performance that generalize across both devices and networks. In this work, we address this challenge by building a cost model of DNNs on mobile devices. The modeling and evaluation are based on latency measurements of 118 networks on 105 mobile System-on-Chips (SoCs). As a key contribution, we propose that a hardware platform can be represented by its measured latencies on a judiciously chosen, small set of networks, which we call the signature set. We also design a machine learning model that takes as inputs (i) the target hardware representation (measured latencies of the signature set on the hardware) and (ii) a representation of the structure of the DNN to be evaluated, and predicts the latency of the DNN on the target hardware. We propose and evaluate different algorithms to select the signature set. Our results show that by carefully choosing the signature set, the network representation, and the machine learning algorithm, we can train accurate cost models that generalize well. We demonstrate the value of such a cost model in a collaborative workload characterization setup, wherein every mobile device contributes a small set of latency measurements to a centralized repository. With even a small number of measurements per new device, we show that the proposed cost model matches the accuracy of device-specific models trained on an order-of-magnitude larger number of measurements. The entire codebase is released at https://github.com/iitm-sysdl/Generalizable-DNN-cost-models.

深度神经网络(dnn)工作负载的准确表征受到网络和硬件多样性的双重挑战。网络正在设计新的主题，如深度可分离卷积，瓶颈层等，它们具有广泛不同的性能特征。此外，神经架构搜索(NAS)的采用正在创造网络的寒武纪大爆发，极大地扩展了必须建模的网络空间。在硬件方面，无数的加速器正在为深度神经网络构建，而编译器的改进使深度神经网络能够在各种cpu和gpu上更有效地执行。显然，描述每个硬件系统上的每个DNN是不可行的。因此，我们需要成本模型来估计适用于设备和网络的性能。在这项工作中，我们通过在移动设备上构建dnn的成本模型来解决这一挑战。建模和评估是基于105个移动系统芯片(soc)上118个网络的延迟测量。作为一个关键贡献，我们建议硬件平台可以通过其在明智选择的一小组网络(我们称之为签名集)上的测量延迟来表示。我们还设计了一个机器学习模型，该模型将(i)目标硬件表示(硬件上签名集的测量延迟)和(ii)待评估DNN结构的表示作为输入，并预测DNN在目标硬件上的延迟。我们提出并评估了选择签名集的不同算法。我们的研究结果表明，通过仔细选择签名集、网络表示和机器学习算法，我们可以训练出准确的泛化成本模型。我们在协作工作负载表征设置中演示了这种成本模型的价值，其中每个移动设备都为集中存储库提供了一小组延迟测量。即使对每个新设备进行少量测量，我们也表明，所提出的成本模型与在大量测量上训练的特定于设备的模型的准确性相匹配。整个代码库在https://github.com/iitm-sysdl/Generalizable-DNN-cost-models上发布。

{"title":"A Case for Generalizable DNN Cost Models for Mobile Devices","authors":"Vinod Ganesan, Surya Selvam, Sanchari Sen, Pratyush Kumar, A. Raghunathan","doi":"10.1109/IISWC50251.2020.00025","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00025","url":null,"abstract":"Accurate workload characterization of Deep Neural Networks (DNNs) is challenged by both network and hardware diversity. Networks are being designed with newer motifs such as depthwise separable convolutions, bottleneck layers, etc., which have widely varying performance characteristics. Further, the adoption of Neural Architecture Search (NAS) is creating a Cambrian explosion of networks, greatly expanding the space of networks that must be modeled. On the hardware front, myriad accelerators are being built for DNNs, while compiler improvements are enabling more efficient execution of DNNs on a wide range of CPUs and GPUs. Clearly, characterizing each DNN on each hardware system is infeasible. We thus need cost models to estimate performance that generalize across both devices and networks. In this work, we address this challenge by building a cost model of DNNs on mobile devices. The modeling and evaluation are based on latency measurements of 118 networks on 105 mobile System-on-Chips (SoCs). As a key contribution, we propose that a hardware platform can be represented by its measured latencies on a judiciously chosen, small set of networks, which we call the signature set. We also design a machine learning model that takes as inputs (i) the target hardware representation (measured latencies of the signature set on the hardware) and (ii) a representation of the structure of the DNN to be evaluated, and predicts the latency of the DNN on the target hardware. We propose and evaluate different algorithms to select the signature set. Our results show that by carefully choosing the signature set, the network representation, and the machine learning algorithm, we can train accurate cost models that generalize well. We demonstrate the value of such a cost model in a collaborative workload characterization setup, wherein every mobile device contributes a small set of latency measurements to a centralized repository. With even a small number of measurements per new device, we show that the proposed cost model matches the accuracy of device-specific models trained on an order-of-magnitude larger number of measurements. The entire codebase is released at https://github.com/iitm-sysdl/Generalizable-DNN-cost-models.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128937382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

HPC-MixPBench: An HPC Benchmark Suite for Mixed-Precision Analysis HPC- mixpbench:用于混合精度分析的HPC基准套件

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00012

K. Parasyris, I. Laguna, Harshitha Menon, M. Schordan, D. Osei-Kuffuor, G. Georgakoudis, Michael O. Lam, T. Vanderbruggen

With the increasing interest in applying approximate computing to HPC applications, representative benchmarks are needed to evaluate and compare various approximate computing algorithms and programming frameworks. To this end, we propose HPC-MixPBench, a benchmark suite consisting of a representative set of kernels and benchmarks that are widely used in HPC domain. HPC-MixPBench has a test harness framework where different tools can be plugged in and evaluated on the set of benchmarks. We demonstrate the effectiveness of our benchmark suite by evaluating several mixed-precision algorithms implemented in FloatSmith, a tool for floating-point mixed-precision approximation analysis. We report several insights about the mixed-precision algorithms that we compare, which we expect can help users of these methods choose the right method for their workload. We envision that this benchmark suite will evolve into a standard set of HPC benchmarks for comparing different approximate computing techniques.

随着将近似计算应用于高性能计算应用的兴趣日益增加，需要有代表性的基准来评估和比较各种近似计算算法和编程框架。为此，我们提出了HPC- mixpbench，这是一个由广泛应用于HPC领域的具有代表性的内核和基准组成的基准套件。HPC-MixPBench有一个测试工具框架，可以在其中插入不同的工具并对一组基准进行评估。我们通过评估FloatSmith(浮点混合精度近似分析工具)中实现的几种混合精度算法来证明我们的基准套件的有效性。我们报告了一些关于我们比较的混合精度算法的见解，我们希望可以帮助这些方法的用户根据他们的工作负载选择正确的方法。我们设想这个基准套件将发展成为一组标准的HPC基准，用于比较不同的近似计算技术。

引用次数: 6

Cross-Stack Workload Characterization of Deep Recommendation Systems 深度推荐系统的跨堆栈工作负载表征

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00024

Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, D. Brooks

Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.

基于深度学习的推荐系统构成了大多数个性化云服务的支柱。尽管计算机体系结构社区最近开始注意到深度推荐推理，但由此产生的解决方案采用了截然不同的方法——从近内存处理到大规模优化。为了更好地设计用于深度推荐推理的未来硬件系统，我们必须首先系统地检查和描述跨不同级别执行堆栈的设计决策的底层系统级影响。在本文中，我们在执行堆栈的三个不同层次上描述了八个行业代表性的深度推荐模型:算法和软件、系统平台和硬件微架构。通过这个跨堆栈特性，我们首先展示了系统部署选择(即cpu或gpu、批处理大小粒度)可以给我们提供高达15倍的加速。为了更好地了解进一步优化的瓶颈，我们研究了软件操作符使用分解和CPU前端和后端微架构效率低下。最后，我们对关键算法模型架构特征和硬件瓶颈之间的相关性进行了建模，揭示了每个硬件瓶颈背后没有单一的主导算法组件。

{"title":"Cross-Stack Workload Characterization of Deep Recommendation Systems","authors":"Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, D. Brooks","doi":"10.1109/IISWC50251.2020.00024","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00024","url":null,"abstract":"Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126431609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Steering Committee : IISWC 2020 指导委员会:IISWC 2020

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00009

引用次数: 0

HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework HETSIM:使用跟踪驱动、同步和依赖感知框架模拟大规模异构系统

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00011

S. Pal, Kuba Kaszyk, Siying Feng, Björn Franke, M. Cole, M. O’Boyle, T. Mudge, R. Dreslinski

The rising complexity of large-scale heterogeneous architectures, such as those composed of off-the-shelf processors coupled with fixed-function logic, has imposed challenges for traditional simulation methodologies. While prior work has explored trace-based simulation techniques that offer good tradeoffs between simulation accuracy and speed, most such proposals are limited to simulating chip multiprocessors (CMPs) with up to hundreds of threads. There exists a gap for a framework that can flexibly and accurately model different heterogeneous systems, as well as scales to a larger number of cores. We implement a solution called HETSIM, a trace-driven, synchronization and dependency-aware framework for fast and accurate pre-silicon performance and power estimations for heterogeneous systems with up to thousands of cores. HETSIM operates in four stages: compilation, emulation, trace generation and trace replay. Given (i) a specification file, (ii) a multithreaded implementation of the target application, and (iii) an architectural and power model of the target hardware, HETSIM generates performance and power estimates with no further user intervention. HETSIM distinguishes itself from existing approaches through emulation of target hardware functionality as software primitives. HETSIM is packaged with primitives that are commonplace across many accelerator designs, and the framework can easily be extended to support custom primitives. We demonstrate the utility of HETSIM through design-space exploration on two recent target architectures: (i) a reconfigurable many-core accelerator, and (ii) a heterogeneous, domain-specific accelerator. Overall, HETSIM demonstrates simulation time speedups of 3.2×-10.4× (average 5.0×) over gem5 in syscall emulation mode, with average deviations in simulated time and power consumption of 15.1% and 10.9%, respectively. HETSIM is validated against silicon for the second target and estimates performance within a deviation of 25.5%, on average.

大规模异构架构(例如由现成的处理器和固定功能逻辑组成的架构)的复杂性不断上升，给传统的仿真方法带来了挑战。虽然先前的工作已经探索了基于跟踪的仿真技术，在仿真精度和速度之间提供了良好的权衡，但大多数此类建议仅限于模拟具有多达数百个线程的芯片多处理器(cmp)。对于一个能够灵活、准确地对不同异构系统建模，并扩展到更大数量核心的框架来说，还存在着差距。我们实现了一个名为HETSIM的解决方案，这是一个跟踪驱动，同步和依赖感知框架，用于具有多达数千个内核的异构系统的快速准确的预硅性能和功耗估计。HETSIM分四个阶段运行:编译、仿真、跟踪生成和跟踪重放。给定(i)规范文件，(ii)目标应用程序的多线程实现，以及(iii)目标硬件的架构和功率模型，HETSIM可以生成性能和功率估计，而无需用户进一步干预。HETSIM通过将目标硬件功能模拟为软件原语而与现有方法区别开来。HETSIM包含了许多加速器设计中常见的原语，并且可以很容易地扩展框架以支持自定义原语。我们通过对两种最新目标体系结构的设计空间探索，展示了HETSIM的实用性:(i)可重构的多核加速器，(ii)异构的、特定领域的加速器。总体而言，HETSIM在系统调用仿真模式下的仿真时间速度比gem5快3.2×-10.4×(平均5.0倍)，仿真时间和功耗的平均偏差分别为15.1%和10.9%。HETSIM针对第二个目标的硅进行了验证，估计性能偏差平均在25.5%以内。

{"title":"HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework","authors":"S. Pal, Kuba Kaszyk, Siying Feng, Björn Franke, M. Cole, M. O’Boyle, T. Mudge, R. Dreslinski","doi":"10.1109/IISWC50251.2020.00011","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00011","url":null,"abstract":"The rising complexity of large-scale heterogeneous architectures, such as those composed of off-the-shelf processors coupled with fixed-function logic, has imposed challenges for traditional simulation methodologies. While prior work has explored trace-based simulation techniques that offer good tradeoffs between simulation accuracy and speed, most such proposals are limited to simulating chip multiprocessors (CMPs) with up to hundreds of threads. There exists a gap for a framework that can flexibly and accurately model different heterogeneous systems, as well as scales to a larger number of cores. We implement a solution called HETSIM, a trace-driven, synchronization and dependency-aware framework for fast and accurate pre-silicon performance and power estimations for heterogeneous systems with up to thousands of cores. HETSIM operates in four stages: compilation, emulation, trace generation and trace replay. Given (i) a specification file, (ii) a multithreaded implementation of the target application, and (iii) an architectural and power model of the target hardware, HETSIM generates performance and power estimates with no further user intervention. HETSIM distinguishes itself from existing approaches through emulation of target hardware functionality as software primitives. HETSIM is packaged with primitives that are commonplace across many accelerator designs, and the framework can easily be extended to support custom primitives. We demonstrate the utility of HETSIM through design-space exploration on two recent target architectures: (i) a reconfigurable many-core accelerator, and (ii) a heterogeneous, domain-specific accelerator. Overall, HETSIM demonstrates simulation time speedups of 3.2×-10.4× (average 5.0×) over gem5 in syscall emulation mode, with average deviations in simulated time and power consumption of 15.1% and 10.9%, respectively. HETSIM is validated against silicon for the second target and estimates performance within a deviation of 25.5%, on average.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"59 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129494612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1