Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00023
Qianlin Liang, P. Shenoy, David E. Irwin
Edge computing has emerged as a popular paradigm for supporting mobile and IoT applications with low latency or high bandwidth needs. The attractiveness of edge computing has been further enhanced due to the recent availability of special-purpose hardware to accelerate specific compute tasks, such as deep learning inference, on edge nodes. In this paper, we experimentally compare the benefits and limitations of using specialized edge systems, built using edge accelerators, to more traditional forms of edge and cloud computing. Our experimental study using edge-based AI workloads shows that today's edge accelerators can provide comparable, and in many cases better, performance, when normalized for power or cost, than traditional edge and cloud servers. They also provide latency and bandwidth benefits for split processing, across and within tiers, when using model compression or model splitting, but require dynamic methods to determine the optimal split across tiers. We find that edge accelerators can support varying degrees of concurrency for multi-tenant inference applications, but lack isolation mechanisms necessary for edge cloud multi-tenant hosting.
{"title":"AI on the Edge: Characterizing AI-based IoT Applications Using Specialized Edge Architectures","authors":"Qianlin Liang, P. Shenoy, David E. Irwin","doi":"10.1109/IISWC50251.2020.00023","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00023","url":null,"abstract":"Edge computing has emerged as a popular paradigm for supporting mobile and IoT applications with low latency or high bandwidth needs. The attractiveness of edge computing has been further enhanced due to the recent availability of special-purpose hardware to accelerate specific compute tasks, such as deep learning inference, on edge nodes. In this paper, we experimentally compare the benefits and limitations of using specialized edge systems, built using edge accelerators, to more traditional forms of edge and cloud computing. Our experimental study using edge-based AI workloads shows that today's edge accelerators can provide comparable, and in many cases better, performance, when normalized for power or cost, than traditional edge and cloud servers. They also provide latency and bandwidth benefits for split processing, across and within tiers, when using model compression or model splitting, but require dynamic methods to determine the optimal split across tiers. We find that edge accelerators can support varying degrees of concurrency for multi-tenant inference applications, but lack isolation mechanisms necessary for edge cloud multi-tenant hosting.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115773067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00022
Alexandre Valentin Jamet, Lluc Alvarez, Daniel A. Jiménez, Marc Casas
The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. The computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over the least-recently-used (LRU) policy, they tend to focus only on the SPEC CPU 2006 benchmark suite - and more recently on the SPEC CPU 2017 benchmark suite - for evaluation. However, these workloads are representative for only a subset of current High-Performance Computing (HPC) workloads. In this paper we evaluate the behavior of a mix of graph processing, scientific and industrial workloads (GAP, XSBench and Qualcomm) along with the well-known SPEC CPU 2006 and SPEC CPU 2017 workloads on state-of-the-art LLC replacement policies such as Multiperspective Reuse Prediction (MPPPB), Glider, Hawkeye, SHiP, DRRIP and SRRIP. Our evaluation reveals that, even though current state-of-the-art LLC replacement policies provide a significant performance improvement over LRU for both SPEC CPU 2006 and SPEC CPU 2017 workloads, those policies are hardly able to capture the access patterns and yield sensible improvement on current HPC and big data workloads due to their highly complex behavior. In addition, this paper introduces two new LLC replacement policies derived from MPPPB. The first proposed replacement policy, Multi-Sampler Multiperspective (MS-MPPPB), uses multiple samplers instead of a single one and dynamically selects the best-behaving sampler to drive reuse distance predictions. The second replacement policy presented in this paper, Multiperspective with Dynamic Features Selector (DS-MPPPB), selects the best behaving features among a set of 64 features to improve the accuracy of the predictions. On a large set of workloads that stress the LLC, MS-MPPPB achieves a geometric mean speed-up of 8.3% over LRU, while DS-MPPPB outperforms LRU by a geometric mean speedup of 8.0%. For big data and HPC workloads, the two proposed techniques present higher performance benefits than state-of-the-art approaches such as MPPPB, Glider and Hawkeye, which yield geometric mean speedups of 7.0%, 5.0% and 4.8% over LRU, respectively.
最后一级缓存(Last Level Cache, LLC)和内存延迟之间的巨大差异促使人们需要高效的缓存管理策略。计算机体系结构文献中有大量关于LLC替换策略的工作。尽管这些工作大大改善了最近最少使用(LRU)策略,但它们往往只关注SPEC CPU 2006基准测试套件-以及最近的SPEC CPU 2017基准测试套件-进行评估。但是,这些工作负载仅代表当前高性能计算(HPC)工作负载的一个子集。在本文中,我们评估了图形处理、科学和工业工作负载(GAP、XSBench和高通)以及著名的SPEC CPU 2006和SPEC CPU 2017工作负载在最先进的LLC替换策略(如多视角重用预测(MPPPB)、Glider、Hawkeye、SHiP、DRRIP和SRRIP)上的混合行为。我们的评估显示,尽管当前最先进的LLC替换策略为SPEC CPU 2006和SPEC CPU 2017工作负载提供了比LRU显著的性能改进,但由于其高度复杂的行为,这些策略几乎无法捕获访问模式并对当前HPC和大数据工作负载产生显着的改进。此外,本文还介绍了由MPPPB衍生出的两种新的有限责任公司置换政策。第一个被提出的替换策略,Multi-Sampler Multiperspective (MS-MPPPB),使用多个采样器而不是单个采样器,并动态选择行为最佳的采样器来驱动重用距离预测。本文提出的第二种替换策略,multi - perspective with Dynamic Features Selector (DS-MPPPB),从64个特征中选择表现最好的特征来提高预测的准确性。在对LLC施加压力的大量工作负载上,MS-MPPPB的几何平均加速比LRU高8.3%,而DS-MPPPB的几何平均加速比LRU高8.0%。对于大数据和高性能计算工作负载,这两种技术比MPPPB、Glider和Hawkeye等最先进的方法具有更高的性能优势,后者的几何平均速度分别比LRU提高7.0%、5.0%和4.8%。
{"title":"Characterizing the impact of last-level cache replacement policies on big-data workloads","authors":"Alexandre Valentin Jamet, Lluc Alvarez, Daniel A. Jiménez, Marc Casas","doi":"10.1109/IISWC50251.2020.00022","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00022","url":null,"abstract":"The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. The computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over the least-recently-used (LRU) policy, they tend to focus only on the SPEC CPU 2006 benchmark suite - and more recently on the SPEC CPU 2017 benchmark suite - for evaluation. However, these workloads are representative for only a subset of current High-Performance Computing (HPC) workloads. In this paper we evaluate the behavior of a mix of graph processing, scientific and industrial workloads (GAP, XSBench and Qualcomm) along with the well-known SPEC CPU 2006 and SPEC CPU 2017 workloads on state-of-the-art LLC replacement policies such as Multiperspective Reuse Prediction (MPPPB), Glider, Hawkeye, SHiP, DRRIP and SRRIP. Our evaluation reveals that, even though current state-of-the-art LLC replacement policies provide a significant performance improvement over LRU for both SPEC CPU 2006 and SPEC CPU 2017 workloads, those policies are hardly able to capture the access patterns and yield sensible improvement on current HPC and big data workloads due to their highly complex behavior. In addition, this paper introduces two new LLC replacement policies derived from MPPPB. The first proposed replacement policy, Multi-Sampler Multiperspective (MS-MPPPB), uses multiple samplers instead of a single one and dynamically selects the best-behaving sampler to drive reuse distance predictions. The second replacement policy presented in this paper, Multiperspective with Dynamic Features Selector (DS-MPPPB), selects the best behaving features among a set of 64 features to improve the accuracy of the predictions. On a large set of workloads that stress the LLC, MS-MPPPB achieves a geometric mean speed-up of 8.3% over LRU, while DS-MPPPB outperforms LRU by a geometric mean speedup of 8.0%. For big data and HPC workloads, the two proposed techniques present higher performance benefits than state-of-the-art approaches such as MPPPB, Glider and Hawkeye, which yield geometric mean speedups of 7.0%, 5.0% and 4.8% over LRU, respectively.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114739964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00016
Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci
Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We lev
{"title":"CPU Microarchitectural Performance Characterization of Cloud Video Transcoding","authors":"Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci","doi":"10.1109/IISWC50251.2020.00016","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00016","url":null,"abstract":"Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We lev","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"106 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122986333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00025
Vinod Ganesan, Surya Selvam, Sanchari Sen, Pratyush Kumar, A. Raghunathan
Accurate workload characterization of Deep Neural Networks (DNNs) is challenged by both network and hardware diversity. Networks are being designed with newer motifs such as depthwise separable convolutions, bottleneck layers, etc., which have widely varying performance characteristics. Further, the adoption of Neural Architecture Search (NAS) is creating a Cambrian explosion of networks, greatly expanding the space of networks that must be modeled. On the hardware front, myriad accelerators are being built for DNNs, while compiler improvements are enabling more efficient execution of DNNs on a wide range of CPUs and GPUs. Clearly, characterizing each DNN on each hardware system is infeasible. We thus need cost models to estimate performance that generalize across both devices and networks. In this work, we address this challenge by building a cost model of DNNs on mobile devices. The modeling and evaluation are based on latency measurements of 118 networks on 105 mobile System-on-Chips (SoCs). As a key contribution, we propose that a hardware platform can be represented by its measured latencies on a judiciously chosen, small set of networks, which we call the signature set. We also design a machine learning model that takes as inputs (i) the target hardware representation (measured latencies of the signature set on the hardware) and (ii) a representation of the structure of the DNN to be evaluated, and predicts the latency of the DNN on the target hardware. We propose and evaluate different algorithms to select the signature set. Our results show that by carefully choosing the signature set, the network representation, and the machine learning algorithm, we can train accurate cost models that generalize well. We demonstrate the value of such a cost model in a collaborative workload characterization setup, wherein every mobile device contributes a small set of latency measurements to a centralized repository. With even a small number of measurements per new device, we show that the proposed cost model matches the accuracy of device-specific models trained on an order-of-magnitude larger number of measurements. The entire codebase is released at https://github.com/iitm-sysdl/Generalizable-DNN-cost-models.
{"title":"A Case for Generalizable DNN Cost Models for Mobile Devices","authors":"Vinod Ganesan, Surya Selvam, Sanchari Sen, Pratyush Kumar, A. Raghunathan","doi":"10.1109/IISWC50251.2020.00025","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00025","url":null,"abstract":"Accurate workload characterization of Deep Neural Networks (DNNs) is challenged by both network and hardware diversity. Networks are being designed with newer motifs such as depthwise separable convolutions, bottleneck layers, etc., which have widely varying performance characteristics. Further, the adoption of Neural Architecture Search (NAS) is creating a Cambrian explosion of networks, greatly expanding the space of networks that must be modeled. On the hardware front, myriad accelerators are being built for DNNs, while compiler improvements are enabling more efficient execution of DNNs on a wide range of CPUs and GPUs. Clearly, characterizing each DNN on each hardware system is infeasible. We thus need cost models to estimate performance that generalize across both devices and networks. In this work, we address this challenge by building a cost model of DNNs on mobile devices. The modeling and evaluation are based on latency measurements of 118 networks on 105 mobile System-on-Chips (SoCs). As a key contribution, we propose that a hardware platform can be represented by its measured latencies on a judiciously chosen, small set of networks, which we call the signature set. We also design a machine learning model that takes as inputs (i) the target hardware representation (measured latencies of the signature set on the hardware) and (ii) a representation of the structure of the DNN to be evaluated, and predicts the latency of the DNN on the target hardware. We propose and evaluate different algorithms to select the signature set. Our results show that by carefully choosing the signature set, the network representation, and the machine learning algorithm, we can train accurate cost models that generalize well. We demonstrate the value of such a cost model in a collaborative workload characterization setup, wherein every mobile device contributes a small set of latency measurements to a centralized repository. With even a small number of measurements per new device, we show that the proposed cost model matches the accuracy of device-specific models trained on an order-of-magnitude larger number of measurements. The entire codebase is released at https://github.com/iitm-sysdl/Generalizable-DNN-cost-models.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128937382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00012
K. Parasyris, I. Laguna, Harshitha Menon, M. Schordan, D. Osei-Kuffuor, G. Georgakoudis, Michael O. Lam, T. Vanderbruggen
With the increasing interest in applying approximate computing to HPC applications, representative benchmarks are needed to evaluate and compare various approximate computing algorithms and programming frameworks. To this end, we propose HPC-MixPBench, a benchmark suite consisting of a representative set of kernels and benchmarks that are widely used in HPC domain. HPC-MixPBench has a test harness framework where different tools can be plugged in and evaluated on the set of benchmarks. We demonstrate the effectiveness of our benchmark suite by evaluating several mixed-precision algorithms implemented in FloatSmith, a tool for floating-point mixed-precision approximation analysis. We report several insights about the mixed-precision algorithms that we compare, which we expect can help users of these methods choose the right method for their workload. We envision that this benchmark suite will evolve into a standard set of HPC benchmarks for comparing different approximate computing techniques.
{"title":"HPC-MixPBench: An HPC Benchmark Suite for Mixed-Precision Analysis","authors":"K. Parasyris, I. Laguna, Harshitha Menon, M. Schordan, D. Osei-Kuffuor, G. Georgakoudis, Michael O. Lam, T. Vanderbruggen","doi":"10.1109/IISWC50251.2020.00012","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00012","url":null,"abstract":"With the increasing interest in applying approximate computing to HPC applications, representative benchmarks are needed to evaluate and compare various approximate computing algorithms and programming frameworks. To this end, we propose HPC-MixPBench, a benchmark suite consisting of a representative set of kernels and benchmarks that are widely used in HPC domain. HPC-MixPBench has a test harness framework where different tools can be plugged in and evaluated on the set of benchmarks. We demonstrate the effectiveness of our benchmark suite by evaluating several mixed-precision algorithms implemented in FloatSmith, a tool for floating-point mixed-precision approximation analysis. We report several insights about the mixed-precision algorithms that we compare, which we expect can help users of these methods choose the right method for their workload. We envision that this benchmark suite will evolve into a standard set of HPC benchmarks for comparing different approximate computing techniques.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129050198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00024
Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, D. Brooks
Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.
{"title":"Cross-Stack Workload Characterization of Deep Recommendation Systems","authors":"Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, D. Brooks","doi":"10.1109/IISWC50251.2020.00024","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00024","url":null,"abstract":"Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126431609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/IISWC50251.2020.00011
S. Pal, Kuba Kaszyk, Siying Feng, Björn Franke, M. Cole, M. O’Boyle, T. Mudge, R. Dreslinski
The rising complexity of large-scale heterogeneous architectures, such as those composed of off-the-shelf processors coupled with fixed-function logic, has imposed challenges for traditional simulation methodologies. While prior work has explored trace-based simulation techniques that offer good tradeoffs between simulation accuracy and speed, most such proposals are limited to simulating chip multiprocessors (CMPs) with up to hundreds of threads. There exists a gap for a framework that can flexibly and accurately model different heterogeneous systems, as well as scales to a larger number of cores. We implement a solution called HETSIM, a trace-driven, synchronization and dependency-aware framework for fast and accurate pre-silicon performance and power estimations for heterogeneous systems with up to thousands of cores. HETSIM operates in four stages: compilation, emulation, trace generation and trace replay. Given (i) a specification file, (ii) a multithreaded implementation of the target application, and (iii) an architectural and power model of the target hardware, HETSIM generates performance and power estimates with no further user intervention. HETSIM distinguishes itself from existing approaches through emulation of target hardware functionality as software primitives. HETSIM is packaged with primitives that are commonplace across many accelerator designs, and the framework can easily be extended to support custom primitives. We demonstrate the utility of HETSIM through design-space exploration on two recent target architectures: (i) a reconfigurable many-core accelerator, and (ii) a heterogeneous, domain-specific accelerator. Overall, HETSIM demonstrates simulation time speedups of 3.2×-10.4× (average 5.0×) over gem5 in syscall emulation mode, with average deviations in simulated time and power consumption of 15.1% and 10.9%, respectively. HETSIM is validated against silicon for the second target and estimates performance within a deviation of 25.5%, on average.
{"title":"HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework","authors":"S. Pal, Kuba Kaszyk, Siying Feng, Björn Franke, M. Cole, M. O’Boyle, T. Mudge, R. Dreslinski","doi":"10.1109/IISWC50251.2020.00011","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00011","url":null,"abstract":"The rising complexity of large-scale heterogeneous architectures, such as those composed of off-the-shelf processors coupled with fixed-function logic, has imposed challenges for traditional simulation methodologies. While prior work has explored trace-based simulation techniques that offer good tradeoffs between simulation accuracy and speed, most such proposals are limited to simulating chip multiprocessors (CMPs) with up to hundreds of threads. There exists a gap for a framework that can flexibly and accurately model different heterogeneous systems, as well as scales to a larger number of cores. We implement a solution called HETSIM, a trace-driven, synchronization and dependency-aware framework for fast and accurate pre-silicon performance and power estimations for heterogeneous systems with up to thousands of cores. HETSIM operates in four stages: compilation, emulation, trace generation and trace replay. Given (i) a specification file, (ii) a multithreaded implementation of the target application, and (iii) an architectural and power model of the target hardware, HETSIM generates performance and power estimates with no further user intervention. HETSIM distinguishes itself from existing approaches through emulation of target hardware functionality as software primitives. HETSIM is packaged with primitives that are commonplace across many accelerator designs, and the framework can easily be extended to support custom primitives. We demonstrate the utility of HETSIM through design-space exploration on two recent target architectures: (i) a reconfigurable many-core accelerator, and (ii) a heterogeneous, domain-specific accelerator. Overall, HETSIM demonstrates simulation time speedups of 3.2×-10.4× (average 5.0×) over gem5 in syscall emulation mode, with average deviations in simulated time and power consumption of 15.1% and 10.9%, respectively. HETSIM is validated against silicon for the second target and estimates performance within a deviation of 25.5%, on average.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"59 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129494612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}