Off-chip memory traffic has been a major performance bottleneck in deep learning accelerators. While reusing on-chip data is a promising way to reduce off-chip traffic, the opportunity on reusing shortcut connection data in deep networks (e.g., residual networks) have been largely neglected. Those shortcut data accounts for nearly 40% of the total feature map data. In this paper, we propose Shortcut Mining, a novel approach that “mines” the unexploited opportunity of on-chip data reusing. We introduce the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and then propose a sequence of procedures which, collectively, can effectively reuse both shortcut and non-shortcut feature maps. The proposed procedures are also able to reuse shortcut data across any number of intermediate layers without using additional buffer resources. Experiment results from prototyping on FPGAs show that, the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152, respectively and a 1.93X increase in throughput compared with a state-of-the-art accelerator.
{"title":"Shortcut Mining: Exploiting Cross-Layer Shortcut Reuse in DCNN Accelerators","authors":"Arash AziziMazreah, Lizhong Chen","doi":"10.1109/HPCA.2019.00030","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00030","url":null,"abstract":"Off-chip memory traffic has been a major performance bottleneck in deep learning accelerators. While reusing on-chip data is a promising way to reduce off-chip traffic, the opportunity on reusing shortcut connection data in deep networks (e.g., residual networks) have been largely neglected. Those shortcut data accounts for nearly 40% of the total feature map data. In this paper, we propose Shortcut Mining, a novel approach that “mines” the unexploited opportunity of on-chip data reusing. We introduce the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and then propose a sequence of procedures which, collectively, can effectively reuse both shortcut and non-shortcut feature maps. The proposed procedures are also able to reuse shortcut data across any number of intermediate layers without using additional buffer resources. Experiment results from prototyping on FPGAs show that, the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152, respectively and a 1.93X increase in throughput compared with a state-of-the-art accelerator.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122050394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirhossein Mirhosseini, Akshitha Sriraman, T. Wenisch
We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of μs-scale stalls. Simultaneous Multithreading (SMT) is an efficient way to improve core utilization and increase server performance density. Unfortunately, scaling SMT to provision enough threads to hide frequent μs-scale stalls is prohibitive and SMT co-location can often drastically increase the tail latency of cloud microservices. In this paper, we propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity provisions dyads (pairs) of two kinds of cores: master-cores, which each primarily executes a single latency-critical master-thread, and lender-cores, which multiplex latency-insensitive throughput threads. When the master-thread stalls, the master-core borrows filler-threads from the lender-core, filling μs-scale utilization holes of the microservice. We propose critical mechanisms, including separate memory paths for the master-thread and filler-threads, to enable master-cores to borrow filler-threads while protecting master-threads’ state from disruption. Duplexity facilitates fast master-thread restart when stalls resolve and minimizes the microservice’s QoS violation. Our evaluation demonstrates that Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average.
{"title":"Enhancing Server Efficiency in the Face of Killer Microseconds","authors":"Amirhossein Mirhosseini, Akshitha Sriraman, T. Wenisch","doi":"10.1109/HPCA.2019.00037","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00037","url":null,"abstract":"We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of μs-scale stalls. Simultaneous Multithreading (SMT) is an efficient way to improve core utilization and increase server performance density. Unfortunately, scaling SMT to provision enough threads to hide frequent μs-scale stalls is prohibitive and SMT co-location can often drastically increase the tail latency of cloud microservices. In this paper, we propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity provisions dyads (pairs) of two kinds of cores: master-cores, which each primarily executes a single latency-critical master-thread, and lender-cores, which multiplex latency-insensitive throughput threads. When the master-thread stalls, the master-core borrows filler-threads from the lender-core, filling μs-scale utilization holes of the microservice. We propose critical mechanisms, including separate memory paths for the master-thread and filler-threads, to enable master-cores to borrow filler-threads while protecting master-threads’ state from disruption. Duplexity facilitates fast master-thread restart when stalls resolve and minimizes the microservice’s QoS violation. Our evaluation demonstrates that Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114508056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present Versatile Inference Processor (VIP), a highly programmable architecture for machine learning inference. VIP consists of 128 lightweight processing engines employing a vector processing paradigm, with a simple ISA and carefully chosen microarchitecture features. It is coupled with a modern, lightly customized, 3D-stacked memory system. Through detailed execution-driven simulations backed by RTL synthesis, we show that we can achieve online, real-time vision throughput (24 fps), at low power consumption, for both fullHD depth-from-stereo using belief propagation, and VGG-16 and VGG-19 deep neural networks (batch size of 1). Our RTL synthesis of a VIP processing engine in TSMC 28 nm technology, using a commercial standard-cell library supplied by ARM, results in 18 mm2 of silicon area and 3.5 W to 4.8 W of power consumption for all 128 VIP processing engines combined.
{"title":"VIP: A Versatile Inference Processor","authors":"Skand Hurkat, José F. Martínez","doi":"10.1109/HPCA.2019.00049","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00049","url":null,"abstract":"We present Versatile Inference Processor (VIP), a highly programmable architecture for machine learning inference. VIP consists of 128 lightweight processing engines employing a vector processing paradigm, with a simple ISA and carefully chosen microarchitecture features. It is coupled with a modern, lightly customized, 3D-stacked memory system. Through detailed execution-driven simulations backed by RTL synthesis, we show that we can achieve online, real-time vision throughput (24 fps), at low power consumption, for both fullHD depth-from-stereo using belief propagation, and VGG-16 and VGG-19 deep neural networks (batch size of 1). Our RTL synthesis of a VIP processing engine in TSMC 28 nm technology, using a commercial standard-cell library supplied by ARM, results in 18 mm2 of silicon area and 3.5 W to 4.8 W of power consumption for all 128 VIP processing engines combined.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130662278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
“Amoeba: An Autonomous Backup and Recovery SSD for Ransomware Attack Defense”, Donghyun Min, Donggyu Park, Jinwoo Ahn, Ryan Walker, Junghee Lee, Sungyong Park, Youngjae Kim, Sogang University and University of Texas at San Antonio “The Architectural Implications of Cloud Microservices”, Yu Gan and Christina Delimitrou, Cornell University “An Alternative Analytical Approach to Associative Processing”, Soroosh Khoram, Yue Zha, and Jing Li, University of Wisconsin-Madison
“变形虫:用于勒索软件攻击防御的自主备份和恢复SSD”,Donghyun Min, Donggyu Park, Jinwoo Ahn, Ryan Walker, Junghee Lee, Sungyong Park, Youngjae Kim,西江大学和德克萨斯大学圣安东尼奥分校“云微服务的架构意义”,Yu Gan和Christina Delimitrou,康奈尔大学“联想处理的另一种分析方法”,Soroosh Khoram, Yue Zha和Jing Li,威斯康星大学麦迪逊分校
{"title":"The Best of IEEE Computer Architecture Letters in 2018","authors":"P. Gratz","doi":"10.1109/HPCA.2019.00060","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00060","url":null,"abstract":"“Amoeba: An Autonomous Backup and Recovery SSD for Ransomware Attack Defense”, Donghyun Min, Donggyu Park, Jinwoo Ahn, Ryan Walker, Junghee Lee, Sungyong Park, Youngjae Kim, Sogang University and University of Texas at San Antonio “The Architectural Implications of Cloud Microservices”, Yu Gan and Christina Delimitrou, Cornell University “An Alternative Analytical Approach to Associative Processing”, Soroosh Khoram, Yue Zha, and Jing Li, University of Wisconsin-Madison","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131006620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Active Timing Margin (ATM) is a technology that improves processor efficiency by reducing the pipeline timing margin with a control loop that adjusts voltage and frequency based on real-time chip environment monitoring. Although ATM has already been shown to yield substantial performance benefits, its full potential has yet to be unlocked. In this paper, we investigate how to maximize ATM’s efficiency gain with a new means of exposing the inter-core speed variation: finetuning the ATM control loop. We conduct our analysis and evaluation on a production-grade POWER7+ system. On the POWER7+ server platform, we fine-tune the ATM control loop by programming its Critical Path Monitors, a key component of its ATM design that measures the cores’ timing margins. With a robust stress-test procedure, we expose over 200 MHz of inherent inter-core speed differential by fine-tuning the percore ATM control loop. Exploiting this differential, we manage to double the ATM frequency gain over the static timing margin; this is not possible using conventional means, i.e. by setting fixed points for each core, because the corelevel must account for chip-wide worst-case voltage variation. To manage the significant performance heterogeneity of fine-tuned systems, we propose application scheduling and throttling to manage the chip’s process and voltage variation. Our proposal improves application performance by more than 10% over the static margin, almost doubling the 6% improvement of the default, unmanaged ATM system. Our technique is general enough that it can be adopted by any system that employs an active timing margin control loop. Keywords-Active timing margin, Performance, Power efficiency, Reliability, Critical path monitors
{"title":"Fine-Tuning the Active Timing Margin (ATM) Control Loop for Maximizing Multi-core Efficiency on an IBM POWER Server","authors":"Yazhou Zu, Daniel Richins, C. Lefurgy, V. Reddi","doi":"10.1109/HPCA.2019.00031","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00031","url":null,"abstract":"Active Timing Margin (ATM) is a technology that improves processor efficiency by reducing the pipeline timing margin with a control loop that adjusts voltage and frequency based on real-time chip environment monitoring. Although ATM has already been shown to yield substantial performance benefits, its full potential has yet to be unlocked. In this paper, we investigate how to maximize ATM’s efficiency gain with a new means of exposing the inter-core speed variation: finetuning the ATM control loop. We conduct our analysis and evaluation on a production-grade POWER7+ system. On the POWER7+ server platform, we fine-tune the ATM control loop by programming its Critical Path Monitors, a key component of its ATM design that measures the cores’ timing margins. With a robust stress-test procedure, we expose over 200 MHz of inherent inter-core speed differential by fine-tuning the percore ATM control loop. Exploiting this differential, we manage to double the ATM frequency gain over the static timing margin; this is not possible using conventional means, i.e. by setting fixed <v, f> points for each core, because the corelevel <v, f> must account for chip-wide worst-case voltage variation. To manage the significant performance heterogeneity of fine-tuned systems, we propose application scheduling and throttling to manage the chip’s process and voltage variation. Our proposal improves application performance by more than 10% over the static margin, almost doubling the 6% improvement of the default, unmanaged ATM system. Our technique is general enough that it can be adopted by any system that employs an active timing margin control loop. Keywords-Active timing margin, Performance, Power efficiency, Reliability, Critical path monitors","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129655605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaowei Wang, Jiecao Yu, C. Augustine, R. Iyer, R. Das
We propose Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks an in-SRAM architecture for accelerating Convolutional Neural Network (CNN) inference by leveraging network redundancy and massive parallelism. The network redundancy is exploited in two ways. First, we prune and fine-tune the trained network model and develop two distinct methods coalescing and overlapping to run inferences efficiently with sparse models. Second, we propose an architecture for network models with a reduced bit width by leveraging bit-serial computation. Our proposed architecture achieves a 17.7×/3.7× speedup over server class CPU/GPU, and a 1.6× speedup compared to the relevant in-cache accelerator, with 2% area overhead each processor die, and no loss on top-1 accuracy for AlexNet. With a relaxed accuracy limit, our tunable architecture achieves higher speedups. Keywords-In-Memory Computing; Cache; Neural Network Pruning; Low Precision Neural Network.
{"title":"Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks","authors":"Xiaowei Wang, Jiecao Yu, C. Augustine, R. Iyer, R. Das","doi":"10.1109/HPCA.2019.00029","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00029","url":null,"abstract":"We propose Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks an in-SRAM architecture for accelerating Convolutional Neural Network (CNN) inference by leveraging network redundancy and massive parallelism. The network redundancy is exploited in two ways. First, we prune and fine-tune the trained network model and develop two distinct methods coalescing and overlapping to run inferences efficiently with sparse models. Second, we propose an architecture for network models with a reduced bit width by leveraging bit-serial computation. Our proposed architecture achieves a 17.7×/3.7× speedup over server class CPU/GPU, and a 1.6× speedup compared to the relevant in-cache accelerator, with 2% area overhead each processor die, and no loss on top-1 accuracy for AlexNet. With a relaxed accuracy limit, our tunable architecture achieves higher speedups. Keywords-In-Memory Computing; Cache; Neural Network Pruning; Low Precision Neural Network.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126433614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilias Vougioukas, Nikos Nikoleris, Andreas Sandberg, S. Diestelhorst, B. Al-Hashimi, G. Merrett
Modern processors use branch prediction as an optimization to improve processor performance. Predictors have become larger and increasingly more sophisticated in order to achieve higher accuracies which are needed in high performance cores. However, branch prediction can also be a source of side channel exploits, as one context can deliberately change the branch predictor state and alter the instruction flow of another context. Current mitigation techniques either sacrifice performance for security, or fail to guarantee isolation when retaining the accuracy. Achieving both has proven to be challenging. In this work we address this by, (1) introducing the notions of steady-state and transient branch predictor accuracy, and (2) showing that current predictors increase their misprediction rate by as much as 90% on average when forced to flush branch prediction state to remain secure. To solve this, (3) we introduce the branch retention buffer, a novel mechanism that partitions only the most useful branch predictor components to isolate separate contexts. Our mechanism makes thread isolation practical, as it stops the predictor from executing cold with little if any added area and no warm-up overheads. At the same time our results show that, compared to the state-of-the-art, average misprediction rates are reduced by 15-20% without increasing area, leading to a 2% performance increase.
{"title":"BRB: Mitigating Branch Predictor Side-Channels.","authors":"Ilias Vougioukas, Nikos Nikoleris, Andreas Sandberg, S. Diestelhorst, B. Al-Hashimi, G. Merrett","doi":"10.1109/HPCA.2019.00058","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00058","url":null,"abstract":"Modern processors use branch prediction as an optimization to improve processor performance. Predictors have become larger and increasingly more sophisticated in order to achieve higher accuracies which are needed in high performance cores. However, branch prediction can also be a source of side channel exploits, as one context can deliberately change the branch predictor state and alter the instruction flow of another context. Current mitigation techniques either sacrifice performance for security, or fail to guarantee isolation when retaining the accuracy. Achieving both has proven to be challenging. In this work we address this by, (1) introducing the notions of steady-state and transient branch predictor accuracy, and (2) showing that current predictors increase their misprediction rate by as much as 90% on average when forced to flush branch prediction state to remain secure. To solve this, (3) we introduce the branch retention buffer, a novel mechanism that partitions only the most useful branch predictor components to isolate separate contexts. Our mechanism makes thread isolation practical, as it stops the predictor from executing cold with little if any added area and no warm-up overheads. At the same time our results show that, compared to the state-of-the-art, average misprediction rates are reduced by 15-20% without increasing area, leading to a 2% performance increase.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133653081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Li, C. Lefurgy, K. Rajamani, Malcolm S. Allen-Ware, Guillermo J. Silva, D. Heimsoth, Saugata Ghose, O. Mutlu
Power management is a key component of modern data center design. Power managers must (1) ensure the costand energy-efficient utilization of the data center infrastructure, (2) maintain availability of the services provided by the center, and (3) address environmental concerns associated with the center’s power consumption. While several power management techniques have been proposed and deployed in production data centers, there are still many challenges to comprehensive data center power management. This is particularly true in public cloud environments, where different jobs have different priority levels, and where high availability is critical. One example of the challenges facing public cloud data centers involves power capping. As power delivery must be highly reliable and tolerate wide variation in the load drawn by the data center components, the power infrastructure (e.g., power supplies, circuit breakers, UPS) has high redundancy and overprovisioning. During normal operation (i.e., typical server power demands, and no failures in the center), the power infrastructure is significantly underutilized. Power capping is a common solution to reduce this underutilization, by allowing more servers to be added safely (i.e., without power shortfalls) to the existing power infrastructure, and throttling power consumption in the infrequent cases where the demanded power exceeds the provisioned power capacity to avoid shortfalls. However, state-of-the-art power capping solutions are (1) not directly applicable to the redundant power infrastructure used in highly-available data centers; and (2) oblivious to differing workload priorities across the entire center when power consumption needs to be throttled, which can unnecessarily slow down high-priority work. To address this need, we develop CapMaestro, a new power management architecture with three key features for public cloud data centers. First, CapMaestro is designed to work with multiple power feeds (i.e., sources), and exploits server-level power capping to independently cap the load on each feed of a server. Second, CapMaestro uses a scalable, global priority-aware power capping approach, which accounts for power capacity at each level of the power distribution hierarchy. It exploits the underutilization of commonly-employed redundant power infrastructure at each level of the hierarchy to safely accommodate a much greater number of servers. Third, CapMaestro exploits stranded power (i.e., power budgets that are not utilized) in redundant power infrastructure to boost the performance of workloads in the data center. We add CapMaestro to a real cloud data center control plane, and demonstrate the effectiveness of all three key features. Using a large-scale data center simulation, we demonstrate that CapMaestro significantly and safely increases the number of servers for existing infrastructure. We also call out other key technical challenges the industry faces in data center power management.
{"title":"A Scalable Priority-Aware Approach to Managing Data Center Server Power","authors":"Y. Li, C. Lefurgy, K. Rajamani, Malcolm S. Allen-Ware, Guillermo J. Silva, D. Heimsoth, Saugata Ghose, O. Mutlu","doi":"10.1109/HPCA.2019.00067","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00067","url":null,"abstract":"Power management is a key component of modern data center design. Power managers must (1) ensure the costand energy-efficient utilization of the data center infrastructure, (2) maintain availability of the services provided by the center, and (3) address environmental concerns associated with the center’s power consumption. While several power management techniques have been proposed and deployed in production data centers, there are still many challenges to comprehensive data center power management. This is particularly true in public cloud environments, where different jobs have different priority levels, and where high availability is critical. One example of the challenges facing public cloud data centers involves power capping. As power delivery must be highly reliable and tolerate wide variation in the load drawn by the data center components, the power infrastructure (e.g., power supplies, circuit breakers, UPS) has high redundancy and overprovisioning. During normal operation (i.e., typical server power demands, and no failures in the center), the power infrastructure is significantly underutilized. Power capping is a common solution to reduce this underutilization, by allowing more servers to be added safely (i.e., without power shortfalls) to the existing power infrastructure, and throttling power consumption in the infrequent cases where the demanded power exceeds the provisioned power capacity to avoid shortfalls. However, state-of-the-art power capping solutions are (1) not directly applicable to the redundant power infrastructure used in highly-available data centers; and (2) oblivious to differing workload priorities across the entire center when power consumption needs to be throttled, which can unnecessarily slow down high-priority work. To address this need, we develop CapMaestro, a new power management architecture with three key features for public cloud data centers. First, CapMaestro is designed to work with multiple power feeds (i.e., sources), and exploits server-level power capping to independently cap the load on each feed of a server. Second, CapMaestro uses a scalable, global priority-aware power capping approach, which accounts for power capacity at each level of the power distribution hierarchy. It exploits the underutilization of commonly-employed redundant power infrastructure at each level of the hierarchy to safely accommodate a much greater number of servers. Third, CapMaestro exploits stranded power (i.e., power budgets that are not utilized) in redundant power infrastructure to boost the performance of workloads in the data center. We add CapMaestro to a real cloud data center control plane, and demonstrate the effectiveness of all three key features. Using a large-scale data center simulation, we demonstrate that CapMaestro significantly and safely increases the number of servers for existing infrastructure. We also call out other key technical challenges the industry faces in data center power management.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133865997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yatish Turakhia, Sneha D. Goenka, G. Bejerano, W. Dally
Whole genome alignment (WGA) is an indispensable tool in comparative genomics to study how different lifeforms have been shaped by evolution at the molecular level. Existing software whole genome aligners require several CPU weeks to compare a pair of mammalian genomes and still miss several biologically-meaningful, high-scoring alignment regions. These aligners are based on the seed-filter-and-extend paradigm with an ungapped filtering stage. Ungapped filtering is responsible for the low sensitivity of these aligners but is used because it is 200× faster than performing gapped alignment, using dynamic programming, in software. In this paper, we show that both performance and sensitivity can be greatly improved by using a hardware accelerator for WGA. Using the genomes of two roundworms (C. elegans and C. Briggsae) and four fruit flies (D. melanogaster, D. simulans, D. yakuba, and D. pseudoobscura), we show that replacing ungapped filtering with gapped filtering increases the number of matching base-pairs in alignments by up to 3×. Our accelerator, Darwin-WGA, is the first hardware accelerator for whole genome alignment and accelerates the gapped filtering stage. Darwin-WGA also employs GACT-X, a novel algorithm used in the extension stage to align arbitrarily long genome sequences using a small on-chip memory, that provides better quality alignments at 2× improvement in memory and speed over the previously published GACT algorithm. Implemented on an FPGA, Darwin-WGA provides up to 24× improvement (performance/$) in WGA over iso-sensitive software. An ASIC implementation of the proposed architecture on TSMC 40nm technology takes around 43W power with 36mm area. It achieves up to 10× performance/watt improvement on whole genome alignments over state-of-the-art software at higher sensitivity, and up to 1,500× performance/watt improvement compared to iso-sensitive software. Darwin-WGA is released under open-source MIT license and is available from https://github.com/gsneha26/Darwin-WGA. Keywords-Co-processor, Comparative Genomics, Whole Genome Alignment, Gapped Filtering
{"title":"Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup","authors":"Yatish Turakhia, Sneha D. Goenka, G. Bejerano, W. Dally","doi":"10.1109/HPCA.2019.00050","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00050","url":null,"abstract":"Whole genome alignment (WGA) is an indispensable tool in comparative genomics to study how different lifeforms have been shaped by evolution at the molecular level. Existing software whole genome aligners require several CPU weeks to compare a pair of mammalian genomes and still miss several biologically-meaningful, high-scoring alignment regions. These aligners are based on the seed-filter-and-extend paradigm with an ungapped filtering stage. Ungapped filtering is responsible for the low sensitivity of these aligners but is used because it is 200× faster than performing gapped alignment, using dynamic programming, in software. In this paper, we show that both performance and sensitivity can be greatly improved by using a hardware accelerator for WGA. Using the genomes of two roundworms (C. elegans and C. Briggsae) and four fruit flies (D. melanogaster, D. simulans, D. yakuba, and D. pseudoobscura), we show that replacing ungapped filtering with gapped filtering increases the number of matching base-pairs in alignments by up to 3×. Our accelerator, Darwin-WGA, is the first hardware accelerator for whole genome alignment and accelerates the gapped filtering stage. Darwin-WGA also employs GACT-X, a novel algorithm used in the extension stage to align arbitrarily long genome sequences using a small on-chip memory, that provides better quality alignments at 2× improvement in memory and speed over the previously published GACT algorithm. Implemented on an FPGA, Darwin-WGA provides up to 24× improvement (performance/$) in WGA over iso-sensitive software. An ASIC implementation of the proposed architecture on TSMC 40nm technology takes around 43W power with 36mm area. It achieves up to 10× performance/watt improvement on whole genome alignments over state-of-the-art software at higher sensitivity, and up to 1,500× performance/watt improvement compared to iso-sensitive software. Darwin-WGA is released under open-source MIT license and is available from https://github.com/gsneha26/Darwin-WGA. Keywords-Co-processor, Comparative Genomics, Whole Genome Alignment, Gapped Filtering","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131376261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder, Sungkeun Kim, R. Boyapati, K. H. Yum, Eun Jung Kim
—The explosion of data availability and the demand for faster data analysis have led to the emergence of applications exhibiting large memory footprint and low data reuse rate. These workloads, ranging from neural networks to graph processing, expose compute kernels that operate over myriads of data. Significant data movement requirements of these kernels impose heavy stress on modern memory subsystems and communication fabrics. To mitigate the worsening gap between high CPU computation density and deficient memory bandwidth, solutions like memory networks and near-data processing designs are being architected to improve system performance substantially. In this work, we examine the idea of mapping compute ker- nels to the memory network so as to leverage in-network computing in data-flow style, by means of near-data processing. We propose Active-Routing , an in-network compute architecture that enables computation on the way for near-data processing by exploiting patterns of aggregation over intermediate results of arithmetic operators. The proposed architecture leverages the massive memory-level parallelism and network concurrency to optimize the aggregation operations along a dynamically built Active-Routing Tree . Our evaluations show that Active-Routing can achieve upto 7 × speedup with an average of 60% performance improvement, and reduce the energy-delay product by 80% across various benchmarks compared to the state-of-the-art processing-in-memory architecture.
{"title":"Active-Routing: Compute on the Way for Near-Data Processing","authors":"Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder, Sungkeun Kim, R. Boyapati, K. H. Yum, Eun Jung Kim","doi":"10.1109/HPCA.2019.00018","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00018","url":null,"abstract":"—The explosion of data availability and the demand for faster data analysis have led to the emergence of applications exhibiting large memory footprint and low data reuse rate. These workloads, ranging from neural networks to graph processing, expose compute kernels that operate over myriads of data. Significant data movement requirements of these kernels impose heavy stress on modern memory subsystems and communication fabrics. To mitigate the worsening gap between high CPU computation density and deficient memory bandwidth, solutions like memory networks and near-data processing designs are being architected to improve system performance substantially. In this work, we examine the idea of mapping compute ker- nels to the memory network so as to leverage in-network computing in data-flow style, by means of near-data processing. We propose Active-Routing , an in-network compute architecture that enables computation on the way for near-data processing by exploiting patterns of aggregation over intermediate results of arithmetic operators. The proposed architecture leverages the massive memory-level parallelism and network concurrency to optimize the aggregation operations along a dynamically built Active-Routing Tree . Our evaluations show that Active-Routing can achieve upto 7 × speedup with an average of 60% performance improvement, and reduce the energy-delay product by 80% across various benchmarks compared to the state-of-the-art processing-in-memory architecture.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123010735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}