Arthur Perais, Rami Sheikh, Luke Yen, Michael McIlvaine, R. Clancy
Branch prediction (i.e., the generation of fetch addresses) and instruction cache accesses need not be tightly coupled. As the instruction fetch stage stalls because of an ICache miss or back-pressure, the branch predictor may run ahead and generate future fetch addresses that can be used for different optimizations, such as instruction prefetching but more importantly hiding taken branch fetch bubbles. This approach is used in many commercially available highperformance design. However, decoupling branch prediction from instruction retrieval has several drawbacks. First, it can increase the pipeline depth, leading to more expensive pipeline flushes. Second, it requires a large Branch Target Buffer (BTB) to store branch targets, allowing the branch predictor to follow taken branches without decoding instruction bytes. Missing the BTB will also cause additional bubbles. In some classes of workloads, those drawbacks may significantly offset the benefits of decoupling. In this paper, we present ELastic Fetching (ELF), a hybrid mechanism that decouples branch prediction from instruction retrieval while minimizing additional bubbles on pipeline flushes and BTB misses. We present two different implementations that trade off complexity for additional performance. Both variants outperform a baseline decoupled fetcher design by up to 3.7% and 5.2%, respectively.
{"title":"Elastic Instruction Fetching","authors":"Arthur Perais, Rami Sheikh, Luke Yen, Michael McIlvaine, R. Clancy","doi":"10.1109/HPCA.2019.00059","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00059","url":null,"abstract":"Branch prediction (i.e., the generation of fetch addresses) and instruction cache accesses need not be tightly coupled. As the instruction fetch stage stalls because of an ICache miss or back-pressure, the branch predictor may run ahead and generate future fetch addresses that can be used for different optimizations, such as instruction prefetching but more importantly hiding taken branch fetch bubbles. This approach is used in many commercially available highperformance design. However, decoupling branch prediction from instruction retrieval has several drawbacks. First, it can increase the pipeline depth, leading to more expensive pipeline flushes. Second, it requires a large Branch Target Buffer (BTB) to store branch targets, allowing the branch predictor to follow taken branches without decoding instruction bytes. Missing the BTB will also cause additional bubbles. In some classes of workloads, those drawbacks may significantly offset the benefits of decoupling. In this paper, we present ELastic Fetching (ELF), a hybrid mechanism that decouples branch prediction from instruction retrieval while minimizing additional bubbles on pipeline flushes and BTB misses. We present two different implementations that trade off complexity for additional performance. Both variants outperform a baseline decoupled fetcher design by up to 3.7% and 5.2%, respectively.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133876687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote Abstracts","authors":"","doi":"10.1109/hpca.2019.00-44","DOIUrl":"https://doi.org/10.1109/hpca.2019.00-44","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115576431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Ajdari, Pyeongsu Park, Joonsung Kim, Dongup Kwon, Jang-Hyun Kim
An SSD array, a storage system consisting of multiple SSDs per node, has become a design choice to implement a fast primary storage system, and modern storage architects now aim to achieve terabit-per-second scale performance with the next-generation SSD array. To reduce the storage cost and improve the device endurability, such SSD array must employ data reduction schemes (i.e., deduplication, compression), which provide high data reduction capability at minimum costs. However, existing data reduction schemes do not scale with the fast increasing performance of an SSD array, due to inhibitive amount of CPU resources (e.g., in software-based schemes) or low data reduction ratio (e.g., in SSD device wide deduplication) or being cost ineffective to address workload changes in datacenters (e.g., in ASIC-based acceleration). In this paper, we propose CIDR, a novel FPGA-based, cost-effective data reduction system for an SSD array to achieve the terabit-per-second scale storage performance. Our key ideas are as follows. First, we decouple data reductionrelated computing tasks from the unscalable host CPUs by offloading them to a scalable array of FPGA boards. Second, we employ a centralized, node-wide metadata management scheme to achieve an SSD array-wide, high data reduction. Third, our FPGA-based reconfiguration adapts to different workload patterns by dynamically balancing the amount of software and hardware tasks running on CPUs and FPGAs, respectively. For evaluation, we built our example CIDR prototype achieving up to 12.8 GB/s (0.1 Tbps) on one FPGA. CIDR outperforms the baseline for a write-only workload by up to 2.47x and a mixed read-write workload by an expected 3.2x, respectively. We showed CIDR’s scalability to achieve Tbps-scale performance by measuring a two-FPGA CIDR and projecting the performance impacts for more FPGAs. Keywords-deduplication; compression; FPGA; SSD array;
{"title":"CIDR: A Cost-Effective In-Line Data Reduction System for Terabit-Per-Second Scale SSD Arrays","authors":"M. Ajdari, Pyeongsu Park, Joonsung Kim, Dongup Kwon, Jang-Hyun Kim","doi":"10.1109/HPCA.2019.00025","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00025","url":null,"abstract":"An SSD array, a storage system consisting of multiple SSDs per node, has become a design choice to implement a fast primary storage system, and modern storage architects now aim to achieve terabit-per-second scale performance with the next-generation SSD array. To reduce the storage cost and improve the device endurability, such SSD array must employ data reduction schemes (i.e., deduplication, compression), which provide high data reduction capability at minimum costs. However, existing data reduction schemes do not scale with the fast increasing performance of an SSD array, due to inhibitive amount of CPU resources (e.g., in software-based schemes) or low data reduction ratio (e.g., in SSD device wide deduplication) or being cost ineffective to address workload changes in datacenters (e.g., in ASIC-based acceleration). In this paper, we propose CIDR, a novel FPGA-based, cost-effective data reduction system for an SSD array to achieve the terabit-per-second scale storage performance. Our key ideas are as follows. First, we decouple data reductionrelated computing tasks from the unscalable host CPUs by offloading them to a scalable array of FPGA boards. Second, we employ a centralized, node-wide metadata management scheme to achieve an SSD array-wide, high data reduction. Third, our FPGA-based reconfiguration adapts to different workload patterns by dynamically balancing the amount of software and hardware tasks running on CPUs and FPGAs, respectively. For evaluation, we built our example CIDR prototype achieving up to 12.8 GB/s (0.1 Tbps) on one FPGA. CIDR outperforms the baseline for a write-only workload by up to 2.47x and a mixed read-write workload by an expected 3.2x, respectively. We showed CIDR’s scalability to achieve Tbps-scale performance by measuring a two-FPGA CIDR and projecting the performance impacts for more FPGAs. Keywords-deduplication; compression; FPGA; SSD array;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116180130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NAND-Net: Minimizing Computational Complexity of In-Memory Processing for Binary Neural Networks","authors":"Hyeonuk Kim, Jaehyeong Sim, Yeongjae Choi, L. Kim","doi":"10.1109/HPCA.2019.00017","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00017","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127354861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Papadimitriou, Athanasios Chatzidimitriou, D. Gizopoulos
Energy efficiency is a known major concern for computing system designers. Significant effort is devoted to power optimization of modern systems, especially in largescale installations such as data centers, in which both high performance and energy efficiency are important. Power optimization can be achieved through different approaches, several of which focus on adaptive voltage regulation. In this paper, we present a comprehensive exploration of how two server-grade systems behave in different frequency and core allocation configurations beyond nominal voltage operation. Our analysis, which is built on top of two state-of-the-art ARMv8 microprocessor chips (Applied Micro’s X-Gene 2 and X-Gene 3) aims (1) to identify the best performance per watt operation points when the servers are operating in various voltage/frequency combinations, (2) to reveal how and why the different core allocation options on the available cores of the microprocessor affect the energy consumption, and (3) to enhance the default Linux scheduler to take task allocation decisions for balanced performance and energy efficiency. Our findings, on actual servers’ hardware, have been integrated into a lightweight online monitoring daemon which decides the optimal combination of voltage, core allocation, and clock frequency to achieve higher energy efficiency. Our approach reduces on average the energy by 25.2% on X-Gene 2, and 22.3% on X-Gene 3, with a minimal performance penalty of 3.2% on X-Gene 2 and 2.5% on X-Gene 3, compared to the default system configuration. Keywords-Energy efficiency; voltage and frequency scaling; power consumption; multicore characterization; micro-servers;
{"title":"Adaptive Voltage/Frequency Scaling and Core Allocation for Balanced Energy and Performance on Multicore CPUs","authors":"G. Papadimitriou, Athanasios Chatzidimitriou, D. Gizopoulos","doi":"10.1109/HPCA.2019.00033","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00033","url":null,"abstract":"Energy efficiency is a known major concern for computing system designers. Significant effort is devoted to power optimization of modern systems, especially in largescale installations such as data centers, in which both high performance and energy efficiency are important. Power optimization can be achieved through different approaches, several of which focus on adaptive voltage regulation. In this paper, we present a comprehensive exploration of how two server-grade systems behave in different frequency and core allocation configurations beyond nominal voltage operation. Our analysis, which is built on top of two state-of-the-art ARMv8 microprocessor chips (Applied Micro’s X-Gene 2 and X-Gene 3) aims (1) to identify the best performance per watt operation points when the servers are operating in various voltage/frequency combinations, (2) to reveal how and why the different core allocation options on the available cores of the microprocessor affect the energy consumption, and (3) to enhance the default Linux scheduler to take task allocation decisions for balanced performance and energy efficiency. Our findings, on actual servers’ hardware, have been integrated into a lightweight online monitoring daemon which decides the optimal combination of voltage, core allocation, and clock frequency to achieve higher energy efficiency. Our approach reduces on average the energy by 25.2% on X-Gene 2, and 22.3% on X-Gene 3, with a minimal performance penalty of 3.2% on X-Gene 2 and 2.5% on X-Gene 3, compared to the default system configuration. Keywords-Energy efficiency; voltage and frequency scaling; power consumption; multicore characterization; micro-servers;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133089738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Fu, L. Riesebos, M. A. Rol, J. V. Straten, J. V. Someren, N. Khammassi, Imran Ashraf, R. Vermeulen, V. Newsum, K. Loh, J. C. D. Sterke, W. Vlothuizen, R. N. Schouten, C. G. Almudever, Leonardo DiCarlo, K. Bertels
Mohammad Bakhshalipour, Mehran Shakerinava, P. Lotfi-Kamran, H. Sarbazi-Azad
—Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the prefetchers either offer low accuracy or lose significant prediction opportunities. We identify that associating the observed spatial patterns to just a single event significantly limits the effectiveness of spatial data prefetchers. In this paper, we make a case for associating the observed spatial patterns to both short and long events to achieve high accuracy while not losing prediction opportunities. We propose Bingo spatial data prefetcher in which short and long events are used to select the best access pattern for prefetching. We propose a storage-efficient design for Bingo in such a way that just one history table is needed to maintain the association between the access patterns and the long and short events. Through a detailed evaluation of a set of big-data applications, we show that Bingo improves system performance by 60% over a baseline with no data prefetcher and 11% over the best-performing prior spatial data prefetcher.
{"title":"Bingo Spatial Data Prefetcher","authors":"Mohammad Bakhshalipour, Mehran Shakerinava, P. Lotfi-Kamran, H. Sarbazi-Azad","doi":"10.1109/HPCA.2019.00053","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00053","url":null,"abstract":"—Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the prefetchers either offer low accuracy or lose significant prediction opportunities. We identify that associating the observed spatial patterns to just a single event significantly limits the effectiveness of spatial data prefetchers. In this paper, we make a case for associating the observed spatial patterns to both short and long events to achieve high accuracy while not losing prediction opportunities. We propose Bingo spatial data prefetcher in which short and long events are used to select the best access pattern for prefetching. We propose a storage-efficient design for Bingo in such a way that just one history table is needed to maintain the association between the access patterns and the long and short events. Through a detailed evaluation of a set of big-data applications, we show that Bingo improves system performance by 60% over a baseline with no data prefetcher and 11% over the best-performing prior spatial data prefetcher.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125932587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
—The complex, distributed nature of data centers have spawned the adoption of distributed, multi-tiered software architectures, consisting of many inter-connected microservices. These microservices exhibit extremely short request service times, often less than 250 µ s. We show that these “killer microsecond” service times can cause state-of-the-art dynamic power management techniques to break down, due to short idle period length and low power state transition overheads. In this paper, we propose µ DPM , a dynamic power management scheme for the microsecond era that coordinates request delaying, per-core sleep states, and voltage frequency scaling. The idea is to postpone the wake up of a CPU as long as possible and then adjust the frequency so that the tail latency constraint of requests are satisfied just-in-time. µ DPM reduces processor energy consumption by up to 32% and consistently outperforms state-of-the-art techniques by 2x.
{"title":"μDPM: Dynamic Power Management for the Microsecond Era","authors":"C. Chou, L. Bhuyan, Daniel Wong","doi":"10.1109/HPCA.2019.00032","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00032","url":null,"abstract":"—The complex, distributed nature of data centers have spawned the adoption of distributed, multi-tiered software architectures, consisting of many inter-connected microservices. These microservices exhibit extremely short request service times, often less than 250 µ s. We show that these “killer microsecond” service times can cause state-of-the-art dynamic power management techniques to break down, due to short idle period length and low power state transition overheads. In this paper, we propose µ DPM , a dynamic power management scheme for the microsecond era that coordinates request delaying, per-core sleep states, and voltage frequency scaling. The idea is to postpone the wake up of a CPU as long as possible and then adjust the frequency so that the tail latency constraint of requests are satisfied just-in-time. µ DPM reduces processor energy consumption by up to 32% and consistently outperforms state-of-the-art techniques by 2x.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126544331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Industry Session Program Committee","authors":"","doi":"10.1109/hpca.2019.00-46","DOIUrl":"https://doi.org/10.1109/hpca.2019.00-46","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123704943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bilge Acun, A. Buyuktosunoglu, Eun Kyung Lee, Yoonho Park
To meet ever increasing computational requirements, supercomputers and data centers are beginning to utilize fat compute nodes with multiple hardware components such as manycore CPUs and accelerators. These components have intrinsic power variations even among same model components from same manufacturer. In this paper, we argue that node assembly techniques that consider these intrinsic power variations can achieve better power efficiency without any performance trade off on large scale supercomputing facilities and data centers. We propose three different node assembly techniques: (1) Sorted Assembly, (2) Balanced Power Assembly, and (3) Application-Aware Assembly. In Sorted Assembly, node components are categorized (or sorted) into groups according to their power efficiency, and components from the same group are assembled into a node. In Balanced Power Assembly, components are assembled to minimize node-to-node power variations. In Application-Aware Assembly, the most heavily used components by the application are selected based on the highest power efficiency. We evaluate the effectiveness and cost savings of the three techniques compared to the standard random assembly under different node counts and variability scenarios.
{"title":"Power Aware Heterogeneous Node Assembly","authors":"Bilge Acun, A. Buyuktosunoglu, Eun Kyung Lee, Yoonho Park","doi":"10.1109/HPCA.2019.00068","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00068","url":null,"abstract":"To meet ever increasing computational requirements, supercomputers and data centers are beginning to utilize fat compute nodes with multiple hardware components such as manycore CPUs and accelerators. These components have intrinsic power variations even among same model components from same manufacturer. In this paper, we argue that node assembly techniques that consider these intrinsic power variations can achieve better power efficiency without any performance trade off on large scale supercomputing facilities and data centers. We propose three different node assembly techniques: (1) Sorted Assembly, (2) Balanced Power Assembly, and (3) Application-Aware Assembly. In Sorted Assembly, node components are categorized (or sorted) into groups according to their power efficiency, and components from the same group are assembled into a node. In Balanced Power Assembly, components are assembled to minimize node-to-node power variations. In Application-Aware Assembly, the most heavily used components by the application are selected based on the highest power efficiency. We evaluate the effectiveness and cost savings of the three techniques compared to the standard random assembly under different node counts and variability scenarios.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133571048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}