Raghavendra Pradyumna Pothukuchi, Amin Ansari, Bhargava Gopireddy, J. Torrellas
Networks-on-Chip (NoCs) in chip multiprocessors are prone to within-die process variation as they span the whole chip. To tolerate variation, their voltages (Vdd) carry over-provisioned guardbands. As a result, prior work has proposed to save energy by operating at reduced Vdd while occasionally suffering and fixing errors. Unfortunately, these proposals use heuristic controller designs that provide no error bounds guarantees.In this work, we develop a scheme that dynamically minimizes the Vdd of groups of routers in a variation-prone NoC using formal control-theoretic methods. The scheme, called Sthira, saves substantial energy while guaranteeing the stability and convergence of error rates. We also enhance the scheme with a low-cost secondary network that retransmits erroneous packets for higher energy efficiency. The enhanced scheme is called Sthira+. We evaluate Sthira and Sthira+ with simulations of NoCs with 64-100 routers. In an NoC with 8 routers per Vdd domain, our schemes reduce the average energy consumptionof the NoC by 27%; in a futuristic NoC with one router per Vdd domain, Sthira+ and Sthira reduce the average energy consumption by 36% and 32%, respectively. The performance impact is negligible. These are significant savings over the state-of-the-art. We conclude that formal control is essential, and that the cheaper Sthira is more cost-effective than Sthira+.
{"title":"Sthira: A Formal Approach to Minimize Voltage Guardbands under Variation in Networks-on-Chip for Energy Efficiency","authors":"Raghavendra Pradyumna Pothukuchi, Amin Ansari, Bhargava Gopireddy, J. Torrellas","doi":"10.1109/PACT.2017.23","DOIUrl":"https://doi.org/10.1109/PACT.2017.23","url":null,"abstract":"Networks-on-Chip (NoCs) in chip multiprocessors are prone to within-die process variation as they span the whole chip. To tolerate variation, their voltages (Vdd) carry over-provisioned guardbands. As a result, prior work has proposed to save energy by operating at reduced Vdd while occasionally suffering and fixing errors. Unfortunately, these proposals use heuristic controller designs that provide no error bounds guarantees.In this work, we develop a scheme that dynamically minimizes the Vdd of groups of routers in a variation-prone NoC using formal control-theoretic methods. The scheme, called Sthira, saves substantial energy while guaranteeing the stability and convergence of error rates. We also enhance the scheme with a low-cost secondary network that retransmits erroneous packets for higher energy efficiency. The enhanced scheme is called Sthira+. We evaluate Sthira and Sthira+ with simulations of NoCs with 64-100 routers. In an NoC with 8 routers per Vdd domain, our schemes reduce the average energy consumptionof the NoC by 27%; in a futuristic NoC with one router per Vdd domain, Sthira+ and Sthira reduce the average energy consumption by 36% and 32%, respectively. The performance impact is negligible. These are significant savings over the state-of-the-art. We conclude that formal control is essential, and that the cheaper Sthira is more cost-effective than Sthira+.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132931050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amro Awad, Arkaprava Basu, S. Blagodurov, Yan Solihin, G. Loh
Updates to a process's page table entry (PTE) renders any existing copies of that PTE in any of a system's TLBs stale. To prevent a process from making illegal memory accesses using stale TLB entries, the operating system (OS) performs a costly TLB shootdown operation. Rather than explicitly issuing shootdowns, we propose a coordinated TLB and page table management mechanism where an expirationtime is associated with each TLB entry. An expired TLB entry is treated as invalid. For each PTE, the OS then tracks the latest expiration time of any TLB entry potentially caching that PTE. No shootdown is issued if the OS modifies a PTE when its corresponding latest expiration time has already passed.In this paper, we explain the hardware and OS support required to support Self-invalidating TLB entries (SITE). As an emerging use case that needs fast TLB shootdowns, we consider memory systems consisting of different types of memory (e.g., faster DRAM and slower non-volatile memory) where aggressive migrations are desirable to keep frequently accessed pages in faster memory, but pages cannot migratetoo often because each migration requires a PTE update and corresponding TLB shootdown. We demonstrate that such heterogeneous memory systems augmented with SITE can allow an average performance improvement of 45.5% over a similar system with traditional TLB shootdowns by avoiding more than 65% of the shootdowns.
{"title":"Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries","authors":"Amro Awad, Arkaprava Basu, S. Blagodurov, Yan Solihin, G. Loh","doi":"10.1109/PACT.2017.38","DOIUrl":"https://doi.org/10.1109/PACT.2017.38","url":null,"abstract":"Updates to a process's page table entry (PTE) renders any existing copies of that PTE in any of a system's TLBs stale. To prevent a process from making illegal memory accesses using stale TLB entries, the operating system (OS) performs a costly TLB shootdown operation. Rather than explicitly issuing shootdowns, we propose a coordinated TLB and page table management mechanism where an expirationtime is associated with each TLB entry. An expired TLB entry is treated as invalid. For each PTE, the OS then tracks the latest expiration time of any TLB entry potentially caching that PTE. No shootdown is issued if the OS modifies a PTE when its corresponding latest expiration time has already passed.In this paper, we explain the hardware and OS support required to support Self-invalidating TLB entries (SITE). As an emerging use case that needs fast TLB shootdowns, we consider memory systems consisting of different types of memory (e.g., faster DRAM and slower non-volatile memory) where aggressive migrations are desirable to keep frequently accessed pages in faster memory, but pages cannot migratetoo often because each migration requires a PTE update and corresponding TLB shootdown. We demonstrate that such heterogeneous memory systems augmented with SITE can allow an average performance improvement of 45.5% over a similar system with traditional TLB shootdowns by avoiding more than 65% of the shootdowns.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124064345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In spite of the multicore revolution, high single thread performance still plays an important role in ensuring a decentoverall gain. Look-ahead is a proven strategy in uncoveringimplicit parallelism; however, a conventional out-of-ordercore quickly becomes resource-inefficient when looking beyond a short distance. An effective approach is to use an in-dependent look-ahead thread running on a separate contextguided by a program slice known as the skeleton. We observethat fixed heuristics to generate skeletons are often suboptimal. As a consequence, look-ahead agent is not able to targetsufficient bottlenecks to reap all the benefits it should.In this paper, we present DRUT, a holistic hardware-software solution, which achieves good single thread performance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branchbased code modules (we call them Do-It-Yourself or DIY)that enable a faster look-ahead thread without compromisingthe quality of the look-ahead. Second, we extend our tuningmechanism to any arbitrary code region and use a profile-driven technique to tune the skeleton for the whole program.Assisted by the aforementioned techniques, look-aheadthread improves the performance of a baseline decoupledlook-ahead by up to 1.93× with a geometric mean of 1.15×. Our techniques, combined with the weak dependence removal technique, improve the performance of a baselinelook-ahead by up to 2.12× with a geometric mean of 1.20×. This is an impressive performance gain of 1.61× over thesingle-thread baseline, which is much better compared toconventional Turbo Boost with a comparable energy budget.
{"title":"DRUT: An Efficient Turbo Boost Solution via Load Balancing in Decoupled Look-Ahead Architecture","authors":"Raj Parihar, Michael C. Huang","doi":"10.1109/PACT.2017.35","DOIUrl":"https://doi.org/10.1109/PACT.2017.35","url":null,"abstract":"In spite of the multicore revolution, high single thread performance still plays an important role in ensuring a decentoverall gain. Look-ahead is a proven strategy in uncoveringimplicit parallelism; however, a conventional out-of-ordercore quickly becomes resource-inefficient when looking beyond a short distance. An effective approach is to use an in-dependent look-ahead thread running on a separate contextguided by a program slice known as the skeleton. We observethat fixed heuristics to generate skeletons are often suboptimal. As a consequence, look-ahead agent is not able to targetsufficient bottlenecks to reap all the benefits it should.In this paper, we present DRUT, a holistic hardware-software solution, which achieves good single thread performance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branchbased code modules (we call them Do-It-Yourself or DIY)that enable a faster look-ahead thread without compromisingthe quality of the look-ahead. Second, we extend our tuningmechanism to any arbitrary code region and use a profile-driven technique to tune the skeleton for the whole program.Assisted by the aforementioned techniques, look-aheadthread improves the performance of a baseline decoupledlook-ahead by up to 1.93× with a geometric mean of 1.15×. Our techniques, combined with the weak dependence removal technique, improve the performance of a baselinelook-ahead by up to 2.12× with a geometric mean of 1.20×. This is an impressive performance gain of 1.61× over thesingle-thread baseline, which is much better compared toconventional Turbo Boost with a comparable energy budget.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124696415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taewook Oh, S. Beard, Nick P. Johnson, S. Popovych, David I. August
Computational scientists are typically not expert programmers, and thus work in easy to use dynamic languages. However, they have very high performance requirements, due to their large datasets and experimental setups. Thus, the performance required for computational science must be extracted from dynamic languages in a manner that is transparent to the programmer. Current approaches to optimize and parallelize dynamic languages, such as just-in-time compilation and highly optimized interpreters, require a huge amount of implementation effort and are typically only effective for a single language. However, scientists in different fields use different languages, depending upon their needs.This paper presents techniques to enable automatic extraction of parallelism within scripts that are universally applicable across multiple different dynamic scripting languages. The key insight is that combining a script with its interpreter, through program specialization techniques, will embed any parallelism within the script into the combined program that can then be extracted via automatic parallelization techniques. Additionally, this paper presents several enhancements to existing speculative automatic parallelization techniques to handle the dependence patterns created by the specialization process. A prototype of the proposed technique, called Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script interpreters with 6 input linear algebra kernel scripts each. The resulting geomean speedup of 5.10× on a 24-core machine shows the potential of the generalized approach in automatic extraction of parallelism in dynamic scripting languages.
{"title":"A Generalized Framework for Automatic Scripting Language Parallelization","authors":"Taewook Oh, S. Beard, Nick P. Johnson, S. Popovych, David I. August","doi":"10.1109/PACT.2017.28","DOIUrl":"https://doi.org/10.1109/PACT.2017.28","url":null,"abstract":"Computational scientists are typically not expert programmers, and thus work in easy to use dynamic languages. However, they have very high performance requirements, due to their large datasets and experimental setups. Thus, the performance required for computational science must be extracted from dynamic languages in a manner that is transparent to the programmer. Current approaches to optimize and parallelize dynamic languages, such as just-in-time compilation and highly optimized interpreters, require a huge amount of implementation effort and are typically only effective for a single language. However, scientists in different fields use different languages, depending upon their needs.This paper presents techniques to enable automatic extraction of parallelism within scripts that are universally applicable across multiple different dynamic scripting languages. The key insight is that combining a script with its interpreter, through program specialization techniques, will embed any parallelism within the script into the combined program that can then be extracted via automatic parallelization techniques. Additionally, this paper presents several enhancements to existing speculative automatic parallelization techniques to handle the dependence patterns created by the specialization process. A prototype of the proposed technique, called Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script interpreters with 6 input linear algebra kernel scripts each. The resulting geomean speedup of 5.10× on a 24-core machine shows the potential of the generalized approach in automatic extraction of parallelism in dynamic scripting languages.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116332147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das
Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requir
{"title":"Cache Automaton: Repurposing Caches for Automata Processing","authors":"Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das","doi":"10.1109/PACT.2017.51","DOIUrl":"https://doi.org/10.1109/PACT.2017.51","url":null,"abstract":"Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requir","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127500744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. We give the operational definitions of both models using Instantaneous Instruction Execution (I2E), which has been used in the definitions of SC and TSO. We also show how both models can be implemented using conventional cache-coherent memory systems and out-of-order processors, and encompasses the behaviors of most known optimizations.
{"title":"Weak Memory Models: Balancing Definitional Simplicity and Implementation Flexibility","authors":"Sizhuo Zhang, M. Vijayaraghavan, Arvind","doi":"10.1109/PACT.2017.29","DOIUrl":"https://doi.org/10.1109/PACT.2017.29","url":null,"abstract":"The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. We give the operational definitions of both models using Instantaneous Instruction Execution (I2E), which has been used in the definitions of SC and TSO. We also show how both models can be implemented using conventional cache-coherent memory systems and out-of-order processors, and encompasses the behaviors of most known optimizations.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128317347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Early design-space evaluation of computer-systems is usually performed using performance models such as detailed simulators, RTL-based models etc. Unfortunately, it is very challenging (often impossible) to run many emerging applications on detailed performance models owing to their complex application software-stacks, significantly long run times, system dependencies and the limited speed/potential of early performance models. To overcome these challenges in benchmarking complex, long-running database applications, we propose a fast and efficient proxy generation methodology, PerfProx that can generate miniature proxy benchmarks, which are representative of the performance of real-world database applications and yet, converge to results quickly and do not need any complex software-stack support. Past research on proxy generation utilizes detailed micro-architecture independent metrics derived from detailed functional simulators, which are often difficult to generate for many emerging applications. PerfProx enables fast and efficient proxy generation using performance metrics derived primarily from hardware performance counters. We evaluate the proposed proxy generation approach on three modern, real-world SQL and NoSQL databases, Cassandra, MongoDB and MySQL running both the data-serving and data-analytics class of applications on different hardware platforms and cache/TLB configurations. The proxy benchmarks mimic the performance (IPC) of the original database applications with ∼94.2% (avg) accuracy. We further demonstrate that the proxies mimic original application performance across several other key metrics, while significantly reducing the instruction counts.
{"title":"Proxy Benchmarks for Emerging Big-Data Workloads","authors":"Reena Panda, L. John","doi":"10.1109/PACT.2017.44","DOIUrl":"https://doi.org/10.1109/PACT.2017.44","url":null,"abstract":"Early design-space evaluation of computer-systems is usually performed using performance models such as detailed simulators, RTL-based models etc. Unfortunately, it is very challenging (often impossible) to run many emerging applications on detailed performance models owing to their complex application software-stacks, significantly long run times, system dependencies and the limited speed/potential of early performance models. To overcome these challenges in benchmarking complex, long-running database applications, we propose a fast and efficient proxy generation methodology, PerfProx that can generate miniature proxy benchmarks, which are representative of the performance of real-world database applications and yet, converge to results quickly and do not need any complex software-stack support. Past research on proxy generation utilizes detailed micro-architecture independent metrics derived from detailed functional simulators, which are often difficult to generate for many emerging applications. PerfProx enables fast and efficient proxy generation using performance metrics derived primarily from hardware performance counters. We evaluate the proposed proxy generation approach on three modern, real-world SQL and NoSQL databases, Cassandra, MongoDB and MySQL running both the data-serving and data-analytics class of applications on different hardware platforms and cache/TLB configurations. The proxy benchmarks mimic the performance (IPC) of the original database applications with ∼94.2% (avg) accuracy. We further demonstrate that the proxies mimic original application performance across several other key metrics, while significantly reducing the instruction counts.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent data structures, such as skiplists, which are widely used in general purpose computations. Our design utilizes array-based nodes which are accessed and updated by warp-cooperative functions, thus taking advantage of the fact that GPUs are most efficient when memory accesses are coalesced and execution divergence is minimized. The proposed design has been implemented, and measurements demonstrate improved performance of up to 11.6x over skiplist designs for the GPU existing today.
{"title":"A GPU-Friendly Skiplist Algorithm","authors":"Nurit Moscovici, Nachshon Cohen, E. Petrank","doi":"10.1145/3018743.3019032","DOIUrl":"https://doi.org/10.1145/3018743.3019032","url":null,"abstract":"We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent data structures, such as skiplists, which are widely used in general purpose computations. Our design utilizes array-based nodes which are accessed and updated by warp-cooperative functions, thus taking advantage of the fact that GPUs are most efficient when memory accesses are coalesced and execution divergence is minimized. The proposed design has been implemented, and measurements demonstrate improved performance of up to 11.6x over skiplist designs for the GPU existing today.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common.In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.
{"title":"Near-Memory Address Translation","authors":"Javier Picorel, Djordje Jevdjic, B. Falsafi","doi":"10.1109/PACT.2017.56","DOIUrl":"https://doi.org/10.1109/PACT.2017.56","url":null,"abstract":"Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common.In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117166223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincent T. Lee, Amrita Mazumdar, Carlo C. del Mundo, Armin Alaghi, L. Ceze, M. Oskin
Similarity search is a key to important applications such as content-based search, deduplication, natural language processing, computer vision, databases, and graphics. At its core, similarity search manifests as k-nearest neighbors (kNN) which consists of parallel distance calculations and a top-k sort. While kNN is poorly supported by today's architectures, it is ideal for near-data processing because of its high memory bandwidth requirements. This work proposes a near-data processing accelerator for similarity search: the similarity search associative memory (SSAM).
{"title":"POSTER: Application-Driven Near-Data Processing for Similarity Search","authors":"Vincent T. Lee, Amrita Mazumdar, Carlo C. del Mundo, Armin Alaghi, L. Ceze, M. Oskin","doi":"10.1109/PACT.2017.25","DOIUrl":"https://doi.org/10.1109/PACT.2017.25","url":null,"abstract":"Similarity search is a key to important applications such as content-based search, deduplication, natural language processing, computer vision, databases, and graphics. At its core, similarity search manifests as k-nearest neighbors (kNN) which consists of parallel distance calculations and a top-k sort. While kNN is poorly supported by today's architectures, it is ideal for near-data processing because of its high memory bandwidth requirements. This work proposes a near-data processing accelerator for similarity search: the similarity search associative memory (SSAM).","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125983075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}