The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interfac
{"title":"A DSL for Performance Orchestration","authors":"Thiago Teixeira, D. Padua, W. Gropp","doi":"10.1109/PACT.2017.50","DOIUrl":"https://doi.org/10.1109/PACT.2017.50","url":null,"abstract":"The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interfac","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"421 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122796203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Kasture, Xu Ji, Nosayba El-Sayed, Nathan Beckmann, Xiaosong Ma, Daniel Sánchez
Datacenter servers often colocate multiple applications to improve utilization and efficiency. However, colocated applications interfere in shared resources, e.g., the last-level cache (LLC) and DRAM bandwidth, causing performance inefficiencies. Prior work has proposed two disjoint approaches to address interference. First, techniques that partition shared resources like the LLC can provide isolation and trade performance among colocated applications within a single node. But partitioning techniques are limited by the fixed resource demands of the applications running on the node. Second, interference-aware schedulers try to find resource-compatible applications and schedule them across nodes to improve performance. But prior schedulers are hampered by the lack of partitioning hardware in conventional multicores, and are forced to take conservative colocation decisions, leaving significant performance on the table. We show that memory-system partitioning and scheduling are complementary, and performing them in a coordinated fashion yields significant benefits. We present Shepherd, a joint scheduler and resource partitioner that seeks to maximize cluster-wide throughput. Shepherd uses detailed application profiling data to partition the shared LLC and to estimate the impact of DRAM bandwidth contention among colocated applications. Shepherd's scheduler leverages this information to colocate applications with complementary resource requirements, improving resource utilization and cluster throughput. We evaluate Shepherd in simulation and on a real cluster with hardware support for cache partitioning. When managing mixes of server and scientific applications, Shepherd improves cluster throughput over an unpartitioned system by 38% on average.
{"title":"POSTER: Improving Datacenter Efficiency Through Partitioning-Aware Scheduling","authors":"H. Kasture, Xu Ji, Nosayba El-Sayed, Nathan Beckmann, Xiaosong Ma, Daniel Sánchez","doi":"10.1109/PACT.2017.43","DOIUrl":"https://doi.org/10.1109/PACT.2017.43","url":null,"abstract":"Datacenter servers often colocate multiple applications to improve utilization and efficiency. However, colocated applications interfere in shared resources, e.g., the last-level cache (LLC) and DRAM bandwidth, causing performance inefficiencies. Prior work has proposed two disjoint approaches to address interference. First, techniques that partition shared resources like the LLC can provide isolation and trade performance among colocated applications within a single node. But partitioning techniques are limited by the fixed resource demands of the applications running on the node. Second, interference-aware schedulers try to find resource-compatible applications and schedule them across nodes to improve performance. But prior schedulers are hampered by the lack of partitioning hardware in conventional multicores, and are forced to take conservative colocation decisions, leaving significant performance on the table. We show that memory-system partitioning and scheduling are complementary, and performing them in a coordinated fashion yields significant benefits. We present Shepherd, a joint scheduler and resource partitioner that seeks to maximize cluster-wide throughput. Shepherd uses detailed application profiling data to partition the shared LLC and to estimate the impact of DRAM bandwidth contention among colocated applications. Shepherd's scheduler leverages this information to colocate applications with complementary resource requirements, improving resource utilization and cluster throughput. We evaluate Shepherd in simulation and on a real cluster with hardware support for cache partitioning. When managing mixes of server and scientific applications, Shepherd improves cluster throughput over an unpartitioned system by 38% on average.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115542725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Several research groups have noted that hardware transactional memory (HTM), even in the case of aborts, can have the side effect of warming up the branch predictor and caches, thereby accelerating subsequent execution. We propose to employ this side effect deliberately, in cases where execution must wait for action in another thread. In doing so, we allow "warm-up" transactions to observe inconsistent state. We must therefore ensure that they never accidentally commit. To that end, we propose that the hardware allow the program to specify, at the start of a transaction, that it should in all cases abort, even if it (accidentally) executes a commit instruction. We discuss several scenarios in which always-abort HTM (AAHTM) can be useful, and present lock and barrier implementations that employ it. We demonstrate the value of these implementations on several real-world applications, obtaining performance improvements of up to 2.5x with almost no programmer effort.
{"title":"Performance Improvement via Always-Abort HTM","authors":"Joseph Izraelevitz, Lingxiang Xiang, M. Scott","doi":"10.1109/PACT.2017.16","DOIUrl":"https://doi.org/10.1109/PACT.2017.16","url":null,"abstract":"Several research groups have noted that hardware transactional memory (HTM), even in the case of aborts, can have the side effect of warming up the branch predictor and caches, thereby accelerating subsequent execution. We propose to employ this side effect deliberately, in cases where execution must wait for action in another thread. In doing so, we allow \"warm-up\" transactions to observe inconsistent state. We must therefore ensure that they never accidentally commit. To that end, we propose that the hardware allow the program to specify, at the start of a transaction, that it should in all cases abort, even if it (accidentally) executes a commit instruction. We discuss several scenarios in which always-abort HTM (AAHTM) can be useful, and present lock and barrier implementations that employ it. We demonstrate the value of these implementations on several real-world applications, obtaining performance improvements of up to 2.5x with almost no programmer effort.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126867361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raghavendra Pradyumna Pothukuchi, Sweta Yamini Pothukuchi, P. Voulgaris, J. Torrellas
Multicores increasingly execute in constrained environments, and are being equipped with controllers for resource management. However, modern multicore systems are structured in multiple complex layers, such as the hardware, OS, and networking layers, each with its own resources. Managing such a system scalably and portably requires that we have a controller in each layer, and that the different controllers coordinate their operation. We present a novel methodology to build coordinated multilevel formal controllers in computing. We consider Robust Control Theory, which focuses on decision making in uncertain environments, and pick the popular Structured Singular Value (SSV) controller. This is the first work to utilize Robust Control Theory for compute resource management. We show the effectiveness of multilevel SSV controllers on a real multicore system.
{"title":"Multilayer Compute Resource Management with Robust Control Theory","authors":"Raghavendra Pradyumna Pothukuchi, Sweta Yamini Pothukuchi, P. Voulgaris, J. Torrellas","doi":"10.1109/PACT.2017.54","DOIUrl":"https://doi.org/10.1109/PACT.2017.54","url":null,"abstract":"Multicores increasingly execute in constrained environments, and are being equipped with controllers for resource management. However, modern multicore systems are structured in multiple complex layers, such as the hardware, OS, and networking layers, each with its own resources. Managing such a system scalably and portably requires that we have a controller in each layer, and that the different controllers coordinate their operation. We present a novel methodology to build coordinated multilevel formal controllers in computing. We consider Robust Control Theory, which focuses on decision making in uncertain environments, and pick the popular Structured Singular Value (SSV) controller. This is the first work to utilize Robust Control Theory for compute resource management. We show the effectiveness of multilevel SSV controllers on a real multicore system.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129015784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang
Recent studies showed that DRAM restore time degrades as technology scales, which imposes large performance and energy overheads. This problem, prolonged restore time (PRT), has been identified by the DRAM industry as one of three major scaling challenges.This paper proposes DrMP, a novel fine-grained precision-aware DRAM restore scheduling approach, to mitigate PRT. The approach exploits process variations (PVs) within and across DRAM rows to save data with mixed precision. The paper describes three variants of the approach: DrMP-A, DrMP-P, and DrMP-U. DrMP-A supports approximate computing by mapping important data bits to fast row segments to reduce restore time for improved performance at a low application error rate. DrMP-P pairs memory rows together to reduce the average restore time for precise computing. DrMP-U combines DrMP-A and DrMP-P to better trade performance, energy consumption, and computation precision. Our experimental results show that, on average, DrMP achieves 20% performance improvement and 15% energy reduction over a precision-oblivious baseline. Further, DrMP achieves an error rate less than 1% at the application level for a suite of benchmarks, including applications that exhibit unacceptable error rates under simple approximation that does not differentiate the importance of different bits.
{"title":"DrMP: Mixed Precision-Aware DRAM for High Performance Approximate and Precise Computing","authors":"Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang","doi":"10.1109/PACT.2017.34","DOIUrl":"https://doi.org/10.1109/PACT.2017.34","url":null,"abstract":"Recent studies showed that DRAM restore time degrades as technology scales, which imposes large performance and energy overheads. This problem, prolonged restore time (PRT), has been identified by the DRAM industry as one of three major scaling challenges.This paper proposes DrMP, a novel fine-grained precision-aware DRAM restore scheduling approach, to mitigate PRT. The approach exploits process variations (PVs) within and across DRAM rows to save data with mixed precision. The paper describes three variants of the approach: DrMP-A, DrMP-P, and DrMP-U. DrMP-A supports approximate computing by mapping important data bits to fast row segments to reduce restore time for improved performance at a low application error rate. DrMP-P pairs memory rows together to reduce the average restore time for precise computing. DrMP-U combines DrMP-A and DrMP-P to better trade performance, energy consumption, and computation precision. Our experimental results show that, on average, DrMP achieves 20% performance improvement and 15% energy reduction over a precision-oblivious baseline. Further, DrMP achieves an error rate less than 1% at the application level for a suite of benchmarks, including applications that exhibit unacceptable error rates under simple approximation that does not differentiate the importance of different bits.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115290078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents BigBus, a novel on-chip photonic network for a 1024 node system. The crux of the idea is to segment the entire system into smaller clusters of nodes, and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens and at a global level, we predict the number of token equivalents of power that the off-chip laser needs to generate.
{"title":"POSTER: BigBus: A Scalable Optical Interconnect","authors":"E. Peter, Janibul Bashir, S. Sarangi","doi":"10.1109/PACT.2017.18","DOIUrl":"https://doi.org/10.1109/PACT.2017.18","url":null,"abstract":"This paper presents BigBus, a novel on-chip photonic network for a 1024 node system. The crux of the idea is to segment the entire system into smaller clusters of nodes, and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens and at a global level, we predict the number of token equivalents of power that the off-chip laser needs to generate.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116607791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin
Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.
{"title":"Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory","authors":"Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin","doi":"10.1109/PACT.2017.58","DOIUrl":"https://doi.org/10.1109/PACT.2017.58","url":null,"abstract":"Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.
{"title":"Nexus: A New Approach to Replication in Distributed Shared Caches","authors":"Po-An Tsai, Nathan Beckmann, Daniel Sánchez","doi":"10.1109/PACT.2017.42","DOIUrl":"https://doi.org/10.1109/PACT.2017.42","url":null,"abstract":"Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127101369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Younghyun Cho, Camilo A. Celis Guzman, Bernhard Egger
This work proposes a co-scheduling technique for co-located parallel applications on Non-Uniform Memory Access (NUMA) multi-socket multi-core platforms. The technique allocates core resources for running parallel applications such that both the utilization of the memory controllers and the CPU cores are maximized. Utilization is predicted using an online performance prediction model based on queuing systems. At runtime, the core allocation is periodically re-evaluated and cores are re-assigned to executing applications. Experimental results show that the proposed co-scheduling technique is able to execute co-located parallel applications in significantly less total execution time than the default Linux scheduler and a conventional scalability-based scheduler.
{"title":"POSTER: Improving NUMA System Efficiency with a Utilization-Based Co-scheduling","authors":"Younghyun Cho, Camilo A. Celis Guzman, Bernhard Egger","doi":"10.1109/PACT.2017.27","DOIUrl":"https://doi.org/10.1109/PACT.2017.27","url":null,"abstract":"This work proposes a co-scheduling technique for co-located parallel applications on Non-Uniform Memory Access (NUMA) multi-socket multi-core platforms. The technique allocates core resources for running parallel applications such that both the utilization of the memory controllers and the CPU cores are maximized. Utilization is predicted using an online performance prediction model based on queuing systems. At runtime, the core allocation is periodically re-evaluated and cores are re-assigned to executing applications. Experimental results show that the proposed co-scheduling technique is able to execute co-located parallel applications in significantly less total execution time than the default Linux scheduler and a conventional scalability-based scheduler.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116770126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. A. Deiana, Vincent St-Amour, P. Dinda, N. Hardavellas, Simone Campanoni
The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call "state dependences" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of "state dependences". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.
{"title":"POSTER: The Liberation Day of Nondeterministic Programs","authors":"E. A. Deiana, Vincent St-Amour, P. Dinda, N. Hardavellas, Simone Campanoni","doi":"10.1109/PACT.2017.26","DOIUrl":"https://doi.org/10.1109/PACT.2017.26","url":null,"abstract":"The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124014259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}