Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095806
C. Ho, Venkatraman Govindaraju, Tony Nowatzki, R. Nagaraju, Zachary Marzec, Preeti Agarwal, Chris Frericks, Ryan Cofell, K. Sankaralingam
Specialization and accelerators are being proposed as an effective way to address the slowdown of Dennard scaling. DySER is one such accelerator, which dynamically synthesizes large compound functional units to match program regions, using a co-designed compiler and microarchitecture. We have completed a full prototype implementation of DySER integrated into the OpenSPARC processor (called SPARC-DySER), a co-designed compiler in LLVM, and a detailed performance evaluation on an FPGA system, which runs an Ubuntu Linux distribution and full applications. Through the prototype, this paper evaluates the fundamental principles of DySER acceleration. Our two key findings are: i) the DySER execution model and microarchitecture provides energy efficient speedups and the integration of DySER does not introduce overheads - overall, DySER's performance improvement to OpenSPARC is 6X, consuming only 200mW ; ii) on the compiler side, the DySER compiler is effective at extracting computationally intensive regular and irregular code. For non-computationally intense irregular code, two control flow shapes curtail the compiler's effectiveness, and we identify potential adaptive mechanisms. Finally, our experience of bringing up an end-to-end prototype of an ISA-exposed accelerator has made clear that two particular artifacts are greatly needed to perform this type of design more quickly and effectively: 1) Open-source implementations of high-performance baseline processors, and 2) Declarative tools for quickly specifying combinations of known compiler transforms.
{"title":"Performance evaluation of a DySER FPGA prototype system spanning the compiler, microarchitecture, and hardware implementation","authors":"C. Ho, Venkatraman Govindaraju, Tony Nowatzki, R. Nagaraju, Zachary Marzec, Preeti Agarwal, Chris Frericks, Ryan Cofell, K. Sankaralingam","doi":"10.1109/ISPASS.2015.7095806","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095806","url":null,"abstract":"Specialization and accelerators are being proposed as an effective way to address the slowdown of Dennard scaling. DySER is one such accelerator, which dynamically synthesizes large compound functional units to match program regions, using a co-designed compiler and microarchitecture. We have completed a full prototype implementation of DySER integrated into the OpenSPARC processor (called SPARC-DySER), a co-designed compiler in LLVM, and a detailed performance evaluation on an FPGA system, which runs an Ubuntu Linux distribution and full applications. Through the prototype, this paper evaluates the fundamental principles of DySER acceleration. Our two key findings are: i) the DySER execution model and microarchitecture provides energy efficient speedups and the integration of DySER does not introduce overheads - overall, DySER's performance improvement to OpenSPARC is 6X, consuming only 200mW ; ii) on the compiler side, the DySER compiler is effective at extracting computationally intensive regular and irregular code. For non-computationally intense irregular code, two control flow shapes curtail the compiler's effectiveness, and we identify potential adaptive mechanisms. Finally, our experience of bringing up an end-to-end prototype of an ISA-exposed accelerator has made clear that two particular artifacts are greatly needed to perform this type of design more quickly and effectively: 1) Open-source implementations of high-performance baseline processors, and 2) Declarative tools for quickly specifying combinations of known compiler transforms.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126233242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095788
Hu-Qiu Liu, Jia-Ju Bai, Yuping Wang, Zhe Bian, Shimin Hu
Drivers use kernel extension functions to manage devices, and there are often many rules on how they should be used. Among the rules, utilization of paired functions, which means that the functions must be called in pairs between two different functions, is extremely complex and important. However, such pairing rules are not well documented, and these rules can be easily violated by programmers when they unconsciously ignore or forget about them. Therefore it is useful to develop a tool to automatically extract paired functions in the kernel source and detect incorrect usages. We put forward a method called PairMiner in this paper. Heuristic and statistical mechanisms are adopted to associate with the special structure of drivers' source code, to find out paired functions between relative operations, and then to detect violations with extracted paired functions. In the experiment evaluation, we have successfully found 1023 paired functions in Linux 3.10.10. The utility of PairMiner was evaluated by analyzing the source code of Linux 2.6.38 and 3.10.10. PairMiner located 265 bugs about paired function violations in 2.6.38 which have been fixed in 3.10.10. We also have identified 1994 paired function violations which have not yet been fixed in 3.10.10. We have reported some violations as potential bugs with emails to the developers, 27 developers have replied the emails and 20 bugs have been confirmed so far, 2 violations are confirmed as false positive.
{"title":"Pairminer: mining for paired functions in Kernel extensions","authors":"Hu-Qiu Liu, Jia-Ju Bai, Yuping Wang, Zhe Bian, Shimin Hu","doi":"10.1109/ISPASS.2015.7095788","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095788","url":null,"abstract":"Drivers use kernel extension functions to manage devices, and there are often many rules on how they should be used. Among the rules, utilization of paired functions, which means that the functions must be called in pairs between two different functions, is extremely complex and important. However, such pairing rules are not well documented, and these rules can be easily violated by programmers when they unconsciously ignore or forget about them. Therefore it is useful to develop a tool to automatically extract paired functions in the kernel source and detect incorrect usages. We put forward a method called PairMiner in this paper. Heuristic and statistical mechanisms are adopted to associate with the special structure of drivers' source code, to find out paired functions between relative operations, and then to detect violations with extracted paired functions. In the experiment evaluation, we have successfully found 1023 paired functions in Linux 3.10.10. The utility of PairMiner was evaluated by analyzing the source code of Linux 2.6.38 and 3.10.10. PairMiner located 265 bugs about paired function violations in 2.6.38 which have been fixed in 3.10.10. We also have identified 1994 paired function violations which have not yet been fixed in 3.10.10. We have reported some violations as potential bugs with emails to the developers, 27 developers have replied the emails and 20 bugs have been confirmed so far, 2 violations are confirmed as false positive.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126407361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095779
Jian Chen, R. Clapp
Efficient and scalable performance modeling is essential to high-performance cluster computing. The critical path based performance analysis has been widely used as it provides valuable insights into the performance of parallel programs, but it is also expensive, inefficient, and inflexible due to its strong reliance on trace-driven simulation. This paper presents an innovative performance modeling framework based on a novel concept of critical-path candidates. The critical-path candidates refer to a group of paths that could potentially be the critical path. Using the instruction and communication counts as the metrics, the critical-path candidate captures the intrinsic computation and communication dependencies, and hence can be reused for exploring multiple design options. Using real-world MPI workloads, we show that the proposed framework achieves a modeling accuracy within 10% compared with the measured runtime for up to 16K MPI ranks. This framework provides an efficient and scalable platform for performance analysis as well as load imbalance analysis.
{"title":"Critical-path candidates: scalable performance modeling for MPI workloads","authors":"Jian Chen, R. Clapp","doi":"10.1109/ISPASS.2015.7095779","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095779","url":null,"abstract":"Efficient and scalable performance modeling is essential to high-performance cluster computing. The critical path based performance analysis has been widely used as it provides valuable insights into the performance of parallel programs, but it is also expensive, inefficient, and inflexible due to its strong reliance on trace-driven simulation. This paper presents an innovative performance modeling framework based on a novel concept of critical-path candidates. The critical-path candidates refer to a group of paths that could potentially be the critical path. Using the instruction and communication counts as the metrics, the critical-path candidate captures the intrinsic computation and communication dependencies, and hence can be reused for exploring multiple design options. Using real-world MPI workloads, we show that the proposed framework achieves a modeling accuracy within 10% compared with the measured runtime for up to 16K MPI ranks. This framework provides an efficient and scalable platform for performance analysis as well as load imbalance analysis.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132248041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095816
Gokcen Kestor, R. Gioiosa, D. Chavarría-Miranda
Modeling the performance of non-deterministic parallel applications on future many-core systems requires the development of novel simulation and emulation techniques and tools. We present "Prometheus", a fast, accurate and modular emulation framework for task-based applications. By raising the level of abstraction and focusing on runtime synchronization, Prometheus can accurately predict applications' performance on very large many-core systems. We validate our emulation framework against two real platforms (AMD Interlagos and Intel MIC) and report error rates generally below 4%.We, then, evaluate Prometheus' performance and scalability: our results show that Prometheus can emulate a task-based application on a system with 512K cores in 11.5 hours. We present two test cases that show how Prometheus can be used to study the performance and behavior of systems that present some of the characteristics expected from exascale supercomputer nodes, such as active power management and processors with a high number of cores but reduced cache per core.
{"title":"Prometheus: scalable and accurate emulation of task-based applications on many-core systems","authors":"Gokcen Kestor, R. Gioiosa, D. Chavarría-Miranda","doi":"10.1109/ISPASS.2015.7095816","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095816","url":null,"abstract":"Modeling the performance of non-deterministic parallel applications on future many-core systems requires the development of novel simulation and emulation techniques and tools. We present \"Prometheus\", a fast, accurate and modular emulation framework for task-based applications. By raising the level of abstraction and focusing on runtime synchronization, Prometheus can accurately predict applications' performance on very large many-core systems. We validate our emulation framework against two real platforms (AMD Interlagos and Intel MIC) and report error rates generally below 4%.We, then, evaluate Prometheus' performance and scalability: our results show that Prometheus can emulate a task-based application on a system with 512K cores in 11.5 hours. We present two test cases that show how Prometheus can be used to study the performance and behavior of systems that present some of the characteristics expected from exascale supercomputer nodes, such as active power management and processors with a high number of cores but reduced cache per core.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123720557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095786
Adam N. Jacobvitz, Andrew D. Hilton, Daniel J. Sorin
Although definition of single-program benchmarks is relatively straight-forward-a benchmark is a program plus a specific input-definition of multi-program benchmarks is more complex. Each program may have a different runtime and they may have different interactions depending on how they align with each other. While prior work has focused on sampling multiprogram benchmarks, little attention has been paid to defining the benchmarks in their entirety. In this work, we propose a four-tuple that formally defines multi-program benchmarks in a well-defined way. We then examine how four different classes of benchmarks created by varying the elements of this tuple align with real-world use-cases. We evaluate the impact of these variations on real hardware, and see drastic variations in results between different benchmarks constructed from the same programs. Notable differences include significant speedups versus slowdowns (e.g., +57% vs -5% or +26% vs -18%), and large differences in magnitude even when the results are in the same direction (e.g., 67% versus 11%).
尽管单程序基准的定义相对简单——基准是一个程序加上一个特定的输入——但多程序基准的定义要复杂得多。每个程序可能有不同的运行时,它们可能有不同的交互,这取决于它们如何相互对齐。虽然以前的工作主要集中在采样多程序基准上,但很少注意到完整地定义基准。在这项工作中,我们提出了一个四元组,以一种定义良好的方式正式定义多程序基准。然后,我们将研究通过改变这个元组的元素创建的四种不同的基准是如何与实际用例保持一致的。我们评估了这些变化对实际硬件的影响,并看到了由相同程序构建的不同基准测试之间结果的巨大差异。值得注意的差异包括显著的加速与减速(例如,+57% vs -5%或+26% vs -18%),以及即使结果在相同方向上(例如,67% vs 11%),幅度上的差异也很大。
{"title":"Multi-program benchmark definition","authors":"Adam N. Jacobvitz, Andrew D. Hilton, Daniel J. Sorin","doi":"10.1109/ISPASS.2015.7095786","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095786","url":null,"abstract":"Although definition of single-program benchmarks is relatively straight-forward-a benchmark is a program plus a specific input-definition of multi-program benchmarks is more complex. Each program may have a different runtime and they may have different interactions depending on how they align with each other. While prior work has focused on sampling multiprogram benchmarks, little attention has been paid to defining the benchmarks in their entirety. In this work, we propose a four-tuple that formally defines multi-program benchmarks in a well-defined way. We then examine how four different classes of benchmarks created by varying the elements of this tuple align with real-world use-cases. We evaluate the impact of these variations on real hardware, and see drastic variations in results between different benchmarks constructed from the same programs. Notable differences include significant speedups versus slowdowns (e.g., +57% vs -5% or +26% vs -18%), and large differences in magnitude even when the results are in the same direction (e.g., 67% versus 11%).","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125207434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095801
M. Andersch, J. Lucas, M. Alvarez-Mesa, B. Juurlink
Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput.
{"title":"On latency in GPU throughput microarchitectures","authors":"M. Andersch, J. Lucas, M. Alvarez-Mesa, B. Juurlink","doi":"10.1109/ISPASS.2015.7095801","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095801","url":null,"abstract":"Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126218376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095810
Xin Tong, Andreas Moshovos
This work presents QTrace, an open-source instrumentation extension API for QEMU (1) that can instrument unmodified applications and OS binaries for uni- and multi-processor systems. QTrace facilitates the development of custom, full-system instrumentation tools for the X86 guest architecture enabling statistics collection and program execution studies including system-level code. This paper: illustrates QTrace's API through instrumentation examples, discusses how QEMU was modified to implement QTrace, explains the validation testing procedures, shows QTrace's usefulness in comparison to a user-level binary instrumentation tool in workloads that spend significant time in the kernel, and illustrates that QTrace does not impose a significant performance penalty. Experiments show that for an instruction count plug-in, QTrace is 12.2X slower than PIN [2], a user-level only instrumentation tool, and 4.1X faster than BOCHS [3], a full-system emulator. QTrace without instrumentation performs similarly to the un-modified QEMU.
{"title":"QTrace: a framework for customizable full system instrumentation","authors":"Xin Tong, Andreas Moshovos","doi":"10.1109/ISPASS.2015.7095810","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095810","url":null,"abstract":"This work presents QTrace, an open-source instrumentation extension API for QEMU (1) that can instrument unmodified applications and OS binaries for uni- and multi-processor systems. QTrace facilitates the development of custom, full-system instrumentation tools for the X86 guest architecture enabling statistics collection and program execution studies including system-level code. This paper: illustrates QTrace's API through instrumentation examples, discusses how QEMU was modified to implement QTrace, explains the validation testing procedures, shows QTrace's usefulness in comparison to a user-level binary instrumentation tool in workloads that spend significant time in the kernel, and illustrates that QTrace does not impose a significant performance penalty. Experiments show that for an instruction count plug-in, QTrace is 12.2X slower than PIN [2], a user-level only instrumentation tool, and 4.1X faster than BOCHS [3], a full-system emulator. QTrace without instrumentation performs similarly to the un-modified QEMU.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126831443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095809
R. D. Jong, Andreas Hansson
As mobile devices gain ever more capabilities, their software and hardware complexity increases. Full system performance generally depends on complex hardware/software interactions, making it hard to reason about the performance impact of new components. These challenges are especially prominent for the flash storage as it involves a complex hardware architecture, and an extensive software stack. Universal Flash Storage (UFS) is an emerging flash interface proposed to address the growing demands of mobile workloads. However, due to the complexity of the system it is hard to determine the contribution of the storage device on the performance perceived by the user. To study the flash storage impact on modern mobile systems, and evaluate next-generation flash devices, we introduce a detailed UFS device model in the open-source full-system simulator gem5. We show the impact of different-performing flash devices on real mobile workloads, and compare the result with existing systems. Contrary to claims made by related work, we show that web browsing performance in itself is independent of the flash performance, but that tasks heavily utilizing the flash storage are clearly seeing the benefits of UFS. Our work enables performance analysis in both the hardware and the software layers of the storage system, and thus provides a platform for further research into mobile flash storage.
{"title":"A full-system approach to analyze the impact of next-generation mobile flash storage","authors":"R. D. Jong, Andreas Hansson","doi":"10.1109/ISPASS.2015.7095809","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095809","url":null,"abstract":"As mobile devices gain ever more capabilities, their software and hardware complexity increases. Full system performance generally depends on complex hardware/software interactions, making it hard to reason about the performance impact of new components. These challenges are especially prominent for the flash storage as it involves a complex hardware architecture, and an extensive software stack. Universal Flash Storage (UFS) is an emerging flash interface proposed to address the growing demands of mobile workloads. However, due to the complexity of the system it is hard to determine the contribution of the storage device on the performance perceived by the user. To study the flash storage impact on modern mobile systems, and evaluate next-generation flash devices, we introduce a detailed UFS device model in the open-source full-system simulator gem5. We show the impact of different-performing flash devices on real mobile workloads, and compare the result with existing systems. Contrary to claims made by related work, we show that web browsing performance in itself is independent of the flash performance, but that tasks heavily utilizing the flash storage are clearly seeing the benefits of UFS. Our work enables performance analysis in both the hardware and the software layers of the storage system, and thus provides a platform for further research into mobile flash storage.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114825993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095804
M. Yoon, Yunho Oh, Sangpil Lee, Seung-Hun Kim, Deokho Kim, W. Ro
Previously, hiding operation stalls is one of the important issues to suppress performance degradation of Graphics Processing Units (GPUs). In this paper, we first conduct a detailed study of factors affecting the operation stalls in terms of the fetch group size on the warp scheduler. Throughout this paper, we find that the size of fetch group is highly involved in hiding various types of operation stalls. The short latency stalls can be hidden by issuing other available warps from the same fetch group. Therefore, the short latency stalls may not be hidden well under small sized fetch group since the group has the limited number of issuable warps to hide stalls. On the contrary, the long latency stalls can be hidden by dividing warps into multiple fetch groups. The scheduler switches the fetch groups when the warps in each fetch group reach the long latency memory operation point. Therefore, the stalls may not be hidden well at the large sized fetch group. Increasing the size of fetch group reduces the number of fetch groups to hide the stalls. In addition, the load/store unit stalls are caused by the limited hardware resources to handle the memory operations. To hide all these stalls effectively, we propose a Dynamic Resizing on Active Warps (DRAW) scheduler which adjusts the size of active fetch group. From the evaluation results, DRAW scheduler reduces an average of 16.3% of stall cycles and improves an average performance of 11.3% compared to the conventional two-level warp scheduler.
{"title":"DRAW: investigating benefits of adaptive fetch group size on GPU","authors":"M. Yoon, Yunho Oh, Sangpil Lee, Seung-Hun Kim, Deokho Kim, W. Ro","doi":"10.1109/ISPASS.2015.7095804","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095804","url":null,"abstract":"Previously, hiding operation stalls is one of the important issues to suppress performance degradation of Graphics Processing Units (GPUs). In this paper, we first conduct a detailed study of factors affecting the operation stalls in terms of the fetch group size on the warp scheduler. Throughout this paper, we find that the size of fetch group is highly involved in hiding various types of operation stalls. The short latency stalls can be hidden by issuing other available warps from the same fetch group. Therefore, the short latency stalls may not be hidden well under small sized fetch group since the group has the limited number of issuable warps to hide stalls. On the contrary, the long latency stalls can be hidden by dividing warps into multiple fetch groups. The scheduler switches the fetch groups when the warps in each fetch group reach the long latency memory operation point. Therefore, the stalls may not be hidden well at the large sized fetch group. Increasing the size of fetch group reduces the number of fetch groups to hide the stalls. In addition, the load/store unit stalls are caused by the limited hardware resources to handle the memory operations. To hide all these stalls effectively, we propose a Dynamic Resizing on Active Warps (DRAW) scheduler which adjusts the size of active fetch group. From the evaluation results, DRAW scheduler reduces an average of 16.3% of stall cycles and improves an average performance of 11.3% compared to the conventional two-level warp scheduler.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130409773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095798
Robert Smolinski, Rakesh Komuravelli, Hyojin Sung, S. Adve
While many techniques have been shown to be successful at reducing the amount of on-chip network traffic, no studies have shown how close a combined approach would come to eliminating all unnecessary data traffic, nor have any studies provided insight into where the remaining challenges are. This paper systematically analyzes the traffic inefficiencies of a directory-based MESI protocol and a more efficient hardware-software co-designed protocol, DeNovo. We categorize data waste into various categories and explore several simple optimizations extending DeNovo with the aim of eliminating all of the on-chip network traffic waste. With all the proposed optimizations, we are able to completely eliminate (100%) onchip network traffic waste at L2 for some of the applications (93.5% on average) compared to the previous DeNovo protocol.
{"title":"Eliminating on-chip traffic waste: are we there yet?","authors":"Robert Smolinski, Rakesh Komuravelli, Hyojin Sung, S. Adve","doi":"10.1109/ISPASS.2015.7095798","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095798","url":null,"abstract":"While many techniques have been shown to be successful at reducing the amount of on-chip network traffic, no studies have shown how close a combined approach would come to eliminating all unnecessary data traffic, nor have any studies provided insight into where the remaining challenges are. This paper systematically analyzes the traffic inefficiencies of a directory-based MESI protocol and a more efficient hardware-software co-designed protocol, DeNovo. We categorize data waste into various categories and explore several simple optimizations extending DeNovo with the aim of eliminating all of the on-chip network traffic waste. With all the proposed optimizations, we are able to completely eliminate (100%) onchip network traffic waste at L2 for some of the applications (93.5% on average) compared to the previous DeNovo protocol.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122167838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}