首页 > 最新文献

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

英文 中文
Performance evaluation of a DySER FPGA prototype system spanning the compiler, microarchitecture, and hardware implementation 一个d斯勒FPGA原型系统的性能评估,包括编译器、微架构和硬件实现
C. Ho, Venkatraman Govindaraju, Tony Nowatzki, R. Nagaraju, Zachary Marzec, Preeti Agarwal, Chris Frericks, Ryan Cofell, K. Sankaralingam
Specialization and accelerators are being proposed as an effective way to address the slowdown of Dennard scaling. DySER is one such accelerator, which dynamically synthesizes large compound functional units to match program regions, using a co-designed compiler and microarchitecture. We have completed a full prototype implementation of DySER integrated into the OpenSPARC processor (called SPARC-DySER), a co-designed compiler in LLVM, and a detailed performance evaluation on an FPGA system, which runs an Ubuntu Linux distribution and full applications. Through the prototype, this paper evaluates the fundamental principles of DySER acceleration. Our two key findings are: i) the DySER execution model and microarchitecture provides energy efficient speedups and the integration of DySER does not introduce overheads - overall, DySER's performance improvement to OpenSPARC is 6X, consuming only 200mW ; ii) on the compiler side, the DySER compiler is effective at extracting computationally intensive regular and irregular code. For non-computationally intense irregular code, two control flow shapes curtail the compiler's effectiveness, and we identify potential adaptive mechanisms. Finally, our experience of bringing up an end-to-end prototype of an ISA-exposed accelerator has made clear that two particular artifacts are greatly needed to perform this type of design more quickly and effectively: 1) Open-source implementations of high-performance baseline processors, and 2) Declarative tools for quickly specifying combinations of known compiler transforms.
专业化和加速器被提议作为解决登纳德尺度放缓的有效方法。DySER就是这样一个加速器,它使用协同设计的编译器和微体系结构,动态地合成大型复合功能单元来匹配程序区域。我们已经完成了一个集成到OpenSPARC处理器(称为SPARC-DySER)的完整原型实现,一个在LLVM中共同设计的编译器,并在一个运行Ubuntu Linux发行版和完整应用程序的FPGA系统上进行了详细的性能评估。通过样机,对daser加速的基本原理进行了评价。我们的两个主要发现是:i) DySER执行模型和微架构提供了节能加速,并且DySER的集成不会引入开销-总体而言,DySER对OpenSPARC的性能改进是6倍,仅消耗200mW;ii)在编译器方面,daser编译器可以有效地提取计算密集型的规则和不规则代码。对于非计算密集型的不规则代码,两种控制流形状限制了编译器的有效性,并确定了潜在的自适应机制。最后,我们提出一个isa公开的加速器的端到端原型的经验清楚地表明,为了更快更有效地执行这种类型的设计,非常需要两个特定的工件:1)高性能基线处理器的开源实现,以及2)用于快速指定已知编译器转换组合的声明性工具。
{"title":"Performance evaluation of a DySER FPGA prototype system spanning the compiler, microarchitecture, and hardware implementation","authors":"C. Ho, Venkatraman Govindaraju, Tony Nowatzki, R. Nagaraju, Zachary Marzec, Preeti Agarwal, Chris Frericks, Ryan Cofell, K. Sankaralingam","doi":"10.1109/ISPASS.2015.7095806","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095806","url":null,"abstract":"Specialization and accelerators are being proposed as an effective way to address the slowdown of Dennard scaling. DySER is one such accelerator, which dynamically synthesizes large compound functional units to match program regions, using a co-designed compiler and microarchitecture. We have completed a full prototype implementation of DySER integrated into the OpenSPARC processor (called SPARC-DySER), a co-designed compiler in LLVM, and a detailed performance evaluation on an FPGA system, which runs an Ubuntu Linux distribution and full applications. Through the prototype, this paper evaluates the fundamental principles of DySER acceleration. Our two key findings are: i) the DySER execution model and microarchitecture provides energy efficient speedups and the integration of DySER does not introduce overheads - overall, DySER's performance improvement to OpenSPARC is 6X, consuming only 200mW ; ii) on the compiler side, the DySER compiler is effective at extracting computationally intensive regular and irregular code. For non-computationally intense irregular code, two control flow shapes curtail the compiler's effectiveness, and we identify potential adaptive mechanisms. Finally, our experience of bringing up an end-to-end prototype of an ISA-exposed accelerator has made clear that two particular artifacts are greatly needed to perform this type of design more quickly and effectively: 1) Open-source implementations of high-performance baseline processors, and 2) Declarative tools for quickly specifying combinations of known compiler transforms.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126233242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Pairminer: mining for paired functions in Kernel extensions Pairminer:在内核扩展中挖掘成对函数
Hu-Qiu Liu, Jia-Ju Bai, Yuping Wang, Zhe Bian, Shimin Hu
Drivers use kernel extension functions to manage devices, and there are often many rules on how they should be used. Among the rules, utilization of paired functions, which means that the functions must be called in pairs between two different functions, is extremely complex and important. However, such pairing rules are not well documented, and these rules can be easily violated by programmers when they unconsciously ignore or forget about them. Therefore it is useful to develop a tool to automatically extract paired functions in the kernel source and detect incorrect usages. We put forward a method called PairMiner in this paper. Heuristic and statistical mechanisms are adopted to associate with the special structure of drivers' source code, to find out paired functions between relative operations, and then to detect violations with extracted paired functions. In the experiment evaluation, we have successfully found 1023 paired functions in Linux 3.10.10. The utility of PairMiner was evaluated by analyzing the source code of Linux 2.6.38 and 3.10.10. PairMiner located 265 bugs about paired function violations in 2.6.38 which have been fixed in 3.10.10. We also have identified 1994 paired function violations which have not yet been fixed in 3.10.10. We have reported some violations as potential bugs with emails to the developers, 27 developers have replied the emails and 20 bugs have been confirmed so far, 2 violations are confirmed as false positive.
驱动程序使用内核扩展函数来管理设备,并且通常有许多关于如何使用它们的规则。在这些规则中,对函数的利用是极其复杂和重要的,这意味着函数必须在两个不同的函数之间成对调用。然而,这样的配对规则没有很好的文档记录,当程序员无意识地忽略或忘记这些规则时,很容易违反这些规则。因此,开发一种工具来自动提取内核源中的成对函数并检测错误用法是非常有用的。本文提出了一种名为PairMiner的方法。采用启发式和统计机制结合驱动程序源代码的特殊结构,找出相关操作之间的配对函数,然后利用提取的配对函数进行违规检测。在实验评估中,我们成功地在Linux 3.10.10中找到了1023个配对函数。通过分析Linux 2.6.38和3.10.10的源代码,对PairMiner的实用性进行了评估。PairMiner在2.6.38中发现了265个关于配对函数违规的错误,这些错误已在3.10.10中修复。我们还确定了1994个配对函数违规,这些违规在3.10.10中尚未修复。我们已经通过邮件向开发者报告了一些违规行为作为潜在的bug, 27个开发者回复了邮件,目前已经确认了20个bug, 2个违规行为被确认为误报。
{"title":"Pairminer: mining for paired functions in Kernel extensions","authors":"Hu-Qiu Liu, Jia-Ju Bai, Yuping Wang, Zhe Bian, Shimin Hu","doi":"10.1109/ISPASS.2015.7095788","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095788","url":null,"abstract":"Drivers use kernel extension functions to manage devices, and there are often many rules on how they should be used. Among the rules, utilization of paired functions, which means that the functions must be called in pairs between two different functions, is extremely complex and important. However, such pairing rules are not well documented, and these rules can be easily violated by programmers when they unconsciously ignore or forget about them. Therefore it is useful to develop a tool to automatically extract paired functions in the kernel source and detect incorrect usages. We put forward a method called PairMiner in this paper. Heuristic and statistical mechanisms are adopted to associate with the special structure of drivers' source code, to find out paired functions between relative operations, and then to detect violations with extracted paired functions. In the experiment evaluation, we have successfully found 1023 paired functions in Linux 3.10.10. The utility of PairMiner was evaluated by analyzing the source code of Linux 2.6.38 and 3.10.10. PairMiner located 265 bugs about paired function violations in 2.6.38 which have been fixed in 3.10.10. We also have identified 1994 paired function violations which have not yet been fixed in 3.10.10. We have reported some violations as potential bugs with emails to the developers, 27 developers have replied the emails and 20 bugs have been confirmed so far, 2 violations are confirmed as false positive.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126407361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Critical-path candidates: scalable performance modeling for MPI workloads 关键路径候选:MPI工作负载的可伸缩性能建模
Jian Chen, R. Clapp
Efficient and scalable performance modeling is essential to high-performance cluster computing. The critical path based performance analysis has been widely used as it provides valuable insights into the performance of parallel programs, but it is also expensive, inefficient, and inflexible due to its strong reliance on trace-driven simulation. This paper presents an innovative performance modeling framework based on a novel concept of critical-path candidates. The critical-path candidates refer to a group of paths that could potentially be the critical path. Using the instruction and communication counts as the metrics, the critical-path candidate captures the intrinsic computation and communication dependencies, and hence can be reused for exploring multiple design options. Using real-world MPI workloads, we show that the proposed framework achieves a modeling accuracy within 10% compared with the measured runtime for up to 16K MPI ranks. This framework provides an efficient and scalable platform for performance analysis as well as load imbalance analysis.
高效和可伸缩的性能建模对于高性能集群计算至关重要。基于关键路径的性能分析已经被广泛使用,因为它为并行程序的性能提供了有价值的见解,但是由于它强烈依赖于跟踪驱动的模拟,它也昂贵、低效和不灵活。本文提出了一种基于关键路径候选概念的创新性能建模框架。关键路径候选指的是可能成为关键路径的一组路径。使用指令和通信计数作为度量,关键路径候选捕获内在的计算和通信依赖,因此可以在探索多个设计选项时重用。使用真实的MPI工作负载,我们表明,与高达16K MPI排名的测量运行时相比,所提出的框架实现了在10%以内的建模精度。该框架为性能分析和负载不平衡分析提供了一个高效、可扩展的平台。
{"title":"Critical-path candidates: scalable performance modeling for MPI workloads","authors":"Jian Chen, R. Clapp","doi":"10.1109/ISPASS.2015.7095779","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095779","url":null,"abstract":"Efficient and scalable performance modeling is essential to high-performance cluster computing. The critical path based performance analysis has been widely used as it provides valuable insights into the performance of parallel programs, but it is also expensive, inefficient, and inflexible due to its strong reliance on trace-driven simulation. This paper presents an innovative performance modeling framework based on a novel concept of critical-path candidates. The critical-path candidates refer to a group of paths that could potentially be the critical path. Using the instruction and communication counts as the metrics, the critical-path candidate captures the intrinsic computation and communication dependencies, and hence can be reused for exploring multiple design options. Using real-world MPI workloads, we show that the proposed framework achieves a modeling accuracy within 10% compared with the measured runtime for up to 16K MPI ranks. This framework provides an efficient and scalable platform for performance analysis as well as load imbalance analysis.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132248041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Prometheus: scalable and accurate emulation of task-based applications on many-core systems Prometheus:在多核系统上对基于任务的应用程序进行可伸缩和精确的仿真
Gokcen Kestor, R. Gioiosa, D. Chavarría-Miranda
Modeling the performance of non-deterministic parallel applications on future many-core systems requires the development of novel simulation and emulation techniques and tools. We present "Prometheus", a fast, accurate and modular emulation framework for task-based applications. By raising the level of abstraction and focusing on runtime synchronization, Prometheus can accurately predict applications' performance on very large many-core systems. We validate our emulation framework against two real platforms (AMD Interlagos and Intel MIC) and report error rates generally below 4%.We, then, evaluate Prometheus' performance and scalability: our results show that Prometheus can emulate a task-based application on a system with 512K cores in 11.5 hours. We present two test cases that show how Prometheus can be used to study the performance and behavior of systems that present some of the characteristics expected from exascale supercomputer nodes, such as active power management and processors with a high number of cores but reduced cache per core.
在未来多核系统上对不确定性并行应用程序的性能进行建模需要开发新的仿真和仿真技术和工具。我们提出了“Prometheus”,一个快速,准确和模块化的仿真框架,用于基于任务的应用程序。通过提高抽象级别和关注运行时同步,Prometheus可以准确地预测应用程序在非常大的多核系统上的性能。我们在两个真实平台(AMD Interlagos和Intel MIC)上验证了我们的仿真框架,报告的错误率通常低于4%。然后,我们评估了Prometheus的性能和可伸缩性:我们的结果表明,Prometheus可以在11.5小时内在512K内核的系统上模拟基于任务的应用程序。我们提供了两个测试用例,它们展示了如何使用Prometheus来研究系统的性能和行为,这些系统具有百亿级超级计算机节点所期望的一些特征,例如有源电源管理和具有大量核心但每个核心缓存减少的处理器。
{"title":"Prometheus: scalable and accurate emulation of task-based applications on many-core systems","authors":"Gokcen Kestor, R. Gioiosa, D. Chavarría-Miranda","doi":"10.1109/ISPASS.2015.7095816","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095816","url":null,"abstract":"Modeling the performance of non-deterministic parallel applications on future many-core systems requires the development of novel simulation and emulation techniques and tools. We present \"Prometheus\", a fast, accurate and modular emulation framework for task-based applications. By raising the level of abstraction and focusing on runtime synchronization, Prometheus can accurately predict applications' performance on very large many-core systems. We validate our emulation framework against two real platforms (AMD Interlagos and Intel MIC) and report error rates generally below 4%.We, then, evaluate Prometheus' performance and scalability: our results show that Prometheus can emulate a task-based application on a system with 512K cores in 11.5 hours. We present two test cases that show how Prometheus can be used to study the performance and behavior of systems that present some of the characteristics expected from exascale supercomputer nodes, such as active power management and processors with a high number of cores but reduced cache per core.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123720557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Multi-program benchmark definition 多程序基准定义
Adam N. Jacobvitz, Andrew D. Hilton, Daniel J. Sorin
Although definition of single-program benchmarks is relatively straight-forward-a benchmark is a program plus a specific input-definition of multi-program benchmarks is more complex. Each program may have a different runtime and they may have different interactions depending on how they align with each other. While prior work has focused on sampling multiprogram benchmarks, little attention has been paid to defining the benchmarks in their entirety. In this work, we propose a four-tuple that formally defines multi-program benchmarks in a well-defined way. We then examine how four different classes of benchmarks created by varying the elements of this tuple align with real-world use-cases. We evaluate the impact of these variations on real hardware, and see drastic variations in results between different benchmarks constructed from the same programs. Notable differences include significant speedups versus slowdowns (e.g., +57% vs -5% or +26% vs -18%), and large differences in magnitude even when the results are in the same direction (e.g., 67% versus 11%).
尽管单程序基准的定义相对简单——基准是一个程序加上一个特定的输入——但多程序基准的定义要复杂得多。每个程序可能有不同的运行时,它们可能有不同的交互,这取决于它们如何相互对齐。虽然以前的工作主要集中在采样多程序基准上,但很少注意到完整地定义基准。在这项工作中,我们提出了一个四元组,以一种定义良好的方式正式定义多程序基准。然后,我们将研究通过改变这个元组的元素创建的四种不同的基准是如何与实际用例保持一致的。我们评估了这些变化对实际硬件的影响,并看到了由相同程序构建的不同基准测试之间结果的巨大差异。值得注意的差异包括显著的加速与减速(例如,+57% vs -5%或+26% vs -18%),以及即使结果在相同方向上(例如,67% vs 11%),幅度上的差异也很大。
{"title":"Multi-program benchmark definition","authors":"Adam N. Jacobvitz, Andrew D. Hilton, Daniel J. Sorin","doi":"10.1109/ISPASS.2015.7095786","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095786","url":null,"abstract":"Although definition of single-program benchmarks is relatively straight-forward-a benchmark is a program plus a specific input-definition of multi-program benchmarks is more complex. Each program may have a different runtime and they may have different interactions depending on how they align with each other. While prior work has focused on sampling multiprogram benchmarks, little attention has been paid to defining the benchmarks in their entirety. In this work, we propose a four-tuple that formally defines multi-program benchmarks in a well-defined way. We then examine how four different classes of benchmarks created by varying the elements of this tuple align with real-world use-cases. We evaluate the impact of these variations on real hardware, and see drastic variations in results between different benchmarks constructed from the same programs. Notable differences include significant speedups versus slowdowns (e.g., +57% vs -5% or +26% vs -18%), and large differences in magnitude even when the results are in the same direction (e.g., 67% versus 11%).","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125207434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
On latency in GPU throughput microarchitectures GPU吞吐量微架构中的延迟
M. Andersch, J. Lucas, M. Alvarez-Mesa, B. Juurlink
Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput.
现代gpu提供巨大的处理能力(算术吞吐量)以及内存吞吐量。目前,虽然人们似乎很清楚如何通过增加吞吐量来提高性能,但微架构延迟对面向吞吐量的GPU架构的性能的影响还不太清楚。事实上,公众对现代gpu中微架构延迟组件的价值、行为和性能影响知之甚少。这项工作试图通过分析GPU微架构组件的空闲(静态)和加载(动态)延迟行为来填补这一空白。我们的结果表明GPU在延迟隐藏方面并不像通常认为的那样有效,基于此,我们认为除了吞吐量之外,延迟也应该是GPU设计的考虑因素。
{"title":"On latency in GPU throughput microarchitectures","authors":"M. Andersch, J. Lucas, M. Alvarez-Mesa, B. Juurlink","doi":"10.1109/ISPASS.2015.7095801","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095801","url":null,"abstract":"Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126218376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
QTrace: a framework for customizable full system instrumentation QTrace:用于可定制的完整系统检测的框架
Xin Tong, Andreas Moshovos
This work presents QTrace, an open-source instrumentation extension API for QEMU (1) that can instrument unmodified applications and OS binaries for uni- and multi-processor systems. QTrace facilitates the development of custom, full-system instrumentation tools for the X86 guest architecture enabling statistics collection and program execution studies including system-level code. This paper: illustrates QTrace's API through instrumentation examples, discusses how QEMU was modified to implement QTrace, explains the validation testing procedures, shows QTrace's usefulness in comparison to a user-level binary instrumentation tool in workloads that spend significant time in the kernel, and illustrates that QTrace does not impose a significant performance penalty. Experiments show that for an instruction count plug-in, QTrace is 12.2X slower than PIN [2], a user-level only instrumentation tool, and 4.1X faster than BOCHS [3], a full-system emulator. QTrace without instrumentation performs similarly to the un-modified QEMU.
这项工作提出了QTrace,一个QEMU(1)的开源仪器扩展API,它可以为单处理器和多处理器系统检测未经修改的应用程序和操作系统二进制文件。QTrace有助于为X86客户机体系结构开发定制的全系统检测工具,支持统计信息收集和包括系统级代码在内的程序执行研究。本文通过插装示例说明了QTrace的API,讨论了如何修改QEMU来实现QTrace,解释了验证测试过程,展示了在内核中花费大量时间的工作负载中,与用户级二进制插装工具相比,QTrace的有用性,并说明了QTrace不会造成显著的性能损失。实验表明,对于指令计数插件,QTrace比PIN[2]慢12.2倍,PIN[2]是一个仅用户级的仪器工具,比BOCHS[3]快4.1倍,BOCHS[3]是一个全系统模拟器。没有插装的QTrace的执行类似于未修改的QEMU。
{"title":"QTrace: a framework for customizable full system instrumentation","authors":"Xin Tong, Andreas Moshovos","doi":"10.1109/ISPASS.2015.7095810","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095810","url":null,"abstract":"This work presents QTrace, an open-source instrumentation extension API for QEMU (1) that can instrument unmodified applications and OS binaries for uni- and multi-processor systems. QTrace facilitates the development of custom, full-system instrumentation tools for the X86 guest architecture enabling statistics collection and program execution studies including system-level code. This paper: illustrates QTrace's API through instrumentation examples, discusses how QEMU was modified to implement QTrace, explains the validation testing procedures, shows QTrace's usefulness in comparison to a user-level binary instrumentation tool in workloads that spend significant time in the kernel, and illustrates that QTrace does not impose a significant performance penalty. Experiments show that for an instruction count plug-in, QTrace is 12.2X slower than PIN [2], a user-level only instrumentation tool, and 4.1X faster than BOCHS [3], a full-system emulator. QTrace without instrumentation performs similarly to the un-modified QEMU.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126831443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A full-system approach to analyze the impact of next-generation mobile flash storage 分析下一代移动闪存影响的全系统方法
R. D. Jong, Andreas Hansson
As mobile devices gain ever more capabilities, their software and hardware complexity increases. Full system performance generally depends on complex hardware/software interactions, making it hard to reason about the performance impact of new components. These challenges are especially prominent for the flash storage as it involves a complex hardware architecture, and an extensive software stack. Universal Flash Storage (UFS) is an emerging flash interface proposed to address the growing demands of mobile workloads. However, due to the complexity of the system it is hard to determine the contribution of the storage device on the performance perceived by the user. To study the flash storage impact on modern mobile systems, and evaluate next-generation flash devices, we introduce a detailed UFS device model in the open-source full-system simulator gem5. We show the impact of different-performing flash devices on real mobile workloads, and compare the result with existing systems. Contrary to claims made by related work, we show that web browsing performance in itself is independent of the flash performance, but that tasks heavily utilizing the flash storage are clearly seeing the benefits of UFS. Our work enables performance analysis in both the hardware and the software layers of the storage system, and thus provides a platform for further research into mobile flash storage.
随着移动设备获得越来越多的功能,它们的软件和硬件的复杂性也在增加。完整的系统性能通常取决于复杂的硬件/软件交互,因此很难推断新组件对性能的影响。这些挑战对于闪存来说尤其突出,因为它涉及复杂的硬件架构和广泛的软件堆栈。通用闪存存储(Universal Flash Storage, UFS)是一种新兴的闪存接口,旨在解决移动工作负载日益增长的需求。然而,由于系统的复杂性,很难确定存储设备对用户感知到的性能的贡献。为了研究闪存对现代移动系统的影响,并评估下一代闪存设备,我们在开源全系统模拟器gem5中引入了一个详细的UFS设备模型。我们展示了不同性能的闪存设备对实际移动工作负载的影响,并将结果与现有系统进行了比较。与相关工作的说法相反,我们表明网页浏览性能本身独立于flash性能,但大量使用flash存储的任务显然看到了UFS的好处。我们的工作使存储系统的硬件和软件层的性能分析成为可能,从而为进一步研究移动闪存提供了一个平台。
{"title":"A full-system approach to analyze the impact of next-generation mobile flash storage","authors":"R. D. Jong, Andreas Hansson","doi":"10.1109/ISPASS.2015.7095809","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095809","url":null,"abstract":"As mobile devices gain ever more capabilities, their software and hardware complexity increases. Full system performance generally depends on complex hardware/software interactions, making it hard to reason about the performance impact of new components. These challenges are especially prominent for the flash storage as it involves a complex hardware architecture, and an extensive software stack. Universal Flash Storage (UFS) is an emerging flash interface proposed to address the growing demands of mobile workloads. However, due to the complexity of the system it is hard to determine the contribution of the storage device on the performance perceived by the user. To study the flash storage impact on modern mobile systems, and evaluate next-generation flash devices, we introduce a detailed UFS device model in the open-source full-system simulator gem5. We show the impact of different-performing flash devices on real mobile workloads, and compare the result with existing systems. Contrary to claims made by related work, we show that web browsing performance in itself is independent of the flash performance, but that tasks heavily utilizing the flash storage are clearly seeing the benefits of UFS. Our work enables performance analysis in both the hardware and the software layers of the storage system, and thus provides a platform for further research into mobile flash storage.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114825993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DRAW: investigating benefits of adaptive fetch group size on GPU DRAW:在GPU上研究自适应获取组大小的好处
M. Yoon, Yunho Oh, Sangpil Lee, Seung-Hun Kim, Deokho Kim, W. Ro
Previously, hiding operation stalls is one of the important issues to suppress performance degradation of Graphics Processing Units (GPUs). In this paper, we first conduct a detailed study of factors affecting the operation stalls in terms of the fetch group size on the warp scheduler. Throughout this paper, we find that the size of fetch group is highly involved in hiding various types of operation stalls. The short latency stalls can be hidden by issuing other available warps from the same fetch group. Therefore, the short latency stalls may not be hidden well under small sized fetch group since the group has the limited number of issuable warps to hide stalls. On the contrary, the long latency stalls can be hidden by dividing warps into multiple fetch groups. The scheduler switches the fetch groups when the warps in each fetch group reach the long latency memory operation point. Therefore, the stalls may not be hidden well at the large sized fetch group. Increasing the size of fetch group reduces the number of fetch groups to hide the stalls. In addition, the load/store unit stalls are caused by the limited hardware resources to handle the memory operations. To hide all these stalls effectively, we propose a Dynamic Resizing on Active Warps (DRAW) scheduler which adjusts the size of active fetch group. From the evaluation results, DRAW scheduler reduces an average of 16.3% of stall cycles and improves an average performance of 11.3% compared to the conventional two-level warp scheduler.
隐藏操作停顿是抑制图形处理器性能下降的重要问题之一。在本文中,我们首先详细研究了影响经纱调度程序上读取组大小的操作延迟因素。在本文中,我们发现获取组的大小与隐藏各种类型的操作停顿高度相关。可以通过发出来自同一获取组的其他可用翘曲来隐藏短延迟延迟。因此,在较小的fetch组中,由于组中可发布的warp数量有限,因此可能无法很好地隐藏短延迟摊位。相反,可以通过将warp划分为多个fetch组来隐藏长时间的延迟。当每个提取组中的翘曲到达长延迟内存操作点时,调度器切换提取组。因此,摊位可能不会隐藏在大型取回组。增加取物组的大小可以减少取物组的数量以隐藏摊位。此外,处理内存操作的硬件资源有限导致了加载/存储单元的停顿。为了有效地隐藏所有这些延迟,我们提出了一个动态调整主动抓取(DRAW)调度程序,它可以调整活动抓取组的大小。从评估结果来看,与传统的两级翘曲调度器相比,DRAW调度器平均减少了16.3%的失速周期,平均性能提高了11.3%。
{"title":"DRAW: investigating benefits of adaptive fetch group size on GPU","authors":"M. Yoon, Yunho Oh, Sangpil Lee, Seung-Hun Kim, Deokho Kim, W. Ro","doi":"10.1109/ISPASS.2015.7095804","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095804","url":null,"abstract":"Previously, hiding operation stalls is one of the important issues to suppress performance degradation of Graphics Processing Units (GPUs). In this paper, we first conduct a detailed study of factors affecting the operation stalls in terms of the fetch group size on the warp scheduler. Throughout this paper, we find that the size of fetch group is highly involved in hiding various types of operation stalls. The short latency stalls can be hidden by issuing other available warps from the same fetch group. Therefore, the short latency stalls may not be hidden well under small sized fetch group since the group has the limited number of issuable warps to hide stalls. On the contrary, the long latency stalls can be hidden by dividing warps into multiple fetch groups. The scheduler switches the fetch groups when the warps in each fetch group reach the long latency memory operation point. Therefore, the stalls may not be hidden well at the large sized fetch group. Increasing the size of fetch group reduces the number of fetch groups to hide the stalls. In addition, the load/store unit stalls are caused by the limited hardware resources to handle the memory operations. To hide all these stalls effectively, we propose a Dynamic Resizing on Active Warps (DRAW) scheduler which adjusts the size of active fetch group. From the evaluation results, DRAW scheduler reduces an average of 16.3% of stall cycles and improves an average performance of 11.3% compared to the conventional two-level warp scheduler.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130409773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Eliminating on-chip traffic waste: are we there yet? 消除芯片上的交通浪费:我们做到了吗?
Robert Smolinski, Rakesh Komuravelli, Hyojin Sung, S. Adve
While many techniques have been shown to be successful at reducing the amount of on-chip network traffic, no studies have shown how close a combined approach would come to eliminating all unnecessary data traffic, nor have any studies provided insight into where the remaining challenges are. This paper systematically analyzes the traffic inefficiencies of a directory-based MESI protocol and a more efficient hardware-software co-designed protocol, DeNovo. We categorize data waste into various categories and explore several simple optimizations extending DeNovo with the aim of eliminating all of the on-chip network traffic waste. With all the proposed optimizations, we are able to completely eliminate (100%) onchip network traffic waste at L2 for some of the applications (93.5% on average) compared to the previous DeNovo protocol.
虽然许多技术已被证明在减少片上网络流量方面是成功的,但没有研究表明,一种组合方法如何接近消除所有不必要的数据流量,也没有任何研究提供了对剩余挑战所在的见解。本文系统地分析了基于目录的MESI协议和更高效的软硬件协同设计协议DeNovo的流量低效率。我们将数据浪费分为不同的类别,并探索了扩展DeNovo的几个简单优化,目的是消除所有片上网络流量浪费。与之前的DeNovo协议相比,通过所有建议的优化,我们能够完全消除(100%)某些应用程序(平均93.5%)在L2上的片上网络流量浪费。
{"title":"Eliminating on-chip traffic waste: are we there yet?","authors":"Robert Smolinski, Rakesh Komuravelli, Hyojin Sung, S. Adve","doi":"10.1109/ISPASS.2015.7095798","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095798","url":null,"abstract":"While many techniques have been shown to be successful at reducing the amount of on-chip network traffic, no studies have shown how close a combined approach would come to eliminating all unnecessary data traffic, nor have any studies provided insight into where the remaining challenges are. This paper systematically analyzes the traffic inefficiencies of a directory-based MESI protocol and a more efficient hardware-software co-designed protocol, DeNovo. We categorize data waste into various categories and explore several simple optimizations extending DeNovo with the aim of eliminating all of the on-chip network traffic waste. With all the proposed optimizations, we are able to completely eliminate (100%) onchip network traffic waste at L2 for some of the applications (93.5% on average) compared to the previous DeNovo protocol.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122167838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1