2014 47th Annual IEEE/ACM International Symposium on Microarchitecture最新文献_第5页

Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults 空间多比特瞬态故障的体系结构脆弱性因子计算

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.15

Mark Wilkening, Vilas Sridharan, Si Li, Fritz G. Previlon, S. Gurumurthi, D. Kaeli

Reliability is an important design constraint in modern microprocessors, and one of the fundamental reliability challenges is combating the effects of transient faults. This requires extensive analysis, including significant fault modelling allow architects to make informed reliability tradeoffs. Recent data shows that multi-bit transient faults are becoming more common, increasing from 0.5% of static random-access memory (SRAM) faults in 180nm to 3.9% in 22nm. Such faults are predicted to be even more prevalent in smaller technology nodes. Therefore, accurately modeling the effects of multi-bit transient faults is increasingly important to the microprocessor design process. Architecture vulnerability factor (AVF) analysis is a method to model the effects of single-bit transient faults. In this paper, we propose a method to calculate AVFs for spatial multibittransient faults (MB-AVFs) and provide insights that can help reduce the impact of these faults. First, we describe a novel multi-bit AVF analysis approach for detected uncorrected errors (DUEs) and show how to measure DUE MB-AVFs in a performance simulator. We then extend our approach to measure silent data corruption (SDC) MB-AVFs. We find that MB-AVFs are not derivable from single-bit AVFs. We also find that larger fault modes have higher MB-AVFs. Finally, we present a case study on using MB-AVF analysis to optimize processor design, yielding SDC reductions of 86% in a GPU vector register file.

在现代微处理器中，可靠性是一个重要的设计约束，其中一个基本的可靠性挑战是与瞬态故障的影响作斗争。这需要广泛的分析，包括重要的故障建模，以允许架构师做出明智的可靠性权衡。最近的数据显示，多位瞬态故障变得越来越普遍，从180nm的静态随机存取存储器(SRAM)故障的0.5%增加到22nm的3.9%。预计这类故障在较小的技术节点中更为普遍。因此，对多比特暂态故障的影响进行精确建模对微处理器的设计过程越来越重要。体系结构脆弱性因子(AVF)分析是一种对单比特暂态故障影响进行建模的方法。在本文中，我们提出了一种计算空间多比特瞬态故障(mb - avf)的avf的方法，并提供了有助于减少这些故障影响的见解。首先，我们描述了一种新的多比特AVF分析方法，用于检测未校正错误(DUE)，并展示了如何在性能模拟器中测量DUE mb -AVF。然后，我们扩展了我们的方法来测量静默数据损坏(SDC) mb - avf。我们发现mb - avf不能由单比特avf衍生。我们还发现，更大的故障模式具有更高的mb - avf。最后，我们提出了一个使用MB-AVF分析来优化处理器设计的案例研究，在GPU矢量寄存器文件中产生86%的SDC降低。

{"title":"Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults","authors":"Mark Wilkening, Vilas Sridharan, Si Li, Fritz G. Previlon, S. Gurumurthi, D. Kaeli","doi":"10.1109/MICRO.2014.15","DOIUrl":"https://doi.org/10.1109/MICRO.2014.15","url":null,"abstract":"Reliability is an important design constraint in modern microprocessors, and one of the fundamental reliability challenges is combating the effects of transient faults. This requires extensive analysis, including significant fault modelling allow architects to make informed reliability tradeoffs. Recent data shows that multi-bit transient faults are becoming more common, increasing from 0.5% of static random-access memory (SRAM) faults in 180nm to 3.9% in 22nm. Such faults are predicted to be even more prevalent in smaller technology nodes. Therefore, accurately modeling the effects of multi-bit transient faults is increasingly important to the microprocessor design process. Architecture vulnerability factor (AVF) analysis is a method to model the effects of single-bit transient faults. In this paper, we propose a method to calculate AVFs for spatial multibittransient faults (MB-AVFs) and provide insights that can help reduce the impact of these faults. First, we describe a novel multi-bit AVF analysis approach for detected uncorrected errors (DUEs) and show how to measure DUE MB-AVFs in a performance simulator. We then extend our approach to measure silent data corruption (SDC) MB-AVFs. We find that MB-AVFs are not derivable from single-bit AVFs. We also find that larger fault modes have higher MB-AVFs. Finally, we present a case study on using MB-AVF analysis to optimize processor design, yielding SDC reductions of 86% in a GPU vector register file.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"25 1","pages":"293-305"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82779557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures 数据并行体系结构SPMD发散管理的设计空间探索

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.48

Yunsup Lee, Vinod Grover, R. Krashinsky, M. Stephenson, S. Keckler, K. Asanović

Data-parallel architectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multiple-data languages. As these architectures have wide data paths that process a single instruction across parallel threads, a mechanism is needed to track and sequence threads as they traverse potentially divergent control paths through the program. The design space for divergence management ranges from software-only approaches where divergence is explicitly managed by the compiler, to hardware solutions where divergence is managed implicitly by the micro architecture. In this paper, we explore this space and propose a new predication-based approach for handling control-flow structures in data-parallel architectures. Unlike prior predication algorithms, our new compiler analyses and hardware instructions consider the commonality of predication conditions across threads to improve efficiency. We prototype our algorithms in a production compiler and evaluate the tradeoffs between software and hardware divergence management on current GPU silicon. We show that our compiler algorithms make a predication-only architecture competitive in performance to one with hardware support for tracking divergence.

数据并行体系结构必须为复杂的控制流结构提供有效的支持，以支持用现代单程序多数据语言编写的复杂应用程序。由于这些体系结构具有跨并行线程处理单个指令的宽数据路径，因此需要一种机制来跟踪和排序线程，因为它们在程序中遍历可能分散的控制路径。发散管理的设计空间范围从由编译器显式管理发散的纯软件方法，到由微体系结构隐式管理发散的硬件解决方案。在本文中，我们探索了这一领域，并提出了一种新的基于预测的方法来处理数据并行架构中的控制流结构。与先前的预测算法不同，我们的新编译器分析和硬件指令考虑了跨线程预测条件的共性，以提高效率。我们在生产编译器中对算法进行了原型化，并在当前GPU芯片上评估了软件和硬件分歧管理之间的权衡。我们表明，我们的编译器算法使仅预测的架构在性能上与具有跟踪发散的硬件支持的架构具有竞争力。

{"title":"Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures","authors":"Yunsup Lee, Vinod Grover, R. Krashinsky, M. Stephenson, S. Keckler, K. Asanović","doi":"10.1109/MICRO.2014.48","DOIUrl":"https://doi.org/10.1109/MICRO.2014.48","url":null,"abstract":"Data-parallel architectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multiple-data languages. As these architectures have wide data paths that process a single instruction across parallel threads, a mechanism is needed to track and sequence threads as they traverse potentially divergent control paths through the program. The design space for divergence management ranges from software-only approaches where divergence is explicitly managed by the compiler, to hardware solutions where divergence is managed implicitly by the micro architecture. In this paper, we explore this space and propose a new predication-based approach for handling control-flow structures in data-parallel architectures. Unlike prior predication algorithms, our new compiler analyses and hardware instructions consider the commonality of predication conditions across threads to improve efficiency. We prototype our algorithms in a production compiler and evaluate the tradeoffs between software and hardware divergence management on current GPU silicon. We show that our compiler algorithms make a predication-only architecture competitive in performance to one with hardware support for tracking divergence.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"136 1","pages":"101-113"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76383792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations 通过密集矩阵计算的可编程组合专门的编译器优化

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.14

Qing Yi, Qian Wang, Huimin Cui

General purpose compilers aim to extract the best average performance for all possible user applications. Due to the lack of specializations for different types of computations, compiler attained performance often lags behind those of the manually optimized libraries. In this paper, we demonstrate a new approach, programmable composition, to enable the specialization of compiler optimizations without compromising their generality. Our approach uses a single pass of source-level analysis to recognize a common pattern among dense matrix computations. It then tags the recognized patterns to trigger a sequence of general-purpose compiler optimizations specially composed for them. We show that by allowing different optimizations to adequately communicate with each other through a set of coordination handles and dynamic tags inserted inside the optimized code, we can specialize the composition of general-purpose compiler optimizations to attain a level of performance comparable to those of manually written assembly code by experts, thereby allowing selected computations in applications to benefit from similar levels of optimizations as those manually applied by experts.

通用编译器的目标是为所有可能的用户应用程序提取最佳的平均性能。由于缺乏针对不同类型计算的专门化，编译器获得的性能通常落后于手动优化的库。在本文中，我们展示了一种新的方法，可编程组合，使编译器优化的专门化而不损害其通用性。我们的方法使用单次源级分析来识别密集矩阵计算中的共同模式。然后，它标记已识别的模式，以触发一系列专门为它们组合的通用编译器优化。我们表明，通过一组协调句柄和在优化代码中插入的动态标签，允许不同的优化相互充分通信，我们可以将通用编译器优化的组合专业化，以达到与专家手动编写的汇编代码相当的性能水平，从而允许应用程序中的选定计算受益于与专家手动应用的优化相似的水平。

{"title":"Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations","authors":"Qing Yi, Qian Wang, Huimin Cui","doi":"10.1109/MICRO.2014.14","DOIUrl":"https://doi.org/10.1109/MICRO.2014.14","url":null,"abstract":"General purpose compilers aim to extract the best average performance for all possible user applications. Due to the lack of specializations for different types of computations, compiler attained performance often lags behind those of the manually optimized libraries. In this paper, we demonstrate a new approach, programmable composition, to enable the specialization of compiler optimizations without compromising their generality. Our approach uses a single pass of source-level analysis to recognize a common pattern among dense matrix computations. It then tags the recognized patterns to trigger a sequence of general-purpose compiler optimizations specially composed for them. We show that by allowing different optimizations to adequately communicate with each other through a set of coordination handles and dynamic tags inserted inside the optimized code, we can specialize the composition of general-purpose compiler optimizations to attain a level of performance comparable to those of manually written assembly code by experts, thereby allowing selected computations in applications to benefit from similar levels of optimizations as those manually applied by experts.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"1 1","pages":"596-608"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72944041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers SMiTe:基于实时系统SMT处理器的精确QoS预测，以提高仓库规模计算机的利用率

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.53

Yunqi Zhang, M. Laurenzano, Jason Mars, Lingjia Tang

One of the key challenges for improving efficiency in warehouse scale computers (WSCs) is to improve server utilization while guaranteeing the quality of service (QoS) of latency-sensitive applications. To this end, prior work has proposed techniques to precisely predict performance and QoS interference to identify 'safe' application co-locations. However, such techniques are only applicable to resources shared across cores. Achieving such precise interference prediction on real-system simultaneous multithreading (SMT) architectures has been a significantly challenging open problem due to the complexity introduced by sharing resources within a core. In this paper, we demonstrate through a real-system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way we model interference. For SMT servers, the interference on different shared resources, including private caches, memory ports, as well as integer and floating-point functional units, do not correlate with each other. This insight suggests the necessity of decoupling interference into multiple resource sharing dimensions. In this work, we propose SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors. With a set of Rulers, which are carefully designed software stressors that apply pressure to a multidimensional space of shared resources, we quantify application sensitivity and contentiousness in a decoupled manner. We then establish a regression model to combine the sensitivity and contentiousness in different dimensions to predict performance interference. Using this methodology, we are able to precisely predict the performance interference in SMT co-location with an average error of 2.80% on SPEC CPU2006 and 1.79% on Cloud Suite. Our evaluation shows that SMiTe allows us to improve the utilization of WSCs by up to 42.57% while enforcing an application's QoS requirements.

提高仓库规模计算机(WSCs)效率的关键挑战之一是在保证对延迟敏感的应用程序的服务质量(QoS)的同时提高服务器利用率。为此，先前的工作已经提出了精确预测性能和QoS干扰的技术，以确定“安全”的应用程序共存。然而，这些技术只适用于跨核心共享的资源。由于内核内资源共享带来的复杂性，在实系统同步多线程(SMT)架构上实现如此精确的干扰预测一直是一个极具挑战性的开放性问题。在本文中，我们通过实际系统调查证明，CMP和SMT架构上资源共享行为之间的根本差异要求我们重新设计建模干扰的方式。对于SMT服务器，对不同共享资源(包括私有缓存、内存端口以及整数和浮点功能单元)的干扰彼此不相关。这一见解表明了将干扰解耦到多个资源共享维度的必要性。在这项工作中，我们提出了SMiTe，这是一种能够在实际系统商品处理器上精确预测SMT协同定位性能的方法。通过一组标尺，这些标尺是精心设计的软件压力源，对共享资源的多维空间施加压力，我们以解耦的方式量化应用程序的敏感性和争议性。然后，我们建立了一个回归模型，结合不同维度的敏感性和争议性来预测性能干扰。使用该方法，我们能够准确地预测SMT共置中的性能干扰，在SPEC CPU2006上的平均误差为2.80%，在Cloud Suite上的平均误差为1.79%。我们的评估表明，SMiTe允许我们在执行应用程序的QoS需求的同时，将wsc的利用率提高42.57%。

{"title":"SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers","authors":"Yunqi Zhang, M. Laurenzano, Jason Mars, Lingjia Tang","doi":"10.1109/MICRO.2014.53","DOIUrl":"https://doi.org/10.1109/MICRO.2014.53","url":null,"abstract":"One of the key challenges for improving efficiency in warehouse scale computers (WSCs) is to improve server utilization while guaranteeing the quality of service (QoS) of latency-sensitive applications. To this end, prior work has proposed techniques to precisely predict performance and QoS interference to identify 'safe' application co-locations. However, such techniques are only applicable to resources shared across cores. Achieving such precise interference prediction on real-system simultaneous multithreading (SMT) architectures has been a significantly challenging open problem due to the complexity introduced by sharing resources within a core. In this paper, we demonstrate through a real-system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way we model interference. For SMT servers, the interference on different shared resources, including private caches, memory ports, as well as integer and floating-point functional units, do not correlate with each other. This insight suggests the necessity of decoupling interference into multiple resource sharing dimensions. In this work, we propose SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors. With a set of Rulers, which are carefully designed software stressors that apply pressure to a multidimensional space of shared resources, we quantify application sensitivity and contentiousness in a decoupled manner. We then establish a regression model to combine the sensitivity and contentiousness in different dimensions to predict performance interference. Using this methodology, we are able to precisely predict the performance interference in SMT co-location with an average error of 2.80% on SPEC CPU2006 and 1.79% on Cloud Suite. Our evaluation shows that SMiTe allows us to improve the utilization of WSCs by up to 42.57% while enforcing an application's QoS requirements.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"267 1","pages":"406-418"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73304228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 128

Bias-Free Branch Predictor 无偏差分支预测器

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.32

Dibakar Gope, Mikko H. Lipasti

Prior research in neutrally-inspired perceptron predictors and Geometric History Length-based TAGE predictors has shown significant improvements in branch prediction accuracy by exploiting correlations in long branch histories. However, not all branches in the long branch history provide useful context. Biased branches resolve as either taken or not-taken virtually every time. Including them in the branch predictor's history does not directly contribute any useful information, but all existing history-based predictors include them anyway. In this work, we propose Bias-Free branch predictors theatre structured to learn correlations only with non-biased conditional branches, aka. Branches whose dynamic behaviorvaries during a program's execution. This, combined with a recency-stack-like management policy for the global history register, opens up the opportunity for a modest history length to include much older and much richer context to predict future branches more accurately. With a 64KB storage budget, the Bias-Free predictor delivers 2.49 MPKI (mispredictions per1000 instructions), improves by 5.32% over the most accurate neural predictor and achieves comparable accuracy to that of the TAGE predictor with fewer predictor tables or better accuracy with same number of tables. This eventually will translate to lower energy dissipated in the memory arrays per prediction.

先前对中性启发感知器预测器和基于几何历史长度的TAGE预测器的研究表明，通过利用长分支历史中的相关性，可以显著提高分支预测的准确性。然而，并非长分支历史中的所有分支都提供有用的上下文。有偏分支实际上每次都是被取或未取。在分支预测器的历史记录中包括它们并不能直接提供任何有用的信息，但是所有现有的基于历史的预测器都包括它们。在这项工作中，我们提出了无偏差分支预测器，其结构仅用于学习与无偏差条件分支的相关性，即。在程序执行过程中动态行为变化的分支。这与全球历史记录的类似最近堆栈的管理策略相结合，为适度的历史记录长度提供了机会，以包括更早和更丰富的上下文，从而更准确地预测未来的分支。使用64KB的存储预算，Bias-Free预测器提供2.49 MPKI(每1000条指令的错误预测)，比最准确的神经预测器提高了5.32%，并且在使用更少的预测表或相同数量的表时实现了与TAGE预测器相当的准确性。这最终将转化为更低的能量耗散在每个预测存储器阵列。

{"title":"Bias-Free Branch Predictor","authors":"Dibakar Gope, Mikko H. Lipasti","doi":"10.1109/MICRO.2014.32","DOIUrl":"https://doi.org/10.1109/MICRO.2014.32","url":null,"abstract":"Prior research in neutrally-inspired perceptron predictors and Geometric History Length-based TAGE predictors has shown significant improvements in branch prediction accuracy by exploiting correlations in long branch histories. However, not all branches in the long branch history provide useful context. Biased branches resolve as either taken or not-taken virtually every time. Including them in the branch predictor's history does not directly contribute any useful information, but all existing history-based predictors include them anyway. In this work, we propose Bias-Free branch predictors theatre structured to learn correlations only with non-biased conditional branches, aka. Branches whose dynamic behaviorvaries during a program's execution. This, combined with a recency-stack-like management policy for the global history register, opens up the opportunity for a modest history length to include much older and much richer context to predict future branches more accurately. With a 64KB storage budget, the Bias-Free predictor delivers 2.49 MPKI (mispredictions per1000 instructions), improves by 5.32% over the most accurate neural predictor and achieves comparable accuracy to that of the TAGE predictor with fewer predictor tables or better accuracy with same number of tables. This eventually will translate to lower energy dissipated in the memory arrays per prediction.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"190 1","pages":"521-532"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85461983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache Unison高速缓存:一种可扩展和有效的模堆叠DRAM高速缓存

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.51

Djordje Jevdjic, G. Loh, Cansu Kaynak, B. Falsafi

Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

最近的研究提倡在许多核心服务器中使用大型die-stacked DRAM缓存，以打破内存延迟和带宽墙。为了充分发挥其潜力，芯片堆叠式DRAM缓存需要低查找延迟、高命中率和有效利用片外带宽。今天的堆叠式DRAM缓存设计根据其管理数据的粒度分为两类:基于块的和基于页的。最先进的基于块的设计，称为Alloy Cache，在堆叠的DRAM中为每个数据块(例如64B)配置一个标签，以便在单个DRAM访问中提供对数据的快速访问。然而，由于DRAM缓存中的时间局部性差，这种设计的命中率很低。相比之下，最先进的基于页面的设计，称为Footprint Cache，按页面粒度(例如，4KB)组织DRAM缓存，但只获取可能在页面中被触摸的块。在这样做时，Footprint Cache通过适度的片上标记存储和合理的查找延迟实现了高命中率。然而，多千兆字节的堆叠DRAM缓存很快就会成为服务器应用程序所需要的，因此即使是基于页面的DRAM缓存也需要数十mb的标签存储。我们介绍了一种新的堆叠dram高速缓存设计，Unison高速缓存。与Alloy Cache的方法类似，Unison Cache将标签元数据直接集成到堆叠DRAM中，以实现任意堆叠DRAM容量的可扩展性。然后，利用来自Footprint Cache设计的见解，Unison Cache采用大型页面大小的缓存分配单元来实现高命中率和减少标签开销，同时预测和获取每个页面中有用的块，以最大限度地减少芯片外流量。我们对服务器工作负载和高达8GB的缓存进行了评估，结果显示，由于命中率高，Unison缓存的性能比Alloy缓存提高了14%，同时性能优于最先进的基于页面的设计，这些设计需要大约50MB的不切实际的基于sram的标签。

{"title":"Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache","authors":"Djordje Jevdjic, G. Loh, Cansu Kaynak, B. Falsafi","doi":"10.1109/MICRO.2014.51","DOIUrl":"https://doi.org/10.1109/MICRO.2014.51","url":null,"abstract":"Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"25-37"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86939730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 159

Continuous, Low Overhead, Run-Time Validation of Program Executions 程序执行的连续、低开销、运行时验证

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.18

E. Aktaş, F. Afram, K. Ghose

The construction of trustworthy systems demands that the execution of every piece of code is validated as genuine, that is, the executed codes do exactly what they are supposed to do. Pre-execution validations of code integrity fail to detect run time compromises like code injection, return and jump-oriented programming, and illegal dynamic linking of program modules. We propose and evaluate a generalized mechanism called REV (for Run-time Execution Validator) that can be easily integrated into a contemporary out-of-order processor to validate, as the program executes, the control flow path and instructions executed along the control flow path. To prevent memory from being tainted by compromised code, REV also prevents updates to the memory from a basic block until its execution has been authenticated. Although control flow signature based authentication of an execution has been suggested before for software testing and for restricted cases of embedded systems, their extensions to out-of-order cores is a non-incremental effort from a micro architectural standpoint. Unlike REV, the existing solutions do not scale with binary sizes, require binaries to be altered or require new ISA support and also fail to contain errors and, in general, impose a heavy performance penalty. We show, using a detailed cycle-accurate micro architectural simulator for an out-of-order pipeline implementing the X86 ISA that the performance overhead of REV is limited to 1.87% on the average across the SPEC 2006 benchmarks.

可信系统的构建要求每段代码的执行都被验证为真实的，也就是说，执行的代码确切地做了它们应该做的事情。代码完整性的预执行验证无法检测运行时的危害，如代码注入、面向返回和跳转的编程，以及程序模块的非法动态链接。我们提出并评估了一种称为REV(运行时执行验证器)的通用机制，该机制可以很容易地集成到现代乱序处理器中，以在程序执行时验证控制流路径和沿着控制流路径执行的指令。为了防止内存被泄露的代码污染，REV还阻止从基本块更新内存，直到对其执行进行了身份验证。尽管之前已经建议将基于控制流签名的执行身份验证用于软件测试和嵌入式系统的受限情况，但从微体系结构的角度来看，它们对乱序内核的扩展是非增量的工作。与REV不同的是，现有的解决方案不能根据二进制文件大小进行扩展，需要修改二进制文件或需要新的ISA支持，而且不能包含错误，通常会造成严重的性能损失。通过对实现X86 ISA的乱序管道使用详细的周期精确微体系结构模拟器，我们发现，在SPEC 2006基准测试中，REV的性能开销平均限制在1.87%。

{"title":"Continuous, Low Overhead, Run-Time Validation of Program Executions","authors":"E. Aktaş, F. Afram, K. Ghose","doi":"10.1109/MICRO.2014.18","DOIUrl":"https://doi.org/10.1109/MICRO.2014.18","url":null,"abstract":"The construction of trustworthy systems demands that the execution of every piece of code is validated as genuine, that is, the executed codes do exactly what they are supposed to do. Pre-execution validations of code integrity fail to detect run time compromises like code injection, return and jump-oriented programming, and illegal dynamic linking of program modules. We propose and evaluate a generalized mechanism called REV (for Run-time Execution Validator) that can be easily integrated into a contemporary out-of-order processor to validate, as the program executes, the control flow path and instructions executed along the control flow path. To prevent memory from being tainted by compromised code, REV also prevents updates to the memory from a basic block until its execution has been authenticated. Although control flow signature based authentication of an execution has been suggested before for software testing and for restricted cases of embedded systems, their extensions to out-of-order cores is a non-incremental effort from a micro architectural standpoint. Unlike REV, the existing solutions do not scale with binary sizes, require binaries to be altered or require new ISA support and also fail to contain errors and, in general, impose a heavy performance penalty. We show, using a detailed cycle-accurate micro architectural simulator for an out-of-order pipeline implementing the X86 ISA that the performance overhead of REV is limited to 1.87% on the average across the SPEC 2006 benchmarks.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"11 1","pages":"229-241"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84609129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research PyMTL:垂直集成计算机体系结构研究的统一框架

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.50

Derek Lockhart, Gary Zibrat, C. Batten

Technology trends prompting architects to consider greater heterogeneity and hardware specialization have exposed an increasing need for vertically integrated research methodologies that can effectively assess performance, area, and energy metrics of future architectures. However, constructing such a methodology with existing tools is a significant challenge due to the unique languages, design patterns, and tools used in functional-level (FL), cycle-level (CL), and register-transfer-level (RTL) modeling. We introduce a new framework called PyMTL that aims to close this computer architecture research methodology gap by providing a unified design environment for FL, CL, and RTL modeling. PyMTL leverages the Python programming language to create a highly productive domain-specific embedded language for concurrent-structural modeling and hardware design. While the use of Python as a modeling and framework implementation language provides considerable benefits in terms of productivity, it comes at the cost of significantly longer simulation times. We address this performance-productivity gap with a hybrid JIT compilation and JIT specialization approach. We introduce Sim JIT, a custom JIT specialization engine that automatically generates optimized C++ for CL and RTL models. To reduce the performance impact of the remaining unspecialized code, we combine Sim JIT with an off-the-shelf Python interpreter with a meta-tracing JIT compiler (PyPy). Sim JIT+PyPy provides speedups of up to 72× for CL models and 200× for RTL models, bringing us within 4-6× of optimized C++ code while providing significant benefits in terms of productivity and usability.

技术趋势促使架构师考虑更大的异构性和硬件专门化，这暴露了对垂直集成研究方法的日益增长的需求，这种方法可以有效地评估未来架构的性能、面积和能源度量。然而，由于在功能级(FL)、循环级(CL)和寄存器-传输级(RTL)建模中使用的独特语言、设计模式和工具，使用现有工具构建这样的方法是一项重大挑战。我们引入了一个名为PyMTL的新框架，旨在通过为FL、CL和RTL建模提供统一的设计环境来缩小计算机体系结构研究方法的差距。PyMTL利用Python编程语言为并发结构建模和硬件设计创建了一种高效的领域特定嵌入式语言。虽然使用Python作为建模和框架实现语言在生产力方面提供了相当大的好处，但它的代价是大大延长了模拟时间。我们使用混合JIT编译和JIT专门化方法来解决这种性能-生产率差距。我们介绍Sim JIT，这是一个定制的JIT专一化引擎，可以自动为CL和RTL模型生成优化的c++。为了减少剩余的非专门化代码对性能的影响，我们将Sim JIT与现成的Python解释器和元跟踪JIT编译器(PyPy)结合起来。Sim JIT+PyPy为CL模型提供了高达72倍的速度，为RTL模型提供了200倍的速度，使我们在4-6倍的优化c++代码的同时，在生产力和可用性方面提供了显著的好处。

{"title":"PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research","authors":"Derek Lockhart, Gary Zibrat, C. Batten","doi":"10.1109/MICRO.2014.50","DOIUrl":"https://doi.org/10.1109/MICRO.2014.50","url":null,"abstract":"Technology trends prompting architects to consider greater heterogeneity and hardware specialization have exposed an increasing need for vertically integrated research methodologies that can effectively assess performance, area, and energy metrics of future architectures. However, constructing such a methodology with existing tools is a significant challenge due to the unique languages, design patterns, and tools used in functional-level (FL), cycle-level (CL), and register-transfer-level (RTL) modeling. We introduce a new framework called PyMTL that aims to close this computer architecture research methodology gap by providing a unified design environment for FL, CL, and RTL modeling. PyMTL leverages the Python programming language to create a highly productive domain-specific embedded language for concurrent-structural modeling and hardware design. While the use of Python as a modeling and framework implementation language provides considerable benefits in terms of productivity, it comes at the cost of significantly longer simulation times. We address this performance-productivity gap with a hybrid JIT compilation and JIT specialization approach. We introduce Sim JIT, a custom JIT specialization engine that automatically generates optimized C++ for CL and RTL models. To reduce the performance impact of the remaining unspecialized code, we combine Sim JIT with an off-the-shelf Python interpreter with a meta-tracing JIT compiler (PyPy). Sim JIT+PyPy provides speedups of up to 72× for CL models and 200× for RTL models, bringing us within 4-6× of optimized C++ code while providing significant benefits in terms of productivity and usability.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"03 1","pages":"280-292"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86101598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 98

PipeCheck: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models PipeCheck:指定和验证内存一致性模型的微架构执行

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.38

Daniel Lustig, Michael Pellauer, M. Martonosi

We present PipeCheck, a methodology and automated tool for verifying that a particular micro architecture correctly implements the consistency model required by its architectural specification. PipeCheck adapts the notion of a "happens before" graph from architecture-level analysis techniques to the micro architecture space. Each node in the "micro architecturally happens before" (μhb) graph represents not only a memory instruction, but also a particular location (e.g., Pipeline stage) within the micro architecture. Architectural specifications such as "preserved program order" are then treated as propositions to be verified, rather than simply as assumptions. PipeCheck allows an architect to easily and rigorously test whether a micro architecture is stronger than, equal in strength to, or weaker than its architecturally-specified consistency model. We also specify and analyze the behavior of common micro architectural optimizations such as speculative load reordering which technically violate formal architecture-level definitions. We evaluate PipeCheck using a library of established litmus tests on a set of open-source pipelines. Using PipeCheck, we were able to validate the largest pipeline, the Open SPARC T2, in just minutes. We also identified a bug in the O3 pipeline of the gem5 simulator.

我们提出了PipeCheck，这是一种方法和自动化工具，用于验证特定的微体系结构是否正确实现了其体系结构规范所要求的一致性模型。PipeCheck将“发生之前”图的概念从架构级分析技术应用到微架构空间。“微架构发生之前”(μhb)图中的每个节点不仅表示一个内存指令，而且表示微架构内的特定位置(例如，流水线阶段)。诸如“保留的程序顺序”之类的体系结构规范随后被视为需要验证的命题，而不是简单地作为假设。PipeCheck允许架构师轻松而严格地测试微架构是否比其架构指定的一致性模型更强、更强、更弱。我们还指定并分析了常见的微体系结构优化的行为，例如在技术上违反正式体系结构级别定义的推测性负载重排序。我们使用在一组开源管道上建立的石蕊测试库来评估PipeCheck。使用PipeCheck，我们能够在几分钟内验证最大的管道Open SPARC T2。我们还在gem5模拟器的O3管道中发现了一个bug。

{"title":"PipeCheck: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models","authors":"Daniel Lustig, Michael Pellauer, M. Martonosi","doi":"10.1109/MICRO.2014.38","DOIUrl":"https://doi.org/10.1109/MICRO.2014.38","url":null,"abstract":"We present PipeCheck, a methodology and automated tool for verifying that a particular micro architecture correctly implements the consistency model required by its architectural specification. PipeCheck adapts the notion of a \"happens before\" graph from architecture-level analysis techniques to the micro architecture space. Each node in the \"micro architecturally happens before\" (μhb) graph represents not only a memory instruction, but also a particular location (e.g., Pipeline stage) within the micro architecture. Architectural specifications such as \"preserved program order\" are then treated as propositions to be verified, rather than simply as assumptions. PipeCheck allows an architect to easily and rigorously test whether a micro architecture is stronger than, equal in strength to, or weaker than its architecturally-specified consistency model. We also specify and analyze the behavior of common micro architectural optimizations such as speculative load reordering which technically violate formal architecture-level definitions. We evaluate PipeCheck using a library of established litmus tests on a set of open-source pipelines. Using PipeCheck, we were able to validate the largest pipeline, the Open SPARC T2, in just minutes. We also identified a bug in the O3 pipeline of the gem5 simulator.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"47 1","pages":"635-646"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90735156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration PPEP:在线性能、功率和能源预测框架和DVFS空间探索

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.17

Bo Su, Junli Gu, Li Shen, Wei Huang, J. Greathouse, Zhiying Wang

Performance, power, and energy (PPE) are critical aspects of modern computing. It is challenging to accurately predict, in real time, the effect of dynamic voltage and frequency scaling (DVFS) on PPE across a wide range of voltages and frequencies. This results in the use of reactive, iterative, and inefficient algorithms for dynamically finding good DVFS states. We propose PPEP, an online PPE prediction framework that proactively and rapidly searches the DVFS space. PPEP uses hardware events to implement both a cycles-per-instruction (CPI) model as well as a per-core power model in order to predict PPE across all DVFS states. We verify on modern AMD CPUs that the PPEP power model achieves an average error of 4.6% (2.8% standard deviation) on 152 benchmark combinations at 5 distinct voltage-frequency states. Predicting average chip power across different DVFS states achieves an average error of 4.2% with a 3.6% standard deviation. Further, we demonstrate the usage of PPEP by creating and evaluating a highly responsive power capping mechanism that can meet power targets in a single step. PPEP also provides insights for future development of DVFS technologies. For example, we find that it is important to carefully consider background workloads for DVFS policies and that enabling north bridge DVFS can offer up to 20% additional energy saving or a 1.4x performance improvement.

性能、功率和能源(PPE)是现代计算的关键方面。在很宽的电压和频率范围内，实时准确地预测动态电压和频率缩放(DVFS)对PPE的影响是一项挑战。这导致使用反应性、迭代性和低效的算法来动态地寻找良好的DVFS状态。提出了一种主动快速搜索DVFS空间的在线PPE预测框架PPEP。PPEP使用硬件事件来实现每指令周期(CPI)模型和每核功率模型，以便预测所有DVFS状态下的PPE。我们在现代AMD cpu上验证，PPEP功率模型在5种不同电压频率状态下的152个基准组合上实现了4.6%(2.8%标准差)的平均误差。预测不同DVFS状态下的平均芯片功耗平均误差为4.2%，标准差为3.6%。此外，我们通过创建和评估一个高响应功率封顶机制来演示PPEP的使用，该机制可以在一个步骤中满足功率目标。PPEP还为DVFS技术的未来发展提供了见解。例如，我们发现仔细考虑DVFS策略的后台工作负载是很重要的，启用北桥DVFS可以提供高达20%的额外节能或1.4倍的性能改进。

{"title":"PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration","authors":"Bo Su, Junli Gu, Li Shen, Wei Huang, J. Greathouse, Zhiying Wang","doi":"10.1109/MICRO.2014.17","DOIUrl":"https://doi.org/10.1109/MICRO.2014.17","url":null,"abstract":"Performance, power, and energy (PPE) are critical aspects of modern computing. It is challenging to accurately predict, in real time, the effect of dynamic voltage and frequency scaling (DVFS) on PPE across a wide range of voltages and frequencies. This results in the use of reactive, iterative, and inefficient algorithms for dynamically finding good DVFS states. We propose PPEP, an online PPE prediction framework that proactively and rapidly searches the DVFS space. PPEP uses hardware events to implement both a cycles-per-instruction (CPI) model as well as a per-core power model in order to predict PPE across all DVFS states. We verify on modern AMD CPUs that the PPEP power model achieves an average error of 4.6% (2.8% standard deviation) on 152 benchmark combinations at 5 distinct voltage-frequency states. Predicting average chip power across different DVFS states achieves an average error of 4.2% with a 3.6% standard deviation. Further, we demonstrate the usage of PPEP by creating and evaluating a highly responsive power capping mechanism that can meet power targets in a single step. PPEP also provides insights for future development of DVFS technologies. For example, we find that it is important to carefully consider background workloads for DVFS policies and that enabling north bridge DVFS can offer up to 20% additional energy saving or a 1.4x performance improvement.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"111 1","pages":"445-457"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76037780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81