首页 > 最新文献

International Symposium on Code Generation and Optimization (CGO'07)最新文献

英文 中文
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems 评价软件动态翻译系统中的间接分支处理机制
Pub Date : 2011-07-01 DOI: 10.1145/1970386.1970390
Jason Hiser, Daniel W. Williams, Wei Hu, J. Davidson, Jason Mars, B. Childers
Software dynamic translation (SDT) systems are used for program instrumentation, dynamic optimization, security, intrusion detection, and many other uses. As noted by many researchers, a major source of SDT overhead is the execution of code which is needed to translate an indirect branch's target address into the address of the translated destination block. This paper discusses the sources of indirect branch (IB) overhead in SDT systems and evaluates several techniques for overhead reduction. Measurements using SPEC CPU2000 show that the appropriate choice and configuration of IB translation mechanisms can significantly reduce the IB handling overhead. In addition, cross-architecture evaluation of IB handling mechanisms reveals that the most efficient implementation and configuration can be highly dependent on the implementation of the underlying architecture
软件动态转换(SDT)系统用于程序检测、动态优化、安全性、入侵检测和许多其他用途。正如许多研究人员所指出的,SDT开销的一个主要来源是将间接分支的目标地址转换为已翻译的目标块的地址所需的代码的执行。本文讨论了SDT系统中间接支路开销的来源,并评价了几种降低开销的技术。使用SPEC CPU2000进行的测量表明,适当选择和配置IB转换机制可以显著降低IB处理开销。此外,对IB处理机制的跨架构评估表明,最有效的实现和配置可能高度依赖于底层架构的实现
{"title":"Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems","authors":"Jason Hiser, Daniel W. Williams, Wei Hu, J. Davidson, Jason Mars, B. Childers","doi":"10.1145/1970386.1970390","DOIUrl":"https://doi.org/10.1145/1970386.1970390","url":null,"abstract":"Software dynamic translation (SDT) systems are used for program instrumentation, dynamic optimization, security, intrusion detection, and many other uses. As noted by many researchers, a major source of SDT overhead is the execution of code which is needed to translate an indirect branch's target address into the address of the translated destination block. This paper discusses the sources of indirect branch (IB) overhead in SDT systems and evaluates several techniques for overhead reduction. Measurements using SPEC CPU2000 show that the appropriate choice and configuration of IB translation mechanisms can significantly reduce the IB handling overhead. In addition, cross-architecture evaluation of IB handling mechanisms reveals that the most efficient implementation and configuration can be highly dependent on the implementation of the underlying architecture","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128010257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Graph-Based Procedural Abstraction 基于图的过程抽象
Pub Date : 2008-03-25 DOI: 10.1109/CGO.2007.14
A. Dreweke, M. Wörlein, I. Fischer, Dominic Schell, T. Meinl, M. Philippsen
Procedural abstraction (PA) extracts duplicate code segments into a newly created method and hence reduces code size. For embedded micro computers the amount of memory is still limited so code reduction is an important issue. This paper presents a novel approach to PA, that is especially targeted towards embedded systems. Earlier approaches of PA are blind with respect to code reordering, i.e., two code segments with the same semantic effect but with different instruction orders were not detected as candidates for PA. Instead of instruction sequences, in our approach the data flow graphs of basic blocks are considered. Compared to known PA techniques more than twice the number of instructions can be saved on a set of binaries, by detecting frequently appearing graph fragments with a graph mining tool based on the well known gSpan algorithm. The detection and extraction of graph fragments is not as straight forward as extracting sequential code fragments. NP-complete graph operations and special rules to decide which parts can be abstracted are needed. However, this effort pays off as smaller sizes significantly reduce costs on mass-produced embedded systems
过程抽象(PA)将重复的代码段提取到新创建的方法中,从而减少代码大小。对于嵌入式微型计算机来说,由于内存有限,因此代码缩减是一个重要的问题。本文提出了一种新的PA方法,特别针对嵌入式系统。早期的PA方法在代码重排序方面是盲目的,即没有检测到具有相同语义效果但指令顺序不同的两个代码段作为PA的候选。我们的方法不考虑指令序列,而是考虑基本块的数据流图。与已知的PA技术相比,通过使用基于众所周知的gSpan算法的图挖掘工具检测频繁出现的图片段,可以在一组二进制文件上节省两倍以上的指令数。图片段的检测和提取不像提取顺序代码片段那样直接。需要np完全图操作和特殊规则来决定哪些部分可以抽象。然而,这种努力得到了回报,因为更小的尺寸显著降低了大规模生产的嵌入式系统的成本
{"title":"Graph-Based Procedural Abstraction","authors":"A. Dreweke, M. Wörlein, I. Fischer, Dominic Schell, T. Meinl, M. Philippsen","doi":"10.1109/CGO.2007.14","DOIUrl":"https://doi.org/10.1109/CGO.2007.14","url":null,"abstract":"Procedural abstraction (PA) extracts duplicate code segments into a newly created method and hence reduces code size. For embedded micro computers the amount of memory is still limited so code reduction is an important issue. This paper presents a novel approach to PA, that is especially targeted towards embedded systems. Earlier approaches of PA are blind with respect to code reordering, i.e., two code segments with the same semantic effect but with different instruction orders were not detected as candidates for PA. Instead of instruction sequences, in our approach the data flow graphs of basic blocks are considered. Compared to known PA techniques more than twice the number of instructions can be saved on a set of binaries, by detecting frequently appearing graph fragments with a graph mining tool based on the well known gSpan algorithm. The detection and extraction of graph fragments is not as straight forward as extracting sequential code fragments. NP-complete graph operations and special rules to decide which parts can be abstracted are needed. However, this effort pays off as smaller sizes significantly reduce costs on mass-produced embedded systems","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128088243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success 并行编程环境:将太规模平台转化为巨大成功的关键
Pub Date : 2007-03-14 DOI: 10.1145/1229428.1229430
J. Fang
Summary form only given. Moore's Law will continue to increase the number of transistors on die for a couple of decades, as silicon technology moves from 65nm today to 45nm, 32 nm and 22nm in the future. Since the power and thermal constraints increase with frequency, multi-core or many-core will be the way of the future microprocessor. In the near future, HW platforms will have many-cores (>16 cores) on die to achieve >1 TIPs computation power, which will communicate each other through an on-die interconnect fabric with >1 TB/s on-die bandwidth and <30 cycles latency. Off-die D-cache will employ 3D stacked memory technology to tremendously increase off-die cache/memory bandwidth and reduce the latency. Fast copper flex cables will link CPU-DRAM on socket and the optical silicon photonics will provide up to 1 Tb/s I/O bandwidth between boxes. The HW system with TIPs of compute power operating in Tera-bytes of data make this a "Tera-scale" platform. What are the SW implications with the HW changes from uniprocessor to Tera-scale platform with many-cores as "the way of the future?" It will be great challenge for programming environments to help programmers to develop concurrent code for most client software. A good concurrent programming environment should extend the existing programming languages that typical programmers are familiar with, and bring benefits for concurrent programming. There are lots of research topics. Examples of these topics include flexible parallel programming models based on needs from applications, better synchronization mechanisms like Transactional Memory to replace simple "Thread + Lock" structure, nested data parallel language primitives with new protocols, fine-grained synchronization mechanisms with HW support, maybe fine-grained message passing, advanced compiler optimizations for the threaded code, and SW tools in the concurrent programming environment. A more interesting problem is how to use such a many-core system to improve single-threaded performance
只提供摘要形式。随着硅技术从今天的65纳米发展到未来的45纳米、32纳米和22纳米,摩尔定律将在未来几十年继续增加芯片上的晶体管数量。由于功率和热约束随着频率的增加而增加,多核或多核将是未来微处理器的发展方向。在不久的将来,硬件平台将在芯片上拥有多核(>6核),以实现> 1tips的计算能力,这些计算能力将通过片上互连结构相互通信,片上带宽为> 1tb /s,延迟<30个周期。Off-die D-cache将采用3D堆叠内存技术,以极大地增加Off-die缓存/内存带宽并减少延迟。快速铜柔性电缆将连接插槽上的CPU-DRAM,光学硅光子学将在盒子之间提供高达1tb /s的I/O带宽。HW系统的计算能力在万亿字节的数据中运行,使其成为一个“万亿级”平台。硬件从单处理器到以多核为“未来趋势”的万亿级平台的转变对软件有什么影响?对于编程环境来说,帮助程序员为大多数客户端软件开发并发代码将是一个巨大的挑战。一个好的并发编程环境应该是对现有编程语言的扩展,是典型程序员所熟悉的,并为并发编程带来好处。有很多研究课题。这些主题的例子包括基于应用程序需求的灵活并行编程模型,更好的同步机制,如事务性内存,以取代简单的“线程+锁”结构,嵌套数据并行语言原语与新协议,支持硬件的细粒度同步机制,可能是细粒度消息传递,线程代码的高级编译器优化,以及并发编程环境中的软件工具。一个更有趣的问题是如何使用这样一个多核系统来提高单线程性能
{"title":"Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success","authors":"J. Fang","doi":"10.1145/1229428.1229430","DOIUrl":"https://doi.org/10.1145/1229428.1229430","url":null,"abstract":"Summary form only given. Moore's Law will continue to increase the number of transistors on die for a couple of decades, as silicon technology moves from 65nm today to 45nm, 32 nm and 22nm in the future. Since the power and thermal constraints increase with frequency, multi-core or many-core will be the way of the future microprocessor. In the near future, HW platforms will have many-cores (>16 cores) on die to achieve >1 TIPs computation power, which will communicate each other through an on-die interconnect fabric with >1 TB/s on-die bandwidth and <30 cycles latency. Off-die D-cache will employ 3D stacked memory technology to tremendously increase off-die cache/memory bandwidth and reduce the latency. Fast copper flex cables will link CPU-DRAM on socket and the optical silicon photonics will provide up to 1 Tb/s I/O bandwidth between boxes. The HW system with TIPs of compute power operating in Tera-bytes of data make this a \"Tera-scale\" platform. What are the SW implications with the HW changes from uniprocessor to Tera-scale platform with many-cores as \"the way of the future?\" It will be great challenge for programming environments to help programmers to develop concurrent code for most client software. A good concurrent programming environment should extend the existing programming languages that typical programmers are familiar with, and bring benefits for concurrent programming. There are lots of research topics. Examples of these topics include flexible parallel programming models based on needs from applications, better synchronization mechanisms like Transactional Memory to replace simple \"Thread + Lock\" structure, nested data parallel language primitives with new protocols, fine-grained synchronization mechanisms with HW support, maybe fine-grained message passing, advanced compiler optimizations for the threaded code, and SW tools in the concurrent programming environment. A more interesting problem is how to use such a many-core system to improve single-threaded performance","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115912786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Dimension Abstraction Approach to Vectorization in Matlab Matlab中矢量化的维抽象方法
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.1
N. Birkbeck, J. Levesque, J. N. Amaral
Matlab is a matrix-processing language that offers very efficient built-in operations for data organized in arrays. However Matlab operation is slow when the program accesses data through interpreted loops. Often during the development of a Matlab application writing loop-based code is more intuitive than crafting the data organization into arrays. Furthermore, many Matlab users do not command the linear algebra expertise necessary to write efficient code. Thus loop-based Matlab coding is a fairly common practice. This paper presents a tool that automatically converts loop-based Matlab code into equivalent array-based form and built-in Matlab constructs. Array-based code is produced by checking the input and output dimensions of equations within loops, and by transposing terms when necessary to generate correct code. This paper also describes an extensible loop pattern database that allows user-defined patterns to be discovered and replaced by more efficient Matlab routines that perform the same computation. The safe conversion of loop-based into more efficient array-based code is made possible by the introduction of a new abstract representation for dimensions
Matlab是一种矩阵处理语言,它为以数组组织的数据提供了非常有效的内置操作。然而,当程序通过解释循环访问数据时,Matlab的运行速度很慢。通常在Matlab应用程序的开发过程中,编写基于循环的代码比将数据组织成数组更直观。此外,许多Matlab用户不具备编写高效代码所需的线性代数专业知识。因此,基于循环的Matlab编码是一种相当普遍的做法。本文介绍了一个将基于循环的Matlab代码自动转换为等效的基于数组的形式和内置Matlab结构的工具。基于数组的代码是通过检查循环内方程的输入和输出维度,并在必要时调换项以生成正确的代码来生成的。本文还描述了一个可扩展的循环模式数据库,它允许用户定义的模式被发现,并被执行相同计算的更有效的Matlab例程所取代。通过引入新的维度抽象表示,可以将基于循环的代码安全转换为更高效的基于数组的代码
{"title":"A Dimension Abstraction Approach to Vectorization in Matlab","authors":"N. Birkbeck, J. Levesque, J. N. Amaral","doi":"10.1109/CGO.2007.1","DOIUrl":"https://doi.org/10.1109/CGO.2007.1","url":null,"abstract":"Matlab is a matrix-processing language that offers very efficient built-in operations for data organized in arrays. However Matlab operation is slow when the program accesses data through interpreted loops. Often during the development of a Matlab application writing loop-based code is more intuitive than crafting the data organization into arrays. Furthermore, many Matlab users do not command the linear algebra expertise necessary to write efficient code. Thus loop-based Matlab coding is a fairly common practice. This paper presents a tool that automatically converts loop-based Matlab code into equivalent array-based form and built-in Matlab constructs. Array-based code is produced by checking the input and output dimensions of equations within loops, and by transposing terms when necessary to generate correct code. This paper also describes an extensible loop pattern database that allows user-defined patterns to be discovered and replaced by more efficient Matlab routines that perform the same computation. The safe conversion of loop-based into more efficient array-based code is made possible by the introduction of a new abstract representation for dimensions","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130640931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Microarchitecture Sensitive Empirical Models for Compiler Optimizations 编译器优化的微架构敏感经验模型
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.25
K. Vaswani, M. J. Thazhuthaveetil, Y. Srikant, P. Joseph
This paper proposes the use of empirical modeling techniques for building microarchitecture sensitive models for compiler optimizations. The models we build relate program performance to settings of compiler optimization flags, associated heuristics and key microarchitectural parameters. Unlike traditional analytical modeling methods, this relationship is learned entirely from data obtained by measuring performance at a small number of carefully selected compiler/microarchitecture configurations. We evaluate three different learning techniques in this context viz. linear regression, adaptive regression splines and radial basis function networks. We use the generated models to a) predict program performance at arbitrary compiler/microarchitecture configurations, b) quantify the significance of complex interactions between optimizations and the microarchitecture, and c) efficiently search for 'optimal' settings of optimization flags and heuristics for any given micro architectural configuration. Our evaluation using benchmarks from the SPEC CPU2000 suites suggests that accurate models (< 5% average error in prediction) can be generated using a reasonable number of simulations. We also find that using compiler settings prescribed by a model-based search can improve program performance by as much as 19% (with an average of 9.5%) over highly optimized binaries
本文建议使用经验建模技术来构建用于编译器优化的微体系结构敏感模型。我们建立的模型将程序性能与编译器优化标志、相关启发式和关键微体系结构参数的设置联系起来。与传统的分析建模方法不同,这种关系完全是通过在少量精心选择的编译器/微体系结构配置下测量性能获得的数据来学习的。在此背景下,我们评估了三种不同的学习技术,即线性回归、自适应回归样条和径向基函数网络。我们使用生成的模型来a)预测任意编译器/微架构配置下的程序性能,b)量化优化与微架构之间复杂交互的重要性,以及c)有效地搜索任何给定微架构配置的优化标志和启发式的“最佳”设置。我们使用SPEC CPU2000套件的基准测试进行的评估表明,使用合理数量的模拟可以生成准确的模型(预测中的平均误差< 5%)。我们还发现,与高度优化的二进制文件相比,使用基于模型的搜索所规定的编译器设置可以将程序性能提高19%(平均为9.5%)
{"title":"Microarchitecture Sensitive Empirical Models for Compiler Optimizations","authors":"K. Vaswani, M. J. Thazhuthaveetil, Y. Srikant, P. Joseph","doi":"10.1109/CGO.2007.25","DOIUrl":"https://doi.org/10.1109/CGO.2007.25","url":null,"abstract":"This paper proposes the use of empirical modeling techniques for building microarchitecture sensitive models for compiler optimizations. The models we build relate program performance to settings of compiler optimization flags, associated heuristics and key microarchitectural parameters. Unlike traditional analytical modeling methods, this relationship is learned entirely from data obtained by measuring performance at a small number of carefully selected compiler/microarchitecture configurations. We evaluate three different learning techniques in this context viz. linear regression, adaptive regression splines and radial basis function networks. We use the generated models to a) predict program performance at arbitrary compiler/microarchitecture configurations, b) quantify the significance of complex interactions between optimizations and the microarchitecture, and c) efficiently search for 'optimal' settings of optimization flags and heuristics for any given micro architectural configuration. Our evaluation using benchmarks from the SPEC CPU2000 suites suggests that accurate models (< 5% average error in prediction) can be generated using a reasonable number of simulations. We also find that using compiler settings prescribed by a model-based search can improve program performance by as much as 19% (with an average of 9.5%) over highly optimized binaries","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114798552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Evaluating Heuristic Optimization Phase Order Search Algorithms 评价启发式优化阶段顺序搜索算法
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.9
P. Kulkarni, D. Whalley, G. Tyson
Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each function. However, to make this approach of iterative compilation more widely accepted and deployed in mainstream compilers, it is essential to modify existing algorithms, or develop new ones that find near-optimal solutions quickly. As a step in this direction, in this paper we attempt to identify and understand the important properties of some commonly employed heuristic search methods by using information collected during an exhaustive exploration of the phase order search space. We compare the performance obtained by each algorithm with all others, as well as with the optimal phase ordering performance. Finally, we show how we can use the features of the phase order space to improve existing algorithms as well as devise new and better performing search algorithms
特定于程序或特定于功能的优化阶段序列被普遍接受,以获得比任何固定的优化阶段顺序更好的整体性能。许多启发式相序空间搜索算法已经被设计出来,用于为每个函数找到高性能的自定义相序。然而,为了使这种迭代编译方法在主流编译器中得到更广泛的接受和部署,有必要修改现有的算法,或者开发能够快速找到接近最优解的新算法。作为朝这个方向迈出的一步,在本文中,我们试图识别和理解一些常用的启发式搜索方法的重要性质,通过使用在详尽的探索相序搜索空间期间收集的信息。我们将每种算法的性能与所有其他算法进行比较,并与最优相位排序性能进行比较。最后,我们展示了如何利用相序空间的特征来改进现有算法,并设计出新的、性能更好的搜索算法
{"title":"Evaluating Heuristic Optimization Phase Order Search Algorithms","authors":"P. Kulkarni, D. Whalley, G. Tyson","doi":"10.1109/CGO.2007.9","DOIUrl":"https://doi.org/10.1109/CGO.2007.9","url":null,"abstract":"Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each function. However, to make this approach of iterative compilation more widely accepted and deployed in mainstream compilers, it is essential to modify existing algorithms, or develop new ones that find near-optimal solutions quickly. As a step in this direction, in this paper we attempt to identify and understand the important properties of some commonly employed heuristic search methods by using information collected during an exhaustive exploration of the phase order search space. We compare the performance obtained by each algorithm with all others, as well as with the optimal phase ordering performance. Finally, we show how we can use the features of the phase order space to improve existing algorithms as well as devise new and better performing search algorithms","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"356 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134085946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors 在网络处理器中使用软件控制缓存的临界区的流水线执行
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.30
J. Dai, Long Li, Bo Huang
To keep up with the explosive Internet packet processing demands, modern network processors (NPs) employ a highly parallel, multi-threaded and multi-core architecture. In such a parallel paradigm, accesses to the shared variables in the external memory (and the associated memory latency) are contained in the critical sections, so that they can be executed atomically and sequentially by different threads in the network processor. In this paper, we present a novel program transformation that is used in the Intelreg Auto-partitioning C Compiler for IXP to exploit the inherent finer-grained parallelism of those critical sections, using the software-controlled caching mechanism available in the NPs. Consequently, those critical sections can be executed in a pipelined fashion by different threads, thereby effectively hiding the memory latency and improving the performance of network applications. Experimental results show that the proposed transformation provides impressive speedup (up-to 9.9times) and scalability (up-to 80 threads) of the performance for the real-world network application (a 10Gbps Ethernet Core/Metro Router)
为了满足爆炸性的互联网数据包处理需求,现代网络处理器(NPs)采用了高度并行、多线程和多核架构。在这种并行范例中,对外部内存中的共享变量的访问(以及相关的内存延迟)包含在临界区中,因此它们可以由网络处理器中的不同线程自动地、顺序地执行。在本文中,我们提出了一种新的程序转换,该转换用于Intelreg自动分区C编译器中,利用NPs中可用的软件控制缓存机制,利用这些关键段固有的细粒度并行性。因此,这些关键段可以由不同的线程以流水线的方式执行,从而有效地隐藏内存延迟并提高网络应用程序的性能。实验结果表明,所提出的转换为实际网络应用(10Gbps以太网核心/城域路由器)提供了令人印象深刻的性能加速(高达9.9倍)和可扩展性(高达80个线程)。
{"title":"Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors","authors":"J. Dai, Long Li, Bo Huang","doi":"10.1109/CGO.2007.30","DOIUrl":"https://doi.org/10.1109/CGO.2007.30","url":null,"abstract":"To keep up with the explosive Internet packet processing demands, modern network processors (NPs) employ a highly parallel, multi-threaded and multi-core architecture. In such a parallel paradigm, accesses to the shared variables in the external memory (and the associated memory latency) are contained in the critical sections, so that they can be executed atomically and sequentially by different threads in the network processor. In this paper, we present a novel program transformation that is used in the Intelreg Auto-partitioning C Compiler for IXP to exploit the inherent finer-grained parallelism of those critical sections, using the software-controlled caching mechanism available in the NPs. Consequently, those critical sections can be executed in a pipelined fashion by different threads, thereby effectively hiding the memory latency and improving the performance of network applications. Experimental results show that the proposed transformation provides impressive speedup (up-to 9.9times) and scalability (up-to 80 threads) of the performance for the real-world network application (a 10Gbps Ethernet Core/Metro Router)","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133062902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Understanding Tradeoffs in Software Transactional Memory 理解软件事务性内存中的权衡
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.38
D. Dice, N. Shavit
There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper re-examines the design decisions behind several of these state-of-the-art algorithms, adopting some ideas, rejecting others, all in an attempt to make STMs faster. We created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STM algorithms versus fine-grained hand-crafted ones. We were able to make several illuminating observations regarding lock acquisition order, the interaction of STMs with memory management schemes, and the role of overheads and abort rates in STM performance
最近在高性能软件和混合硬件/软件事务性存储器(stm和hytm)的设计方面出现了大量的工作。本文重新审视了这些最先进算法背后的设计决策,采用了一些想法,拒绝了其他想法,所有这些都是为了使stm更快。我们创建了STM算法的事务锁定(TL)框架,并使用它对非阻塞、基于锁和混合STM算法与细粒度手工制作的STM算法的性能进行了一系列比较。我们能够就锁获取顺序、STM与内存管理方案的交互以及开销和中断率在STM性能中的作用做出一些有启发性的观察
{"title":"Understanding Tradeoffs in Software Transactional Memory","authors":"D. Dice, N. Shavit","doi":"10.1109/CGO.2007.38","DOIUrl":"https://doi.org/10.1109/CGO.2007.38","url":null,"abstract":"There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper re-examines the design decisions behind several of these state-of-the-art algorithms, adopting some ideas, rejecting others, all in an attempt to make STMs faster. We created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STM algorithms versus fine-grained hand-crafted ones. We were able to make several illuminating observations regarding lock acquisition order, the interaction of STMs with memory management schemes, and the role of overheads and abort rates in STM performance","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116651130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
Loop Optimization using Hierarchical Compilation and Kernel Decomposition 使用层次编译和核分解的循环优化
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.22
Denis Barthou, S. Donadio, Patrick Carribault, Alexandre Duchateau, W. Jalby
The increasing complexity of hardware features for recent processors makes high performance code generation very challenging. In particular, several optimization targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level parallelism). Very often, these optimization goals impose different and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix muliply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL)
最近的处理器的硬件特性越来越复杂,这使得高性能代码生成非常具有挑战性。特别是,必须同时追求多个优化目标(最小化L1/L2/L3/TLB失误和最大化指令级并行性)。通常,这些优化目标会对要应用的转换施加不同且相互矛盾的约束。我们提出了一种新的分层编译方法,用于依赖于使用最先进的编译器来生成高性能代码。这种方法不依赖于应用程序,也不需要任何程序集手工编码。它依赖于将原始循环巢分解成更简单的核,通常是1D到2D循环,更容易优化。我们成功地将这种方法应用于优化密集矩阵乘法原语(不仅适用于方形情况,也适用于更一般的矩形情况)和卷积。在Itanium 2和Pentium 4架构上优化代码的性能优于ATLAS,并且在大多数情况下,与手动调整的供应商库(例如MKL)相匹配。
{"title":"Loop Optimization using Hierarchical Compilation and Kernel Decomposition","authors":"Denis Barthou, S. Donadio, Patrick Carribault, Alexandre Duchateau, W. Jalby","doi":"10.1109/CGO.2007.22","DOIUrl":"https://doi.org/10.1109/CGO.2007.22","url":null,"abstract":"The increasing complexity of hardware features for recent processors makes high performance code generation very challenging. In particular, several optimization targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level parallelism). Very often, these optimization goals impose different and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix muliply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL)","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115773952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection 基于编译器管理软件的瞬态故障检测冗余多线程
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.7
Cheng Wang, Ho-Seop Kim, Youfeng Wu, V. Ying
As transistors become increasingly smaller and faster with tighter noise margins, modern processors are becoming increasingly more susceptible to transient hardware faults. Existing hardware-based redundant multi-threading (HRMT) approaches rely mostly on special-purpose hardware to replicate the program into redundant execution threads and compare their computation results. In this paper, we present a software-based redundant multi-threading (SRMT) approach for transient fault detection. Our SRMT technique uses compiler to automatically generate redundant threads so they can run on general-purpose chip multi-processors (CMPs). We exploit high-level program information available at compile time to optimize data communication between redundant threads. Furthermore, our software-based technique provides flexible program execution environment where the legacy binary codes and the reliability-enhanced codes can co-exist in a mix-and-match fashion, depending on the desired level of reliability and software compatibility. Our experimental results show that compiler analysis and optimization techniques can reduce data communication requirement by up to 88% of HRMT. With general-purpose intra-chip communication mechanisms in CMP machine, SRMT overhead can be as low as 19%. Moreover, SRMT technique achieves error coverage rates of 99.98% and 99.6% for SPEC CPU2000 integer and floating-point benchmarks, respectively. These results demonstrate the competitiveness of SRMT to HRMT approaches
随着晶体管变得越来越小,越来越快,噪声范围越来越小,现代处理器越来越容易受到瞬态硬件故障的影响。现有的基于硬件的冗余多线程(HRMT)方法主要依靠专用硬件将程序复制到冗余执行线程中,并比较它们的计算结果。本文提出了一种基于软件的冗余多线程(SRMT)暂态故障检测方法。我们的SRMT技术使用编译器自动生成冗余线程,以便它们可以在通用芯片多处理器(cmp)上运行。我们利用编译时可用的高级程序信息来优化冗余线程之间的数据通信。此外,我们基于软件的技术提供了灵活的程序执行环境,其中遗留二进制代码和增强可靠性的代码可以以混合匹配的方式共存,这取决于期望的可靠性和软件兼容性级别。我们的实验结果表明,编译器分析和优化技术可以减少高达88%的数据通信需求。在CMP机中使用通用的片内通信机制,SRMT开销可以低至19%。此外,SRMT技术在SPEC CPU2000整数和浮点基准测试中分别实现了99.98%和99.6%的错误覆盖率。这些结果证明了SRMT与HRMT方法的竞争力
{"title":"Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection","authors":"Cheng Wang, Ho-Seop Kim, Youfeng Wu, V. Ying","doi":"10.1109/CGO.2007.7","DOIUrl":"https://doi.org/10.1109/CGO.2007.7","url":null,"abstract":"As transistors become increasingly smaller and faster with tighter noise margins, modern processors are becoming increasingly more susceptible to transient hardware faults. Existing hardware-based redundant multi-threading (HRMT) approaches rely mostly on special-purpose hardware to replicate the program into redundant execution threads and compare their computation results. In this paper, we present a software-based redundant multi-threading (SRMT) approach for transient fault detection. Our SRMT technique uses compiler to automatically generate redundant threads so they can run on general-purpose chip multi-processors (CMPs). We exploit high-level program information available at compile time to optimize data communication between redundant threads. Furthermore, our software-based technique provides flexible program execution environment where the legacy binary codes and the reliability-enhanced codes can co-exist in a mix-and-match fashion, depending on the desired level of reliability and software compatibility. Our experimental results show that compiler analysis and optimization techniques can reduce data communication requirement by up to 88% of HRMT. With general-purpose intra-chip communication mechanisms in CMP machine, SRMT overhead can be as low as 19%. Moreover, SRMT technique achieves error coverage rates of 99.98% and 99.6% for SPEC CPU2000 integer and floating-point benchmarks, respectively. These results demonstrate the competitiveness of SRMT to HRMT approaches","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124326388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 142
期刊
International Symposium on Code Generation and Optimization (CGO'07)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1