首页 > 最新文献

International Symposium on Code Generation and Optimization, 2003. CGO 2003.最新文献

英文 中文
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting 度量:通过二进制重写跟踪内存层次结构中的低效率
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191553
Jaydeep Marathe, F. Mueller, Tushar Mohan, B. Supinski, S. Mckee, A. Yoo
We present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions. First, we present methods to extract partial data traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial data traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial data traces. By exploiting summarized memory metrics, by-reference metrics as well as cache evictor information, we can pin-point the sources of performance problems. Fourth, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several experiments resulting in up to 40% lower miss ratios.
我们介绍METRIC,一个通过检查数据跟踪来确定内存效率低下的环境。METRIC旨在改变应用程序的性能行为,这些应用程序在解析内存引用时通常受到延迟的限制。我们做出了四个主要贡献。首先,我们提出了通过动态二进制重写来观察运行中的应用程序的内存行为来提取部分数据轨迹的方法。其次,我们提出了一种方法,通过一种新的在线压缩参考流的技术,在常量空间中表示常规参考的部分数据轨迹。第三,我们采用离线缓存模拟,从部分数据轨迹中得出内存性能瓶颈的指示。通过利用总结的内存指标、引用指标以及缓存驱逐者信息,我们可以确定性能问题的根源。第四,我们展示了获得优化机会的能力,并在几个实验中评估了它们的好处,从而使脱靶率降低了40%。
{"title":"METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting","authors":"Jaydeep Marathe, F. Mueller, Tushar Mohan, B. Supinski, S. Mckee, A. Yoo","doi":"10.1109/CGO.2003.1191553","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191553","url":null,"abstract":"We present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions. First, we present methods to extract partial data traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial data traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial data traces. By exploiting summarized memory metrics, by-reference metrics as well as cache evictor information, we can pin-point the sources of performance problems. Fourth, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several experiments resulting in up to 40% lower miss ratios.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Retargetable and reconfigurable software dynamic translation 可重新定位和可重新配置的软件动态翻译
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191531
K. Scott, Naveen Kumar, S. Velusamy, B. Childers, J. Davidson, M. Soffa
Software dynamic translation (SDT) is a technology that permits the modification of an executing program's instructions. In recent years, SDT has received increased attention, from both industry and academia, as a feasible and effective approach to solving a variety of significant problems. Despite this increased attention, the task of initiating a new project in software dynamic translation remains a difficult one. To address this concern, and in particular, to promote the adoption of SDT technology into an even wider range of applications, we have implemented Strata, a cross-platform infrastructure for building software dynamic translators. This paper describes Strata's architecture, our experience retargeting it to three different processors, and our use of Strata to build two novel SDT systems - one for safe execution of untrusted binaries and one for fast prototyping of architectural simulators.
软件动态翻译(SDT)是一种允许修改执行程序指令的技术。近年来,SDT作为一种解决各种重大问题的可行和有效的方法,越来越受到业界和学术界的重视。尽管受到越来越多的关注,在软件动态翻译中启动一个新项目的任务仍然是一个困难的任务。为了解决这个问题,特别是为了将SDT技术应用到更广泛的应用中,我们实现了Strata,这是一个用于构建软件动态翻译的跨平台基础设施。本文描述了Strata的体系结构,我们将其重新定位到三个不同的处理器上的经验,以及我们使用Strata构建两个新颖的SDT系统——一个用于安全执行不受信任的二进制文件,另一个用于架构模拟器的快速原型。
{"title":"Retargetable and reconfigurable software dynamic translation","authors":"K. Scott, Naveen Kumar, S. Velusamy, B. Childers, J. Davidson, M. Soffa","doi":"10.1109/CGO.2003.1191531","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191531","url":null,"abstract":"Software dynamic translation (SDT) is a technology that permits the modification of an executing program's instructions. In recent years, SDT has received increased attention, from both industry and academia, as a feasible and effective approach to solving a variety of significant problems. Despite this increased attention, the task of initiating a new project in software dynamic translation remains a difficult one. To address this concern, and in particular, to promote the adoption of SDT technology into an even wider range of applications, we have implemented Strata, a cross-platform infrastructure for building software dynamic translators. This paper describes Strata's architecture, our experience retargeting it to three different processors, and our use of Strata to build two novel SDT systems - one for safe execution of untrusted binaries and one for fast prototyping of architectural simulators.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115546758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 227
Jumbo: run-time code generation for Java and its applications Jumbo: Java及其应用程序的运行时代码生成
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191532
Samuel N. Kamin, L. Clausen, Ava Jarvis
Run-time code generation is a well-known technique for improving the efficiency of programs by exploiting dynamic information. Unfortunately, the difficulty of constructing run-time code-generators has hampered their widespread use. We describe Jumbo, a tool for easily creating run-time code generators for Java. Jumbo is a compiler for a two-level version of Java, where programs can contain quoted code fragments. The Jumbo API allows the code fragments to be combined at run-time and then executed. We illustrate Jumbo with several examples that show significant speedups over similar code written in plain Java, and argue further that Jumbo is a generalized software component system.
运行时代码生成是一种众所周知的通过利用动态信息来提高程序效率的技术。不幸的是,构建运行时代码生成器的困难阻碍了它们的广泛使用。我们将介绍Jumbo,这是一个为Java轻松创建运行时代码生成器的工具。Jumbo是一个两级Java版本的编译器,其中的程序可以包含带引号的代码片段。Jumbo API允许在运行时组合代码片段,然后执行。我们用几个例子来说明Jumbo,这些例子显示了用普通Java编写的类似代码的显著加速,并进一步论证了Jumbo是一个通用的软件组件系统。
{"title":"Jumbo: run-time code generation for Java and its applications","authors":"Samuel N. Kamin, L. Clausen, Ava Jarvis","doi":"10.1109/CGO.2003.1191532","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191532","url":null,"abstract":"Run-time code generation is a well-known technique for improving the efficiency of programs by exploiting dynamic information. Unfortunately, the difficulty of constructing run-time code-generators has hampered their widespread use. We describe Jumbo, a tool for easily creating run-time code generators for Java. Jumbo is a compiler for a two-level version of Java, where programs can contain quoted code fragments. The Jumbo API allows the code fragments to be combined at run-time and then executed. We illustrate Jumbo with several examples that show significant speedups over similar code written in plain Java, and argue further that Jumbo is a generalized software component system.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129642328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Dynamic profiling and trace cache generation 动态分析和跟踪缓存生成
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191552
Marc Berndl, L. Hendren
Dynamic program optimization is increasingly important for achieving good runtime performance. A key issue is how to select which code to optimize. One approach is to dynamically detect traces, long sequences of instructions spanning multiple methods, which are likely to execute to completion. Traces are easy to optimize and have been shown to be a good unit for optimization. The paper reports on a new approach for dynamically detecting, creating and storing traces in a Java virtual machine. We first describe four important criteria for a successful trace strategy: good instruction stream coverage, low dispatch rate, cache stability, and optimizability of traces. We then present our approach based on branch correlation graphs. A branch correlation graph stores information about the correlation between pairs of branches, as well as additional state information. We present the complete design for an efficient implementation of the system, including a detailed discussion of the trace cache and profiling mechanisms. We have implemented an experimental framework to measure the traces generated by our approach in a direct-threaded Java VM (SableVM) and we present experimental results to show that the traces we generate meet the design criteria.
动态程序优化对于实现良好的运行时性能越来越重要。一个关键问题是如何选择要优化的代码。一种方法是动态检测跟踪,即跨越多个方法的长指令序列,这些指令可能执行到完成。轨迹很容易优化,并且已被证明是一个很好的优化单元。本文报道了一种在Java虚拟机中动态检测、创建和存储轨迹的新方法。我们首先描述了成功跟踪策略的四个重要标准:良好的指令流覆盖、低分派率、缓存稳定性和跟踪的可优化性。然后,我们提出了基于分支相关图的方法。分支关联图存储关于分支对之间的相关性的信息,以及附加的状态信息。我们给出了系统高效实现的完整设计,包括对跟踪缓存和分析机制的详细讨论。我们已经实现了一个实验框架来测量我们的方法在直接线程Java VM (SableVM)中生成的迹线,我们给出的实验结果表明,我们生成的迹线符合设计标准。
{"title":"Dynamic profiling and trace cache generation","authors":"Marc Berndl, L. Hendren","doi":"10.1109/CGO.2003.1191552","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191552","url":null,"abstract":"Dynamic program optimization is increasingly important for achieving good runtime performance. A key issue is how to select which code to optimize. One approach is to dynamically detect traces, long sequences of instructions spanning multiple methods, which are likely to execute to completion. Traces are easy to optimize and have been shown to be a good unit for optimization. The paper reports on a new approach for dynamically detecting, creating and storing traces in a Java virtual machine. We first describe four important criteria for a successful trace strategy: good instruction stream coverage, low dispatch rate, cache stability, and optimizability of traces. We then present our approach based on branch correlation graphs. A branch correlation graph stores information about the correlation between pairs of branches, as well as additional state information. We present the complete design for an efficient implementation of the system, including a detailed discussion of the trace cache and profiling mechanisms. We have implemented an experimental framework to measure the traces generated by our approach in a direct-threaded Java VM (SableVM) and we present experimental results to show that the traces we generate meet the design criteria.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133615294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Reality-based optimization 以现实为基础的优化
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191533
S. McFarling
Profile-based optimization has been studied extensively. Numerous papers and real systems have shown substantial improvements. However, most of these papers have been limited to either branch prediction or instruction cache performance. Also, most of these papers have looked at small applications with a limited number of testing and training scenarios. In this paper, we look at real use of large real-world desktop applications. We also assume memory consumption and disk performance are the primary metrics of interest. For this domain, we show that it is very difficult to get adequate coverage of large applications even with an extensive collection of training scenarios. We propose instead to augment traditional scenarios with data derived from real use. We show that this methodology allows us to reduce memory pressure by 29% and disk reads by 33% compared to traditional approaches.
基于剖面的优化已经得到了广泛的研究。许多论文和实际系统都显示出了实质性的改进。然而,这些论文大多局限于分支预测或指令缓存性能。此外,这些论文中的大多数都着眼于具有有限数量的测试和训练场景的小型应用程序。在本文中,我们将了解大型桌面应用程序的实际使用情况。我们还假设内存消耗和磁盘性能是我们感兴趣的主要指标。对于这个领域,我们表明,即使有广泛的训练场景集合,也很难获得对大型应用程序的充分覆盖。相反,我们建议使用来自实际使用的数据来增强传统场景。我们表明,与传统方法相比,这种方法使我们能够减少29%的内存压力和33%的磁盘读取。
{"title":"Reality-based optimization","authors":"S. McFarling","doi":"10.1109/CGO.2003.1191533","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191533","url":null,"abstract":"Profile-based optimization has been studied extensively. Numerous papers and real systems have shown substantial improvements. However, most of these papers have been limited to either branch prediction or instruction cache performance. Also, most of these papers have looked at small applications with a limited number of testing and training scenarios. In this paper, we look at real use of large real-world desktop applications. We also assume memory consumption and disk performance are the primary metrics of interest. For this domain, we show that it is very difficult to get adequate coverage of large applications even with an extensive collection of training scenarios. We propose instead to augment traditional scenarios with data derived from real use. We show that this methodology allows us to reduce memory pressure by 29% and disk reads by 33% compared to traditional approaches.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115412000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Hiding program slices for software security 隐藏程序片段以保证软件安全
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191556
X. Zhang, Rajiv Gupta
Given the high cost of producing software, development of technology for prevention of software piracy is important for the software industry. In this paper we present a novel approach for preventing the creation of unauthorized copies of software. Our approach splits software modules into open and hidden components. The open components are installed (executed) on an insecure machine while the hidden components are installed (executed) on a secure machine. We assume that while open components can be stolen, to obtain a fully functioning copy of the software, the hidden components must be recovered. We describe an algorithm that constructs hidden components by slicing the original software components. We argue that recovery of hidden components constructed through slicing, in order to obtain a fully functioning copy of the software, is a complex task. We further develop security analysis to capture the complexity of recovering hidden components. Finally we apply our technique to several large Java programs to study the complexity of recovering constructed hidden components and to measure the runtime overhead introduced by splitting of software into open and hidden components.
鉴于生产软件的高成本,开发防止软件盗版的技术对软件行业来说非常重要。在本文中,我们提出了一种防止创建未经授权的软件副本的新方法。我们的方法将软件模块分为开放组件和隐藏组件。开放组件在不安全的机器上安装(执行),而隐藏组件在安全的机器上安装(执行)。我们假设,虽然开放的组件可以被盗,但要获得软件的完整功能副本,必须恢复隐藏的组件。我们描述了一种通过对原始软件组件进行切片来构造隐藏组件的算法。我们认为,恢复通过切片构造的隐藏组件,以获得软件的完整功能副本,是一项复杂的任务。我们进一步开发安全分析,以捕获恢复隐藏组件的复杂性。最后,我们将该技术应用于几个大型Java程序,以研究恢复构造的隐藏组件的复杂性,并测量将软件分为开放组件和隐藏组件所带来的运行时开销。
{"title":"Hiding program slices for software security","authors":"X. Zhang, Rajiv Gupta","doi":"10.1109/CGO.2003.1191556","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191556","url":null,"abstract":"Given the high cost of producing software, development of technology for prevention of software piracy is important for the software industry. In this paper we present a novel approach for preventing the creation of unauthorized copies of software. Our approach splits software modules into open and hidden components. The open components are installed (executed) on an insecure machine while the hidden components are installed (executed) on a secure machine. We assume that while open components can be stolen, to obtain a fully functioning copy of the software, the hidden components must be recovered. We describe an algorithm that constructs hidden components by slicing the original software components. We argue that recovery of hidden components constructed through slicing, in order to obtain a fully functioning copy of the software, is a complex task. We further develop security analysis to capture the complexity of recovering hidden components. Finally we apply our technique to several large Java programs to study the complexity of recovering constructed hidden components and to measure the runtime overhead introduced by splitting of software into open and hidden components.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124813519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Code optimization for code compression 代码压缩的代码优化
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191555
M. Drinic, D. Kirovski, Hoi Vo
With the emergence of software delivery platforms such as Microsoft's .NET, the reduced size of transmitted binaries has become a very important system parameter, strongly affecting system performance. We present two novel pre-processing steps for code compression that explore program binaries' syntax and semantics to achieve superior compression ratios. The first preprocessing step involves heuristic partitioning of a program binary into streams with high auto-correlation. The second preprocessing step uses code optimization via instruction rescheduling in order to improve prediction probabilities for a given compression engine. We have developed three heuristics for instruction rescheduling that explore tradeoffs of the solution quality versus algorithm run-time. The pre-processing steps are integrated with the generic paradigm of prediction by partial matching (PPM) which is the basis of our compression codec. The compression algorithm is implemented for x86 binaries and tested on several large Microsoft applications. Binaries compressed using our compression codec are 18-24% smaller than those compressed using the best available off-the-shelf compressor.
随着微软的。net等软件交付平台的出现,传输二进制文件的减小大小已经成为一个非常重要的系统参数,强烈影响系统性能。我们提出了两个新的代码压缩预处理步骤,探索程序二进制文件的语法和语义,以获得更高的压缩比。预处理的第一步涉及将程序二进制数据启发式划分为具有高自相关性的流。第二个预处理步骤通过指令重调度进行代码优化,以提高给定压缩引擎的预测概率。我们为指令重调度开发了三种启发式方法,探索了解决方案质量与算法运行时之间的权衡。预处理步骤与部分匹配(PPM)预测的通用范式相结合,这是我们的压缩编解码器的基础。该压缩算法是为x86二进制文件实现的,并在几个大型微软应用程序上进行了测试。使用我们的压缩编解码器压缩的二进制文件比使用最好的现成压缩器压缩的二进制文件小18-24%。
{"title":"Code optimization for code compression","authors":"M. Drinic, D. Kirovski, Hoi Vo","doi":"10.1109/CGO.2003.1191555","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191555","url":null,"abstract":"With the emergence of software delivery platforms such as Microsoft's .NET, the reduced size of transmitted binaries has become a very important system parameter, strongly affecting system performance. We present two novel pre-processing steps for code compression that explore program binaries' syntax and semantics to achieve superior compression ratios. The first preprocessing step involves heuristic partitioning of a program binary into streams with high auto-correlation. The second preprocessing step uses code optimization via instruction rescheduling in order to improve prediction probabilities for a given compression engine. We have developed three heuristics for instruction rescheduling that explore tradeoffs of the solution quality versus algorithm run-time. The pre-processing steps are integrated with the generic paradigm of prediction by partial matching (PPM) which is the basis of our compression codec. The compression algorithm is implemented for x86 binaries and tested on several large Microsoft applications. Binaries compressed using our compression codec are 18-24% smaller than those compressed using the best available off-the-shelf compressor.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125746695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache 具有分布式数据缓存的集群VLIW处理器中内存一致性的本地调度技术
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191545
E. Gibert, F. Sánchez, Antonio González
Clustering is a common technique to deal with wire delays. Fully-distributed architectures, where the register file, the functional units and the cache memory are partitioned, are particularly effective to deal with these constraints and besides they are very scalable. However the distribution of the data cache introduces a new problem: memory instructions may reach the cache in an order different to the sequential program order, thus possibly violating its contents. In this paper two local scheduling mechanisms that guarantee the serialization of aliased memory instructions are proposed and evaluated: the construction of memory dependent chains (MDC solution), and two transformations (store replication and load-store synchronization) applied to the original data dependence graph (DDGT solution). These solutions do not require any extra hardware. The proposed scheduling techniques are evaluated for a word-interleaved cache clustered VLIW processor (although these techniques can also be used for any other distributed cache configuration). Results for the Mediabench benchmark suite demonstrate the effectiveness of such techniques. In particular, the DDGT solution increases the proportion of local accesses by 16% compared to MDC, and stall time is reduced by 32% since load instructions can be freely scheduled in any cluster However the MDC solution reduces compute time and it often outperforms the former. Finally the impact of both techniques on an architecture with attraction buffers is studied and evaluated.
聚类是处理线路延迟的常用技术。完全分布式的体系结构,其中寄存器文件,功能单元和缓存内存是分区的,特别有效地处理这些约束,而且它们是非常可扩展的。然而,数据缓存的分布引入了一个新问题:内存指令到达缓存的顺序可能与顺序程序的顺序不同,因此可能违反其内容。本文提出并评价了两种保证并行内存指令序列化的局部调度机制:构建内存依赖链(MDC方案)和对原始数据依赖图进行存储复制和负载-存储同步两种转换(DDGT方案)。这些解决方案不需要任何额外的硬件。建议的调度技术针对字交错缓存集群VLIW处理器进行了评估(尽管这些技术也可用于任何其他分布式缓存配置)。mediabbench基准测试套件的结果证明了这些技术的有效性。特别是,与MDC相比,DDGT解决方案将本地访问的比例增加了16%,并且由于负载指令可以在任何集群中自由调度,因此停机时间减少了32%。然而,MDC解决方案减少了计算时间,并且通常优于前者。最后,研究和评估了这两种技术对具有吸引力缓冲的结构的影响。
{"title":"Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache","authors":"E. Gibert, F. Sánchez, Antonio González","doi":"10.1109/CGO.2003.1191545","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191545","url":null,"abstract":"Clustering is a common technique to deal with wire delays. Fully-distributed architectures, where the register file, the functional units and the cache memory are partitioned, are particularly effective to deal with these constraints and besides they are very scalable. However the distribution of the data cache introduces a new problem: memory instructions may reach the cache in an order different to the sequential program order, thus possibly violating its contents. In this paper two local scheduling mechanisms that guarantee the serialization of aliased memory instructions are proposed and evaluated: the construction of memory dependent chains (MDC solution), and two transformations (store replication and load-store synchronization) applied to the original data dependence graph (DDGT solution). These solutions do not require any extra hardware. The proposed scheduling techniques are evaluated for a word-interleaved cache clustered VLIW processor (although these techniques can also be used for any other distributed cache configuration). Results for the Mediabench benchmark suite demonstrate the effectiveness of such techniques. In particular, the DDGT solution increases the proportion of local accesses by 16% compared to MDC, and stall time is reduced by 32% since load instructions can be freely scheduled in any cluster However the MDC solution reduces compute time and it often outperforms the former. Finally the impact of both techniques on an architecture with attraction buffers is studied and evaluated.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114919964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Optimal and efficient speculation-based partial redundancy elimination 最优和有效的基于推测的部分冗余消除
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191536
Qiong Cai, Jingling Xue
Existing profile-guided partial redundancy elimination (PRE) methods use speculation to enable the removal of partial redundancies along more frequently executed paths at the expense of introducing additional expression evaluations along less frequently executed paths. While being capable of minimizing the number of expression evaluations in some cases, they are, in general, not computationally optimal in achieving this objective. In addition, the experimental results for their effectiveness are mostly missing. This work addresses the following three problems: (1) Is the computational optimality of speculative PRE solvable in polynomial time? (2) Is edge profiling - less costly than path profiling - sufficient to guarantee the computational optimality? (3) Is the optimal algorithm (if one exists) lightweight enough to be used efficiently in a dynamic compiler? In this paper, we provide positive answers to the first two problems and promising results to the third. We present an algorithm that analyzes edge insertion points based on an edge profile. Our algorithm guarantees optimally that the total number of computations for an expression in the transformed code is always minimized with respect to the edge profile given. This implies that edge profiling, which is less costly than path profiling, is sufficient to guarantee this optimality. The key in the development of our algorithm lies in the removal of some non-essential edges (and consequently, all resulting non-essential nodes) from a flow graph so that the problem of finding an optimal code motion is reduced to one of finding a minimal cut in the reduced (flow) graph thus obtained. We have implemented our algorithm in Intel's Open Runtime Platform (ORP). Our preliminary results over a number of Java benchmarks show that our algorithm is lightweight and can be potentially a practical component in a dynamic compiler. As a result, our algorithm can also be profitably employed in a profile-guided static compiler in which compilation cost can often be sacrificed for code efficiency.
现有的配置文件引导的部分冗余消除(PRE)方法使用推测来消除执行频率较高的路径上的部分冗余,但代价是在执行频率较低的路径上引入额外的表达式求值。虽然在某些情况下能够最小化表达式求值的数量,但在实现这一目标时,它们通常不是计算上最优的。此外,其有效性的实验结果大多缺失。本工作解决了以下三个问题:(1)推测性PRE的计算最优性在多项式时间内可解吗?(2)边缘剖析(成本比路径剖析低)是否足以保证计算最优性?(3)最优算法(如果存在的话)是否足够轻量级,可以在动态编译器中有效使用?在本文中,我们对前两个问题给出了肯定的答案,对第三个问题给出了可喜的结果。提出了一种基于边缘轮廓分析边缘插入点的算法。我们的算法最优地保证了转换代码中表达式的计算总数总是相对于给定的边缘轮廓最小化。这意味着比路径分析成本更低的边缘分析足以保证这种最优性。我们算法开发的关键在于从流图中去除一些非必要的边(以及由此产生的所有非必要节点),从而将寻找最佳代码运动的问题简化为在由此获得的简化(流)图中寻找最小切口的问题。我们已经在英特尔的开放运行时平台(ORP)上实现了我们的算法。我们对许多Java基准测试的初步结果表明,我们的算法是轻量级的,可以作为动态编译器中的实用组件。因此,我们的算法也可以在配置文件引导的静态编译器中使用,在静态编译器中,通常可以牺牲编译成本来提高代码效率。
{"title":"Optimal and efficient speculation-based partial redundancy elimination","authors":"Qiong Cai, Jingling Xue","doi":"10.1109/CGO.2003.1191536","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191536","url":null,"abstract":"Existing profile-guided partial redundancy elimination (PRE) methods use speculation to enable the removal of partial redundancies along more frequently executed paths at the expense of introducing additional expression evaluations along less frequently executed paths. While being capable of minimizing the number of expression evaluations in some cases, they are, in general, not computationally optimal in achieving this objective. In addition, the experimental results for their effectiveness are mostly missing. This work addresses the following three problems: (1) Is the computational optimality of speculative PRE solvable in polynomial time? (2) Is edge profiling - less costly than path profiling - sufficient to guarantee the computational optimality? (3) Is the optimal algorithm (if one exists) lightweight enough to be used efficiently in a dynamic compiler? In this paper, we provide positive answers to the first two problems and promising results to the third. We present an algorithm that analyzes edge insertion points based on an edge profile. Our algorithm guarantees optimally that the total number of computations for an expression in the transformed code is always minimized with respect to the edge profile given. This implies that edge profiling, which is less costly than path profiling, is sufficient to guarantee this optimality. The key in the development of our algorithm lies in the removal of some non-essential edges (and consequently, all resulting non-essential nodes) from a flow graph so that the problem of finding an optimal code motion is reduced to one of finding a minimal cut in the reduced (flow) graph thus obtained. We have implemented our algorithm in Intel's Open Runtime Platform (ORP). Our preliminary results over a number of Java benchmarks show that our algorithm is lightweight and can be potentially a practical component in a dynamic compiler. As a result, our algorithm can also be profitably employed in a profile-guided static compiler in which compilation cost can often be sacrificed for code efficiency.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131111422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Optimizing memory accesses for spatial computation 优化空间计算的内存访问
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191547
M. Budiu, S. Goldstein
We present the internal representation and optimizations used by the CASH compiler for improving the memory parallelism of pointer-based programs. CASH uses an SSA-based representation for memory, which compactly summarizes both control-flow- and dependence information. In CASH, memory optimization is a four-step process: (1) first an initial, relatively coarse representation of memory dependences is built; (2) next, unnecessary memory dependences are removed using dependence tests; (3) third, redundant memory operations are removed (4) finally, parallelism is increased by pipelining memory accesses in loops. While the first three steps above are very general, the loop pipelining transformations are particularly applicable for spatial computation, which is the primary target of CASH. The redundant memory removal optimizations presented are: load/store hoisting (subsuming partial redundancy elimination and common-subexpression elimination), load-after-store removal, store-before-store removal (dead store removal) and loop-invariant load motion. One of our loop pipelining transformations is a new form of loop parallelization, called loop decoupling. This transformation separates independent memory accesses within a loop body into several independent loops, which are allowed dynamically to slip with respect to each other A new computational primitive, a token generator is used to dynamically control the amount of slip, allowing maximum freedom, while guaranteeing that no memory dependences are violated.
我们给出了CASH编译器用于改进基于指针的程序的内存并行性的内部表示和优化。CASH使用基于ssa的内存表示,它简洁地总结了控制流和依赖信息。在CASH中,内存优化是一个四步过程:(1)首先建立一个初始的、相对粗糙的内存依赖表示;(2)接下来,使用依赖性测试去除不必要的内存依赖性;(3)第三,删除冗余的内存操作(4)最后,通过循环管道内存访问来增加并行性。虽然上面的前三个步骤非常通用,但是循环流水线转换特别适用于空间计算,这是CASH的主要目标。提出的冗余内存删除优化包括:负载/存储提升(包括部分冗余消除和公共子表达式消除),存储后负载移除,存储前存储移除(死存储移除)和循环不变负载运动。我们的循环流水线转换之一是循环并行化的一种新形式,称为循环解耦。这种转换将循环体中的独立内存访问分离为几个独立的循环,这些循环允许动态地相对于彼此滑动。一个新的计算原语,一个令牌生成器用于动态控制滑动的数量,允许最大的自由度,同时保证不违反内存依赖。
{"title":"Optimizing memory accesses for spatial computation","authors":"M. Budiu, S. Goldstein","doi":"10.1109/CGO.2003.1191547","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191547","url":null,"abstract":"We present the internal representation and optimizations used by the CASH compiler for improving the memory parallelism of pointer-based programs. CASH uses an SSA-based representation for memory, which compactly summarizes both control-flow- and dependence information. In CASH, memory optimization is a four-step process: (1) first an initial, relatively coarse representation of memory dependences is built; (2) next, unnecessary memory dependences are removed using dependence tests; (3) third, redundant memory operations are removed (4) finally, parallelism is increased by pipelining memory accesses in loops. While the first three steps above are very general, the loop pipelining transformations are particularly applicable for spatial computation, which is the primary target of CASH. The redundant memory removal optimizations presented are: load/store hoisting (subsuming partial redundancy elimination and common-subexpression elimination), load-after-store removal, store-before-store removal (dead store removal) and loop-invariant load motion. One of our loop pipelining transformations is a new form of loop parallelization, called loop decoupling. This transformation separates independent memory accesses within a loop body into several independent loops, which are allowed dynamically to slip with respect to each other A new computational primitive, a token generator is used to dynamically control the amount of slip, allowing maximum freedom, while guaranteeing that no memory dependences are violated.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123051628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
期刊
International Symposium on Code Generation and Optimization, 2003. CGO 2003.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1