2001 Innovative Architecture for Future Generation High-Performance Processors and Systems最新文献

英文中文

Power efficient instruction cache for wide-issue processors 用于大问题处理器的高效指令缓存

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 2001-01-18 DOI: 10.1109/IWIA.2001.955192

A.-M. Badulescu, A. Veidenbaum

The paper focuses on reducing power in instruction cache by eliminating the fetching of instructions that are not needed from a cache line. We propose a mechanism that predicts which instructions are going to be used out of a cache line before that line is fetched into the instruction buffer. The average instruction cache power savings obtained by using our fetch predictor is 22% for SPEC95 benchmark suite.

本文的重点是通过消除从缓存线中提取不需要的指令来降低指令缓存的功耗。我们提出了一种机制，可以预测哪些指令将在缓存行被提取到指令缓冲区之前从缓存行中使用。在SPEC95基准测试套件中，通过使用我们的fetch预测器获得的平均指令缓存功耗节省为22%。

引用次数: 5

An efficient algorithm for pointer-to-array access conversion for compiling and optimizing DSP applications 一种用于编译和优化DSP应用程序的指针到数组访问转换的有效算法

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955200

Robert A. van Engelen, K. Gallivan

The complexity of Digital Signal Processing (DSP) applications has been steadily increasing due to advances in hardware design for embedded processors. To meet critical power consumption and timing constraints, many DSP applications are hand-coded in assembly. Because the cost of hand-coding is becoming prohibitive for developing an embedded system, there is a trend toward the use of high-level programming languages, particularly C, and the use of optimizing compilers for software development. Consequently, more than ever there is a need for compilers to optimize DSP application to make effective use of the available hardware resources. Existing DSP codes are often riddled with pointer-based data accesses, because DSP programmers have the mistaken belief that a compiler will always generate better target code. The use of extensive pointer arithmetic makes analysis and optimization difficult for compilers for modern DSPs with regular architectures and large homogeneous registers sets. In this paper, we present a novel algorithm for converting pointer-based code to code with explicit array accesses. The conversion enables a compiler to perform data flow analysis and loop optimizations on DSP codes.

由于嵌入式处理器硬件设计的进步，数字信号处理(DSP)应用的复杂性一直在稳步增加。为了满足关键的功耗和时间限制，许多DSP应用程序都是用汇编手工编码的。由于开发嵌入式系统的手工编码成本越来越高，因此有一种趋势是使用高级编程语言，特别是C语言，并使用优化编译器进行软件开发。因此，编译器比以往任何时候都更需要优化DSP应用程序，以有效利用可用的硬件资源。现有的DSP代码常常充斥着基于指针的数据访问，因为DSP程序员错误地认为编译器总能生成更好的目标代码。广泛的指针算法的使用使得具有常规架构和大型同构寄存器集的现代dsp的编译器难以分析和优化。在本文中，我们提出了一种将基于指针的代码转换为具有显式数组访问的代码的新算法。这种转换使编译器能够在DSP代码上执行数据流分析和循环优化。

{"title":"An efficient algorithm for pointer-to-array access conversion for compiling and optimizing DSP applications","authors":"Robert A. van Engelen, K. Gallivan","doi":"10.1109/IWIA.2001.955200","DOIUrl":"https://doi.org/10.1109/IWIA.2001.955200","url":null,"abstract":"The complexity of Digital Signal Processing (DSP) applications has been steadily increasing due to advances in hardware design for embedded processors. To meet critical power consumption and timing constraints, many DSP applications are hand-coded in assembly. Because the cost of hand-coding is becoming prohibitive for developing an embedded system, there is a trend toward the use of high-level programming languages, particularly C, and the use of optimizing compilers for software development. Consequently, more than ever there is a need for compilers to optimize DSP application to make effective use of the available hardware resources. Existing DSP codes are often riddled with pointer-based data accesses, because DSP programmers have the mistaken belief that a compiler will always generate better target code. The use of extensive pointer arithmetic makes analysis and optimization difficult for compilers for modern DSPs with regular architectures and large homogeneous registers sets. In this paper, we present a novel algorithm for converting pointer-based code to code with explicit array accesses. The conversion enables a compiler to perform data flow analysis and loop optimizations on DSP codes.","PeriodicalId":388942,"journal":{"name":"2001 Innovative Architecture for Future Generation High-Performance Processors and Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128224906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Present status of development of the Earth Simulator 地球模拟器的发展现状

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955201

M. Yokokawa

The Earth Simulator is an ultra high-speed supercomputer. The research and development of the Earth Simulator was initiated in 1997 as one of the approaches in the Earth Simulator project which aims at promotion of research and development for understanding and prediction of global environmental changes. The Earth Simulator is a distributed memory parallel system which consists of 640 processor nodes connected by a single-stage full crossbar switch. Each processor node is a shared memory system composed of eight vector processors. The total peak performance and main memory capacity are 40Tflop/s and 10TB, respectively. In this paper, a concept of the Earth Simulator and an outline of the Earth Simulator system are described.

地球模拟器是一台超高速超级计算机。地球模拟器的研究和开发始于1997年，是地球模拟器项目的方法之一，旨在促进研究和开发，以了解和预测全球环境变化。地球模拟器是一个由640个处理器节点组成的分布式存储并行系统，由单级全交叉开关连接。每个处理器节点是一个由8个矢量处理器组成的共享内存系统。总峰值性能为40Tflop/s，主存容量为10TB。本文介绍了地球模拟器的概念和地球模拟器系统的概要。

引用次数: 17

Characteristics of loop unrolling effect: software pipelining and memory latency hiding 循环展开效果的特点:软件流水线化和内存延迟隐藏

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955198

S. Hiroyuki, Y. Teruhiko

Recently loop unrolling has been shown in a new light from the superscalar architectural point of view. In this paper, we show that in addition to superscalar effect and scalar replacement effect, loop unrolling can hide memory latency, and that the combination of those effects improve the performance of loop unrolling. A major contribution of this paper is that the analysis is done symbolically and quantitatively. Although they have been known as major reasons that affect the performance of loop unrolling, no quantitative approach has not been tried. Our analysis can make clear the behaviour of superscalar functions and memory latency hiding in loop unrolling.

最近，循环展开从超标量架构的角度进行了新的展示。在本文中，我们证明了除了超标量效应和标量替换效应外，循环展开还可以隐藏内存延迟，并且这些效应的组合提高了循环展开的性能。本文的一个主要贡献是分析是象征性的和定量的。虽然它们已被认为是影响循环展开性能的主要原因，但没有尝试过定量方法。我们的分析可以明确超标量函数的行为和隐藏在循环展开中的内存延迟。

引用次数: 6

Wrapped system call in communication and execution fusion OS: CEFOS 通信与执行融合操作系统CEFOS中的封装系统调用

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955199

Hiroshi Nakayama, Takuya Tanabayashi, Makoto Amamiya

This paper proposes an operating system CEFOS (Communication and Execution Fusion OS), which fuses interprocessor communication and intra processor computation. Fusion of communications and internal executions is achieved both in executions and function interfaces in CEFOS. CEFOS is based on a fine-grain multi-threading approach. In the fine-grain thread control, one of the major problems is how to reduce the frequency of context switching and communication between user process and CEFOS Kernel. The important point to resolve this problem is to design the environment which supports an efficient cooperation environment between user process and CEFOS kernel. We propose a Display Request and Data (DRD) function and Wrapped System Call (WSC) mechanism with DRD function, which provide efficient cooperation between user process and CEFOS Kernel. DRD and WSC reduce the number of invocations from user process threads to the OS kernel and achieve high speeds thread switching.

本文提出一种融合处理器间通信和处理器内计算的操作系统CEFOS (Communication and Execution Fusion OS)。CEFOS在执行和功能接口上都实现了通信和内部执行的融合。CEFOS基于细粒度多线程方法。在细粒度线程控制中，如何减少用户进程与CEFOS内核之间的上下文切换和通信频率是一个主要问题。解决这一问题的关键在于设计一个支持用户进程和CEFOS内核之间高效协作的环境。我们提出了显示请求和数据(DRD)函数和带DRD函数的包装系统调用(WSC)机制，提供了用户进程和CEFOS内核之间的高效协作。DRD和WSC减少了从用户进程线程到操作系统内核的调用次数，实现了高速线程切换。

引用次数: 0

Cache-In-Memory Cache-In-Memory

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955191

J. T. Zawodny, P. Kogge

The new technology of Processing-In-Memory now allows relatively large DRAM memory macros to be positioned on the same die with processing logic. Despite the high bandwidth and low latency possible with such macros, more of both is always better. Classical techniques such as caching are typically used for such performance gains, but at the cost of high power. The paper summarizes some recent work into the potential of utilizing structures within such memory macros as cache substitutes, and under what conditions power savings may result.

内存中处理的新技术现在允许相对较大的DRAM内存宏与处理逻辑放置在同一芯片上。尽管这类宏可能具有高带宽和低延迟，但两者兼而有之总是更好。缓存等经典技术通常用于这种性能提升，但代价是高功耗。本文总结了最近一些关于利用这些内存宏中的结构作为缓存替代品的潜力的工作，以及在什么条件下可能会节省电力。

引用次数: 6

An architecture of on-chip-memory multi-threading processor 片上内存多线程处理器的结构

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955202

T. Matsuzaki, H. Tomiyasu, M. Amamiya

This paper proposes an on-chip-memory processor architecture: FUCE. FUCE means Fusion of Communication and Execution. The goal of the FUCE processor project is fusing the intra processor execution and inter processor communication. In order to achieve this goal, the FUCE processor integrates the processor units, memory units and communication units into a chip. FUCE Processor provides a next generation memory system architecture. In this architecture, no data cache memory is required, since memory access latency can be hidden due to the simultaneous multithreading mechanism and the on-chip-memory system with broad-bandwidth low latency internal bus of FUCE Processor. This approach can reduce the performance gap between instruction execution, and memory and network accesses.

本文提出了一种片上存储处理器体系结构:FUCE。FUCE的意思是沟通和执行的融合。FUCE处理器项目的目标是融合处理器内执行和处理器间通信。为了实现这一目标，FUCE处理器将处理器单元、存储单元和通信单元集成到一个芯片中。FUCE处理器提供了下一代存储系统架构。在该架构中，由于采用同步多线程机制和FUCE处理器内部总线的片上存储系统具有宽带宽低延迟，可以隐藏内存访问延迟，因此不需要数据缓存。这种方法可以减少指令执行与内存和网络访问之间的性能差距。

引用次数: 2

An approach towards an analytical characterization of locality and its portability 一种局部性及其可移植性的分析表征方法

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955195

G. Bilardi, E. Peserico

The evolution of computing technology towards the ultimate physical limits makes communication the dominant cost of computing. It would then be desirable to have a framework for the study of locality, which we define as the property of an algorithm that enables implementations with reduced communication overheads. We discuss the issue of useful characterizations of the locality of an algorithm with reference to both single machines and classes of machines. We then consider the question of portability of locality. We illustrate the proposed approach with its application to the study of temporal locality, the property of an algorithm that enables efficient implementations on machines where memory accesses have a variable latency, depending on the location being accessed. We discuss how, for a fixed operation schedule, temporal locality can be characterized for interesting classes of uniform hierarchical machines by a set of metrics, the width lengths of the schedule. Moreover, a portable memory management of any schedule can be obtained for such classes of machines. The situation becomes more complex when the schedule is a degree of freedom of the implementation. Then, while some computations do admit a single schedule, optimal across many machines, this is not always the case. Thus, in general, only the less stringent notion of portability based on parametrized schedules can be pursued. Correspondingly, a concise characterization of temporal locality becomes harder to achieve and still remains an open problem.

计算技术向终极物理极限的发展使得通信成为计算的主要成本。因此，有一个研究局部性的框架是可取的，我们将其定义为一种算法的属性，该算法使实现能够减少通信开销。我们讨论了算法局部性的有用刻画问题，包括单机和机器类。然后我们考虑局部性的可移植性问题。我们通过其在时间局部性研究中的应用来说明所提出的方法，时间局部性是一种算法的特性，它能够在内存访问具有可变延迟的机器上有效实现，这取决于被访问的位置。对于一个固定的操作调度，我们讨论了如何通过一组度量(调度的宽度长度)来表征统一层次机器的时间局部性。此外，对于这类机器，可以获得任意调度的可移植内存管理。当进度是实现的一个自由度时，情况变得更加复杂。然后，虽然有些计算确实承认单一调度，在许多机器上是最优的，但情况并非总是如此。因此，一般来说，只能追求基于参数化时间表的不太严格的可移植性概念。相应地，时间局部性的简明表征变得更加难以实现，并且仍然是一个悬而未决的问题。

{"title":"An approach towards an analytical characterization of locality and its portability","authors":"G. Bilardi, E. Peserico","doi":"10.1109/IWIA.2001.955195","DOIUrl":"https://doi.org/10.1109/IWIA.2001.955195","url":null,"abstract":"The evolution of computing technology towards the ultimate physical limits makes communication the dominant cost of computing. It would then be desirable to have a framework for the study of locality, which we define as the property of an algorithm that enables implementations with reduced communication overheads. We discuss the issue of useful characterizations of the locality of an algorithm with reference to both single machines and classes of machines. We then consider the question of portability of locality. We illustrate the proposed approach with its application to the study of temporal locality, the property of an algorithm that enables efficient implementations on machines where memory accesses have a variable latency, depending on the location being accessed. We discuss how, for a fixed operation schedule, temporal locality can be characterized for interesting classes of uniform hierarchical machines by a set of metrics, the width lengths of the schedule. Moreover, a portable memory management of any schedule can be obtained for such classes of machines. The situation becomes more complex when the schedule is a degree of freedom of the implementation. Then, while some computations do admit a single schedule, optimal across many machines, this is not always the case. Thus, in general, only the less stringent notion of portability based on parametrized schedules can be pursued. Correspondingly, a concise characterization of temporal locality becomes harder to achieve and still remains an open problem.","PeriodicalId":388942,"journal":{"name":"2001 Innovative Architecture for Future Generation High-Performance Processors and Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121158874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Pipelined memory hierarchies: scalable organizations and application performance 流水线内存层次结构:可伸缩组织和应用程序性能

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955196

G. Bilardi, K. Ekanadham, P. Pattnaik

The time to perform a random access to main memory has been increasing for decades relative to processor speed and is currently of the order of a few hundred cycles. To alleviate this problem, one resorts to memory organizations that are hierarchical to exploit locality of the computation, and pipelinable to exploit parallelism. The goal of the study is to begin a systematic exploration of the performance advantages of such memories, achieving scalability even when the underlying principles are pushed to the limit permitted by physical laws. First, we propose memory organizations with the ability to accept requests at a constant rate without significantly affecting the latency of individual requests, which is within a constant factor of the minimum value achievable under fundamental physical constraints. Second, we discuss how the pipeline capability can be effectively exploited by memory management techniques in order to reduce execution time for applications. We conclude by outlining the issues that require further work in order to pursue systematically the potential of pipelined hierarchical memories.

几十年来，相对于处理器的速度，执行随机访问主存储器的时间一直在增加，目前大约是几百个周期。为了缓解这个问题，可以采用分层的内存组织来利用计算的局部性，采用可管道的内存组织来利用并行性。这项研究的目标是开始系统地探索这种存储器的性能优势，即使在基本原理被推到物理定律允许的极限时也要实现可伸缩性。首先，我们建议内存组织能够以恒定速率接受请求，而不会显著影响单个请求的延迟，这是在基本物理约束下可实现的最小值的恒定因子内。其次，我们讨论了如何通过内存管理技术有效地利用管道功能，以减少应用程序的执行时间。最后，我们概述了需要进一步工作的问题，以便系统地追求流水线分层存储器的潜力。

{"title":"Pipelined memory hierarchies: scalable organizations and application performance","authors":"G. Bilardi, K. Ekanadham, P. Pattnaik","doi":"10.1109/IWIA.2001.955196","DOIUrl":"https://doi.org/10.1109/IWIA.2001.955196","url":null,"abstract":"The time to perform a random access to main memory has been increasing for decades relative to processor speed and is currently of the order of a few hundred cycles. To alleviate this problem, one resorts to memory organizations that are hierarchical to exploit locality of the computation, and pipelinable to exploit parallelism. The goal of the study is to begin a systematic exploration of the performance advantages of such memories, achieving scalability even when the underlying principles are pushed to the limit permitted by physical laws. First, we propose memory organizations with the ability to accept requests at a constant rate without significantly affecting the latency of individual requests, which is within a constant factor of the minimum value achievable under fundamental physical constraints. Second, we discuss how the pipeline capability can be effectively exploited by memory management techniques in order to reduce execution time for applications. We conclude by outlining the issues that require further work in order to pursue systematically the potential of pipelined hierarchical memories.","PeriodicalId":388942,"journal":{"name":"2001 Innovative Architecture for Future Generation High-Performance Processors and Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121684687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Power reduction in superscalar datapaths through dynamic bit-slice activation 通过动态位片激活来降低超标量数据路径的功耗

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

Pub Date : 1900-01-01 DOI: 10.1109/IWIA.2001.955193

Dmitry Ponomarev, Gurhan Kucuk, K. Ghose

We show by simulating the execution of SPEC 95 benchmarks on a true hardware-level, cycle by cycle simulator for a superscalar CPU that about half of the bytes of operands flowing on the datapath, particularly the leading bytes, are all zeros. Furthermore, a significant number of the bits within the non-zero part of the data flowing on the various paths within the processor do not change from their prior value. We show how these two facts, attesting to the lack of a high level of entropy in the data streams, can be exploited to reduce power dissipation within all explicit and implicit storage components of a typical superscalar datapath such as register files, dispatch buffers, reorder buffers, as well as interconnections such as buses and direct links. Our simulation results and SPICE measurements from representative VLSI layouts show power savings of about 25% on the average over all SPEC 95 benchmarks.

我们通过在一个真正的硬件级别上模拟SPEC 95基准测试的执行，在一个超标量CPU的逐周期模拟器上显示，在数据路径上流动的操作数的大约一半字节，特别是前导字节，都是零。此外，在处理器内的各种路径上流动的数据的非零部分中的相当数量的位不会改变其先前的值。我们展示了如何利用这两个事实(证明数据流中缺乏高水平的熵)来减少典型超标量数据路径的所有显式和隐式存储组件(如寄存器文件、调度缓冲区、重排序缓冲区以及互连(如总线和直接链接)中的功耗。我们的模拟结果和代表性VLSI布局的SPICE测量显示，在所有SPEC 95基准测试中，平均节省约25%的功耗。

引用次数: 5

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2001 Innovative Architecture for Future Generation High-Performance Processors and Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀