Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001最新文献

英文中文

Adaptive interfacing with reconfigurable computers 与可重构计算机的自适应接口

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903347

N. Bergmann, Anwar S. Dawood

A reconfigurable computer consists of reconfigurable logic circuits added to a conventional processor to give a computer where both the hardware and the software can be programmed on an application by application basis. Despite significant research, reconfigurable computers have failed to gain widespread acceptance as a high-speed computing replacement for conventional supercomputers. This paper describes the reasons for this failure and argues that the domain of real-time, reactive computer systems provides a better potential application area. An experimental Adaptive Instrument Module, based on reconfigurable reactive computing technology will be flown on the FedSat low earth orbit satellite to test out these ideas.

可重构计算机由添加到传统处理器上的可重构逻辑电路组成，使计算机的硬件和软件都可以在逐个应用程序的基础上进行编程。尽管进行了大量的研究，但可重构计算机作为传统超级计算机的高速计算替代品并没有得到广泛的接受。本文描述了这种失败的原因，并认为实时、响应式计算机系统领域提供了一个更好的潜在应用领域。一个基于可重构反应计算技术的实验性自适应仪器模块将在联邦卫星低地球轨道卫星上飞行，以测试这些想法。

引用次数: 2

DStride: data-cache miss-address-based stride prefetching scheme for multimedia processors DStride:多媒体处理器基于数据缓存缺失地址的跨步预取方案

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903360

G. Hariprakash, R. Achutharaman, A. Omondi

Prefetching reduces cache miss latency by moving data up in memory hierarchy before they are actually needed. Recent hardware-based stride prefetching techniques mostly rely on the processor pipeline information (e.g. program counter and branch prediction table) for prediction. Continuing developments in processor microarchitecture drastically change core pipeline design and require that existing hardware-based stride prefetching techniques be adapted to the evolving new processor architectures. In this paper we present a new hardware-based stride prefetching technique, called DStride, that is independent of processor pipeline design changes. In this new design, the first-level data cache miss address stream is used for the stride prediction. The miss addresses are separated into load stream and store stream to increase the efficiency of the predictor. They are checked separately against the recent miss address stream to detect the strides. The detected steady strides are maintained in a table that also performs look-ahead stride prefetching when the processor stride reference rate is higher than the prefetch request service rate. We evaluated our design with multimedia workloads using execution-driven simulation with SimpleScalar toolset. Our experiments show that DStride is very effective in reducing overall pipeline stalls due to cache miss latency, especially for stride-intensive applications such as multimedia workloads.

预取通过在实际需要之前将数据上移到内存层次结构中来减少缓存丢失延迟。目前基于硬件的跨预取技术主要依靠处理器管道信息(如程序计数器和分支预测表)进行预取。处理器微体系结构的持续发展极大地改变了核心管道设计，并要求现有的基于硬件的跨距预取技术适应不断发展的新处理器体系结构。在本文中，我们提出了一种新的基于硬件的步幅预取技术，称为DStride，它不受处理器流水线设计变化的影响。在这个新设计中，第一级数据缓存丢失地址流被用于步长预测。miss地址被分成负载流和存储流，以提高预测器的效率。它们分别根据最近丢失的地址流进行检查，以检测跨步。检测到的稳定步幅保存在一个表中，当处理器步幅参考速率高于预取请求服务速率时，该表还执行预读步幅预取。我们使用SimpleScalar工具集的执行驱动仿真来评估我们的多媒体工作负载设计。我们的实验表明，DStride在减少由于缓存丢失延迟而导致的整体管道停滞方面非常有效，特别是对于像多媒体工作负载这样的跨步密集型应用程序。

{"title":"DStride: data-cache miss-address-based stride prefetching scheme for multimedia processors","authors":"G. Hariprakash, R. Achutharaman, A. Omondi","doi":"10.1109/ACAC.2001.903360","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903360","url":null,"abstract":"Prefetching reduces cache miss latency by moving data up in memory hierarchy before they are actually needed. Recent hardware-based stride prefetching techniques mostly rely on the processor pipeline information (e.g. program counter and branch prediction table) for prediction. Continuing developments in processor microarchitecture drastically change core pipeline design and require that existing hardware-based stride prefetching techniques be adapted to the evolving new processor architectures. In this paper we present a new hardware-based stride prefetching technique, called DStride, that is independent of processor pipeline design changes. In this new design, the first-level data cache miss address stream is used for the stride prediction. The miss addresses are separated into load stream and store stream to increase the efficiency of the predictor. They are checked separately against the recent miss address stream to detect the strides. The detected steady strides are maintained in a table that also performs look-ahead stride prefetching when the processor stride reference rate is higher than the prefetch request service rate. We evaluated our design with multimedia workloads using execution-driven simulation with SimpleScalar toolset. Our experiments show that DStride is very effective in reducing overall pipeline stalls due to cache miss latency, especially for stride-intensive applications such as multimedia workloads.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"192 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120899124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Fault-tolerant routing on Complete Josephus Cubes 完全约瑟夫立方体上的容错路由

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903366

P. Loh, W. Hsu

This paper introduces the Complete Josephus Cube, a fault-tolerant class of the recently proposed Josephus Cube and proposes a cost-effective, fault-tolerant routing strategy for the Complete Josephus Cube. For a Complete Josephus Cube of order r, the routing algorithm can tolerate up to (r+1) encountered component faults in its message path and generates routes that are both deadlock-free and livelock-free. The message is guaranteed to be optimally (respectively, sub-optimally) delivered within a maximum of r (respectively, 2r+1) hops. The message overhead incurred is only a single (r+2)-bit routing vector accompanying the message to be communicated.

本文介绍了最近提出的完全约瑟夫立方体的容错类完全约瑟夫立方体，并提出了一种经济有效的完全约瑟夫立方体容错路由策略。对于阶数为r的Complete Josephus Cube，路由算法可以容忍其消息路径中最多(r+1)个组件故障，并生成无死锁和无活锁的路由。保证消息在最大r(分别为2r+1)个跳数内以最优(分别为次最优)方式传递。所产生的消息开销只是伴随要通信的消息的单个(r+2)位路由向量。

引用次数: 5

Password-capabilities: their evolution from the password-capability system into Walnut and beyond 密码能力:从密码能力系统到核桃及以后的演变

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903370

R. Pose

Since we first devised and defined password capabilities as a new technique for building capability-based operating systems, a number of research systems around the world have used them as the bases for a variety of operating systems. Our original Password-Capability System was implemented on custom built hardware with a novel address translation and protection scheme specifically designed to support password-capabilities. The password-capability concept later formed the basis of Opal developed at the University of Washington, and Mungi from the University of New South Wales, both of which used commercially available hardware. A second generation password-capability based system, Walnut, was developed at Monash University in the 1990s. Walnut was designed to run on commercially available hardware. It addressed some shortcomings of the original Password-Capability System but had to sacrifice some features that depended on hardware support. A third generation system that will extend Walnut to support mandatory security policies and other advanced features is currently being considered. This paper analyses the evolution of the Password-Capability System into Walnut, examines the shortcomings of the systems, and identifies issues to be addressed in the new system.

自从我们首次设计并定义密码功能作为一种构建基于功能的操作系统的新技术以来，世界各地的许多研究系统都将其作为各种操作系统的基础。我们最初的密码功能系统是在定制的硬件上实现的，具有新颖的地址转换和保护方案，专门设计用于支持密码功能。密码能力概念后来形成了华盛顿大学开发的Opal和新南威尔士大学的Mungi的基础，两者都使用了商用硬件。20世纪90年代，莫纳什大学开发了第二代基于密码功能的系统核桃。Walnut被设计为运行在商用硬件上。它解决了原始密码能力系统的一些缺点，但不得不牺牲一些依赖于硬件支持的功能。目前正在考虑第三代系统，该系统将扩展Walnut以支持强制性安全策略和其他高级功能。本文分析了密码能力系统(passwordcapability System)到核桃(Walnut)的演变过程，分析了现有系统的不足，指出了新系统需要解决的问题。

{"title":"Password-capabilities: their evolution from the password-capability system into Walnut and beyond","authors":"R. Pose","doi":"10.1109/ACAC.2001.903370","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903370","url":null,"abstract":"Since we first devised and defined password capabilities as a new technique for building capability-based operating systems, a number of research systems around the world have used them as the bases for a variety of operating systems. Our original Password-Capability System was implemented on custom built hardware with a novel address translation and protection scheme specifically designed to support password-capabilities. The password-capability concept later formed the basis of Opal developed at the University of Washington, and Mungi from the University of New South Wales, both of which used commercially available hardware. A second generation password-capability based system, Walnut, was developed at Monash University in the 1990s. Walnut was designed to run on commercially available hardware. It addressed some shortcomings of the original Password-Capability System but had to sacrifice some features that depended on hardware support. A third generation system that will extend Walnut to support mandatory security policies and other advanced features is currently being considered. This paper analyses the evolution of the Password-Capability System into Walnut, examines the shortcomings of the systems, and identifies issues to be addressed in the new system.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"1992 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128602650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Performance evaluation of a partial retraining scheme for defective multi-layer neural networks 缺陷多层神经网络部分再训练方案的性能评价

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903376

K. Yamamori, T. Abe, S. Horiguchi

This paper addresses an efficient stuck-defect compensation scheme for multi-layer artificial neural networks implemented in hardware devices. To compensate for stuck defects, we have proposed a two-stage partial retraining scheme that adjusts weights belonging to a neuron affected by defects based on back-propagation (BP) algorithm between two layers. For input neurons, the partial retraining scheme is applied two times; first-stage between the input layer and the hidden layer, second-stage between the hidden layer and the output layer. The partial retraining scheme does not need any additional circuits if the hardware neural network has circuits for learning. In this paper we discuss the performance of the partial retraining scheme, retraining time, network yield and generalization ability. As a result, the partial retraining scheme could compensate the neuron stuck defects about 10 times faster than the whole network retraining by BP algorithm. In addition, yields of networks are also improved. The partial retraining scheme achieved more than 80% recognition ratio for noisy input patterns when 16% neurons of the network have 0-stuck or 1-stuck defects.

本文研究了一种在硬件设备上实现的多层人工神经网络的有效卡滞缺陷补偿方案。为了补偿被卡住的缺陷，我们提出了一种基于BP算法的两阶段部分再训练方案，该方案在两层之间调整受缺陷影响的神经元的权重。对于输入神经元，采用两次局部再训练方案;第一阶段在输入层和隐藏层之间，第二阶段在隐藏层和输出层之间。如果硬件神经网络具有学习电路，则部分再训练方案不需要任何额外的电路。本文讨论了部分再训练方案的性能、再训练时间、网络良率和泛化能力。结果表明，局部再训练方案补偿神经元卡滞缺陷的速度比BP算法全网再训练快10倍左右。此外，网络的产率也得到了提高。当网络中有16%的神经元存在0卡或1卡缺陷时，部分再训练方案对噪声输入模式的识别率达到80%以上。

{"title":"Performance evaluation of a partial retraining scheme for defective multi-layer neural networks","authors":"K. Yamamori, T. Abe, S. Horiguchi","doi":"10.1109/ACAC.2001.903376","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903376","url":null,"abstract":"This paper addresses an efficient stuck-defect compensation scheme for multi-layer artificial neural networks implemented in hardware devices. To compensate for stuck defects, we have proposed a two-stage partial retraining scheme that adjusts weights belonging to a neuron affected by defects based on back-propagation (BP) algorithm between two layers. For input neurons, the partial retraining scheme is applied two times; first-stage between the input layer and the hidden layer, second-stage between the hidden layer and the output layer. The partial retraining scheme does not need any additional circuits if the hardware neural network has circuits for learning. In this paper we discuss the performance of the partial retraining scheme, retraining time, network yield and generalization ability. As a result, the partial retraining scheme could compensate the neuron stuck defects about 10 times faster than the whole network retraining by BP algorithm. In addition, yields of networks are also improved. The partial retraining scheme achieved more than 80% recognition ratio for noisy input patterns when 16% neurons of the network have 0-stuck or 1-stuck defects.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122548971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Application domains for fixed-length block structured architectures 定长块结构体系结构的应用领域

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903353

L. Eeckhout, T. Vander Aa, B. Goeman, H. Vandierendonck, R. Lauwereins, K. De Bosschere

In order to tackle the growing complexity and interconnects problem in modern microprocessor architectures, computer architects have come up with new architectural paradigms. A fixed-length block structured architecture (BSA) is one of these paradigms. The basic idea of a BSA is to generate blocks of instructions, called BSA-blocks, statically (by the compiler) and executing these blocks on a decentralized microarchitecture. In this paper, we focus on possible application domains for this architectural paradigm. To investigate this issue, we have set up several experiments with 43 benchmarks coming from the SPECint95, the SPECfp95, the MediaBench suite, plus a set of MPEG-4 like algorithms. The main conclusion of this paper is twofold. First, multimedia applications are less control-intensive than SPECint95 benchmarks and more control-intensive than SPECfp95 benchmarks. As a result, a compiler for a BSA will find more opportunities to fill BSA-blocks with instructions from the actually executed control flow paths for SPECfp95 than for multimedia applications; and more for multimedia applications than for SPECint95. Second, 16 instructions per BSA-block is appropriate for all application domains. Larger BSA-blocks on the other hand, result in higher branch misprediction rates for most applications and lead to a less effective use of the virtual window size.

为了解决现代微处理器体系结构中日益增长的复杂性和互连问题，计算机架构师提出了新的体系结构范式。固定长度的块结构体系结构(BSA)就是其中一种范例。BSA的基本思想是静态地(由编译器)生成指令块，称为BSA块，并在分散的微架构上执行这些块。在本文中，我们关注于此架构范例的可能应用领域。为了研究这个问题，我们用来自SPECint95、SPECfp95、mediabbench套件以及一组类似MPEG-4的算法的43个基准测试设置了几个实验。本文的主要结论有两个方面。首先，多媒体应用程序的控制强度比SPECint95基准测试要低，而比SPECfp95基准测试要高。因此，BSA的编译器将发现更多的机会来填充BSA块与指令从实际执行的控制流路径为SPECfp95比多媒体应用程序;多媒体应用程序比SPECint95更多。其次，每个bsa块16条指令适用于所有应用领域。另一方面，对于大多数应用程序来说，较大的bsa块会导致更高的分支错误预测率，并导致虚拟窗口大小的使用效率降低。

{"title":"Application domains for fixed-length block structured architectures","authors":"L. Eeckhout, T. Vander Aa, B. Goeman, H. Vandierendonck, R. Lauwereins, K. De Bosschere","doi":"10.1109/ACAC.2001.903353","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903353","url":null,"abstract":"In order to tackle the growing complexity and interconnects problem in modern microprocessor architectures, computer architects have come up with new architectural paradigms. A fixed-length block structured architecture (BSA) is one of these paradigms. The basic idea of a BSA is to generate blocks of instructions, called BSA-blocks, statically (by the compiler) and executing these blocks on a decentralized microarchitecture. In this paper, we focus on possible application domains for this architectural paradigm. To investigate this issue, we have set up several experiments with 43 benchmarks coming from the SPECint95, the SPECfp95, the MediaBench suite, plus a set of MPEG-4 like algorithms. The main conclusion of this paper is twofold. First, multimedia applications are less control-intensive than SPECint95 benchmarks and more control-intensive than SPECfp95 benchmarks. As a result, a compiler for a BSA will find more opportunities to fill BSA-blocks with instructions from the actually executed control flow paths for SPECfp95 than for multimedia applications; and more for multimedia applications than for SPECint95. Second, 16 instructions per BSA-block is appropriate for all application domains. Larger BSA-blocks on the other hand, result in higher branch misprediction rates for most applications and lead to a less effective use of the virtual window size.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128314101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The SawMill framework for virtual memory diversity 虚拟内存多样性的SawMill框架

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903345

M. Aron, J. Liedtke, Kevin Elphinstone, Yoonho Park, T. Jaeger, Luke Deller

We present a framework that allows applications to build and customize VM services on the LA microkernel. While the LA microkernel's abstractions are quite powerful, using these abstractions effectively requires higher-level paradigms. We propose the dataspace paradigm which provides a modular VM framework. The modularity introduced by the dataspace paradigm facilitates implementation and permits dynamic configurability. Initial performance results from a prototype are promising.

我们提供了一个框架，允许应用程序在LA微内核上构建和定制VM服务。虽然LA微内核的抽象非常强大，但是有效地使用这些抽象需要更高级别的范例。我们提出了数据空间范式，它提供了一个模块化的虚拟机框架。数据空间范例引入的模块化简化了实现，并允许动态配置。原型机的初步性能结果是有希望的。

引用次数: 31

Exploiting Java instruction/thread level parallelism with horizontal multithreading 利用水平多线程的Java指令/线程级并行性

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903373

Kenji Watanabe, Wanming Chu, Yamin Li

Java bytecodes can be executed with the following three methods: a Java interpreter running on a particular machine interprets bytecodes; a Just-in-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the machine executes the translated codes; and a Java processor executes bytecodes directly. The first two methods require no special hardware support for the execution of Java bytecodes and are widely used currently. The last method requires an embedded Java processor, picoJavaI or picoJavaII for instance. The picoJavaI and picoJavaII are simple pipelined processors with no ILP (instruction level parallelism) and TLP (thread level parallelism) supports. A so-called MAJC (microprocessor architecture for Java computing) design can exploit ILP and TLP by using a modified VLIW (very long instruction word) architecture and vertical multithreading technique, but it has its own instruction set and cannot execute Java bytecodes directly. In this paper, we investigate a processor architecture which can directly execute Java bytecodes meanwhile can exploit Java ILP and TLP simultaneously. The proposed processor consists of multiple slots implementing horizontal multithreading and multiple functional units shared by all threads executed in parallel. Our architectural simulation results show that the Java processor could achieve an average 20 IPC (instructions per cycle), or 7.33 EIPC (effective IPC), with 8 slots and a 4-instruction scheduling window for each slot. We also check other configurations and give the utilization of functional units as well as the performance improvement with various kinds of working loads.

Java字节码可以通过以下三种方法执行:运行在特定机器上的Java解释器解释字节码;即时(JIT)编译器将字节码转换为特定机器的本机原语，然后机器执行翻译后的代码;Java处理器直接执行字节码。前两种方法在执行Java字节码时不需要特殊的硬件支持，目前被广泛使用。最后一种方法需要嵌入式Java处理器，例如picoJavaI或picoJavaII。picoJavaI和picoJavaII是简单的流水线处理器，不支持ILP(指令级并行)和TLP(线程级并行)。所谓的MAJC(用于Java计算的微处理器体系结构)设计可以通过使用改进的VLIW(甚长指令字)体系结构和垂直多线程技术来利用ILP和TLP，但它有自己的指令集，不能直接执行Java字节码。本文研究了一种既能直接执行Java字节码，又能同时利用Java ILP和TLP的处理器体系结构。所提出的处理器由实现水平多线程的多个插槽和并行执行的所有线程共享的多个功能单元组成。我们的体系结构模拟结果表明，Java处理器可以实现平均20 IPC(每周期指令)或7.33 EIPC(有效IPC)，具有8个插槽和每个插槽的4条指令调度窗口。我们还检查了其他配置，并给出了功能单元的利用率以及在各种工作负载下的性能改进。

{"title":"Exploiting Java instruction/thread level parallelism with horizontal multithreading","authors":"Kenji Watanabe, Wanming Chu, Yamin Li","doi":"10.1109/ACAC.2001.903373","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903373","url":null,"abstract":"Java bytecodes can be executed with the following three methods: a Java interpreter running on a particular machine interprets bytecodes; a Just-in-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the machine executes the translated codes; and a Java processor executes bytecodes directly. The first two methods require no special hardware support for the execution of Java bytecodes and are widely used currently. The last method requires an embedded Java processor, picoJavaI or picoJavaII for instance. The picoJavaI and picoJavaII are simple pipelined processors with no ILP (instruction level parallelism) and TLP (thread level parallelism) supports. A so-called MAJC (microprocessor architecture for Java computing) design can exploit ILP and TLP by using a modified VLIW (very long instruction word) architecture and vertical multithreading technique, but it has its own instruction set and cannot execute Java bytecodes directly. In this paper, we investigate a processor architecture which can directly execute Java bytecodes meanwhile can exploit Java ILP and TLP simultaneously. The proposed processor consists of multiple slots implementing horizontal multithreading and multiple functional units shared by all threads executed in parallel. Our architectural simulation results show that the Java processor could achieve an average 20 IPC (instructions per cycle), or 7.33 EIPC (effective IPC), with 8 slots and a 4-instruction scheduling window for each slot. We also check other configurations and give the utilization of functional units as well as the performance improvement with various kinds of working loads.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132559843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines 利用微线程管道在芯片多处理器中实现高效的矢量指令集

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903363

C. Jesshope

This paper looks at a combination of two techniques, one of which, using a vector instruction set, has a long history dating back to pipelined vector supercomputers, such as the Cray 1 and its successors. The other technique, multi-threading, is also well understood. The novel approach proposed in this paper combines both vertical and horizontal micro-threading with vector instruction descriptors. It will be shown that a family of threads can represent a vector instruction with dependencies between the instances of that family, the iterations. This technique gives a very low overhead in implementing an n-way loop and is able to tolerate high memory latency. The use of micro-threading to handle dependencies between threads provides the ability to trade-off between instruction level parallelism and loop parallelism. The paper describes the means by which instruction classes may be instanced as independent parallel micro-threads and illustrates the speed-up that may be obtained compared to using a conventional loop.

本文着眼于两种技术的结合，其中一种是使用矢量指令集，这种技术的历史可以追溯到流水线矢量超级计算机，如Cray 1及其后续产品。另一种技术，多线程，也很容易理解。本文提出的新方法将垂直微线程和水平微线程与矢量指令描述符相结合。它将显示线程族可以表示向量指令，该指令族的实例之间存在依赖关系，即迭代。这种技术在实现n路循环时开销非常低，并且能够容忍高内存延迟。使用微线程处理线程之间的依赖关系提供了在指令级并行性和循环并行性之间进行权衡的能力。本文描述了将指令类实例化为独立并行微线程的方法，并举例说明了与使用传统循环相比可能获得的加速。

引用次数: 29

Retargetable cache simulation using high level processor models 使用高级处理器模型的可重定向缓存仿真

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

Pub Date : 2001-01-29 DOI: 10.1109/ACAC.2001.903371

Rajiv A. Ravindran, R. Moona

During processor design, it is often necessary to evaluate multiple cache configurations. This paper describes the design and implementation of a retargetable on-line cache simulator. The cache simulator has been implemented using a retargetable instruction set simulator from the Sim-nML processor description language. The retargetability helps in cache simulation and evaluation much before the actual processor design.

在处理器设计期间，通常需要评估多个缓存配置。本文介绍了一种可重定向在线缓存模拟器的设计与实现。该缓存模拟器使用来自Sim-nML处理器描述语言的可重目标指令集模拟器实现。在实际处理器设计之前，可重定向性有助于缓存模拟和评估。

引用次数: 4

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀