Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824324
B. McClure, T. Au, J. Indulska
This paper presents the design and evaluation through a discrete event simulation of an ODP-based Adaptive Computing Architecture which manages network resources in large-scale heterogeneous error-prone networks. The emphasis is given to network (communication) adaptation of this architecture simulated for an exemplar defence network. The results show that, for this network, the architecture provides significant improvement in terms of higher priority requests meeting their QoS requirements and adaptation to link failure under heavy link utilisation. In addition, link utilisation is lower with the architecture active.
{"title":"Adaptive middleware for heterogeneous defence networks-an exploratory simulation study","authors":"B. McClure, T. Au, J. Indulska","doi":"10.1109/ACAC.2000.824324","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824324","url":null,"abstract":"This paper presents the design and evaluation through a discrete event simulation of an ODP-based Adaptive Computing Architecture which manages network resources in large-scale heterogeneous error-prone networks. The emphasis is given to network (communication) adaptation of this architecture simulated for an exemplar defence network. The results show that, for this network, the architecture provides significant improvement in terms of higher priority requests meeting their QoS requirements and adaptation to link failure under heavy link utilisation. In addition, link utilisation is lower with the architecture active.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129270931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824325
John Morris, G. Bundell, S. Tham
Several commercial and research projects have produced a variety of 'computing surfaces' based on FPGAs with some interconnection pattern. However, because the majority of these projects have constrained themselves to two-dimensional structures that can be fabricated on a single planar substrate, the interconnect patterns are fixed and severely constrain the ability of a problem to be mapped on to the prototyping system. This paper describes a simple development of the Achilles interprocessor switch. Achilles' 3D stack of processors provides a flexible and scalable system-any number of stacks may be connected together in a small volume and a user may set up a connection pattern quite different from any envisaged by the hardware designer. Simulation of control systems where there are large numbers of objects such as traffic flows, network message traffic, etc, is CPU intensive and generally requires inordinately long runs on conventional sequential processors. So we have chosen Petri Net simulation for a feasibility study for Achilles as a reconfigurable processor. This showed that the architecture is particularly suitable for Petri Net simulations as hundreds of places in a net can be simultaneously active-reducing by orders of magnitude the time necessary for simulations.
{"title":"A scalable re-configurable processor","authors":"John Morris, G. Bundell, S. Tham","doi":"10.1109/ACAC.2000.824325","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824325","url":null,"abstract":"Several commercial and research projects have produced a variety of 'computing surfaces' based on FPGAs with some interconnection pattern. However, because the majority of these projects have constrained themselves to two-dimensional structures that can be fabricated on a single planar substrate, the interconnect patterns are fixed and severely constrain the ability of a problem to be mapped on to the prototyping system. This paper describes a simple development of the Achilles interprocessor switch. Achilles' 3D stack of processors provides a flexible and scalable system-any number of stacks may be connected together in a small volume and a user may set up a connection pattern quite different from any envisaged by the hardware designer. Simulation of control systems where there are large numbers of objects such as traffic flows, network message traffic, etc, is CPU intensive and generally requires inordinately long runs on conventional sequential processors. So we have chosen Petri Net simulation for a feasibility study for Achilles as a reconfigurable processor. This showed that the architecture is particularly suitable for Petri Net simulations as hundreds of places in a net can be simultaneously active-reducing by orders of magnitude the time necessary for simulations.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824329
D. Tate, G. Steven, F. Steven
Superscalar processors strive to increase the number of instructions issued in each processor cycle. Compilers therefore need to expose as much Instruction Level Parallelism (ILP) as possible by using increasingly complex code optimisations. However, the knowledge base of instruction scheduling is focused on in-order instruction issue. It has previously been determined that aggressive static instruction scheduling impedes the speedup achieved by out-of-order instruction issue given an ideal environment. This paper examines how the scheduling process impairs the performance of out-of-order instruction issue. The use of Boolean guards, function in-lining, register renaming and percolation both between basic blocks and around loop back edges is evaluated. The results show that removing Boolean guards and severely limiting percolation while retaining function in-lining produces an improvement over unscheduled benchmarks.
{"title":"Static scheduling for out-of-order instruction issue processors","authors":"D. Tate, G. Steven, F. Steven","doi":"10.1109/ACAC.2000.824329","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824329","url":null,"abstract":"Superscalar processors strive to increase the number of instructions issued in each processor cycle. Compilers therefore need to expose as much Instruction Level Parallelism (ILP) as possible by using increasingly complex code optimisations. However, the knowledge base of instruction scheduling is focused on in-order instruction issue. It has previously been determined that aggressive static instruction scheduling impedes the speedup achieved by out-of-order instruction issue given an ideal environment. This paper examines how the scheduling process impairs the performance of out-of-order instruction issue. The use of Boolean guards, function in-lining, register renaming and percolation both between basic blocks and around loop back edges is evaluated. The results show that removing Boolean guards and severely limiting percolation while retaining function in-lining produces an improvement over unscheduled benchmarks.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114674365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824321
Gareth Lee, John Morris
Dataflow computation models enable simpler and more efficient management of the memory hierarchy-a key barrier to the performance of many parallel programs. This paper describes a dataflow language based on Java. Use of the dataflow model enables a programmer to generate parallel programs without explicit directions for message passing, work allocation and synchronisation. A small handful of additional syntactic constructs are required. A pre-processor is used to convert Dataflow Java programs to standard portable Java. The underlying run-time system was easy to implement using Java's object modelling and communications primitives. Although raw performance lags behind an equivalent C-based system, we were able to demonstrate useful speedups in a heterogeneous environment, thus amply illustrating the potential power of the Dataflow Java approach to use all machines-of whatever type-that might be available on a network when Java JIT compiler technology matures.
{"title":"Dataflow Java: implicitly parallel Java","authors":"Gareth Lee, John Morris","doi":"10.1109/ACAC.2000.824321","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824321","url":null,"abstract":"Dataflow computation models enable simpler and more efficient management of the memory hierarchy-a key barrier to the performance of many parallel programs. This paper describes a dataflow language based on Java. Use of the dataflow model enables a programmer to generate parallel programs without explicit directions for message passing, work allocation and synchronisation. A small handful of additional syntactic constructs are required. A pre-processor is used to convert Dataflow Java programs to standard portable Java. The underlying run-time system was easy to implement using Java's object modelling and communications primitives. Although raw performance lags behind an equivalent C-based system, we were able to demonstrate useful speedups in a heterogeneous environment, thus amply illustrating the potential power of the Dataflow Java approach to use all machines-of whatever type-that might be available on a network when Java JIT compiler technology matures.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116153955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824316
H. Cheung, L. Ang, K. Eshraghian
We propose a parallel architecture for the implementation of the embedded zerotree wavelet (EZW) algorithm, based on the depth-first search (DFS) bit stream (BS) architecture. Using the depth-first search of the wavelet coefficient tree, the wavelet coefficients in the coefficient tree are first partitioned into independent sub-trees. In the case of full parallelism, each of the sub-trees is processed by an independent processor. The output from each processor is then multiplexed back into a single output bit stream. While the output bit stream from each sub-tree processor is in the depth-first search format, the overall multiplexed output bit stream represents the search of the sub-trees in parallel. The implementation of each of the sub-tree EZW processor is based on the DFS BS architecture, which accepts the bits of the coefficients in decreasing order of significance from a sub-tree. All the bits in a significant bit plane are processed to produce the output bit stream from the architecture in one scan of the sub-trees. The rise of the DFS BS structure also makes it possible for partial parallelism where a sub-tree processor can process two or more sub-trees in sequence. This provides flexibility for the design of the overall processor optimally to match the speed of the overall input bit stream. The emphasis in this paper is on the parallel processing aspect of the DFS BS architecture. A sub-tree processor can be easily modified to perform any improved EZW algorithm, and the multiplexer for the output bit streams from the processors can be modified to produce the format of the EZW algorithm based on other tree searching schemes similar to the SPIHT algorithm.
{"title":"Parallel architecture for the implementation of the embedded zerotree wavelet algorithm","authors":"H. Cheung, L. Ang, K. Eshraghian","doi":"10.1109/ACAC.2000.824316","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824316","url":null,"abstract":"We propose a parallel architecture for the implementation of the embedded zerotree wavelet (EZW) algorithm, based on the depth-first search (DFS) bit stream (BS) architecture. Using the depth-first search of the wavelet coefficient tree, the wavelet coefficients in the coefficient tree are first partitioned into independent sub-trees. In the case of full parallelism, each of the sub-trees is processed by an independent processor. The output from each processor is then multiplexed back into a single output bit stream. While the output bit stream from each sub-tree processor is in the depth-first search format, the overall multiplexed output bit stream represents the search of the sub-trees in parallel. The implementation of each of the sub-tree EZW processor is based on the DFS BS architecture, which accepts the bits of the coefficients in decreasing order of significance from a sub-tree. All the bits in a significant bit plane are processed to produce the output bit stream from the architecture in one scan of the sub-trees. The rise of the DFS BS structure also makes it possible for partial parallelism where a sub-tree processor can process two or more sub-trees in sequence. This provides flexibility for the design of the overall processor optimally to match the speed of the overall input bit stream. The emphasis in this paper is on the parallel processing aspect of the DFS BS architecture. A sub-tree processor can be easily modified to perform any improved EZW algorithm, and the multiplexer for the output bit streams from the processors can be modified to produce the format of the EZW algorithm based on other tree searching schemes similar to the SPIHT algorithm.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132215310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824320
C. Jesshope, Bing Luo
This paper briefly reviews the current research into RISC microprocessor architecture, which now seems to be so complex as to make the acronym somewhat of an oxymoron. In response to this development we present a new approach to RISC micro-architecture named micro-threading. Micro-threading exploits instruction-level parallelism by multi-threading but where the threads are all assumed to be drawn from the same context and are thus represented by just a program counter. This approach attempts to overcomes the limit of RISC instruction control (branch, loop, etc.) and data control (data miss, etc.) by providing such a low context switch time that it can be used not only to tolerate high latency memory but also avoid speculation in instruction execution. It is therefore able to provide a more efficient approach to instruction pipelining. In order to demonstrate this approach we compile simple examples to illustrate the concept of micro-threading within the same context. Then one possible architecture of a micro-threaded pipeline is presented in detail. At last, we give some comparisons and a conclusion.
{"title":"Micro-threading: a new approach to future RISC","authors":"C. Jesshope, Bing Luo","doi":"10.1109/ACAC.2000.824320","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824320","url":null,"abstract":"This paper briefly reviews the current research into RISC microprocessor architecture, which now seems to be so complex as to make the acronym somewhat of an oxymoron. In response to this development we present a new approach to RISC micro-architecture named micro-threading. Micro-threading exploits instruction-level parallelism by multi-threading but where the threads are all assumed to be drawn from the same context and are thus represented by just a program counter. This approach attempts to overcomes the limit of RISC instruction control (branch, loop, etc.) and data control (data miss, etc.) by providing such a low context switch time that it can be used not only to tolerate high latency memory but also avoid speculation in instruction execution. It is therefore able to provide a more efficient approach to instruction pipelining. In order to demonstrate this approach we compile simple examples to illustrate the concept of micro-threading within the same context. Then one possible architecture of a micro-threaded pipeline is presented in detail. At last, we give some comparisons and a conclusion.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"103 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113963823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824330
Adam Wiggins, G. Heiser
The StrongARM SA-1100 is a high-speed low-power processor aimed at embedded and portable applications. Its architecture features virtual caches and TLBs which are not tagged by an address-space identifier. Consequently, context switches on that processor are potentially very expensive, as they may require complete flushes of TLBs and caches. This paper presents the design of an address-space management technique for the StrongARM which minimises TLB and cache flushes and thus context switching costs. The basic idea is to implement the top-level of the (hardware-walked) page-table as a cache for page directory entries for different address spaces. This allows switching address spaces with minimal overhead as long as the working sets do not overlap. For small (/spl les/32 MB) address spaces further improvements are possible by making use of the StrongARM's re-mapping facility. Our technique is discussed in the context of the LA microkernel in which it will be implemented.
{"title":"Fast address-space switching on the StrongARM SA-1100 processor","authors":"Adam Wiggins, G. Heiser","doi":"10.1109/ACAC.2000.824330","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824330","url":null,"abstract":"The StrongARM SA-1100 is a high-speed low-power processor aimed at embedded and portable applications. Its architecture features virtual caches and TLBs which are not tagged by an address-space identifier. Consequently, context switches on that processor are potentially very expensive, as they may require complete flushes of TLBs and caches. This paper presents the design of an address-space management technique for the StrongARM which minimises TLB and cache flushes and thus context switching costs. The basic idea is to implement the top-level of the (hardware-walked) page-table as a cache for page directory entries for different address spaces. This allows switching address spaces with minimal overhead as long as the working sets do not overlap. For small (/spl les/32 MB) address spaces further improvements are possible by making use of the StrongARM's re-mapping facility. Our technique is discussed in the context of the LA microkernel in which it will be implemented.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129245870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824319
B. Gunther
The Circuit Object Organisation Library is a C++ class library for developing continuously executing circuit generator programs used in real-time, adaptive reconfigurable computing applications. A C++ program linked with COOL can execute autonomously, since COOL provides a high-speed place and route facility for realising fine grained FPGA circuits from object-oriented structural descriptions. With COOL the need for separate hardware description and software programming languages disappears. The class inheritance concept is used to define specialised circuits, composed of gate, port, and wire objects. An applications programming interface borrowing from graphical user interface toolkits, automatic storage reclamation, and use of operator overloading make circuit description intuitive and relatively accessible to developers without a strong hardware background. COOL features constructive placement algorithms, and a two-stage router that minimises average run time, yet handles difficult routes via a last-resort Lee maze router. Preliminary tests reveal that COOL can realise circuits at rates of tens of thousands of gates per second on a low-end PC.
{"title":"The circuit object organisation library","authors":"B. Gunther","doi":"10.1109/ACAC.2000.824319","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824319","url":null,"abstract":"The Circuit Object Organisation Library is a C++ class library for developing continuously executing circuit generator programs used in real-time, adaptive reconfigurable computing applications. A C++ program linked with COOL can execute autonomously, since COOL provides a high-speed place and route facility for realising fine grained FPGA circuits from object-oriented structural descriptions. With COOL the need for separate hardware description and software programming languages disappears. The class inheritance concept is used to define specialised circuits, composed of gate, port, and wire objects. An applications programming interface borrowing from graphical user interface toolkits, automatic storage reclamation, and use of operator overloading make circuit description intuitive and relatively accessible to developers without a strong hardware background. COOL features constructive placement algorithms, and a two-stage router that minimises average run time, yet handles difficult routes via a last-resort Lee maze router. Preliminary tests reveal that COOL can realise circuits at rates of tens of thousands of gates per second on a low-end PC.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824328
Christian Siemers, Sybille Siemers
A novel architecture for reconfigurable computing based on a coarse grain FPGA-like architecture is introduced. The basic blocks contain all arithmetical and logical capacities as well as some registers and will be programmable by sequential instruction streams produced by software compiler. Reconfiguration is related to hyper-blocks of instructions. For the composed reconfigurable processors a classification is introduced for describing realtime, multithreading and performance capabilities.
{"title":"Reconfigurable computing based on universal configurable blocks-a new approach for supporting performance- and realtime-dominated applications","authors":"Christian Siemers, Sybille Siemers","doi":"10.1109/ACAC.2000.824328","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824328","url":null,"abstract":"A novel architecture for reconfigurable computing based on a coarse grain FPGA-like architecture is introduced. The basic blocks contain all arithmetical and logical capacities as well as some registers and will be programmable by sequential instruction streams produced by software compiler. Reconfiguration is related to hyper-blocks of instructions. For the composed reconfigurable processors a classification is introduced for describing realtime, multithreading and performance capabilities.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-01-31DOI: 10.1109/ACAC.2000.824318
L. Eeckhout, K. D. Bosschere, H. Neefs
Scaling contemporary superscalar microarchitectures to higher levels of parallelism in future technologies seems to be impractical due to the increasing complexity. In this paper, we show that a fixed-length block structured instruction set architecture (BSA), is capable of reducing the hardware complexity and is therefore feasible as an alternative architectural paradigm for traditional architectures with large virtual window sizes for future technologies. This is reached through two major interventions. First, statically, grouping instructions from various basic blocks into larger atomic units of work with a fixed length, called blocks, makes fetching easier. Second, a decentralized microarchitecture reduces the processor core logic significantly resulting in higher clock frequencies. The performance evaluation methodology used in this paper both considers IPC (number of useful instructions retired per clock cycle) and clock cycle period. In addition, a broad design space is explored by quantifying the influence of various microarchitectural parameters on overall performance.
{"title":"On the feasibility of fixed-length block structured architectures","authors":"L. Eeckhout, K. D. Bosschere, H. Neefs","doi":"10.1109/ACAC.2000.824318","DOIUrl":"https://doi.org/10.1109/ACAC.2000.824318","url":null,"abstract":"Scaling contemporary superscalar microarchitectures to higher levels of parallelism in future technologies seems to be impractical due to the increasing complexity. In this paper, we show that a fixed-length block structured instruction set architecture (BSA), is capable of reducing the hardware complexity and is therefore feasible as an alternative architectural paradigm for traditional architectures with large virtual window sizes for future technologies. This is reached through two major interventions. First, statically, grouping instructions from various basic blocks into larger atomic units of work with a fixed length, called blocks, makes fetching easier. Second, a decentralized microarchitecture reduces the processor core logic significantly resulting in higher clock frequencies. The performance evaluation methodology used in this paper both considers IPC (number of useful instructions retired per clock cycle) and clock cycle period. In addition, a broad design space is explored by quantifying the influence of various microarchitectural parameters on overall performance.","PeriodicalId":129890,"journal":{"name":"Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115266435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}