Bruce K. Holmer, B. Sano, M. Carlton, P. V. Roy, R. Haygood, W. Bush, A. Despain, J. Pendleton, T. Dobry
Most Prolog machines have been based on specialized architectures. The authors' goal is to start with a general-purpose architecture and determine a minimal set of extensions for high-performance Prolog execution. They have developed both the architecture and optimizing compiler simultaneously, drawing on results of previous implementations. They find that most Prolog-specific operations can be done satisfactorily in software; however, there is a crucial set of features that the architecture must support to achieve the best Prolog performance. The emphasis in this study is on the authors' architecture and instruction set. The costs and benefits of the special architectural features and instructions are analyzed. Simulated performance results are presented and indicate a peak compiled Prolog performance of 3.68 million logical inferences per second.<>
{"title":"Fast Prolog with an extended general purpose architecture","authors":"Bruce K. Holmer, B. Sano, M. Carlton, P. V. Roy, R. Haygood, W. Bush, A. Despain, J. Pendleton, T. Dobry","doi":"10.1145/325164.325154","DOIUrl":"https://doi.org/10.1145/325164.325154","url":null,"abstract":"Most Prolog machines have been based on specialized architectures. The authors' goal is to start with a general-purpose architecture and determine a minimal set of extensions for high-performance Prolog execution. They have developed both the architecture and optimizing compiler simultaneously, drawing on results of previous implementations. They find that most Prolog-specific operations can be done satisfactorily in software; however, there is a crucial set of features that the architecture must support to achieve the best Prolog performance. The emphasis in this study is on the authors' architecture and instruction set. The costs and benefits of the special architectural features and instructions are analyzed. Simulated performance results are presented and indicate a peak compiled Prolog performance of 3.68 million logical inferences per second.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126549216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data-flow architectures tolerate long unpredictable communication delays and support generation and coordination of parallel activities directly in hardware, instead of assuming that program mapping will cause these issues to disappear. However, the proposed mechanisms are complex and introduce new mapping complications. A greatly simplified approach to data-flow execution, called the explicit token store (ETS) architecture, and its current realization in Monsoon are presented. The essence of dynamic data-flow execution is captured by a simple transition on state bits associated with storage local to a processor. Low-level storage management is performed by the compiler in assigning nodes to slots in an activation frame, rather than dynamically in hardware. The processor is simple, highly pipelined, and quite general. It may be viewed as a generalization of a fairly primitive von Neumann architecture. Although the addressing capability is restrictive, there is exactly one instruction executed for each action on the data-flow graph. Thus, the machine-originated ETS model provides new understanding of the merits and the real cost of direct execution of data-flow graphs.<>
{"title":"Monsoon: an explicit token-store architecture","authors":"G. Papadopoulos, D. Culler","doi":"10.1145/285930.285999","DOIUrl":"https://doi.org/10.1145/285930.285999","url":null,"abstract":"Data-flow architectures tolerate long unpredictable communication delays and support generation and coordination of parallel activities directly in hardware, instead of assuming that program mapping will cause these issues to disappear. However, the proposed mechanisms are complex and introduce new mapping complications. A greatly simplified approach to data-flow execution, called the explicit token store (ETS) architecture, and its current realization in Monsoon are presented. The essence of dynamic data-flow execution is captured by a simple transition on state bits associated with storage local to a processor. Low-level storage management is performed by the compiler in assigning nodes to slots in an activation frame, rather than dynamically in hardware. The processor is simple, highly pipelined, and quite general. It may be viewed as a generalization of a fairly primitive von Neumann architecture. Although the addressing capability is restrictive, there is exactly one instruction executed for each action on the data-flow graph. Thus, the machine-originated ETS model provides new understanding of the merits and the real cost of direct execution of data-flow graphs.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116666173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing methods of generating and analyzing traces suffer from a variety of limitations, including complexity, inaccuracy, short length, inflexibility, or applicability only to CISC (complex-instruction-set-computer) machines. The authors use a trace-generation mechanism based on link-time code modification which is simple to use, generates accurate long traces of multiuser programs, runs on a RISC (reduced-instruction-set-computer) machine, and can be flexibly controlled. Accurate performance data for large second-level caches can be obtained by on-the-fly analysis of the traces. A comparison is made of the performance of systems with 512 K to 16 M second-level caches, and it is show that, for today's large programs, second-level caches of more than 4 MB may be unnecessary. It is also shown that set associativity in second-level caches of more than 1 MB does not significantly improve system performance. In addition, the experiments provide insights into first-level and second-level cache line size.<>
{"title":"Generation and analysis of very long address traces","authors":"A. Borg, R. Kessler, D. W. Wall","doi":"10.1145/325164.325153","DOIUrl":"https://doi.org/10.1145/325164.325153","url":null,"abstract":"Existing methods of generating and analyzing traces suffer from a variety of limitations, including complexity, inaccuracy, short length, inflexibility, or applicability only to CISC (complex-instruction-set-computer) machines. The authors use a trace-generation mechanism based on link-time code modification which is simple to use, generates accurate long traces of multiuser programs, runs on a RISC (reduced-instruction-set-computer) machine, and can be flexibly controlled. Accurate performance data for large second-level caches can be obtained by on-the-fly analysis of the traces. A comparison is made of the performance of systems with 512 K to 16 M second-level caches, and it is show that, for today's large programs, second-level caches of more than 4 MB may be unnecessary. It is also shown that set associativity in second-level caches of more than 1 MB does not significantly improve system performance. In addition, the experiments provide insights into first-level and second-level cache line size.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123426876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The issue of I/O device access in HARTS (Hexagonal Architecture for Real-Time Systems)-a distributed real-time computer system under construction at the University of Michigan-is explicitly addressed. Several candidate solutions are introduced, explored and evaluated according to cost, complexity, reliability, and performance: (1) 'node-direct' distribution with the intranode bus and a local I/O bus; (2) use of dedicated I/O nodes, which are placed in the hexagonal mesh as regular applications nodes, but which provide I/O services rather than computing services; and (3) use of a separate I/O network; which has led to the proposal of an 'interlaced' I/O network. The interlaced I/O network is intended to provide both high performance without burdening node processors with I/O overhead and a high degree of reliability. Both static and dynamic multiownership protocols are developed for managing I/O device access in this I/O network. The relative merits of the two protocols are explored, and the performance and accessibility which each provides are simulated.<>
{"title":"A distributed I/O architecture for HARTS","authors":"K. Shin, G. Dykema","doi":"10.1145/325164.325159","DOIUrl":"https://doi.org/10.1145/325164.325159","url":null,"abstract":"The issue of I/O device access in HARTS (Hexagonal Architecture for Real-Time Systems)-a distributed real-time computer system under construction at the University of Michigan-is explicitly addressed. Several candidate solutions are introduced, explored and evaluated according to cost, complexity, reliability, and performance: (1) 'node-direct' distribution with the intranode bus and a local I/O bus; (2) use of dedicated I/O nodes, which are placed in the hexagonal mesh as regular applications nodes, but which provide I/O services rather than computing services; and (3) use of a separate I/O network; which has led to the proposal of an 'interlaced' I/O network. The interlaced I/O network is intended to provide both high performance without burdening node processors with I/O overhead and a high degree of reliability. Both static and dynamic multiownership protocols are developed for managing I/O device access in this I/O network. The relative merits of the two protocols are explored, and the performance and accessibility which each provides are simulated.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130712843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The architecture for issuing multiple instructions per clock in the NonStop Cyclone processor is described. Pairs of instructions are fetched and decoded by a dual two-stage prefetch pipeline and passed to a dual six-stage pipeline for execution. Dynamic branch prediction is used to reduce branch penalties. A unique microcode routine for each pair is stored in the large duplexed control store. The microcode controls parallel data paths optimized for executing the most frequent instruction pairs. Other features of the architecture include cache support for unaligned double-precision accesses, a virtually addressed main memory, and a novel precise exception mechanism.<>
{"title":"Multiple instruction issue in the NonStop Cyclone processor","authors":"R. Horst, R. L. Harris, Robert L. Jardine","doi":"10.1145/325164.325147","DOIUrl":"https://doi.org/10.1145/325164.325147","url":null,"abstract":"The architecture for issuing multiple instructions per clock in the NonStop Cyclone processor is described. Pairs of instructions are fetched and decoded by a dual two-stage prefetch pipeline and passed to a dual six-stage pipeline for execution. Dynamic branch prediction is used to reduce branch penalties. A unique microcode routine for each pair is stored in the large duplexed control store. The microcode controls parallel data paths optimized for executing the most frequent instruction pairs. Other features of the architecture include cache support for unaligned double-precision accesses, a virtually addressed main memory, and a novel precise exception mechanism.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134083732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In an attempt to reduce the number of operand memory references, many RISC (reduced-instruction-set-computer) machines have 32 or more general-purpose registers (e.g. MIPS, ARM, Spectrum, 88 K). Without special compiler optimizations, such as inlining or interprocedural register allocation, it is rare that a computer will use a majority of these registers for a function. The authors explore the possibility of using some of these registers to hold branch target addresses and the corresponding instruction at each branch target. To evaluate the effectiveness of this scheme, two machines were designed and emulated. One machine had 32 general-purpose registers used for data references, while the other machine had 16 data registers and 16 registers used for branching. The results show that using registers for branching can effectively reduce the cost of transfers of control.<>
{"title":"Reducing the cost of branches by using registers","authors":"J. Davidson, D. Whalley","doi":"10.1145/325164.325138","DOIUrl":"https://doi.org/10.1145/325164.325138","url":null,"abstract":"In an attempt to reduce the number of operand memory references, many RISC (reduced-instruction-set-computer) machines have 32 or more general-purpose registers (e.g. MIPS, ARM, Spectrum, 88 K). Without special compiler optimizations, such as inlining or interprocedural register allocation, it is rare that a computer will use a majority of these registers for a function. The authors explore the possibility of using some of these registers to hold branch target addresses and the corresponding instruction at each branch target. To evaluate the effectiveness of this scheme, two machines were designed and emulated. One machine had 32 general-purpose registers used for data references, while the other machine had 16 data registers and 16 registers used for branching. The results show that using registers for branching can effectively reduce the cost of transfers of control.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"2010 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125609027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new lock-based cache scheme which incorporates synchronization into the cache coherency mechanism is presented. With this scheme high-level synchronization primitives, as well as low-level ones, can be implemented without excessive overhead. Cost functions for well-known synchronization methods are derived for invalidation schemes, write update schemes, and the authors' lock-based scheme. To predict the performance implications of the new scheme accurately, a new simulation model embodying a widely accepted paradigm of parallel programming is developed. It is shown that that authors' lock-based protocol outperforms existing cache protocols.<>
{"title":"Synchronization with multiprocessor caches","authors":"Joonwon Lee, U. Ramachandran","doi":"10.1145/325164.325107","DOIUrl":"https://doi.org/10.1145/325164.325107","url":null,"abstract":"A new lock-based cache scheme which incorporates synchronization into the cache coherency mechanism is presented. With this scheme high-level synchronization primitives, as well as low-level ones, can be implemented without excessive overhead. Cost functions for well-known synchronization methods are derived for invalidation schemes, write update schemes, and the authors' lock-based scheme. To predict the performance implications of the new scheme accurately, a new simulation model embodying a widely accepted paradigm of parallel programming is developed. It is shown that that authors' lock-based protocol outperforms existing cache protocols.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Annaratone, Marco Fillo, K. Nakabayashi, M. Viredaz
K2 is a distributed-memory parallel processor designed to support a multiuser, multitasking, time-sharing operating system and an automatically parallelizing Fortran compiler. The architecture and the hardware implementation of K2 are presented. The authors focus on the architectural features required by the operating system and the compiler. A prototype machine with 24 processors is currently being developed.<>
{"title":"The K2 parallel processor: architecture and hardware implementation","authors":"M. Annaratone, Marco Fillo, K. Nakabayashi, M. Viredaz","doi":"10.1145/325164.325118","DOIUrl":"https://doi.org/10.1145/325164.325118","url":null,"abstract":"K2 is a distributed-memory parallel processor designed to support a multiuser, multitasking, time-sharing operating system and an automatically parallelizing Fortran compiler. The architecture and the hardware implementation of K2 are presented. The authors focus on the architectural features required by the operating system and the compiler. A prototype machine with 24 processors is currently being developed.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131086499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Architectural support is proposed for goal management as part of a special-purpose processor architecture for the efficient execution of Flat Concurrent Prolog. Goal management operations, namely, halt, spawn, suspend, and commit, are decoupled from goal reduction and overlapped in the goal management unit. Their efficient execution is enabled using a goal cache. The authors evaluate the performance of the goal management support using an analytic performance model and program parameters characteristic of the system's development workload. Most goal management operations are completely overlapped, resulting in a speedup of 2. Higher speedups are obtained for workloads that exhibit greater goal management complexity.<>
{"title":"Architectural support for the management of tightly-coupled fine-grain goals in Flat Concurrent Prolog","authors":"L. Alkalaj, T. Lang, M. Ercegovac","doi":"10.1145/325164.325155","DOIUrl":"https://doi.org/10.1145/325164.325155","url":null,"abstract":"Architectural support is proposed for goal management as part of a special-purpose processor architecture for the efficient execution of Flat Concurrent Prolog. Goal management operations, namely, halt, spawn, suspend, and commit, are decoupled from goal reduction and overlapped in the goal management unit. Their efficient execution is enabled using a goal cache. The authors evaluate the performance of the goal management support using an analytic performance model and program parameters characteristic of the system's development workload. Most goal management operations are completely overlapped, resulting in a speedup of 2. Higher speedups are obtained for workloads that exhibit greater goal management complexity.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VAX architecture has been extended to include an integrated, register-based vector processor. This extension allows both high-end and low-end implementations and can be supported with only small changes by VAX/VMS and VAX/ULTRIX operating systems. The extension is effectively exploited by the new vectorizing capabilities of VAX Fortran. Features of the VAX vector architecture and the design decisions which make it a consistent extension of the VAX architecture are discussed.<>
{"title":"VAX vector architecture","authors":"D. Bhandarkar, Richard Brunner","doi":"10.1145/325164.325145","DOIUrl":"https://doi.org/10.1145/325164.325145","url":null,"abstract":"The VAX architecture has been extended to include an integrated, register-based vector processor. This extension allows both high-end and low-end implementations and can be supported with only small changes by VAX/VMS and VAX/ULTRIX operating systems. The extension is effectively exploited by the new vectorizing capabilities of VAX Fortran. Features of the VAX vector architecture and the design decisions which make it a consistent extension of the VAX architecture are discussed.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"38 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114127130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}