Current software engineering practice heavily relies on the reliability of software implementation languages and underlying architectures. However, both the currently used languages, as well as the traditional architectures suffer from a shortage of built-in security. In this paper, an architecture is presented, which is heavily influenced by two properties of secure languages: coercion and exception handling. It is shown that proper design decisions lead to an architecture having a compact data representation, allowing both generic and nongeneric instructions. The architecture is object oriented, and object addressing is under control of the operand stream, with optimalisation possibilities to bypass descriptor inspection.
{"title":"DOAS: an object oriented architecture supporting secure languages","authors":"A. J. Goor, H. Corporaal","doi":"10.1145/75362.75409","DOIUrl":"https://doi.org/10.1145/75362.75409","url":null,"abstract":"Current software engineering practice heavily relies on the reliability of software implementation languages and underlying architectures. However, both the currently used languages, as well as the traditional architectures suffer from a shortage of built-in security. In this paper, an architecture is presented, which is heavily influenced by two properties of secure languages: coercion and exception handling. It is shown that proper design decisions lead to an architecture having a compact data representation, allowing both generic and nongeneric instructions. The architecture is object oriented, and object addressing is under control of the operand stream, with optimalisation possibilities to bypass descriptor inspection.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unification is known to be the most repeated operation in logic programming and PROLOG interpreters. To speed up the execution of logic programs, the performance of unification must be improved. We propose a parallel unification machine for speeding up the unification algorithm. The machine is simulated at the register transfer level and the simulation results as well as performance comparison with a serial unification coprocessor are presented.
{"title":"Design and performance measurements of a parallel machine for the unification algorithm","authors":"F. Sibai, K. Watson, Mi Lu","doi":"10.1145/75362.75398","DOIUrl":"https://doi.org/10.1145/75362.75398","url":null,"abstract":"Unification is known to be the most repeated operation in logic programming and PROLOG interpreters. To speed up the execution of logic programs, the performance of unification must be improved. We propose a parallel unification machine for speeding up the unification algorithm. The machine is simulated at the register transfer level and the simulation results as well as performance comparison with a serial unification coprocessor are presented.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131786346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Definition of elementary arithmetic operations by using ACM","authors":"S. D'Angelo, G. Sechi","doi":"10.1145/75362.75414","DOIUrl":"https://doi.org/10.1145/75362.75414","url":null,"abstract":"","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115206619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Combining is a local compiler optimization technique that can enhance the performance of global compaction techniques for VLIW machines. Given two adjacent operations of a certain class that are flow (read-after-write) dependent and that cannot be placed in the same micro-instruction, the combining technique can transform the operations so that the modified operations have no dependence. The transformed operations can be executed in the same micro-instruction, thus allowing the total execution time of the program to be reduced. In this paper, combining a pair of flow-dependent operations into a wide instruction word is suggested as an important compilation technique for VLIW architectures. Combining is particularly effective with software pipelining and loop unrolling since combinable operations can come together with a higher probability when these compilation techniques are used. We implemented combining in our parallelizing compiler for the wide instruction word architecture, which is now being built at the IBM T. J. Watson Research Center. It is shown that ten percent speedup is obtained on the Stanford integer benchmarks and other sequential-matured C programs, in comparison to compaction techniques that do not use combining. For a class of inner loops, combining can remove the inter-iteration dependencies completely and can improve performance in the same ratio as the loop is unrolled.
组合是一种局部编译器优化技术,可以提高VLIW机器的全局压缩技术的性能。给定某类中相邻的两个操作依赖于流(读后写),且不能放在同一微指令中,组合技术可以对操作进行转换,使修改后的操作不依赖。转换后的操作可以在同一微指令中执行,从而减少了程序的总执行时间。本文提出将一对流相关操作组合成一个宽指令字作为VLIW体系结构的重要编译技术。组合对于软件流水线和循环展开特别有效,因为当使用这些编译技术时,可组合操作可以以更高的概率一起出现。我们在宽指令字架构的并行化编译器中实现了组合,该架构目前正在IBM t.j. Watson研究中心构建。结果表明,与不使用组合的压缩技术相比,在斯坦福整数基准测试和其他顺序成熟的C程序上获得了10%的加速。对于一类内部循环,组合可以完全消除迭代间的依赖关系,并且可以在展开循环时以相同的比例提高性能。
{"title":"“Combining” as a compilation technique for VLIW architectures","authors":"T. Nakatani, K. Ebcioglu","doi":"10.1145/75362.75401","DOIUrl":"https://doi.org/10.1145/75362.75401","url":null,"abstract":"Combining is a local compiler optimization technique that can enhance the performance of global compaction techniques for VLIW machines. Given two adjacent operations of a certain class that are flow (read-after-write) dependent and that cannot be placed in the same micro-instruction, the combining technique can transform the operations so that the modified operations have no dependence. The transformed operations can be executed in the same micro-instruction, thus allowing the total execution time of the program to be reduced. In this paper, combining a pair of flow-dependent operations into a wide instruction word is suggested as an important compilation technique for VLIW architectures. Combining is particularly effective with software pipelining and loop unrolling since combinable operations can come together with a higher probability when these compilation techniques are used. We implemented combining in our parallelizing compiler for the wide instruction word architecture, which is now being built at the IBM T. J. Watson Research Center. It is shown that ten percent speedup is obtained on the Stanford integer benchmarks and other sequential-matured C programs, in comparison to compaction techniques that do not use combining. For a class of inner loops, combining can remove the inter-iteration dependencies completely and can improve performance in the same ratio as the loop is unrolled.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114808962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two major limitations concerning the design of cost-effective application-specific architectures are the recurrent costs of system-software development and hardware implementation, in particular VLSI implementation, for each architecture. The SCalable ARChitecture Experiment (SCARCE) aims to provide a framework for application-specific processor design. The framework allows scaling of functionality, implementation complexity, and performance. The SCARCE framework consists and will consist of: an architecture framework defining the constraints for the design of application-specific architectures; tools for synthesizing architectures from application or application-area; VLSI cell libraries and tools for quick generation of application-specific processors; a system-software platform which can be retargeted quickly to fit the application-specific architecture; This paper concentrates on the micro-architecture framework of SCARCE and outlines the process of generating VLSI processors.
{"title":"A flexible VLSI core for an adaptable architecture","authors":"Hans M. Mulder, P. Stravers","doi":"10.1145/75362.75423","DOIUrl":"https://doi.org/10.1145/75362.75423","url":null,"abstract":"Two major limitations concerning the design of cost-effective application-specific architectures are the recurrent costs of system-software development and hardware implementation, in particular VLSI implementation, for each architecture.\u0000The SCalable ARChitecture Experiment (SCARCE) aims to provide a framework for application-specific processor design. The framework allows scaling of functionality, implementation complexity, and performance. The SCARCE framework consists and will consist of: an architecture framework defining the constraints for the design of application-specific architectures; tools for synthesizing architectures from application or application-area; VLSI cell libraries and tools for quick generation of application-specific processors; a system-software platform which can be retargeted quickly to fit the application-specific architecture;\u0000This paper concentrates on the micro-architecture framework of SCARCE and outlines the process of generating VLSI processors.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117341817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new instruction fetch method, forward semantic, is offered to enable the deeply pipelined processors to fetch one useful instruction every cycle. Forward semantic is an improved alternative to the delayed branching (with or without squashing), with five major advantages. Fist, no restriction is imposed on the type of instructions filling the branch slots, which allows a large number of slots to be filled. Second, no modification to the offsets and displacements is necessary when an instruction is copied to fill a branch slot, which simplifies the linker implementation. Third, an interrupted program can resume execution with a single program counter, eliminating the need for reloading the instruction pipeline before resuming execution. Fourth, programs compiled with N slots can execute on pipelines requiring K (K ≤ N) slots, which makes family architecture compatibility possible . Lastly, the filling of branch slots is totally transparent to code compaction and software interlocking schemes. These advantages combine to provide an efficient instruction fetch mechanism and to eliminate artificial penalties on branch cost. At the cost of 11% static code expansion, forward semantic achieves an instruction fetch cost of 1.2 cycles for pipelines requiring 10 slots for each taken branch. This level of instruction fetch efficiency has never been achieved before with conventional instruction fetch methods. The branch cost is dictated by the accuracy of the compile-time branch prediction rather than artificial limitations, such as data dependencies, which prevent the slots from being filled. These results are measured from the execution of real UNIX and CAD programs with complex control structures.
{"title":"Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors","authors":"P. Chang, Wen-mei W. Hwu","doi":"10.1145/75362.75418","DOIUrl":"https://doi.org/10.1145/75362.75418","url":null,"abstract":"A new instruction fetch method, forward semantic, is offered to enable the deeply pipelined processors to fetch one useful instruction every cycle. Forward semantic is an improved alternative to the delayed branching (with or without squashing), with five major advantages. Fist, no restriction is imposed on the type of instructions filling the branch slots, which allows a large number of slots to be filled. Second, no modification to the offsets and displacements is necessary when an instruction is copied to fill a branch slot, which simplifies the linker implementation. Third, an interrupted program can resume execution with a single program counter, eliminating the need for reloading the instruction pipeline before resuming execution. Fourth, programs compiled with N slots can execute on pipelines requiring K (K ≤ N) slots, which makes family architecture compatibility possible . Lastly, the filling of branch slots is totally transparent to code compaction and software interlocking schemes. These advantages combine to provide an efficient instruction fetch mechanism and to eliminate artificial penalties on branch cost. At the cost of 11% static code expansion, forward semantic achieves an instruction fetch cost of 1.2 cycles for pipelines requiring 10 slots for each taken branch. This level of instruction fetch efficiency has never been achieved before with conventional instruction fetch methods. The branch cost is dictated by the accuracy of the compile-time branch prediction rather than artificial limitations, such as data dependencies, which prevent the slots from being filled. These results are measured from the execution of real UNIX and CAD programs with complex control structures.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"239 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125708314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of automatic loop parallelization has received a lot of attention in the area of parallelizing compilers. Automatic loop parallelization can be achieved by several algorithms. In this paper we address the problem of time optimal parallelization of loops with conditional jumps. We prove that even for machines with unlimited resources there are simple loops for which no semantically and algorithmically equivalent time optimal program exists.
{"title":"On optimal loop parallelization","authors":"F. Gasperoni, U. Schwiegelshohn, K. Ebcioglu","doi":"10.1145/75362.75411","DOIUrl":"https://doi.org/10.1145/75362.75411","url":null,"abstract":"The problem of automatic loop parallelization has received a lot of attention in the area of parallelizing compilers. Automatic loop parallelization can be achieved by several algorithms. In this paper we address the problem of time optimal parallelization of loops with conditional jumps. We prove that even for machines with unlimited resources there are simple loops for which no semantically and algorithmically equivalent time optimal program exists.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131131191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a method to reorder the straight line instruction streams for pipelined computers which have one instruction issue unit but may contain multiple function units. The objective is to make the most efficient usage of the pipelines within the computer system. The input to the scheduler is the intermediate code of a compiler, and is represented by a data dependence graph (DDG). The scheduler is a kind of list scheduler. The data dependence and the pipeline effect of the function units within the system have been considered for finding a most suitable time slot for each node during reordering time. The scheduler has been implemented and several scientific application programs have been tested. The results show that in most of the cases the scheduler will achieve the optimal result. The average instruction issue rate is over 96%. As a comparison, the issue rate of an ordinary compiler is only 22%; and the issue rate of a compiler with the effect of pipeline but without reordering the instruction stream is about 45%.
{"title":"On reordering instruction streams for pipelined computers","authors":"J. Shieh, C. Papachristou","doi":"10.1145/75362.75419","DOIUrl":"https://doi.org/10.1145/75362.75419","url":null,"abstract":"This paper describes a method to reorder the straight line instruction streams for pipelined computers which have one instruction issue unit but may contain multiple function units. The objective is to make the most efficient usage of the pipelines within the computer system. The input to the scheduler is the intermediate code of a compiler, and is represented by a data dependence graph (DDG).\u0000The scheduler is a kind of list scheduler. The data dependence and the pipeline effect of the function units within the system have been considered for finding a most suitable time slot for each node during reordering time.\u0000The scheduler has been implemented and several scientific application programs have been tested. The results show that in most of the cases the scheduler will achieve the optimal result. The average instruction issue rate is over 96%. As a comparison, the issue rate of an ordinary compiler is only 22%; and the issue rate of a compiler with the effect of pipeline but without reordering the instruction stream is about 45%.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115199432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Under timing constraints, local compaction may fail because of poor scheduling decisions. Su [SDWX87] uses foresight to avoid some of the poor scheduling decisions. However, the foresight takes a considerable amount of time. In this paper the Incremental Foresight algorithm is introduced. Experiments using four different target architectures show that the Incremental Foresight algorithm works as well as foresight, and saves around 48 percent of the excess time.
{"title":"Incremental foresighted local compaction","authors":"Pantung Wijaya, V. Allan","doi":"10.1145/75362.75415","DOIUrl":"https://doi.org/10.1145/75362.75415","url":null,"abstract":"Under timing constraints, local compaction may fail because of poor scheduling decisions. Su [SDWX87] uses foresight to avoid some of the poor scheduling decisions. However, the foresight takes a considerable amount of time. In this paper the Incremental Foresight algorithm is introduced. Experiments using four different target architectures show that the Incremental Foresight algorithm works as well as foresight, and saves around 48 percent of the excess time.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114337084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes MIES, a design tool for the modeling, visualization, and analysis of VLSI microarchitectures. MIES combines a graphical data path model and symbolic control model and provides a number of user interfaces which allow these models to be created, simulated, and evaluated.
{"title":"MIES: a microarchitecture design tool","authors":"J. Nestor, B. Soudan, Z. Mayet","doi":"10.1145/75362.75422","DOIUrl":"https://doi.org/10.1145/75362.75422","url":null,"abstract":"This paper describes MIES, a design tool for the modeling, visualization, and analysis of VLSI microarchitectures. MIES combines a graphical data path model and symbolic control model and provides a number of user interfaces which allow these models to be created, simulated, and evaluated.","PeriodicalId":365456,"journal":{"name":"MICRO 22","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114369205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}