Pub Date : 1995-10-01DOI: 10.1016/0165-6074(95)00020-O
Joachim König , Lothar Thiele
Domain specific architectures gain more and more attention in high performance applications, when general purpose processors are not capable of achieving the desired throughput. The algorithms to be implemented strongly influence the architecture of the design and vice versa. A systematic approach based on provably correct transformations, herein called Algorithm-Architecture Co-design, will be demonstrated on the design of a processor for long integer arithmetic.
{"title":"Algorithm-architecture co-design by example: a coprocessor for on-line arithmetic","authors":"Joachim König , Lothar Thiele","doi":"10.1016/0165-6074(95)00020-O","DOIUrl":"10.1016/0165-6074(95)00020-O","url":null,"abstract":"<div><p>Domain specific architectures gain more and more attention in high performance applications, when general purpose processors are not capable of achieving the desired throughput. The algorithms to be implemented strongly influence the architecture of the design and vice versa. A systematic approach based on provably correct transformations, herein called <em>Algorithm-Architecture Co-design</em>, will be demonstrated on the design of a processor for long integer arithmetic.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 5","pages":"Pages 339-357"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00020-O","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125470956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-01DOI: 10.1016/0165-6074(95)00019-K
F. Catthoor , M. Moonen
In this introduction, we will summarize the main contributions of the papers collected in this special issue. Moreover, the topics addressed in these papers will be linked to the major research trends in the domain of parallel algorithms, architectures and compilation.
{"title":"Parallel programmable architectures and compilation for multi-dimensional processing","authors":"F. Catthoor , M. Moonen","doi":"10.1016/0165-6074(95)00019-K","DOIUrl":"10.1016/0165-6074(95)00019-K","url":null,"abstract":"<div><p>In this introduction, we will summarize the main contributions of the papers collected in this special issue. Moreover, the topics addressed in these papers will be linked to the major research trends in the domain of parallel algorithms, architectures and compilation.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 5","pages":"Pages 333-337"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00019-K","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128816917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-01DOI: 10.1016/0165-6074(95)00023-H
J. Kneip, M. Ohmacht, K. Rönner, P. Pirsch
A highly parallel single-chip image signal processor architecture has been derived by analysis of image processing algorithms. Available levels of parallelism and their associated demands on data access, control and complexity of operations were taken into account. The RISC-architecture, called “HiPAR-DSP”, consists of a control unit, 16 parallel ASIMD-controlled datapaths with autonomous addressing and instruction selection capability, a local data cache per data path, a shared memory with matrix type data access and a powerful DMA-unit. The proposed architecture was designed by assessing the results of an analysis of characteristic algorithm properties with respect to their inherent parallelization resources, achievable speed up and implementation costs. This resulted in a proper balance between the degree of parallelism and flexibility, leading to a high performance for a wide field of applications. Additional measures were taken to support an efficient high level programmability of the processor. This was achieved by the concurrent implementation of special architectural features and a C++-programming environment. It consists of an adaptation of the GNU C++-compiler and an optimizing assembler, supporting all levels of concurrence offered by the hardware. While most levels of parallelization are kept invisible to the programmer, data-level parallelism is expressed by the programmer using special new data types added to the standard C/C++-data-types. A sustained performance of about 2.0 Gigaoperations per second is achieved by the 100 MHz clocked processor for numerous image processing algorithms, leading to a processing time e.g. for a normalized correlation of a 512 × 512 image with a 32 × 32 correlation mask of 450 ms. Thus, a performance is achieved with a programmable parallel processor architecture that hitherto required the application of a dedicated integrated circuit.
{"title":"Architecture and C++-programming environment of a highly parallel image signal processor","authors":"J. Kneip, M. Ohmacht, K. Rönner, P. Pirsch","doi":"10.1016/0165-6074(95)00023-H","DOIUrl":"10.1016/0165-6074(95)00023-H","url":null,"abstract":"<div><p>A highly parallel single-chip image signal processor architecture has been derived by analysis of image processing algorithms. Available levels of parallelism and their associated demands on data access, control and complexity of operations were taken into account. The RISC-architecture, called “HiPAR-DSP”, consists of a control unit, 16 parallel ASIMD-controlled datapaths with autonomous addressing and instruction selection capability, a local data cache per data path, a shared memory with matrix type data access and a powerful DMA-unit. The proposed architecture was designed by assessing the results of an analysis of characteristic algorithm properties with respect to their inherent parallelization resources, achievable speed up and implementation costs. This resulted in a proper balance between the degree of parallelism and flexibility, leading to a high performance for a wide field of applications. Additional measures were taken to support an efficient high level programmability of the processor. This was achieved by the concurrent implementation of special architectural features and a C++-programming environment. It consists of an adaptation of the GNU C++-compiler and an optimizing assembler, supporting all levels of concurrence offered by the hardware. While most levels of parallelization are kept invisible to the programmer, data-level parallelism is expressed by the programmer using special new data types added to the standard C/C++-data-types. A sustained performance of about 2.0 Gigaoperations per second is achieved by the 100 MHz clocked processor for numerous image processing algorithms, leading to a processing time e.g. for a normalized correlation of a 512 × 512 image with a 32 × 32 correlation mask of 450 ms. Thus, a performance is achieved with a programmable parallel processor architecture that hitherto required the application of a dedicated integrated circuit.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 5","pages":"Pages 391-408"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00023-H","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116801854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-01DOI: 10.1016/0165-6074(95)99030-9
E. De Greef, F. Catthoor, H. De Man
In this paper, an architectural template is presented, which is able to execute the full search motion estimation algorithm or other similar video or image processing algorithms in real time. The architecture is based on a set of programmable video signal processors (VSP's). It is also possible to integrate the processor cores and their local memories on a (set of) chip(s). Due to the programmability, the system is very flexible and can be used for emulation of other similar block-oriented local-neighborhood algorithms. The architecture can be easily divided into several partitions, without data-exchange between partitions. Special attention is paid to memory size and transfer optimization, which are dominant factors for both area and power cost. The trade-offs and techniques used to arrive at these solutions are explained in detail. It is shown that careful optimizations can lead to large savings in memory size (up to 66%) and bandwidth requirements (up to a factor of 4) compared to a straightforward solution.
{"title":"Mapping real-time motion estimation type algorithms to memory efficient, programmable multi-processor architectures","authors":"E. De Greef, F. Catthoor, H. De Man","doi":"10.1016/0165-6074(95)99030-9","DOIUrl":"10.1016/0165-6074(95)99030-9","url":null,"abstract":"<div><p>In this paper, an architectural template is presented, which is able to execute the full search motion estimation algorithm or other similar video or image processing algorithms in real time. The architecture is based on a set of programmable video signal processors (VSP's). It is also possible to integrate the processor cores and their local memories on a (set of) chip(s). Due to the programmability, the system is very flexible and can be used for emulation of other similar block-oriented local-neighborhood algorithms. The architecture can be easily divided into several partitions, without data-exchange between partitions. Special attention is paid to memory size and transfer optimization, which are dominant factors for both area and power cost. The trade-offs and techniques used to arrive at these solutions are explained in detail. It is shown that careful optimizations can lead to large savings in memory size (up to 66%) and bandwidth requirements (up to a factor of 4) compared to a straightforward solution.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 5","pages":"Pages 409-423"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)99030-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131489456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-01DOI: 10.1016/0165-6074(95)00021-F
Thibault Duboux, Afonso Ferreira , Michel Gastaldo
Most of the proposed VLSI dictionary machines appearing in the literature were designed to fit in one chip only. If the number of acquired elements is larger than that of VLSI cells, another chip has to be designed and manufactured to take a larger dictionary into account. In this paper, we propose a new design for dictionary machines that assembles blocks of standard existing dictionary machines. Our machine is as efficient as the best machines described in the literature, with the enormous advantage of scaling up quite easily, with no degradation of its performance, by simply adding more and more standard blocks.
{"title":"A scalable design for VLSI dictionary machines","authors":"Thibault Duboux, Afonso Ferreira , Michel Gastaldo","doi":"10.1016/0165-6074(95)00021-F","DOIUrl":"10.1016/0165-6074(95)00021-F","url":null,"abstract":"<div><p>Most of the proposed VLSI dictionary machines appearing in the literature were designed to fit in one chip only. If the number of acquired elements is larger than that of VLSI cells, another chip has to be designed and manufactured to take a larger dictionary into account. In this paper, we propose a new design for dictionary machines that assembles blocks of standard existing dictionary machines. Our machine is as efficient as the best machines described in the literature, with the enormous advantage of scaling up quite easily, with no degradation of its performance, by simply adding more and more standard blocks.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 5","pages":"Pages 359-372"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00021-F","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123433574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-08-01DOI: 10.1016/0165-6074(95)00010-L
Q. Song, E.K. Teoh, D.P. Mital
Performance analysis and comparison are carried out for the one- and two-dimensional systolic arrays based on transputers. Low efficiency has been found in the one-dimensional array because of communication overhead. The systolic algorithm is extended to the two-dimensional array to implement a full parallelism in each layer's calculation. This speeds up simulation of the network. Experiment results are provided to support the performance evaluation.
{"title":"Multilayered neural network implementation on transputer systolic array","authors":"Q. Song, E.K. Teoh, D.P. Mital","doi":"10.1016/0165-6074(95)00010-L","DOIUrl":"10.1016/0165-6074(95)00010-L","url":null,"abstract":"<div><p>Performance analysis and comparison are carried out for the one- and two-dimensional systolic arrays based on transputers. Low efficiency has been found in the one-dimensional array because of communication overhead. The systolic algorithm is extended to the two-dimensional array to implement a full parallelism in each layer's calculation. This speeds up simulation of the network. Experiment results are provided to support the performance evaluation.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 4","pages":"Pages 289-299"},"PeriodicalIF":0.0,"publicationDate":"1995-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00010-L","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116668776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-08-01DOI: 10.1016/0165-6074(95)90003-9
{"title":"Calendar of forthcoming conferences and events","authors":"","doi":"10.1016/0165-6074(95)90003-9","DOIUrl":"https://doi.org/10.1016/0165-6074(95)90003-9","url":null,"abstract":"","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 4","pages":"Pages 331-332"},"PeriodicalIF":0.0,"publicationDate":"1995-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)90003-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137408471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-08-01DOI: 10.1016/0165-6074(95)00011-C
Jim-Min Lin , Shang Rong Tsai
User or program mobility in distributed computing systems is becoming increasingly significant in the modern community since users may change their working locations frequently. Job migration is supplementary to remote login in the support of user mobility. However, the migration facility is not a common feature in distributed systems yet. This is mainly due to the inherent complexity in implementing such a facility. This paper proposes a logical machine migration mechanism that can effectively support software environment migration. The basic idea behind logical machine migration is to migrate a logical machine, including the running processes and their execution environment, by a single mechanism. Thus most of the migration difficulties due to the dependency on the operating system kernel are eliminated. We have realized an experimental system, called DLMS386, which successfully demonstrates such idea.
{"title":"Supporting user mobility in the Distributed Logical Machine System","authors":"Jim-Min Lin , Shang Rong Tsai","doi":"10.1016/0165-6074(95)00011-C","DOIUrl":"10.1016/0165-6074(95)00011-C","url":null,"abstract":"<div><p>User or program mobility in distributed computing systems is becoming increasingly significant in the modern community since users may change their working locations frequently. Job migration is supplementary to remote login in the support of user mobility. However, the migration facility is not a common feature in distributed systems yet. This is mainly due to the inherent complexity in implementing such a facility. This paper proposes a logical machine migration mechanism that can effectively support software environment migration. The basic idea behind logical machine migration is to migrate a logical machine, including the running processes and their execution environment, by a single mechanism. Thus most of the migration difficulties due to the dependency on the operating system kernel are eliminated. We have realized an experimental system, called DLMS386, which successfully demonstrates such idea.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 4","pages":"Pages 315-330"},"PeriodicalIF":0.0,"publicationDate":"1995-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00011-C","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129359017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-08-01DOI: 10.1016/0165-6074(95)00012-D
R. Posch, F. Pucher
This paper deals with an experimental network application. As an example, an environment with on screen display of low density information is selected. Such an environment can be found in hospitals where patients have to be guided from one place to another, as well as in many other situations, like airports. The method used is a fiber based 20 Mbit/sec network. In order to have a homogeneous structure transputer links are used throughout. Both, packet oriented inter processor communications and low level bit streams for the video frames, can coexist over these links. Uniformity in the physical layer [1] ensures maximum reliability and flexibility. With the usage of transputer links, fault detection in this application is inherent. The overall design is a highly distributed and low cost solution. Interfaces to standard networks are easily available.
{"title":"An experimental mixed purpose network","authors":"R. Posch, F. Pucher","doi":"10.1016/0165-6074(95)00012-D","DOIUrl":"10.1016/0165-6074(95)00012-D","url":null,"abstract":"<div><p>This paper deals with an experimental network application. As an example, an environment with on screen display of low density information is selected. Such an environment can be found in hospitals where patients have to be guided from one place to another, as well as in many other situations, like airports. The method used is a fiber based 20 Mbit/sec network. In order to have a homogeneous structure transputer links are used throughout. Both, packet oriented inter processor communications and low level bit streams for the video frames, can coexist over these links. Uniformity in the physical layer [1] ensures maximum reliability and flexibility. With the usage of transputer links, fault detection in this application is inherent. The overall design is a highly distributed and low cost solution. Interfaces to standard networks are easily available.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 4","pages":"Pages 263-271"},"PeriodicalIF":0.0,"publicationDate":"1995-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00012-D","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127516395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-08-01DOI: 10.1016/0165-6074(95)00015-G
S.K. Basu , J.Datta Gupta , R.Datta Gupta
In this paper we propose a VLSI implementable architecture called Cube Connected Tree having advantageous properties of both tree and hypercube. This structure has a fixed low degree of nodes for any size of the network unlike the hypercube where the node degree is dependent on the size of the hypercube. The degree-diameter product metric [26]of CCT is low compared to that of a hypercube of comparable size. It overcomes the data congestion problem near the root of the binary tree by having multiple roots in the structure, thereby enhancing the I/O bandwidth of the system. The complexity of the VLSI layout of this structure has been addressed within the grid model of Thompson [12]. By using spare links and PEs, fault tolerance capabilities of the system have been enhanced. Easy programmability of this structure has been demonstrated by designing polylogarithmic algorithms for sorting and discrete Fourier transform.
{"title":"On synthesizing cube and tree for parallel processing","authors":"S.K. Basu , J.Datta Gupta , R.Datta Gupta","doi":"10.1016/0165-6074(95)00015-G","DOIUrl":"10.1016/0165-6074(95)00015-G","url":null,"abstract":"<div><p>In this paper we propose a VLSI implementable architecture called Cube Connected Tree having advantageous properties of both tree and hypercube. This structure has a fixed low degree of nodes for any size of the network unlike the hypercube where the node degree is dependent on the size of the hypercube. The degree-diameter product metric [26]of CCT is low compared to that of a hypercube of comparable size. It overcomes the data congestion problem near the root of the binary tree by having multiple roots in the structure, thereby enhancing the I/O bandwidth of the system. The complexity of the VLSI layout of this structure has been addressed within the grid model of Thompson [12]. By using spare links and PEs, fault tolerance capabilities of the system have been enhanced. Easy programmability of this structure has been demonstrated by designing polylogarithmic algorithms for sorting and discrete Fourier transform.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 4","pages":"Pages 273-288"},"PeriodicalIF":0.0,"publicationDate":"1995-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(95)00015-G","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124649456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}