Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218548
D. Lê, M. Ercegovac, T. Lang, J. Moreno
The design of MAMACG, a software tool for automatically mapping an important class of matrix algorithms into mesh array computational graphs, is described. MAMACG is a concrete realization of the multimesh graph (MMG) method, implemented in Elk, a dialect of LISP with built-in X-graphics capabilities.<>
{"title":"MAMACG: a tool for automatic mapping of matrix algorithms onto mesh array computational graphs","authors":"D. Lê, M. Ercegovac, T. Lang, J. Moreno","doi":"10.1109/ASAP.1992.218548","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218548","url":null,"abstract":"The design of MAMACG, a software tool for automatically mapping an important class of matrix algorithms into mesh array computational graphs, is described. MAMACG is a concrete realization of the multimesh graph (MMG) method, implemented in Elk, a dialect of LISP with built-in X-graphics capabilities.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131101895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218536
S. Ritz, M. Pankert, H. Meyr
For the design of complex digital signal processing systems, block diagram oriented simulation has become a widely accepted standard. Current research is concerned with the coupling of heterogenous simulation engines and the transition from simulation to the implementation of digital signal processing systems. Due to the difficulty in mastering complex design spaces high level hardware and software synthesis is becoming increasingly important. The authors concentrate on the block diagram oriented software synthesis of digital signal processing systems for programmable processors, such as digital signal processors (DSP). They present the synthesis environment DESCARTES illustrating novel optimization strategies. Furthermore they discuss goal directed software synthesis, by which code is interactively or automatically generated, which can be adapted to the application specific needs imposed by constraints on memory space, sampling rate or latency.<>
{"title":"High level software synthesis for signal processing systems","authors":"S. Ritz, M. Pankert, H. Meyr","doi":"10.1109/ASAP.1992.218536","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218536","url":null,"abstract":"For the design of complex digital signal processing systems, block diagram oriented simulation has become a widely accepted standard. Current research is concerned with the coupling of heterogenous simulation engines and the transition from simulation to the implementation of digital signal processing systems. Due to the difficulty in mastering complex design spaces high level hardware and software synthesis is becoming increasingly important. The authors concentrate on the block diagram oriented software synthesis of digital signal processing systems for programmable processors, such as digital signal processors (DSP). They present the synthesis environment DESCARTES illustrating novel optimization strategies. Furthermore they discuss goal directed software synthesis, by which code is interactively or automatically generated, which can be adapted to the application specific needs imposed by constraints on memory space, sampling rate or latency.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129916284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218540
A. Suarez, J. Llabería, A. Fernandez
The authors present a technique for scheduling partitions in systolic algorithms (SA). This technique can be used in combination with any possible projection used for the problem dependent size SA design and with any possible spatial mapping used for the partitions. They also present the necessary code transformations to transform the sequential code into the code that is executed in a processing element (PE) of the systolic processor (SP). This technique is applied to the matrix by vector problem using a non-unimodular transformation matrix and taking into account input and output of data. Cut&pile spatial mapping is used for the partitions.<>
{"title":"Scheduling partitions in systolic algorithms","authors":"A. Suarez, J. Llabería, A. Fernandez","doi":"10.1109/ASAP.1992.218540","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218540","url":null,"abstract":"The authors present a technique for scheduling partitions in systolic algorithms (SA). This technique can be used in combination with any possible projection used for the problem dependent size SA design and with any possible spatial mapping used for the partitions. They also present the necessary code transformations to transform the sequential code into the code that is executed in a processing element (PE) of the systolic processor (SP). This technique is applied to the matrix by vector problem using a non-unimodular transformation matrix and taking into account input and output of data. Cut&pile spatial mapping is used for the partitions.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130761465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218571
Michael Murray, J. Burr, D. Stork, Ming-Tak Leung, K. Boonyanit, G. Wolff, A. Peterson
Describes a special purpose, very high speed, digital deterministic Boltzmann neural network VLSI chip. Each chip has 32 physical neural processors, which can be apportioned into an arbitrary topology (input, multiple hidden and output layers) of up to 160 virtual neurons total. Under typical conditions, the chip learns at approximately 5*10/sup 8/ connection updates/second (CUPS). Through relatively minor (subsequent) modifications, the authors' chips can be 'tiled' in multi-chip modules, to make multi-layer networks of arbitrary size suffering only slight communications delays and overhead. In this way, the number of CUPS can be made arbitrarily large, limited only by the number of chips tiled. The chip's high speed is due to massively parallel array computation of the inner products of connection weights and neural activations, limited (but adequate) precision for weights and activations (5 bits), high clock rate (180 MHz), as well as several algorithmic and design insights.<>
{"title":"Deterministic Boltzmann machine VLSI can be scaled using multi-chip modules","authors":"Michael Murray, J. Burr, D. Stork, Ming-Tak Leung, K. Boonyanit, G. Wolff, A. Peterson","doi":"10.1109/ASAP.1992.218571","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218571","url":null,"abstract":"Describes a special purpose, very high speed, digital deterministic Boltzmann neural network VLSI chip. Each chip has 32 physical neural processors, which can be apportioned into an arbitrary topology (input, multiple hidden and output layers) of up to 160 virtual neurons total. Under typical conditions, the chip learns at approximately 5*10/sup 8/ connection updates/second (CUPS). Through relatively minor (subsequent) modifications, the authors' chips can be 'tiled' in multi-chip modules, to make multi-layer networks of arbitrary size suffering only slight communications delays and overhead. In this way, the number of CUPS can be made arbitrarily large, limited only by the number of chips tiled. The chip's high speed is due to massively parallel array computation of the inner products of connection weights and neural activations, limited (but adequate) precision for weights and activations (5 bits), high clock rate (180 MHz), as well as several algorithmic and design insights.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"358 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133320318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218541
R. Hughey
This paper presents the New Systolic Language as a general solution to the problem of systolic programming. The language provides a simple programming interface for systolic algorithms suitable for different hardware platforms and software simulators. The New Systolic Language hides the details and potential hazards of inter-processor communication, allowing data flow only via abstract systolic data streams. Data flows and systolic cell programs for the co-processor are integrated with host functions, enabling a single file to specify a complete systolic program.<>
{"title":"Programming systolic arrays","authors":"R. Hughey","doi":"10.1109/ASAP.1992.218541","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218541","url":null,"abstract":"This paper presents the New Systolic Language as a general solution to the problem of systolic programming. The language provides a simple programming interface for systolic algorithms suitable for different hardware platforms and software simulators. The New Systolic Language hides the details and potential hazards of inter-processor communication, allowing data flow only via abstract systolic data streams. Data flows and systolic cell programs for the co-processor are integrated with host functions, enabling a single file to specify a complete systolic program.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124810085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218584
P. Hoang, J. Rabaey
A multiprocessor scheduling algorithm that simultaneously considers pipelining, retiming, parallel execution and hierarchical node decomposition to maximize performance throughput is presented. The algorithm is able to take into account interprocessor communication delays, and memory and processor availability constraints. The results on a set of benchmarks demonstrate the algorithm's ability to achieve near optimal speedups across a wide range of applications of various types of concurrency, with good scalability with respect to processor count.<>
{"title":"Hierarchical scheduling of DSP programs onto multiprocessors for maximum throughput","authors":"P. Hoang, J. Rabaey","doi":"10.1109/ASAP.1992.218584","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218584","url":null,"abstract":"A multiprocessor scheduling algorithm that simultaneously considers pipelining, retiming, parallel execution and hierarchical node decomposition to maximize performance throughput is presented. The algorithm is able to take into account interprocessor communication delays, and memory and processor availability constraints. The results on a set of benchmarks demonstrate the algorithm's ability to achieve near optimal speedups across a wide range of applications of various types of concurrency, with good scalability with respect to processor count.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131396008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218570
M. Vishwanath, R. Owens, M. J. Irwin
Three architectures, based on linear systolic arrays, for computing the discrete wavelet transform, are described. The AT/sup 2/ lower bound for computing the DWT in a systolic model is derived and shown to be AT/sup 2/= Omega (N/sup 2/N/sub w/k). Two of the architectures are within a factor of log N from optimal, but they are of practical importance due to their regular structure, scalability and limited I/O needs. The third architecture is optimal, but it requires complex control.<>
{"title":"Discrete wavelet transforms in VLSI","authors":"M. Vishwanath, R. Owens, M. J. Irwin","doi":"10.1109/ASAP.1992.218570","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218570","url":null,"abstract":"Three architectures, based on linear systolic arrays, for computing the discrete wavelet transform, are described. The AT/sup 2/ lower bound for computing the DWT in a systolic model is derived and shown to be AT/sup 2/= Omega (N/sup 2/N/sub w/k). Two of the architectures are within a factor of log N from optimal, but they are of practical importance due to their regular structure, scalability and limited I/O needs. The third architecture is optimal, but it requires complex control.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133544898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218537
Duen-Jeng Wang, Y. Hu
A systematic approach to implement a real time recursive digital signal processing algorithm on a dedicated multiprocessor array is presented. First, the authors unfold the algorithm so that its corresponding dependence graph becomes a newly defined generalized perfect rate graph. They prove that the dependence graph of a recursive algorithm admits a desirable rate optimal, full static multiprocessor implementation if and only if it is a generalized perfect rate graph. Based on these results, an efficient heuristic algorithm is presented to perform optimal multi-processor scheduling and task assignment so that the number of processors required is minimized.<>
{"title":"Fully static multiprocessor realization for real-time recursive DSP algorithms","authors":"Duen-Jeng Wang, Y. Hu","doi":"10.1109/ASAP.1992.218537","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218537","url":null,"abstract":"A systematic approach to implement a real time recursive digital signal processing algorithm on a dedicated multiprocessor array is presented. First, the authors unfold the algorithm so that its corresponding dependence graph becomes a newly defined generalized perfect rate graph. They prove that the dependence graph of a recursive algorithm admits a desirable rate optimal, full static multiprocessor implementation if and only if it is a generalized perfect rate graph. Based on these results, an efficient heuristic algorithm is presented to perform optimal multi-processor scheduling and task assignment so that the number of processors required is minimized.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127238305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218549
Jeime Moreno, M. Medina
The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<>
{"title":"Matrix computations in arrays of DSPs","authors":"Jeime Moreno, M. Medina","doi":"10.1109/ASAP.1992.218549","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218549","url":null,"abstract":"The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123180437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218573
K. Asanović, J. Beck, Brian E. D. Kingsbury, P. Kohn, N. Morgan, J. Wawrzynek
SPERT (synthetic perceptron testbed) is a fully programmable single chip microprocessor designed for efficient execution of artificial neural network algorithms. The first implementation is in a 1.2 mu m CMOS technology with a 50 MHz clock rate, and a prototype system is being designed to occupy a double SBus slot within a Sun Sparcstation. SPERT sustains over 300*10/sup 6/ connections per second during pattern classification, and around 100*10/sup 6/ connection updates per second while running the popular error backpropagation training algorithm. This represents a speedup of around two orders of magnitude over a Sparcstation-2 for algorithms of interest. An earlier system produced by the group, the Ring Array Processor (RAP), used commercial DSP chips. Compared with a RAP multiprocessor of similar performance, SPERT represents over an order of magnitude reduction in cost for problems where fixed-point arithmetic is satisfactory.<>
{"title":"SPERT: a VLIW/SIMD microprocessor for artificial neural network computations","authors":"K. Asanović, J. Beck, Brian E. D. Kingsbury, P. Kohn, N. Morgan, J. Wawrzynek","doi":"10.1109/ASAP.1992.218573","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218573","url":null,"abstract":"SPERT (synthetic perceptron testbed) is a fully programmable single chip microprocessor designed for efficient execution of artificial neural network algorithms. The first implementation is in a 1.2 mu m CMOS technology with a 50 MHz clock rate, and a prototype system is being designed to occupy a double SBus slot within a Sun Sparcstation. SPERT sustains over 300*10/sup 6/ connections per second during pattern classification, and around 100*10/sup 6/ connection updates per second while running the popular error backpropagation training algorithm. This represents a speedup of around two orders of magnitude over a Sparcstation-2 for algorithms of interest. An earlier system produced by the group, the Ring Array Processor (RAP), used commercial DSP chips. Compared with a RAP multiprocessor of similar performance, SPERT represents over an order of magnitude reduction in cost for problems where fixed-point arithmetic is satisfactory.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}