Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651532
A. Basaglia, W. Fornaciari, F. Salice
A methodology to design a digital special purpose neurocomputer implementing feedforward multilayer neural networks is presented. The design flow consists of three stages: the weight discretization, which relaxes the precision requirements maintaining the compatibility with the original model; the architectural synthesis, which transforms the abstract description into an optimized digital structure; and the VHDL model generation, which produces the VHDL description of the general purpose neurocomputer by using a set of parametric components.
{"title":"Special purpose neurocomputers: an automatic design approach","authors":"A. Basaglia, W. Fornaciari, F. Salice","doi":"10.1109/ICAPP.1997.651532","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651532","url":null,"abstract":"A methodology to design a digital special purpose neurocomputer implementing feedforward multilayer neural networks is presented. The design flow consists of three stages: the weight discretization, which relaxes the precision requirements maintaining the compatibility with the original model; the architectural synthesis, which transforms the abstract description into an optimized digital structure; and the VHDL model generation, which produces the VHDL description of the general purpose neurocomputer by using a set of parametric components.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124962232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651494
J. Silcock, A. Gościński
The DSM system we propose in this paper is implemented completely at the operating system level as a component of RHODOS' Memory (Space) Manager. In addition, it is integrated with RHODOS' existing invalidation-based DSM allowing the programmers to choose the consistency protocol best suited to their application. These factors enable RHODOS DSM to provide the user with a transparent, efficient and scalable shared memory programming environment. In this paper, we describe the logical design, implementation and performance study of an update based DSM which strictly adheres to the above criteria. These criteria allow the user to program using a familiar model while taking advantage of the greater scalability of COWs.
{"title":"Update based distributed shared memory integrated into RHODOS' memory management","authors":"J. Silcock, A. Gościński","doi":"10.1109/ICAPP.1997.651494","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651494","url":null,"abstract":"The DSM system we propose in this paper is implemented completely at the operating system level as a component of RHODOS' Memory (Space) Manager. In addition, it is integrated with RHODOS' existing invalidation-based DSM allowing the programmers to choose the consistency protocol best suited to their application. These factors enable RHODOS DSM to provide the user with a transparent, efficient and scalable shared memory programming environment. In this paper, we describe the logical design, implementation and performance study of an update based DSM which strictly adheres to the above criteria. These criteria allow the user to program using a familiar model while taking advantage of the greater scalability of COWs.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129855328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651489
W. Hahn, Suk-Han Yoon, Kangwoo Lee, M. Dubois
We model and evaluate a new parallel processing system for commercial applications, so called SPAX. SPAX cost-effectively overcomes the SMP limitation by providing both scalability of the parallel processing system and application portability of the SMP. To investigate whether the new architecture satisfies the requirements of commercial applications, such as OLTP, we have built the system and workload model. The results of the simulation show that the IO subsystem becomes the bottleneck before the newly developed system network. We find that SPAX can still meet the IO requirement of the OLTP workload as its network and IO node support the flexible IO subsystem, in terms of the number of disk drives and IO nodes versus that of processing nodes.
{"title":"Modeling and evaluation of a new cluster-based system for commercial applications","authors":"W. Hahn, Suk-Han Yoon, Kangwoo Lee, M. Dubois","doi":"10.1109/ICAPP.1997.651489","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651489","url":null,"abstract":"We model and evaluate a new parallel processing system for commercial applications, so called SPAX. SPAX cost-effectively overcomes the SMP limitation by providing both scalability of the parallel processing system and application portability of the SMP. To investigate whether the new architecture satisfies the requirements of commercial applications, such as OLTP, we have built the system and workload model. The results of the simulation show that the IO subsystem becomes the bottleneck before the newly developed system network. We find that SPAX can still meet the IO requirement of the OLTP workload as its network and IO node support the flexible IO subsystem, in terms of the number of disk drives and IO nodes versus that of processing nodes.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127589587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651498
S. Gupta, P. Srimani
In this paper we investigate the problem of how to schedule n independent jobs on an m/spl times/m torus based network. We develop a model to quantify the effect of contention for communication links on the dilation of job execution time when multiple jobs share communication links.
{"title":"Subtorii allocation strategies for torus connected networks","authors":"S. Gupta, P. Srimani","doi":"10.1109/ICAPP.1997.651498","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651498","url":null,"abstract":"In this paper we investigate the problem of how to schedule n independent jobs on an m/spl times/m torus based network. We develop a model to quantify the effect of contention for communication links on the dilation of job execution time when multiple jobs share communication links.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131320364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651517
P. Tang
A complete suite of algorithms for parallelizing compilers to generate efficient SPMD code for SOR problems is presented. By applying unimodular transformation before loop tiling and parallelization, the number of messages per iteration per processor is reduced from 3/sup n/-1 in the conventional parallel SOR algorithm to 2/sup n/-1, where n is the dimensionality of the data set. To maintain the memory-scalability, a novel approach to use the local dynamic memory of parallel processors to implement the skewed data set is proposed.
{"title":"Generating efficient parallel code for successive over-relaxation","authors":"P. Tang","doi":"10.1109/ICAPP.1997.651517","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651517","url":null,"abstract":"A complete suite of algorithms for parallelizing compilers to generate efficient SPMD code for SOR problems is presented. By applying unimodular transformation before loop tiling and parallelization, the number of messages per iteration per processor is reduced from 3/sup n/-1 in the conventional parallel SOR algorithm to 2/sup n/-1, where n is the dimensionality of the data set. To maintain the memory-scalability, a novel approach to use the local dynamic memory of parallel processors to implement the skewed data set is proposed.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129754401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651497
Lin Huang, M. Oudshoorn
Parallel applications with inconstant usage patterns presents a big challenge to programmers in that the spawning of tasks and the communication between them may be conditional (named "conditional parallel programming"). Ideally, the programmer should not be burdened by operational issues which have little relationship to the application itself. This paper proposes a new parallel programming environment, ATME, to automate task scheduling in conditional parallel programming. By adaptively producing accurate estimates of the task model prior to execution, ATME modifies task distribution to improve the system and application performance.
{"title":"ATME: a parallel programming environment for applications with conditional task attributes","authors":"Lin Huang, M. Oudshoorn","doi":"10.1109/ICAPP.1997.651497","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651497","url":null,"abstract":"Parallel applications with inconstant usage patterns presents a big challenge to programmers in that the spawning of tasks and the communication between them may be conditional (named \"conditional parallel programming\"). Ideally, the programmer should not be burdened by operational issues which have little relationship to the application itself. This paper proposes a new parallel programming environment, ATME, to automate task scheduling in conditional parallel programming. By adaptively producing accurate estimates of the task model prior to execution, ATME modifies task distribution to improve the system and application performance.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127703962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651531
P. Farber, K. Asanović
Multi-Spert is a scalable parallel system built from multiple Spert-II nodes which we have constructed to speed error backpropagation neural network training for speech recognition research. We present the Multi-Spert hardware and software architecture, and describe our implementation of two alternative parallelization strategies for the backprop algorithm. We have developed detailed analytic models of the two strategies which allow us to predict performance over a range of network and machine parameters. The models' predictions are validated by measurements for a prototype five node Multi-Spert system. This prototype achieves a neural network training performance of over 530 million connection updates per second (MCUPS) while training a realistic speech application neural network. The model predicts that performance will scale to over 800 MCUPS for eight nodes.
{"title":"Parallel neural network training on Multi-Spert","authors":"P. Farber, K. Asanović","doi":"10.1109/ICAPP.1997.651531","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651531","url":null,"abstract":"Multi-Spert is a scalable parallel system built from multiple Spert-II nodes which we have constructed to speed error backpropagation neural network training for speech recognition research. We present the Multi-Spert hardware and software architecture, and describe our implementation of two alternative parallelization strategies for the backprop algorithm. We have developed detailed analytic models of the two strategies which allow us to predict performance over a range of network and machine parameters. The models' predictions are validated by measurements for a prototype five node Multi-Spert system. This prototype achieves a neural network training performance of over 530 million connection updates per second (MCUPS) while training a realistic speech application neural network. The model predicts that performance will scale to over 800 MCUPS for eight nodes.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651522
Jinwoo Suh, M. Ung, Viktor K. Prasanna
We show a high throughput implementation of SAR on high performance computing (HPC) platforms. In our implementation, the processors are divided into two groups of size M and N. The first group consisting of M processors computes the FDC (frequency domain convolution) in range dimension, and the second group of N processors computes the FDC in azimuth dimension. M and N are determined by the computational requirements of FDC in range and azimuth dimensions respectively. The key contribution of this paper is the development of a general high-throughput M-to-N communication algorithm. The M-to-N communication algorithm is a basic communication primitive used in many signal processing applications when a software task pipeline is employed to obtain high throughput performance. Our algorithm reduces the number of communication steps to 1g(N/M+1)+n(k-1), where k/spl ges/2 and n=[1g/sub k/ M]. Implementation results on the IBM SP2 and the Cray T3D based on the MITRE real-time benchmarks are presented. The results show that, given an image of size 1K/spl times/1K, the minimum number of processors required for processing the SAR benchmarks can be reduced by 50% by using the proposed communication algorithm.
我们展示了SAR在高性能计算(HPC)平台上的高吞吐量实现。在我们的实现中,处理器被分为大小为M和N的两组,第一组由M个处理器组成,在范围维度上计算频域卷积(FDC),第二组由N个处理器组成,在方位维度上计算FDC。M和N分别由FDC在距离和方位角尺寸上的计算要求决定。本文的主要贡献是开发了一种通用的高吞吐量M-to-N通信算法。M-to-N通信算法是许多信号处理应用中使用的一种基本通信原语,用于软件任务流水线以获得高吞吐量性能。我们的算法将通信步数减少到1g(N/M+1)+ N (k-1),其中k/spl ges/2和N =[1g/sub k/ M]。给出了基于MITRE实时基准测试在IBM SP2和Cray T3D上的实现结果。结果表明,在给定大小为1K/spl次/1K的图像时,使用所提出的通信算法可以将处理SAR基准所需的最小处理器数量减少50%。
{"title":"Parallel implementation of synthetic aperture radar on high performance computing platforms","authors":"Jinwoo Suh, M. Ung, Viktor K. Prasanna","doi":"10.1109/ICAPP.1997.651522","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651522","url":null,"abstract":"We show a high throughput implementation of SAR on high performance computing (HPC) platforms. In our implementation, the processors are divided into two groups of size M and N. The first group consisting of M processors computes the FDC (frequency domain convolution) in range dimension, and the second group of N processors computes the FDC in azimuth dimension. M and N are determined by the computational requirements of FDC in range and azimuth dimensions respectively. The key contribution of this paper is the development of a general high-throughput M-to-N communication algorithm. The M-to-N communication algorithm is a basic communication primitive used in many signal processing applications when a software task pipeline is employed to obtain high throughput performance. Our algorithm reduces the number of communication steps to 1g(N/M+1)+n(k-1), where k/spl ges/2 and n=[1g/sub k/ M]. Implementation results on the IBM SP2 and the Cray T3D based on the MITRE real-time benchmarks are presented. The results show that, given an image of size 1K/spl times/1K, the minimum number of processors required for processing the SAR benchmarks can be reduced by 50% by using the proposed communication algorithm.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123755019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651507
Oh-Young Kwon, Tae-Geun Kim, T. Han, Sung-Bong Yang, Shin-Dug Kim
In order to generate local addresses for an array section A(l:h:s) with block-cyclic distribution, an efficient compiling method is required. In this paper, two local address generation methods for the block-cyclic distribution are presented. One is a simple local address generation method that is modified from the virtual-block scheme. The other is a linear-time /spl Delta/M table construction method. The array elements of A(l:h:s) to be accessed at run-time build up a family of lines. By using the equation of the lines, a /spl Delta/M table can be generated in O(k) time. Experimental results show that a simple local address generation method has poor performance but a linear-time /spl Delta/M table generation method is faster than other algorithms in /spl Delta/M table generation time and access time for 10,000 array elements.
{"title":"An efficient local address generation for the block-cyclic distribution","authors":"Oh-Young Kwon, Tae-Geun Kim, T. Han, Sung-Bong Yang, Shin-Dug Kim","doi":"10.1109/ICAPP.1997.651507","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651507","url":null,"abstract":"In order to generate local addresses for an array section A(l:h:s) with block-cyclic distribution, an efficient compiling method is required. In this paper, two local address generation methods for the block-cyclic distribution are presented. One is a simple local address generation method that is modified from the virtual-block scheme. The other is a linear-time /spl Delta/M table construction method. The array elements of A(l:h:s) to be accessed at run-time build up a family of lines. By using the equation of the lines, a /spl Delta/M table can be generated in O(k) time. Experimental results show that a simple local address generation method has poor performance but a linear-time /spl Delta/M table generation method is faster than other algorithms in /spl Delta/M table generation time and access time for 10,000 array elements.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125662850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-12-10DOI: 10.1109/ICAPP.1997.651478
V. Fazio
This paper describes and compares an implementation of an unusual hot-spot-resistant adaptive routing architecture. This paper evaluates the performance of the architecture.
本文描述并比较了一种特殊的抗热点自适应路由体系结构的实现。本文对该体系结构的性能进行了评估。
{"title":"Adaptive routing for a bus-based multiprocessor","authors":"V. Fazio","doi":"10.1109/ICAPP.1997.651478","DOIUrl":"https://doi.org/10.1109/ICAPP.1997.651478","url":null,"abstract":"This paper describes and compares an implementation of an unusual hot-spot-resistant adaptive routing architecture. This paper evaluates the performance of the architecture.","PeriodicalId":325978,"journal":{"name":"Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126580912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}